*Published on Wednesday, 23 March 2022*

## Calculate correlation coefficient between arrays of different length

Can the Pearson coefficient be computed between two numeric arrays, if these have different size? Let's what does it mean.

## Scenario

If we look at the definition of **pearson correlation coefficient** for discrete series of data, we see that it's *mathematically* wrong to consider data having different length. Indeed, one should keep in mind that **we shouldn't create missing data** out of nothing. *BUT* sometimes, in real-world problems we may want to use simple "good enough" procedures instead of *perfect* impossible ones, so...

Our scenario is the following: we have two pandas time series. We want to understand how do they correlate with each other, but their size or sampling rate is different and some data can miss.

## Procedure

- Select a
**startTime**and**endTime**index in order to match overlapping time periods and reduce both vectors to that range. This must be done because we do*NOT*want to*extrapolate*any data outside a that range, but only interpolate missing data (...it's less error prone!). Let's call the resulting data vectors**sMin**and**sMax**where*len(sMax) >= len(sMin)*. - We choose to interpolate the longest vector, i.e.
**sMax**. That's because we assume it's more dense, i.e. has more elements over same time range and makes interpolation errors generally less relevant. So we turn DatetimeIndex (of both series!) into absolute integer values**t**, and finally create an interpolation function**f(t)**based on**sMax**. - Use
**f(t)**to compute missing values of**sMin**. Resulting values are put into a new vector, which we call**sInt**. - Now
**sMin**and**sInt**have same length, so we can compute the correlation index between them.

## Python code

```
from scipy import interpolate
from numpy import corrcoef
def corr(s1, s2):
startDate = max(min(s1.index), min(s2.index))
endDate = min(max(s1.index), max(s2.index))
s1 = s1.loc[(s1.index >= startDate) & (s1.index <= endDate)]
s2 = s2.loc[(s2.index >= startDate) & (s2.index <= endDate)]
s1.index = map(int, s1.index.strftime("%Y%m%d%H%M%S"))
s2.index = map(int, s2.index.strftime("%Y%m%d%H%M%S"))
sMin, sMax = (s2, s1) if len(s1) >= len(s2) else (s1, s2)
f = interpolate.interp1d(sMax.index, sMax.values)
minBound = min(sMax.index)
maxBound = max(sMax.index)
sMin = sMin[(sMin.index >= minBound) & (sMin.index <= maxBound)]
sInt = f(sMin.index)
return corrcoef(sMin, sInt)[0,1]
```

## Testing with Mock Data

```
import datetime
import pandas as pd
def corr(s1, s2):
...
if __name__ == '__main__':
i1 = pd.date_range(start=datetime.datetime(2019, 1, 1), end = datetime.datetime(2022, 1, 1))
i2 = pd.date_range(start=datetime.datetime(2020, 2, 3), end = datetime.datetime(2022, 3, 3))
s1 = pd.Series(index=i1, data=[i for i in range(len(i1))])
s2 = pd.Series(index=i2, data=[i**2 for i in range(len(i2))])
print(corr(s1, s2)) # Output: 0.9681593362286723
```