I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:
for i in xrange(rows): # rows are the number of rows in the matrix. for j in xrange(i, rows): r = scipy.stats.pearsonr(data[i,:], data[j,:]) print r
Please note that I am making use of the
pearsonr function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).
My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?
After looking at Joe Kington's answer, I decided to look into the
corrcoef() code and was inspired by it to do the following implementation.
ms = data.mean(axis=1)[(slice(None,None,None),None)] datam = data - ms datass = np.sqrt(scipy.stats.ss(datam,axis=1)) for i in xrange(rows): temp = np.dot(datam[i:],datam[i].T) rs = temp / (datass[i:]*datass[i])
Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using
corrcoef() alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.
First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.
Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by
pearsonr() so let's scratch that too. Using the below code:
r = np.zeros((rows,rows)) ms = data.mean(axis=1) datam = np.zeros_like(data) for i in xrange(rows): datam[i] = data[i] - ms[i] datass = scipy.stats.ss(datam,axis=1) for i in xrange(rows): for j in xrange(i,rows): r_num = np.add.reduce(datam[i]*datam[j]) r_den = np.sqrt(datass[i]*datass[j]) r[i,j] = min((r_num / r_den), 1.0)
I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.
Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.
You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.