Fastest file format to save and load matrices in Octave? - file-io

I have a matrix that is about 11,000 x 1,000, saved as a csv. It takes forever to load.
What is the fastest (or recommended) format to save matrices in?

Where does the data come from?
Way back when I was in graduate school, I generated simulation data and results in a C++ program. As I owned the data, I wrote a routine to write the matrix data in the binary format expected by Octave --- and which point reading is pretty fast as it becomes a single fread call.

Don't forget the -binary option. For example,
save -binary myfile.mat X Y Z; % save X, Y, and Z matrices to myfile.mat
load myfile.mat; % load X, Y, and Z matrices from myfile.mat
When I forgot to use the -binary option, my 80,000 x 402 matrix of doubles took more than 22 minutes to load. With the -binary option, it took less than 2.5 seconds.

Related

Best way to calculate cursory statistics from very large CSV

I have some data in CSV format (16 billion rows, 170 columns).
I can extract each column using cut and load "just" one column from a file into Pandas using pd.load_csv(), but it is painfully slow and uses about 228GB of RAM while loading then settles back to 46GB for one of the columns while for some others tested my system with 256GB of RAM starts swapping and grinds to a halt.
Is there some way which is reasonably fast and requires less RAM to calculate standard stats like mean, median, standard deviation, and standard error on each column?
System(s) are all running Ubuntu 20.04.3 LTS and I can install any package available through standard repos.
NOTE: Some columns have u for unknown/missing data while some just have nothing for the same but otherwise all the columns are either integers or floats.
If anyone is looking for an answer, the comments have some good suggestions for not using CSV files.
In almost all cases, using something other than CSV is best, but sometimes (like in my case), it's what you have to work with. There are a couple of solutions that work reasonably well depending on factors.
I was unable to find a solution, so I just wrote my own.
Calculating the Standard Deviation and Standard Error (and Confidence Intervals) does not require holding all variables in RAM; however, if you opt not to hold them in RAM you will have to read them twice. Once to calculate the Mean, and the second for the sum of the difference between the mean and the values squared (sometimes referred to the Mean Squares). With those two numbers and the number of variables you can calculate most of the most-common stats.
Example code:
#!/usr/bin/env python3
import csv
import math
def calc_some_stats(infile, col_idx):
n, tot = 0, 0
with open(infile, 'r') as fh:
reader = csv.reader(fh)
for row in reader:
try:
val = float(row[col_idx])
n += 1
tot += val
except ValueError:
# Ignore nulls, 'u', and 'nan'
pass
pass
pass
mean, sum_mean_sq = tot / n, 0
with open(infile, 'r') as fh:
reader = csv.reader(fh)
for row in reader:
try:
val = float(row[col_idx])
sum_mean_sq += (mean - val)**2
except ValueError:
pass
pass
pass
variance = sum_mean_sq / n
standard_deviation = math.sqrt(variance)
standard_error = standard_deviation / math.sqrt(n)
return n, mean, standard_deviation, standard_error
n, mean, stdev, sem = calc_some_stats("somefile.csv", 12)

How to calculate a very large correlation matrix

I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).
My brute-force version of this looks like the following but is, not surprisingly, very slow:
for i1 in range(z.shape[0]):
for i2 in range(i1+1):
r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
if r > 0.95:
file.write("%6d %6d %.3f\n" % (i1,i2,r))
I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.
Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?
After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.
In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!
For reference, the code I used was basically this:
import numpy as np
import pickle
import h5py
# read nparray, dimensions (102000, 60)
infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()
# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)
# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd
# transpose back
z = ztrans.T
# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)
arr /= z.shape[0]
# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
hf.create_dataset("correlation", data=arr)
From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef.
There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category...
Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.
z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]
And this can be turned into the correlation by normalizing by the variances
sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]
Of course memory usage is still an issue.
You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)
N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N))
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]
Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).
You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.
You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.
You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.
please check out deepgraph package.
https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.
Not sure for 100000 x 100000 but worth trying.

Zoom in on np.fft2 result

Is there a way to chose the x/y output axes range from np.fft2 ?
I have a piece of code computing the diffraction pattern of an aperture. The aperture is defined in a 2k x 2k pixel array. The diffraction pattern is basically the inner part of the 2D FT of the aperture. The np.fft2 gives me an output array same size of the input but with some preset range of the x/y axes. Of course I can zoom in by using the image viewer, but I have already lost detail. What is the solution?
Thanks,
Gert
import numpy as np
import matplotlib.pyplot as plt
r= 500
s= 1000
y,x = np.ogrid[-s:s+1, -s:s+1]
mask = x*x + y*y <= r*r
aperture = np.ones((2*s+1, 2*s+1))
aperture[mask] = 0
plt.imshow(aperture)
plt.show()
ffta= np.fft.fft2(aperture)
plt.imshow(np.log(np.abs(np.fft.fftshift(ffta))**2))
plt.show()
Unfortunately, much of the speed and accuracy of the FFT come from the outputs being the same size as the input.
The conventional way to increase the apparent resolution in the output Fourier domain is by zero-padding the input: np.fft.fft2(aperture, [4 * (2*s+1), 4 * (2*s+1)]) tells the FFT to pad your input to be 4 * (2*s+1) pixels tall and wide, i.e., make the input four times larger (sixteen times the number of pixels).
Begin aside I say "apparent" resolution because the actual amount of data you have hasn't increased, but the Fourier transform will appear smoother because zero-padding in the input domain causes the Fourier transform to interpolate the output. In the example above, any feature that could be seen with one pixel will be shown with four pixels. Just to make this fully concrete, this example shows that every fourth pixel of the zero-padded FFT is numerically the same as every pixel of the original unpadded FFT:
# Generate your `ffta` as above, then
N = 2 * s + 1
Up = 4
fftup = np.fft.fft2(aperture, [Up * N, Up * N])
relerr = lambda dirt, gold: np.abs((dirt - gold) / gold)
print(np.max(relerr(fftup[::Up, ::Up] , ffta))) # ~6e-12.
(That relerr is just a simple relative error, which you want to be close to machine precision, around 2e-16. The largest error between every 4th sample of the zero-padded FFT and the unpadded FFT is 6e-12 which is quite close to machine precision, meaning these two arrays are nearly numerically equivalent.) End aside
Zero-padding is the most straightforward way around your problem. But it does cost you a lot of memory. And it is frustrating because you might only care about a tiny, tiny part of the transform. There's an algorithm called the chirp z-transform (CZT, or colloquially the "zoom FFT") which can do this. If your input is N (for you 2*s+1) and you want just M samples of the FFT's output evaluated anywhere, it will compute three Fourier transforms of size N + M - 1 to obtain the desired M samples of the output. This would solve your problem too, since you can ask for M samples in the region of interest, and it wouldn't require prohibitively-much memory, though it would need at least 3x more CPU time. The downside is that a solid implementation of CZT isn't in Numpy/Scipy yet: see the scipy issue and the code it references. Matlab's CZT seems reliable, if that's an option; Octave-forge has one too and the Octave people usually try hard to match/exceed Matlab.
But if you have the memory, zero-padding the input is the way to go.

Multiply one fixed matrix by a huge number of vectors

I'll need to change the basis of some 10^7 vectors, each having
200 coordinates. So I will multiply one [200 x 200] matrix by 10^7 [200 x 1] vectors. I need it to run very fast but I need to code it fast (one day or less)
and my CUDA is poor, so I don't want to code it from scratch in CUDA or OpenCL. Maybe some existing library can do it for me? Notice that, if the solution uses GPGPU, the matrix should be transfered to the GPU only once, otherwise the performance will be poor. Could I could use OpenACC (or OpenMP, I don't know)? Is it possible to do this in a day?
I prefer open source solutions (for both convenience and ethical reasons) but I can tolerate a closed source solution, even paid (assuming it is not too costly).
This is for my dissertation.
Thank you for your attention.
You can put your vectors in a matrix, 200 * 10^7 is perhaps to much space at once depending on our system, so you can split it.
And then you use any code that is optimized for matrix matrix multiplication, like BLAS. There are many implementations on CPUs, GPUs (cuBLAS, MAGMA,...), multicores (PLASMA,...), or distributed memory.
Since you will have big matrices you vill have a better acceleration than by doing matrix vector multiplications.
You're going to multiply 10 million big vectors by a huge matrix that is the same for all of them.
It would be fastest if all possible decision-making could be compiled-out ahead of time.
In other words, there are lots of index calculations and loop testing that would be identically repeated millions of times.
This sounds like a perfect case for pre-compilation:
Write a small program that would take as input your 200x200 matrix data values, and have it print out a piece of program text defining a function capable of inputting the input vector and outputting the result vector.
It could look something like this:
void multTheMatrixByTheVector(double a[200], double b[200]){
b[0] = 0
+ a[0] * <a constant, the value of mat[0][0]>
+ a[1] * <a constant, the value of mat[1][0]>
...
+ a[199] * <a constant, the value of mat[199][0]>
;
b[1] = 0
+ a[0] * <a constant, the value of mat[0][1]>
+ a[1] * <a constant, the value of mat[1][1]>
...
+ a[199] * <a constant, the value of mat[199][1]>
;
...
b[199] = etc. etc.
}
You see, that function will be around 40000 lines long, but a decent compiler should be able to handle it.
Of course, if any of the matrix elements are zero, i.e. there's some sparsity, you can omit those lines (or let the compiler optimizer do it).
To do this on CUDA or vectorized instructions, you'd have to modify it accordingly, but that should be do-able.
When you include that function in your main program, it should be able to run about as fast as the machine can go.
It's not wasting any cycles doing index calculations, loop testing, or multiplying by empty matrix cells.
Then if it takes 10ns per multiply and add, my back-of-the envelope says it should take 400 usec per vector, or 4000 seconds overall - a little over an hour.

Optimizing Python KD Tree Searches

Scipy (http://www.scipy.org/) offers two KD Tree classes; the KDTree and the cKDTree.
The cKDTree is much faster, but is less customizable and query-able than the KDTree (as far as I can tell from the docs).
Here is my problem:
I have a list of 3million 2 dimensional (X,Y) points. I need to return all of the points within a distance of X units from every point.
With the KDtree, there is an option to do just this: KDtree.query_ball_tree() It generates a list of lists of all the points within X units from every other point. HOWEVER: this list is enormous and quickly fills up my virtual memory (about 744 million items long).
Potential solution #1: Is there a way to parse this list into a text file as it is writing?
Potential solution #2: I have tried using a for loop (for every point in the list) and then finding that single point's neighbors within X units by employing: KDtree.query_ball_point(). HOWEVER: this takes forever as it needs to run the query millions of times. Is there a cKDTree equivalent of this KDTree tool?
Potential solution #3: Beats me, anyone else have any ideas?
From scipy 0.12 on, both KD Tree classes have feature parity. Quoting its announcement:
cKDTree feature-complete
Cython version of KDTree, cKDTree, is now feature-complete. Most
operations (construction, query, query_ball_point, query_pairs,
count_neighbors and sparse_distance_matrix) are between 200 and 1000
times faster in cKDTree than in KDTree. With very minor caveats,
cKDTree has exactly the same interface as KDTree, and can be used as a
drop-in replacement.
Try using KDTree.query_ball_point instead. It takes a single point, or array of points, and produces the points within a given distance of the input point(s).
You can use this function to perform batch queries. Give it, say, 100000 points at a time, and then write the results out to a file. Something like this:
BATCH_SIZE = 100000
for i in xrange(0, len(pts), BATCH_SIZE):
neighbours = tree.query_ball_point(pts[i:i+BATCH_SIZE], X)
# write neighbours to a file...