Is there a way to reduce scipy/numpy precision to reduce memory consumption? - numpy

On my 64-bit Debian/Lenny system (4GByte RAM + 4GByte swap partition) I can successfully do:
v=array(10000*random([512,512,512]),dtype=np.int16)
f=fftn(v)
but with f being a np.complex128 the memory consumption is shocking, and I can't do much more with the result (e.g modulate the coefficients and then f=ifftn(f) ) without a MemoryError traceback.
Rather than installing some more RAM and/or expanding my swap partitions, is there some way of controlling the scipy/numpy "default precision" and have it compute a complex64 array instead ?
I know I can just reduce it afterwards with f=array(f,dtype=np.complex64); I'm looking to have it actually do the FFT work in 32-bit precision and half the memory.

It doesn't look like there's any function to do this in scipy's fft functions ( see http://www.astro.rug.nl/efidad/scipy.fftpack.basic.html ).
Unless you're able to find a fixed point FFT library for python, it's unlikely that the function you want exists, since your native hardware floating point format is 128 bits. It does look like you could use the rfft method to get just the real-valued components (no phase) of the FFT, and that would save half your RAM.
I ran the following in interactive python:
>>> from numpy import *
>>> v = array(10000*random.random([512,512,512]),dtype=int16)
>>> shape(v)
(512, 512, 512)
>>> type(v[0,0,0])
<type 'numpy.int16'>
At this point the RSS (Resident Set Size) of python was 265MB.
f = fft.fft(v)
And at this point the RSS of python 2.3GB.
>>> type(f)
<type 'numpy.ndarray'>
>>> type(f[0,0,0])
<type 'numpy.complex128'>
>>> v = []
And at this point the RSS goes down to 2.0GB, since I've free'd up v.
Using "fft.rfft(v)" to compute real-values only results in a 1.3GB RSS. (almost half, as expected)
Doing:
>>> f = complex64(fft.fft(v))
Is the worst of both worlds, since it first computes the complex128 version (2.3GB) and then copies that into the complex64 version (1.3GB) which means the peak RSS on my machine was 3.6GB, and then it settled down to 1.3GB again.
I think that if you've got 4GB RAM, this should all work just fine (as it does for me). What's the issue?

Scipy 0.8 will have single precision support for almost all the fft code (The code is already in the trunk, so you can install scipy from svn if you need the feature now).

Related

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

How to calculate a very large correlation matrix

I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).
My brute-force version of this looks like the following but is, not surprisingly, very slow:
for i1 in range(z.shape[0]):
for i2 in range(i1+1):
r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
if r > 0.95:
file.write("%6d %6d %.3f\n" % (i1,i2,r))
I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.
Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?
After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.
In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!
For reference, the code I used was basically this:
import numpy as np
import pickle
import h5py
# read nparray, dimensions (102000, 60)
infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()
# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)
# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd
# transpose back
z = ztrans.T
# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)
arr /= z.shape[0]
# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
hf.create_dataset("correlation", data=arr)
From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef.
There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category...
Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.
z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]
And this can be turned into the correlation by normalizing by the variances
sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]
Of course memory usage is still an issue.
You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)
N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N))
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]
Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).
You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.
You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.
You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.
please check out deepgraph package.
https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.
Not sure for 100000 x 100000 but worth trying.

Impossible to sparse or pickle DataFrame (kernel crash)

The problem is very strange.
I create a pandas matrix like this:
My index is a 4 levels ones.
df = pd.Dataframe(np.zeros((300 000,300 000)), index=index, columns=index)
The matrix is built with success when I use np.zeros (without it my kernel crash) but it is impossible to pickle it or to sparse it. Python spends almost 60 Go of memory on my mac with 8 Go RAM. I also tried to use a cluster with more than 60 Go RAM. Why a so simple matrix is impossible to manage. Am I doing wrong something ?
The SparseDataFrame (SDF) are row-based. So it is a wrong way to build a SDF with a columns index.
See: https://github.com/pandas-dev/pandas/issues/16197

Fastest Cython implementation depends on computer?

I am converting a python script to cython and optimizing it for more speed. Right now i have 2 versions, on my desktop V2 is twice as fast as V1 unfortunately on my laptop V1 is twice as fast as V2 and i am unable to find out why there is such a big difference.
Both computers use:
- Ubuntu 16.04
- Python 2.7.12
- Cython 0.25.2
- Numpy 1.12.1
Desktop:
- Intel® Core™ i3-4370 CPU # 3.80GHz × 4 64bit. 16GB RAM
Laptop:
- Intel® Core™ i5-3210 CPU # 2.5GHz × 2 64bit. 8GB RAM
V1 - you can find the full code here. the only changes made are renaming go.py, preprocessing.py to go.pyx, preprocessing.pyx and using
import pyximport; pyximport.install() to compile them. you can run test.py. This version is using a 2d numpy array board to store data in go.pyx and list comprehension in the get_board function in preprocessing.pyx to process data. during the test no function is called from go.py only the numpy array board is used
V2 - you can find the full code here. quite some stuff has changed, below you can find a list with everything affecting this test case. Be aware, all function and variable declarations have to be in go.pxd. you can run test.py using this command: python test.py build_ext --inplace
the 2d numpy array is replaced by:
cdef char board[ 362 ]
and the function get_board_feature in go.pyx replaces numpy list comprehension:
cdef char get_board_feature( self, short location ):
# return correct board feature value
# 0 active player stone
# 1 opponent stone
# 2 empty location
cdef char value = self.board[ location ]
if value == EMPTY:
return 2
if value == self.player_current:
return 0
return 1
get_board function in preprocessing.pyx is replaced with a function that loops over the array and calls get_board_feature in go.pyx for every location
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int get_board(self, GameState state, np.ndarray[double, ndim=2] tensor, int offSet ):
"""A feature encoding WHITE BLACK and EMPTY on separate planes, but plane 0
always refers to the current player and plane 1 to the opponent
"""
cdef short location
for location in range( 0, state.size * state.size ):
tensor[ offSet + state.get_board_feature( location ), location ] = 1
return offSet + 3
Please let me know if i should include any other information or run certain tests.
cmp, diff test
the V2 go.c and preprocessing.c files are identical.
V1 does not generate a .c file to compare
update compared .so files
the V2 go.so files are different:
goD.so goL.so differ: byte 473, line 1
the preprocessing.so files are identical, not sure what to think of that..
They are two different machines and behave differently. There's a reason why processor reviews use large benchmark suites. It could be said that the desktop CPU performs better on average, but execution times between two small but non-trivial pieces of codes does not 'have' to favor the desktop CPU. And differences execution times definitely do not have to follow any linear relationship. The performance is always dependant on a huge amount of factors. Possible explanations include but are not limited to the smaller L1 and L2 caches on the desktop and the change in vector instruction sets from AVX to AVX2 between the Ivy Bridge laptop and the Haswell desktop.
Generally it's a good idea to concentrate on using good algorithms and to identify and remove bottlenecks when optimizing performance. Trying to stare at benchmarks between different machines will probably only cause a headache.

numpy correlation coefficient: np.dot(A, A.T) on large arrays causing seg fault

NOTE:
Speed is not as important as getting a final result.
However, some speed up over worst case is required as well.
I have a large array A:
A.shape=(20000,265) # or possibly larger like 50,000 x 265
I need to compute the correlation coefficients.
np.corrcoeff # internally casts the results as doubles
I just borrowed their code and wrote my own cov/corr not casting into doubles, since I really only need 32 bit floats.And I ditch the conj() since my data are always real.
cov = A.dot(A.T)/n #where A is an array of 32 bit floats
diag = np.diag(cov)
corr = cov / np.sqrt(np.mutliply.outer(d,d))
I still run out of memory and I'm using a large memory machine, 264GB
I've been told, that the fast C libraries, are probably using a routine which breaks the
dot product up into pieces, and to optimize this, the number of elements is padded to a power of 2.
I don't really need to compute the symmetric half of the correlation coefficient matrix.
However, I don't see a way to do this in reasonable amount of time doing it "manually", with python loops.
Does anybody know of a way to ask numpy for a decent dot product routine, that balances memory usage with speed...?
Cheers
UPDATE:
Funny how writing these questions tends to help me find the language for a better google query.
Found this:
http://wiki.scipy.org/PerformanceTips
Not sure that I follow it....so, please comment or provide answers about this solution, your own ideas, or just general commentary on this type of problem.
TIA
EDIT: I apologize because my array is really much bigger than I thought.
array size is actually 151,000 x 265
I''m running out of memory on a machine with 264 GB with at least 230 GB free.
I'm surprised that the numpy call to blas dgemm and being careful with C order arrays
didn't do squat.
Python compiled with intel's mkl will run this with 12GB of memory in about 30 seconds:
>>> A = np.random.rand(50000,265).astype(np.float32)
>>> A.dot(A.T)
array([[ 86.54410553, 64.25226593, 67.24698639, ..., 68.5118103 ,
64.57299805, 66.69223785],
...,
[ 66.69223785, 62.01016235, 67.35866547, ..., 66.66306305,
65.75863647, 86.3017807 ]], dtype=float32)
If you do not have access to in intel's MKL download python anaconda and install the accelerate package which has a trial version for 30 days or free for academics that contains a mkl compile. Various other C++ BLAS libraries should work also- even if it copies the array from C to F it should not take more then ~30GB of memory.
The only thing that I can think of that your installation is trying to do is try to hold the entire 50,000 x 50,000 x 265 array in memory which is quite frankly terrible. For reference a float32 50,000 x 50,000 array is only 10GB, while the aforementioned array is 2.6TB...
If its a gemm issue you can try a chunk gemm formula:
def chunk_gemm(A, B, csize):
out = np.empty((A.shape[0],B.shape[1]), dtype=A.dtype)
for i in xrange(0, A.shape[0], csize):
iend = i+csize
for j in xrange(0, B.shape[1], csize):
jend = j+csize
out[i:iend, j:jend] = np.dot(A[i:iend], B[:,j:jend])
return out
This will be slower, but will hopefully get over your memory issues.
You can try and see if np.einsum works better than dot for your case:
cov = np.einsum('ij,kj->ik', A, A) / n
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order, not sure if that's the case here. einsum will buffer its inputs, and use vectorized SIMD operations where possible, but outside that it is basically going to run the naive three nested loops to compute the matrix product.
UPDATE: Turns out the dot product completed with out error, but upon careful inspection
the output array consists of zeros at 95,000 to the end, of the 151,000 cols.
That is, out[:,94999] = non-zero but out[:,95000] = 0 for all rows...
This is super annoying...
Another Blas description
The exchange, mentions something that I thought about too...Since blas is fortran, shouldn't
the order of the input be F order...? Where as the scipy doc page below, says C order.
Trying F order caused a segmentation fault. So I'm back to square one.
ORIGINAL POST
I finally tracked down my problem, which was in the details as usual.
I'm using an array of np.float32 which were stored as F order. I can't control the F order to my knowledge, since the data is loaded from images using an imaging library.
import scipy
roi = np.ascontiguousarray( roi )# see roi.flags below
out = scipy.linalg.blas.sgemm(alpha=1.0, a=roi, b=roi, trans_b=True)
This level 3 blas routine does the trick. My problem was two fold:
roi.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
And... i was using blas dgemm NOT sgemm. The 'd' is for 'double' and 's' for 'single'.
See this pdf: BLAS summary pdf
I looked at it once and was overwhelmed...I went back and read the wikipedia article on blas routines to understand level 3 vs other levels: wikipedia article on blas
Now it works on A = 150,000 x 265, performing:
A \dot A.T
Thanks everyone for your thoughts...knowing that it could be done was most important.