I paralellized my pandas.apply with a dask.map_paritions and it works fine on a test DataFrame (600 rows) cutting my calculations from 6 to 2 minutes - computing inner function as expected.
However, when I run the full set (6000 rows) the code gest irrepsponsive, the CPU and memory seems to be loaded and ultimately gets killed:
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=params.parallel)
applied = ddf.map_partitions(lambda dframe: dframe.apply(lambda x: my_fun(inData, x), axis=1)).compute(scheduler='processes')
after 15 minutes (BTW simple pandas.apply on the full dataset takes ca 15 minutes) I get from the console:
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
I am running MacOS, python 3.7, calling from PyCharm - any hints?
PS. I tried both few partitions (2,4) and a lot (32 - since I though maybe that's memory issue)
PS2. I pass quite big objects in inData to my apply function in classic pandas it does not matter, since it is linear, here I think maybe that is an issue (since I have there e.g. a network graph of ca 700MB) - can it be?
Related
I have recently upgraded my Intel MacBook Pro 13" to a MacBook Pro 14" with M1 Pro. Been working hard on getting my software to compile and work again. No big issues fortunately, except for floating point problems in some obscure fortran code and in python. With regard to python/numpy I have the following question.
I have a large code base bur for simplicity will use this simple function that converts flight level to pressure to show the issue.
def fl2pres(FL):
P0=101325
T0=288.15
T1=216.65
g=9.80665
R=287.0528742
GAMMA=0.0065
P11=P0*np.exp(-g/GAMMA/R*np.log(T0/T1))
h=FL*30.48
return np.where(h<=11000, \
P0*np.exp(-g/GAMMA/R*np.log((T0/(T0-GAMMA*h) ))),\
P11*np.exp(-g/R/T1*(h-11000)) )
When I run the code on my M1 Pro, I get:
In [2]: fl2pres(np.float64([400, 200]))
Out[3]: array([18753.90334892, 46563.239766 ])
and;
In [3]: fl2pres(np.float32([400, 200]))
Out[3]: array([18753.90234375, 46563.25080916])
Doing the same on my older Intel MacBook Pro I get:
In [2]: fl2pres(np.float64([400, 200]))
Out[2]: array([18753.90334892, 46563.239766 ])
and;
In [3]: fl2pres(np.float32([400, 200]))
Out[3]: array([18753.904296888, 46563.24778944])
The float64 calculations match but the float32 do not. We use float32 quite a lot throughout our code for memory optimisation. I understand that due to architectural differences this sort of floating point errors can occur but was wondering whether a simple fix was possible as currently some unit-tests fail. I could include the architecture in these tests but am hoping for an easier solution?
Converting all inputs to float64 makes my unit-tests pass and hence fixes this issue but sine we have quite some large arrays and dataframes, the impact on memory is unwanted.
Both laptops run python 3.9.10 installed through homebrew, pandas 1.4.1 and numpy 1.22.3 (installed to map against accelerate and blas).
EDIT
I have changes the function to print intermediate values to see where changes occur:
def fl2pres(FL):
P0=101325
T0=288.15
T1=216.65
g=9.80665
R=287.0528742
GAMMA=0.0065
P11=P0*np.exp(-g/GAMMA/R*np.log(T0/T1))
h=FL*30.48
A = np.log((T0/(T0-GAMMA*h)))
B = np.exp(-g/GAMMA/R*A)
C = np.exp(-g/R/T1*(h-11000))
print(f"P11:{P11}, h:{h}, A:{A}, B:{B}, C:{C}")
return np.where(h<=11000, P0*B, P11*C)
Running this function with the same input as above for the float32 case, I get on M1 Pro:
P11:22632.040591374975, h:[12192. 6096.], A:[0.32161594 0.14793371], B:[0.1844504 0.45954345], C:[0.82864394 2.16691503]
array([18753.90334892, 46563.239766 ])
On Intel:
P11:22632.040591374975, h:[12192. 6096.], A:[0.32161596 0.14793368], B:[0.18445034 0.45954353], C:[0.828644 2.166915]
array([18753.90429688, 46563.24778944])
As per the issue I created at numpy's GitHub:
the differences you are experiencing seem to be all within a single
"ULP" (unit in the last place), maybe 2? For special math functions,
like exp, or sin, small errors are unfortunately expected and can be
system dependend (both hardware and OS/math libraries).
One thing that could be would might have a slightly larger effect
could be use of SVML that NumPy has on newer machines (i.e. only on
the intel one). That can be disabled at build time using
NPY_DISABLE_SVML=1 as an environment variable, but I don't think you
can disable its use without building NumPy. (However, right now, it
may well be that the M1 machine is the less precise one, or that they
are both roughly the same, just different)
I haven't tried compiling numpy using NPY_DISABLE_SVML=1 and my plan now is to use a docker container that can run on all my platforms and use a single "truth" for my tests.
I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source
I am using the built in Pandas Series' interpolate function to linearly interpolate missing values (nans) in a time series. However, for long time series (N=10^9) the code never stops running and eventually crashes..
A very simple example is:
N = 10**6
s = pd.Series(np.arange(N))
s[::10] = np.nan # Set every 10th value to nan
s = s.interpolate() # Linearly interpolate nans
For N=10^6 the interpolation step takes 0.06 seconds, for N=10^7 it takes 0.6 seconds, for N=10^8 it takes 6 seconds.
The issue starts when N is increased to 10^9 (which is roughly the length of the time series I'm working with). Now my computer just crashes... I'm assuming it's some kind of memory error, without knowing much about the internal workings of computers.
Is there a method to avoid this, without creating a slow for loop and splitting the time series up?
The problem is very strange.
I create a pandas matrix like this:
My index is a 4 levels ones.
df = pd.Dataframe(np.zeros((300 000,300 000)), index=index, columns=index)
The matrix is built with success when I use np.zeros (without it my kernel crash) but it is impossible to pickle it or to sparse it. Python spends almost 60 Go of memory on my mac with 8 Go RAM. I also tried to use a cluster with more than 60 Go RAM. Why a so simple matrix is impossible to manage. Am I doing wrong something ?
The SparseDataFrame (SDF) are row-based. So it is a wrong way to build a SDF with a columns index.
See: https://github.com/pandas-dev/pandas/issues/16197
NOTE:
Speed is not as important as getting a final result.
However, some speed up over worst case is required as well.
I have a large array A:
A.shape=(20000,265) # or possibly larger like 50,000 x 265
I need to compute the correlation coefficients.
np.corrcoeff # internally casts the results as doubles
I just borrowed their code and wrote my own cov/corr not casting into doubles, since I really only need 32 bit floats.And I ditch the conj() since my data are always real.
cov = A.dot(A.T)/n #where A is an array of 32 bit floats
diag = np.diag(cov)
corr = cov / np.sqrt(np.mutliply.outer(d,d))
I still run out of memory and I'm using a large memory machine, 264GB
I've been told, that the fast C libraries, are probably using a routine which breaks the
dot product up into pieces, and to optimize this, the number of elements is padded to a power of 2.
I don't really need to compute the symmetric half of the correlation coefficient matrix.
However, I don't see a way to do this in reasonable amount of time doing it "manually", with python loops.
Does anybody know of a way to ask numpy for a decent dot product routine, that balances memory usage with speed...?
Cheers
UPDATE:
Funny how writing these questions tends to help me find the language for a better google query.
Found this:
http://wiki.scipy.org/PerformanceTips
Not sure that I follow it....so, please comment or provide answers about this solution, your own ideas, or just general commentary on this type of problem.
TIA
EDIT: I apologize because my array is really much bigger than I thought.
array size is actually 151,000 x 265
I''m running out of memory on a machine with 264 GB with at least 230 GB free.
I'm surprised that the numpy call to blas dgemm and being careful with C order arrays
didn't do squat.
Python compiled with intel's mkl will run this with 12GB of memory in about 30 seconds:
>>> A = np.random.rand(50000,265).astype(np.float32)
>>> A.dot(A.T)
array([[ 86.54410553, 64.25226593, 67.24698639, ..., 68.5118103 ,
64.57299805, 66.69223785],
...,
[ 66.69223785, 62.01016235, 67.35866547, ..., 66.66306305,
65.75863647, 86.3017807 ]], dtype=float32)
If you do not have access to in intel's MKL download python anaconda and install the accelerate package which has a trial version for 30 days or free for academics that contains a mkl compile. Various other C++ BLAS libraries should work also- even if it copies the array from C to F it should not take more then ~30GB of memory.
The only thing that I can think of that your installation is trying to do is try to hold the entire 50,000 x 50,000 x 265 array in memory which is quite frankly terrible. For reference a float32 50,000 x 50,000 array is only 10GB, while the aforementioned array is 2.6TB...
If its a gemm issue you can try a chunk gemm formula:
def chunk_gemm(A, B, csize):
out = np.empty((A.shape[0],B.shape[1]), dtype=A.dtype)
for i in xrange(0, A.shape[0], csize):
iend = i+csize
for j in xrange(0, B.shape[1], csize):
jend = j+csize
out[i:iend, j:jend] = np.dot(A[i:iend], B[:,j:jend])
return out
This will be slower, but will hopefully get over your memory issues.
You can try and see if np.einsum works better than dot for your case:
cov = np.einsum('ij,kj->ik', A, A) / n
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order, not sure if that's the case here. einsum will buffer its inputs, and use vectorized SIMD operations where possible, but outside that it is basically going to run the naive three nested loops to compute the matrix product.
UPDATE: Turns out the dot product completed with out error, but upon careful inspection
the output array consists of zeros at 95,000 to the end, of the 151,000 cols.
That is, out[:,94999] = non-zero but out[:,95000] = 0 for all rows...
This is super annoying...
Another Blas description
The exchange, mentions something that I thought about too...Since blas is fortran, shouldn't
the order of the input be F order...? Where as the scipy doc page below, says C order.
Trying F order caused a segmentation fault. So I'm back to square one.
ORIGINAL POST
I finally tracked down my problem, which was in the details as usual.
I'm using an array of np.float32 which were stored as F order. I can't control the F order to my knowledge, since the data is loaded from images using an imaging library.
import scipy
roi = np.ascontiguousarray( roi )# see roi.flags below
out = scipy.linalg.blas.sgemm(alpha=1.0, a=roi, b=roi, trans_b=True)
This level 3 blas routine does the trick. My problem was two fold:
roi.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
And... i was using blas dgemm NOT sgemm. The 'd' is for 'double' and 's' for 'single'.
See this pdf: BLAS summary pdf
I looked at it once and was overwhelmed...I went back and read the wikipedia article on blas routines to understand level 3 vs other levels: wikipedia article on blas
Now it works on A = 150,000 x 265, performing:
A \dot A.T
Thanks everyone for your thoughts...knowing that it could be done was most important.