Fastest Cython implementation depends on computer? - numpy

I am converting a python script to cython and optimizing it for more speed. Right now i have 2 versions, on my desktop V2 is twice as fast as V1 unfortunately on my laptop V1 is twice as fast as V2 and i am unable to find out why there is such a big difference.
Both computers use:
- Ubuntu 16.04
- Python 2.7.12
- Cython 0.25.2
- Numpy 1.12.1
Desktop:
- Intel® Core™ i3-4370 CPU # 3.80GHz × 4 64bit. 16GB RAM
Laptop:
- Intel® Core™ i5-3210 CPU # 2.5GHz × 2 64bit. 8GB RAM
V1 - you can find the full code here. the only changes made are renaming go.py, preprocessing.py to go.pyx, preprocessing.pyx and using
import pyximport; pyximport.install() to compile them. you can run test.py. This version is using a 2d numpy array board to store data in go.pyx and list comprehension in the get_board function in preprocessing.pyx to process data. during the test no function is called from go.py only the numpy array board is used
V2 - you can find the full code here. quite some stuff has changed, below you can find a list with everything affecting this test case. Be aware, all function and variable declarations have to be in go.pxd. you can run test.py using this command: python test.py build_ext --inplace
the 2d numpy array is replaced by:
cdef char board[ 362 ]
and the function get_board_feature in go.pyx replaces numpy list comprehension:
cdef char get_board_feature( self, short location ):
# return correct board feature value
# 0 active player stone
# 1 opponent stone
# 2 empty location
cdef char value = self.board[ location ]
if value == EMPTY:
return 2
if value == self.player_current:
return 0
return 1
get_board function in preprocessing.pyx is replaced with a function that loops over the array and calls get_board_feature in go.pyx for every location
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int get_board(self, GameState state, np.ndarray[double, ndim=2] tensor, int offSet ):
"""A feature encoding WHITE BLACK and EMPTY on separate planes, but plane 0
always refers to the current player and plane 1 to the opponent
"""
cdef short location
for location in range( 0, state.size * state.size ):
tensor[ offSet + state.get_board_feature( location ), location ] = 1
return offSet + 3
Please let me know if i should include any other information or run certain tests.
cmp, diff test
the V2 go.c and preprocessing.c files are identical.
V1 does not generate a .c file to compare
update compared .so files
the V2 go.so files are different:
goD.so goL.so differ: byte 473, line 1
the preprocessing.so files are identical, not sure what to think of that..

They are two different machines and behave differently. There's a reason why processor reviews use large benchmark suites. It could be said that the desktop CPU performs better on average, but execution times between two small but non-trivial pieces of codes does not 'have' to favor the desktop CPU. And differences execution times definitely do not have to follow any linear relationship. The performance is always dependant on a huge amount of factors. Possible explanations include but are not limited to the smaller L1 and L2 caches on the desktop and the change in vector instruction sets from AVX to AVX2 between the Ivy Bridge laptop and the Haswell desktop.
Generally it's a good idea to concentrate on using good algorithms and to identify and remove bottlenecks when optimizing performance. Trying to stare at benchmarks between different machines will probably only cause a headache.

Related

Simultaneous matrix vector multiplication with multiprocessing

What I want to do
I have an m x n numpy array A where m << n that I want to load on a node where all 20 CPUs on that node can share memory. On each CPU, I want to multiply A by a n x 1 vector v, where the vector v differs on each CPU but matrix A stays the same.
Constraint
Matrix A is sufficiently large so that I cannot load A on each CPU, so I would like to put A in shared node memory. And since A*v is just m x 1, I think I never need to store matrices of size m x n on each CPU (just one copy of A in shared memory).
The references below tell me I can use shared memory with the multiprocessing module.
My question
If I have 1 worker per CPU, can each worker simultaneously compute A x v (where v is different for each worker) using the multiprocessing module?
I am concerned that since I am simultaneously accessing the same shared memory by each worker, multiprocessing would copy matrix A into each CPU, which would cause memory problems for me.
Edit:
Question has been edited to remove the part about ray.io because I just learned Ray has its own forum so I am asking the same question except for Ray on that forum.
References
Shared-memory objects in multiprocessing
How to do parallel programming in Python?
For posterity, here is the answer copy-pasted from discuss.ray.io:
The recommended way to do this would be to wrap these all in one Ray job. The setup is a bit different compared to something like MPI or multiprocessing; instead of you launching python workers explicitly, Ray will automatically start workers and execute tasks for you. Also, all of the tasks and shared-memory objects created during one “application” are associated with the “driver”, aka the process that executes the main python script, so once that script exits, all of the associated tasks and objects will be cleaned up as well.
The code from the other post is a good start:
import ray
#ray.remote
def multiply(A, v):
return A * v # Put your worker code here.
A_ref = ray.put(A) # Put A in Ray's shared-memory object store.
refs = [multiply.remote(A_ref, v) for v in vs]
results = ray.get(refs)

np.float32 floating point differences between intel MacBook and M1

I have recently upgraded my Intel MacBook Pro 13" to a MacBook Pro 14" with M1 Pro. Been working hard on getting my software to compile and work again. No big issues fortunately, except for floating point problems in some obscure fortran code and in python. With regard to python/numpy I have the following question.
I have a large code base bur for simplicity will use this simple function that converts flight level to pressure to show the issue.
def fl2pres(FL):
P0=101325
T0=288.15
T1=216.65
g=9.80665
R=287.0528742
GAMMA=0.0065
P11=P0*np.exp(-g/GAMMA/R*np.log(T0/T1))
h=FL*30.48
return np.where(h<=11000, \
P0*np.exp(-g/GAMMA/R*np.log((T0/(T0-GAMMA*h) ))),\
P11*np.exp(-g/R/T1*(h-11000)) )
When I run the code on my M1 Pro, I get:
In [2]: fl2pres(np.float64([400, 200]))
Out[3]: array([18753.90334892, 46563.239766 ])
and;
In [3]: fl2pres(np.float32([400, 200]))
Out[3]: array([18753.90234375, 46563.25080916])
Doing the same on my older Intel MacBook Pro I get:
In [2]: fl2pres(np.float64([400, 200]))
Out[2]: array([18753.90334892, 46563.239766 ])
and;
In [3]: fl2pres(np.float32([400, 200]))
Out[3]: array([18753.904296888, 46563.24778944])
The float64 calculations match but the float32 do not. We use float32 quite a lot throughout our code for memory optimisation. I understand that due to architectural differences this sort of floating point errors can occur but was wondering whether a simple fix was possible as currently some unit-tests fail. I could include the architecture in these tests but am hoping for an easier solution?
Converting all inputs to float64 makes my unit-tests pass and hence fixes this issue but sine we have quite some large arrays and dataframes, the impact on memory is unwanted.
Both laptops run python 3.9.10 installed through homebrew, pandas 1.4.1 and numpy 1.22.3 (installed to map against accelerate and blas).
EDIT
I have changes the function to print intermediate values to see where changes occur:
def fl2pres(FL):
P0=101325
T0=288.15
T1=216.65
g=9.80665
R=287.0528742
GAMMA=0.0065
P11=P0*np.exp(-g/GAMMA/R*np.log(T0/T1))
h=FL*30.48
A = np.log((T0/(T0-GAMMA*h)))
B = np.exp(-g/GAMMA/R*A)
C = np.exp(-g/R/T1*(h-11000))
print(f"P11:{P11}, h:{h}, A:{A}, B:{B}, C:{C}")
return np.where(h<=11000, P0*B, P11*C)
Running this function with the same input as above for the float32 case, I get on M1 Pro:
P11:22632.040591374975, h:[12192. 6096.], A:[0.32161594 0.14793371], B:[0.1844504 0.45954345], C:[0.82864394 2.16691503]
array([18753.90334892, 46563.239766 ])
On Intel:
P11:22632.040591374975, h:[12192. 6096.], A:[0.32161596 0.14793368], B:[0.18445034 0.45954353], C:[0.828644 2.166915]
array([18753.90429688, 46563.24778944])
As per the issue I created at numpy's GitHub:
the differences you are experiencing seem to be all within a single
"ULP" (unit in the last place), maybe 2? For special math functions,
like exp, or sin, small errors are unfortunately expected and can be
system dependend (both hardware and OS/math libraries).
One thing that could be would might have a slightly larger effect
could be use of SVML that NumPy has on newer machines (i.e. only on
the intel one). That can be disabled at build time using
NPY_DISABLE_SVML=1 as an environment variable, but I don't think you
can disable its use without building NumPy. (However, right now, it
may well be that the M1 machine is the less precise one, or that they
are both roughly the same, just different)
I haven't tried compiling numpy using NPY_DISABLE_SVML=1 and my plan now is to use a docker container that can run on all my platforms and use a single "truth" for my tests.

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

numpy correlation coefficient: np.dot(A, A.T) on large arrays causing seg fault

NOTE:
Speed is not as important as getting a final result.
However, some speed up over worst case is required as well.
I have a large array A:
A.shape=(20000,265) # or possibly larger like 50,000 x 265
I need to compute the correlation coefficients.
np.corrcoeff # internally casts the results as doubles
I just borrowed their code and wrote my own cov/corr not casting into doubles, since I really only need 32 bit floats.And I ditch the conj() since my data are always real.
cov = A.dot(A.T)/n #where A is an array of 32 bit floats
diag = np.diag(cov)
corr = cov / np.sqrt(np.mutliply.outer(d,d))
I still run out of memory and I'm using a large memory machine, 264GB
I've been told, that the fast C libraries, are probably using a routine which breaks the
dot product up into pieces, and to optimize this, the number of elements is padded to a power of 2.
I don't really need to compute the symmetric half of the correlation coefficient matrix.
However, I don't see a way to do this in reasonable amount of time doing it "manually", with python loops.
Does anybody know of a way to ask numpy for a decent dot product routine, that balances memory usage with speed...?
Cheers
UPDATE:
Funny how writing these questions tends to help me find the language for a better google query.
Found this:
http://wiki.scipy.org/PerformanceTips
Not sure that I follow it....so, please comment or provide answers about this solution, your own ideas, or just general commentary on this type of problem.
TIA
EDIT: I apologize because my array is really much bigger than I thought.
array size is actually 151,000 x 265
I''m running out of memory on a machine with 264 GB with at least 230 GB free.
I'm surprised that the numpy call to blas dgemm and being careful with C order arrays
didn't do squat.
Python compiled with intel's mkl will run this with 12GB of memory in about 30 seconds:
>>> A = np.random.rand(50000,265).astype(np.float32)
>>> A.dot(A.T)
array([[ 86.54410553, 64.25226593, 67.24698639, ..., 68.5118103 ,
64.57299805, 66.69223785],
...,
[ 66.69223785, 62.01016235, 67.35866547, ..., 66.66306305,
65.75863647, 86.3017807 ]], dtype=float32)
If you do not have access to in intel's MKL download python anaconda and install the accelerate package which has a trial version for 30 days or free for academics that contains a mkl compile. Various other C++ BLAS libraries should work also- even if it copies the array from C to F it should not take more then ~30GB of memory.
The only thing that I can think of that your installation is trying to do is try to hold the entire 50,000 x 50,000 x 265 array in memory which is quite frankly terrible. For reference a float32 50,000 x 50,000 array is only 10GB, while the aforementioned array is 2.6TB...
If its a gemm issue you can try a chunk gemm formula:
def chunk_gemm(A, B, csize):
out = np.empty((A.shape[0],B.shape[1]), dtype=A.dtype)
for i in xrange(0, A.shape[0], csize):
iend = i+csize
for j in xrange(0, B.shape[1], csize):
jend = j+csize
out[i:iend, j:jend] = np.dot(A[i:iend], B[:,j:jend])
return out
This will be slower, but will hopefully get over your memory issues.
You can try and see if np.einsum works better than dot for your case:
cov = np.einsum('ij,kj->ik', A, A) / n
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order, not sure if that's the case here. einsum will buffer its inputs, and use vectorized SIMD operations where possible, but outside that it is basically going to run the naive three nested loops to compute the matrix product.
UPDATE: Turns out the dot product completed with out error, but upon careful inspection
the output array consists of zeros at 95,000 to the end, of the 151,000 cols.
That is, out[:,94999] = non-zero but out[:,95000] = 0 for all rows...
This is super annoying...
Another Blas description
The exchange, mentions something that I thought about too...Since blas is fortran, shouldn't
the order of the input be F order...? Where as the scipy doc page below, says C order.
Trying F order caused a segmentation fault. So I'm back to square one.
ORIGINAL POST
I finally tracked down my problem, which was in the details as usual.
I'm using an array of np.float32 which were stored as F order. I can't control the F order to my knowledge, since the data is loaded from images using an imaging library.
import scipy
roi = np.ascontiguousarray( roi )# see roi.flags below
out = scipy.linalg.blas.sgemm(alpha=1.0, a=roi, b=roi, trans_b=True)
This level 3 blas routine does the trick. My problem was two fold:
roi.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
And... i was using blas dgemm NOT sgemm. The 'd' is for 'double' and 's' for 'single'.
See this pdf: BLAS summary pdf
I looked at it once and was overwhelmed...I went back and read the wikipedia article on blas routines to understand level 3 vs other levels: wikipedia article on blas
Now it works on A = 150,000 x 265, performing:
A \dot A.T
Thanks everyone for your thoughts...knowing that it could be done was most important.

Is there a way to reduce scipy/numpy precision to reduce memory consumption?

On my 64-bit Debian/Lenny system (4GByte RAM + 4GByte swap partition) I can successfully do:
v=array(10000*random([512,512,512]),dtype=np.int16)
f=fftn(v)
but with f being a np.complex128 the memory consumption is shocking, and I can't do much more with the result (e.g modulate the coefficients and then f=ifftn(f) ) without a MemoryError traceback.
Rather than installing some more RAM and/or expanding my swap partitions, is there some way of controlling the scipy/numpy "default precision" and have it compute a complex64 array instead ?
I know I can just reduce it afterwards with f=array(f,dtype=np.complex64); I'm looking to have it actually do the FFT work in 32-bit precision and half the memory.
It doesn't look like there's any function to do this in scipy's fft functions ( see http://www.astro.rug.nl/efidad/scipy.fftpack.basic.html ).
Unless you're able to find a fixed point FFT library for python, it's unlikely that the function you want exists, since your native hardware floating point format is 128 bits. It does look like you could use the rfft method to get just the real-valued components (no phase) of the FFT, and that would save half your RAM.
I ran the following in interactive python:
>>> from numpy import *
>>> v = array(10000*random.random([512,512,512]),dtype=int16)
>>> shape(v)
(512, 512, 512)
>>> type(v[0,0,0])
<type 'numpy.int16'>
At this point the RSS (Resident Set Size) of python was 265MB.
f = fft.fft(v)
And at this point the RSS of python 2.3GB.
>>> type(f)
<type 'numpy.ndarray'>
>>> type(f[0,0,0])
<type 'numpy.complex128'>
>>> v = []
And at this point the RSS goes down to 2.0GB, since I've free'd up v.
Using "fft.rfft(v)" to compute real-values only results in a 1.3GB RSS. (almost half, as expected)
Doing:
>>> f = complex64(fft.fft(v))
Is the worst of both worlds, since it first computes the complex128 version (2.3GB) and then copies that into the complex64 version (1.3GB) which means the peak RSS on my machine was 3.6GB, and then it settled down to 1.3GB again.
I think that if you've got 4GB RAM, this should all work just fine (as it does for me). What's the issue?
Scipy 0.8 will have single precision support for almost all the fft code (The code is already in the trunk, so you can install scipy from svn if you need the feature now).