np.float32 floating point differences between intel MacBook and M1 - numpy

I have recently upgraded my Intel MacBook Pro 13" to a MacBook Pro 14" with M1 Pro. Been working hard on getting my software to compile and work again. No big issues fortunately, except for floating point problems in some obscure fortran code and in python. With regard to python/numpy I have the following question.
I have a large code base bur for simplicity will use this simple function that converts flight level to pressure to show the issue.
def fl2pres(FL):
P0=101325
T0=288.15
T1=216.65
g=9.80665
R=287.0528742
GAMMA=0.0065
P11=P0*np.exp(-g/GAMMA/R*np.log(T0/T1))
h=FL*30.48
return np.where(h<=11000, \
P0*np.exp(-g/GAMMA/R*np.log((T0/(T0-GAMMA*h) ))),\
P11*np.exp(-g/R/T1*(h-11000)) )
When I run the code on my M1 Pro, I get:
In [2]: fl2pres(np.float64([400, 200]))
Out[3]: array([18753.90334892, 46563.239766 ])
and;
In [3]: fl2pres(np.float32([400, 200]))
Out[3]: array([18753.90234375, 46563.25080916])
Doing the same on my older Intel MacBook Pro I get:
In [2]: fl2pres(np.float64([400, 200]))
Out[2]: array([18753.90334892, 46563.239766 ])
and;
In [3]: fl2pres(np.float32([400, 200]))
Out[3]: array([18753.904296888, 46563.24778944])
The float64 calculations match but the float32 do not. We use float32 quite a lot throughout our code for memory optimisation. I understand that due to architectural differences this sort of floating point errors can occur but was wondering whether a simple fix was possible as currently some unit-tests fail. I could include the architecture in these tests but am hoping for an easier solution?
Converting all inputs to float64 makes my unit-tests pass and hence fixes this issue but sine we have quite some large arrays and dataframes, the impact on memory is unwanted.
Both laptops run python 3.9.10 installed through homebrew, pandas 1.4.1 and numpy 1.22.3 (installed to map against accelerate and blas).
EDIT
I have changes the function to print intermediate values to see where changes occur:
def fl2pres(FL):
P0=101325
T0=288.15
T1=216.65
g=9.80665
R=287.0528742
GAMMA=0.0065
P11=P0*np.exp(-g/GAMMA/R*np.log(T0/T1))
h=FL*30.48
A = np.log((T0/(T0-GAMMA*h)))
B = np.exp(-g/GAMMA/R*A)
C = np.exp(-g/R/T1*(h-11000))
print(f"P11:{P11}, h:{h}, A:{A}, B:{B}, C:{C}")
return np.where(h<=11000, P0*B, P11*C)
Running this function with the same input as above for the float32 case, I get on M1 Pro:
P11:22632.040591374975, h:[12192. 6096.], A:[0.32161594 0.14793371], B:[0.1844504 0.45954345], C:[0.82864394 2.16691503]
array([18753.90334892, 46563.239766 ])
On Intel:
P11:22632.040591374975, h:[12192. 6096.], A:[0.32161596 0.14793368], B:[0.18445034 0.45954353], C:[0.828644 2.166915]
array([18753.90429688, 46563.24778944])

As per the issue I created at numpy's GitHub:
the differences you are experiencing seem to be all within a single
"ULP" (unit in the last place), maybe 2? For special math functions,
like exp, or sin, small errors are unfortunately expected and can be
system dependend (both hardware and OS/math libraries).
One thing that could be would might have a slightly larger effect
could be use of SVML that NumPy has on newer machines (i.e. only on
the intel one). That can be disabled at build time using
NPY_DISABLE_SVML=1 as an environment variable, but I don't think you
can disable its use without building NumPy. (However, right now, it
may well be that the M1 machine is the less precise one, or that they
are both roughly the same, just different)
I haven't tried compiling numpy using NPY_DISABLE_SVML=1 and my plan now is to use a docker container that can run on all my platforms and use a single "truth" for my tests.

Related

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

the alternative for NCCL on window 10

So I am on windows 10 and am using multiple GPUs now in order to run the training of some machine learning model and this model is about GAN algorithm you can check the full code over here :
Here, I get to the point where there is need to reduce the sum from different GPU devices as following:
if len(devices) > 1:
with tf.name_scope('SumAcrossGPUs'), tf.device(None):
for var_idx, grad_shape in enumerate(self._grad_shapes):
g = [dev_grads[dev][var_idx][0] for dev in devices]
if np.prod(grad_shape): # nccl does not support zero-sized tensors
g = tf.contrib.nccl.all_sum(g)
for dev, gg in zip(devices, g):
dev_grads[dev][var_idx] = (gg, dev_grads[dev][var_idx][1])
Now in this part I get an error regarding NCCL, which I noticed that is not supported on windows it needs linux, therefore I am stuck here...what is the "work around solution" here??..how can I manage to use NCCL on windows or an alternative to the code above..is there any simple way to do that?...thanks in advance.
Note: I have checked out some stackoverflow issues already. However, no answer exist which can solve my problem.

Causes of floating point non-determinism? Including NumPy?

IEEE floating point operations are deterministic, but see How can floating point calculations be made deterministic? for one way that an overall floating point computation can be non-deterministic:
... parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.
Two-part question:
How else can an overall floating point computation be non-deterministic, yielding results that are not exactly equal?
Consider a single-threaded Python program that calls NumPy, CVXOPT, and SciPy subroutines such as scipy.optimize.fsolve(), which in turn call native libraries like MINPACK and GLPK and optimized linear algebra subroutines like BLAS, ATLAS, and MKL. “If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything.”
Do these native libraries ever parallelize in a way that introduces non-deterministic results?
Assumptions:
The same software, with the same inputs, on the same hardware. The output of multiple runs should be equal.
If that works, it's highly desirable to test that the output after doing a code refactoring is equal. (Yes, some changes in order of operations can make some of the output not-equal.)
All random numbers in the program are psuedo-random numbers used in a consistent way from the same seeds across all runs.
No uninitialized values. Python is generally safe in that way but numpy.empty() returns a new array without initializing entries. And it's not clear that it's much faster in practice. So beware!
#PaulPanzer's test shows that numpy.empty() does return an uninitialized array and it can easily and quickly recycle a recent array:
import numpy as np
np.arange(100); np.empty(100, int); np.empty(100, int)
np.arange(100, 200.0); np.empty(100, float); np.empty(100, float)
It's tricky to get useful timing measurements for these routines! In a timeit loop, numpy.empty() can just keep reallocating the same one or two memory nodes. The time is independent of the array size. To prevent recycling:
from timeit import timeit
timeit('l.append(numpy.empty(100000))', 'import numpy; l = []')
timeit('l.append(numpy.zeros(100000))', 'import numpy; l = []')
but reducing that array size to numpy.zeros(10000) takes 15x as long; reducing it to numpy.zeros(1000) takes 1.3x as long (on my MBP). Puzzling.
See also:
Hash values are salted in Python 3 and each dict preserves insertion order. That could vary the order of operations from run to run. [I'm wrangling with this problem in Python 2.7.15.]
I found that most (not all) of the non-determinism problems I'm experiencing seem to be fixed in the code for OpenBLAS 0.3.5.
A bunch of threading problems in earlier versions of OpenBLAS are fixed in release 0.3.4, but that release has a macOS compatibility bug that's fixed in the code for release 0.3.5. The bugs also occurs with Apple's Accelerate framework version 1.1 and Intel's MKL mkl==2019.0.
See how to install OpenBLAS and compile NumPy and SciPy on it.
Perhaps the remaining problems I'm experiencing are due to other libraries linked to Accelerate?
Note: I'm still open to more answers to this question.

Graph Lassoalgorithm Sklearn.covariance.graph_lasso()

I am working on replicating a paper titled “Improving Mean Variance Optimization through Sparse Hedging Restriction”. The authors’ idea is to use Graphical Lasso algorithm to infuse some bias in the estimation process of the inverse of the sample covariance matrix. The graphical lasso algorithm works perfectly fine in R, but when I use python on the same data with the same parameters I get two sorts of errors:
If I use coordinate descent (cd ) mode as a solver, I get a floating point error saying that: FloatingPointError: Non SPD result: the system is too ill-conditioned for this solver. The system is too ill-conditioned for this solver (The thing that bugs me is that I tried this solver on a simulated Positive definite matrix and It game me this error)
If I use the Least Angle Regression (LARS) mode (Which is less stable but recommended for ill-conditioned matrices) I get an overflow error stating OverflowError: int too large to convert to float
To my knowledge, unlike C++ and other languages, python is not restricted by an upper maximum for integer numbers (besides the capacity of the machine itself). Whereas the floats are restricted. I think this might be the source of the later problem. (I have also heard in the past that R is much more robust in terms of dealing ill-conditioned matrices). I would be glad to hear you experience with graph lasso in R or python.
With this email, I have attached a little python code that simulates this problem in a few lines. Any input will be of great appreciation.
Thank you all,
Skander
from sklearn.covariance import graph_lasso
from sklearn.datasets import make_spd_matrix
symetric_PD_mx= make_spd_matrix(100)
glout = graph_lasso(emp_cov=symetric_PD_mx, alpha=0.01,mode="lars")

numpy correlation coefficient: np.dot(A, A.T) on large arrays causing seg fault

NOTE:
Speed is not as important as getting a final result.
However, some speed up over worst case is required as well.
I have a large array A:
A.shape=(20000,265) # or possibly larger like 50,000 x 265
I need to compute the correlation coefficients.
np.corrcoeff # internally casts the results as doubles
I just borrowed their code and wrote my own cov/corr not casting into doubles, since I really only need 32 bit floats.And I ditch the conj() since my data are always real.
cov = A.dot(A.T)/n #where A is an array of 32 bit floats
diag = np.diag(cov)
corr = cov / np.sqrt(np.mutliply.outer(d,d))
I still run out of memory and I'm using a large memory machine, 264GB
I've been told, that the fast C libraries, are probably using a routine which breaks the
dot product up into pieces, and to optimize this, the number of elements is padded to a power of 2.
I don't really need to compute the symmetric half of the correlation coefficient matrix.
However, I don't see a way to do this in reasonable amount of time doing it "manually", with python loops.
Does anybody know of a way to ask numpy for a decent dot product routine, that balances memory usage with speed...?
Cheers
UPDATE:
Funny how writing these questions tends to help me find the language for a better google query.
Found this:
http://wiki.scipy.org/PerformanceTips
Not sure that I follow it....so, please comment or provide answers about this solution, your own ideas, or just general commentary on this type of problem.
TIA
EDIT: I apologize because my array is really much bigger than I thought.
array size is actually 151,000 x 265
I''m running out of memory on a machine with 264 GB with at least 230 GB free.
I'm surprised that the numpy call to blas dgemm and being careful with C order arrays
didn't do squat.
Python compiled with intel's mkl will run this with 12GB of memory in about 30 seconds:
>>> A = np.random.rand(50000,265).astype(np.float32)
>>> A.dot(A.T)
array([[ 86.54410553, 64.25226593, 67.24698639, ..., 68.5118103 ,
64.57299805, 66.69223785],
...,
[ 66.69223785, 62.01016235, 67.35866547, ..., 66.66306305,
65.75863647, 86.3017807 ]], dtype=float32)
If you do not have access to in intel's MKL download python anaconda and install the accelerate package which has a trial version for 30 days or free for academics that contains a mkl compile. Various other C++ BLAS libraries should work also- even if it copies the array from C to F it should not take more then ~30GB of memory.
The only thing that I can think of that your installation is trying to do is try to hold the entire 50,000 x 50,000 x 265 array in memory which is quite frankly terrible. For reference a float32 50,000 x 50,000 array is only 10GB, while the aforementioned array is 2.6TB...
If its a gemm issue you can try a chunk gemm formula:
def chunk_gemm(A, B, csize):
out = np.empty((A.shape[0],B.shape[1]), dtype=A.dtype)
for i in xrange(0, A.shape[0], csize):
iend = i+csize
for j in xrange(0, B.shape[1], csize):
jend = j+csize
out[i:iend, j:jend] = np.dot(A[i:iend], B[:,j:jend])
return out
This will be slower, but will hopefully get over your memory issues.
You can try and see if np.einsum works better than dot for your case:
cov = np.einsum('ij,kj->ik', A, A) / n
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order, not sure if that's the case here. einsum will buffer its inputs, and use vectorized SIMD operations where possible, but outside that it is basically going to run the naive three nested loops to compute the matrix product.
UPDATE: Turns out the dot product completed with out error, but upon careful inspection
the output array consists of zeros at 95,000 to the end, of the 151,000 cols.
That is, out[:,94999] = non-zero but out[:,95000] = 0 for all rows...
This is super annoying...
Another Blas description
The exchange, mentions something that I thought about too...Since blas is fortran, shouldn't
the order of the input be F order...? Where as the scipy doc page below, says C order.
Trying F order caused a segmentation fault. So I'm back to square one.
ORIGINAL POST
I finally tracked down my problem, which was in the details as usual.
I'm using an array of np.float32 which were stored as F order. I can't control the F order to my knowledge, since the data is loaded from images using an imaging library.
import scipy
roi = np.ascontiguousarray( roi )# see roi.flags below
out = scipy.linalg.blas.sgemm(alpha=1.0, a=roi, b=roi, trans_b=True)
This level 3 blas routine does the trick. My problem was two fold:
roi.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
And... i was using blas dgemm NOT sgemm. The 'd' is for 'double' and 's' for 'single'.
See this pdf: BLAS summary pdf
I looked at it once and was overwhelmed...I went back and read the wikipedia article on blas routines to understand level 3 vs other levels: wikipedia article on blas
Now it works on A = 150,000 x 265, performing:
A \dot A.T
Thanks everyone for your thoughts...knowing that it could be done was most important.