Memory error with numpy. array - numpy

I get a memory error when using numpy.arange with large numbers. My code is as follows:
import numpy as np
list = np.arange(0, 10**15, 10**3)
profit_list = []
for diff in list:
x = do_some_calculation
profit_list.append(x)
What can be a replacement so I can avoid getting the memory error?

If you replace list¹ with a generator, that is, you do
for diff in range(10**15, 10**3):
x = do_some_calculation
profit_list.append(x)
then that will no longer cause MemoryErrors as you no longer initiate the full list. In this world, though, profit_list will probably by causing issues instead, as you are trying to add 10^12 items to that. Again, you can probably get around that by not storing the values explicitly, but rather yield them as you need them, using generators.
¹: Side note: Don't use list as a variable name as it shadows a built-in.

Related

filtering "events" in awkward-array

I am reading data from a file of "events". For each event, there is some number of "tracks". For each track there are a series of "variables". A stripped down version of the code (using awkward0 as awkward) looks like
f = h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r")
afile = awkward.hdf5(f)
pocaz = np.asarray(afile["poca_z"].astype(dtype_X))
pocaMx = np.asarray(afile["major_axis_x"].astype(dtype_X))
pocaMy = np.asarray(afile["major_axis_y"].astype(dtype_X))
pocaMz = np.asarray(afile["major_axis_z"].astype(dtype_X))
In this snippet of code, "pocaz", "pocaMx", etc. are what I have called variables (a physics label, not a Python data type). On rare occasions, pocaz takes on an extreme value, pocaMx and/or pocaMy take on nan values, and/or pocaMz takes on the value inf. I would like to remove these tracks from the events using some syntactically simple method. I am guessing this functionality exists (perhaps in the current version of awkward but not awkward0), but cannot find it described in a transparent way. Is there a simple example anywhere?
Thanks,
Mike
It looks to me, from the fact that you're able to call np.asarray on these arrays without error, that they are one-dimensional arrays of numbers. If so, then Awkward Array isn't doing anything for you here; you should be able to find the one-dimensional NumPy arrays inside
f["poca_z"], f["major_axis_x"], f["major_axis_y"], f["major_axis_z"]
as groups (note that this is f, not afile) and leave Awkward Array entirely out of it.
The reason I say that is because you can use np.isfinite on these NumPy arrays. (There's an equivalent in Awkward v1, v2, but you're talking about Awkward v0 and I don't remember.) That will give you an array of booleans for you to slice these arrays.
I don't have the HDF5 file for testing, but I think it would go like this:
f = h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r")
pocaz = np.asarray(a["poca_z"]["0"], dtype=dtype_X)
pocaMx = np.asarray(a["major_axis_x"]["0"], dtype=dtype_X) # the only array
pocaMy = np.asarray(a["major_axis_y"]["0"], dtype=dtype_X) # in each group
pocaMz = np.asarray(a["major_axis_z"]["0"], dtype=dtype_X) # is named "0"
good = np.ones(len(pocaz), dtype=bool)
good &= np.isfinite(pocaz)
good &= np.isfinite(pocaMx)
good &= np.isfinite(pocaMy)
good &= np.isfinite(pocaMz)
pocaz[good], pocaMx[good], pocaMy[good], pocaMz[good]
If you also need to cut extreme finite values, you can include
good &= (-1000 < pocaz) & (pocaz < 1000)
etc. in the good selection criteria.
(The way you'd do this in Awkward Array is not any different, since Awkward is just generalizing what NumPy does here, but if you don't need it, you might as well leave it out.)
If you want numpy arrays, why not read the data with h5py functions? It provides a very natural way to return the datasets as arrays. Code would look like this. (FYI, I used the file context manager to open the file.)
with h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r") as h5f:
# the [()] returns the dataset as an array:
pocaz_arr = h5f["poca_z"]["0"][()]
# verify array shape and datatype:
print(f"Shape: {pocaz_arr.shape}, Dtype: {poca_z_arr.dtype})")
pocaMx_arr = h5f["major_axis_x"]["0"][()] # the only dataset
pocaMy_arr = h5f["major_axis_y"]["0"][()] # in each group
pocaMz_arr = h5f["major_axis_z"]["0"][()] # is named "0"

What is the difference between SeedSequence.spawn and SeedSequence.generate_state

I am trying to use numpy's SeedSequence to seed RNGs in different processes. However I am not sure whether I should use ss.generate_state or ss.spawn:
import concurrent.futures
import numpy as np
def worker(seed):
rng = np.random.default_rng(seed)
return rng.random(1)
num_repeats = 1000
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
with concurrent.futures.ProcessPoolExecutor() as pool:
result1 = np.hstack(list(pool.map(worker, ss.generate_state(num_repeats))))
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
with concurrent.futures.ProcessPoolExecutor() as pool:
result2 = np.hstack(list(pool.map(worker, ss.spawn(num_repeats))))
What are the differences between the two approaches and which should I use?
Using ss.generate_state is ~10% faster for the basic example above, likely because we are serializing floats instead of objects.
Well the SeedSequence is intended to generate good quality seeds from not so good seeds.
Performance
The generate_state is much faster than spawn
The difference of time for transfering the object or the state should not be the main reason for the difference you notice. You can test this without any
%%timeit
ss.generate_state(num_repeats)
100 times faster than
ss.spawn(num_repeats)
Seed size
In the code map(worker, ss.generate_state(num_repeats)) the RNGs are seeded with integers, while in the map(worker, ss.spawn(num_repeats)) the RNGs are seeded with SeedSequence that can potentially give result in quality initialization, the seed sequence can be used internally to generate a state vector with as many bits as required to completely initialize the RNG. Well, to be honest I expect it to do some expansion on this number to initialize the RNG, not simply padding with zeros for example.
Repeated use
The most important difference is that generate_state gives the same result if called multiple times. On the other hand, spawn gives different results each call.
For illustration, check the following example
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
print('With generate_state')
print(np.hstack([worker(s) for s in ss.generate_state(5)]))
print(np.hstack([worker(s) for s in ss.generate_state(5)]))
print('With spawn')
print(np.hstack([worker(s) for s in ss.spawn(5)]))
print(np.hstack([worker(s) for s in ss.spawn(5)]))
With generate_state
[0.6625651 0.17654256 0.25323331 0.38250588 0.52670541]
[0.6625651 0.17654256 0.25323331 0.38250588 0.52670541]
With spawn
[0.06988312 0.40886412 0.55733136 0.43249601 0.53394111]
[0.64885573 0.16788206 0.12435154 0.14676836 0.51876499]
As you see the arrays generated by seeding different RNGs with generate_state gives the same result, not only immediately after construction, but every time the method is called. Spawn should give you the same results using a newly constructed SeedSequence (I am using numpy 1.19.2), however if you run the same code twice using the same instance, the second time will give produce different seeds.

Custom equations with bygroup and apply in pandas - MemoryError

All,
I am running code to calcluate a new variable (newvar) for each constituent (group) in a panel using the apply function:
df['newvar'] = df.groupby('group')['var1'].apply(lambda x : x - x.shift() + df['var2'] - df['var3'])
The code returns a memory error (MemoryError). I think what's happening is that the code generates a large number of separate dataframes which then causes the system to run out of memory because df itself is quite a large file. I can probalby do this with a for-loop, but is there a more verbose/computationally efficient way of doing this?
Many thanks,
Andres

Using pyfftw properly for speed up over numpy

I am in the midst of trying to make the leap from Matlab to numpy, but I desperately need speed in my fft's. Now I know of pyfftw, but I don't know that I am using it properly. My approach is going something like
import numpy as np
import pyfftw
import timeit
pyfftw.interfaces.cache.enable()
def wrapper(func, *args):
def wrapped():
return func(*args)
return wrapped
def my_fft(v):
global a
global fft_object
a[:] = v
return fft_object()
def init_cond(X):
return my_fft(2.*np.cosh(X)**(-2))
def init_cond_py(X):
return np.fft.fft(2.*np.cosh(X)**(-2))
K = 2**16
Llx = 10.
KT = 2*K
dx = Llx/np.float64(K)
X = np.arange(-Llx,Llx,dx)
global a
global b
global fft_object
a = pyfftw.n_byte_align_empty(KT, 16, 'complex128')
b = pyfftw.n_byte_align_empty(KT, 16, 'complex128')
fft_object = pyfftw.FFTW(a,b)
wrapped = wrapper(init_cond, X)
print min(timeit.repeat(wrapped,repeat=100,number=1))
wrapped_two = wrapper(init_cond_py, X)
print min(timeit.repeat(wrapped_two,repeat=100,number=1))
I appreciate that there are builder functions and also standard interfaces to the scipy and numpy fft calls through pyfftw. These have all behaved very slowly though. By first creating an instance of the fft_object and then using it globally, I have been able to get speeds as fast or slightly faster than numpy's fft call.
That being said, I am working under the assumption that wisdom is implicitly being stored. Is that true? Do I need to make that explicit? If so, what is the best way to do that?
Also, I think timeit is completely opaque. Am I using it properly? Is it storing wisdom as I call repeat? Thanks in advance for any help you might be able to give.
In an interactive (ipython) session, I think the following is what you want to do (timeit is very nicely handled by ipython):
In [1]: import numpy as np
In [2]: import pyfftw
In [3]: K = 2**16
In [4]: Llx = 10.
In [5]: KT = 2*K
In [6]: dx = Llx/np.float64(K)
In [7]: X = np.arange(-Llx,Llx,dx)
In [8]: a = pyfftw.n_byte_align_empty(KT, 16, 'complex128')
In [9]: b = pyfftw.n_byte_align_empty(KT, 16, 'complex128')
In [10]: fft_object = pyfftw.FFTW(a,b)
In [11]: a[:] = 2.*np.cosh(X)**(-2)
In [12]: timeit np.fft.fft(a)
100 loops, best of 3: 4.96 ms per loop
In [13]: timeit fft_object(a)
100 loops, best of 3: 1.56 ms per loop
In [14]: np.allclose(fft_object(a), np.fft.fft(a))
Out[14]: True
Have you read the tutorial? What don't you understand?
I would recommend using the builders interface to construct the FFTW object. Have a play with the various settings, most importantly the number of threads.
The wisdom is not stored by default. You need to extract it yourself.
All your globals are unnecessary - the objects you want to change are mutable, so you can handle them just fine. fft_object always points to the same thing, so no problem with that not being a global. Ideally, you simply don't want that loop over ii. I suggest working out how to structure your arrays in order that you can do all your operations in a single call
Edit:
[edit edit: I wrote the following paragraph with only a cursory glance at your code, and clearly with it being a recursive update, vectorising is not an obvious approach without some serious cunning. I have a few comments on your implementation at the bottom though]
I suspect your problem is a more fundamental misunderstanding of how to best use a language like Python (or indeed Matlab) for numerical processing. The core tenet is vectorise as much as possible. By this, I mean roll up your python calls to be as few as possible. I can't see how to do that with your example unfortunately (though I've only thought about it for 2 mins). If that's still failing, think about cython - though make sure you really want to go down that route (i.e. you've exhausted the other options).
Regarding the globals: Don't do it that way. If you want to create an object with state, use a class (that is what they are for) or perhaps a closure in your case. The global is almost never what you want (I think I have one at least vaguely legit use for it in all my writing of python, and that's in the cache code in pyfftw). I suggest reading this nice SO question. Matlab is a crappy language - one of the many reasons for this is its crap scoping facilities which tend to lead to bad habits.
You only need global if you want to modify a reference globally. I suggest reading a bit more about the Python scoping rules and what variables really are in python.
FFTW objects carry with them all the arrays you need so you don't need to pass them around separately. Using the call interface carries almost no overhead (particularly if you disable the normalisation) either for setting or returning the values - if you're at that level of optimisation, I strongly suspect you've hit the limit (I'd caveat this that this may not quite be true for many many very small FFTs, but at this point you want to rethink your algorithm to vectorise the calls to FFTW). If you find a substantial overhead in updating the arrays every time (using the call interface), this is a bug and you should submit it as such (and I'd be pretty surprised).
Bottom line, don't worry about updating the arrays on every call. This is almost certainly not your bottleneck, though make sure you're aware of the normalisation and disable it if you wish (it might slow things down slightly compared to raw accessing of the update_arrays() and execute() methods).
Your code makes no use of the cache. The cache is only used when you're using the interfaces code, and reduces the Python overhead in creating new FFTW objects internally. Since you're handling the FFTW object yourself, there is no reason for a cache.
The builders code is a less constrained interface to get an FFTW object. I almost always use the builders now (it's much more convenient that creating a FFTW object from scratch). The cases in which you want to create an FFTW object directly are pretty rare and I'd be interested to know what they are.
Comments on the algorithm implementation:
I'm not familiar with the algorithm you're implementing. However, I have a few comments on how you've written it at the moment.
You're computing nl_eval(wp) on every loop, but as far as I can tell that's just the same as nl_eval(w) from the previous loop, so you don't need to compute it twice (but this comes with the caveat that it's pretty hard to see what's going on when you have globals everywhere, so I might be missing something).
Don't bother with the copies in my_fft or my_ifft. Simply do fft_object(u) (2.29 ms versus 1.67 ms on my machine for the forward case). The internal array update routine makes the copy unnecessary. Also, as you've written it, you're copying twice: c[:] means "copy into the array c", and the array you're copying into c is v.copy(), i.e. a copy of v (so two copies in total).
More sensible (and probably necessary) is copying the output into holding arrays (since that avoids clobbering interim results on calls to the FFTW object), though make sure your holding arrays are properly aligned. I'm sure you've noted this is important but it's rather more understandable to copy the output.
You can move all your scalings together. The 3 in the computation of wn can be be moved inside my_fft in nl_eval. You can also combine this with the normalisation constant from the ifft (and turn it off in pyfftw).
Take a look at numexpr for the basic array operations. It can offer quite a bit of speed-up over vanilla numpy.
Anyway take what you will from all that. No doubt I've missed something or said something incorrect, so please accept it with as much humility as I can offer. It's worth spending a little time working out how Python ticks compared to Matlab (in fact, just forget the latter).

How to find large objects in ZODB

I'm trying to analyze my ZODB because it grew really large (it's also large after packing).
The package zodbbrowser has a feature that displays the amount of bytes of an object. It does so by getting the length of the pickled state (name of the variable), but it also does a bit of magic which I don't fully understand.
How would I go to find the largest objects in my ZODB?
I've written a method which should do exactly this. Feel free to use it, but be aware that this is very memory consuming. The package zodbbrowser must be installed.
def zodb_objects_by_size(self):
"""
Recurse over the ZODB tree starting from self.aq_parent. For
each object, use zodbbrowser's implementation to get the raw
object state. Put each length into a Counter object and
return a list of the biggest objects, specified by path and
size.
"""
from zodbbrowser.history import ZodbObjectHistory
from collections import Counter
def recurse(obj, results):
# Retrieve state pickle from ZODB, get length
history = ZodbObjectHistory(obj)
pstate = history.loadStatePickle()
length = len(pstate)
# Add length to Counter instance
path = '/'.join(obj.getPhysicalPath())
results[path] = length
# Recursion
for child in obj.contentValues():
# Play around portal tools and other weird objects which
# seem to contain themselves
if child.contentValues() == obj.contentValues():
continue
# Rolling in the deep
try:
recurse(child, results)
except (RuntimeError, AttributeError), err:
import pdb; pdb.set_trace() ## go debug
results = Counter()
recurse(self.aq_parent, results)
return results.most_common()