I have a code where I have to iterate creation of several GB of data with np.tile and np.repeat.
After few iterations the code goes out of memory. Since each tile and repeat is used only within inside an iteration, I am thinking on how to save memory.
Ideally, in order to reuse memory, I would like to do something like this:
large_matrix = np.zeros(N*M)
for data in generator:
np.repeat(data, M, out = large_matrix)
[...] #here I will use large matrix
Unfortunately there is no such keyword out on np.repeat and I had create my own njit(parallel=True) numba functions to replicate numpy repeat function.
However, before I start rewriting many other numpy functions in numba, my question is: what is the numpy-thonic way to store numpy results on already existing arrays so to keep memory usage under control?
Numpy's in-place assignment is large_matrix[:]=np.repeat(data,M)
better - encapsulate the inside of your for-loop as a function (e.g. def process(data):). This way, all matrices except for the returned outputs are freed when the iteration is done. If the outputs are big, write them down to disk instead of accumulating them on RAM.
It's very rare that tile or repeat can't be replaced with smart broadcasting.
Related
I have a dataset of images that is too large to store on memory. What I plan to do is loading pairs of the paths to the images and corresponding labels as my dataset, then use a generator function during training to convert only the paths in my batch to images before feeding them to the network.
Is data.Dataset.map() a good way to do this? Does it return a mapping function, that can be applied only to the current batch during training, or does it perform the mapping operation on the whole dataset at once, occupying lots of memory? In the second case, what is an alternative?
A few tutorials I went through made me believe the mapping takes place per batch, but this quote from the documentation suggests a whole new dataset is returned: "This transformation applies map_func to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input."
The key thing to understand here is that tf.data.Dataset objects are generally "lazy" in that elements are only processed as needed (in a batched Dataset, elements == batches). When iterating over a dataset, this usually means that only the next requested element is prepared and then returned. So to answer your question: When using map to load data from disk, and applying this to a dataset of file names, only one batch of the loaded data should be stored in memory at the same time, and you should be able to process the dataset just fine. However, this can significantly slow down training if loading the files is a bottleneck in terms of speed.
There are some exceptions though, for example:
When you use the shuffle method, you need to provide a buffer size, and AFAIK the entire buffer is preprocessed at once. This can lead to issues since you want a large buffer for good shuffling, but this requires more memory. Thus you probably want to use shuffle before applying map.
The prefetch method results in multiple elements being prepared in order to avoid the model having to wait for the next batch to be processed.
Note that this lazy behavior also has some disadvantages, e.g.
You can only iterate over datasets sequentially; there is no random access.
A dataset doesn't even know how many elements it contains (this would require iterating over the entire set).
I'm trying to filter rows in a np.ndarray with a column condition. Usually i use this code:
def filter(data,column_index,threshold):
left = data[data[:,column_index] <= threshold]
right = data[data[:,column_index] > threshold]
return left,right
But i have a large dataset and thresholds, so i need to optimize the time of execution. Is there a way to optimize this code using Cython?. My try:
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def filter(data,int column_index,float threshold):
cdef np.ndarray left = data[data[:,column_index] <= threshold]
cdef np.ndarray right = data[data[:,column_index] > threshold]
return left,right
Thanks.
Assuming threshold does not contain any NaN, the values smaller or equal are not greater and vice versa. Thus, you can store the boolean values data[:,column_index] <= threshold temporary not to compute again the second comparison.
Moreover, the comparisons create multiple temporary arrays that can be avoided by creating one big temporary array, then iterating over the values and put them either to the left or the right of the big array, then the big array can be sliced to return the left and right part. This method is called a partition.
Besides this, the biggest problem comes from the fact that the first dimension is not contiguous. Since the array should be big, there is a good chance that it does not fit in the CPU cache (so it is stored in RAM) and data.shape[1] is not small. As a result, fetching non-contiguous data from the RAM with a large stride is pretty inefficient. The computation will likely be bound by the memory latency. There is not a lot you can do in Cython, apart using the above methods (or perform a local contiguous copy if values contain NaN). If data is really huge (> 100,000 x 100,000), you can use multiple threads to mitigate the latency cost. Otherwise, the cost to create threads, distribute the work and gather results will be likely to big to worth using threads. Note that you could adapt the input data structure so that accesses are more contiguous if this is possible (probably not).
Summary
adataframe is a DataFrame with 800k rows. Naturally, it consumes a bit of memory. When I do this:
adataframe = adataframe.tail(144)
memory is not released.
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python. However, if I attempt to create a new 800k-row DataFrame and also keep only a small slice, memory usage grows. If I do it again, it grows again, ad infinitum.
I'm using Debian Jessie's Python 3.4.2 with Pandas 0.18.1 and numpy 1.11.1.
Demonstration with minimal program
With the following program I create a dictionary
data = {
0: a_DataFrame_loaded_from_a_CSV,_only_the_last_144_rows,
1: same_thing,
# ...
9: same_thing,
}
and I monitor memory usage while I'm creating the dictionary. Here it is:
#!/usr/bin/env python3
from resource import getrusage, RUSAGE_SELF
import pandas as pd
def print_memory_usage():
print(getrusage(RUSAGE_SELF).ru_maxrss)
def read_dataframe_from_csv(f):
result = pd.read_csv(f, parse_dates=[0],
names=('date', 'value', 'flags'),
usecols=('date', 'value', 'flags'),
index_col=0, header=None,
converters={'flags': lambda x: x})
result = result.tail(144)
return result
print_memory_usage()
data = {}
for i in range(10):
with open('data.csv') as f:
data[i] = read_dataframe_from_csv(f)
print_memory_usage()
Results
If data.csv only contains a few rows (e.g. 144, in which case the slicing is redundant), memory usage grows very slowly. But if data.csv contains 800k rows, the results are similar to these:
52968
153388
178972
199760
225312
244620
263656
288300
309436
330568
349660
(Adding gc.collect() before print_memory_usage() does not make any significant difference.)
What can I do about it?
As #Alex noted, slicing a dataframe only gives you a view to the original frame, but does not delete it; you need to use .copy() for that. However, even when I used .copy(), memory usage grew and grew and grew, albeit at a slower rate.
I suspect that this has to do with how Python, numpy and pandas use memory. A dataframe is not a single object in memory; it contains pointers to other objects (especially, in this particular case, to strings, which is the "flags" column). When the dataframe is freed, and these objects are freed, the reclaimed free memory space can be fragmented. Later, when a huge new dataframe is created, it might not be able to use the fragmented space, and new space might need to be allocated. The details depend on many little things, such as the Python, numpy and pandas versions, and the particulars of each case.
Rather than investigating these little details, I decided that reading a huge time series and then slicing it is a no go, and that I must read only the part I need right from the start. I like some of the code I created for that, namely the textbisect module and the FilePart class.
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python.
Correct that is how maxrss works (it measures peak memory usage). See here.
So the question then is why is the garbage collector not cleaning up the original DataFrames after they have been subsetted.
I suspect it is because subsetting returns a DataFrame that acts as a proxy to the original one (so values don't need to be copied). This would result in a relatively fast subset operation but also memory leaks like the one you found and weird speed characteristics when setting values.
Profiling a piece of numpy code shows that I'm spending most of the time within these two functions
numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279
Here's the Numpy source:
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L301
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L279
Question #1:
__getitem__ seems to be called every time I'm using something like my_array[arg] and it's getting more expensive if arg is not an integer but a slice. Is there any way to speed up calls to array slices?
E.g. in
for i in range(idx): res[i] = my_array[i:i+10].mean()
Question #2:
When exactly does __array_finalize__ get called and how can I speed up by reducing the number of calls to this function?
Thanks!
You could not use matrices as much and just use 2d numpy arrays. I typically only use matrices for a short-time to take advantage of the syntax for multiplication (but with the addition of the .dot method on arrays, I find I do that less and less as well).
But, to your questions:
1) There really is no short-cut to __getitem__ unless defmatrix over-rides __getslice__ which it could do but doesn't yet. There are the .item and .itemset methods which are optimized for integer getting and setting (and return Python objects rather than NumPy's array-scalars)
2) __array_finalize__ is called whenever an array object (or a subclass) is created. It is called from the C-function that every array-creation gets funneled through. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003
In the case of sub-classes defined purely in Python, it is calling back into the Python interpreter from C which has overhead. If the matrix class were a builtin type (a Cython-based cdef class, for example), then the call could avoid the Python interpreter overhead.
Question 1:
Since array slices can sometimes require a copy of the underlying data structure (holding the pointers to the data in memory) they can be quite expensive. If you're really bottlenecked by this in your above example, you can perform mean operations by actually iterating over the i to i+10 elements and manually creating the mean. For some operations this won't give any performance improvement, but avoiding creating new data structures will generally speed up the process.
Another note, if you're not using native types inside numpy you will get a Very large performance penalty to manipulating a numpy array. Say you're array has dtype=float64 and your native machine float size is float32 -- this will cost a lot of extra computation power for numpy and performance overall will drop. Sometimes this is fine and you can just take the hit for maintaining a data type. Other times it's arbitrary what type the float or int is stored as internally. In these cases try dtype=float instead of dtype=float64. Numpy should default to your native type. I've had 3x+ speedups on numpy intensive algorithms by making this change.
Question 2:
__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.
I have a large numpy array that I am going to take a linear projection of using randomly generated values.
>>> input_array.shape
(50, 200000)
>>> random_array = np.random.normal(size=(200000, 300))
>>> output_array = np.dot(input_array, random_array)
Unfortunately, random_array takes up a lot of memory, and my machine starts swapping. It seems to me that I don't actually need all of random_array around at once; in theory, I ought to be able to generate it lazily during the dot product calculation...but I can't figure out how.
How can I reduce the memory footprint of the calculation of output_array from input_array?
This obviously isn't the fastest solution, but have you tried:
m, inner = input_array.shape
n = 300
out = np.empty((m, n))
for i in xrange(n):
out[:, i] = np.dot(input_array, np.random.normal(size=inner))
This might be a situation where using cython could reduce your memory usage. You could generate the random numbers on the fly and accumulate the result as you go. I don't have the time to write and test the full function, but you would definitely want to use randomkit (the library that numpy uses under the hood) at the c-level.
You can take a look at some example code I wrote for another application to see how to wrap randomkit:
https://github.com/synapticarbors/pylangevin-integrator/blob/master/cIntegrator.pyx
And also check out how matrix multiplication is implemented in the following paper on cython:
http://conference.scipy.org/proceedings/SciPy2009/paper_2/full_text.pdf
Instead of having both arrays as inputs, just have input_array as one, and then in the method, generate small chunks of the random array as you go.
Sorry if it is just a sketch instead of actual code, but hopefully it is enough to get you started.