Copy a Numpy 1D array to a Python List Fast - numpy

I need to Copy a large Numpy float32 1D array (Think 1 million elements) to a Python List Fast.
Doing numpyarray.tolist() takes almost 0.5 seconds. I could write a C++ wrapper to do this and do memcpy but it seems convoluted for something so simple.
Is there a way I can do a simple copy/conversion.

Related

numpy-thonic way to store data on already-allocated data

I have a code where I have to iterate creation of several GB of data with np.tile and np.repeat.
After few iterations the code goes out of memory. Since each tile and repeat is used only within inside an iteration, I am thinking on how to save memory.
Ideally, in order to reuse memory, I would like to do something like this:
large_matrix = np.zeros(N*M)
for data in generator:
np.repeat(data, M, out = large_matrix)
[...] #here I will use large matrix
Unfortunately there is no such keyword out on np.repeat and I had create my own njit(parallel=True) numba functions to replicate numpy repeat function.
However, before I start rewriting many other numpy functions in numba, my question is: what is the numpy-thonic way to store numpy results on already existing arrays so to keep memory usage under control?
Numpy's in-place assignment is large_matrix[:]=np.repeat(data,M)
better - encapsulate the inside of your for-loop as a function (e.g. def process(data):). This way, all matrices except for the returned outputs are freed when the iteration is done. If the outputs are big, write them down to disk instead of accumulating them on RAM.
It's very rare that tile or repeat can't be replaced with smart broadcasting.

Intersection of sorted numpy arrays

I have a list of sorted numpy arrays. What is the most efficient way to compute the sorted intersection of these arrays?
In my application, I expect the number of arrays to be less than 10^4, I expect the individual arrays to be of length less than 10^7, and I expect the length of the intersection to be close to p*N, where N is the length of the largest array and where 0.99 < p <= 1.0. The arrays are loaded from disk and can be loaded in batches if they won't all fit in memory at once.
A quick and dirty approach is to repeatedly invoke numpy.intersect1d(). That seems inefficient though as intersect1d() does not take advantage of the fact that the arrays are sorted.
Since intersect1d sort arrays each time, it's effectively inefficient.
Here you have to sweep intersection and each sample together to build the new intersection, which can be done in linear time, maintaining order.
Such task must often be tuned by hand with low level routines.
Here a way to do that with numba :
from numba import njit
import numpy as np
#njit
def drop_missing(intersect,sample):
i=j=k=0
new_intersect=np.empty_like(intersect)
while i< intersect.size and j < sample.size:
if intersect[i]==sample[j]: # the 99% case
new_intersect[k]=intersect[i]
k+=1
i+=1
j+=1
elif intersect[i]<sample[j]:
i+=1
else :
j+=1
return new_intersect[:k]
Now the samples :
n=10**7
ref=np.random.randint(0,n,n)
ref.sort()
def perturbation(sample,k):
rands=np.random.randint(0,n,k-1)
rands.sort()
l=np.split(sample,rands)
return np.concatenate([a[:-1] for a in l])
samples=[perturbation(ref,100) for _ in range(10)] #similar samples
And a run for 10 samples
def find_intersect(samples):
intersect=samples[0]
for sample in samples[1:]:
intersect=drop_missing(intersect,sample)
return intersect
In [18]: %time u=find_intersect(samples)
Wall time: 307 ms
In [19]: len(u)
Out[19]: 9999009
This way it seems that the job can be done in about 5 minutes , beyond loading time.
A few months ago, I wrote a C++-based python extension for this exact purpose. The package is called sortednp and is available via pip. The intersection of multiple sorted numpy arrays, for example, a, b and c, can be calculated with
import sortednp as snp
i = snp.kway_intersect(a, b, c)
By default, this uses an exponential search to advance the array indices internally which is pretty fast in cases where the intersection is small. In your case, it might be faster if you add algorithm=snp.SIMPLE_SEARCH to the method call.

Fastest way to apply arithmetic operations to System.Array in IronPython

I would like to add (arithmetics) two large System.Arrays element-wise in IronPython and store the result in the first array like this:
for i in range(0:ArrA.Count) :
arrA.SetValue(i, arrA.GetValue(i) + arrB.GetValue(i));
However, this seems very slow. Having a C background I would like to use pointers or iterators. However, I do not know how I should apply the IronPython idiom in a fast way. I cannot use Python lists, as my objects are strictly from type System.Array. The type is 3d float.
What is the fastests / a fast way to perform to compute this computation?
Edit:
The number of elements is appr. 256^3.
3d float means that the array can be accessed like this: array.GetValue(indexX, indexY, indexZ). I am not sure how the respective memory is organized in IronPython's System.Array.
Background: I wrote an interface to an IronPython API, which gives access to data in a simulation software tool. I retrieve 3d scalar data and accumulate it to a temporal array in my IronPython script. The accumulation is performed 10,000 times and should be fast, so that the simulation does not take ages.
Is it possible to use the numpy library developed for IronPython?
https://pytools.codeplex.com/wikipage?title=NumPy%20and%20SciPy%20for%20.Net
It appears to be supported, and as far as I know is as close you can get in python to C style pointer functionality with arrays and such.
Create an array:
x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
Multiply all elements by 3.0:
x *= 3.0

Understanding Numpy internals for profiling purposes

Profiling a piece of numpy code shows that I'm spending most of the time within these two functions
numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279
Here's the Numpy source:
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L301
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L279
Question #1:
__getitem__ seems to be called every time I'm using something like my_array[arg] and it's getting more expensive if arg is not an integer but a slice. Is there any way to speed up calls to array slices?
E.g. in
for i in range(idx): res[i] = my_array[i:i+10].mean()
Question #2:
When exactly does __array_finalize__ get called and how can I speed up by reducing the number of calls to this function?
Thanks!
You could not use matrices as much and just use 2d numpy arrays. I typically only use matrices for a short-time to take advantage of the syntax for multiplication (but with the addition of the .dot method on arrays, I find I do that less and less as well).
But, to your questions:
1) There really is no short-cut to __getitem__ unless defmatrix over-rides __getslice__ which it could do but doesn't yet. There are the .item and .itemset methods which are optimized for integer getting and setting (and return Python objects rather than NumPy's array-scalars)
2) __array_finalize__ is called whenever an array object (or a subclass) is created. It is called from the C-function that every array-creation gets funneled through. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003
In the case of sub-classes defined purely in Python, it is calling back into the Python interpreter from C which has overhead. If the matrix class were a builtin type (a Cython-based cdef class, for example), then the call could avoid the Python interpreter overhead.
Question 1:
Since array slices can sometimes require a copy of the underlying data structure (holding the pointers to the data in memory) they can be quite expensive. If you're really bottlenecked by this in your above example, you can perform mean operations by actually iterating over the i to i+10 elements and manually creating the mean. For some operations this won't give any performance improvement, but avoiding creating new data structures will generally speed up the process.
Another note, if you're not using native types inside numpy you will get a Very large performance penalty to manipulating a numpy array. Say you're array has dtype=float64 and your native machine float size is float32 -- this will cost a lot of extra computation power for numpy and performance overall will drop. Sometimes this is fine and you can just take the hit for maintaining a data type. Other times it's arbitrary what type the float or int is stored as internally. In these cases try dtype=float instead of dtype=float64. Numpy should default to your native type. I've had 3x+ speedups on numpy intensive algorithms by making this change.
Question 2:
__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.

Numpy: Reduce memory footprint of dot product with random data

I have a large numpy array that I am going to take a linear projection of using randomly generated values.
>>> input_array.shape
(50, 200000)
>>> random_array = np.random.normal(size=(200000, 300))
>>> output_array = np.dot(input_array, random_array)
Unfortunately, random_array takes up a lot of memory, and my machine starts swapping. It seems to me that I don't actually need all of random_array around at once; in theory, I ought to be able to generate it lazily during the dot product calculation...but I can't figure out how.
How can I reduce the memory footprint of the calculation of output_array from input_array?
This obviously isn't the fastest solution, but have you tried:
m, inner = input_array.shape
n = 300
out = np.empty((m, n))
for i in xrange(n):
out[:, i] = np.dot(input_array, np.random.normal(size=inner))
This might be a situation where using cython could reduce your memory usage. You could generate the random numbers on the fly and accumulate the result as you go. I don't have the time to write and test the full function, but you would definitely want to use randomkit (the library that numpy uses under the hood) at the c-level.
You can take a look at some example code I wrote for another application to see how to wrap randomkit:
https://github.com/synapticarbors/pylangevin-integrator/blob/master/cIntegrator.pyx
And also check out how matrix multiplication is implemented in the following paper on cython:
http://conference.scipy.org/proceedings/SciPy2009/paper_2/full_text.pdf
Instead of having both arrays as inputs, just have input_array as one, and then in the method, generate small chunks of the random array as you go.
Sorry if it is just a sketch instead of actual code, but hopefully it is enough to get you started.