Profiling a piece of numpy code shows that I'm spending most of the time within these two functions
numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279
Here's the Numpy source:
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L301
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L279
Question #1:
__getitem__ seems to be called every time I'm using something like my_array[arg] and it's getting more expensive if arg is not an integer but a slice. Is there any way to speed up calls to array slices?
E.g. in
for i in range(idx): res[i] = my_array[i:i+10].mean()
Question #2:
When exactly does __array_finalize__ get called and how can I speed up by reducing the number of calls to this function?
Thanks!
You could not use matrices as much and just use 2d numpy arrays. I typically only use matrices for a short-time to take advantage of the syntax for multiplication (but with the addition of the .dot method on arrays, I find I do that less and less as well).
But, to your questions:
1) There really is no short-cut to __getitem__ unless defmatrix over-rides __getslice__ which it could do but doesn't yet. There are the .item and .itemset methods which are optimized for integer getting and setting (and return Python objects rather than NumPy's array-scalars)
2) __array_finalize__ is called whenever an array object (or a subclass) is created. It is called from the C-function that every array-creation gets funneled through. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003
In the case of sub-classes defined purely in Python, it is calling back into the Python interpreter from C which has overhead. If the matrix class were a builtin type (a Cython-based cdef class, for example), then the call could avoid the Python interpreter overhead.
Question 1:
Since array slices can sometimes require a copy of the underlying data structure (holding the pointers to the data in memory) they can be quite expensive. If you're really bottlenecked by this in your above example, you can perform mean operations by actually iterating over the i to i+10 elements and manually creating the mean. For some operations this won't give any performance improvement, but avoiding creating new data structures will generally speed up the process.
Another note, if you're not using native types inside numpy you will get a Very large performance penalty to manipulating a numpy array. Say you're array has dtype=float64 and your native machine float size is float32 -- this will cost a lot of extra computation power for numpy and performance overall will drop. Sometimes this is fine and you can just take the hit for maintaining a data type. Other times it's arbitrary what type the float or int is stored as internally. In these cases try dtype=float instead of dtype=float64. Numpy should default to your native type. I've had 3x+ speedups on numpy intensive algorithms by making this change.
Question 2:
__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.
Related
I have a code where I have to iterate creation of several GB of data with np.tile and np.repeat.
After few iterations the code goes out of memory. Since each tile and repeat is used only within inside an iteration, I am thinking on how to save memory.
Ideally, in order to reuse memory, I would like to do something like this:
large_matrix = np.zeros(N*M)
for data in generator:
np.repeat(data, M, out = large_matrix)
[...] #here I will use large matrix
Unfortunately there is no such keyword out on np.repeat and I had create my own njit(parallel=True) numba functions to replicate numpy repeat function.
However, before I start rewriting many other numpy functions in numba, my question is: what is the numpy-thonic way to store numpy results on already existing arrays so to keep memory usage under control?
Numpy's in-place assignment is large_matrix[:]=np.repeat(data,M)
better - encapsulate the inside of your for-loop as a function (e.g. def process(data):). This way, all matrices except for the returned outputs are freed when the iteration is done. If the outputs are big, write them down to disk instead of accumulating them on RAM.
It's very rare that tile or repeat can't be replaced with smart broadcasting.
As far as I can tell, there are at least two different ways to recover a Tensor from a TensorProto in Tensorflow 2.3. Say, for the sake of example, that we have
tensor = tf.range(10)
tproto = tf.make_tensor_proto(tensor)
Then:
You can use tf.make_ndarray like so
tf.constant(tf.make_ndarray(tproto))
Or you can use tf.io.parse_tensor like so
tf.io.parse_tensor(tproto.SerializeToString(), out_type=tf.int32)
I feel both of these are a bit artificial, since in the former you end up with an intermediate numpy array, and in the latter you have to serialize the TensorProto to a string and parse it back. Additionally, parse_tensor won't automatically recover the correct data type from the TensorProto. So:
Is there a function to do the conversion in a single step? I'd like to see something like tf.from_tensor_proto doing the conversion all at once optimizing for speed and memory allocation (or, if tf.constant(tf.make_ndarray(tproto)) is the best you can do, just wrapping this up).
Otherwise, which of the two options above should be preferred (in terms of efficiency, memory usage, etc.)?
I'm building a Deep Neural Network in Kotlin (I know Python would be better, but I have to do that in Kotlin).
For training the net I need a huge amount of data from the MNIST database, this means I need to read about 60,000 images from a single file in IDX format and store them for simultaneous use.
Every image consists of 784 Bytes. So the total size is:
784*60,000 = 47,040,000 = ~47 MB of training data.
Which ain't that much, since I'm running the JVM in an 8GB RAM env.
After reading an image i need to convert it to a KMatrix, a custom data structure for matrix math operations. Under the hood of a KMatrix there's an Array<Array<Double>>.
I need a structure to store all the images at once, so I'm currently using a List<KMatrix>, which basically tranlates to a List<Array<Array<Double>>>
The problem is that while building the List<KMatrix> the Garbage Collector runs out of memory, launching a OutOfMemoryException: GC overhead limit exceeded.
I wonder if the problem is which data structures I'm using (i.e. should I use an ArrayList instead of an Array?) or maybe how I'm building the entire thing up (i.e. I need some optimization work to do).
I'll put the code, if needed, as soon as I can.
Thanks for your help.
Self-answer with the summarized solution (Thanks to answers by #Tenfour04 and #gidds)
As #Tenfour04 stated, you have basically three alternatives to the Array<Array<Double>> for the KMatrix:
an Array<DoubleArray> which mantains the same logic as the original, but saving lots of memory and increasing performance;
a 1-Dimensional DoubleArray which saves a bit of extra memory and performance, but with increased complexity given by the index-mapping of the array (the [i;j] element of the matrix is given by the [i * w + j] element of the array), and this probably isn't worth it as #gidds pointed out;
a 1-D DoubleBuffer created with ByteBuffer.allocateDirect(8 * size).asDoubleBuffer(), which improves performances even further but has only get and put methods, so it is useless if you need simple and direct set operations.
Conclusion
I choose the option 2, since in my case I'm performing very intensive operations, but in common cases, probably option 1 is the best as it is balanced in complexity and performance.
If you need a highest-performance structure and read/put methods are enough, I'd say that option 3 is what you're looking for.
Hope this helps someone
Suppose I have a PyTorch Variable in GPU:
var = Variable(torch.rand((100,100,100))).cuda()
What's the best way to copy (not bridge) this variable to a NumPy array?
var.clone().data.cpu().numpy()
or
var.data.cpu().numpy().copy()
By running a quick benchmark, .clone() was slightly faster than .copy(). However, .clone() + .numpy() will create a PyTorch Variable plus a NumPy bridge, while .copy() will create a NumPy bridge + a NumPy array.
This is a very interesting question. According to me, the question is little bit opinion-based and I would like to share my opinion on this.
From the above two approaches, I would prefer the first one (use clone()). Since your goal is to copy information, essentially you need to invest extra memory. clone() and copy() should take a similar amount of storage since creating numpy bridge doesn't cause extra memory. Also, I didn't understand what you meant by, copy() will create two numPy arrays. And as you mentioned, clone() is faster than copy(), I don't see any other problem with using clone().
I would love to give a second thought on this if anyone can provide some counter arguments.
Because clone() is recorded by AD second options is less intense. There are few options you may also consider.
I am working with bidimensional arrays on Numpy for Extreme Learning Machines. One of my arrays, H, is random, and I want to compute its pseudoinverse.
If I use scipy.linalg.pinv2 everything runs smoothly. However, if I use scipy.linalg.pinv, sometimes (30-40% of the times) problems arise.
The reason why I am using pinv2 is because I read (here: http://vene.ro/blog/inverses-pseudoinverses-numerical-issues-speed-symmetry.html ) that pinv2 performs better on "tall" and on "wide" arrays.
The problem is that, if H has a column j of all 1, pinv(H) has huge coefficients at row j.
This is in turn a problem because, in such cases, np.dot(pinv(H), Y) contains some nan values (Y is an array of small integers).
Now, I am not into linear algebra and numeric computation enough to understand if this is a bug or some precision related property of the two functions. I would like you to answer this question so that, if it's the case, I can file a bug report (honestly, at the moment I would not even know what to write).
I saved the arrays with np.savetxt(fn, a, '%.2e', ';'): please, see https://dl.dropboxusercontent.com/u/48242012/example.tar.gz to find them.
Any help is appreciated. In the provided file, you can see in pinv(H).csv that rows 14, 33, 55, 56 and 99 have huge values, while in pinv2(H) the same rows have more decent values.
Your help is appreciated.
In short, the two functions implement two different ways to calculate the pseudoinverse matrix:
scipy.linalg.pinv uses least squares, which may be quite compute intensive and take up a lot of memory.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html#scipy.linalg.pinv
scipy.linalg.pinv2 uses SVD (singular value decomposition), which should run with a smaller memory footprint in most cases.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv2.html#scipy.linalg.pinv2
numpy.linalg.pinv also implements this method.
As these are two different evaluation methods, the resulting matrices will not be the same. Each method has its own advantages and disadvantages, and it is not always easy to determine which one should be used without deeply understanding the data and what the pseudoinverse will be used for. I'd simply suggest some trial-and-error and use the one which gives you the best results for your classifier.
Note that in some cases these functions cannot converge to a solution, and will then raise a scipy.stats.LinAlgError. In that case you may try to use the second pinv implementation, which will greatly reduce the amount of errors you receive.
Starting from scipy 1.7.0 , pinv2 is deprecated and also replaced by a SVD solution.
DeprecationWarning: scipy.linalg.pinv2 is deprecated since SciPy 1.7.0, use scipy.linalg.pinv instead
That means, numpy.pinv, scipy.pinv and scipy.pinv2 now compute all equivalent solutions. They are also equally fast in their computation, with scipy being slightly faster.
import numpy as np
import scipy
arr = np.random.rand(1000, 2000)
res1 = np.linalg.pinv(arr)
res2 = scipy.linalg.pinv(arr)
res3 = scipy.linalg.pinv2(arr)
np.testing.assert_array_almost_equal(res1, res2, decimal=10)
np.testing.assert_array_almost_equal(res1, res3, decimal=10)