Row / column vs linear indexing speed (spatial locality) - optimization

Related question:
This one
I am using a spatial grid which can potentially get big (10^6 nodes) or even bigger. I will regularly have to perform displacement operations (like a particle from a node to another). I'm not a crack in informatics but I begin to understand the concepts of cache lines and spatial locality, though not well yet. So, I was wandering if it is preferible to use a 2D array (and if yes, which one? I'd prefer to avoid boost for now, but maybe I will link it later) and indexing the displacement for example like this:
Array[i][j] -> Array[i-1][j+2]
or, with a 1D array, if NX is the "equivalent" number of columns:
Array[i*NX+j] -> Array[(i-1)*NX+j+2]
Knowing that it will be done nearly one million times per iteration, with nearly one million iteration as well.

With modern compilers and optimization enabled both of these will probably generate the exact same code
Array[i-1][j+2] // Where Array is 2-dimensional
and
Array[(i-1)*NX+j+2] // Where Array is 1-dimensional
assuming NX is the dimension of the second subscript in the 2-dimensional Array (the number of columns).

Related

Typed lists vs ND-arrays in Numba

Could someone, please clarify that what is the benefit of using a Numba typed list over an ND array? Also, how do the two compares in terms of speed, and in what context would it be recommended to use the typed list?
Typed lists are useful when your need to append a sequence of elements but you do not know the total number of elements and you could not even find a reasonable bound. Such a data structure is significantly more expensive than a 1D array (both in memory space and computation time).
1D arrays cannot be resized efficiently: a new array needs to be created and a copy must be performed. However, the indexing of 1D arrays is very cheap. Numpy also provide many functions that can natively operate on them (lists are implicitly converted to arrays when passed to a Numpy function and this process is expensive). Note that is the number of items can be bounded to a reasonably size (ie. not much higher than the number of actual element), you can create a big array, then add the elements and finally work on a sub-view of the array.
ND arrays cannot be directly compared with lists. Note that lists of lists are similar to jagged array (they can contains lists of different sizes) while ND array are likes a (fixed-size) N x ... x M table. Lists of lists are very inefficient and often not needed.
As a result, use ND arrays when you can and you do not need to often resize them (or append/remove elements). Otherwise, use typed lists.

Kotlin's Array vs ArrayList vs List for storing large amounts of data

I'm building a Deep Neural Network in Kotlin (I know Python would be better, but I have to do that in Kotlin).
For training the net I need a huge amount of data from the MNIST database, this means I need to read about 60,000 images from a single file in IDX format and store them for simultaneous use.
Every image consists of 784 Bytes. So the total size is:
784*60,000 = 47,040,000 = ~47 MB of training data.
Which ain't that much, since I'm running the JVM in an 8GB RAM env.
After reading an image i need to convert it to a KMatrix, a custom data structure for matrix math operations. Under the hood of a KMatrix there's an Array<Array<Double>>.
I need a structure to store all the images at once, so I'm currently using a List<KMatrix>, which basically tranlates to a List<Array<Array<Double>>>
The problem is that while building the List<KMatrix> the Garbage Collector runs out of memory, launching a OutOfMemoryException: GC overhead limit exceeded.
I wonder if the problem is which data structures I'm using (i.e. should I use an ArrayList instead of an Array?) or maybe how I'm building the entire thing up (i.e. I need some optimization work to do).
I'll put the code, if needed, as soon as I can.
Thanks for your help.
Self-answer with the summarized solution (Thanks to answers by #Tenfour04 and #gidds)
As #Tenfour04 stated, you have basically three alternatives to the Array<Array<Double>> for the KMatrix:
an Array<DoubleArray> which mantains the same logic as the original, but saving lots of memory and increasing performance;
a 1-Dimensional DoubleArray which saves a bit of extra memory and performance, but with increased complexity given by the index-mapping of the array (the [i;j] element of the matrix is given by the [i * w + j] element of the array), and this probably isn't worth it as #gidds pointed out;
a 1-D DoubleBuffer created with ByteBuffer.allocateDirect(8 * size).asDoubleBuffer(), which improves performances even further but has only get and put methods, so it is useless if you need simple and direct set operations.
Conclusion
I choose the option 2, since in my case I'm performing very intensive operations, but in common cases, probably option 1 is the best as it is balanced in complexity and performance.
If you need a highest-performance structure and read/put methods are enough, I'd say that option 3 is what you're looking for.
Hope this helps someone

Implementing a 2D recursive spatial filter using Scipy

Minimally, I would like to know how to achieve what is stated in the title. Specifically, signal.lfilter seems like the only implementation of a difference equation filter in scipy, but it is 1D, as shown in the docs. I would like to know how to implement a 2D version as described by this difference equation. If that's as simple as "bro, use this function," please let me know, pardon my naiveté, and feel free to disregard the rest of the post.
I am new to DSP and acknowledging there might be a different approach to answering my question so I will explain the broader goal and give context for the question in the hopes someone knows how do want I want with Scipy, or perhaps a better way than what I explicitly asked for.
To get straight into it, broadly speaking I am using vectorized computation methods (Numpy/Scipy) to implement a Monte Carlo simulation to improve upon a naive for loop. I have successfully abstracted most of my operations to array computation / linear algebra, but a few specific ones (recursive computations) have eluded my intuition and I continually end up in the digital signal processing world when I go looking for how this type of thing has been done by others (that or machine learning but those "frameworks" are much opinionated). The reason most of my google searches end up on scipy.signal or scipy.ndimage library references is clear to me at this point, and subsequent to accepting the "signal" representation of my data, I have spent a considerable amount of time (about as much as reasonable for a field that is not my own) ramping up the learning curve to try and figure out what I need from these libraries.
My simulation entails updating a vector of data representing the state of a system each period for n periods, and then repeating that whole process a "Monte Carlo" amount of times. The updates in each of n periods are inherently recursive as the next depends on the state of the prior. It can be characterized as a difference equation as linked above. Additionally this vector is theoretically indexed on an grid of points with uneven stepsize. Here is an example vector y and its theoretical grid t:
y = np.r_[0.0024, 0.004, 0.0058, 0.0083, 0.0099, 0.0133, 0.0164]
t = np.r_[0.25, 0.5, 1, 2, 5, 10, 20]
I need to iteratively perform numerous operations to y for each of n "updates." Specifically, I am computing the curvature along the curve y(t) using finite difference approximations and using the result at each point to adjust the corresponding y(t) prior to the next update. In a loop this amounts to inplace variable reassignment with the desired update in each iteration.
y += some_function(y)
Not only does this seem inefficient, but vectorizing things seems intuitive given y is a vector to begin with. Furthermore I am interested in preserving each "updated" y(t) along the n updates, which would require a data structure of dimensions len(y) x n. At this point, why not perform the updates inplace in the array? This is wherein lies the question. Many of the update operations I have succesfully vectorized the "Numpy way" (such as adding random variates to each point), but some appear overly complex in the array world.
Specifically, as mentioned above the one involving computing curvature at each element using its neighbouring two elements, and then imediately using that result to update the next row of the array before performing its own curvature "update." I was able to implement a non-recursive version (each row fails to consider its "updated self" from the prior row) of the curvature operation using ndimage generic_filter. Given the uneven grid, I have unique coefficients (kernel weights) for each triplet in the kernel footprint (instead of always using [1,-2,1] for y'' if I had a uniform grid). This last part has already forced me to use a spatial filter from ndimage rather than a 1d convolution. I'll point out, something conceptually similar was discussed in this math.exchange post, and it seems to me only the third response saliently addressed the difference between mathematical notion of "convolution" which should be associative from general spatial filtering kernels that would require two sequential filtering operations or a cleverly merged kernel.
In any case this does not seem to actually address my concern as it is not about 2D recursion filtering but rather having a backwards looking kernel footprint. Additionally, I think I've concluded it is not applicable in that this only allows for "recursion" (backward looking kernel footprints in the spatial filtering world) in a manner directly proportional to the size of the recursion. Meaning if I wanted to filter each of n rows incorporating calculations on all prior rows, it would require a convolution kernel far too big (for my n anyways). If I'm understanding all this correctly, a recursive linear filter is algorithmically more efficient in that it returns (for use in computation) the result of itself applied over the previous n samples (up to a level where the stability of the algorithm is affected) using another companion vector (z). In my case, I would only need to look back one step at output signal y[n-1] to compute y[n] from curvature at x[n] as the rest works itself out like a cumsum. signal.lfilter works for this, but I can't used that to compute curvature, as that requires a kernel footprint that can "see" at least its left and right neighbors (pixels), which is how I ended up using generic_filter.
It seems to me I should be able to do both simultaneously with one filter namely spatial and recursive filtering; or somehow I've missed the maths of how this could be mathematically simplified/combined (convolution of multiples kernels?).
It seems like this should be a common problem, but perhaps it is rarely relevant to do both at once in signal processing and image filtering. Perhaps this is why you don't use signals libraries solely to implement a fast monte carlo simulation; though it seems less esoteric than using a tensor math library to implement a recursive neural network scan ... which I'm attempting to do right now.
EDIT: For those familiar with the theoretical side of DSP, I know that what I am describing, the process of designing a recursive filters with arbitrary impulse responses, is achieved by employing a mathematical technique called the z-transform which I understand is generally used for two things:
converting between the recursion coefficients and the frequency response
combining cascaded and parallel stages into a single filter
Both are exactly what I am trying to accomplish.
Also, reworded title away from FIR / IIR because those imply specific definitions of "recursion" and may be confusing / misnomer.

Element-wise operations on arrays of different rank

How do I multiply two arrays of different rank, element-wise? For example, element-wise multiplying every row of a matrix with a vector.
real :: a(m,n), b(n)
My initial thought was to use spread(b,...), but it is my understanding that this tiles b in memory, which would make it undesirable for large arrays.
In MATLAB I would use bsxfun for this.
If the result of the expression is simply being assigned to another variable (versus being an intermediate in a more complicated expression or being used as an actual argument), then a loop (DO [CONCURRENT]) or FORALL assignment is likely to be best from the point of view of execution speed (though it will be processor dependent).

Optimizing Python KD Tree Searches

Scipy (http://www.scipy.org/) offers two KD Tree classes; the KDTree and the cKDTree.
The cKDTree is much faster, but is less customizable and query-able than the KDTree (as far as I can tell from the docs).
Here is my problem:
I have a list of 3million 2 dimensional (X,Y) points. I need to return all of the points within a distance of X units from every point.
With the KDtree, there is an option to do just this: KDtree.query_ball_tree() It generates a list of lists of all the points within X units from every other point. HOWEVER: this list is enormous and quickly fills up my virtual memory (about 744 million items long).
Potential solution #1: Is there a way to parse this list into a text file as it is writing?
Potential solution #2: I have tried using a for loop (for every point in the list) and then finding that single point's neighbors within X units by employing: KDtree.query_ball_point(). HOWEVER: this takes forever as it needs to run the query millions of times. Is there a cKDTree equivalent of this KDTree tool?
Potential solution #3: Beats me, anyone else have any ideas?
From scipy 0.12 on, both KD Tree classes have feature parity. Quoting its announcement:
cKDTree feature-complete
Cython version of KDTree, cKDTree, is now feature-complete. Most
operations (construction, query, query_ball_point, query_pairs,
count_neighbors and sparse_distance_matrix) are between 200 and 1000
times faster in cKDTree than in KDTree. With very minor caveats,
cKDTree has exactly the same interface as KDTree, and can be used as a
drop-in replacement.
Try using KDTree.query_ball_point instead. It takes a single point, or array of points, and produces the points within a given distance of the input point(s).
You can use this function to perform batch queries. Give it, say, 100000 points at a time, and then write the results out to a file. Something like this:
BATCH_SIZE = 100000
for i in xrange(0, len(pts), BATCH_SIZE):
neighbours = tree.query_ball_point(pts[i:i+BATCH_SIZE], X)
# write neighbours to a file...