How to interpolate a 5 dimensional array? - numpy

I have an array of shape- [41, 101, 6, 4, 280]. I want to interpolate it so that if I give it a value from 41 temperature and 101 density values, it spits out an array of [6,4,280] shape. Is there a NumPy function that can deal with this?

let's start step by step :
Q : Is there a NumPy function that can deal with this?
Yes, there is.
The first step is to generate an instance of a 5D numpy.ndarray, that will contain your known data-points ( do not mind the dtype, that was used for just reminding we can go literally from bits upto complex128 values here, if later needed ):
>>> import numpy as np
>>>
>>> a5Dtensor = np.ndarray( (41, 101, 6, 4, 280 ), dtype = np.uint8 )
Now, let's validate it's .shape :
>>> a5Dtensor.shape
(41, 101, 6, 4, 280)
The core trick is the built-in smart numpy-slicing :
>>> a5Dtensor[0,0,:,:,:].shape
(6, 4, 280)
This indeed returns the requested 3D-cube of data-points.
The slicing-trick is also very smart in not producing any new memory-allocations (which will be of interest once the sizes grow somewhere beyond L1/L2/L3-CPU-cache horizons, the more once you get beyond a few GB-of data)
>>> a5Dtensor[0,0,:,:,:].flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False <------ may enjoy FORTRAN efficient data layout, where needed
OWNDATA : False <------ 3D-cube data not "copied", rather "viewed" inside 5D
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Last, but not least, if aTemperatureVALUE and aDensityVALUE variables are not indices into 5D, but rather data-values, among which you seek for the sought for interpolation of 3D-cube datapoints' values, the numpy can serve with a piecewise linear interpolation ( with some constraints ), yet making any such interpolation for each of the result ( in the 3D-cube of interpolated-values ) requires a 2D-interpolation being run for each of the 3D-cube coordinates, based on the values held for nearest-{ lower, upper } temperature and density values, present in the original 5D-data-points.
There are other smart tools for this in numpy ( nD; n = 0+ .meshgrid() method, .argwhere() and others ), yet finding ( pre-sorting ), indirect indexing may be needed for this, in case the original 5D-data-points do not exhibit some properties, like a 3D-cubes of data-point having been already pre-sorted in the first two dimensions for easier processing for the sought-for 2D-(temp,density)-interpolator ( be it specifically tailor-made for dtype=uint8, float64, complex128 or object ).

Related

Numpy Random Choice with Non-regular Array Size

I'm making an array of sums of random choices from a negative binomial distribution (nbd), with each sum being of non-regular length. Right now I implement it as follows:
import numpy
from numpy.random import default_rng
rng = default_rng()
nbd = rng.negative_binomial(1, 0.5, int(1e6))
gmc = [12, 35, 4, 67, 2]
n_pp = np.empty(len(gmc))
for i in range(len(gmc)):
n_pp[i] = np.sum(rng.choice(nbd, gmc[i]))
This works, but when I perform it over my actual data it's very slow (gmc is of dimension 1e6), and I would like to vary this for multiple values of n and p in the nbd (in this example they're set to 1 and 0.5, respectively).
I'd like to work out a pythonic way to do this which eliminates the loop, but I'm not sure it's possible. I want to keep default_rng for the better random generation than the older way of doing it (np.random.choice), if possible.
The distribution of the sum of m samples from the negative binomial distribution with parameters (n, p) is the negative binomial distribution with parameters (m*n, p). So instead of summing random selections from a large, precomputed sample of negative_binomial(1, 0.5), you can generate your result directly with negative_binomial(gmc, 0.5):
In [68]: gmc = [12, 35, 4, 67, 2]
In [69]: npp = rng.negative_binomial(gmc, 0.5)
In [70]: npp
Out[70]: array([ 9, 34, 1, 72, 7])
(The negative_binomial method will broadcast its inputs, so we can pass gmc as an argument to generate all the samples with one call.)
More generally, if you want to vary the n that is used to generate nbd, you would multiply that n by the corresponding element in gmc and pass the product to rng.negative_binomial.

How to conveniently use operations on numpy fortran contiguos arrays?

Some numpy functions like np.matmul(a, b) have convenient behavior for stacks of matrices.
The manual states:
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
Thus, for a.shape = (10 , 2, 4) and b.shape(10, 4, 2) the statementa # b is meaningful and will have shape (10, 2, 2)
However, I'm coming from the linear algebra world, where I'm used to a Fortran contiguous array layout.
The same a represented as a Fortran contiguous array would have shape (4, 2, 10) and similarly b.shape = (2, 4, 10).
To do a # b as before I would have to invoke
(a.T # b.T).T .
Even worse, assume you naively created the same Fortran-contiguous array a with the behavior of matmul in mind, such that it has shape (10, 4, 2).
Then a.strides = (8, 80, 320) with the smallest stride in the 'stack' index, which actually should have highest stride.
Is this really the way to go or am I missing something?
While numpy can handle all sorts of layouts, many details are designed with the "C" layout in mind. Good examples are how nested lists translate into arrays, and the way numpy operations batch excess dimensions as in the matmul case.
It is correct that results in numpy as a rule of thumb do not depend on array layout (FORTRAN,C,non-contiguous); speed, however, certainly does and heavily so:
rng = np.random.default_rng()
a = rng.random((100,111,200))
b = rng.random((111,77,200))
af = np.array(a,order="F")
bf = np.array(b,order="F")
np.allclose((b.T#a.T).T,(bf.T#af.T).T)
# True
timeit(lambda:(b.T#a.T).T,number=10)
# 5.972857117187232
timeit(lambda:(bf.T#af.T).T,number=10)
# 0.1994628761895001
In fact, sometimes it is totally worth it to non-lazily transpose, i.e. copy your data into the best layout:
timeit(lambda:(np.array(b.T,order="C")#np.array(a.T,order="C")).T,number=10)
# 0.3931349152699113
My advice: If you want speed and convenience it is probably best to go with the "C" layout, it doesn't take all that long to get used to and saves you a lot of potential headaches.
numpy's matrix multiplication works regardless of the internal layout of the array. For example, here are two C-ordered arrays:
>>> import numpy as np
>>> a = np.random.rand(10, 2, 4)
>>> b = np.random.rand(10, 4, 2)
>>> print('a', a.shape, a.strides)
>>> print('b', b.shape, b.strides)
a (10, 2, 4) (64, 32, 8)
b (10, 4, 2) (64, 16, 8)
Here are the equivalent arrays in Fortran order:
>>> af = np.asfortranarray(a)
>>> bf = np.asfortranarray(b)
>>> print('af', af.shape, af.strides)
>>> print('bf', bf.shape, bf.strides)
af (10, 2, 4) (8, 80, 160)
bf (10, 4, 2) (8, 80, 320)
Numpy treats equivalent arrays as equivalent, regardless of their internal layout:
>>> np.allclose(a, af) and np.allclose(b, bf)
True
The results of a matrix multiplication do not depend on the internal layout:
>>> np.allclose(a # b, af # bf)
True
and you can even mix layouts if you wish:
>>> np.allclose(a # bf, af # b)
True
In short, the most convenient way to use Fortran-ordered arrays in numpy is to not worry about internal array layout: the shape is all that matters.
If your array shapes differ from what is expected by the numpy matmul API, your best bet is to reshape the arrays, for example using a.transpose(2, 0, 1) # b.transpose(2, 0, 1) or similar, depending on what is appropriate for your use-case, but don't worry: for C or Fortran contiguous arrays, this operation only adjusts the metadata around the array view, it does not cause the underlying data buffer to be copied or re-ordered.

Numpy: stack arrays whose internal dimensions differ

I have a situation similar to the following:
import numpy as np
a = np.random.rand(55, 1, 3)
b = np.random.rand(55, 626, 3)
Here the shapes represent the number of observations, then the number of time slices per observation, then the number of dimensions of the observation at the given time slice. So b is a full representation of 3 dimensions for each of the 55 observations at one new time interval.
I'd like to stack a and b into an array with shape 55, 627, 3. How can one accomplish this in numpy? Any suggestions would be greatly appreciated!
To follow up on Divakar's answer above, the axis argument in numpy is the index of a given dimension within an array's shape. Here I want to stack a and b by virtue of their middle shape value, which is at index = 1:
import numpy as np
a = np.random.rand(5, 1, 3)
b = np.random.rand(5, 100, 3)
# create the desired result shape: 55, 627, 3
stacked = np.concatenate((b, a), axis=1)
# validate that a was appended to the end of b
print(stacked[:, -1, :], '\n\n\n', a.squeeze())
This returns:
[[0.72598529 0.99395887 0.21811998]
[0.9833895 0.465955 0.29518207]
[0.38914048 0.61633291 0.0132326 ]
[0.05986115 0.81354865 0.43589306]
[0.17706517 0.94801426 0.4567973 ]]
[[0.72598529 0.99395887 0.21811998]
[0.9833895 0.465955 0.29518207]
[0.38914048 0.61633291 0.0132326 ]
[0.05986115 0.81354865 0.43589306]
[0.17706517 0.94801426 0.4567973 ]]
A purist might use instead np.all(stacked[:, -1, :] == a.squeeze()) to validate this equivalence. All glory to #Divakar!
Strictly for the curious, the use case for this concatenation is a kind of wonky data preparation pipeline for a Long Short Term Memory Neural Network. In that kind of network, the training data shape should be number_of_observations, number_of_time_intervals, number_of_dimensions_per_observation. I am generating new predictions of each object at a new time interval, so those predictions have shape number_of_observations, 1, number_of_dimensions_per_observation. To visualize the sequence of observations' positions over time, I want to add the new positions to the array of previous positions, hence the question above.

Numpy index array of unknown dimensions?

I need to compare a bunch of numpy arrays with different dimensions, say:
a = np.array([1,2,3])
b = np.array([1,2,3],[4,5,6])
assert(a == b[0])
How can I do this if I do not know either the shape of a and b, besides that
len(shape(a)) == len(shape(b)) - 1
and neither do I know which dimension to skip from b. I'd like to use np.index_exp, but that does not seem to help me ...
def compare_arrays(a,b,skip_row):
u = np.index_exp[ ... ]
assert(a[:] == b[u])
Edit
Or to put it otherwise, I wan't to construct slicing if I know the shape of the array and the dimension I want to miss. How do I dynamically create the np.index_exp, if I know the number of dimensions and positions, where to put ":" and where to put "0".
I was just looking at the code for apply_along_axis and apply_over_axis, studying how they construct indexing objects.
Lets make a 4d array:
In [355]: b=np.ones((2,3,4,3),int)
Make a list of slices (using list * replicate)
In [356]: ind=[slice(None)]*b.ndim
In [357]: b[ind].shape # same as b[:,:,:,:]
Out[357]: (2, 3, 4, 3)
In [358]: ind[2]=2 # replace one slice with index
In [359]: b[ind].shape # a slice, indexing on the third dim
Out[359]: (2, 3, 3)
Or with your example
In [361]: b = np.array([1,2,3],[4,5,6]) # missing []
...
TypeError: data type not understood
In [362]: b = np.array([[1,2,3],[4,5,6]])
In [366]: ind=[slice(None)]*b.ndim
In [367]: ind[0]=0
In [368]: a==b[ind]
Out[368]: array([ True, True, True], dtype=bool)
This indexing is basically the same as np.take, but the same idea can be extended to other cases.
I don't quite follow your questions about the use of :. Note that when building an indexing list I use slice(None). The interpreter translates all indexing : into slice objects: [start:stop:step] => slice(start, stop, step).
Usually you don't need to use a[:]==b[0]; a==b[0] is sufficient. With lists alist[:] makes a copy, with arrays it does nothing (unless used on the RHS, a[:]=...).

Numpy Integer to String Based on Lookup Table

I have a numpy array called "landuse" that's a series of numbers 1-3 representing different landuse categories. I want to convert this to a string based on a lookup table.
ids = [0,1,2,3]
lookup_table = ['None', 'Forest', 'Water', 'Urban']
First let me explain why your loop isn't working, in python an assignment, ie a = 1 takes the object 1 and gives it the name a. When you do name = "Water", name forgets what it was pointing to before and now points to "Water", but that doesn't mean the previous object that was assigned to name gets replaced with "Water".
That's the problem, and now for a fix. If you have your landuse as an array of integer codes you can just use a lookup table. The table should be big enough so you don't get an indexing error when you do lookup_table[landuse.max()]
import numpy as np
landuse = np.array([1,2,3,1,2,4])
lookup_table = np.array(['None', 'Forest', 'Water', 'Urban', 'Other'])
landuse_title = lookup_table[landuse]
And for the final part of your question, the numpy ndarray is a homogenous data structure, meaning everything in the array must have the same data type. With that limitation in mind, it should be clear that you cannot take a row of the integers and replace it with a row of strings. Numpy does have "flexible dtypes" which allow you to do something like:
>>> dt = np.dtype([('name', 'S4'), ('age', 'int'), ('height', 'float')])
>>> array = np.array([('Mark', 25, 70.5),('Ben',40,72.75)], dtype=dt)
>>> array
array([('Mark', 25, 70.5), ('Ben', 40, 72.75)],
dtype=[('name', '|S4'), ('age', '<i4'), ('height', '<f8')])
>>> array.shape
(2,)
>>> array['name']
array(['Mark', 'Ben'],
dtype='|S4')
We've created an array that hold a name, age and height for each person, but notice that the shape of the array is (2,) because we have two "people" in the array. I'm not sure exactly what your needs are, but you could try and use the flexible dtype to hold all the information in one array if that's what you need. Depending on what my end goal, I often find it's easier to just use a few separate arrays, or a list of arrays. Hope that helps.
I am not entirely clear what your question is, but it seems you could use a dictionary for this:
import numpy as np
landuse=np.array([1,2,3,1,2,4],dtype=np.integer)
a={1:'Forest',2:'Water'}
print [a.setdefault(i,'Urban') for i in landuse]
which will emit a list containing the strings you are interested in:
['Forest', 'Water', 'Urban', 'Forest', 'Water', 'Urban']
If you objective is to have the final result in a numpy array of strings, you can do this:
name=np.array([a.setdefault(i,'Urban') for i in landuse],dtype='|S10')