numpy: indexing 1d array with multidimensional index

numpy: indexing 1d array with multidimensional index - numpy

How do I index a lower dimensional data array with a higher dimensional index array?
E.g.: Given a 1d data array and a 2d index array:
data = np.array([11,12,13])
idx = np.array([[0,1],
[1,2])
I would like to get a 2d data array:
np.array([[11,12],
[12,13]])

This is very easy in Python / NumPy, thanks to the advanced Numpy indexing system, you just use your indexing as the slicing, e.g. data[idx].
data = np.array([11,12,13])
idx = np.array([[0,1],
[1,2]])
# this will produce the correct result
data[idx]
# array([[11, 12],
# [12, 13]])

Related

create a numpy array of zeros with same dtype as another array but different shape

How can I create a NumPy array f zeros with the same dtype as another array?
I know about zeros_like() but it gives the same shape as the passed array. I am looking for a different shape but the same dtype as the source array.

NumPy array's have a attribute called dtype. Simply use this attribute when creating a new array with the same data type.
a = np.array([0,1,2])
print(a.dtype) # dtype('int64')
b = np.array([3.1,4.1,5.1], dtype=a.dtype)
print(b) # array([3, 4, 5])

How to convert 4D numpy array to 2D by fixing indices

I have a 4D numpy array A of shape (N,N,N,N) that I would like to convert to a 2D matrix M of shape (N,N) by fixing pairs of indices. For example
M[i,j] = A[i,j,i,j]
How should this be done in numpy, avoiding for loops?
Edit:
I will subsequently access the elements of M using an index array provided by numpy.ix_ so accessing elements of the 4D array in analoguous way would be a solution as well.

This is a workaround:
i, j = np.arange(N), np.arange(N)
j_idx, i_idx = np.meshgrid(i, j)
M = A[i_idx, j_idx, i_idx, j_idx]
Uses meshgrid to generate the indexing pattern beforehand and then fancy indexing the array A to get M. As #hpaulj suggested, you can specify sparse = True in np.meshgrid() to obtain broadcastable 1D arrays instead of full 2D index arrays to save some space.
You can also do this using np.ix_() as well:
ixgrid = np.ix_(i, j)
M = A[ixgrid + ixgrid]
Since ixgrid is a 2-tuple, ixgrid + ixgrid produces the 4-tuple required for indexing A.

Create pandas MultiIndex DataFrame from multi dimensional np arrays

I am trying to insert 72 matrixes with dimensions (24,12) from an np array into a preexisting MultiIndexDataFrame indexed according to a np.array with dimension (72,2). I don't care to index the content of the matrixes (24,12), I just need to index the 72 matrix even as objects for rearrangemnet purposes. It is like a map to reorder accroding to some conditions to then unstack the columns.
what I have tried so far is:
cosphi.shape
(72, 2)
MFPAD_RCR.shape
(72, 24, 12)
df = pd.MultiIndex.from_arrays(cosphi.T, names=("costheta","phi"))
I successfully create an DataFrame of 2 columns with 72 index row. Then I try to add the 72 matrixes
df1 = pd.DataFrame({'MFPAD':MFPAD_RCR},index=df)
or possibly
df1 = pd.DataFrame({'MFPAD':MFPAD_RCR.astype(object)},index=df)
I get the error
Exception: Data must be 1-dimensional.
Any idea?

After a bot of careful research, I found that my question has been already answered here (the right answer) and here (a solution using a deprecated function).
For my specific question, the answer is something like:
data = MFPAD_RCR.reshape(72, 288).T
df = pd.DataFrame(
data=data,
index=pd.MultiIndex.from_product([phiM, cosM],names=["phi","cos(theta)"]),
columns=['item {}'.format(i) for i in range(72)]
)
Note: that the 3D np array has to be reshaped with the second dimension equal to the product of the major and the minor indexes.
df1 = df.T
I want to be able to sort my items (aka matrixes) according to extra indexes coming from cosphi
cosn=np.array([col[0] for col in cosphi]); #list
phin=np.array([col[1] for col in cosphi]); #list
Note: the length of the new indexes has to be the same as the items (matrixes) = 72
df1.set_index(cosn, "cos_ph", append=True, inplace=True)
df1.set_index(phin, "phi_ph", append=True, inplace=True)
And after this one can sort
df1.sort_index(level=1, inplace=True, kind="mergesort")
and reshape
outarray=(df1.T).values.reshape(24,12,72).transpose(2, 0, 1)
Any suggestion to make the code faster / prettier is more than welcome!

Simultanusely access multiple 3D array with different indexes for each 3D array

I have 8000 3D arrays stacked in a 4D array with dimensions 8000x16x8x8:
import numpy as np
arr = np.zeros((8000,16,8,8))
now I have 3 arrays (size of 8000) of indices to access in each axis:
arr_x = np.random.randint(size=8000, high=16 , low=0)
arr_y = np.random.randint(size=8000, high=8 , low=0)
arr_z = np.random.randint(size=8000, high=8 , low=0)
I want to access simultaneously for every 3D array located in index i (where i goes between 0 to 7999) the specific cell in index arr(i,arr_x[i], arr_y[i],arr_z[i])
A naive for loop implementation, with a simple print will look like:
for i in range(0,8000):
print(arr[i,arr_x[i], arr_y[i], arr_z[i]])
sure I tried arr[:, arr_x, arr_y, arr_z], but it fetches 8000 times every 3D array.

generate large array in dask

I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array.
So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?
Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?
The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.)
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16)) # in actual sample we expect 65000x65000 array
# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"
# accumulate results
all_pairs=[]
def generate_pairs(sequence):
pairs=[]
for i in range(len(sequence)-8+1):
window=sequence[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
pairs.append(pair)
return pairs
# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))
# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
counts[ two_index[j[0]], two_index[j[1]] ] += 1
print(counts)
EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?
Here's simpler code to test:
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
seq = "AAAAACCATCGACTACGACTAC"
counts = np.zeros(shape=(16,16))
for i in range(len(seq)-8+1):
window=seq[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
counts[two_index[pair[0]], two_index[pair[1]]] += 1 # problematic part!
print(counts)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

numpy: indexing 1d array with multidimensional index - numpy

How do I index a lower dimensional data array with a higher dimensional index array? E.g.: Given a 1d data array and a 2d index array: data = np.array([11,12,13]) idx = np.array([[0,1], [1,2]) I would like to get a 2d data array: np.array([[11,12], [12,13]])

This is very easy in Python / NumPy, thanks to the advanced Numpy indexing system, you just use your indexing as the slicing, e.g. data[idx]. data = np.array([11,12,13]) idx = np.array([[0,1], [1,2]]) # this will produce the correct result data[idx] # array([[11, 12], # [12, 13]])

Related

create a numpy array of zeros with same dtype as another array but different shape

How to convert 4D numpy array to 2D by fixing indices

Create pandas MultiIndex DataFrame from multi dimensional np arrays

Simultanusely access multiple 3D array with different indexes for each 3D array

generate large array in dask

Categories

Resources