How do I index a lower dimensional data array with a higher dimensional index array?
E.g.: Given a 1d data array and a 2d index array:
data = np.array([11,12,13])
idx = np.array([[0,1],
[1,2])
I would like to get a 2d data array:
np.array([[11,12],
[12,13]])
This is very easy in Python / NumPy, thanks to the advanced Numpy indexing system, you just use your indexing as the slicing, e.g. data[idx].
data = np.array([11,12,13])
idx = np.array([[0,1],
[1,2]])
# this will produce the correct result
data[idx]
# array([[11, 12],
# [12, 13]])
Related
How can I create a NumPy array f zeros with the same dtype as another array?
I know about zeros_like() but it gives the same shape as the passed array. I am looking for a different shape but the same dtype as the source array.
NumPy array's have a attribute called dtype. Simply use this attribute when creating a new array with the same data type.
a = np.array([0,1,2])
print(a.dtype) # dtype('int64')
b = np.array([3.1,4.1,5.1], dtype=a.dtype)
print(b) # array([3, 4, 5])
I have a 4D numpy array A of shape (N,N,N,N) that I would like to convert to a 2D matrix M of shape (N,N) by fixing pairs of indices. For example
M[i,j] = A[i,j,i,j]
How should this be done in numpy, avoiding for loops?
Edit:
I will subsequently access the elements of M using an index array provided by numpy.ix_ so accessing elements of the 4D array in analoguous way would be a solution as well.
This is a workaround:
i, j = np.arange(N), np.arange(N)
j_idx, i_idx = np.meshgrid(i, j)
M = A[i_idx, j_idx, i_idx, j_idx]
Uses meshgrid to generate the indexing pattern beforehand and then fancy indexing the array A to get M. As #hpaulj suggested, you can specify sparse = True in np.meshgrid() to obtain broadcastable 1D arrays instead of full 2D index arrays to save some space.
You can also do this using np.ix_() as well:
ixgrid = np.ix_(i, j)
M = A[ixgrid + ixgrid]
Since ixgrid is a 2-tuple, ixgrid + ixgrid produces the 4-tuple required for indexing A.
I am trying to insert 72 matrixes with dimensions (24,12) from an np array into a preexisting MultiIndexDataFrame indexed according to a np.array with dimension (72,2). I don't care to index the content of the matrixes (24,12), I just need to index the 72 matrix even as objects for rearrangemnet purposes. It is like a map to reorder accroding to some conditions to then unstack the columns.
what I have tried so far is:
cosphi.shape
(72, 2)
MFPAD_RCR.shape
(72, 24, 12)
df = pd.MultiIndex.from_arrays(cosphi.T, names=("costheta","phi"))
I successfully create an DataFrame of 2 columns with 72 index row. Then I try to add the 72 matrixes
df1 = pd.DataFrame({'MFPAD':MFPAD_RCR},index=df)
or possibly
df1 = pd.DataFrame({'MFPAD':MFPAD_RCR.astype(object)},index=df)
I get the error
Exception: Data must be 1-dimensional.
Any idea?
After a bot of careful research, I found that my question has been already answered here (the right answer) and here (a solution using a deprecated function).
For my specific question, the answer is something like:
data = MFPAD_RCR.reshape(72, 288).T
df = pd.DataFrame(
data=data,
index=pd.MultiIndex.from_product([phiM, cosM],names=["phi","cos(theta)"]),
columns=['item {}'.format(i) for i in range(72)]
)
Note: that the 3D np array has to be reshaped with the second dimension equal to the product of the major and the minor indexes.
df1 = df.T
I want to be able to sort my items (aka matrixes) according to extra indexes coming from cosphi
cosn=np.array([col[0] for col in cosphi]); #list
phin=np.array([col[1] for col in cosphi]); #list
Note: the length of the new indexes has to be the same as the items (matrixes) = 72
df1.set_index(cosn, "cos_ph", append=True, inplace=True)
df1.set_index(phin, "phi_ph", append=True, inplace=True)
And after this one can sort
df1.sort_index(level=1, inplace=True, kind="mergesort")
and reshape
outarray=(df1.T).values.reshape(24,12,72).transpose(2, 0, 1)
Any suggestion to make the code faster / prettier is more than welcome!
I have 8000 3D arrays stacked in a 4D array with dimensions 8000x16x8x8:
import numpy as np
arr = np.zeros((8000,16,8,8))
now I have 3 arrays (size of 8000) of indices to access in each axis:
arr_x = np.random.randint(size=8000, high=16 , low=0)
arr_y = np.random.randint(size=8000, high=8 , low=0)
arr_z = np.random.randint(size=8000, high=8 , low=0)
I want to access simultaneously for every 3D array located in index i (where i goes between 0 to 7999) the specific cell in index arr(i,arr_x[i], arr_y[i],arr_z[i])
A naive for loop implementation, with a simple print will look like:
for i in range(0,8000):
print(arr[i,arr_x[i], arr_y[i], arr_z[i]])
sure I tried arr[:, arr_x, arr_y, arr_z], but it fetches 8000 times every 3D array.
I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array.
So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?
Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?
The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.)
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16)) # in actual sample we expect 65000x65000 array
# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"
# accumulate results
all_pairs=[]
def generate_pairs(sequence):
pairs=[]
for i in range(len(sequence)-8+1):
window=sequence[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
pairs.append(pair)
return pairs
# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))
# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
counts[ two_index[j[0]], two_index[j[1]] ] += 1
print(counts)
EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?
Here's simpler code to test:
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
seq = "AAAAACCATCGACTACGACTAC"
counts = np.zeros(shape=(16,16))
for i in range(len(seq)-8+1):
window=seq[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
counts[two_index[pair[0]], two_index[pair[1]]] += 1 # problematic part!
print(counts)