Fastest way to iterate over array columns and stack them vertically - numpy

Some of this might not make much sense, but I just tried to simulate my problem as faithfully as possible. Despite this, the core of the problem is super simple, it's just an iteration where columns are grabbed sequentially and then stacked vertically, growing an array.
Is there a more efficient way to achieve this? With my real data, this procedure takes almost 30 minutes. I'm downsizing the problem with the random data included in the snippet below.
import numpy as np
#Initialize big 3d arrays:
big_3d_arr_1=np.random.randint(1, 5, size=(20000, 15000,2))
big_3d_arr_2=np.random.randint(1, 5, size=(20000, 15000,2))
#Get slices and stack them:
slice_1 = np.vstack((big_3d_arr_1[:,-1,1],big_3d_arr_1[:,0,1],big_3d_arr_1[:,0,0] ) ).T
slice_2 = np.vstack((big_3d_arr_2[:,0,1],big_3d_arr_2[:,-1,1],big_3d_arr_2[:,0,0] ) ).T
slices = np.vstack(( slice_1,slice_2) )
#Iterate over rest of columns and grow a stacked array
for i in range(1, 1000):
print(i)
slice_1 = np.vstack((big_3d_arr_1[:,-1,1],big_3d_arr_1[:,i,1],big_3d_arr_1[:,0,0] ) ).T
slice_2 = np.vstack((big_3d_arr_2[:,i,1],big_3d_arr_2[:,-1,1],big_3d_arr_2[:,0,0] ) ).T
slices_0 = np.vstack(( slice_1,slice_2) )
slices = np.vstack(( slices_0,slices) ) #new arrays go on top

Related

Adding a third dimension to my 2D array in a for loop

I have a for loop that gives me an output of 16 x 8 2D arrays per entry in the loop. I want to stack all of these 2D arrays along the z-axis in a 3D array. This way, I can determine the variance over the z-axis. I have tried multiple commands, such as np.dstack, matrix3D[p,:,:] = ... and np.newaxis both in- and outside the loop. However, the closest I've come to my desired output is just a repetition of the last array stacked on top of each other. Also the dimensions were way off. I need to keep the original 16 x 8 format. By now I'm in a bit too deep and could use some nudge in the right direction!
My code:
excludedElectrodes = [1,a.numberOfColumnsInArray,a.numberOfElectrodes-a.numberOfColumnsInArray+1,a.numberOfElectrodes]
matrixEA = np.full([a.numberOfRowsInArray, a.numberOfColumnsInArray], np.nan)
for iElectrode in range(a.numberOfElectrodes):
if a.numberOfDeflectionsPerElectrode[iElectrode] != 0:
matrixEA[iElectrode // a.numberOfColumnsInArray][iElectrode % a.numberOfColumnsInArray] = 0
for iElectrode in range (a.numberOfElectrodes):
if iElectrode+1 not in excludedElectrodes:
"""Preprocessing"""
# Loop over heartbeats
for p in range (1,len(iLAT)):
# Calculate parameters, store them in right row-col combo (electrode number)
matrixEA[iElectrode // a.numberOfColumnsInArray][iElectrode % a.numberOfColumnsInArray] = (np.trapz(abs(correctedElectrogram[limitA[0]:limitB[0]]-totalBaseline[limitA[0]:limitB[0]]))/(1000))
# Stack all matrixEA arrays along z axis
matrix3D = np.dstack(matrixEA)
This example snippet does what you want, although I suspect your errors have to do more with things not relative to the concatenate part. Here, we use the None keyword in the array to create a new empty dimension (along which we concatenate the 2D arrays).
import numpy as np
# Function does create a dummy (16,8) array
def foo(a):
return np.random.random((16,8)) + a
arrays2D = []
# Your loop
for i in range(10):
# Calculate your (16,8) array
f = foo(i)
# And append it to the list
arrays2D.append(f)
# Stack arrays along new dimension
array3D = np.concatenate([i[...,None] for i in arrays2D], axis = -1)

How to build a numpy matrix one row at a time?

I'm trying to build a matrix one row at a time.
import numpy as np
f = np.matrix([])
f = np.vstack([ f, np.matrix([1]) ])
This is the error message.
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 1
As you can see, np.matrix([]) is NOT an empty list. I'm going to have to do this some other way. But what? I'd rather not do an ugly workaround kludge.
you have to pass some dimension to the initial matrix. Either fill it with some zeros or use np.empty():
f = np.empty(shape = [1,1])
f = np.vstack([f,np.matrix([1])])
you can use np.hstack instead for the first case, then use vstack iteratively.
arr = np.array([])
arr = np.hstack((arr, np.array([1,1,1])))
arr = np.vstack((arr, np.array([2,2,2])))
Now you can convert into a matrix.
mat = np.asmatrix(arr)
Good grief. It appears there is no way to do what I want. Kludgetown it is. I'll build an array with a bogus first entry, then when I'm done make a copy without the bogosity.

generate large array in dask

I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array.
So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?
Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?
The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.)
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16)) # in actual sample we expect 65000x65000 array
# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"
# accumulate results
all_pairs=[]
def generate_pairs(sequence):
pairs=[]
for i in range(len(sequence)-8+1):
window=sequence[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
pairs.append(pair)
return pairs
# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))
# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
counts[ two_index[j[0]], two_index[j[1]] ] += 1
print(counts)
EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?
Here's simpler code to test:
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
seq = "AAAAACCATCGACTACGACTAC"
counts = np.zeros(shape=(16,16))
for i in range(len(seq)-8+1):
window=seq[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
counts[two_index[pair[0]], two_index[pair[1]]] += 1 # problematic part!
print(counts)

Store regression coefficients, merge back into data-frame

I'm trying to estimate a random effects model, and store those coefficients. I then want to merge them to the data-frame to predict the dependent variable.
There is a random effect coefficient for each group. In the data-frame, if an observation belongs to group 1, I want the group 1 coefficient listed there. For observations in group 2, the group 2 coefficient and so on.
I am able to access and store the coefficients. But I'm not able to merge them back into the data-frame. I'm not sure how to think of it. Here is the code I have so far:
md = smf.mixedlm('y ~ x', data=df, groups=train['GroupID'])
mdf = md.fit()
I tried storing the coefficients in three ways:
re_coeffs = pd.Series(mdf.random_effects.values) #creates a series with shape (1,)
re_coeffs = [(k) for k in mdf.random_effects.values()] #creates a list with the coeffs
re_coeffs = np.array(mdf.random_effects.values) #creates array with shape ()
All of them work, but none of them let me merge them back into the original data-frame. I'm not sure about using a dictionary or a list, or generally how to think about merging these coefficients back into the original data-frame.
I'll appreciate any suggestions for this.
This seems to work:
md = smf.mixedlm('y ~ x', data=train, groups=train['GroupID'])
mdf = md.fit()
re_coeffs = [(k) for k in mdf.random_effects.values()]
df = pd.DataFrame(re_coeffs)
df['ConfigID'] = df.index
merged = pd.merge(train,df, on=['GroupID'])

Faster alternative than a loop to create a numpy array

I have an array which contains point clouds (about 100 ladar points). I need to create a set of numpy arrays as quickly as possible.
sweep = np.empty(
shape=(len(sweep.points),),
dtype=[
('point', np.float64, 3),
('intensity', np.float32),
## ..... more fields ....
]
)
for index, point in enumerate(sweep.points):
sweep[index]['point'] = (point.x, point.y, point.z)
sweep[index]['intensity'] = point.intensity
## ....more fields...
Writing an explicit loop is very inefficient and slow. Is there a better way to go about this?
It's slightly faster to use a list comprehension to format the data and pass it directly to a numpy array:
np.array([((point.x, point.y, point.z), point.intensity)
for point in points],
dtype=[('point', np.float64, 3),
('intensity', np.float32)])