Faster alternative than a loop to create a numpy array

Faster alternative than a loop to create a numpy array - numpy

I have an array which contains point clouds (about 100 ladar points). I need to create a set of numpy arrays as quickly as possible.
sweep = np.empty(
shape=(len(sweep.points),),
dtype=[
('point', np.float64, 3),
('intensity', np.float32),
## ..... more fields ....
]
)
for index, point in enumerate(sweep.points):
sweep[index]['point'] = (point.x, point.y, point.z)
sweep[index]['intensity'] = point.intensity
## ....more fields...
Writing an explicit loop is very inefficient and slow. Is there a better way to go about this?

It's slightly faster to use a list comprehension to format the data and pass it directly to a numpy array:
np.array([((point.x, point.y, point.z), point.intensity)
for point in points],
dtype=[('point', np.float64, 3),
('intensity', np.float32)])

Related

Fastest way to iterate over array columns and stack them vertically

Some of this might not make much sense, but I just tried to simulate my problem as faithfully as possible. Despite this, the core of the problem is super simple, it's just an iteration where columns are grabbed sequentially and then stacked vertically, growing an array.
Is there a more efficient way to achieve this? With my real data, this procedure takes almost 30 minutes. I'm downsizing the problem with the random data included in the snippet below.
import numpy as np
#Initialize big 3d arrays:
big_3d_arr_1=np.random.randint(1, 5, size=(20000, 15000,2))
big_3d_arr_2=np.random.randint(1, 5, size=(20000, 15000,2))
#Get slices and stack them:
slice_1 = np.vstack((big_3d_arr_1[:,-1,1],big_3d_arr_1[:,0,1],big_3d_arr_1[:,0,0] ) ).T
slice_2 = np.vstack((big_3d_arr_2[:,0,1],big_3d_arr_2[:,-1,1],big_3d_arr_2[:,0,0] ) ).T
slices = np.vstack(( slice_1,slice_2) )
#Iterate over rest of columns and grow a stacked array
for i in range(1, 1000):
print(i)
slice_1 = np.vstack((big_3d_arr_1[:,-1,1],big_3d_arr_1[:,i,1],big_3d_arr_1[:,0,0] ) ).T
slice_2 = np.vstack((big_3d_arr_2[:,i,1],big_3d_arr_2[:,-1,1],big_3d_arr_2[:,0,0] ) ).T
slices_0 = np.vstack(( slice_1,slice_2) )
slices = np.vstack(( slices_0,slices) ) #new arrays go on top

How to build a numpy matrix one row at a time?

I'm trying to build a matrix one row at a time.
import numpy as np
f = np.matrix([])
f = np.vstack([ f, np.matrix([1]) ])
This is the error message.
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 1
As you can see, np.matrix([]) is NOT an empty list. I'm going to have to do this some other way. But what? I'd rather not do an ugly workaround kludge.

you have to pass some dimension to the initial matrix. Either fill it with some zeros or use np.empty():
f = np.empty(shape = [1,1])
f = np.vstack([f,np.matrix([1])])

you can use np.hstack instead for the first case, then use vstack iteratively.
arr = np.array([])
arr = np.hstack((arr, np.array([1,1,1])))
arr = np.vstack((arr, np.array([2,2,2])))
Now you can convert into a matrix.
mat = np.asmatrix(arr)

Good grief. It appears there is no way to do what I want. Kludgetown it is. I'll build an array with a bogus first entry, then when I'm done make a copy without the bogosity.

generate large array in dask

I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array.
So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?
Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?
The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.)
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16)) # in actual sample we expect 65000x65000 array
# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"
# accumulate results
all_pairs=[]
def generate_pairs(sequence):
pairs=[]
for i in range(len(sequence)-8+1):
window=sequence[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
pairs.append(pair)
return pairs
# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))
# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
counts[ two_index[j[0]], two_index[j[1]] ] += 1
print(counts)
EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?
Here's simpler code to test:
import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}
seq = "AAAAACCATCGACTACGACTAC"
counts = np.zeros(shape=(16,16))
for i in range(len(seq)-8+1):
window=seq[i:i+8]
words= [window[i:i+2] for i in range(0, len(window), 2)]
for pair in itertools.combinations(words,2):
counts[two_index[pair[0]], two_index[pair[1]]] += 1 # problematic part!
print(counts)

Python: AttributeError: "'numpy.float64' object has no attribute 'tanh'"

I have seen couple of questions with similar title, however I am afraid, none of them could satisfactorily answer my question and that is, how do I take tan inverse or lets say exp of a numpy ndarray? For instance, piece of my code looks similar to this-
import numpy as np
from numpy import ndarray,zeros,array,dot,exp
import itertools
def zetta_G(x,spr_g,theta_g,c_g):
#this function computes estimated g:
#c_g is basically a matrix of dim equal to g and whose elements contains list of centers that describe the fuzzy system for each element of g:
m,n=c_g.shape[0],c_g.shape[1]
#creating an empty matrix of dim mxn to hold regressors:
zetta_g=zeros((m,n),dtype=ndarray)
#creating an empty matrix of dim mxn to hold estimated g:
z_g=np.zeros((m,n),dtype=ndarray)
#for filling rows
for k in range(m):
#for filling columns
for p in range(n):
#container to hold-length being equal to number of inputs(e1,e2,e3 etc)
Mu=[[] for i in range(len(x))]
for i in range(len(x)):
#filling that with number of zeros equal to len of center
Mu[i]=np.zeros(len(c_g[k][p]))
#creating an empty list for holding rules
M=[]
#piece of code for creating rules-all possible combinations
for i in range(len(x)):
for j in range(len(c_g[k][p])):
Mu[i][j]=exp(-.5*((x[i]-c_g[k][p][j])/spr_g[k][p])**2)
b=list(itertools.product(*Mu))
for i in range(len(b)):
M.append(reduce(lambda x,y:x*y,b[i]))
M=np.array(M)
S=np.sum(M)
#import pdb;pdb.set_trace()
zetta_g[k][p]=M/S
z_g[k][p]=dot(M/S,theta_g[k][p])
return zetta_g,z_g
if __name__=='__main__':
x=[1.2,.2,.4]
cg11,cg12,cg13,cg21,cg22,cg23,cg31,cg32,cg33=[-10,-8,-6,-4,-2,0,2,4,6,8,10],[-10,-8,-6,-4,-2,0,2,4,6,8,10],[-10,-8,-6,-4,-2,0,2,4,6,8,10],[-10,-8,-6,-4,-2,0,2,4,6,8,10],[-10,-8,-6,-4,-2,0,2,4,6,8,10],[-12,-9,-6,-3,0,3,6,9,12],[-6.5,-4.5,-2.5,0,2.5,4.5,6.5],[-5,-4,-3,-2,-1,0,1,2,3,4,5],[-3.5,-2.5,-1.5,0,1.5,2.5,3.5]
C,spr_f=array([[-10,-8,-6,-4,-2,0,2,4,6,8,10],[-10,-8,-6,-4,-2,0,2,4,6,8,10],[-10,-8,-6,-4,-2,0,2,4,6,8,10]]),[2.2,2,2.1]
c_g=array([[cg11,cg12,cg13],[cg21,cg22,cg23],[cg31,cg32,cg33]])
spr_g=array([[2,2.1,2],[2.1,2.2,3],[2.5,1,1.5]])
theta_g=np.zeros((c_g.shape[0],c_g.shape[1]),dtype=ndarray)
#import pdb;pdb.set_trace()
N=0
for i in range(c_g.shape[0]):
for j in range(c_g.shape[1]):
length=len(c_g[i][j])**len(x)
theta_g[i][j]=np.random.sample(length)
N=N+(len(c_g[i][j]))**len(x)
zetta_g,z_g=zetta_G(x,spr_g,theta_g,c_g)
#zetta is a function that accepts following args-- x: which is a list of certain dim, spr_g: is a matrix of dimension similar to theta_g and c_g. theta_g and c_g are numpy matrices with lists as individual elements
print(zetta_g)
print(z_g)
inv=np.tanh(z_g)
print(inv)

In [89]: a=np.array([[1],[3],[2]],dtype=np.ndarray)
In [90]: a
Out[90]:
array([[1],
[3],
[2]], dtype=object)
Note that the dtype is object, not ndarray. If the dtype isn't one of the recognized numeric or string types, it is object, a generic pointer, just like the elements of a list.
In [91]: np.tanh(a)
AttributeError: 'int' object has no attribute 'tanh'
np.tanh is trying to delegate the task to the elements of array. Commonly math on object dtype arrays is performed by list like iteration on the elements. It does not do the fast compiled numeric numpy math.
If a is ordinary number array:
In [95]: np.tanh(np.array([[1],[3],[2]]))
Out[95]:
array([[0.76159416],
[0.99505475],
[0.96402758]])
With object dtype arrays, your ability to do numeric calculations is limited. Some things work, others don't. It's hit-or-miss.
Here's a first stab at cleaning up your code; it's not tested.
def zetta_G(x,spr_g,theta_g,c_g):
m,n=c_g.shape[0],c_g.shape[1]
#creating an empty matrix of dim mxn to hold regressors:
zetta_g=zeros((m,n),dtype=object)
#creating an empty matrix of dim mxn to hold estimated g:
z_g=np.zeros((m,n),dtype=object)
#for filling rows
for k in range(m):
#for filling columns
for p in range(n):
#container to hold-length being equal to number of inputs(e1,e2,e3 etc)
Mu = np.zeros((len(x), len(c_g[k,p])))
#creating an empty list for holding rules
for i in range(len(x)):
Mu[i,:]=exp(-.5*((x[i]-c_g[k,p,:])/spr_g[k,p])**2)
# probably can calc Mu without any loop
M = []
b=list(itertools.product(*Mu))
for i in range(len(b)):
M.append(reduce(lambda x,y:x*y,b[i]))
M=np.array(M)
S=np.sum(M)
zetta_g[k,p]=M/S
z_g[k,p]=dot(M/S,theta_g[k,p])
return zetta_g,z_g
Running your code, and adding some .shape displays I see that
z_g is (3,3) and contains just single numbers. So it can be initialed as a plain 2d float array:
z_g=np.zeros((m,n))
theta_g is (3,3), but with variable length array elements
print([i.shape for i in theta_g.flat])
[(1331,), (1331,), (1331,), (1331,), (1331,), (729,), (343,), (1331,), (343,)]
zetta_g matches in shapes
If I change:
x=np.array([1.2,.2,.4])
I can calculate Mu without a loop with:
Mu = exp(-.5*((x[:,None]-np.array(c_g[k,p])[None,:])/spr_g[k,p])**2)
c_g is a (3,3) array with variable length lists; I can vectorize the
((x[i]-c_g[k,p][j])
expression with:
x[:,None]-np.array(c_g[k,p])[None,:]
Not a big time saver here since x has 4 elements and c_g elements are only 7-11 long. But cleaner.
In this running code I don't see a tanh, so I don't know what kinds of arrays are using that.

You set type of array's elements to dtype=np.ndarray. Replace type to, let say, dtype=np.float64 or any numeric type.

How to delete the small none-zero values and increase the sparsity?

I have a large csr_matrix (46000*46000) ,but this matrix is very dense, its sparsity is about 0.05%. most of none-zero values are less than 1 ,I want to delete these values and increase the sparsity
import scipy.sparse as sp
cgc=sp.load_npz('/root/cg.npz')
print cgc.count_nonzero() #2115920056
cgc=cgc[cgc>1] #too slow

You have two options:
Zero the elements in-place, and then convert.This works on the array in-place to save memory and time, but changes your original array (which seems to be saved to disk anyway, so it shouldn't be a problem):
cgc[cgc<1] = 0
cgc = scipy.sparse.csr_matrix(cgc)
Build the sparce indices and construct the sparse matrix from them (doesn't overwite original, but is slow and memory-intensive):
i, j = np.flatnonzero(cgc > 1)
cgc_sparse = np.csr_matrix((cgc[i, j], (i, j)))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Faster alternative than a loop to create a numpy array - numpy

It's slightly faster to use a list comprehension to format the data and pass it directly to a numpy array: np.array([((point.x, point.y, point.z), point.intensity) for point in points], dtype=[('point', np.float64, 3), ('intensity', np.float32)])

Related

Fastest way to iterate over array columns and stack them vertically

How to build a numpy matrix one row at a time?

generate large array in dask

Python: AttributeError: "'numpy.float64' object has no attribute 'tanh'"

How to delete the small none-zero values and increase the sparsity?

Categories

Resources