How to dot product (1,10^{13}) (10^13,1) scipy sparse matrix - numpy

Basically what the title entails.
The two matrices are mostly zeros. And the first is 1 x 9999999999999 and the second is 9999999999999 x 1
When I try to do a dot product I get this.
Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
Full traceback </br>
MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
In [31]: imputed.dot(s)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-31-670cfc69d4cf> in <module>
----> 1 imputed.dot(s)
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in dot(self, other)
357
358 """
--> 359 return self * other
360
361 def power(self, n, dtype=None):
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in __mul__(self, other)
478 if self.shape[1] != other.shape[0]:
479 raise ValueError('dimension mismatch')
--> 480 return self._mul_sparse_matrix(other)
481
482 # If it's a list or whatever, treat it like a matrix
~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other)
499
500 major_axis = self._swap((M, N))[0]
--> 501 other = self.__class__(other) # convert to this format
502
503 idx_dtype = get_index_dtype((self.indptr, self.indices,
~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
32 arg1 = arg1.copy()
33 else:
---> 34 arg1 = arg1.asformat(self.format)
35 self._set_self(arg1)
36
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in asformat(self, format, copy)
320 # Forward the copy kwarg, if it's accepted.
321 try:
--> 322 return convert_method(copy=copy)
323 except TypeError:
324 return convert_method()
~/.local/lib/python3.8/site-packages/scipy/sparse/csc.py in tocsr(self, copy)
135 idx_dtype = get_index_dtype((self.indptr, self.indices),
136 maxval=max(self.nnz, N))
--> 137 indptr = np.empty(M + 1, dtype=idx_dtype)
138 indices = np.empty(self.nnz, dtype=idx_dtype)
139 data = np.empty(self.nnz, dtype=upcast(self.dtype))
MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
It seems the scipy is trying to create a temp array.
I am using the .dot method that scipy provides.
I am also open to non-scipy solutions.
Thanks!

In [105]: from scipy import sparse
If I make a (100,1) csr matrix:
In [106]: A = sparse.random(100,1,format='csr')
In [107]: A
Out[107]:
<100x1 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
The data and indices are:
In [109]: A.data
Out[109]: array([0.19060481])
In [110]: A.indices
Out[110]: array([0], dtype=int32)
In [112]: A.indptr
Out[112]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
So even with only 1 nonzero term, one array is large (101).
On the other hand the csc format for the same array has a much smaller storage. But csc with (1,100) shape will look like the csr.
In [113]: Ac = A.tocsc()
In [114]: Ac.indptr
Out[114]: array([0, 1], dtype=int32)
In [115]: Ac.indices
Out[115]: array([88], dtype=int32)
Math, especially matrix products is done with csr/csc formats. So it may be hard to avoid this 80 TB memory use.
Looking at the traceback I see that it's trying to convert other to the format that matches self.
So with A.dot(B), and A is (1,N) csr, the small shape. B is (N,1) csc, also the small shape. But B.tocsr() requires the large (N+1,) shaped indptr.
Let's try an alternative to dot
First 2 matrices:
In [122]: A = sparse.random(1,100, .2,format='csr')
In [123]: B = sparse.random(100,1, .2,format='csc')
In [124]: A
Out[124]:
<1x100 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
In [125]: B
Out[125]:
<100x1 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Column format>
In [126]: A#B
Out[126]:
<1x1 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
In [127]: _.A
Out[127]: array([[1.33661021]])
Their nonzero element indices. Only the ones that match matter.
In [128]: A.indices, B.indices
Out[128]:
(array([16, 20, 23, 28, 30, 37, 39, 40, 43, 49, 54, 59, 61, 63, 67, 70, 74,
91, 94, 99], dtype=int32),
array([ 5, 8, 15, 25, 34, 35, 40, 46, 47, 51, 53, 60, 68, 70, 75, 81, 87,
90, 91, 94], dtype=int32))
equality matrix:
In [129]: mask = A.indices[:,None]==B.indices
In [132]: np.nonzero(mask.any(axis=0))
Out[132]: (array([ 6, 13, 18, 19]),)
In [133]: np.nonzero(mask.any(axis=1))
Out[133]: (array([ 7, 15, 17, 18]),)
The matching indices:
In [139]: A.indices[Out[133]]
Out[139]: array([40, 70, 91, 94], dtype=int32)
In [140]: B.indices[Out[132]]
Out[140]: array([40, 70, 91, 94], dtype=int32)
sum of the corresponding data values matches [127]
In [141]: (A.data[Out[133]]*B.data[Out[132]]).sum()
Out[141]: 1.3366102138511582

Related

CountVectorizer not working in ColumnTransformer

Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create a sample data frame
df = pd.DataFrame({
'corpus': ['This is the first document.', 'This document is the second document.', 'And this is the third one.',
'Is this the first document?', 'I have the fourth document'],
'word_length': [27, 37, 26, 27, 26]
})
text_feature = ["corpus"]
count_transformer = CountVectorizer()
# Create the ColumnTransformer
ct = ColumnTransformer(transformers=[
("count", count_transformer, text_feature)],
remainder='passthrough')
ct.fit_transform(df)
The output says:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 5
I tried the code below which does the job but is doesn't scale easily as ColumnTransformer().
np.c_[count_transformer.fit_transform(df["corpus"]).toarray(), df["word_length"].values].
The result is the numpy array below:
array([[ 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 27],
0, 2, 0, 0, 0, 1, 0, 1, 1, 0, 1, 37],
1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 26],
0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 27],
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 26]], dtype=int64)

How to one hot encode numpy array of arrays using keras to_categorical?

I have target data y_tr in shape (92107, 49) where each sample (row) is an array of length 49. I would like to loop over y_tr and one hot encode each array. I have been using keras to_categorical though I am facing an Indexerror.
Array in y_tr:
array([ 379, 379, 379, 252, 4166, 391, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0], dtype=int32)
The number of classes to encode each integer index = 10,000. (I do not wish to encode zero values):
onehots=[]
for seq in y_tr[:,1:]:
try:
onehots.append(to_categorical(seq - 1, num_classes=tar_vocab_size, dtype=np.int32))
except IndexError as e:
raise e
76 n = y.shape[0]
77 categorical = np.zeros((n, num_classes), dtype=dtype)
---> 78 categorical[np.arange(n), y] = 1
79 output_shape = input_shape + (num_classes,)
80 categorical = np.reshape(categorical, output_shape)
IndexError: index 28350 is out of bounds for axis 1 with size 10000

Counting zeros in a rolling - numpy array (including NaNs)

I am trying to find a way of Counting zeros in a rolling using numpy array ?
Using pandas I can get it using:
df['demand'].apply(lambda x: (x == 0).rolling(7).sum()).fillna(0))
or
df['demand'].transform(lambda x: x.rolling(7).apply(lambda x: 7 - np.count _nonzero(x))).fillna(0)
In numpy, using the code from Here
def rolling_window(a, window_size):
shape = (a.shape[0] - window_size + 1, window_size) + a.shape[1:]
print(shape)
strides = (a.strides[0],) + a.strides
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = np.asarray([10, 20, 30, 5, 6, 0, 0, 0])
np.count_nonzero(rolling_window(arr==0, 7), axis=1)
Output:
array([2, 3])
However, I need the first 6 NaNs as well, and fill it with zeros:
Expected output:
array([0, 0, 0, 0, 0, 0, 2, 3])
Think an efficient one would be with 1D convolution -
def sum_occurences_windowed(arr, W):
K = np.ones(W, dtype=int)
out = np.convolve(arr==0,K)[:len(arr)]
out[:W-1] = 0
return out
Sample run -
In [42]: arr
Out[42]: array([10, 20, 30, 5, 6, 0, 0, 0])
In [43]: sum_occurences_windowed(arr,W=7)
Out[43]: array([0, 0, 0, 0, 0, 0, 2, 3])
Timings on varying length arrays and window of 7
Including count_rolling from #Quang Hoang's post.
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [sum_occurences_windowed, count_rolling]
in_ = {n:(np.random.randint(0,5,(n)),7) for n in [10,20,50,100,200,500,1000,2000,5000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Length')
t.plot(logx=True, save='timings.png')
Extending to generic n-dim arrays
from scipy.ndimage.filters import convolve1d
def sum_occurences_windowed_ndim(arr, W, axis=-1):
K = np.ones(W, dtype=int)
out = convolve1d((arr==0).astype(int),K,axis=axis,origin=-(W//2))
out.swapaxes(axis,0)[:W-1] = 0
return out
So, on a 2D array, for counting along each row, use axis=1 and for cols, axis=0 and so on.
Sample run -
In [155]: np.random.seed(0)
In [156]: a = np.random.randint(0,3,(3,10))
In [157]: a
Out[157]:
array([[0, 1, 0, 1, 1, 2, 0, 2, 0, 0],
[0, 2, 1, 2, 2, 0, 1, 1, 1, 1],
[0, 1, 0, 0, 1, 2, 0, 2, 0, 1]])
In [158]: sum_occurences_windowed_ndim(a, W=7)
Out[158]:
array([[0, 0, 0, 0, 0, 0, 3, 2, 3, 3],
[0, 0, 0, 0, 0, 0, 2, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 4, 3, 4, 3]])
# Verify with earlier 1D solution
In [159]: np.vstack([sum_occurences_windowed(i,7) for i in a])
Out[159]:
array([[0, 0, 0, 0, 0, 0, 3, 2, 3, 3],
[0, 0, 0, 0, 0, 0, 2, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 4, 3, 4, 3]])
Let's test out our original 1D input array -
In [187]: arr
Out[187]: array([10, 20, 30, 5, 6, 0, 0, 0])
In [188]: sum_occurences_windowed_ndim(arr, W=7)
Out[188]: array([0, 0, 0, 0, 0, 0, 2, 3])
I would modify the function as follow:
def count_rolling(a, window_size):
shape = (a.shape[0] - window_size + 1, window_size) + a.shape[1:]
strides = (a.strides[0],) + a.strides
rolling = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
out = np.zeros_like(a)
out[window_size-1:] = (rolling == 0).sum(1)
return out
arr = np.asarray([10, 20, 30, 5, 6, 0, 0, 0])
count_rolling(arr,7)
Output:
array([0, 0, 0, 0, 0, 0, 2, 3])

Can we use pytorch scatter_ on GPU

I'm trying to do one hot encoding on some data with pyTorch on GPU mode, however, it keeps giving me an exception. Can anybody help me?
Here's one example:
def char_OneHotEncoding(x):
coded = torch.zeros(x.shape[0], x.shape[1], 101)
for i in range(x.shape[1]):
coded[:,i] = scatter(x[:,i])
return coded
def scatter(x):
return torch.zeros(x.shape[0], 101).scatter_(1, x.view(-1,1), 1)
So if I give it an tensor on GPU, it shows like this:
x_train = [[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0],
[14, 13, 83, 18, 14],
[ 0, 0, 0, 0, 0]]
print(char_OneHotEncoding(torch.tensor(x_train, dtype=torch.long).cuda()).shape)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-62-95c0c4ade406> in <module>()
4 [14, 13, 83, 18, 14],
5 [ 0, 0, 0, 0, 0]]
----> 6 print(char_OneHotEncoding(torch.tensor(x_train, dtype=torch.long).cuda()).shape)
7 x_train[:5, maxlen:maxlen+5]
<ipython-input-53-055f1bf71306> in char_OneHotEncoding(x)
2 coded = torch.zeros(x.shape[0], x.shape[1], 101)
3 for i in range(x.shape[1]):
----> 4 coded[:,i] = scatter(x[:,i])
5 return coded
6
<ipython-input-53-055f1bf71306> in scatter(x)
7
8 def scatter(x):
----> 9 return torch.zeros(x.shape[0], 101).scatter_(1, x.view(-1,1), 1)
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #3 'index'
BTW, if we simply remove the .cuda() here, everything goes one well
print(char_OneHotEncoding(torch.tensor(x_train, dtype=torch.long)).shape)
torch.Size([5, 5, 101])
Yes, it is possible. You have to pay attention that all tensors are on GPU. In particular, by default, constructors like torch.zeros allocate on CPU, which will lead to this kind of mismatches. Your code can be fixed by constructing with device=x.device, as below
import torch
def char_OneHotEncoding(x):
coded = torch.zeros(x.shape[0], x.shape[1], 101, device=x.device)
for i in range(x.shape[1]):
coded[:,i] = scatter(x[:,i])
return coded
def scatter(x):
return torch.zeros(x.shape[0], 101, device=x.device).scatter_(1, x.view(-1,1), 1)
x_train = torch.tensor([
[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0],
[14, 13, 83, 18, 14],
[ 0, 0, 0, 0, 0]
], dtype=torch.long, device='cuda')
print(char_OneHotEncoding(x_train).shape)
Another alternative are constructors called xxx_like, for instance zeros_like, though in this case, since you need different shapes than x, I found device=x.device more readable.

Fill blocks at random places on each 2D slice of a 3D array

I have 3D numpy array, for example, like this:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]]])
Is there a way to index it in such a way that I select, for example, top right corner of 2x2 elements in the first plane, and a center 2x2 elements subarray from the second plane? So that I could then zero out the elements 2,3,6,7,21,22,25,26:
array([[[ 0, 1, 0, 0],
[ 4, 5, 0, 0],
[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 0, 0, 23],
[24, 0, 0, 27],
[28, 29, 30, 31]]])
I have a batch of images, and I need to zero out a small window of fixed size, but at different (random) locations for each image in the batch. The first dimension is number of images.
Something like this:
a[:, x: x+2, y: y+2] = 0
where x and y are vectors which have different values for each first dimension of a.
Approach #1 : Here'e one approach that's mostly based on linear-indexing -
def random_block_fill_lidx(a, N, fillval=0):
# a is input array
# N is blocksize
# Store shape info
m,n,r = a.shape
# Get all possible starting linear indices for each 2D slice
possible_start_lidx = (np.arange(n-N+1)[:,None]*r + range(r-N+1)).ravel()
# Get random start indices from all possible ones for all 2D slices
start_lidx = np.random.choice(possible_start_lidx, m)
# Get linear indices for the block of (N,N)
offset_arr = (a.shape[-1]*np.arange(N)[:,None] + range(N)).ravel()
# Add in those random start indices with the offset array
idx = start_lidx[:,None] + offset_arr
# On a 2D view of the input array, use advance-indexing to set fillval.
a.reshape(m,-1)[np.arange(m)[:,None], idx] = fillval
return a
Approach #2 : Here's another and possibly more efficient one (for large 2D slices) using advanced-indexing -
def random_block_fill_adv(a, N, fillval=0):
# a is input array
# N is blocksize
# Store shape info
m,n,r = a.shape
# Generate random start indices for second and third axes keeping proper
# distance from the boundaries for the block to be accomodated within.
idx0 = np.random.randint(0,n-N+1,m)
idx1 = np.random.randint(0,r-N+1,m)
# Setup indices for advanced-indexing.
# First axis indices would be simply the range array to select one per elem.
# We need to extend this to 3D so that the latter dim indices could be aligned.
dim0 = np.arange(m)[:,None,None]
# Second axis indices would idx0 with broadcasted additon of blocksized
# range array to cover all block indices along this axis. Repeat for third.
dim1 = idx0[:,None,None] + np.arange(N)[:,None]
dim2 = idx1[:,None,None] + range(N)
a[dim0, dim1, dim2] = fillval
return a
Approach #3 : With the old-trusty loop -
def random_block_fill_loopy(a, N, fillval=0):
# a is input array
# N is blocksize
# Store shape info
m,n,r = a.shape
# Generate random start indices for second and third axes keeping proper
# distance from the boundaries for the block to be accomodated within.
idx0 = np.random.randint(0,n-N+1,m)
idx1 = np.random.randint(0,r-N+1,m)
# Iterate through first and use slicing to assign fillval.
for i in range(m):
a[i, idx0[i]:idx0[i]+N, idx1[i]:idx1[i]+N] = fillval
return a
Sample run -
In [357]: a = np.arange(2*4*7).reshape(2,4,7)
In [358]: a
Out[358]:
array([[[ 0, 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12, 13],
[14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27]],
[[28, 29, 30, 31, 32, 33, 34],
[35, 36, 37, 38, 39, 40, 41],
[42, 43, 44, 45, 46, 47, 48],
[49, 50, 51, 52, 53, 54, 55]]])
In [359]: random_block_fill_adv(a, N=3, fillval=0)
Out[359]:
array([[[ 0, 0, 0, 0, 4, 5, 6],
[ 7, 0, 0, 0, 11, 12, 13],
[14, 0, 0, 0, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27]],
[[28, 29, 30, 31, 32, 33, 34],
[35, 36, 37, 38, 0, 0, 0],
[42, 43, 44, 45, 0, 0, 0],
[49, 50, 51, 52, 0, 0, 0]]])
Fun stuff : Being in-place filling, if we keep running random_block_fill_adv(a, N=3, fillval=0), we will eventually end up with all zeros a. Thus, also verifying the code.
Runtime test
In [579]: a = np.random.randint(0,9,(10000,4,4))
In [580]: %timeit random_block_fill_lidx(a, N=2, fillval=0)
...: %timeit random_block_fill_adv(a, N=2, fillval=0)
...: %timeit random_block_fill_loopy(a, N=2, fillval=0)
...:
1000 loops, best of 3: 545 µs per loop
1000 loops, best of 3: 891 µs per loop
100 loops, best of 3: 10.6 ms per loop
In [581]: a = np.random.randint(0,9,(1000,40,40))
In [582]: %timeit random_block_fill_lidx(a, N=10, fillval=0)
...: %timeit random_block_fill_adv(a, N=10, fillval=0)
...: %timeit random_block_fill_loopy(a, N=10, fillval=0)
...:
1000 loops, best of 3: 739 µs per loop
1000 loops, best of 3: 671 µs per loop
1000 loops, best of 3: 1.27 ms per loop
So, which one to choose depends on the first axis length and blocksize.