How to one hot encode numpy array of arrays using keras to_categorical? - numpy

I have target data y_tr in shape (92107, 49) where each sample (row) is an array of length 49. I would like to loop over y_tr and one hot encode each array. I have been using keras to_categorical though I am facing an Indexerror.
Array in y_tr:
array([ 379, 379, 379, 252, 4166, 391, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0], dtype=int32)
The number of classes to encode each integer index = 10,000. (I do not wish to encode zero values):
onehots=[]
for seq in y_tr[:,1:]:
try:
onehots.append(to_categorical(seq - 1, num_classes=tar_vocab_size, dtype=np.int32))
except IndexError as e:
raise e
76 n = y.shape[0]
77 categorical = np.zeros((n, num_classes), dtype=dtype)
---> 78 categorical[np.arange(n), y] = 1
79 output_shape = input_shape + (num_classes,)
80 categorical = np.reshape(categorical, output_shape)
IndexError: index 28350 is out of bounds for axis 1 with size 10000

Related

CountVectorizer not working in ColumnTransformer

Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create a sample data frame
df = pd.DataFrame({
'corpus': ['This is the first document.', 'This document is the second document.', 'And this is the third one.',
'Is this the first document?', 'I have the fourth document'],
'word_length': [27, 37, 26, 27, 26]
})
text_feature = ["corpus"]
count_transformer = CountVectorizer()
# Create the ColumnTransformer
ct = ColumnTransformer(transformers=[
("count", count_transformer, text_feature)],
remainder='passthrough')
ct.fit_transform(df)
The output says:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 5
I tried the code below which does the job but is doesn't scale easily as ColumnTransformer().
np.c_[count_transformer.fit_transform(df["corpus"]).toarray(), df["word_length"].values].
The result is the numpy array below:
array([[ 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 27],
0, 2, 0, 0, 0, 1, 0, 1, 1, 0, 1, 37],
1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 26],
0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 27],
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 26]], dtype=int64)

How to dot product (1,10^{13}) (10^13,1) scipy sparse matrix

Basically what the title entails.
The two matrices are mostly zeros. And the first is 1 x 9999999999999 and the second is 9999999999999 x 1
When I try to do a dot product I get this.
Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
Full traceback </br>
MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
In [31]: imputed.dot(s)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-31-670cfc69d4cf> in <module>
----> 1 imputed.dot(s)
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in dot(self, other)
357
358 """
--> 359 return self * other
360
361 def power(self, n, dtype=None):
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in __mul__(self, other)
478 if self.shape[1] != other.shape[0]:
479 raise ValueError('dimension mismatch')
--> 480 return self._mul_sparse_matrix(other)
481
482 # If it's a list or whatever, treat it like a matrix
~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other)
499
500 major_axis = self._swap((M, N))[0]
--> 501 other = self.__class__(other) # convert to this format
502
503 idx_dtype = get_index_dtype((self.indptr, self.indices,
~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
32 arg1 = arg1.copy()
33 else:
---> 34 arg1 = arg1.asformat(self.format)
35 self._set_self(arg1)
36
~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in asformat(self, format, copy)
320 # Forward the copy kwarg, if it's accepted.
321 try:
--> 322 return convert_method(copy=copy)
323 except TypeError:
324 return convert_method()
~/.local/lib/python3.8/site-packages/scipy/sparse/csc.py in tocsr(self, copy)
135 idx_dtype = get_index_dtype((self.indptr, self.indices),
136 maxval=max(self.nnz, N))
--> 137 indptr = np.empty(M + 1, dtype=idx_dtype)
138 indices = np.empty(self.nnz, dtype=idx_dtype)
139 data = np.empty(self.nnz, dtype=upcast(self.dtype))
MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64
It seems the scipy is trying to create a temp array.
I am using the .dot method that scipy provides.
I am also open to non-scipy solutions.
Thanks!
In [105]: from scipy import sparse
If I make a (100,1) csr matrix:
In [106]: A = sparse.random(100,1,format='csr')
In [107]: A
Out[107]:
<100x1 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
The data and indices are:
In [109]: A.data
Out[109]: array([0.19060481])
In [110]: A.indices
Out[110]: array([0], dtype=int32)
In [112]: A.indptr
Out[112]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
So even with only 1 nonzero term, one array is large (101).
On the other hand the csc format for the same array has a much smaller storage. But csc with (1,100) shape will look like the csr.
In [113]: Ac = A.tocsc()
In [114]: Ac.indptr
Out[114]: array([0, 1], dtype=int32)
In [115]: Ac.indices
Out[115]: array([88], dtype=int32)
Math, especially matrix products is done with csr/csc formats. So it may be hard to avoid this 80 TB memory use.
Looking at the traceback I see that it's trying to convert other to the format that matches self.
So with A.dot(B), and A is (1,N) csr, the small shape. B is (N,1) csc, also the small shape. But B.tocsr() requires the large (N+1,) shaped indptr.
Let's try an alternative to dot
First 2 matrices:
In [122]: A = sparse.random(1,100, .2,format='csr')
In [123]: B = sparse.random(100,1, .2,format='csc')
In [124]: A
Out[124]:
<1x100 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
In [125]: B
Out[125]:
<100x1 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Column format>
In [126]: A#B
Out[126]:
<1x1 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
In [127]: _.A
Out[127]: array([[1.33661021]])
Their nonzero element indices. Only the ones that match matter.
In [128]: A.indices, B.indices
Out[128]:
(array([16, 20, 23, 28, 30, 37, 39, 40, 43, 49, 54, 59, 61, 63, 67, 70, 74,
91, 94, 99], dtype=int32),
array([ 5, 8, 15, 25, 34, 35, 40, 46, 47, 51, 53, 60, 68, 70, 75, 81, 87,
90, 91, 94], dtype=int32))
equality matrix:
In [129]: mask = A.indices[:,None]==B.indices
In [132]: np.nonzero(mask.any(axis=0))
Out[132]: (array([ 6, 13, 18, 19]),)
In [133]: np.nonzero(mask.any(axis=1))
Out[133]: (array([ 7, 15, 17, 18]),)
The matching indices:
In [139]: A.indices[Out[133]]
Out[139]: array([40, 70, 91, 94], dtype=int32)
In [140]: B.indices[Out[132]]
Out[140]: array([40, 70, 91, 94], dtype=int32)
sum of the corresponding data values matches [127]
In [141]: (A.data[Out[133]]*B.data[Out[132]]).sum()
Out[141]: 1.3366102138511582

Numpy array of u32 to binary bool array

I have an array:
a = np.array([[3701687786, 458299110, 2500872618, 3633119408],
[2377269574, 2599949379, 717229868, 137866584]], dtype=np.uint32)
How do I convert this to an array of shape (2, 128) where 128 represents the binary value for all the numbers in each row (boolean dtype) ?
I can sort of get the bytes by using list comprehension:
b''.join(struct.pack('I', x) for x in a[0])
But this is not quite right. How can I do this using numpy?
You can do bitwise and with the powers of 2:
vals = (a[...,None] & 2**np.arange(32, dtype='uint32')[[None,None,::-1])
out = (vals>0).astype(int).reshape(a.shape[0],-1)
# test
out[0,:32]
Output:
array([1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
0, 1, 1, 1, 1, 0, 1, 0, 1, 0])

Counting zeros in a rolling - numpy array (including NaNs)

I am trying to find a way of Counting zeros in a rolling using numpy array ?
Using pandas I can get it using:
df['demand'].apply(lambda x: (x == 0).rolling(7).sum()).fillna(0))
or
df['demand'].transform(lambda x: x.rolling(7).apply(lambda x: 7 - np.count _nonzero(x))).fillna(0)
In numpy, using the code from Here
def rolling_window(a, window_size):
shape = (a.shape[0] - window_size + 1, window_size) + a.shape[1:]
print(shape)
strides = (a.strides[0],) + a.strides
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = np.asarray([10, 20, 30, 5, 6, 0, 0, 0])
np.count_nonzero(rolling_window(arr==0, 7), axis=1)
Output:
array([2, 3])
However, I need the first 6 NaNs as well, and fill it with zeros:
Expected output:
array([0, 0, 0, 0, 0, 0, 2, 3])
Think an efficient one would be with 1D convolution -
def sum_occurences_windowed(arr, W):
K = np.ones(W, dtype=int)
out = np.convolve(arr==0,K)[:len(arr)]
out[:W-1] = 0
return out
Sample run -
In [42]: arr
Out[42]: array([10, 20, 30, 5, 6, 0, 0, 0])
In [43]: sum_occurences_windowed(arr,W=7)
Out[43]: array([0, 0, 0, 0, 0, 0, 2, 3])
Timings on varying length arrays and window of 7
Including count_rolling from #Quang Hoang's post.
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [sum_occurences_windowed, count_rolling]
in_ = {n:(np.random.randint(0,5,(n)),7) for n in [10,20,50,100,200,500,1000,2000,5000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Length')
t.plot(logx=True, save='timings.png')
Extending to generic n-dim arrays
from scipy.ndimage.filters import convolve1d
def sum_occurences_windowed_ndim(arr, W, axis=-1):
K = np.ones(W, dtype=int)
out = convolve1d((arr==0).astype(int),K,axis=axis,origin=-(W//2))
out.swapaxes(axis,0)[:W-1] = 0
return out
So, on a 2D array, for counting along each row, use axis=1 and for cols, axis=0 and so on.
Sample run -
In [155]: np.random.seed(0)
In [156]: a = np.random.randint(0,3,(3,10))
In [157]: a
Out[157]:
array([[0, 1, 0, 1, 1, 2, 0, 2, 0, 0],
[0, 2, 1, 2, 2, 0, 1, 1, 1, 1],
[0, 1, 0, 0, 1, 2, 0, 2, 0, 1]])
In [158]: sum_occurences_windowed_ndim(a, W=7)
Out[158]:
array([[0, 0, 0, 0, 0, 0, 3, 2, 3, 3],
[0, 0, 0, 0, 0, 0, 2, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 4, 3, 4, 3]])
# Verify with earlier 1D solution
In [159]: np.vstack([sum_occurences_windowed(i,7) for i in a])
Out[159]:
array([[0, 0, 0, 0, 0, 0, 3, 2, 3, 3],
[0, 0, 0, 0, 0, 0, 2, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 4, 3, 4, 3]])
Let's test out our original 1D input array -
In [187]: arr
Out[187]: array([10, 20, 30, 5, 6, 0, 0, 0])
In [188]: sum_occurences_windowed_ndim(arr, W=7)
Out[188]: array([0, 0, 0, 0, 0, 0, 2, 3])
I would modify the function as follow:
def count_rolling(a, window_size):
shape = (a.shape[0] - window_size + 1, window_size) + a.shape[1:]
strides = (a.strides[0],) + a.strides
rolling = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
out = np.zeros_like(a)
out[window_size-1:] = (rolling == 0).sum(1)
return out
arr = np.asarray([10, 20, 30, 5, 6, 0, 0, 0])
count_rolling(arr,7)
Output:
array([0, 0, 0, 0, 0, 0, 2, 3])

Can't use deployed TF BERT model to get GCloud online predictions from SavedModel: "Bad Request" error

I trained a BERT model based on this notebook.
I export it as a tf SavedModel this way:
def serving_input_fn():
receiver_tensors = {
"input_ids": tf.placeholder(dtype=tf.int32, shape=[1, MAX_SEQ_LENGTH])
}
features = {
"input_ids": receiver_tensors['input_ids'],
"input_mask": 1 - tf.cast(tf.equal(receiver_tensors['input_ids'], 0), dtype=tf.int32),
"segment_ids": tf.zeros(dtype=tf.int32, shape=[1, MAX_SEQ_LENGTH]),
"label_ids": tf.placeholder(tf.int32, [None], name='label_ids')
}
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
estimator._export_to_tpu = False
estimator.export_saved_model("export", serving_input_fn)
Then if I try to use the saved model locally it works:
from tensorflow.contrib import predictor
predict_fn = predictor.from_saved_model("export/1575241274/")
print(predict_fn({
"input_ids": [[101, 10468, 99304, 11496, 171, 112, 10176, 22873, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
}))
# {'probabilities': array([[-0.01023898, -4.5866656 ]], dtype=float32), 'labels': 0}
Then I uploaded the SavedModel to a bucket and created a model and a model version on gcloud this way:
gcloud alpha ai-platform versions create v1gpu --model [...] --origin=[...] --python-version=3.5 --runtime-version=1.14 --accelerator=^:^count=1:type=nvidia-tesla-k80 --machine-type n1-highcpu-4
No issue there, the model is deployed and displayed as working in the console.
But if I try to get predictions, as such:
import googleapiclient.discovery
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/[project_name]/models/[model_name]/versions/v1gpu'
response = service.projects().predict(
name=name,
body={'instances': [{
"input_ids": [[101, 10468, 99304, 11496, 171, 112, 10176, 22873, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
}]}
).execute()
print(response["predictions"])
All I get is the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/googleapiclient/http.py", line 851, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/[project_name]/models/[model_name]/versions/v1gpu:predict?alt=json returned "Bad Request">
I get the same error if I test the model from the gcloud console using the "Test your model with sample input data" feature.
Edit:
The saved_model has a tagset "serve" and signature_def "serving_default".
Output of "saved_model_cli show --dir 1575241274/ --tag_set serve --signature_def serving_default":
The given SavedModel SignatureDef contains the following input(s):
inputs['input_ids'] tensor_info:
dtype: DT_INT32
shape: (1, 128)
name: Placeholder:0
The given SavedModel SignatureDef contains the following output(s):
outputs['labels'] tensor_info:
dtype: DT_INT32
shape: ()
name: loss/Squeeze:0
outputs['probabilities'] tensor_info:
dtype: DT_FLOAT
shape: (1, 2)
name: loss/LogSoftmax:0
Method name is: tensorflow/serving/predict
The body of the request sent to the API has the form:
{"instances": [<instance 1>, <instance 2>, ...]}
As specified in documentation we need something like this:
{
"instances": [
<object>
...
]
}
In this case you have:
{
"instances": [
{
"input_ids":
[ <object> ]
}
...
]
}
You need to replace input_ids to instances:
{
"instances":
[
[101, 10468, 99304, 11496, 171, 112, 10176, 22873, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
]
}
Note If you can show the saved_model_cli will be great.
Also gcloud local predict command is also a good option for testing.
It depend of the signature of the model. In my case I have the following signature (just keeping the input part):
The given SavedModel SignatureDef contains the following input(s):
inputs['attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, 128)
name: serving_default_attention_mask:0
inputs['input_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, 128)
name: serving_default_input_ids:0
inputs['token_type_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, 128)
name: serving_default_token_type_ids:0
and I need to pass data in the following format (in this case 2 examples):
{'instances':
[
{'input_ids': [101, 143, 18267, 15470, 90395, ...],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, .....],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, .....]
},
{'input_ids': [101, 17664, 143, 30728, .........],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, .......],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, ....]
}
]
}
I am using it with a Keras model with Tensorflow 2.2.0
I guess in your case you need (for 2 examples):
{'instances':
[
{'input_ids': [101, 143, 18267, 15470, 90395, ...]},
{'input_ids': [101, 17664, 143, 30728, .........]}
]
}