numpy : View support in CUDA - numpy

Does numba+CUDA and/or cupy support numpy views on the GPU ?
Specifically views with different types, like this :
In [179]: x = np.random.randint(0,100,(10,5), dtype=np.int16)
In [180]: y = x[:,4].view(dtype=np.float16)
In [188]: y[:] = 0
Out[188]:
array([[97, 14, 75, 42, 0],
[30, 87, 78, 62, 0],
[23, 92, 90, 37, 0],
[15, 12, 58, 36, 0],
[21, 32, 88, 83, 0],
[99, 70, 92, 16, 0],
[ 3, 88, 93, 16, 0],
[52, 32, 24, 15, 0],
[52, 99, 17, 97, 0],
[20, 33, 59, 56, 0]], dtype=int16)
In [191]: y
Out[191]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float16)
I need float16 to work
numba: https://github.com/numba/numba/issues/4402
cupy seems to have some support for float16

Related

Keep getting nan-loss when using Seq2SeqTransformer

I am trying to train a transformer model on text data. The task is to predict missing (masked) words so e.g. the input "How are you ?" gets mapped to "How [MASKED] you ?" like so:
inputs = [69, 4, 1337, 666] # How [MASK] you ?
targets = [69, 42, 1337, 666] # How are you ?
The problem is, that after a few steps, sometimes after a few hundred, sometimes after a few thousand, the loss becomes nan.
I have tried a model with just 90k parameters but also one with 10m parameters. The result is always the same.
The code below shows how I instantiate a Seq2SeqTransformer.
Using the debugger does not give me anything in the "Graph Executions" section. All I see is this:
Any idea what I could be doing wrong here? The learning rate is already rather small so I can't imagine that this is the problem.
model = Seq2SeqTransformer(
vocab_size=vocab_size, # <= 2000
embedding_width=32,
dropout_rate=0.1,
encoder_layer=TransformerEncoder(
num_layers=1, num_attention_heads=2, intermediate_size=64,
dropout_rate=0.1, intermediate_dropout=0.1, attention_dropout_rate=0.1
),
decoder_layer=TransformerDecoder(
num_layers=1, num_attention_heads=2, intermediate_size=64,
dropout_rate=0.1, intermediate_dropout=0.1, attention_dropout_rate=0.1
)
)
optimizer = Adam(
learning_rate=TransformerSchedule(
min_lr=2.5e-6,
max_lr=1.5e-4,
warmup_steps=6000,
warm_steps=30000
)
)
model.compile(
optimizer=optimizer,
loss=SmoothedSparseCategoricalCrossentropy(0.1),
)
model.fit(
train_data.repeat(),
steps_per_epoch=1000,
epochs=500,
callbacks=callbacks,
validation_data=valid_data,
validation_steps=100,
)
Batch Data
Just to verify that the data I present to the model is alright, here I print(a_batch). Since samples get bucketed, they all have the same length which is also why input_masks is all 1s.
Note: ID of [MASK] is 4.
X
{'inputs': <tf.Tensor: shape=(119, 41), dtype=int32, numpy=
array([[ 2, 192, 214, ..., 525, 7, 3],
[ 2, 57, 964, ..., 15, 7, 3],
[ 2, 4, 191, ..., 646, 7, 3],
...,
[ 2, 430, 29, ..., 675, 4, 3],
[ 2, 101, 45, ..., 15, 7, 3],
[ 2, 421, 11, ..., 15, 4, 3]], dtype=int32)>,
'input_masks': <tf.Tensor: shape=(119, 41), dtype=float32, numpy=
array([[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]], dtype=float32)>,
'targets': <tf.Tensor: shape=(119, 41), dtype=int32, numpy=
array([[ 2, 192, 214, ..., 525, 7, 3],
[ 2, 57, 964, ..., 15, 7, 3],
[ 2, 104, 191, ..., 646, 7, 3],
...,
[ 2, 430, 29, ..., 675, 7, 3],
[ 2, 101, 45, ..., 15, 7, 3],
[ 2, 421, 11, ..., 15, 7, 3]], dtype=int32)>}
Y
tf.Tensor(
[[ 2 192 214 ... 525 7 3]
[ 2 57 964 ... 15 7 3]
[ 2 104 191 ... 646 7 3]
...
[ 2 430 29 ... 675 7 3]
[ 2 101 45 ... 15 7 3]
[ 2 421 11 ... 15 7 3]], shape=(119, 41), dtype=int32)

Fastest way to do vectorized reduce product with boolean mask

I have a 3D numpy array A and 2D numpy boolean mask B.
The first two dimensions of A matches B
And I'm wondering if there is any fast way for each first dimension of A, select the third dimension along second based on B, perform a reduced product over the second dimension.
My expected out C would be a 2D numpy array, with the first dimension of A and the second dimension from the third of A.
My current solution is C = np.prod(A*np.repeat(B[...,np.newaxis], A.shape[-1], 2), 1)
Is there any better alternative?
With concrete example:
In [364]: A=np.arange(1,25).reshape(2,3,4); B=np.arange(1,7).reshape(2,3)
In [365]: C = np.prod(A*np.repeat(B[...,np.newaxis], A.shape[-1], 2), 1)
That repeat does:
In [366]: np.repeat(B[...,np.newaxis], A.shape[-1], 2)
Out[366]:
array([[[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]],
[[4, 4, 4, 4],
[5, 5, 5, 5],
[6, 6, 6, 6]]])
In [367]: _.shape
Out[367]: (2, 3, 4)
In [368]: A*np.repeat(B[...,np.newaxis], A.shape[-1], 2)
Out[368]:
array([[[ 1, 2, 3, 4],
[ 10, 12, 14, 16],
[ 27, 30, 33, 36]],
[[ 52, 56, 60, 64],
[ 85, 90, 95, 100],
[126, 132, 138, 144]]])
But by broadcasting rules, the repeat is no needed:
In [369]: A*B[...,np.newaxis]
Out[369]:
array([[[ 1, 2, 3, 4],
[ 10, 12, 14, 16],
[ 27, 30, 33, 36]],
[[ 52, 56, 60, 64],
[ 85, 90, 95, 100],
[126, 132, 138, 144]]])
In [371]: np.prod(_369, axis=1)
Out[371]:
array([[ 270, 720, 1386, 2304],
[556920, 665280, 786600, 921600]])
You could apply prod to A and B individually, but I don't know if that makes much of a difference:
In [373]: np.prod(A,1)*np.prod(B,1)[:,None]
Out[373]:
array([[ 270, 720, 1386, 2304],
[556920, 665280, 786600, 921600]])

operation of Einstein sum of 3D matrices

The following code indicates that the Einstein sum of two 3D (2x2x2) matrices is a 4D (2x2x2x2) matrix.
$ c_{ijlm} = \Sigma_k a_{i,j,k}b_{k,l,m} $
$ c_{0,0,0,0} = \Sigma_k a_{0,0,k}b_{k,0,0} = 1x9 + 5x11 = 64 $
But, c_{0,0,0,0} = 35 according to the result below:
>>> a=np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
>>> b=np.array([[[9,10],[11,12]],[[13,14],[15,16]]])
>>> c=np.einsum('ijk,klm->ijlm', a,b)
>>> c
array([[[[ 35, 38],
[ 41, 44]],
[[ 79, 86],
[ 93, 100]]],
[[[123, 134],
[145, 156]],
[[167, 182],
[197, 212]]]])
Could someone explain how the operation is carried out?
The particular element that you are testing, [0,0,0,0] is calculated with:
In [167]: a[0,0,:]*b[:,0,0]
Out[167]: array([ 9, 26])
In [168]: a[0,0,:]
Out[168]: array([1, 2])
In [169]: b[:,0,0]
Out[169]: array([ 9, 13])
It may be easier to understand if we reshape both arrays to 2d:
In [170]: A=a.reshape(-1,2); B=b.reshape(2,-1)
In [171]: A
Out[171]:
array([[1, 2],
[3, 4],
[5, 6],
[7, 8]])
In [172]: B
Out[172]:
array([[ 9, 10, 11, 12],
[13, 14, 15, 16]])
In [173]: A#B
Out[173]:
array([[ 35, 38, 41, 44],
[ 79, 86, 93, 100],
[123, 134, 145, 156],
[167, 182, 197, 212]])
The same numbers, but in (4,4) instead of (2,2,2,2). It's easier to read the (1,2) and (9,13) off of A and B.

How to skip 'for' loop when dealing with numpy arrays

Here is my code:
import numpy as np
>>> x
array([[ 1, 57],
[ 2, 21],
[ 4, 34],
...,
[3348, 29],
[3350, 23],
[3353, 11]])
>>> x.shape
(1310, 2)
>>> pic # greyscale image
array([[223, 222, 225, ..., 217, 219, 214],
[224, 222, 219, ..., 220, 219, 216],
[223, 224, 220, ..., 219, 215, 213],
...,
[228, 226, 231, ..., 224, 228, 229],
[229, 227, 227, ..., 216, 225, 227],
[226, 228, 225, ..., 218, 225, 230]], dtype=uint8)
pic = np.stack((pic,pic,pic), axis=2)
>>> pic.shape
(2208, 2752, 3)
>>>labels.shape
(2208, 2752)
color = [0, 0, 255]
for i in x:
B=np.full((i[1],3), color).astype('int')
pic[labels==i[0]]=B
It colors all the pixels in grayscale image (pic) with blue (rgb 0,0,255), which satisfy the condition pic[labels==i[0]]. Now, this is very slow because of the 'for' loop used (for i in x).
Is there any efficient 'Numpy way', which won't include for loop, and
would, therefore, be much faster. Thanks for your kind help!

How to multiply a NxN matrix H with a Nx1 array t = [t1,t2,...,tN] such that H*t = [H*t1,H*t2,...,H*tN] using numpy?

I know how to do this with for loops, but is there a way to use numpy arrays and their operations to achieve this type of multiplication?
You can use np.multiply.outer:
>>> import numpy as np
>>>
>>> H = np.arange(9).reshape(3, 3)
>>> t = np.c_[10:40:10]
>>>
>>> H
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> t
array([[10],
[20],
[30]])
>>>
>>> np.multiply.outer(t.ravel(), H)
array([[[ 0, 10, 20],
[ 30, 40, 50],
[ 60, 70, 80]],
[[ 0, 20, 40],
[ 60, 80, 100],
[120, 140, 160]],
[[ 0, 30, 60],
[ 90, 120, 150],
[180, 210, 240]]])