I am trying to train a transformer model on text data. The task is to predict missing (masked) words so e.g. the input "How are you ?" gets mapped to "How [MASKED] you ?" like so:
inputs = [69, 4, 1337, 666] # How [MASK] you ?
targets = [69, 42, 1337, 666] # How are you ?
The problem is, that after a few steps, sometimes after a few hundred, sometimes after a few thousand, the loss becomes nan.
I have tried a model with just 90k parameters but also one with 10m parameters. The result is always the same.
The code below shows how I instantiate a Seq2SeqTransformer.
Using the debugger does not give me anything in the "Graph Executions" section. All I see is this:
Any idea what I could be doing wrong here? The learning rate is already rather small so I can't imagine that this is the problem.
model = Seq2SeqTransformer(
vocab_size=vocab_size, # <= 2000
embedding_width=32,
dropout_rate=0.1,
encoder_layer=TransformerEncoder(
num_layers=1, num_attention_heads=2, intermediate_size=64,
dropout_rate=0.1, intermediate_dropout=0.1, attention_dropout_rate=0.1
),
decoder_layer=TransformerDecoder(
num_layers=1, num_attention_heads=2, intermediate_size=64,
dropout_rate=0.1, intermediate_dropout=0.1, attention_dropout_rate=0.1
)
)
optimizer = Adam(
learning_rate=TransformerSchedule(
min_lr=2.5e-6,
max_lr=1.5e-4,
warmup_steps=6000,
warm_steps=30000
)
)
model.compile(
optimizer=optimizer,
loss=SmoothedSparseCategoricalCrossentropy(0.1),
)
model.fit(
train_data.repeat(),
steps_per_epoch=1000,
epochs=500,
callbacks=callbacks,
validation_data=valid_data,
validation_steps=100,
)
Batch Data
Just to verify that the data I present to the model is alright, here I print(a_batch). Since samples get bucketed, they all have the same length which is also why input_masks is all 1s.
Note: ID of [MASK] is 4.
X
{'inputs': <tf.Tensor: shape=(119, 41), dtype=int32, numpy=
array([[ 2, 192, 214, ..., 525, 7, 3],
[ 2, 57, 964, ..., 15, 7, 3],
[ 2, 4, 191, ..., 646, 7, 3],
...,
[ 2, 430, 29, ..., 675, 4, 3],
[ 2, 101, 45, ..., 15, 7, 3],
[ 2, 421, 11, ..., 15, 4, 3]], dtype=int32)>,
'input_masks': <tf.Tensor: shape=(119, 41), dtype=float32, numpy=
array([[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]], dtype=float32)>,
'targets': <tf.Tensor: shape=(119, 41), dtype=int32, numpy=
array([[ 2, 192, 214, ..., 525, 7, 3],
[ 2, 57, 964, ..., 15, 7, 3],
[ 2, 104, 191, ..., 646, 7, 3],
...,
[ 2, 430, 29, ..., 675, 7, 3],
[ 2, 101, 45, ..., 15, 7, 3],
[ 2, 421, 11, ..., 15, 7, 3]], dtype=int32)>}
Y
tf.Tensor(
[[ 2 192 214 ... 525 7 3]
[ 2 57 964 ... 15 7 3]
[ 2 104 191 ... 646 7 3]
...
[ 2 430 29 ... 675 7 3]
[ 2 101 45 ... 15 7 3]
[ 2 421 11 ... 15 7 3]], shape=(119, 41), dtype=int32)
Related
Given the following ndarray t -
In [26]: t.shape
Out[26]: (3, 3, 2)
In [27]: t
Out[27]:
array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 6, 7],
[ 8, 9],
[10, 11]],
[[12, 13],
[14, 15],
[16, 17]]])
this piecewise linear interpolant for the points t[:, 0, 0] can evaluated for [0 , 0.66666667, 1.33333333, 2.] as follows using numpy.interp -
In [38]: x = np.linspace(0, t.shape[0]-1, 4)
In [39]: x
Out[39]: array([0. , 0.66666667, 1.33333333, 2. ])
In [30]: xp = np.arange(t.shape[0])
In [31]: xp
Out[31]: array([0, 1, 2])
In [32]: fp = t[:,0,0]
In [33]: fp
Out[33]: array([ 0, 6, 12])
In [40]: np.interp(x, xp, fp)
Out[40]: array([ 0., 4., 8., 12.])
How can all the interpolants be efficiently calculated and returned together for all values of fp -
array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 4, 5],
[ 6, 7],
[ 8, 9]],
[[ 8, 9],
[10, 11],
[12, 13]],
[[12, 13],
[14, 15],
[16, 17]]])
As the interpolation is 1d with changing y values it must be run for each 1d slice of t. It's probably faster to loop explicitly but neater to loop using np.apply_along_axis
import numpy as np
t = np.arange( 18 ).reshape(3,3,2)
x = np.linspace( 0, t.shape[0]-1, 4)
xp = np.arange(t.shape[0])
def interfunc( arr ):
""" Function interpolates a 1d array. """
return np.interp( x, xp, arr )
np.apply_along_axis( interfunc, 0, t ) # apply function along axis 0
""" Result
array([[[ 0., 1.],
[ 2., 3.],
[ 4., 5.]],
[[ 4., 5.],
[ 6., 7.],
[ 8., 9.]],
[[ 8., 9.],
[10., 11.],
[12., 13.]],
[[12., 13.],
[14., 15.],
[16., 17.]]]) """
With explicit loops
result = np.zeros((4,3,2))
for c in range(t.shape[1]):
for p in range(t.shape[2]):
result[:,c,p] = np.interp( x, xp, t[:,c,p])
On my machine the second option runs in half the time.
Edit to use np.nditer
As the result and the parameter have different shapes I seem to have to create two np.nditer objects one for the parameter and one for the result. This is my first attempt to use nditer for anything so it could be over complicated.
def test( t ):
ts = t.shape
result = np.zeros((ts[0]+1,ts[1],ts[2]))
param = np.nditer( [t], ['external_loop'], ['readonly'], order = 'F')
with np.nditer( [result], ['external_loop'], ['writeonly'], order = 'F') as res:
for p, r in zip( param, res ):
r[:] = interfunc(p)
return result
It's slightly slower than the explicit loops and less easy to follow than either of the other solutions.
As requested by #Tis Chris, here is a solution using np.nditer with the multi_index flag but I prefer the explicit nested for loops method above because it is 10% faster
In [29]: t = np.arange( 18 ).reshape(3,3,2)
In [30]: ax0old = np.arange(t.shape[0])
In [31]: ax0new = np.linspace(0, t.shape[0]-1, 4)
In [32]: tnew = np.zeros((len(ax0new), t.shape[1], t.shape[2]))
In [33]: it = np.nditer(t[0], flags=['multi_index'])
In [34]: for _ in it:
...: tnew[:, it.multi_index[0], it.multi_index[1]] = np.interp(ax0new, ax0old, t[:, it.multi_
...: index[0], it.multi_index[1]])
...:
In [35]: tnew
Out[35]:
array([[[ 0., 1.],
[ 2., 3.],
[ 4., 5.]],
[[ 4., 5.],
[ 6., 7.],
[ 8., 9.]],
[[ 8., 9.],
[10., 11.],
[12., 13.]],
[[12., 13.],
[14., 15.],
[16., 17.]]])
You could try scipy.interpolate.interp1d:
from scipy.interpolate import interp1d
import numpy as np
t = np.array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 6, 7],
[ 8, 9],
[10, 11]],
[[12, 13],
[14, 15],
[16, 17]]])
# for the first slice
f = interp1d(np.arange(t.shape[0]), t[..., 0], axis=0)
# returns a function which you call with values within range np.arange(t.shape[0])
# data used for interpolation
t[..., 0]
>>> array([[ 0, 2, 4],
[ 6, 8, 10],
[12, 14, 16]])
f(1)
>>> array([ 6., 8., 10.])
f(1.5)
>>> array([ 9., 11., 13.])
I have a three-dimensional numpy array of size (5000, 8, 9)
I would like to insert a row of 0s as the first row for each of the 5000 arrays, such that the new shape will be (5000, 9, 9) and the first row will be 0s.
How can I do this elegantly in numpy?
EDIT:
Thanks for the inspiration, Ben. I'm trying but I clearly don't have it yet. Here's an MWE of what I have so far:
>>> import numpy as np
>>> n1 = np.array([[[1,2,3], [4,5,6], [7, 8, 9], [10, 11, 12]], [[3, 2, 1], [4, 3, 2], [5,4,3], [6,5,4]]])
>>> n1
array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]],
[[ 3, 2, 1],
[ 4, 3, 2],
[ 5, 4, 3],
[ 6, 5, 4]]])
>>> proper = np.zeros(((3, 4, 3)))
>>> proper
array([[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]])
>>> np.insert(proper, n1, axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: insert() takes at least 3 arguments (3 given)
>>> np.insert(proper, 0, n1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../pkgs/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 4435, in insert
new[tuple(slobj)] = values
ValueError: could not broadcast input array from shape (2,4,3) into shape (2)
>>> np.insert(n1, 0, proper)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../pkgs/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 4435, in insert
new[tuple(slobj)] = values
ValueError: could not broadcast input array from shape (3,4,3) into shape (3)
Consider the follwing:
Let n1 be any tensor as per your required shape:
n1 = np.empty(shape=(5000, 8, 9))
print(n1.shape)
We add a vector at the 0th index, and setting the required axis
n2 = np.insert(n1, 0, np.ones(shape=(1,)), axis=1)
print(n2.shape)
You can verify with
print(n2[0][0])
print(n2[1][0])
Hope it helps.
You can create a new array filled with zeros and insert your original array into it using indexing:
a = np.arange(5000*8*9).reshape(5000,8,9)
b = np.zeros((5000,9,9))
b[:,1:,:] = a
b[0]
>>> array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 2., 3., 4., 5., 6., 7., 8.],
[ 9., 10., 11., 12., 13., 14., 15., 16., 17.],
[18., 19., 20., 21., 22., 23., 24., 25., 26.],
[27., 28., 29., 30., 31., 32., 33., 34., 35.],
[36., 37., 38., 39., 40., 41., 42., 43., 44.],
[45., 46., 47., 48., 49., 50., 51., 52., 53.],
[54., 55., 56., 57., 58., 59., 60., 61., 62.],
[63., 64., 65., 66., 67., 68., 69., 70., 71.]])
I am trying to use tf.gather_nd to convert
'R = tf.eye(3, batch_shape=[4])'
to :
array([[[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.]],
[[0., 0., 1.],
[0., 1., 0.],
[1., 0., 0.]],
[[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.]],
[[1., 0., 0.],
[0., 0., 1.],
[0., 1., 0.]]], dtype=float32)'
With the index:
ind = array([[0, 2, 1],
[2, 1, 0],
[1, 2, 0],
[0, 2, 1]], dtype=int32)
I found out if I can convert the index matrix to something like:
ind_c = np.array([[[0, 0], [0, 2], [0, 1]],
[[1, 2], [1, 1], [1, 0]],
[[2, 1], [2, 2], [2, 0]],
[[3, 0], [3, 2], [3, 1]]])
gather_nd will do the job. so my question is:
is there a better way than converting the index ind to ind_c
if this the only way how I can convert ind to ind_c with tensorflow? (I have done this for now manually)
Thanks
You can try the following:
ind = tf.constant([[0, 2, 1],[2, 1, 0],[1, 2, 0],[0, 2, 1]], dtype=tf.int32)
# Creates the row indices matrix
row = tf.tile(tf.expand_dims(tf.range(tf.shape(ind)[0]), 1), [1, tf.shape(ind)[1]])
# Concat to the ind to form the index matrix
ind_c = tf.concat([tf.expand_dims(row,-1), tf.expand_dims(ind, -1)], axis=2)
I used tf.nn.top_k()function from tensorflow to use the model's softmax probabilities to visualize the certainty of its predictions with 5 new images and with k=5. I have an output as follows which I am not sure how to exactly interpret. Could anyone explain the output please.
TopKV2(values=array([[ 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0.]], dtype=float32), indices=array([[13, 0, 1, 2, 3],
[13, 0, 1, 2, 3],
[13, 0, 1, 2, 3],
[26, 0, 1, 2, 3],
[13, 0, 1, 2, 3]], dtype=int32))
From the documentation, it returns two tensors: the first with the top K value and the second with the indices of these values in the original tensor.
So for your data what I see is that the original tensor is always one-hot (i.e. has a single 1.0 entry per row and is 0 everywhere else).
There is a minimal example of an RNN in the Skflow documentation. The input data is a matrix with shape (4,5). Why is the data split according to the following function for input?:
def input_fn(X):
return tf.split(1, 5, X)
This function returns a list of 5 arrays with shape 4,1
[array([[ 2.],
[ 2.],
[ 3.],
[ 2.]], dtype=float32), array([[ 1.],
[ 2.],
[ 3.],
[ 4.]], dtype=float32), array([[ 2.],
[ 3.],
[ 1.],
[ 5.]], dtype=float32), array([[ 2.],
[ 4.],
[ 2.],
[ 4.]], dtype=float32), array([[ 3.],
[ 5.],
[ 1.],
[ 1.]], dtype=f
and, what is the difference/impact on the RNN between the above function, or defining the function like this? As both input functions run
def input_fn(X):
return tf.split(1, 1, X)
Which returns the following:
[[[ 1., 3., 3., 2., 1.],
[ 2., 3., 4., 5., 6.]]
Presented here:
testRNN(self):
random.seed(42)
import numpy as np
data = np.array(list([[2, 1, 2, 2, 3],
[2, 2, 3, 4, 5],
[3, 3, 1, 2, 1],
[2, 4, 5, 4, 1]]), dtype=np.float32)
# labels for classification
labels = np.array(list([1, 0, 1, 0]), dtype=np.float32)
# targets for regression
targets = np.array(list([10, 16, 10, 16]), dtype=np.float32)
test_data = np.array(list([[1, 3, 3, 2, 1], [2, 3, 4, 5, 6]]))
def input_fn(X):
return tf.split(1, 5, X)
# Classification
classifier = skflow.TensorFlowRNNClassifier(
rnn_size=2, cell_type='lstm', n_classes=2, input_op_fn=input_fn)
classifier.fit(data, labels)
classifier.weights_
classifier.bias_
predictions = classifier.predict(test_data)
self.assertAllClose(predictions, np.array([1, 0]))