Any non-zero recurrent_dropout yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful, return_sequences = any, with & w/o Bidirectional(), activation='relu', loss='binary_crossentropy'. NaNs occur within a few batches.
Any fixes? Help's appreciated.
TROUBLESHOOTING ATTEMPTED:
recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50 (empirically determined), Nadam optimizer
activation='tanh' - no NaNs, weights stable, tested for up to 10 batches
lr=2e-6,2e-5 - no NaNs, weights stable, tested for up to 10 batches
lr=5e-5 - no NaNs, weights stable, for 3 batches - NaNs on batch 4
batch_shape=(32,48,16) - large loss for 2 batches, NaNs on batch 3
NOTE: batch_shape=(32,672,16), 17 calls to train_on_batch per batch
ENVIRONMENT:
Keras 2.2.4 (TensorFlow backend), Python 3.7, Spyder 3.3.7 via Anaconda
GTX 1070 6GB, i7-7700HQ, 12GB RAM, Win-10.0.17134 x64
CuDNN 10+, latest Nvidia drives
ADDITIONAL INFO:
Model divergence is spontaneous, occurring at different train updates even with fixed seeds - Numpy, Random, and TensorFlow random seeds. Furthermore, when first diverging, LSTM layer weights are all normal - only going to NaN later.
Below are, in order: (1) inputs to LSTM; (2) LSTM outputs; (3) Dense(1,'sigmoid') outputs -- the three are consecutive, with Dropout(0.5) between each. Preceding (1) are Conv1D layers. Right: LSTM weights. "BEFORE" = 1 train update before; "AFTER = 1 train update after
BEFORE divergence:
AT divergence:
## LSTM outputs, flattened, stats
(mean,std) = (inf,nan)
(min,max) = (0.00e+00,inf)
(abs_min,abs_max) = (0.00e+00,inf)
AFTER divergence:
## Recurrent Gates Weights:
array([[nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., -0., ..., -0., 0., 0.],
[ 0., -0., -0., ..., -0., 0., 0.],
...,
[nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., -0., ..., -0., 0., -0.],
[ 0., 0., -0., ..., -0., 0., 0.]], dtype=float32)
## Dense Sigmoid Outputs:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)
MINIMAL REPRODUCIBLE EXAMPLE:
from keras.layers import Input,Dense,LSTM,Dropout
from keras.models import Model
from keras.optimizers import Nadam
from keras.constraints import MaxNorm as maxnorm
import numpy as np
ipt = Input(batch_shape=(32,672,16))
x = LSTM(512, activation='relu', return_sequences=False,
recurrent_dropout=0.3,
kernel_constraint =maxnorm(0.5, axis=0),
recurrent_constraint=maxnorm(0.5, axis=0))(ipt)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt,out)
optimizer = Nadam(lr=4e-4, clipnorm=1)
model.compile(optimizer=optimizer,loss='binary_crossentropy')
for train_update,_ in enumerate(range(100)):
x = np.random.randn(32,672,16)
y = np.array([1]*5 + [0]*27)
np.random.shuffle(y)
loss = model.train_on_batch(x,y)
print(train_update+1,loss,np.sum(y))
Observations: the following speed up divergence:
Higher units (LSTM)
Higher # of layers (LSTM)
Higher lr << no divergence when <=1e-4, tested up to 400 trains
Less '1' labels << no divergence with y below, even with lr=1e-3; tested up to 400 trains
y = np.random.randint(0,2,32) # makes more '1' labels
UPDATE: not fixed in TF2; reproducible also using from tensorflow.keras imports.
Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear.
Verdict: recurrent_dropout has nothing to do with it; a thing's being looped where none expect it.
Actual culprit: the activation argument, now 'relu', is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'.
I.e., activation is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:
c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)
Solution(s):
Apply BatchNormalization to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)
If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use activation=None, then BN, then Activation layer)
Use activation='selu'; more stable, but can still diverge
Use lower lr
Apply gradient clipping
Use fewer timesteps
More answers, to some remaining questions:
Why was recurrent_dropout suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement.
Why do nonzero mean inputs accelerate divergence? Additive symmetry; nonzero-mean distributions are asymmetric, with one sign dominating - facilitating large pre-activations, hence large ReLUs.
Why can training be stable for hundreds of iterations with a low lr? Extreme activations induce large gradients via large error; with a low lr, this means weights adjust to prevent such activations - whereas a high lr jumps too far too quickly.
Why do stacked LSTMs diverge faster? In addition to feeding ReLUs to itself, LSTM feeds the next LSTM, which then feeds itself the ReLU'd ReLU's --> fireworks.
UPDATE 1/22/2020: recurrent_dropout may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here
Related
I have to build a model in keras. I am really struggling with my actual dataset, hence I am just trying to figure out the basics on a simpler dataset.
model = Sequential([
Dense(32, input_dim=X_train.shape[1], activation="sigmoid"),
Dense(2, activation="softmax"),
])
metrics=[
tf.keras.metrics.TruePositives(name="tp"),
tf.keras.metrics.TrueNegatives(name="tn"),
tf.keras.metrics.FalseNegatives(name="fn"),
tf.keras.metrics.FalsePositives(name="fp"),
tf.keras.metrics.Recall(name="recall"),
tf.keras.metrics.Precision(name="precision")
]
model.compile(loss="categorical_crossentropy", metrics=metrics, optimizer="sgd")
model.evaluate(X_test, y_test)
evaluation = model.evaluate(X_test, y_test)
for i, m in model.metrics_names:
print(m, evaluation[i])
This gets printed out:
loss 0.4604386021425058
tp 2965.5
tn 2965.5
fn 531.25
fp 531.25
recall 0.8480753898620605
precision 0.8480753898620605
Something really strange about this results. I believe it is due to using the softmax with two nodes.
y_train looks something like this:
array([[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.]], dtype=float32)
I tried a sigmoid, but then the whole model breaks down, at least here the fitting works.
Is there a way to configure recall and precision so they consider one output notes as Positive?
The only solution in your case is to transform the problem into a one-dimensional one, i.e.
Use Dense(1,activation='sigmoid') instead of Dense(2,activation='softmax'); change [0,1] to 0 and [1,0] to 1 as an example.
Use binary_crossentropy instead of categorical_crossentropy.
Otherwise, you can implement a special callback to retrieve those metrics (using scikit-learn, like in the example below):
How to get other metrics in Tensorflow 2.0 (not only accuracy)?
I have the following ndarray (which is stacked by 351 3x3 matrices)
tensor = np.ones((351,3,3))
b = np.ones((351,3))
Applying a function such as :
np.linalg.tensorinv(tensor)
np.linalg.tensorsolve(tensor,b)
Gives me the following error:
"{LinAlgError}Last 2 dimensions of the array must be square"
Why does that error occur? I mean the last two dimensions are square (3x3). This even do not work with tensor.T (which is 3x3x351). Thanks for any help.
The sense in which the tensorinv operation defines square dimensions is somewhat unusual. tensorinv takes a parameter ind and a tensor is "square" if the product of the indices up to (but not including) ind and the product of the indices from ind to the last index are equal, i.e. prod(tensor.shape[:ind]) == prod(tensor.shape[ind:]). This is useful for defining inverses of tensor operations or solving tensor contraction equations, but based on the shape of your examples, I expect this isn't what you are trying to do.
You seem to be wanting to solve 315 different linear systems of equations Ax=b. You should be able to do this with just np.linalg.solve(tensor, b) (though not with your examples in the question, as your tensor of all ones would be a bunch of singular matrices). Rewriting your example to make tensor smaller and a collection of identity matrices rather than all ones:
>>> temp=np.eye(3)
>>> tensor=np.repeat(temp[np.newaxis,:,:],4,axis=0)
>>> tensor.shape
(4, 3, 3)
>>> b=np.ones((4,3))
>>> b
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
>>> np.linalg.solve(tensor,b)
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
Here is an example to solve similar questions from the issue #43561
When I was trying to load the sequential model here using tf.keras.models.load_model in TF 2.3.1, an error is thrown at the following location:
~/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/functional.py in _should_skip_first_node(layer)
1031 return (isinstance(layer, Functional) and
1032 # Filter out Sequential models without an input shape.
-> 1033 isinstance(layer._layers[0], input_layer_module.InputLayer))
1034
1035
IndexError: list index out of range
The model is believed to be trained using keras and under TF1.9, and the model definition can be found here, and here's the code for training.
Here you can find the full stack trace and running code under TF 2.3.1: https://colab.research.google.com/drive/1Lfo0O7D0cM8EtR0h6noqCoWqoqf8bzAD?usp=sharing
Then I downgraded to TF 2.2 and 2.1 with the same code above, it threw the error just as #35934 Keras Model Errors on Loading - 'list' object has no attribute 'items'
Then I downgraded to TF 2.0, the code was executing indefinitely. Finally I had to manually stop it:
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py in IsMapping(o)
2569
2570 """
-> 2571 return _pywrap_tensorflow_internal.IsMapping(o)
2572
2573 def IsMappingView(o):
KeyboardInterrupt:
Here you can find the full stack trace when I stopped the code under TF 2.0: https://colab.research.google.com/drive/1fCR-ci05NuYhQ8M9O2lRVG0F0YzI9Ggo?usp=sharing
Then I have tried to use keras instead of tf.keras with TF 2.3.1 and Keras 2.3.1, first I encountered an error that can be solved in this way: https://github.com/tensorflow/tensorflow/issues/38589#issuecomment-665930503 . Then another error occurs:
~/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py in function(inputs, outputs, updates, name, **kwargs)
3931 if updates:
3932 raise ValueError('`updates` argument is not supported during '
-> 3933 'eager execution. You passed: %s' % (updates,))
3934 from tensorflow.python.keras import models # pylint: disable=g-import-not-at-top
3935 from tensorflow.python.keras.utils import tf_utils # pylint: disable=g-import-not-at-top
ValueError: `updates` argument is not supported during eager execution. You passed: [<tf.Variable 'UnreadVariable' shape=() dtype=int64, numpy=0>, <tf.Variable 'UnreadVariable' shape=(3, 3, 3, 32) dtype=float32, numpy=
array([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
......
The gist is here: https://colab.research.google.com/drive/1OovMHVrMBsIwcwn2PUcgbEXHUfPLMdyM?usp=sharing
So this way fails.
Solutions
One way is to use TF 1.15.4 and Keras 2.3.1, and finally it worked out fine, inputs, outputs, summary etc. are all parsed correctly, as well as being able to run data through the model: https://colab.research.google.com/drive/1XaRMeiT1SefS6Q10wsa0y9rEercyFlCR?usp=sharing
Another is to modify the TF 2.3.1 source code so that the model can be used in latest version using tensorflow keras. You have to redefining _should_skip_first_node in file tensorflow/python/keras/engine/functional.py:
def _should_skip_first_node(layer):
"""Returns True if the first layer node should not be saved or loaded."""
# Networks that are constructed with an Input layer/shape start with a
# pre-existing node linking their input to output. This node is excluded from
# the network config.
if layer._layers:
return (isinstance(layer, Functional) and
# Filter out Sequential models without an input shape.
isinstance(layer._layers[0], input_layer_module.InputLayer))
else:
return isinstance(layer, Functional)
Afterwards
I have submitted a PR #43570 to tensorflow, hope it will get fixed in future TF versions.
I have a general question about time series forecasting in machine learning. It's not about coding yet, and I'm just trying to understand how I should build the model.
Below is some code I have related to my model:
def build_model(my_learning_rate, feature_layer):
model = tf.keras.models.Sequential()
model.add(feature_layer)
model.add(tf.keras.layers.Dense(units=64, activation="relu"))
model.add(tf.keras.layers.Dense(units=1))
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate), loss="mean_squared_error", metrics=[tf.keras.metrics.RootMeanSquaredError()])
return model
Here is my feature layer:
<tf.Tensor: shape=(3000, 31), dtype=float32, numpy=
array([[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 1., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 1., 0.],
[0., 0., 1., ..., 0., 0., 1.]], dtype=float32)>
The time series forecasting modeling technique I learned recently is totally different than how I have been building the model. The technique involves time windows that use past values (my labels!) as features and the next value as the label. It also involves RNN and LSTM.
Is the way I built the model and the time series forecasting technique fundamentally different and will generate different outcomes? Is the way I have been modeling this reasonable, or I should switch to the proper time series forecasting approach?
Yes, Using LSTM and Recurrent layers are usually used for time series as the data from previous timestamps are essential to create a successful model to create accurate and precise predictions. For example, when I make models for time series models, I usually use time distributed 1 dimensional convolutional layers. Code below:
model = Sequential()
model.add(TimeDistributed(Conv1D(filters=64, kernel_size=1, activation='relu'),input_shape=(None, n_steps, n_features)))
model.add(TimeDistributed(Flatten()))
model.add(LSTM(100, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1))
If you want to implement this yourself, you must reshape the original X array int o n_steps (timestamps) and n_features(number of features in data)
Hope this helps!
This is my code.
I'm getting broad cast error.I'm unable to understand why?I have looked at other similar questions, which spoke about problems with dimensions, but I was unable to find out the problem.Any help is appreciated.
Thanks in advance.
I have attached the image.
Broadcast error
Both arrays ( ns and X_train_grade_encoded) are of the same shape , but there is error why?
So I looked at your notebook image. It is a small png that requires zoom to read. We strongly encourage, some even demand, that you copy-n-paste code and errors. We need to see the problem, right up front, not hidden. Otherwise we are likely to move to the next question.
broadcast errors usually occur when doing some sort of math on two arrays, or when (my second guess) assigning one array to a slice of another. But this case is a more obscure one, trying to make an object dtype array from (n,4) and (n,300) shaped arrays.
You are doing hstack((ns, array2)). With an ordinary np.hstack that would work and produce a (n, 304) shaped array. But you are using scipy.sparse.hstack. I don't know if that was intentional or a mistake. You haven't hinted that you are working the sparse matrices.
ns probably was constructed from a sparse matrix, since you use toarray(). But it is now a dense (numpy) array.
sparse.hstack is intended for sparse matrices, returning a sparse matrix. I don't know the exact limits on using dense array inputs. I believe it can convert dense to coo sparse and then do its join, but here the error occurred before it got to that step.
This reproduces your error:
In [37]: from scipy import sparse
Trying to use sparse hstack on two dense arrays:
In [38]: sparse.hstack([np.ones((3,4)),np.zeros((3,2))])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-a9d8036b5a44> in <module>
----> 1 sparse.hstack([np.ones((3,4)),np.zeros((3,2))])
/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
543 """
544
--> 545 blocks = np.asarray(blocks, dtype='object')
546
547 if blocks.ndim != 2:
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not broadcast input array from shape (3,4) into shape (3)
But if we first convert one (even the 2nd) to sparse:
In [39]: sparse.hstack([np.ones((3,4)),sparse.coo_matrix(np.zeros((3,2)))])
Out[39]:
<3x6 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in COOrdinate format>
In [40]: _.A
Out[40]:
array([[1., 1., 1., 1., 0., 0.],
[1., 1., 1., 1., 0., 0.],
[1., 1., 1., 1., 0., 0.]])
of course the right way to join two dense arrays:
In [41]: np.hstack([np.ones((3,4)),np.zeros((3,2))])
Out[41]:
array([[1., 1., 1., 1., 0., 0.],
[1., 1., 1., 1., 0., 0.],
[1., 1., 1., 1., 0., 0.]])
The array(...,object) error is a bit obscure; it arises because both arrays are dense and have the same first dimension. It's a known issue in numpy. Since sparse.hstack was intended for use on sparse matrices, its developers can be excused for ignoring this numpy misuse.
===
sparse.vstack does work with dense arrays, with shapes like (3,4) and (5,4), because np.array(..., object) does make a valid object dtype array. But if the shapes match, e.g. (3,4) and (3,4), neither hstack nor vstack work, but the error message is different from yours.
In [66]: sparse.hstack((np.ones((3,2)),np.zeros((3,2))))
...
ValueError: blocks must be 2-D
So we need to the take the docs seriously.