Recall and precision not working correctly(keras) - tensorflow

I have to build a model in keras. I am really struggling with my actual dataset, hence I am just trying to figure out the basics on a simpler dataset.
model = Sequential([
Dense(32, input_dim=X_train.shape[1], activation="sigmoid"),
Dense(2, activation="softmax"),
])
metrics=[
tf.keras.metrics.TruePositives(name="tp"),
tf.keras.metrics.TrueNegatives(name="tn"),
tf.keras.metrics.FalseNegatives(name="fn"),
tf.keras.metrics.FalsePositives(name="fp"),
tf.keras.metrics.Recall(name="recall"),
tf.keras.metrics.Precision(name="precision")
]
model.compile(loss="categorical_crossentropy", metrics=metrics, optimizer="sgd")
model.evaluate(X_test, y_test)
evaluation = model.evaluate(X_test, y_test)
for i, m in model.metrics_names:
print(m, evaluation[i])
This gets printed out:
loss 0.4604386021425058
tp 2965.5
tn 2965.5
fn 531.25
fp 531.25
recall 0.8480753898620605
precision 0.8480753898620605
Something really strange about this results. I believe it is due to using the softmax with two nodes.
y_train looks something like this:
array([[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.]], dtype=float32)
I tried a sigmoid, but then the whole model breaks down, at least here the fitting works.
Is there a way to configure recall and precision so they consider one output notes as Positive?

The only solution in your case is to transform the problem into a one-dimensional one, i.e.
Use Dense(1,activation='sigmoid') instead of Dense(2,activation='softmax'); change [0,1] to 0 and [1,0] to 1 as an example.
Use binary_crossentropy instead of categorical_crossentropy.
Otherwise, you can implement a special callback to retrieve those metrics (using scikit-learn, like in the example below):
How to get other metrics in Tensorflow 2.0 (not only accuracy)?

Related

What is the meaning of the seed in the tensorflow?

I'm a beginner at TensorFlow, and my book said I should put this code first to produce the same sequence of results.
seed = 3
np.random.seed(seed)
tf.random.set_seed(seed)
I put various values in seed, and the results showed a big difference.
Does it affect the initial value setting of weights?
What exactly does that seed value setting play?
This is not a term which is specific for TensorFlow. Almost every programming language have a seed for to determine randomness of the random number generators etc. With seeds you can re-produce your results. For to reproduce results you need to restart kernel see the difference. For example:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
seed = 3
np.random.seed(seed)
tf.random.set_seed(seed)
model = Sequential()
model.add(Dense(2, activation="relu", input_shape = (28,28,1), kernel_initializer=initializer))
model.add(Dense(1, kernel_initializer=initializer))
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=['accuracy'])
This will give, 1st run:
>> model.get_weights()
[array([[-1.036157 , 0.19191754]], dtype=float32),
array([0., 0.], dtype=float32),
array([[-0.86388123],
[-0.8873284 ]], dtype=float32),
array([0.], dtype=float32)]
If you create the same model for the second time:
>> model.get_weights()
[array([[-0.4151404 , 0.81533253]], dtype=float32),
array([0., 0.], dtype=float32),
array([[ 0.39391458],
[-0.48291653]], dtype=float32),
array([0.], dtype=float32)]
produce the same sequence of results
Now, restart the kernel and create the model again. Results should be same for first and second run.
If you want to create a model with exact same weights everytime, you need to do:
initializer = tf.keras.initializers.GlorotNormal(seed=seed)
model.add(Dense(2, activation="relu", input_shape = (28,28,1), kernel_initializer=initializer))
model.add(Dense(1, kernel_initializer=initializer))
>> model.get_weights()
[array([[-0.6233864, 1.5557635]], dtype=float32),
array([0., 0.], dtype=float32),
array([[-0.6233864],
[ 1.5557635]], dtype=float32),
array([0.], dtype=float32)]
will be same, does not matter how many times you create the model, even after restarting kernel.

LSTM 'recurrent_dropout' with 'relu' yields NaNs

Any non-zero recurrent_dropout yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful, return_sequences = any, with & w/o Bidirectional(), activation='relu', loss='binary_crossentropy'. NaNs occur within a few batches.
Any fixes? Help's appreciated.
TROUBLESHOOTING ATTEMPTED:
recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50 (empirically determined), Nadam optimizer
activation='tanh' - no NaNs, weights stable, tested for up to 10 batches
lr=2e-6,2e-5 - no NaNs, weights stable, tested for up to 10 batches
lr=5e-5 - no NaNs, weights stable, for 3 batches - NaNs on batch 4
batch_shape=(32,48,16) - large loss for 2 batches, NaNs on batch 3
NOTE: batch_shape=(32,672,16), 17 calls to train_on_batch per batch
ENVIRONMENT:
Keras 2.2.4 (TensorFlow backend), Python 3.7, Spyder 3.3.7 via Anaconda
GTX 1070 6GB, i7-7700HQ, 12GB RAM, Win-10.0.17134 x64
CuDNN 10+, latest Nvidia drives
ADDITIONAL INFO:
Model divergence is spontaneous, occurring at different train updates even with fixed seeds - Numpy, Random, and TensorFlow random seeds. Furthermore, when first diverging, LSTM layer weights are all normal - only going to NaN later.
Below are, in order: (1) inputs to LSTM; (2) LSTM outputs; (3) Dense(1,'sigmoid') outputs -- the three are consecutive, with Dropout(0.5) between each. Preceding (1) are Conv1D layers. Right: LSTM weights. "BEFORE" = 1 train update before; "AFTER = 1 train update after
BEFORE divergence:
AT divergence:
## LSTM outputs, flattened, stats
(mean,std) = (inf,nan)
(min,max) = (0.00e+00,inf)
(abs_min,abs_max) = (0.00e+00,inf)
AFTER divergence:
## Recurrent Gates Weights:
array([[nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., -0., ..., -0., 0., 0.],
[ 0., -0., -0., ..., -0., 0., 0.],
...,
[nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., -0., ..., -0., 0., -0.],
[ 0., 0., -0., ..., -0., 0., 0.]], dtype=float32)
## Dense Sigmoid Outputs:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)
MINIMAL REPRODUCIBLE EXAMPLE:
from keras.layers import Input,Dense,LSTM,Dropout
from keras.models import Model
from keras.optimizers import Nadam
from keras.constraints import MaxNorm as maxnorm
import numpy as np
ipt = Input(batch_shape=(32,672,16))
x = LSTM(512, activation='relu', return_sequences=False,
recurrent_dropout=0.3,
kernel_constraint =maxnorm(0.5, axis=0),
recurrent_constraint=maxnorm(0.5, axis=0))(ipt)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt,out)
optimizer = Nadam(lr=4e-4, clipnorm=1)
model.compile(optimizer=optimizer,loss='binary_crossentropy')
for train_update,_ in enumerate(range(100)):
x = np.random.randn(32,672,16)
y = np.array([1]*5 + [0]*27)
np.random.shuffle(y)
loss = model.train_on_batch(x,y)
print(train_update+1,loss,np.sum(y))
Observations: the following speed up divergence:
Higher units (LSTM)
Higher # of layers (LSTM)
Higher lr << no divergence when <=1e-4, tested up to 400 trains
Less '1' labels << no divergence with y below, even with lr=1e-3; tested up to 400 trains
y = np.random.randint(0,2,32) # makes more '1' labels
UPDATE: not fixed in TF2; reproducible also using from tensorflow.keras imports.
Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear.
Verdict: recurrent_dropout has nothing to do with it; a thing's being looped where none expect it.
Actual culprit: the activation argument, now 'relu', is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'.
I.e., activation is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:
c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)
Solution(s):
Apply BatchNormalization to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)
If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use activation=None, then BN, then Activation layer)
Use activation='selu'; more stable, but can still diverge
Use lower lr
Apply gradient clipping
Use fewer timesteps
More answers, to some remaining questions:
Why was recurrent_dropout suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement.
Why do nonzero mean inputs accelerate divergence? Additive symmetry; nonzero-mean distributions are asymmetric, with one sign dominating - facilitating large pre-activations, hence large ReLUs.
Why can training be stable for hundreds of iterations with a low lr? Extreme activations induce large gradients via large error; with a low lr, this means weights adjust to prevent such activations - whereas a high lr jumps too far too quickly.
Why do stacked LSTMs diverge faster? In addition to feeding ReLUs to itself, LSTM feeds the next LSTM, which then feeds itself the ReLU'd ReLU's --> fireworks.
UPDATE 1/22/2020: recurrent_dropout may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here

Is dropout layer still active in a freezed Keras model (i.e. trainable=False)?

I have two trained models (model_A and model_B), and both of them have dropout layers. I have freezed model_A and model_B and merged them with a new dense layer to get model_AB (but I have not removed model_A's and model_B's dropout layers). model_AB's weights will be non-trainable, except for the added dense layer.
Now my question is: are the dropout layers in model_A and model_B active (i.e. drop neurons) when I am training model_AB?
Short answer: The dropout layers will continue dropping neurons during training, even if you set their trainable property to False.
Long answer: There are two distinct notions in Keras:
Updating the weights and states of a layer: this is controlled using trainable property of that layer, i.e. if you set layer.trainable = False then the weights and internal states of the layer would not be updated.
Behavior of a layer in training and testing phases: as you know a layer, like dropout, may have a different behavior during training and testing phases. Learning phase in Keras is set using keras.backend.set_learning_phase(). For example, when you call model.fit(...) the learning phase is automatically set to 1 (i.e. training), whereas when you use model.predict(...) it will be automatically set to 0 (i.e. test). Further, note that learning phase of 1 (i.e. training) does not necessarily imply updating the weighs/states of a layer. You can run your model with a learning phase of 1 (i.e. training phase), but no weights will be updated; just layers will switch to their training behavior (see this answer for more information). Further, there is another way to set learning phase for each individual layer by passing training=True argument when calling a layer on a tensor (see this answer for more information).
So according to the above points, when you set trainable=False on a dropout layer and use that in training mode (e.g. either by calling model.fit(...), or manually setting learning phase to training like example below), the neurons would still be dropped by the dropout layer.
Here is a reproducible example which illustrates this point:
from keras import layers
from keras import models
from keras import backend as K
import numpy as np
inp = layers.Input(shape=(10,))
out = layers.Dropout(0.5)(inp)
model = models.Model(inp, out)
model.layers[-1].trainable = False # set dropout layer as non-trainable
model.compile(optimizer='adam', loss='mse') # IMPORTANT: we must always compile model after changing `trainable` attribute
# create a custom backend function so that we can control the learning phase
func = K.function(model.inputs + [K.learning_phase()], model.outputs)
x = np.ones((1,10))
# learning phase = 1, i.e. training mode
print(func([x, 1]))
# the output will be:
[array([[2., 2., 2., 0., 0., 2., 2., 2., 0., 0.]], dtype=float32)]
# as you can see some of the neurons have been dropped
# now set learning phase = 0, i.e test mode
print(func([x, 0]))
# the output will be:
[array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
# unsurprisingly, no neurons have been dropped in test phase
The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.
Note that the Dropout layer only applies when training is set to True such that no values are dropped during inference. When using model.fit, training will be appropriately set to True automatically, and in other contexts, you can set the kwarg explicitly to True when calling the layer.
(This is in contrast to setting trainable=False for a Dropout layer. trainable does not affect the layer's behavior, as Dropout does not have any variables/weights that can be frozen during training.)
Check the official doc here.

Using seed to sample in tensorflow-probability

I am trying to use tensorflow-probability and started off with something really simple:
import tensorflow as tf
import tensorflow_probability as tfp
tf.enable_eager_execution()
tfd = tfp.distributions
poiss = tfd.Poisson(0.8)
poiss.sample(2, seed=1)
#> Out: <tf.Tensor: id=3569, shape=(2,), dtype=float32, numpy=array([0., 0.], dtype=float32)>
poiss.sample(2, seed=1)
#> Out: <tf.Tensor: id=3695, shape=(2,), dtype=float32, numpy=array([1., 0.], dtype=float32)>
poiss.sample(2, seed=1)
#> Out: <tf.Tensor: id=3824, shape=(2,), dtype=float32, numpy=array([2., 2.], dtype=float32)>
poiss.sample(2, seed=1)
#> Out: <tf.Tensor: id=3956, shape=(2,), dtype=float32, numpy=array([0., 1.], dtype=float32)>
I was thinking I would get the same results when re-using the same seed, but somehow that's not true.
I also tried without eager execution, but the results still weren't reproducible. Same story if I add something like tf.set_random_seed(12).
I suppose there is something basic I am missing?
For those interested, I am running Python 3.5.2 on Ubuntu 16.04 with
tensorflow-probability==0.5.0
tensorflow==1.12.0
For deterministic output in graph mode, you need to set both the graph random seed (tf.set_random_seed) and the op random seed (seed= in your sample call).
The workings of random samplers in TFv2 are still being sorted out. For now, my best understanding is that you can call tf.set_random_seed prior to each call to a sampler, and pass the sampler a seed=, if you want deterministic output in eager.
This is now cleaner, we support fully deterministic randomness in TFP. You can pass a tuple of two ints for seed, or a Tensor of shape (2,) to trigger the deterministic behavior. tfp.random.split_seed is also relevant here.
Besides setting the seed for sample or sample_chain in mcmc, you might need to set the followings as well:
seed = 24
os.environ['TF_DETERMINISTIC_OPS'] = 'true'
os.environ['PYTHONHASHSEED'] = f'{seed}'
np.random.seed(seed)
random.seed(seed)
tf.random.set_seed(seed)

tensorflow softmax_cross_entropy code

Since the source code of tf.nn.softmax_cross_entropy_with_logits in gen_nn_ops is hidden, could anyone perhaps explain me how tensorflow compute the cross entropy after Softmax. I mean, after softmax it might output 0 because of precision which will give rise to a NaN problem with cross entropy. Did tensorflow use clip method when softmax to bound the output of it?
The implementation of tf.nn.softmax_cross_entropy_with_logits further goes to native C++ code, here is XLA implementation. Logits are not bound and 0 is possible when one of the logits is much bigger than others. Example:
>>> session.run(tf.nn.softmax([10.0, 50.0, 100.0, 200.0]))
array([ 0., 0., 0., 1.], dtype=float32)
If you wish, you can clip the logits just before the softmax, but it's not recommended, because it kills the gradient when the output is large. A better option is to use batch normalization to make the activations more like normally distributed.