keras batchnorm layer moving average for variance - tensorflow

I've been trying to understand the Keras BatchNorm layer behavior in my Keras NN model. One question I encountered was how the BN layer is calculating the moving average of the 'variance'. My understanding is Keras is using exponential-weighted-average method to calculate the moving average for both mean and variance from the training mini-batches. But regardless of this, after a really large number of epochs, this moving average should approach the mean/variance of the training data set. But in my simple example, the 'variance' moving average is always different from the training data 'variance'. Below is my code and output:
from keras.layers import Input, BatchNormalization
from keras.models import Model
from keras.optimizers import Adam, RMSprop
import numpy as np
X_input = Input(shape=(6,))
X = BatchNormalization(axis=-1)(X_input)
model = Model(inputs=X_input, outputs=X)
model.compile(optimizer=RMSprop(), loss='mean_squared_error')
np.random.seed(3)
train_data = np.random.random((5,6))
train_label = np.random.random((5,6))
model.fit(x=train_data, y=train_label, epochs=10000, batch_size=6, verbose=False)
bn_gamma, bn_beta, bn_mean, bn_var = model.layers[1].get_weights()
train_mean = np.mean(train_data, axis=0)
train_var = np.var(train_data, axis=0)
print("train_mean: {}".format(train_mean))
print("moving_mean: {}".format(bn_mean))
print("train_var: {}".format(train_var))
print("moving_var: {}".format(bn_var))
Below is the output:
train_mean: [0.42588575 0.47785879 0.32170309 0.49151921 0.355046 0.60104636]
moving_mean: [0.4258843 0.47785735 0.32170165 0.49151778 0.35504454 0.60104346]
train_var: [0.03949981 0.05228663 0.04027516 0.02522536 0.10261097 0.0838988 ]
moving_var: [0.04938692 0.06537427 0.05035637 0.03153942 0.12829503 0.10489936]
If you see, the train_mean is the same as the moving average mean of BN layer, but train_var (variance) is not. Can anyone please help here? Thanks.

If you look at the source code of batchnorm, you can see that the unbiased estimator of population variance is used, here is the relevant line:
variance *= sample_size / (sample_size - (1.0 + self.epsilon))
In your case, the sample size is 5, so you should have train_var * 5./4 == moving_var, which is the case.

Related

XOR problem with 2-2-1 configuration should always predict output accurately?

I am trying to solve the XOR problem using the following code:
import numpy as np
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, Concatenate
from tensorflow.keras.utils import plot_model
from tensorflow.keras.optimizers import SGD, Adam
# input data
x = np.array([[0,0], [0,1], [1,0], [1,1]], 'float32')
y = np.array([[0], [1], [1], [0]], 'float32')
### Model
model = Sequential()
# add layers (architecture)
model.add(Dense(2, activation = 'relu')
model.add(Dense(1, activation = 'sigmoid'))
# compile
model.compile(loss = 'mean_squared_error',
optimizer = SGD(learning_rate = 0.1, momentum=0.8),
metrics = ['accuracy'])
# train
model.fit(x, y, epochs = 25000, batch_size = 1)
# evaluate
ev = model.evaluate(x, y)
I already tested:
using different activation functions in the hidden layer (sigmoid and tanh)
using different learning rates and momentum
Also, I am running with a high number of epochs (25000). Still, it only accurately predicts all outputs a few times. Most of the times accuracy is equal to 0.5 or 0.75.
I have read that this is the minimum configuration to solve this problem. However, it also seems that the error surface presents a number of regions with local minima.
My question is:
Should I assume that the model is correct and can learn the problem, although sometimes it gets 'stuck' in a local minima, OR do I still need to improve my model somehow to solve the XOR more accurately and consistently?

Why can't I classify my data perfectly on this simple problem using a NN?

I have a set of observations made of 10 features, each of these features being a real number in the interval (0,2). Say I wanted to train a simple neural network to classify whether the average of those features is above or below 1.0.
Unless I'm missing something, it should be enough with a two-layer network with one neuron on each layer. The activation functions would be a linear one (i.e. no activation function) on the first layer and a sigmoid on the output layer. An example of a NN with this architecture that would work is one that calculates the average on the first layer (i.e. all weights = 0.1 and bias=0) and asseses whether that is above or below 1.0 in the second layer (i.e. weight = 1.0 and bias = -1.0).
When I implement this using TensorFlow (see code below), I obviously get a very high accuracy quite quickly, but never get to 100% accuracy... I would like some help to understand conceptually why this is the case. I don't see why the backppropagation algorithm does not reach a set of optimal weights (may be this is related with the loss function I'm using, which has local minmums?). Also I would like to know whether a 100% accuracy is achievable if I use different activations and/or loss function.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
X = [np.random.random(10)*2.0 for _ in range(10000)]
X = np.array(X)
y = X.mean(axis=1) >= 1.0
y = y.astype('int')
train_ratio = 0.8
train_len = int(X.shape[0]*0.8)
X_train, X_test = X[:train_len,:], X[train_len:,:]
y_train, y_test = y[:train_len], y[train_len:]
def create_classifier(lr = 0.001):
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units=1))
classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))#, input_shape=input_shape))
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
metrics=[tf.keras.metrics.BinaryAccuracy()],
classifier.compile(optimizer=optimizer, loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), metrics=metrics)
return classifier
classifier = create_classifier(lr = 0.1)
history = classifier.fit(X_train, y_train, batch_size=1000, validation_split=0.1, epochs=2000)
Ignoring the fact that a neural network is an odd approach for this problem, and answering your specific question - it looks like your learning rate might be too high which could explain the fluctuations around the optimal point.

LSTM training error is very high and relatively unchanging

As a learning exercise, I'm trying to use an LSTM model with the Keras framework to predict the stock market based on multiple data points. The size of my input array is roughly [5000, 100]. Based on other questions on this site and articles online, the approach seems fairly standard: put the data in a numpy array, scale it, reshape it to 3 dimensions for the LSTM, split it into train and test sections, and feed it through the model. Running only the training portion of the model, I am consistently getting loss scores around 400,000,000. This is not changed by altering the batch size, the number of epochs, the number of layers, replacing the normalization with dropout layers, changing the sizes of each layer, or using different optimizers and loss functions. Any idea why the loss is so high and what I can do to fix that? Attached is the code. All advice is greatly appreciated.
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, losses, optimizers, Model, preprocessing
from keras.utils import plot_model
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
scaler = MinMaxScaler(feature_range=(0, 1))
features_df = pd.read_csv("dataset.csv")
features_np = np.array(features_df)
features_np.astype(np.float64)
scaler.fit_transform(features_np)
num_features=features_np.shape[1]
features = np.reshape(features_np, (features_np.shape[0], 1, features_np.shape[1]))
labels_np = np.array(pd.read_csv("output.csv"))
scaler.fit_transform(labels_np)
test_in = features_np[int(features_np.shape[0] * 0.75):]
test_in = np.reshape(test_in, (test_in.shape[0], 1, test_in.shape[1]))
test_out = labels_np[int(labels_np.shape[0] * 0.75):]
test_out = np.reshape(test_out, (test_out.shape[0], 1, test_out.shape[1]))
inputs = layers.Input(shape=(1, features.shape[2]))
x = layers.LSTM(5000, return_sequences=True)(inputs)
lstm1 = layers.LSTM(1000, return_sequences=True)(x)
norm1 = layers.BatchNormalization()(lstm1)
lstm2 = layers.LSTM(1000, return_sequences=True)(norm1)
lstm3 = layers.LSTM(1000, return_sequences=True)(lstm2)
norm2 = layers.BatchNormalization()(lstm3)
lstm4 = layers.LSTM(1000, return_sequences=True)(norm2)
lstm5 = layers.LSTM(1000)(lstm4)
dense1 = layers.Dense(1000, activation='relu')(lstm5)
dense2 = layers.Dense(1000, activation='sigmoid')(dense1)
outputs = layers.Dense(2)(dense2)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(features, labels_np, epochs=1, batch_size=4)
evaluate = model.evaluate(test_in, test_out, verbose=2)
While I have not solved the error, implementing the Sequential() model and using only two LSTM layers and a Dense layer changed the error: the training error is now very low while testing remains high. This now appears to be a (relatively) simple problem of overfitting rather than the more confusing error of high training loss. Hopefully, this helps anyone having a similar problem.
There are two things i notice and dont understand why you use them. First one is , dense2 layer with sigmoid activation. I dont think sigmoid activation is benefical to when we are trying to solve a regression problem. Can you change that to relu and see what happens. Second one is you have two dense layers. You did not specify that but i think you are predicting two values with same inputs. If you are trying to predict just one value, you should you should change that to
outputs = layers.Dense(1)(dense2)

Learning a Categorical Variable with TensorFlow Probability

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1
I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

Resnet-50 adversarial training with cleverhans FGSM accuracy stuck at 5%

I am facing a strange problem when adversarially training a resnet-50, and I am not sure whether is's a logical error, or a bug somewhere in the code/libraries.
I am adversarially training a resnet-50 thats loaded from Keras, using the FastGradientMethod from cleverhans, and expecting the adversarial accuracy to rise at least above 90% (probably 99.x%). The training algorithm, training- and attack-params should be visible in the code.
The problem, as already stated in the title is, that the accuracy is stuck at 5% after training ~3000 of 39002 training inputs in the first epoch. (GermanTrafficSignRecognitionBenchmark, GTSRB).
When training without and adversariy loss function, the accuracy does not get stuck after 3000 samples, but continues to rise > 0.95 in the first epoch.
When substituting the network with a lenet-5, alexnet and vgg19, the code works as expected, and an accuracy absolutely comparabele to the non-adversarial, categorical_corssentropy lossfunction is achieved. I've also tried running the procedure using solely tf-cpu and different versions of tensorflow, the result is always the same.
Code for obtaining ResNet-50:
def build_resnet50(num_classes, img_size):
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Flatten
resnet = ResNet50(weights='imagenet', include_top=False, input_shape=img_size)
x = Flatten(input_shape=resnet.output.shape)(resnet.output)
x = Dense(1024, activation='sigmoid')(x)
predictions = Dense(num_classes, activation='softmax', name='pred')(x)
model = Model(inputs=[resnet.input], outputs=[predictions])
return model
Training:
def lr_schedule(epoch):
# decreasing learning rate depending on epoch
return 0.001 * (0.1 ** int(epoch / 10))
def train_model(model, xtrain, ytrain, xtest, ytest, lr=0.001, batch_size=32,
epochs=10, result_folder=""):
from cleverhans.attacks import FastGradientMethod
from cleverhans.utils_keras import KerasModelWrapper
import tensorflow as tf
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import LearningRateScheduler, ModelCheckpoint
sgd = SGD(lr=lr, decay=1e-6, momentum=0.9, nesterov=True)
model(model.input)
wrap = KerasModelWrapper(model)
sess = tf.compat.v1.keras.backend.get_session()
fgsm = FastGradientMethod(wrap, sess=sess)
fgsm_params = {'eps': 0.01,
'clip_min': 0.,
'clip_max': 1.}
loss = get_adversarial_loss(model, fgsm, fgsm_params)
model.compile(loss=loss, optimizer=sgd, metrics=['accuracy'])
model.fit(xtrain, ytrain,
batch_size=batch_size,
validation_data=(xtest, ytest),
epochs=epochs,
callbacks=[LearningRateScheduler(lr_schedule)])
Loss-function:
def get_adversarial_loss(model, fgsm, fgsm_params):
def adv_loss(y, preds):
import tensorflow as tf
tf.keras.backend.set_learning_phase(False) #turn off dropout during input gradient calculation, to avoid unconnected gradients
# Cross-entropy on the legitimate examples
cross_ent = tf.keras.losses.categorical_crossentropy(y, preds)
# Generate adversarial examples
x_adv = fgsm.generate(model.input, **fgsm_params)
# Consider the attack to be constant
x_adv = tf.stop_gradient(x_adv)
# Cross-entropy on the adversarial examples
preds_adv = model(x_adv)
cross_ent_adv = tf.keras.losses.categorical_crossentropy(y, preds_adv)
tf.keras.backend.set_learning_phase(True) #turn back on
return 0.5 * cross_ent + 0.5 * cross_ent_adv
return adv_loss
Versions used:
tf+tf-gpu: 1.14.0
keras: 2.3.1
cleverhans: > 3.0.1 - latest version pulled from github
It is a side-effect of the way we estimate the moving averages on BatchNormalization.
The mean and variance of the training data that you used are different from the ones of the dataset used to train the ResNet50. Because the momentum on the BatchNormalization has a default value of 0.99, with only 10 iterations it does not converge quickly enough to the correct values for the moving mean and variance. This is not obvious during training when the learning_phase is 1 because BN uses the mean/variance of the batch. Nevertheless when we set learning_phase to 0, the incorrect mean/variance values which are learned during training significantly affect the accuracy.
You can fix this problem by below approachs:
More iterations
Reduce the size of the batch from 32 to 16(to perform more updates per epoch) and increase the number of epochs from 10 to 250. This way the moving average and variance will converge to the correct values.
Change the momentum of BatchNormalization
Keep the number of iterations fixed but change the momentum of the BatchNormalization layer to update more aggressively the rolling mean and variance (not recommended for production models).
On the original snippet, add the following code between reading the base_model and defining the new layers:
# ....
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=input_shape)
# PATCH MOMENTUM - START
import json
conf = json.loads(base_model.to_json())
for l in conf['config']['layers']:
if l['class_name'] == 'BatchNormalization':
l['config']['momentum'] = 0.5
m = Model.from_config(conf['config'])
for l in base_model.layers:
m.get_layer(l.name).set_weights(l.get_weights())
base_model = m
# PATCH MOMENTUM - END
x = base_model.output
# ....
Would also recommend you to try another hack provided bu us here.