Error comparing SGD+momentum v/s SGD on MNIST - optimization

I'm working on a toy project to compare the performance of SGD and SGD+momentum optimizers on MNIST data. To do this, I have created 2 cell blocks, one for SGD:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0)
for epoch in range(10):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 2000 == 1999:
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
Loss after 10 epochs for SGD : loss: 0.674
I've then created another cell for SGD+momentum:
sgd_momentum = optim.SGD(net.parameters(), lr=0.001, momentum=0.7)
for epoch in range(10):
#similar as in SGD, just replace the optimizer
The problem I'm facing is that SGD+momentum is trying to optimize from the point where SGD left off. This is the loss for the first minibatch, epoch 1:
[1, 2000] loss: 0.506
How do I ensure that SGD+momentum takes the original loss? I am unable to understand the reason for this.

In PyTorch, once optim.step() has been used, the weight optimization begins automatically (provided no other flags have been set).
Calling model = model() / Net() before using the second optimizer did the trick, as it reinitialized the network

Related

eager mode and keras.fit have different results

I am trying to convert model.fit() in Keras to the eager mode training. The model is an autoencoder. It has one encoder and two decoders. The decoders have different loss functions. The losses for decoders in eager model and model.fit are the same. I tried to set everything as the model.fit(). But the losses are different. I really appreciate help me out.
The link for google colab: https://colab.research.google.com/drive/1XNOwJ9oVgs1z9qqXIs_ldnKuSm3Dn2Ud?usp=sharing
In the following, the definition and training of the model are shown. I use model.fit() for training. Also, in the end, the output is shown, which shows the values for losses.
def fit_ae (x_unlab, p_m, alpha, parameters):
# Parameters
_, dim = x_unlab.shape
epochs = parameters['epochs']
batch_size = parameters['batch_size']
# Build model
inputs = contrib_layers.Input(shape=(dim,))
# Encoder
h = contrib_layers.Dense(int(256), activation='relu', name='encoder1')(inputs)
h = contrib_layers.Dense(int(128), activation='relu', name='encoder2')(h)
h = contrib_layers.Dense(int(26), activation='relu', name='encoder3')(h)
# Mask estimator
output_1 = contrib_layers.Dense(dim, activation='sigmoid', name = 'mask')(h)
# Feature estimator
output_2 = contrib_layers.Dense(dim, activation='sigmoid', name = 'feature')(h)
#Projection Network
model = Model(inputs = inputs, outputs = [output_1, output_2])
model.compile(optimizer='rmsprop',
loss={'mask': 'binary_crossentropy',
'feature': 'mean_squared_error'},
loss_weights={'mask':1, 'feature':alpha})
m_unlab = mask_generator(p_m, x_unlab)
m_label, x_tilde = pretext_generator(m_unlab, x_unlab)
# Fit model on unlabeled data
model.fit(x_tilde, {'mask': m_label, 'feature': x_unlab}, epochs = epochs, batch_size= batch_size)
########### OUTPUT
Epoch 1/15
4/4 [==============================] - 1s 32ms/step - loss: 1.0894 - mask_loss: 0.6560 - feature_loss: 0.2167
Epoch 2/15
4/4 [==============================] - 0s 23ms/step - loss: 0.6923 - mask_loss: 0.4336 - feature_loss: 0.1293
Epoch 3/15
4/4 [==============================] - 0s 26ms/step - loss: 0.4720 - mask_loss: 0.3022 - feature_loss: 0.0849
Epoch 4/15
4/4 [==============================] - 0s 23ms/step - loss: 0.4054 - mask_loss: 0.2581 - feature_loss: 0.0736
In the following code, I implemented the above code in eager mode. I set all optimizer and loss functions same as the above code. Data are the same for training both model.
###################################################### MODEL AUTOENCODER ============================================
def eager_ae(x_unlab,p_m,alpha,parameters):
# import pdb; pdb.set_trace()
_, dim = x_unlab.shape
epochs = parameters['epochs']
batch_size = parameters['batch_size']
E = keras.Sequential([
Input(shape=[dim,]),
Dense(256,activation='relu'),
Dense(128,activation='relu'),
Dense(26,activation='relu'),
])
# Mask estimator
output_1 = keras.Sequential([
Dense(dim,activation='sigmoid'),
])
# Feature estimator
output_2 = keras.Sequential([
Dense(dim,activation='sigmoid'),
])
optimizer = tf.keras.optimizers.RMSprop()
loss_mask = tf.keras.losses.BinaryCrossentropy()
loss_feature = tf.keras.losses.MeanSquaredError()
# Generate corrupted samples
m_unlab = mask_generator(p_m, x_unlab)
m_label, x_tilde = pretext_generator(m_unlab, x_unlab)
for epoch in range(epochs):
loss_metric = tf.keras.metrics.Mean(name='train_loss')
len_batch = range(int(x_unlab.shape[0]/batch_size))
for i in len_batch:
samples = x_tilde[i*batch_size:(i+1)*batch_size]
mask = m_label[i*batch_size:(i+1)*batch_size]
# train_step(samples,tgt)
with tf.GradientTape() as tape:
latent = E(samples, training=True)
out_mask = output_1(latent)
out_feat = output_2(latent)
# import pdb; pdb.set_trace()
lm = loss_mask(out_mask,tf.Variable(mask,dtype=tf.float32))
lf = loss_feature(out_feat,tf.Variable(samples,dtype=tf.float32))
pred_loss = lm + alpha*lf
trainable_vars = E.trainable_weights+output_1.trainable_weights+output_2.trainable_weights
grads = tape.gradient(pred_loss, trainable_vars)
optimizer.apply_gradients(zip(grads, trainable_vars))
loss_metric.update_state(pred_loss)
print(f'Epoch {epoch}, Loss {loss_metric.result()}')
return E
############# OUTPUT
Epoch 0, Loss 7.902271747589111
Epoch 1, Loss 5.336598873138428
Epoch 2, Loss 2.880791664123535
Epoch 3, Loss 1.9296690225601196
Epoch 4, Loss 1.6377944946289062
Epoch 5, Loss 1.5342860221862793
Epoch 6, Loss 1.5015968084335327
Epoch 7, Loss 1.4912563562393188
The total loss in the first code is less than zero (≈0.25), while the total loss in the second code is more than 1 (≈1.3). I can not find the issue in my second implementation (the second code).

constant loss values with normal CNNs and transfer learning

I am working on the dataset given in the paper https://arxiv.org/ftp/arxiv/papers/1511/1511.02459.pdf
In this paper, a dataset of images (portraits of people) is labeled by a floating number between 1 and 5 (1 ugly, 5 good looking). I wanted to work on this dataset and use MobileNetV2 with transfer learning (pretrained on Imagenet) in Tensorflow 2.4.0-dev20201009 with CUDA 11.1 on my RTX 3070 8gb. I don't really see my mistake but training my model yields often in constant validation loss, for example:
78/78 [==============================] - ETA: 0s - loss: 52145660442.33472020-11-20 13:19:36.796481: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:596] layout failed: Invalid argument: Size of values 2 does not match size of permutation 4 # fanin shape insequential/dense/BiasAdd-0-TransposeNHWCToNCHW-LayoutOptimizer
78/78 [==============================] - 16s 70ms/step - loss: 51654522711.5709 - val_loss: 9.5415
Epoch 2/300
78/78 [==============================] - 4s 52ms/step - loss: 9.4870 - val_loss: 9.5415
Epoch 3/300
78/78 [==============================] - 4s 52ms/step - loss: 9.3986 - val_loss: 9.5415
Epoch 4/300
78/78 [==============================] - 4s 51ms/step - loss: 9.4950 - val_loss: 9.5415
Epoch 5/300
78/78 [==============================] - 4s 52ms/step - loss: 9.4076 - val_loss: 9.5415
Epoch 6/300
78/78 [==============================] - 4s 52ms/step - loss: 9.4993 - val_loss: 9.5415
Epoch 7/300
78/78 [==============================] - 4s 52ms/step - loss: 9.3758 - val_loss: 9.5415
...
The validation loss would remain constant for 300 epochs. My code can be found here below. Let me summarize:
I used transfer-learning from Imagenet and froze the convolutional base of MobileNetV2.
I added a dense layer as the classificator and 1 output neuron. The loss function I used is MSE. The optimizer in the code is SGD, and I also tried ADAM which could also yield constant loss values on the validation set.
The above error (constant val loss) occurs also with different learning rates and with ADAM. Sometimes the same learning rate yields not constant val loss but reasonable loss. I assume this is due to the randomized weights initialization method on the dense layers in my classificator. I even tried absurd learning_rates like 10, and values are still constant. If the lr is very high then changes should be clearly seen! This is not the case. What is wrong?
My code:
import os
from typing import Dict, Any
from PIL import Image
from sklearn.model_selection import GridSearchCV
import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.keras import layers
from tensorflow import keras
import matplotlib.pyplot as plt
import pickle
import numpy as np
import cv2
import random
#method to create the model
def create_model(IMG_SIZE, lr):
#Limit memore usage of GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0], [
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024*7)])
except RuntimeError as e:
print(e)
model = keras.Sequential()
model.add(MobileNetV2(input_shape=(IMG_SIZE, IMG_SIZE, 3), include_top=False))
model.layers[0].trainable = False
model.add(layers.GlobalAveragePooling2D())
model.add(tf.keras.layers.Dropout(0.8))
model.add(layers.Dense(128, activation="relu"))
model.add(layers.Dense(1, activation="relu"))
#use adam or sgd as optimizers
adam = tf.keras.optimizers.Adam(learning_rate=lr, beta_1=0.9, beta_2=0.98,
epsilon=1e-9)
sgd = tf.keras.optimizers.SGD(lr=lr, decay=1e-6, momentum=0.5)
model.compile(optimizer=sgd,
loss=tf.losses.mean_squared_error,
)
model.summary()
return model
#preprocessing
def loadImages(IMG_SIZE):
path = os.path.join(os.getcwd(), 'data\\Images')
training_data=[]
labelMap = getLabelMap()
for img in os.listdir(path):
out_array = np.zeros((350,350, 3), np.float32) #original size of images in the dataset
try:
img_array = cv2.imread(os.path.join(path, img))
img_array=img_array.astype('float32') #cast to float because to prevent normalization erros
out_array = cv2.normalize(img_array, out_array, 0, 1, cv2.NORM_MINMAX) #normalize image
out_array = cv2.resize(out_array, (IMG_SIZE, IMG_SIZE)) #resize, bc we need 224x224 for Imagenet pretrained weights
training_data.append([out_array, float(labelMap[img])])
except Exception as e:
pass
return training_data
#preprocessing, the txt file All_labels.txt has lines of the form 'filename.jpg 3.2' and 3.2 is the label
def getLabelMap():
map = {}
path = os.getcwd()
path = os.path.join(path, "data\\train_test_files\\All_labels.txt")
f = open(path, "r")
for line in f:
line = line.split()
map[line[0]] = line[1]
f.close()
return map
#not important, in case you want to see the images after preprocessing
def showimg(image):
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
plt.imshow(image)
plt.show()
#pickle the preprocessed data
def pickle_it(training_set, IMG_SIZE):
X = []
Y = []
for features, label in training_set:
X.append(features)
Y.append(label)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 3)
Y = np.array(Y)
pickle_out = open("X.pickle", "wb")
pickle.dump(X, pickle_out)
pickle_out.close()
pickle_out = open("Y.pickle", "wb")
pickle.dump(Y, pickle_out)
pickle_out.close()
#for prediction after training the model
def betterThan(y, Y):
Z=np.sort(Y)
cnt = 0
for z in Z:
if z>y:
break
else:
cnt = cnt+1
return float(cnt/len(Y))
#for prediction after training the model
def predictImage(image, model, Y):
img_array = cv2.imread(image)
img_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
img_array = np.array(img_array).reshape(-1, IMG_SIZE, IMG_SIZE, 3)
y = model.predict(img_array)
per = betterThan(y, Y)
print('You look better than ' + str(per) + '% of the dataset')
#Main/Driver function
#Preprocessing
IMG_SIZE = 224
training_set=[]
training_set = loadImages(IMG_SIZE)
random.shuffle(training_set)
pickle_it(training_set, IMG_SIZE) #I pickle my data, so that I don't always have to go through the preprocessing
#Load preprocessed data
X = pickle.load(open("X.pickle", "rb"))
Y = pickle.load(open("Y.pickle", "rb"))
#Just to check that the images look correct
showimg(X[0])
# define the grid search parameters, feel free to edit the grids
batch_size = [64]
epochsGrid = [300]
learning_rate = [0.1]
#save models and best parameters found in grid search
size_histories = {}
min_val_loss = 10
best_para = {}
#ignore this, used for bugs on my gpu... You possibly don't need this
config = tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True))
sess = tf.compat.v1.Session(config=config)
#grid search, training the model
for epochs in epochsGrid:
for batch in batch_size:
for lr in learning_rate:
model = create_model(IMG_SIZE, lr)
model_name = str(epochs) + '_' + str(batch) + '_' + str(lr)
#train the model with the given hyperparameters
size_histories[model_name] = model.fit(X, Y, batch_size=batch, epochs=epochs, validation_split=0.1)
# save model with the best loss value
if min(size_histories[model_name].history['val_loss']) < min_val_loss:
min_val_loss = min(size_histories[model_name].history['val_loss'])
best_para['epoch'] = epochs
best_para['batch'] = batch
best_para['lr'] = lr
model.save('savedModel')
#If you want to make prediction
model = tf.keras.models.load_model("savedModel")
image = os.path.join(os.getcwd(), 'data\\otherImages\\beautifulWomen.jpg')
predictImage(image, model, Y)
EDIT:
I have found the issue. It is 'relu' in the output neuron. When I change my loss from RMSE to MAPE I will see that I got a 100 percent error on validation. I assume this is because all my validation data is output to 0. This is only possible when the value in the output neuron before 'relu' is negative. I don't know why this is the case. But removing 'relu' will yield better training.
Does anyone know why 'relu' causes this problem with regression problems?
If this is your last layer
model.add(layers.Dense(1, activation="relu"))
then your models final output is y if y > 0 else 0. At your untrained state, your model could very well have y pinned to something like -17 or 17 with fairly equal chance. In the case of -17, the relu will convert that to 0 and also set the gradient to 0, which means the network doesn't learn. Yeah, the network doesn't learn anything from any part of a network where a relu unit output 0. In the case of the layer before
model.add(layers.Dense(128, activation="relu"))
there will be a really good chance that about half of the units will fire with a positive value and so they learn, so that layer is fine.
What can be done in the case of a bad initialization or after training a bad state in which the output of that last layer is pushed down to below 0? Well, what if we just don't use relu. What activation to use? None! Let's look at what that would be
1: model = keras.Sequential()
2: model.add(MobileNetV2(input_shape=(IMG_SIZE, IMG_SIZE, 3), include_top=False))
3: model.layers[0].trainable = False
4: model.add(layers.GlobalAveragePooling2D())
5: model.add(tf.keras.layers.Dropout(0.8))
6: model.add(layers.Dense(128, activation="relu"))
7: model.add(layers.Dense(1))
Lines 1-6 are all the same. It is important to note that the output of line 6 passes through the non-linear relu activation, and so there is the capability to learn non-linearities. Line 7, without an activation function will be a linear combination of Line 6, with a full ability to generate gradients in the positive and negative output region. When backprop is applied to learn the target values of 1 to 5, if the network outputs -17, it can learn to output a larger number. Yeah!
If you'd like to have 2 layers of nonlinearity, I'd suggest the following
1: model = keras.Sequential()
2: model.add(MobileNetV2(input_shape=(IMG_SIZE, IMG_SIZE, 3), include_top=False))
3: model.layers[0].trainable = False
4: model.add(layers.GlobalAveragePooling2D())
5: model.add(layers.Dense(128, activation="tanh"))
6: model.add(layers.Dense(64, activation="tanh"))
7: model.add(layers.Dense(1))
Ditch the dropout unless you have actual proof that it helps in this very specific network (and right now I suspect you don't). Try tanh as your hidden layer activation function. It has some nice features, like being positive and negative, gradient even with large and/or negative numbers, and acts somewhat to automatically regularize weights. But, importantly, the last output either has no activation function.

Tensorflow Probability returns unstable Predictions

I'm using a Tensorflow Probability model. Of course is a probabilistic outcome, and the derivative of error does not go to zero (otherwise the model would be deterministic). The prediction is not stable, because we have a range in the derivative of loss, let's say, in a convex optimization, from 1.2 to 0.2 as an example.
This interval generates a different prediction each time the model is trained. Sometimes I get an excellent fit (red=real, blue lines=predicted +2 std deviation and -2 std deviation):
Sometimes not, with same hyper-parameters:
Sometimes mirrored:
For business purposes, this is quite problematic, given that it is expected that a prediction presents a stable output.
Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
np.random.seed(42)
dataframe = pd.read_csv('Apple_Data_300.csv').ix[0:800,:]
dataframe.head()
plt.plot(range(0,dataframe.shape[0]),dataframe.iloc[:,1])
x1=np.array(dataframe.iloc[:,1]+np.random.randn(dataframe.shape[0])).astype(np.float32).reshape(-1,1)
y=np.array(dataframe.iloc[:,1]).T.astype(np.float32).reshape(-1,1)
tfd = tfp.distributions
model = tf.keras.Sequential([
tf.keras.layers.Dense(1,kernel_initializer='glorot_uniform'),
tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t, scale=1)),
tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t, scale=1)),
tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t, scale=1))
])
negloglik = lambda x, rv_x: -rv_x.log_prob(x)
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0001), loss=negloglik)
model.fit(x1,y, epochs=500, verbose=True)
yhat = model(x1)
mean = yhat.mean()
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
mm = sess.run(mean)
mean = yhat.mean()
stddev = yhat.stddev()
mean_plus_2_std = sess.run(mean - 2. * stddev)
mean_minus_2_std = sess.run(mean + 2. * stddev)
plt.figure(figsize=(8,6))
plt.plot(y,color='red',linewidth=1)
#plt.plot(mm)
plt.plot(mean_minus_2_std,color='blue',linewidth=1)
plt.plot(mean_plus_2_std,color='blue',linewidth=1)
Loss:
Epoch 498/500
801/801 [==============================] - 0s 32us/sample - loss: 2.4169
Epoch 499/500
801/801 [==============================] - 0s 30us/sample - loss: 2.4078
Epoch 500/500
801/801 [==============================] - 0s 31us/sample - loss: 2.3944
Is there a way to control the prediction output for a probabilistic model? The loss stops at 1.42, even decreasing learning rate and increasing training epochs. What am I missing here ?
WORKING CODE AFTER ANSWER:
init = tf.global_variables_initializer()
with tf.Session() as sess:
model = tf.keras.Sequential([
tf.keras.layers.Dense(1,kernel_initializer='glorot_uniform'),
tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t, scale=1))
])
negloglik = lambda x, rv_x: -rv_x.log_prob(x)
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0001), loss=negloglik)
model.fit(x1,y, epochs=500, verbose=True, batch_size=16)
yhat = model(x1)
mean = yhat.mean()
sess.run(init)
mm = sess.run(mean)
mean = yhat.mean()
stddev = yhat.stddev()
mean_plus_2_std = sess.run(mean - 3. * stddev)
mean_minus_2_std = sess.run(mean + 3. * stddev)
Are you running tf.global_variables_initializer too late?
I found this in the answer of Understanding tf.global_variables_initializer:
Variable initializers must be run explicitly before other ops in your
model can be run. The easiest way to do that is to add an op that runs
all the variable initializers, and run that op before using the model.

Softmax logistic regression: Different performance by scikit-learn and TensorFlow

I'm trying to learn a simple linear softmax model on some data. The LogisticRegression in scikit-learn seems to work fine, and now I am trying to port the code to TensorFlow, but I'm not getting the same performance, but quite a bit worse. I understand that the results will not be exactly equal (scikit learn has regularization params etc), but it's too far off.
total = pd.read_feather('testfile.feather')
labels = total['labels']
features = total[['f1', 'f2']]
print(labels.shape)
print(features.shape)
classifier = linear_model.LogisticRegression(C=1e5, solver='newton-cg', multi_class='multinomial')
classifier.fit(features, labels)
pred_labels = classifier.predict(features)
print("SCI-KITLEARN RESULTS: ")
print('\tAccuracy:', classifier.score(features, labels))
print('\tPrecision:', precision_score(labels, pred_labels, average='macro'))
print('\tRecall:', recall_score(labels, pred_labels, average='macro'))
print('\tF1:', f1_score(labels, pred_labels, average='macro'))
# now try softmax regression with tensorflow
print("\n\nTENSORFLOW RESULTS: ")
## By default, the OneHotEncoder class will return a more efficient sparse encoding.
## This may not be suitable for some applications, such as use with the Keras deep learning library.
## In this case, we disabled the sparse return type by setting the sparse=False argument.
enc = OneHotEncoder(sparse=False)
enc.fit(labels.values.reshape(len(labels), 1)) # Reshape is required as Encoder expect 2D data as input
labels_one_hot = enc.transform(labels.values.reshape(len(labels), 1))
# tf Graph Input
x = tf.placeholder(tf.float32, [None, 2]) # 2 input features
y = tf.placeholder(tf.float32, [None, 5]) # 5 output classes
# Set model weights
W = tf.Variable(tf.zeros([2, 5]))
b = tf.Variable(tf.zeros([5]))
# Construct model
pred = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax
clas = tf.argmax(pred, axis=1)
# Minimize error using cross entropy
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
# Start training
with tf.Session() as sess:
# Run the initializer
sess.run(init)
# Training cycle
for epoch in range(1000):
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost], feed_dict={x: features, y: labels_one_hot})
# Test model
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
class_out = clas.eval({x: features})
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("\tAccuracy:", accuracy.eval({x: features, y: labels_one_hot}))
print('\tPrecision:', precision_score(labels, class_out, average='macro'))
print('\tRecall:', recall_score(labels, class_out, average='macro'))
print('\tF1:', f1_score(labels, class_out, average='macro'))
The output of this code is
(1681,)
(1681, 2)
SCI-KITLEARN RESULTS:
Accuracy: 0.822129684711
Precision: 0.837883361162
Recall: 0.784522522208
F1: 0.806251963817
TENSORFLOW RESULTS:
Accuracy: 0.694825
Precision: 0.735883666192
Recall: 0.649145125846
F1: 0.678045562185
I inspected the result of the one-hot-encoding, and the data, but I have no idea why the result in TF is much worse.
Any suggestion would be really appreciated..
The problem turned out to be silly, I just needed more epochs, a smaller learning rate (and for efficiency I turned to AdamOptimizer, results are now equal, although the TF implementation is much slower.
(1681,)
(1681, 2)
SCI-KITLEARN RESULTS:
Accuracy: 0.822129684711
Precision: 0.837883361162
Recall: 0.784522522208
F1: 0.806251963817
TENSORFLOW RESULTS:
Accuracy: 0.82213
Precision: 0.837883361162
Recall: 0.784522522208
F1: 0.806251963817

Tensor Flow - low accuracy on CNN Mnist Data set / How to batch accuracy calculations

The code is in python 3.5.2 with Tensor flow. The neural network returns an accuracy of between .10 and 5.00, with the higher value tending to be the accuracy of the training data by a factor of roughly 6. I cannot tell whether the neural network is legitimately doing worse than random guessing or if the accuracy code i am using has a serious fault i cannot see.
The neural network consists of 5 layers:
input
conv1 (with max pooling relu and dropout)
conv2 (with max pooling relu and dropout)
fully connected (with relu)
output
uses default Adam optimizer
I feel very suspicious of my accuracy calculations as i made them differently than what i have seen due to RAM constraints. The accuracy calculation does both the accuracy of the train and test data.
acc_total = 0
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
for _ in range(int(mnist.test.num_examples/batch_size)):
test_x, test_y = mnist.test.next_batch(batch_size)
acc = accuracy.eval(feed_dict={x: test_x, y: test_y})
acc_total += acc
print('Accuracy:',acc_total*batch_size/float(mnist.test.num_examples),end='\r')
print('Epoch', epoch, 'current test set accuracy : ',acc_total*batch_size/float(mnist.test.num_examples))
acc_total=0
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
for _ in range(int(mnist.train.num_examples/batch_size)):
train_x, train_y = mnist.train.next_batch(batch_size)
acc = accuracy.eval(feed_dict={x: train_x, y: train_y})
acc_total += acc
print('Accuracy:',acc_total*batch_size/float(mnist.train.num_examples),end='\r')
print('Epoch', epoch, 'current train set accuracy : ',acc_total*batch_size/float(mnist.test.num_examples))
This is a sample of the outputs:
Epoch 0 completed out of 20 loss: 10333239.3396 83.29 ts 429
Epoch 0 current test set accuracy : 0.7072
Epoch 0 current train set accuracy : 3.8039
Epoch 1 completed out of 20 loss: 1831489.40747 39.24 ts 858
Epoch 1 current test set accuracy : 0.7765
Epoch 1 current train set accuracy : 4.2239
Epoch 2 completed out of 20 loss: 1010191.40466 25.89 ts 1287
Epoch 2 current test set accuracy : 0.8069
Epoch 2 current train set accuracy : 4.3898
Epoch 3 completed out of 20 loss: 631960.809082 0.267 ts 1716
Epoch 3 current test set accuracy : 0.8277
Epoch 3 current train set accuracy : 4.4955
Epoch 4 completed out of 20 loss: 439149.724823 2.001 ts 2145
Epoch 4 current test set accuracy : 0.8374
Epoch 4 current train set accuracy : 4.5674
The full code is as follows (sorry about the length i added a lot of comments for my own use ):
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
#Imported Data set
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#ammount of output classes
n_classes = 10
#ammount of examples processed at once
#memory impact of ~500MB for 128 with more on eval runs
batch_size = 128
#Times to cycle through the entire imput data set
epoch_amm =20
#Input and outputs placeholders
x = tf.placeholder(tf.float32, [None, 784])
y = tf.placeholder(tf.float32)
#Dropout is 1-keeprate; fc- fully conected layer dropout;conv conv layer droupout
keep_rate_fc=.5
keep_rate_conv=.75
keep_prob=tf.placeholder(tf.float32)
#Regularization paramaters
Regularization_active= False #True and False MUST be capitalized
Lambda= 1.0 #'weight' of the weights on the loss function
# counter for total steps taken by trainer
training_steps = 1
#Learning Rate For Network
base_Rate = .03
decay_steps = 64
decay_rate = .96
Staircase = True
Learning_Rate = tf.train.exponential_decay(base_Rate, training_steps, decay_steps, decay_rate, staircase='Staircase', name='Exp_decay' )
#Convolution Function returns neuronns that act on a section of prev. layer
def conv2d(x,W):
return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding='SAME')
#Pooling function returns max value in 2 by 2 sections
def maxpool2d(x):
return tf.nn.max_pool(x,ksize=[1,2,2,1],strides=[1,2,2,1],padding='SAME')
def relu(x):
return tf.nn.relu(x,'relu')
def add(x, b):
return tf.add(x,b)
#'Main' method, contains the Neural Network
def convolutional_neural_network(x):
weights = {'W_conv1':tf.Variable(tf.random_normal([5,5,1,32])),
'W_conv2':tf.Variable(tf.random_normal([5,5,32,64])),
'W_fc':tf.Variable(tf.random_normal([7*7*64,1024])),
'W_out':tf.Variable(tf.random_normal([1024,n_classes]))}
biases = {'B_conv1':tf.Variable(tf.random_normal([32])),
'B_conv2':tf.Variable(tf.random_normal([64])),
'B_fc':tf.Variable(tf.random_normal([1024])),
'B_out':tf.Variable(tf.random_normal([n_classes]))}
# Input layer
x = tf.reshape(x, shape=[-1,28,28,1])
#first layer. pass inputs through conv2d and save as conv1 then apply maxpool2d
conv1 = conv2d(x,weights['W_conv1'])
conv1 = add(conv1,biases['B_conv1'])
conv1 = relu(conv1)
conv1 = maxpool2d(conv1)
conv1 = tf.nn.dropout(conv1,keep_rate_conv)
#second layer does same as first layer
conv2 = conv2d(conv1,weights['W_conv2'])
conv2 = add(conv2,biases['B_conv2'])
conv2 = relu(conv2)
conv2 = maxpool2d(conv2)
conv2 = tf.nn.dropout(conv2,keep_rate_conv)
#3rd layer fully connected
fc = tf.reshape(conv2,[-1,7*7*64])
fc = tf.matmul(fc,weights['W_fc'])
fc = add(fc,biases['B_fc'])
fc = relu(fc)
fc = tf.nn.dropout(fc,keep_rate_fc)
#4th and final layer
output = tf.matmul(fc,weights['W_out'])
output = add(output,biases['B_out'])
return output
#Trains The neural Network
def train_neural_network(x):
training_steps = 0
#Initiate The Network
prediction = convolutional_neural_network(x)
#Define the Cost and Cost function
#tf.reduce_mean averages the values of a tensor into one value
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
#Apply Regularization if active
#if Regularization_active :
# print('DEBUG!! LINE 84 REGULARIZATION ACTIVE')
# cost = (tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y))+
# (Lambda*(tf.nn.l2_loss(weight['W_conv1'])+
# tf.nn.l2_loss(weight['W_conv2'])+
# tf.nn.l2_loss(weight['W_fc'])+
# tf.nn.l2_loss(weight['W_out'])+
# tf.nn.l2_loss(biases['B_conv1'])+
# tf.nn.l2_loss(biases['B_conv2'])+
# tf.nn.l2_loss(biases['B_fc'])+
# tf.nn.l2_loss(biases['B_out']))))
#Optimizer + Learning_Rate passthrough
optimizer = tf.train.AdamOptimizer().minimize(cost)
#Get Epoch Ammount
hm_epochs = epoch_amm
#Starts C++ Training session
print('Session Started')
with tf.Session() as sess:
#Initiate all Variables
sess.run(tf.global_variables_initializer())
#Begin Logs
summary_writer = tf.summary.FileWriter('/tmp/logs',sess.graph)
#Start Training
for epoch in range(hm_epochs):
epoch_loss = 0
for count in range(int(mnist.train.num_examples/batch_size)):
training_steps = (training_steps+1)
epoch_x, epoch_y = mnist.train.next_batch(batch_size)
count, c = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
epoch_loss += c
print('Epoch', epoch, 'current epoch loss', epoch_loss, 'batch loss', c,'ts',training_steps,' ', end='\r')
#Log the loss per epoch
print('Epoch', epoch, 'completed out of',hm_epochs,'loss:',epoch_loss,' ')
acc_total = 0
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
for _ in range(int(mnist.test.num_examples/batch_size)):
test_x, test_y = mnist.test.next_batch(batch_size)
acc = accuracy.eval(feed_dict={x: test_x, y: test_y})
acc_total += acc
print('Accuracy:',acc_total*batch_size/float(mnist.test.num_examples),end='\r')
print('Epoch', epoch, 'current test set accuracy : ',acc_total*batch_size/float(mnist.test.num_examples))
acc_total=0
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
for _ in range(int(mnist.train.num_examples/batch_size)):
train_x, train_y = mnist.train.next_batch(batch_size)
acc = accuracy.eval(feed_dict={x: train_x, y: train_y})
acc_total += acc
print('Accuracy:',acc_total*batch_size/float(mnist.train.num_examples),end='\r')
print('Epoch', epoch, 'current train set accuracy : ',acc_total*batch_size/float(mnist.test.num_examples))
print('Complete')
sess.close()
#Run the Neural Network
train_neural_network(x)
The CNN had low results because of 4 reasons:
Improper (Lack of) feeding of dropout
-the keep rate was not being fed into accuracy.eval(feed_dict={x: test_x, y: test_y}) causing the network to underpreform in its accuracy evaluations
Poor Initialization of weights
RELU neuron work significantly better with weights closer to zero than normal distribution.
far to high learning rate
Learning rate of .03 even with decay was far far to high and stoped it from training effectively
errors in accuracy function
The accuracy function of the training data was receiving the size of the data set form mnist.test.num_examples instead of the proper mnist.train.num_examples and caused nonsensical values of accuracy in excess of 100%