MNIST Tensorflow vs code from Michael Nielsen

MNIST Tensorflow vs code from Michael Nielsen - tensorflow

I read Michael Nielsen's book neuralnetworksanddeeplearning.com about Neural networks. He always does the example with the MNIST data.
I now took his code and designed exactly the same network in Tensorflow, but I realized that the results in Tensorflow are not the same(they are much worser).
Here are the details:
1) The code from Michael Nielsen can be found at https://github.com/kanban1992/MNIST_Comparison/tree/master/Michael_Nielsen.
you can start everything with
python start_2.py
The network has
3 hidden layers a 30 neurons.
all activation functions are sigmoids
I use Stochastic Gradient Descent (learning rate 3.0) with Backpropagation. The batch size is 10
A quadratic cost function without any regularization is used.
The weight matrix which connects layer l and l+1 are initialized with a gaussian probability density with stddev=1/sqrt(Number of neurons in layer l) and mu=0.0. The biases are initialized with standard normal distribution.
after training for 5 epochs I get 95 % of the images in the validation set classified correct.
This approach has to be correct, because it works well and I did not modify it!
2) The tensorflow implementation was done by me and has exactly the same structure as the Nielsen net described above in point 1).
The full code can be found at https://github.com/kanban1992/MNIST_Comparison/tree/master/tensorflow and run with
python start_train.py
With the tensorflow approach I get a accuracy of 10% (that would be the same as random guessing!) So something is not working and I have no idea what!?
Here is a snippet of the most important part of the code:
x_training,y_training,x_validation,y_validation,x_test,y_test = mnist_loader.load_data_wrapper()
N_training=len(x_training)
N_validation=len(x_validation)
N_test=len(x_test)
N_epochs = 5
learning_rate = 3.0
batch_size = 10
N1 = 784 #equals N_inputs
N2 = 30
N3 = 30
N4 = 30
N5 = 10
N_in=N1
N_out=N5
x = tf.placeholder(tf.float32,[None,N1])#don't take the shape=(batch_size,N1) argument, because we need this for different batch sizes
W2 = tf.Variable(tf.random_normal([N1, N2],mean=0.0,stddev=1.0/math.sqrt(N1*1.0)))# Initialize the weights for one neuron with 1/sqrt(Number of weights which enter the neuron/ Number of neurons in layer before)
b2 = tf.Variable(tf.random_normal([N2]))
a2 = tf.sigmoid(tf.matmul(x, W2) + b2) #x=a1
W3 = tf.Variable(tf.random_normal([N2, N3],mean=0.0,stddev=1.0/math.sqrt(N2*1.0)))
b3 = tf.Variable(tf.random_normal([N3]))
a3 = tf.sigmoid(tf.matmul(a2, W3) + b3)
W4 = tf.Variable(tf.random_normal([N3, N4],mean=0.0,stddev=1.0/math.sqrt(N3*1.0)))
b4 = tf.Variable(tf.random_normal([N4]))
a4 = tf.sigmoid(tf.matmul(a3, W4) + b4)
W5 = tf.Variable(tf.random_normal([N4, N5],mean=0.0,stddev=1.0/math.sqrt(N4*1.0)))
b5 = tf.Variable(tf.random_normal([N5]))
y = tf.sigmoid(tf.matmul(a4, W5) + b5)
y_ = tf.placeholder(tf.float32,[None,N_out]) # ,shape=(batch_size,N_out)
quadratic_cost= tf.scalar_mul(1.0/(N_training*2.0),tf.reduce_sum(tf.squared_difference(y,y_)))
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(quadratic_cost)
init = tf.initialize_all_variables()
#launch the graph
sess = tf.Session()
sess.run(init)
#batch size of training input
N_training_batch=N_training/batch_size #rounds to samllest integer
correct=[0]*N_epochs
cost_training_data=[0.0]*N_epochs
for i in range(0,N_epochs):
for j in range(0,N_training_batch):
start=j*batch_size
end=(j+1)*batch_size
batch_x=x_training[start:end]
batch_y=y_training[start:end]
sess.run(train_step, feed_dict={x: batch_x,
y_: batch_y})
perm = np.arange(N_training)
np.random.shuffle(perm)
x_training = x_training[perm]
y_training = y_training[perm]
#cost after each epoch
cost_training_data[i]=sess.run(quadratic_cost, feed_dict={x: x_training,
y_: y_training})
#correct predictions after each epoch
y_out_validation=sess.run(y,feed_dict={x: x_validation})
for k in range(0,len(y_out_validation)):
arg=np.argmax(y_out_validation[k])
if 1.0==y_validation[k][arg]:
correct[i]+=1
print "correct after "+str(i)+ " epochs: "+str(correct[i])
It would be really great if you could tell me what's going wrong :-)

Your learning rate seems to high for Gradient Decent. Try a number more like .0001. Raise or lower from there.
I like the Adam optimizer, make sure you start with a smaller learning rate (.001 I think is the default for Adam):
optimizer = tf.train.AdamOptimizer(learning_rate)

Related

Why does GradientTape behave differently when watching loop operations as opposed to array operations?

There is something about the workings of GradientTape that escapes my understanding.
Suppose we want to train an agent on the classic bandit problem using an actor-critic RL framework. There are two bandits, A and B, and the agent must learn to select A, which yields higher returns on average. The training consists of, say, 1000 epochs, in each of which the agent draws, say, 100 samples from each bandit. The reward is 1 every time the agent selects A, and 0 otherwise.
Let's see how the agent learns by observing rewards over 10 training simulations. Here is the code defining the agent and the environment (neither needs to be more complicated than below).
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from keras import Model
from keras.optimizers import Adam
n_sims = 10 # number of simulations
for n in range(n_sims):
# define actors and optimizers for each simulation
actor_input = Input(shape=(2,))
actor_output = Dense(2, activation='softmax')(actor_input)
globals()[f'actor_{n}'] = Model(inputs=actor_input, outputs=actor_output)
globals()[f'actor_opt_{n}'] = Adam(learning_rate=.1)
# define critics and optimizers for each simulation
critic_input = Input(shape=(2,))
critic_output = Dense(1, activation='softmax')(critic_input)
globals()[f'critic_{n}'] = Model(inputs=critic_input, outputs=critic_output)
globals()[f'critic_opt_{n}'] = Adam(learning_rate=.1)
globals()[f'mean_rewards_{n}'] = [] # list to store rewards over training epochs for each simulation
A = np.random.normal(loc=10, scale=15, size=int(1e5)) # bandit A
B = np.random.normal(loc=0, scale=1, size=int(1e5)) # bandit B
n_training_epochs = 1000
n_samples = 100
Let's consider two alternative codes for the training loop using GradientTape, both based on a simple 'vanilla' loss function.
The first is the slow one and literally involves a for loop over the samples drawn in each epoch. Cumulative actor and critic's losses are iteratively computed, and then their means are used to update their respective network weights.
for _ in range(n_training_epochs):
A_samples = np.random.choice(A, size=n_samples)
B_samples = np.random.choice(B, size=n_samples)
for n in range(n_sims):
cum_actor_loss, cum_critic_loss, cum_reward = 0, 0, 0
with tf.GradientTape() as actor_tape, tf.GradientTape() as critic_tape:
for A_sample, B_sample in zip(A_samples, B_samples):
probs = globals()[f'actor_{n}'](tf.reshape([A_sample, B_sample], (1,-1)))[0]
action = np.random.choice(['A','B'], p=np.squeeze(probs))
reward = 1 if action == 'A' else 0
cum_reward += reward
action_prob = probs[['A','B'].index(action)]
value = globals()[f'critic_{n}'](tf.reshape([A_sample, B_sample], (1,-1)))[0]
advantage = reward - value
cum_actor_loss += -tf.math.log(action_prob)*advantage
cum_critic_loss += advantage**2
mean_actor_loss = cum_actor_loss/n_samples
mean_critic_loss = cum_critic_loss/n_samples
globals()[f'mean_rewards_{n}'].append(cum_reward/n_samples)
actor_grads = actor_tape.gradient(mean_actor_loss, globals()[f'actor_{n}'].trainable_variables)
globals()[f'actor_opt_{n}'].apply_gradients(zip(actor_grads, globals()[f'actor_{n}'].trainable_variables))
critic_grads = critic_tape.gradient(mean_critic_loss, globals()[f'critic_{n}'].trainable_variables)
globals()[f'critic_opt_{n}'].apply_gradients(zip(critic_grads, globals()[f'critic_{n}'].trainable_variables))
If you plot the average training rewards over each epoch, you'll probably get something like this figure
In the second option, instead of using an explicit for loop over samples in each epoch, we perform operations on arrays. This alternative is much faster in terms of computation time.
for _ in range(n_training_epochs):
A_samples = np.random.choice(A, size=n_samples)
B_samples = np.random.choice(B, size=n_samples)
for n in range(n_sims):
with tf.GradientTape() as actor_tape, tf.GradientTape() as critic_tape:
probs = globals()[f'actor_{n}'](tf.reshape([[A_sample, B_sample] for A_sample, B_sample in zip(A_samples, B_samples)], (n_samples,-1)))
actions = np.array([np.random.choice(['A','B'], p=np.squeeze(probs[i])) for i in range(len(probs))]).reshape(n_samples, -1)
rewards = np.array([1.0 if action == 'A' else 0.0 for action in actions]).reshape(n_samples, -1)
globals()[f'mean_rewards_{n}'].append(np.mean(rewards))
values = globals()[f'critic_{n}'](tf.reshape([[A_sample, B_sample] for A_sample, B_sample in zip(A_samples, B_samples)], (n_samples,-1)))
advantages = rewards + tf.math.negative(values)
actions_num = [['A','B'].index(action) for action in actions]
action_probs = tf.reduce_sum(tf.one_hot(actions_num, len(['A','B'])) * probs, axis=1)
mean_actor_loss = -tf.reduce_mean(advantages * tf.math.log(action_probs))
mean_critic_loss = tf.reduce_mean(tf.pow(advantages, 2))
actor_grads = actor_tape.gradient(mean_actor_loss, globals()[f'actor_{n}'].trainable_variables)
globals()[f'actor_opt_{n}'].apply_gradients(zip(actor_grads, globals()[f'actor_{n}'].trainable_variables))
critic_grads = critic_tape.gradient(mean_critic_loss, globals()[f'critic_{n}'].trainable_variables)
globals()[f'critic_opt_{n}'].apply_gradients(zip(critic_grads, globals()[f'critic_{n}'].trainable_variables))
Let's plot the average reward over epochs, to obtain something like this
As you can see the agent tends to learn earlier and more stably in the first case than in the second (where learning may not even happen), although the two training loops are in theory mathematically equivalent. How is that? The reason has probably something to do with the fact that, in the first option, GradientTape is watching the trainable variables several times per epoch before applying the gradient, whereas in the second option it does so only once. Even so, I can't figure out why exactly this produces the observed results. Can you help me understand?

How does y_pred look like when making a custom loss function in keras?

I am training a UNet shaped CNN and have to deal with data imbalances. I want to minimise false negatives, so I want to implement a custom loss function that does so. I created the following loss function:
from tensorflow.keras import backend as K
def fbeta_loss(y_true, y_pred, beta=2., epsilon=K.epsilon()):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
tp = K.sum(y_true_f * y_pred_f)
predicted_positive = K.sum(y_pred_f)
actual_positive = K.sum(y_true_f)
precision = tp/(predicted_positive+epsilon) # calculating precision
recall = tp/(actual_positive+epsilon) # calculating recall
# calculating fbeta
beta_squared = K.square(beta)
fb = (1+beta_squared)*precision*recall / (beta_squared*precision + recall + epsilon)
return 1-fb
However, I am not sure if y_pred is binary, or a float number between 0 and 1. In my final layer I use a sigmoid activation. Does that mean if I create a custom loss function y_pred is a float between 0 and 1, and I should add a step that maps every value higher then a threshold(0.5) to 1 and lower to 0? Or is that step already included in the Keras model? Since in similar custom loss implementations that step is often not included, e.g. .
Hopefully this is sort of clear, I am relatively new to stackoverflow. Let me know if anything is missing! Thanks in advance.

The output of sigmoid activation function is always between 0 and 1.
In the limit of x tending towards infinity, S(x) converges to 1, and in the limit of x tending towards negative infinity, S(x) converges to 0. Here, the word converges does not mean that S(x) reach any of 0 or 1 but it converges to 0 and 1.
And so the output of S(x) is always a float between 0 and 1.
Range of S(x):
0 < S(x) < 1

Changing the learning rate after every step in Keras

I want to increase the learning rate from batch to batch inside of one epoch, so the first data the Net sees in one epoch has low learning rate and the last data it sees has high learning rate. How do I do this in tf.keras?

To modify the learning rate after every epoch, you can use tf.keras.callbacks.LearningRateScheduler as mentioned in the docs here.
But in our case, we need to modify the learning rate after every batch is passed to the model. We'll use tf.keras.optimizers.schedules.LearningRateSchedule for this purpose. This would modify the learning rate after each step or a gradient update.
Suppose I have 100 samples in my training dataset and my batch size is 5. The no. of steps will be 100 / 5 = 20 steps. Reframing the statement, in a single epoch, 20 batches will be passed to the model and gradient updates would also occur 20 times ( in a single epoch ).
Using the code given in the docs,
batch_size = 5
num_train_samples = 100
class MyLRSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, initial_learning_rate):
self.initial_learning_rate = initial_learning_rate
def __call__(self, step):
return self.initial_learning_rate / (step + 1)
optimizer = tf.keras.optimizers.SGD(learning_rate=MyLRSchedule(0.1))
The value of step will be go from 0 to 19 for the 1st epoch, considering our example. For the 2nd epoch, it will go from 20 to 39. For your use-case, we can modify the above like,
batch_size = 5
num_train_samples = 100
num_steps = num_train_samples / batch_size
class MyLRSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, initial_learning_rate):
self.initial_learning_rate = initial_learning_rate
def __call__(self, step):
step_in_epoch = step - (( step // num_steps ) * num_steps )
# Update LR according to step_in_epoch
optimizer = tf.keras.optimizers.SGD(learning_rate=MyLRSchedule(0.1))
The value of step_in_epoch will go from 0 to 19 for 1st epoch. For 2nd epoch, it will go from 0 to 19 again and likewise for all epochs. Update the LR accordingly.
Make sure that num_train_samples is perfectly divisible by the batch size. This would ease the calculation of no. of steps.

Creating custom metrics in tensorflow estimators

I am training a classification problem using tensorflow estimators.
I want to calculate the f1 score for each batch to data along with precision and recall.
I calculate precision and recall using the code below and log them for evaluation and training.
I also calculate the fscore using the formula, but while logging the fscore I get an error.
pre = tf.metrics.precision(labels=labels,predictions=pred,name="precision")
rec = tf.metrics.recall(labels=labels,predictions=pred,name="recall")
fscore_val = tf.reduce_mean((2*pre[0]*rec[0]) / (pre[0] + rec[0] + 1e-5))
fscore_update = tf.group(pre[1], rec[1])
fscore = (fscore_val, fscore_update)
# logging metric at evaluation time
metrics['precision'] = pre
metrics['recall'] = rec
metrics['fscore'] = fscore
# logging metric at training time
tf.summary.scalar('precision', pre[1])
tf.summary.scalar('recall', rec[1])
tf.summary.scalar('fscore', fscore)
This is the error that I get.
TypeError: Expected float32, got <tf.Operation 'metrics_Left_Lane_Type/group_deps' type=NoOp> of type 'Operation' instead.
I understand why I am getting this error.
It is because fscore should be two values, similar to precision and recall.
Can someone please help me on how to do this in tensorflow estimators?

First of all, TensorFlow has it's own f1 score tf.contrib.metrics.f1_score and it is rather straightforward to use. The only possible downside is that it hides threshold value from user, choosing the best from specified quantity of possible thresholds.
predictions = tf.sigmoid(logits)
tf.contrib.metrics.f1_score(labels, predictions, num_thresholds=20)
If, for any reason, you want a custom implementation, you need to group update_ops. Every TensorFlow metric has operation that increments its value. You can set threshold manually when defining predictions
predictions = tf.greater(tf.sigmoid(logits), 0.5)
def f1_score(labels, predictions):
precision, update_op_precision = tf.metrics.precision(labels, predictions)
recall, update_op_recall = tf.metrics.recall(labels, predictions)
eps = 1e-5 #small constant for numerical stability
f1 = 2 * precision * recall / (precision + recall + eps)
f1_upd = 2 * update_op_precision * update_op_recall / (update_op_precision + update_op_recall + eps)
return f1, f1_upd
f1_score = f1_score(labels, predictions)
Then you can add it to eval_metric_ops dict or pass to summary.scalar
eval_metric_ops = {'f1': f1_score}
tf.summary.scalar('f1', f1_score[1])
It actually gives very close results with metric from contrib module

change loss function during training

Suppose my loss function is of the following form:
loss = a*loss_1 + (1-a)*loss_2
Suppose also I am training for 100 steps. How can I dynamically change the loss function in tensorflow so that I gradually change "a" from 1 to 0 during the 100 steps of training?
To be precise, I want my loss to be
loss = 1*loss_1+0*loss_2 = loss_1
at the beginning of training (at step 1)
and
loss = 0*loss_1+1*loss_2 = loss_2 at the end (step 100)
with some kind of gradual (doesn't have to be continuous) decrease in between.

Assuming that the value of a does not depend on the computation done at the current step, create a placeholder for a then pass the value you want using the feed dictionary.

You can use tf.train.polynomial_decay.
tf.train.polynomial_decay(learning_rate=1, global_step=step_from_placeholder,
decay_steps=100, end_learning_rate=0,
power=1.0, cycle=False, name=None)
This computes
global_step = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) * \
(1 - global_step / decay_steps) ** (power) + end_learning_rate

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

MNIST Tensorflow vs code from Michael Nielsen - tensorflow

Your learning rate seems to high for Gradient Decent. Try a number more like .0001. Raise or lower from there. I like the Adam optimizer, make sure you start with a smaller learning rate (.001 I think is the default for Adam): optimizer = tf.train.AdamOptimizer(learning_rate)

Related

Why does GradientTape behave differently when watching loop operations as opposed to array operations?

How does y_pred look like when making a custom loss function in keras?

Changing the learning rate after every step in Keras

Creating custom metrics in tensorflow estimators

change loss function during training

Categories

Resources