I just recently learned tensorflow. I tried to run a simple regression example, but I got a bad result.
My input X is a matrix of 10x10000, that is, each data is a vector of 10x1, a total of 10000 pieces of data.
Desired output Y is just first row of X.
My code is as follows:
import tensorflow as tf
import numpy as np
from numpy.random import RandomState
rdm=RandomState(1)
data_size=10000
xdim=10
X=rdm.rand(data_size,xdim)
Y = [x1[0] for x1 in X]
x=tf.placeholder(tf.float32,shape=(None,xdim))
y=tf.placeholder(tf.float32,shape=(None))
#logits = modelFun(x)
Weights = tf.Variable(tf.random_normal([xdim, 1]))
biases = tf.Variable(0.1)
logits = tf.matmul(x, Weights) + biases
loss = tf.reduce_mean(tf.square(logits - y))
optimizer = tf.train.GradientDescentOptimizer(0.005).minimize(loss)
batch_size=50
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
steps=20001
for i in range(steps):
start = i * batch_size % data_size
end = min(start + batch_size,data_size)
sess.run(optimizer,feed_dict={x:X[start:end],y:Y[start:end]})
if i % 5000 == 0:
ypred,training_loss= sess.run([logits,loss],feed_dict={x:X,y:Y})
print("Epoch %d: loss=%g"%(i,training_loss))
The output results are as follows:
Epoch 0: loss=6.31555
Epoch 5000: loss=0.0798763
Epoch 10000: loss=0.0797333
Epoch 15000: loss=0.0797259
Epoch 20000: loss=0.079724
It can't go down to 0.0797.
I checked part of the output. They are far from the correct answer.
>>>print(ypred[:10].T[0])
[ 0.49342471 0.49475971 0.50192004 0.48912409 0.50592101 0.48473218 0.48652697 0.50261581 0.50218904 0.48906678]
>>>print(np.array(Y[:10]))
[ 0.417022 0.41919451 0.80074457 0.09834683 0.98886109 0.01936696 0.10233443 0.90340192 0.88330609 0.11474597]
What is the reason for this? How to solve it?
So thanks for your help!
You're asking for too much from your model. You're generating 10000 points of ten-dimensional random data, so there's no structure to learn, and then doing linear regression with a single neuron; your model doesn't have the capacity to even begin to memorize your input, so guessing that every y is about 0.5 is the best it can do.
The biggest issue is the random input. Most types of machine learning models make strong assumptions about the structure of what they're trying to learn, and random data doesn't have that structure. A large enough neural network could memorize you input data and give you a low training error, but it would completely fail to generalize (the test error would be high), and generalizing is usually the goal.
Related
I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.
Consider a simple line fitting a * x + b = x, where a, b are the optimized parameters and x is the observed vector given by
import torch
X = torch.randn(1000,1,1)
One can immediately see that the exact solution is a=1, b=0 for any x and it can be found as easily as:
import numpy as np
np.polyfit(X.numpy().flatten(), X.numpy().flatten(), 1)
I am trying now to find this solution by means of gradient descent in PyTorch, where the mean square error is used as an optimization criterion.
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam, SGD, Adagrad, ASGD
X = torch.randn(1000,1,1) # Sample data
class SimpleNet(nn.Module): # Trivial neural network containing two weights
def __init__(self):
super(SimpleNet, self).__init__()
self.f1 = nn.Linear(1,1)
def forward(self, x):
x = self.f1(x)
return x
# Testing default setting of 3 basic optimizers
K = 500
net = SimpleNet()
optimizer = Adam(params=net.parameters())
Adam_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adam_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = SGD(params=net.parameters(), lr=0.0001)
SGD_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
SGD_losses.append(float(loss.detach()))
net = SimpleNet()
optimizer = Adagrad(params=net.parameters())
Adagrad_losses = []
optimizer.zero_grad() # zero the gradient buffers
for k in range(K):
for b in range(1): # single batch
loss = torch.mean((net.forward(X[b,:,:]) - X[b,:, :])**2)
loss.backward()
optimizer.step()
Adagrad_losses.append(float(loss.detach()))
The training progress in terms of loss evolution can be shown as
What is surprising for me is a very slow convergence of the algorithms in default setting. I have thus 2 questions:
1) Is it possible to achieve an arbitrary small error (loss) purely by means of some Pytorch optimizer? Since the loss function is convex, it should be definitely possible, however, I am not able to figure out, how to achieve this using PyTorch. Note that the above 3 optimizers cannot do that - see the loss progress in log scale for 20000 iterations:
2) I am wondering how the optimizers can work well in complex examples, when they does not work well even in this extremely simple example. Or (and that is the second question) is it something wrong in their application above that I missed?
The place where you called zero_grad is wrong. During each epoch, gradient is added to the previous one and backpropagated. This makes the loss oscillate as it gets closer, but previous gradient throws it off of the solution again.
Code below will easily perform the task:
import torch
X = torch.randn(1000,1,1)
net = SimpleNet()
optimizer = Adam(params=net.parameters())
for epoch in range(EPOCHS):
optimizer.zero_grad() # zero the gradient buffers
loss = torch.mean((net.forward(X) - X) ** 2)
if loss < 1e-8:
print(epoch, loss)
break
loss.backward()
optimizer.step()
1) Is it possible to achieve an arbitrary small error (loss) purely by
means of some Pytorch optimizer?
Yeah, precision above is reached in around ~1500 epochs, you can go lower up to the machine (float in this case) precision
2) I am wondering how the optimizers can work well in complex
examples, when they does not work well even in this extremely simple
example.
Currently, we don't have anything better (at least wide spread) for network optimization than first order methods. Those are used as it's much faster to calculate gradient than Hessians for higher order methods. And complex, non-convex functions may have a lot of minima which kinda fulfill the task we threw at it, there is no need for global minima per se (although they may under some conditions, see this paper).
I used tensorflow to implement a simple RNN model to learn possible trends of time series data and predict future values. However, the model always produces same values after training. Actually, the best model it got is:
y = b.
The RNN structure is:
InputLayer -> BasicRNNCell -> Dense -> OutputLayer
RNN code:
def RNN(n_timesteps, n_input, n_output, n_units):
tf.reset_default_graph()
X = tf.placeholder(dtype=tf.float32, shape=[None, n_timesteps, n_input])
cells = [tf.contrib.rnn.BasicRNNCell(num_units=n_units)]
stacked_rnn = tf.contrib.rnn.MultiRNNCell(cells)
stacked_output, states = tf.nn.dynamic_rnn(stacked_rnn, X, dtype=tf.float32)
stacked_output = tf.layers.dense(stacked_output, n_output)
return X, stacked_output
while in training, n_timesteps=1, n_input=1, n_output=1, n_units=2, learning_rate=0.0000001. And loss is calculated by mean squared error.
Input is a sequence of data in continuous days. Output is the data after the days of input.
(Maybe these are not good settings. But no matter how I change them, the results are almost same. So I just set these to help show them later.)
And I found out this is because weights and bias of BasicRNNCell are not trained. They keep same from beginning. And only the weights and bias of Dense keep changing. So in training, I got a prediction like these:
In the beginning:
loss: 1433683500.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
After a while:
loss: 175372340.0
rnn/multi_rnn_cell/cell_0/cell0/kernel:0 [KEEP UNCHANGED]
rnn/multi_rnn_cell/cell_0/cell0/bias:0 [KEEP UNCHANGED]
dense/kernel:0 [CHANGING]
dense/bias:0 [CHANGING]
The orange line indicates the true data, the blue line indicates results of my code. Through training, the blue line will keep going up until model gets a stable loss.
So I doubt whether I did a wrong implementation, so I generate a group of data with y = 10x + 5 for testing. This time, My model learns the correct results.
In the beginning:
In the end:
I have tried:
add more layers of both BasicRNNCell and Dense
increase rnn cell hidden num(n_units) to 128
decrease learning_rate to 1e-10
increase timesteps to 60
They all dont work.
So, my questions are:
Is it because my model is too simple? But I think the trend of my data is not so complicated to learn. At least something like y = ax + b will produce a smaller loss than y = b.
What may lead to these results?
Or how should I go on debugging?
And now, I double maybe BasicRNNCell is not fully realized, users should implement some functions of it? I have no experience with tensorflow before.
It seems your net is just not fit for that kind of data, or from another point of view, your data is badly scaled. When adding the 4 lines below after the split_data, I get some sort of learning behavior, similar to the one with the a*x+b case
data = read_data(work_dir, input_file)
plot_data(data)
input_data, output_data, n_batches = split_data(data, n_timesteps, n_input, n_output)
# scale input and output data
input_data = input_data-input_data[0]
input_data = input_data/np.max(input_data)*1000
output_data = output_data-output_data[0]
output_data = output_data/np.max(output_data)*1000
I'm running into a weird problem with TensorFlow. I've set up a very simple classification problem, four input variables, one binary output variable, one layer of weights and bias, output goes through a sigmoid to 0 or 1.
The problem is, memory consumption is quadratic in the number of records of training data! With only 5,000 records, it's already 900 megabytes; at 10,000, it runs into a few gigabytes. Since I want to end up using at least a few tens of thousands of records, this is a problem.
It is happening specifically in the back propagation step; when I just try to evaluate the cost function, memory consumption is linear in the number of records, as expected.
Code follows. What am I doing wrong?
import numpy as np
import os
import psutil
import tensorflow as tf
process = psutil.Process(os.getpid())
sess = tf.InteractiveSession()
# Parameters
learning_rate = 0.01
random_seed = 1
tf.set_random_seed(random_seed)
# Data
data = np.loadtxt('train.csv', delimiter=',', dtype=np.float32)
train_X = data[:, :-1]
train_Y = data[:, -1]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(np.float32, shape=(rows, cols))
Y = tf.placeholder(np.float32, shape=rows,)
# Weights
W = tf.Variable(tf.random_normal((cols, 1)))
b = tf.Variable(tf.random_normal(()))
# Model
p = tf.nn.sigmoid(tf.matmul(X, W) + b)
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Just one optimizer step is enough to demonstrate the problem
optimizer.run({X: train_X, Y: train_Y})
# Memory consumption is quadratic in number of rows
print('{0:,} bytes'.format(process.memory_info().peak_wset))
It turns out to be again the problem of shape. Using matmul the way I did there, generates output of shape (n,1). Using that in a context where shape (n,) was expected, silently generates quadratic blowup.
The solution is squeeze. Specifically, tf.squeeze(tf.matmul(X, W)).
It makes sense that memory consumption blows up like that since the backprop requires the extra memory to keep track of the gradients of each operation (though I can't figure out how it ends up being quadratic).
Solution : Mini-batches
This is usually the goto method when it comes to training models. Split up your training data into little mini-batches each containing a fixed number of samples (this is rarely more than 200 samples) at feed it to the optimizer one mini-batch at a time. So if your batch_size=64 then the train_X and train_Y fed to the optimizer will be of the shapes (64, 4) and (64,) respectively.
I would try something like this
batch_size = 64
for i in range(rows):
batch_X = train_X[i*batch_size : (i + 1)*batch_size]
batch_Y = train_Y[i*batch_size : (i + 1)*batch_size]
optimizer.run({X: batch_X, Y:batch_Y})
I'm learning TensorFlow and trying to apply it on a simple linear regression problem. data is numpy.ndarray of shape [42x2].
I'm a bit puzzled why after each succesive epoch the loss is increasing. Isn't the loss expected to to go down with each successive epoch!
Here is my code (let me know, if you'd like me to share the output as well!): (Thanks a lot for taking your time to answer to it.)
1) created the placeholders for dependent / independent variables
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32,name='Y')
2) created vars for weight, bias, total_loss (after each epoch)
w = tf.Variable(0.0,name='weights')
b = tf.Variable(0.0,name='bias')
3) defined loss function & optimizer
Y_pred = X * w + b
loss = tf.reduce_sum(tf.square(Y - Y_pred), name = 'loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001).minimize(loss)
4) created summary events & event file writer
tf.summary.scalar(name = 'weight', tensor = w)
tf.summary.scalar(name = 'bias', tensor = b)
tf.summary.scalar(name = 'loss', tensor = loss)
merged = tf.summary.merge_all()
evt_file = tf.summary.FileWriter('def_g')
evt_file.add_graph(tf.get_default_graph())
5) and execute all in a session
with tf.Session() as sess1:
sess1.run(tf.variables_initializer(tf.global_variables()))
for epoch in range(10):
summary, _,l = sess1.run([merged,optimizer,loss],feed_dict={X:data[:,0],Y:data[:,1]})
evt_file.add_summary(summary,epoch+1)
evt_file.flush()
print(" new_loss: {}".format(sess1.run(loss,feed_dict={X:data[:,0],Y:data[:,1]})))
Cheers!
The short answer is that your learning rate is too big. I was able to get reasonable results by changing it from 0.001 to 0.0001, but I only used the 23 points from your second-last comment (I initially didn't notice your last comment), so using all the data might require an even lower number.
0.001 seems like a really low learning rate. However, the real problem is that your loss function is using reduce_sum instead of reduce_mean. This causes your loss to be a large number, which sends a very strong signal to the GradientDescentOptimizer, so it's overshooting despite the low learning rate. The problem would only get worse if you added more points to your training data. So use reduce_mean to get the average squared error and your algorithms will be much better behaved.