Working example:
import numpy as np
import tensorflow as tf
## construct data
np.random.seed(723888)
N,P = 50,3 # number and dimensionality of observations
Xbase = np.random.multivariate_normal(mean=np.zeros((P,)), cov=np.eye(P), size=N)
## construct model
X = tf.placeholder(dtype=tf.float32, shape=(None, P), name='X')
mu = tf.Variable(np.random.normal(loc=0.0, scale=0.1, size=(P,)), dtype=tf.float32, name='mu')
xDist = tf.contrib.distributions.MultivariateNormalDiag(loc=mu, scale_diag=tf.ones(shape=(P,), dtype=tf.float32), name='xDist')
xProbs = xDist.prob(X, name='xProbs')
## prepare optimizer
eta = 1e-3 # learning rate
loss = -tf.reduce_mean(tf.log(xProbs), name='loss')
optimizer = tf.train.AdamOptimizer(learning_rate=eta).minimize(loss)
## launch session
with tf.Session() as sess:
tf.global_variables_initializer().run()
sess.run(optimizer, feed_dict={X: Xbase})
I want to do optimization over the parameters of a multivariate gaussian distribution in tensorflow, as in my above example. I can successfully run commands like sess.run(loss, feed_dict={X: Xbase}), so I have implemented the distribution correctly. When I try to run the optimization op, I get an odd error message:
InvalidArgumentError: -1 is not between 0 and 3
[[Node: gradients_1/xDist_7/xProbs/Prod_grad/InvertPermutation = InvertPermutation[T=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](gradients_1/xDist_7/xProbs/Prod_grad/concat)]]
Caused by op 'gradients_1/xDist_7/xProbs/Prod_grad/InvertPermutation'
That I do not understand.
I get the same error message if I use tf.contrib.distributions.MultivariateNormalFullCovariance instead of tf.contrib.distributions.MultivariateNormalDiag. I do not get the error if scale_diag and not loc is the variable being optimized over.
I'm still looking into why this is failing, but for a short-term fix, does making the following change work?
xLogProbs = xDist.log_prob(X, name='xLogProbs')
loss = -tf.reduce_mean(xLogProbs, name='loss')
Note: this is actually preferable to tf.log(xProbs) because it is never less numerically precise--and sometimes substantially more precise. (This is true of all tf.Distributions.)
Related
In tf, the optimizer class only has two function:
compute_gradients
apply_gradients
where apply_gradients returns an op that performs the update w <- w + Δw via a tf.assign_add function.
However I need direct access to the Δw itself. (or w' = w+Δw). I know that the optimizer adds nodes to the computational graph which compute this Δw for each variable. How can I access them? Or do I have to re-implement the optimizer myself?
The reason is that I need to compute gradients dw'/dw, as I am working on something related to gradient based hyperparameter optimization (cf. https://arxiv.org/abs/1703.01785)
The "delta" applied to each variable is not accessible through any common method or name. In fact, looking a bit into the source it seems rather difficult extract, as it varies from one optimizer to the other.
What you can do, at least, is to compute the differences between variable values and their updates. For example it could work like this:
import tensorflow as tf
with tf.Graph().as_default():
# Setup example model
x = tf.placeholder(tf.float32, [None, 1])
y = tf.placeholder(tf.float32, [None, 2])
w = tf.Variable([[1., 2.]], tf.float32)
pred = x # w
loss = (tf.reduce_sum(tf.squared_difference(pred, y))
/ tf.cast(tf.shape(x)[0], tf.float32))
# Record variable values before training step
# (tf.identity should work here but it does not so we use
# a trivial add operation to enforce the control dependency)
w_old = w + 0
# Train after having recorded variable values
with tf.control_dependencies([w_old]):
train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
# Compute deltas after training
with tf.control_dependencies([train_op]):
w_delta = w.read_value() - w_old
init_op = tf.global_variables_initializer()
# Test
with tf.Session() as sess:
sess.run(init_op)
print(sess.run(w))
# [[1. 2.]]
_, w_delta_val = sess.run(
[train_op, w_delta],
feed_dict={x: [[1.], [2.]], y: [[3., 4], [5., 6.]]})
print(w_delta_val)
# [[0.79999995 0.5999999 ]]
print(sess.run(w))
# [[1.8 2.6]]
To get the updated w', you get just print the w directly after you have executed optimizer.apply_gradients(). The w at present is the w'.
Now, if you want to acquire the gradient of w, just do the operation of w'-w.
I have a simple linear regression question as below:
My codes are as below:
import tensorflow as tf
import numpy as np
batch_xs=np.array([[0,0,1],[1,1,1],[1,0,1],[0,1,1]])
batch_ys=np.array([[0],[1],[1],[0]])
x = tf.placeholder(tf.float32, [None, 3])
W = tf.Variable(tf.zeros([3, 1]))
b = tf.Variable(tf.zeros([1]))
y = tf.nn.sigmoid(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 1])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
learning_rate = 0.05
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
sess = tf.Session()
tf.global_variables_initializer().run(session=sess)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
Prediction:
x0=np.array([[1.,0.,0.]])
x0=np.float32(x0)
y0=tf.nn.softmax(tf.matmul(x0,W) + b)
print(y0)
However, print(y0) shows Tensor("Softmax_2:0", shape=(1, 1), dtype=float32) instead of a figure. I expect y0 would be around 0.99.
I tried y0.eval(), but I got ValueError: Cannot evaluate tensor using 'eval()': No default session is registered..
How can I make a change to obtain the result? Thanks!
There are a couple of ways to get things to print out while writing TensorFlow code. Of course, there’s the classic Python built-in, print (Or the function print(), if we’re being Python 3 about it). And then there’s TensorFlow’s print function, tf.Print (notice the capital P).
When working with TensorFlow, it’s important to remember that everything is ultimately a graph computation. This means that if you print a TensorFlow operation using Python’s print, it will simply show a description of what that operation is, since no values have been passed through it yet. It will also often show the dimensions that are expected to be in that node, if they’re known.
If you want to print the values that are ‘flowing’ through a particular part of the graph as it’s being executed, then we need to turn to using tf.Print.
I'm running into a weird problem with TensorFlow. I've set up a very simple classification problem, four input variables, one binary output variable, one layer of weights and bias, output goes through a sigmoid to 0 or 1.
The problem is, memory consumption is quadratic in the number of records of training data! With only 5,000 records, it's already 900 megabytes; at 10,000, it runs into a few gigabytes. Since I want to end up using at least a few tens of thousands of records, this is a problem.
It is happening specifically in the back propagation step; when I just try to evaluate the cost function, memory consumption is linear in the number of records, as expected.
Code follows. What am I doing wrong?
import numpy as np
import os
import psutil
import tensorflow as tf
process = psutil.Process(os.getpid())
sess = tf.InteractiveSession()
# Parameters
learning_rate = 0.01
random_seed = 1
tf.set_random_seed(random_seed)
# Data
data = np.loadtxt('train.csv', delimiter=',', dtype=np.float32)
train_X = data[:, :-1]
train_Y = data[:, -1]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(np.float32, shape=(rows, cols))
Y = tf.placeholder(np.float32, shape=rows,)
# Weights
W = tf.Variable(tf.random_normal((cols, 1)))
b = tf.Variable(tf.random_normal(()))
# Model
p = tf.nn.sigmoid(tf.matmul(X, W) + b)
cost = tf.reduce_sum((p-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Just one optimizer step is enough to demonstrate the problem
optimizer.run({X: train_X, Y: train_Y})
# Memory consumption is quadratic in number of rows
print('{0:,} bytes'.format(process.memory_info().peak_wset))
It turns out to be again the problem of shape. Using matmul the way I did there, generates output of shape (n,1). Using that in a context where shape (n,) was expected, silently generates quadratic blowup.
The solution is squeeze. Specifically, tf.squeeze(tf.matmul(X, W)).
It makes sense that memory consumption blows up like that since the backprop requires the extra memory to keep track of the gradients of each operation (though I can't figure out how it ends up being quadratic).
Solution : Mini-batches
This is usually the goto method when it comes to training models. Split up your training data into little mini-batches each containing a fixed number of samples (this is rarely more than 200 samples) at feed it to the optimizer one mini-batch at a time. So if your batch_size=64 then the train_X and train_Y fed to the optimizer will be of the shapes (64, 4) and (64,) respectively.
I would try something like this
batch_size = 64
for i in range(rows):
batch_X = train_X[i*batch_size : (i + 1)*batch_size]
batch_Y = train_Y[i*batch_size : (i + 1)*batch_size]
optimizer.run({X: batch_X, Y:batch_Y})
I am trying to train a linear regression model in Tensorflow using some generated data. The model seems to learn the slope of the line, but is unable to learn the bias.
I have tried changing the no. of epochs, the weight(slope) and the biases, but every time , the learnt bias by the model comes out to be zero. I don't know where I am going wrong and some help would be appreciated.
Here is the code.
import numpy as np
import tensorflow as tf
# assume the linear model to be Y = W*X + b
X = tf.placeholder(tf.float32, [None, 1])
Y = tf.placeholder(tf.float32, [None,1])
# the weight and biases
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))
# the model
prediction = tf.matmul(X,W) + b
# the cost function
cost = tf.reduce_mean(tf.square(Y - prediction))
# Use gradient descent
learning_rate = 0.000001
train_step =
tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
steps = 1000
epochs = 10
Verbose = False
# In the end, the model should learn these values
test_w = 3
bias = 10
for _ in xrange(epochs):
for i in xrange(steps):
# make fake data for the model
# feed one example at a time
# stochastic gradient descent, because we only use one example at a time
x_temp = np.array([[i]])
y_temp = np.array([[test_w*i + bias]])
# train the model using the data
feed_dict = {X: x_temp, Y:y_temp}
sess.run(train_step,feed_dict=feed_dict)
if Verbose and i%100 == 0:
print("Iteration No: %d" %i)
print("W = %f" % sess.run(W))
print("b = %f" % sess.run(b))
print("Finally:")
print("W = %f" % sess.run(W))
print("b = %f" % sess.run(b))
# These values should be close to the values we used to generate data
https://github.com/HarshdeepGupta/tensorflow_notebooks/blob/master/Linear%20Regression.ipynb
Outputs are in the last line of code.
The model needs to learn test_w and bias (in the notebook link, it is in the 3rd cell, after the first comment), which are set to 3 and 10 respectively.
The model correctly learns the weight(slope), but is unable to learn the bias. Where is the error?
The main problem is that you are feeding just one sample at a time to the model. This makes your optimizer very inestable, that's why you have to use such a small learning rate. I will suggest you to feed more samples in each step.
If you insist in feeding one sample at a time, maybe you should consider using an optimizer with momentum, like tf.train.AdamOptimizer(learning_rate). This way you can increase the learning rate and reach convergence.
I am trying to implement a very basic neural network in TensorFlow but I am having some problems. It is a very basic network that takes as input to values (hours or sleep and hours of study) and predicts the score on a test (I found this example on you-tube). So basically I have only one hidden layer with three units, each one computes an activation function (sigmoid) and the cost function is sum of square errors and I am using Gradient descent to minimize it. So the problem is, when I train the net with the training data and try to make some predictions using the same training data, the results do not quite match and they also appear strange because the look equal each other.
import tensorflow as tf
import numpy as np
import input_data
sess = tf.InteractiveSession()
# create a 2-D version of input for plotting
trX = np.matrix(([3,5], [5,1],[10,2]), dtype=float)
trY = np.matrix(([85], [82], [93]), dtype=float) # 3X1 matrix
trX = trX / np.max(trX, axis=0)
trY = trY / 100 # 100 is the maximum score allowed
teX = np.matrix(([3,5]), dtype=float)
teY = np.matrix(([85]), dtype=float)
teX = teX/np.amax(teX, axis=0)
teY = teY/100
def init_weights(shape):
return tf.Variable(tf.random_normal(shape, stddev=0.01))
def model(X, w_h, w_o):
z2 = tf.matmul(X, w_h)
a2 = tf.nn.sigmoid(z2) # this is a basic mlp, think 2 stacked logistic regressions
z3 = tf.matmul(a2, w_o)
yHat = tf.nn.sigmoid(z3)
return yHat # note that we dont take the softmax at the end because our cost fn does that for us
X = tf.placeholder("float", [None, 2])
Y = tf.placeholder("float", [None, 1])
W1 = init_weights([2, 3]) # create symbolic variables
W2 = init_weights([3, 1])
sess.run(tf.initialize_all_variables())
py_x = model(X, W1, W2)
cost = tf.reduce_mean(tf.square(py_x - Y))
train_op = tf.train.GradientDescentOptimizer(0.5).minimize(cost) # construct an optimizer
predict_op = py_x
sess.run(train_op, feed_dict={X: trX, Y: trY})
print sess.run(predict_op, feed_dict={X: trX})
sess.close()
It yields:
[[ 0.51873487]
[ 0.51874501]
[ 0.51873082]]
and I believe it should be similar to the training data results.
I am quite new to neural nets and machine learning so pardon me for any mistakes, thanks in advance.
The main reason that your network isn't training is that the statement:
sess.run(train_op, feed_dict={X: trX, Y: trY})
…only executes once. In TensorFlow, running train_op (or whatever operation is returned from Optimizer.minimize() will only cause the network to take a single gradient descent step. You should execute it in a loop to perform iterative training, and the weights will eventually converge.
Two other tips: (i) you might achieve faster convergence if you feed a subset of your training data in each step, rather than the entire dataset; and (ii) the learning rate of 0.5 is probably too high (although this depends on the data).