Keras GAN (generator) not training well despite accurate discriminator - tensorflow

I've tried sorting this out for a few days now, following many pieces of advice found on forums etc, and now would welcome any suggestions to what is wrong!
I'm attempting to get my first GAN training - a simple feedforward deep net - very similar to using MNIST dataset, but with spectrum power windows derived from the VCTK-Corpus (size(1, 513)).
You can see from the Tensorboard graphs below that the networks seem to be interacting, and there is some sort of training going on:
Tensorboard graph overview. Tensorboard graph zoom.
However, results are poor and noisy: generated and validation comparison
The generator takes normal noise (usually 30 to 100 vectors) with a mean of 0 and stdev of 0.5.
def gan_generator(x_shape, frame_size):
g_input = Input(shape=x_shape)
H = BatchNormalization()(g_input)
H = Dense(128)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
H = Dense(128)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
H = Dense(256)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
H = Dense(256)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
H = Dense(256)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
out = Dense(frame_size[1], activation='linear')(H)
generator = Model(g_input, out)
generator.summary()
return generator
The discriminator determines a one-hot categorisation of generated frames:
(not sure about batch normalisation here - I've read it shouldn't be used if you're mixing real and generated into one batch. However, the generator makes much more convincing results with it than without - despite having a higher loss.)
def gan_discriminator(input_shape):
d_input = Input(shape=input_shape)
H = Dropout(0.1)(d_input)
H = Dense(256)(H)
H = Dropout(0.1)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
H = Dense(128)(H)
H = Dropout(0.1)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
H = Dense(100)(H)
H = Dropout(0.1)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
H = Dense(100)(H)
H = Dropout(0.1)(H)
H = LeakyReLU()(H)
H = BatchNormalization()(H)
H = Reshape((1, -1))(H)
d_V = Dense(2, activation='softmax')(H)
discriminator = Model(d_input,d_V)
discriminator.summary()
return discriminator
The GAN is easy:
def init_gan(generator, discriminator):
x = Input(shape=generator.inputs[0].shape[1:])
#Generator makes a prediction
pred = generator(x)
#Discriminator attempts to categorise prediction
y = discriminator(pred)
GAN = Model(x, y)
return GAN
Some training variables:
GAN (generator): Adam, lr=1e-4, categorical_crossentropy
discriminator: Adam, lr=1e-3, categorical_crossentropy
batch size: around 8000 samples
Mini-batch (weight update cycle): 32
The training loop:
#Pre-training Discriminator Network
#Load new batch of real frames
frames = load_data(data_dir)
frames_label = np.zeros((frames.shape[0], 1, 2))
frames_label[:, :, 0] = 1 #mark as real frames
#Generate Frames from noise vector
X_noise = noisegen((frames.shape[0], 1, n_noise))
generated_frames = generator.predict(X_noise)
generated_label = np.zeros((generated_frames.shape[0], 1, 2))
generated_label[:, :, 1] = 1 #mark as false frames
#Prep Data - concat real and false data
dis_batch_x = np.concatenate((frames, generated_frames), axis=0)
dis_batch_y = np.concatenate((frames_label, generated_label), axis=0)
#Make discriminator trainable and train for 8 epochs
make_trainable(discriminator, True)
discriminator.compile(optimizer=dis_optimizer, loss=dis_loss)
fit_model(discriminator, dis_batch_x, dis_batch_y, 8)
#Training Loop
for d in range(data_sets):
print "Starting New Dataset: {0}/{1}".format(d+1, data_sets)
""" Fit Discriminator """
#Load new batch of real frames
frames = load_data(data_dir)
frames_label = np.zeros((frames.shape[0], 1, 2))
frames_label[:, :, 0] = 1 #mark as real frames
#Generate Frames from noise vector
X_noise = noisegen((frames.shape[0], 1, n_noise))
generated_frames = generator.predict(X_noise)
generated_label = np.zeros((generated_frames.shape[0], 1, 2))
generated_label[:, :, 1] = 1 #mark as false frames
#Prep Data - concat real and false data
dis_batch_x = np.concatenate((frames, generated_frames), axis=0)
dis_batch_y = np.concatenate((frames_label, generated_label), axis=0)
#Make discriminator trainable & fit
make_trainable(discriminator, True)
discriminator.compile(optimizer=dis_optimizer, loss=dis_loss)
fit_model(discriminator, dis_batch_x, dis_batch_y)
""" Fit Generator """
#Prep Data
X_noise = noisegen((frames.shape[0], 1, n_noise))
generated_label = np.zeros((generated_frames.shape[0], 1, 2))
generated_label[:, :, 1] = 1 #mark as false frames
make_trainable(discriminator, False)
GAN.layers[2].trainable = False #done twice just to be sure
GAN.compile(optimizer=GAN_optimizer, loss=GAN_loss)
fit_model(GAN, X_noise, generated_label)
And finally a little bit of system info:
OSX 10.12
Tensorflow 1.5.0 (GPU)
Keras 2.1.3
Python 2.7
Many thanks in advance!

What the solution actually was that I didn't swap my True/False class in Generator training (suggested https://github.com/soumith/ganhacks), which I think effectively makes it gradient ascent.
Clarification on this would be nice to have.

Related

I want train a set of weight using pytorch, but the weights do not even change

I want to reproduce a method from a paper, the code in this paper was written in tensorflow1.0 and I want to rewrite it in pytorch. A brief description, I want to get a set of G that can be used to reweight input data but in training, the G doesn't even change, this is the tensorflow code:
n,p = X_input.shape
n_e, p_e = X_encoder_input.shape
display_step = 100
X = tf.placeholder("float", [None, p])
X_encoder = tf.placeholder("float", [None, p_e])
G = tf.Variable(tf.ones([n,1]))
loss_balancing = tf.constant(0, tf.float32)
for j in range(1,p+1):
X_j = tf.slice(X_encoder, [j*n,0],[n,p_e])
I = tf.slice(X, [0,j-1],[n,1])
balancing_j = tf.divide(tf.matmul(tf.transpose(X_j),G*G*I),tf.maximum(tf.reduce_sum(G*G*I),tf.constant(0.1))) - tf.divide(tf.matmul(tf.transpose(X_j),G*G*(1-I)),tf.maximum(tf.reduce_sum(G*G*(1-I)),tf.constant(0.1)))
loss_balancing += tf.norm(balancing_j,ord=2)
loss_regulizer = (tf.reduce_sum(G*G)-n)**2 + 10*(tf.reduce_sum(G*G-1))**2#
loss = loss_balancing + 0.0001*loss_regulizer
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
saver = tf.train.Saver()
sess = tf.Session()
sess.run(tf.global_variables_initializer())
and this is my rewriting pytorch code:
n, p = x_test.shape
loss_balancing = torch.tensor(0.0)
G = nn.Parameter(torch.ones([n,1]))
optimizer = torch.optim.RMSprop([G] , lr=0.001)
for i in range(num_steps):
for j in range(1, p+1):
x_j = x_all_encoder[j * n : j*n + n , :]
I = x_test[0:n , j-1:j]
balancing_j = torch.divide(torch.matmul(torch.transpose(x_j,0,1) , G * G * I) ,
torch.maximum( (G * G * I).sum() ,
torch.tensor(0.1) -
torch.divide(torch.matmul(torch.transpose(x_j,0,1) ,G * G * (1-I)),
torch.maximum( (G*G*(1-I)).sum() , torch.tensor(0.1) )
)
)
)
loss_balancing += nn.Parameter(torch.norm(balancing_j))
loss_regulizer = nn.Parameter(((G * G) - n).sum() ** 2 + 10 * ((G * G - 1).sum()) ** 2)
loss = nn.Parameter( loss_balancing + 0.0001 * loss_regulizer )
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i % 100 == 0:
print('Loss:{:.4f}'.format(loss.item()))
and the G.grad = None, I want to know how to get the G a set of value by iteration to minimize the Loss , Thanks.
Firstly, please provide a minimal reproducible example. It will be very helpful for people to answer your question.
Since G.grad has no value, it indicates that loss.backward() didn't properly work.
The computation of gradient can be disturbed by many factors, but in this case, I suspect the maximum operation in your code prevents the backward flow since the maximum operation is not differentiable in general.
To check if this hypothesis is correct, you could check the gradient of a tensor created after the maximum operation which I can't do because provided code is not executable in my case.

If I don't want to train in batches and my state is a vector, what should my tensors have for a shape?

I'm trying to use tensorflow to solve a reinforced learning problem. I created an gym environment of my own. The state is a one dimensional array (size 224) and there are 170 actions to choose from (0...169). I do not want to train in batches. What I want is to make the most simple version of the RL problem running with tensorflow.
My main problem is, i guess the dimensions. I would assume that TF would allow me to input the state as 1D tensor. But then I get an error when I want to calculate W*input=action. Dimensions error make it hard to know whats right. Also, examples on the web focus on training from images, in batches.
In general, I started in this tutorial, but the state is encoded differently, which again makes it hard to follow (especially since I'm not really familiar with python).
import gym
import numpy as np
import random
import tensorflow as tf
env = gym.make('MyOwnEnv-v0')
n_state = 224
n_action = 170
sess = tf.InteractiveSession()
# Implementing the network itself
inputs1 = tf.placeholder(shape=[1,n_state],dtype=tf.float32)
W = tf.Variable(tf.random_uniform([n_state,n_action],0,0.01))
Qout = tf.transpose(tf.matmul(inputs1,W))
predict = tf.reshape(tf.argmax(Qout,1), [n_action,1])
#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[n_action,1],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)
# Training the network
init = tf.global_variables_initializer()
print("input: ", inputs1.get_shape()
, "\nW: ", W.get_shape()
, "\nQout: ", Qout.get_shape()
, "\npredict:", predict.get_shape()
, "\nnextQ: ", nextQ.get_shape()
, "\nloss: ", loss.get_shape())
# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
sess.run(init)
for i in range(num_episodes):
#Reset environment and get first new observation
s = env.reset()
rAll = 0
d = False
j = 0
#The Q-Network
while j < 99:
j+=1
#Choose an action by greedily (with e chance of random action) from the Q-network
a,allQ = sess.run([predict,Qout],feed_dict={inputs1:s})
if np.random.rand(1) < e:
a = env.action_space.sample()
#Get new state and reward from environment
s1,r,d,_ = env.step(a)
#Obtain the Q' values by feeding the new state through our network
Q1 = sess.run(Qout,feed_dict={inputs1:s1})
#Obtain maxQ' and set our target value for chosen action.
maxQ1 = np.max(Q1)
targetQ = allQ
#targetQ[0,a[0]] = r + y*maxQ1
targetQ[a,0] = r + y*maxQ1
#Train our network using target and predicted Q values
_,W1 = sess.run([updateModel,W],feed_dict={inputs1:s,nextQ:targetQ})
rAll += r
s = s1
if d == True:
#Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break
jList.append(j)
rList.append(rAll)
print('Percent of succesful episodes: ' + str(sum(rList)/num_episodes) + '%')

(De-)Convutional lstm autoencoder - error jumps

I'm trying to build a convolutional lstm autoencoder (that also predicts future and past) with Tensorflow, and it works to a certain degree, but the error sometimes jumps back up, so essentially, it never converges.
The model is as follows:
The encoder starts with a 64x64 frame from a 20 frame bouncing mnist video for each time step of the lstm. Every stacking layer of LSTM halfs it and increases the depth via 2x2 convolutions with a stride of 2. (so -->32x32x3 -->...--> 1x1x96)
On the other hand, the lstm performs 3x3 convolutions with a stride of 1 on its state. Both results are concatenated to form the new state. In the same way, the decoder uses transposed convolutions to go back to the original format. Then the squared error is calculated.
The error starts at around 2700 and it takes around 20 hours (geforce1060) to get down to ~1700. At which point the jumping back up (and it sometimes jumps back up to 2300 or even ridiculous values like 440300) happens often enough that I can't really get any lower. Also at that point, it can usually pinpoint where the number should be, but its too fuzzy to actually make out the digit...
I tried different learning rates and optimizers, so if anybody knows why that jumping happens, that'd make me happy :)
Here is a graph of the loss with epochs:
import tensorflow as tf
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
#based on code by loliverhennigh (Github)
class ConvCell(tf.contrib.rnn.RNNCell):
count = 0 #exists only to remove issues with variable scope
def __init__(self, shape, num_features, transpose = False):
self.shape = shape
self.num_features = num_features
self._state_is_tuple = True
self._transpose = transpose
ConvCell.count+=1
self.count = ConvCell.count
#property
def state_size(self):
return (tf.contrib.rnn.LSTMStateTuple(self.shape[0:4],self.shape[0:4]))
#property
def output_size(self):
return tf.TensorShape(self.shape[1:4])
#here comes to the actual conv lstm implementation, if transpose = true, it performs a deconvolution on the input
def __call__(self, inputs, state, scope=None):
with tf.variable_scope(scope or type(self).__name__+str(self.count)):
c, h = state
state_shape = h.shape
input_shape = inputs.shape
#filter variables and convolutions on data coming from the same cell, a time step previous
h_filters = tf.get_variable("h_filters",[3,3,state_shape[3],self.num_features])
h_filters_gates = tf.get_variable("h_filters_gates",[3,3,state_shape[3],3])
h_partial = tf.nn.conv2d(h,h_filters,[1,1,1,1],'SAME')
h_partial_gates = tf.nn.conv2d(h,h_filters_gates,[1,1,1,1],'SAME')
c_filters = tf.get_variable("c_filters",[3,3,state_shape[3],3])
c_partial = tf.nn.conv2d(c,c_filters,[1,1,1,1],'SAME')
#filters and convolutions/deconvolutions on data coming fromthe cell input
if self._transpose:
x_filters = tf.get_variable("x_filters",[2,2,self.num_features,input_shape[3]])
x_filters_gates = tf.get_variable("x_filters_gates",[2,2,3,input_shape[3]])
x_partial = tf.nn.conv2d_transpose(inputs,x_filters,[int(state_shape[0]),int(state_shape[1]),int(state_shape[2]),self.num_features],[1,2,2,1],'VALID')
x_partial_gates = tf.nn.conv2d_transpose(inputs,x_filters_gates,[int(state_shape[0]),int(state_shape[1]),int(state_shape[2]),3],[1,2,2,1],'VALID')
else:
x_filters = tf.get_variable("x_filters",[2,2,input_shape[3],self.num_features])
x_filters_gates = tf.get_variable("x_filters_gates",[2,2,input_shape[3],3])
x_partial = tf.nn.conv2d(inputs,x_filters,[1,2,2,1],'VALID')
x_partial_gates = tf.nn.conv2d(inputs,x_filters_gates,[1,2,2,1],'VALID')
#some more lstm gate business
gate_bias = tf.get_variable("gate_bias",[1,1,1,3])
h_bias = tf.get_variable("h_bias",[1,1,1,self.num_features*2])
gates = h_partial_gates + x_partial_gates + c_partial + gate_bias
i,f,o = tf.split(gates,3,axis=3)
#concatenate the units coming from the spacial and the temporal dimension to build a unified state
concat = tf.concat([h_partial,x_partial],3) + h_bias
new_c = tf.nn.relu(concat)*tf.sigmoid(i)+c*tf.sigmoid(f)
new_h = new_c * tf.sigmoid(o)
new_state = tf.contrib.rnn.LSTMStateTuple(new_c,new_h)
return new_h, new_state #its redundant, but thats how tensorflow likes it, apparently
#global variables
LEARNING_RATE = 0.005
ITERATIONS_PER_EPOCH = 80
BATCH_SIZE = 75
TEST = False #manual switch to go from training to testing
if TEST:
BATCH_SIZE = 1
inputs = tf.placeholder(tf.float32, (20, BATCH_SIZE, 64, 64,1))
shape0 = [BATCH_SIZE,64,64,2]
shape1 = [BATCH_SIZE,32,32,6]
shape2 = [BATCH_SIZE,16,16,12]
shape3 = [BATCH_SIZE,8,8,24]
shape4 = [BATCH_SIZE,4,4,48]
shape5 = [BATCH_SIZE,2,2,96]
shape6 = [BATCH_SIZE,1,1,192]
#apparently tf.multirnncell has very specific requirements for the initial states oO
initial_state1 = (tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape1),tf.zeros(shape1)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape2),tf.zeros(shape2)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape3),tf.zeros(shape3)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape4),tf.zeros(shape4)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape5),tf.zeros(shape5)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape6),tf.zeros(shape6)))
initial_state2 = (tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape5),tf.zeros(shape5)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape4),tf.zeros(shape4)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape3),tf.zeros(shape3)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape2),tf.zeros(shape2)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape1),tf.zeros(shape1)),tf.contrib.rnn.LSTMStateTuple(tf.zeros(shape0),tf.zeros(shape0)))
#encoding part of the autoencoder graph
cell1 = ConvCell(shape1,3)
cell2 = ConvCell(shape2,6)
cell3 = ConvCell(shape3,12)
cell4 = ConvCell(shape4,24)
cell5 = ConvCell(shape5,48)
cell6 = ConvCell(shape6,96)
mcell = tf.contrib.rnn.MultiRNNCell([cell1,cell2,cell3,cell4,cell5,cell6])
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(mcell, inputs[0:20,:,:,:],initial_state=initial_state1,dtype=tf.float32, time_major=True)
#decoding part of the autoencoder graph, forward block and backwards block
cell9a = ConvCell(shape5,48,transpose = True)
cell10a = ConvCell(shape4,24,transpose = True)
cell11a = ConvCell(shape3,12,transpose = True)
cell12a = ConvCell(shape2,6,transpose = True)
cell13a = ConvCell(shape1,3,transpose = True)
cell14a = ConvCell(shape0,1,transpose = True)
mcella = tf.contrib.rnn.MultiRNNCell([cell9a,cell10a,cell11a,cell12a,cell13a,cell14a])
cell9b = ConvCell(shape5,48,transpose = True)
cell10b = ConvCell(shape4,24,transpose = True)
cell11b= ConvCell(shape3,12,transpose = True)
cell12b = ConvCell(shape2,6,transpose = True)
cell13b = ConvCell(shape1,3,transpose = True)
cell14b = ConvCell(shape0,1,transpose = True)
mcellb = tf.contrib.rnn.MultiRNNCell([cell9b,cell10b,cell11b,cell12b,cell13b,cell14b])
def PredictionLayer(rnn_outputs,viewPoint = 11, reverse = False):
predLength = viewPoint-2 if reverse else 20-viewPoint #vision is the input for the decoder
vision = tf.concat([rnn_outputs[viewPoint-1:viewPoint,:,:,:],tf.zeros([predLength,BATCH_SIZE,1,1,192])],0)
if reverse:
rnn_outputs2, rnn_states = tf.nn.dynamic_rnn(mcellb, vision, initial_state = initial_state2, time_major=True)
else:
rnn_outputs2, rnn_states = tf.nn.dynamic_rnn(mcella, vision, initial_state = initial_state2, time_major=True)
mean = tf.reduce_mean(rnn_outputs2,4)
if TEST:
return mean
if reverse:
return tf.reduce_sum(tf.square(mean-inputs[viewPoint-2::-1,:,:,:,0]))
else:
return tf.reduce_sum(tf.square(mean-inputs[viewPoint-1:20,:,:,:,0]))
if TEST:
mean = tf.concat([PredictionLayer(rnn_outputs,11,True)[::-1,:,:,:],createPredictionLayer(rnn_outputs,11)],0)
else: #training part of the graph
error = tf.zeros([1])
for i in range(8,15): #range size of 7 or less works, 9 or more does not, no idea why
error += PredictionLayer(rnn_outputs, i)
error += PredictionLayer(rnn_outputs, i, True)
train_fn = tf.train.RMSPropOptimizer(learning_rate=LEARNING_RATE).minimize(error)
################################################################################
## TRAINING LOOP ##
################################################################################
#code based on siemanko/tf_lstm.py (Github)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
saver = tf.train.Saver(restore_sequentially=True, allow_empty=True,)
session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
session.run(tf.global_variables_initializer())
vids = np.load("mnist_test_seq.npy") #20/10000/64/64 , moving mnist dataset from http://www.cs.toronto.edu/~nitish/unsupervised_video/
vids = vids[:,0:6000,:,:] #training set
saver.restore(session,tf.train.latest_checkpoint('./conv_lstm_multiples_v2/'))
#saver.restore(session,'.\conv_lstm_multiples\iteration-74')
for epoch in range(1000):
if TEST:
break
epoch_error = 0
#randomize batches each epoch
vids = np.swapaxes(vids,0,1)
np.random.shuffle(vids)
vids = np.swapaxes(vids,0,1)
for i in range(ITERATIONS_PER_EPOCH):
#running the graph and feeding data
err,_ = session.run([error, train_fn], {inputs: np.expand_dims(vids[:,i*BATCH_SIZE:(i+1)*BATCH_SIZE,:,:],axis=4)})
print(err)
epoch_error += err
#training error each epoch and regular saving
epoch_error /= (ITERATIONS_PER_EPOCH*BATCH_SIZE*4096*20*7)
if (epoch+1) % 5 == 0:
saver.save(session,'.\conv_lstm_multiples_v2\iteration',global_step=epoch)
print("saved")
print("Epoch %d, train error: %f" % (epoch, epoch_error))
#testing
plt.ion()
f, axarr = plt.subplots(2)
vids = np.load("mnist_test_seq.npy")
for i in range(6000,10000):
img = session.run([mean], {inputs: np.expand_dims(vids[:,i:i+1,:,:],axis=4)})
for j in range(20):
axarr[0].imshow(img[0][j,0,:,:])
axarr[1].imshow(vids[j,i,:,:])
plt.show()
plt.pause(0.1)
Usually this happens when gradients' magnitude is very high at some point and causes your network parameters to change a lot. To verify that it is indeed the case, you can produce the same plot of gradient magnitudes and see if they jump right before the loss jump. Assuming this is the case, the classic approach is to use gradient clipping (or go all the way to natural gradient).

tensorflow giving nans when calculating gradient with sparse tensors

The following snippet is from a fairly large piece of code but hopefully I can give all the information necessary:
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
y1 and y2 are 128x30 and ymask is 30x30. ystar is 128x30. dist is 1x30. When ymask is the identity matrix, everything works fine. But when I set it to be all zeros, apart from a single 1 along the diagonal (so as to set all columns but one in y2 to be zero), I get nans for the gradient of dist with respect to y2, using tf.gradients(dist, [y2]). The specific value of dist is [0,0,7.9,0,...], with all the ystar-y2 values being around the range (-1,1) in the third column and zero elsewhere.
I'm pretty confused as to why a numerical issue would occur here, given there are no logs or divisions, is this underflow? Am I missing something in the maths?
For context, I'm doing this to try to train individual dimensions of y, one at a time, using the whole network.
longer version to reproduce:
import tensorflow as tf
import numpy as np
import pandas as pd
batchSize = 128
eta = 0.8
tasks = 30
imageSize = 32**2
groups = 3
tasksPerGroup = 10
trainDatapoints = 10000
w = np.zeros([imageSize, groups * tasksPerGroup])
toyIndex = 0
for toyLoop in range(groups):
m = np.ones([imageSize]) * np.random.randn(imageSize)
for taskLoop in range(tasksPerGroup):
w[:, toyIndex] = m * 0.1 * np.random.randn(1)
toyIndex += 1
xRand = np.random.normal(0, 0.5, (trainDatapoints, imageSize))
taskLabels = np.matmul(xRand, w) + np.random.normal(0,0.5,(trainDatapoints, groups * tasksPerGroup))
DF = np.concatenate((xRand, taskLabels), axis=1)
trainDF = pd.DataFrame(DF[:trainDatapoints, ])
# define graph variables
x = tf.placeholder(tf.float32, [None, imageSize])
W = tf.Variable(tf.zeros([imageSize, tasks]))
b = tf.Variable(tf.zeros([tasks]))
ystar = tf.placeholder(tf.float32, [None, tasks])
ymask = tf.placeholder(tf.float32, [tasks, tasks])
dataLength = tf.cast(tf.shape(ystar)[0],dtype=tf.float32)
y1 = tf.matmul(x, W) + b
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
mse = tf.reciprocal(dataLength) * tf.reduce_mean(tf.square(dist))
grads = tf.gradients(dist, [y2])
trainStep = tf.train.GradientDescentOptimizer(eta).minimize(mse)
# build graph
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
randTask = np.random.randint(0, 9)
ymaskIn = np.zeros([tasks, tasks])
ymaskIn[randTask, randTask] = 1
batch = trainDF.sample(batchSize)
batch_xs = batch.iloc[:, :imageSize]
batch_ys = np.zeros([batchSize, tasks])
batch_ys[:, randTask] = batch.iloc[:, imageSize + randTask]
gradOut = sess.run(grads, feed_dict={x: batch_xs, ystar: batch_ys, ymask: ymaskIn})
sess.run(trainStep, feed_dict={x: batch_xs, ystar: batch_ys, ymask:ymaskIn})
Here's a very simple reproduction:
import tensorflow as tf
with tf.Graph().as_default():
y = tf.zeros(shape=[1], dtype=tf.float32)
dist = tf.norm(y,axis=0)
(grad,) = tf.gradients(dist, [y])
with tf.Session():
print(grad.eval())
Prints:
[ nan]
The issue is that tf.norm computes sum(x**2)**0.5. The gradient is x / sum(x**2) ** 0.5 (see e.g. https://math.stackexchange.com/a/84333), so when sum(x**2) is zero we're dividing by zero.
There's not much to be done in terms of a special case: the gradient as x approaches all zeros depends on which direction it's approaching from. For example if x is a single-element vector, the limit as x approaches 0 could either be 1 or -1 depending on which side of zero it's approaching from.
So in terms of solutions, you could just add a small epsilon:
import tensorflow as tf
def safe_norm(x, epsilon=1e-12, axis=None):
return tf.sqrt(tf.reduce_sum(x ** 2, axis=axis) + epsilon)
with tf.Graph().as_default():
y = tf.constant([0.])
dist = safe_norm(y,axis=0)
(grad,) = tf.gradients(dist, [y])
with tf.Session():
print(grad.eval())
Prints:
[ 0.]
Note that this is not actually the Euclidean norm. It's a good approximation as long as the input is much larger than epsilon.

How to use `sparse_softmax_cross_entropy_with_logits`: without getting Incompatible Shapes Error

I would like to use the sparse_softmax_cross_entropy_with_logits
with the julia TensorFlow wrapper.
The operations is defined in the code here.
Basically, as I understand it the first argument should be logits, that would normally be fed to softmax to get them to be category probabilities (~1hot output).
And the second should be the correct labels as label ids.
I have adjusted the example code from the TensorFlow.jl readme
See below:
using Distributions
using TensorFlow
# Generate some synthetic data
x = randn(100, 50)
w = randn(50, 10)
y_prob = exp(x*w)
y_prob ./= sum(y_prob,2)
function draw(probs)
y = zeros(size(probs))
for i in 1:size(probs, 1)
idx = rand(Categorical(probs[i, :]))
y[i, idx] = 1
end
return y
end
y = draw(y_prob)
# Build the model
sess = Session(Graph())
X = placeholder(Float64)
Y_obs = placeholder(Float64)
Y_obs_lbl = indmax(Y_obs, 2)
variable_scope("logisitic_model", initializer=Normal(0, .001)) do
global W = get_variable("weights", [50, 10], Float64)
global B = get_variable("bias", [10], Float64)
end
L = X*W + B
Y=nn.softmax(L)
#costs = log(Y).*Y_obs #Dense (Orginal) way
costs = nn.sparse_softmax_cross_entropy_with_logits(L, Y_obs_lbl+1) #sparse way
Loss = -reduce_sum(costs)
optimizer = train.AdamOptimizer()
minimize_op = train.minimize(optimizer, Loss)
saver = train.Saver()
# Run training
run(sess, initialize_all_variables())
cur_loss, _ = run(sess, [Loss, minimize_op], Dict(X=>x, Y_obs=>y))
When I run it however, I get an error:
Tensorflow error: Status: Incompatible shapes: [1,100] vs. [100,10]
[[Node: gradients/SparseSoftmaxCrossEntropyWithLogits_10_grad/mul = Mul[T=DT_DOUBLE, _class=[], _device="/job:localhost/replica:0/task:0/cpu:0"](gradients/SparseSoftmaxCrossEntropyWithLogits_10_grad/ExpandDims, SparseSoftmaxCrossEntropyWithLogits_10:1)]]
in check_status(::TensorFlow.Status) at /home/ubuntu/.julia/v0.5/TensorFlow/src/core.jl:101
in run(::TensorFlow.Session, ::Array{TensorFlow.Port,1}, ::Array{Any,1}, ::Array{TensorFlow.Port,1}, ::Array{Ptr{Void},1}) at /home/ubuntu/.julia/v0.5/TensorFlow/src/run.jl:96
in run(::TensorFlow.Session, ::Array{TensorFlow.Tensor,1}, ::Dict{TensorFlow.Tensor,Array{Float64,2}}) at /home/ubuntu/.julia/v0.5/TensorFlow/src/run.jl:143
This only happens when I try to train it.
If I don't include an optimise function/output then it works fine.
So I am doing something that screws up the gradient math.