Tensorflow: sparse_tensor_dense_matmul slower than regular matmul - tensorflow

I have 2 scenarios:
scenario 1:
op: sparse_tensor_dense_matmul
A: 1000x1000 sparsity = 90%
B: 1000x1000 sparsity = 0%
scenario 2:
op: matmul
A: 1000x1000 sparsity = 0%
B: 1000x1000 sparsity = 0%
I understand that GPUs do not compute sparse matrix multiplication well but I would certainly expect them to perform it atleast as well as they perform non-sparse matrix mulipliation. In my code I get 10x slower for sparse_tensor_dense_matmul!
import tensorflow as tf
import numpy as np
import time
import itertools
rate = 0.1
N = 1000
itrs = 1000
num = int(rate * N * N)
combs = np.array(list(itertools.product(range(N), range(N))))
choices = range(len(combs))
_idxs = np.random.choice(a=choices, size=num, replace=False).tolist()
_idxs = combs[_idxs]
_idxs = _idxs.tolist()
_idxs = sorted(_idxs)
_vals = np.float32(np.random.rand(num))
_y = np.random.uniform(low=-1., high=1., size=(N, N))
_z = np.random.uniform(low=-1., high=1., size=(N, N))
################################################
x = tf.SparseTensor(indices=_idxs, values=_vals, dense_shape=(N, N))
y = tf.Variable(_y, dtype=tf.float32)
z = tf.Variable(_z, dtype=tf.float32)
sparse_dot = tf.sparse_tensor_dense_matmul(x, y)
dot = tf.matmul(z, y)
################################################
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()
start = time.time()
for i in range(itrs):
[_sparse_dot] = sess.run([sparse_dot], feed_dict={})
total = time.time() - start
print (total)
start = time.time()
for i in range(itrs):
[_dot] = sess.run([dot], feed_dict={})
total = time.time() - start
print (total)
################################################
25.357680797576904
2.7684502601623535

Related

PPO: NaN Policy return in Tensorflow Keras

I am trying to implement the PPO algorithm with clipped loss in addition to KL penalties and run training on Mujuco Gym environments. After ~ 15000 gradient steps, policy collapses into returning NaN.
These are the policy training info before the policy collapses:
A: tf.Tensor(-0.10426917, shape=(), dtype=float32)
LOG_A: tf.Tensor(37.021107, shape=(), dtype=float32)
LOSS: tf.Tensor(0.16812761, shape=(), dtype=float32)
GRAD: tf.Tensor(
[[-3.4624012e-04 -1.2807851e-04 -1.9778654e-01 ... -2.7586846e+00
-1.2552655e-01 -1.7212760e-03]
[ 4.6312678e-05 -2.2251482e-04 5.5088173e-03 ... 9.5249921e-02
2.2186586e-03 2.0080474e-04]
[ 2.0314787e-05 -1.6381161e-04 7.1509695e-03 ... 1.1740552e-01
3.4010289e-03 1.2105847e-04]
...
[ 1.7827883e-04 -1.1712313e-05 5.8873045e-01 ... 9.2354174e+00
2.9186043e-01 -2.2818900e-03]
[-9.0385452e-05 3.0951984e-03 -3.6487404e-02 ... -2.6829168e-01
-3.9602429e-02 2.0654879e-03]
[ 2.2925157e-04 4.6892464e-03 5.9946489e-01 ... 9.3497839e+00
3.0514282e-01 -1.3834883e-03]], shape=(11, 256), dtype=float32)
A: tf.Tensor(nan, shape=(), dtype=float32)
LOG_A: tf.Tensor(nan, shape=(), dtype=float32)
Note: The gradient info captures only the gradients of the first layer, as I have found capturing all gradient info to be messy and seemingly redundant.
What I have tried:
Tuning hyperparameters: I have tried multiple sets of hyperparameters including the one documented in the original paper. The same error occurs(the hyperparams setup provided in the example below are chosen for higher sampling efficiency for faster debugging).
Gradient clipping: Gradient norm has been clipped to be unitary, and as shown above, it does not appear to have the exploding gradient issue.
Guaranteed numerical stability of tanh squashing of policy log probability: A small epsilon was used to clip the sum of squares so that action log probability does not return inf after tanh squashing.
Unitized code example:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import gym
import scipy.signal
import time
from tensorflow.keras import Model
import matplotlib.pyplot as plt
import random
import tensorflow_probability as tfp
tf.keras.backend.set_floatx('float32')
EPSILON = 1e-10
################## GLOBAL SETUP P1 ##################
problem = "Hopper-v2"
env = gym.make(problem)
eval_env = gym.make(problem)
num_states = env.observation_space.shape[0]
print("Size of State Space -> {}".format(num_states), flush=True)
num_actions = env.action_space.shape[0]
print("Size of Action Space -> {}".format(num_actions), flush=True)
upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]
print("Max Value of Action -> {}".format(upper_bound), flush=True)
print("Min Value of Action -> {}".format(lower_bound), flush=True)
minibatch_size = 256
##########*****####################*****##########
#################### Auxiliaries ####################
def discounted_cumulative_sums(x, discount):
# Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]
##########*****####################*****##########
#################### Replay Buffer ####################
class Buffer:
def __init__(self, observation_dimensions, action_dimensions, size, gamma=0.99, lam=0.95):
self.observation_buffer = np.zeros(
(size, observation_dimensions), dtype=np.float32
)
self.action_buffer = np.zeros((size, action_dimensions), dtype=np.int32)
self.advantage_buffer = np.zeros(size, dtype=np.float32)
self.reward_buffer = np.zeros(size, dtype=np.float32)
self.return_buffer = np.zeros(size, dtype=np.float32)
self.value_buffer = np.zeros(size, dtype=np.float32)
self.logprobability_buffer = np.zeros(size, dtype=np.float32)
self.gamma, self.lam = gamma, lam
self.pointer, self.trajectory_start_index = 0, 0
def store(self, observation, action, reward, value, logprobability):
self.observation_buffer[self.pointer] = observation
self.action_buffer[self.pointer] = action
self.reward_buffer[self.pointer] = reward
self.value_buffer[self.pointer] = value
self.logprobability_buffer[self.pointer] = logprobability
self.pointer += 1
def finish_trajectory(self, last_value=0):
path_slice = slice(self.trajectory_start_index, self.pointer)
rewards = np.append(self.reward_buffer[path_slice], last_value)
values = np.append(self.value_buffer[path_slice], last_value)
deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]
self.advantage_buffer[path_slice] = discounted_cumulative_sums(
deltas, self.gamma * self.lam
)
self.return_buffer[path_slice] = discounted_cumulative_sums(
rewards, self.gamma
)[:-1]
self.trajectory_start_index = self.pointer
def get(self):
# Get all data of the buffer and normalize the advantages
rindex = np.random.choice(self.pointer, minibatch_size)
advantage_mean, advantage_std = (
np.mean(self.advantage_buffer[rindex]),
np.std(self.advantage_buffer[rindex]),
)
return (
self.observation_buffer[rindex],
self.action_buffer[rindex],
(self.advantage_buffer[rindex] - advantage_mean) / advantage_std,
self.return_buffer[rindex],
self.logprobability_buffer[rindex],
)
def clear(self):
self.pointer, self.trajectory_start_index = 0, 0
##########*****####################*****##########
#################### Models ####################
class Actor(Model):
def __init__(self):
super().__init__()
self.action_dim = num_actions
self.dense1_layer = layers.Dense(256, activation="relu")
self.dense2_layer = layers.Dense(256, activation="relu")
self.mean_layer = layers.Dense(self.action_dim)
self.stdev_layer = layers.Dense(self.action_dim)
def call(self, state, eval_mode=False):
a1 = self.dense1_layer(state)
a2 = self.dense2_layer(a1)
mu = self.mean_layer(a2)
log_sigma = self.stdev_layer(a2)
sigma = tf.exp(log_sigma)
covar_m = tf.linalg.diag(sigma**2)
dist = tfp.distributions.MultivariateNormalTriL(loc=mu, scale_tril=tf.linalg.cholesky(covar_m))
if eval_mode:
action_ = mu
else:
action_ = dist.sample()
action = tf.tanh(action_)
log_pi_ = dist.log_prob(action_)
log_pi = log_pi_ - tf.reduce_sum(tf.math.log(tf.clip_by_value(1 - action**2, EPSILON, 1.0)), axis=1)
return action*upper_bound, log_pi
def get_critic():
state_input = layers.Input(shape=(num_states))
state_out = layers.Dense(256, activation="relu")(state_input)
out = layers.Dense(256, activation="relu")(state_out)
outputs = layers.Dense(1, dtype='float32')(out)
model = tf.keras.Model(state_input, outputs)
return model
##########*****####################*****##########
#################### GLOBAL SETUP P2 ####################
# Hyperparameters of the PPO algorithm
horizon = 2048
iterations = 2000
gamma = 0.99
clip_ratio = 0.2
epochs = 500
lam = 0.97
target_kl = 0.01
beta = 1.0
render = False
actor_model = Actor()
critic_model = get_critic()
lr = 0.0003
policy_optimizer = tf.keras.optimizers.Adam(learning_rate=lr,
# )
clipnorm=1.0)
value_optimizer = tf.keras.optimizers.Adam(learning_rate=lr,
# )
clipnorm=1.0)
buffer = Buffer(num_states, num_actions, horizon)
##########*****####################*****##########
#################### Training ####################
observation, episode_return, episode_length = env.reset(), 0, 0
tf_observation = tf.expand_dims(observation, 0)
def train_policy(
observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
):
global beta
with tf.GradientTape() as tape: # Record operations for automatic differentiation.
action, log_a = actor_model(observation_buffer)
# print("A: ", tf.reduce_mean(action))
# print("LOG_A: ", tf.reduce_mean(log_a))
ratio = tf.exp(
log_a
- logprobability_buffer
)
# print("R: ", tf.reduce_mean(ratio), flush=True)
cd_ratio = tf.clip_by_value(ratio, (1 - clip_ratio), (1 + clip_ratio))
min_advantage = cd_ratio * advantage_buffer
_kl = -beta*tf.math.reduce_max(logprobability_buffer - log_a)
policy_loss = -tf.reduce_mean(tf.minimum(ratio * advantage_buffer, min_advantage) + _kl)
# print("LOSS: ", policy_loss)
policy_grads = tape.gradient(policy_loss, actor_model.trainable_variables)
policy_optimizer.apply_gradients(zip(policy_grads, actor_model.trainable_variables))
# print("GRAD: ", policy_grads[0], flush=True)
action_opt, log_a_opt = actor_model(observation_buffer)
kl = tf.reduce_mean(
logprobability_buffer
- log_a_opt
)
if kl < target_kl/1.5:
beta = beta/2
if kl > target_kl*1.5:
beta = beta*2
return kl
def train_value_function(observation_buffer, return_buffer):
with tf.GradientTape() as tape: # Record operations for automatic differentiation.
value_loss = tf.reduce_mean((return_buffer - critic_model(observation_buffer)) ** 2)
value_grads = tape.gradient(value_loss, critic_model.trainable_variables)
value_optimizer.apply_gradients(zip(value_grads, critic_model.trainable_variables))
for ite in range(iterations):
for t in range(horizon):
if render:
env.render()
action, log_pi_a = actor_model(tf_observation)
action = action[0]
observation_new, reward, done, _ = env.step(action)
episode_return += reward
episode_length += 1
value_t = critic_model(tf_observation)
buffer.store(observation, action, reward, value_t, log_pi_a)
observation = observation_new
tf_observation = tf.expand_dims(observation, 0)
terminal = done
if terminal or (t == horizon - 1):
last_value = 0 if done else critic_model(tf_observation)
buffer.finish_trajectory(last_value)
observation, episode_return, episode_length = env.reset(), 0, 0
tf_observation = tf.expand_dims(observation, 0)
for _ in range(epochs):
(
observation_buffer,
action_buffer,
advantage_buffer,
return_buffer,
logprobability_buffer,
) = buffer.get()
kl = train_policy(
observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
)
train_value_function(observation_buffer, return_buffer)
buffer.clear()
##########*****####################*****##########
Note:
The code base is constructed by a combination of a modified version of the official keras PPO tutorial(https://keras.io/examples/rl/ppo_cartpole/) and Modules(Mainly the policy network) that have been tested in other implementations.
I refrained from using tf_function declaration as I am very new to tensorflow, thus not understanding its impact, and I have read from various github issues that sometimes such declaration causes numerical instability due to caching. However, it could be a source of my issues.
Any help is appreciated, and apologies if something is missing or unclear.

Deep neural-network with backpropagation implementation does not work - python

I want to implement a multilayer NN with backpropagation. I have been trying for days, but it simply does not work. It is extremely clear in my head how it is supposed to work, I have streamline my code to be as simple as possible but I can't do it. It's probably something stupid, but I cannot see it.
The implementation I have done is with an input layer of 784 (28x28), two (L) hidden layers of 300 and an output of 10 classes. I have a bias in every layer (except last...)
The output activation is softmax and the hidden activation is ReLU.
I use mini batches of 600 examples over a dataset of 60k examples with 50 to 500 epoches.
Here the core of my code:
Preparation:
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
L = 2
K = len(np.unique(train_labels))
lr = 0.001
nb_epochs = 50
node_per_hidden_layer = 300
nb_batches = 100
W = []
losses_test = []
X_train = np.reshape(train_images, (train_images.shape[0], train_images.shape[1]*train_images.shape[2]))
X_test = np.reshape(test_images, (test_images.shape[0], train_images.shape[1]*train_images.shape[2]))
Y_train = np.zeros((train_labels.shape[0], K))
Y_train[np.arange(Y_train.shape[0]), train_labels] = 1
Y_test = np.zeros((test_labels.shape[0], K))
Y_test[np.arange(Y_test.shape[0]), test_labels] = 1
W.append(np.random.normal(0, 0.01, (X_train.shape[1]+1, node_per_hidden_layer)))
for i in range(L-1):
W.append(np.random.normal(0, 0.01, (node_per_hidden_layer+1, node_per_hidden_layer)))
W.append(np.random.normal(0, 0.01, (node_per_hidden_layer+1, K)))
Helper function:
def softmax(z):
exp = np.exp(z - z.max(1)[:,np.newaxis])
return np.array(exp / exp.sum(1)[:,np.newaxis])
def softmax_derivative(z):
sm = softmax(z)
return sm * (1-sm)
def ReLU(z):
return np.maximum(z, 0)
def ReLU_derivative(z):
return (z >= 0).astype(int)
def get_loss(y, y_pred):
return -np.sum(y * np.log(y_pred))
fitting
def fit():
minibatch_size = len(X_train) // nb_batches
for epoch in range(nb_epochs):
permutaion = list(np.random.permutation(X_train.shape[0]))
X_shuffle = X_train[permutaion]
Y_shuffle = Y_train[permutaion]
print("Epoch----------------", epoch)
for batche in range(0, X_shuffle.shape[0], minibatch_size):
Z = [None] * (L + 2)
a = [None] * (L + 2)
delta = [None] * (L + 2)
X = X_train[batche:batche+minibatch_size]
Y = Y_shuffle[batche:batche+minibatch_size]
### forward propagation
a[0] = np.append(X, np.ones((minibatch_size, 1)), axis=1)
for i in range(L):
Z[i + 1] = a[i] # W[i]
a[i + 1] = np.append(ReLU(Z[i+1]), np.ones((minibatch_size, 1), dtype=int), axis=1)
Z[-1] = a[L] # W[L]
a[-1] = softmax(Z[-1])
### back propagation
delta[-1] = (Y - a[-1]) * softmax_derivative(Z[-1])
for i in range(L, 0, -1):
delta[i] = (delta[i+1] # W[i].T)[:,:-1] * ReLU_derivative(Z[i])
for i in range(len(W)):
g = a[i].T # delta[i+1] / minibatch_size
W[i] = W[i] + lr * g
get_loss_on_test()
loss
def get_loss_on_test():
Z_test = [None] * (L + 2)
a_test = [None] * (L + 2)
a_test[0] = np.append(X_test, np.ones((len(X_test), 1)), axis=1)
for i in range(L):
Z_test[i + 1] = a_test[i] # W[i]
a_test[i + 1] = np.append(ReLU(Z_test[i+1]), np.ones((len(X_test), 1)), axis=1)
Z_test[-1] = a_test[L] # W[L]
a_test[-1] = softmax(Z_test[-1])
losses_test.append(get_loss(Y_test, a_test[-1]))
main
losses_test.clear()
fit()
plt.plot(losses_test)
plt.show()
If you want to see it in my notebook with an example of losses graph, here the link: https://github.com/beurnii/INF8225/blob/master/tp2/jpt.ipynb
If you want more details on my assignment, this is part 1b (page 2 for english):
https://github.com/beurnii/INF8225/blob/master/tp2/INF8225_TP2_2020.pdf

Why does the cost in my implementation of a deep neural network increase after a few iterations?

I am a beginner in machine learning and neural networks. Recently, after watching Andrew Ng's lectures on deep learning, I tried to implement a binary classifier using deep neural networks on my own.
However, the cost of the function is expected to decrease after each iteration.
In my program, it decreases slightly in the beginning, but rapidly increases later. I tried to make changes in learning rate and number of iterations, but to no avail. I am very confused.
Here is my code
1. Neural network classifier class
class NeuralNetwork:
def __init__(self, X, Y, dimensions, alpha=1.2, iter=3000):
self.X = X
self.Y = Y
self.dimensions = dimensions # Including input layer and output layer. Let example be dimensions=4
self.alpha = alpha # Learning rate
self.iter = iter # Number of iterations
self.length = len(self.dimensions)-1
self.params = {} # To store parameters W and b for each layer
self.cache = {} # To store cache Z and A for each layer
self.grads = {} # To store dA, dZ, dW, db
self.cost = 1 # Initial value does not matter
def initialize(self):
np.random.seed(3)
# If dimensions is 4, then layer 0 and 3 are input and output layers
# So we only need to initialize w1, w2 and w3
# There is no need of w0 for input layer
for l in range(1, len(self.dimensions)):
self.params['W'+str(l)] = np.random.randn(self.dimensions[l], self.dimensions[l-1])*0.01
self.params['b'+str(l)] = np.zeros((self.dimensions[l], 1))
def forward_propagation(self):
self.cache['A0'] = self.X
# For last layer, ie, the output layer 3, we need to activate using sigmoid
# For layer 1 and 2, we need to use relu
for l in range(1, len(self.dimensions)-1):
self.cache['Z'+str(l)] = np.dot(self.params['W'+str(l)], self.cache['A'+str(l-1)]) + self.params['b'+str(l)]
self.cache['A'+str(l)] = relu(self.cache['Z'+str(l)])
l = len(self.dimensions)-1
self.cache['Z'+str(l)] = np.dot(self.params['W'+str(l)], self.cache['A'+str(l-1)]) + self.params['b'+str(l)]
self.cache['A'+str(l)] = sigmoid(self.cache['Z'+str(l)])
def compute_cost(self):
m = self.Y.shape[1]
A = self.cache['A'+str(len(self.dimensions)-1)]
self.cost = -1/m*np.sum(np.multiply(self.Y, np.log(A)) + np.multiply(1-self.Y, np.log(1-A)))
self.cost = np.squeeze(self.cost)
def backward_propagation(self):
A = self.cache['A' + str(len(self.dimensions) - 1)]
m = self.X.shape[1]
self.grads['dA'+str(len(self.dimensions)-1)] = -(np.divide(self.Y, A) - np.divide(1-self.Y, 1-A))
# Sigmoid derivative for final layer
l = len(self.dimensions)-1
self.grads['dZ' + str(l)] = self.grads['dA' + str(l)] * sigmoid_prime(self.cache['Z' + str(l)])
self.grads['dW' + str(l)] = 1 / m * np.dot(self.grads['dZ' + str(l)], self.cache['A' + str(l - 1)].T)
self.grads['db' + str(l)] = 1 / m * np.sum(self.grads['dZ' + str(l)], axis=1, keepdims=True)
self.grads['dA' + str(l - 1)] = np.dot(self.params['W' + str(l)].T, self.grads['dZ' + str(l)])
# Relu derivative for previous layers
for l in range(len(self.dimensions)-2, 0, -1):
self.grads['dZ'+str(l)] = self.grads['dA'+str(l)] * relu_prime(self.cache['Z'+str(l)])
self.grads['dW'+str(l)] = 1/m*np.dot(self.grads['dZ'+str(l)], self.cache['A'+str(l-1)].T)
self.grads['db'+str(l)] = 1/m*np.sum(self.grads['dZ'+str(l)], axis=1, keepdims=True)
self.grads['dA'+str(l-1)] = np.dot(self.params['W'+str(l)].T, self.grads['dZ'+str(l)])
def update_parameters(self):
for l in range(1, len(self.dimensions)):
self.params['W'+str(l)] = self.params['W'+str(l)] - self.alpha*self.grads['dW'+str(l)]
self.params['b'+str(l)] = self.params['b'+str(l)] - self.alpha*self.grads['db'+str(l)]
def train(self):
np.random.seed(1)
self.initialize()
for i in range(self.iter):
#print(self.params)
self.forward_propagation()
self.compute_cost()
self.backward_propagation()
self.update_parameters()
if i % 100 == 0:
print('Cost after {} iterations is {}'.format(i, self.cost))
2. Testing code for odd or even number classifier
import numpy as np
from main import NeuralNetwork
X = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
Y = np.array([[1, 0, 1, 0, 1, 0, 1, 0, 1, 0]])
clf = NeuralNetwork(X, Y, [1, 1, 1], alpha=0.003, iter=7000)
clf.train()
3. Helper Code
import math
import numpy as np
def sigmoid_scalar(x):
return 1/(1+math.exp(-x))
def sigmoid_prime_scalar(x):
return sigmoid_scalar(x)*(1-sigmoid_scalar(x))
def relu_scalar(x):
if x > 0:
return x
else:
return 0
def relu_prime_scalar(x):
if x > 0:
return 1
else:
return 0
sigmoid = np.vectorize(sigmoid_scalar)
sigmoid_prime = np.vectorize(sigmoid_prime_scalar)
relu = np.vectorize(relu_scalar)
relu_prime = np.vectorize(relu_prime_scalar)
Output
I believe your cross-entropy derivative is wrong. Instead of this:
# WRONG!
self.grads['dA'+str(len(self.dimensions)-1)] = -(np.divide(self.Y, A) - np.divide(1-self.Y, A))
... do this:
# CORRECT
self.grads['dA'+str(len(self.dimensions)-1)] = np.divide(A - self.Y, (1 - A) * A)
See these lecture notes for the details. I think you meant the formula (5), but forgot 1-A. Anyway, use formula (6).

`scipy.optimize` functions hang even with `maxiter=0`

I am trying to train the MNIST data (which I downloaded from Kaggle) with simple multi-class logistic regression, but the scipy.optimize functions hang.
Here's the code:
import csv
from math import exp
from numpy import *
from scipy.optimize import fmin, fmin_cg, fmin_powell, fmin_bfgs
# Prepare the data
def getIiter(ifname):
"""
Get the iterator from a csv file with filename ifname
"""
ifile = open(ifname, 'r')
iiter = csv.reader(ifile)
iiter.__next__()
return iiter
def parseRow(s):
y = [int(x) for x in s]
lab = y[0]
z = y[1:]
return (lab, z)
def getAllRows(ifname):
iiter = getIiter(ifname)
x = []
l = []
for row in iiter:
lab, z = parseRow(row)
x.append(z)
l.append(lab)
return x, l
def cutData(x, y):
"""
70% training
30% testing
"""
m = len(x)
t = int(m * .7)
return [(x[:t], y[:t]), (x[t:], y[t:])]
def num2IndMat(l):
t = array(l)
tt = [vectorize(int)((t == i)) for i in range(10)]
return array(tt).T
def readData(ifname):
x, l = getAllRows(ifname)
t = [[1] + y for y in x]
return array(t), num2IndMat(l)
#Calculate the cost function
def sigmoid(x):
return 1 / (1 + exp(-x))
vSigmoid = vectorize(sigmoid)
vLog = vectorize(log)
def costFunction(theta, x, y):
sigxt = vSigmoid(dot(x, theta))
cm = (- y * vLog(sigxt) - (1 - y) * vLog(1 - sigxt)) / m / N
return sum(cm)
def unflatten(flatTheta):
return [flatTheta[i * N : (i + 1) * N] for i in range(n + 1)]
def costFunctionFlatTheta(flatTheta):
return costFunction(unflatten(flatTheta), trainX, trainY)
def costFunctionFlatTheta1(flatTheta):
return costFunction(flatTheta.reshape(785, 10), trainX, trainY)
x, y = readData('train.csv')
[(trainX, trainY), (testX, testY)] = cutData(x, y)
m = len(trainX)
n = len(trainX[0]) - 1
N = len(trainY[0])
initTheta = zeros(((n + 1), N))
flatInitTheta = ndarray.flatten(initTheta)
flatInitTheta1 = initTheta.reshape(1, -1)
In the last two lines we flatten initTheta because the fmin{,_cg,_bfgs,_powell} functions seem to only take vectors as the initial value argument x0. I also flatten initTheta using reshape in hope this answer can be of help.
There is no problem computing the cost function which takes up less than 2 seconds on my computer:
print(costFunctionFlatTheta(flatInitTheta), costFunctionFlatTheta1(flatInitTheta1))
# 0.69314718056 0.69314718056
But all the fmin functions hang, even if I set maxiter=0.
e.g.
newFlatTheta = fmin(costFunctionFlatTheta, flatInitTheta, maxiter=0)
or
newFlatTheta1 = fmin(costFunctionFlatTheta1, flatInitTheta1, maxiter=0)
When I interrupt the program, it seems to me it all hangs at lines in optimize.py calling the cost functions, lines like this:
return function(*(wrapper_args + args))
For example, if I use fmin_cg, this would be line 292 in optimize.py (Version 0.5).
How do I solve this problem?
OK I found a way to stop fmin_cg from hanging.
Basically I just need to write a function that computes the gradient of the cost function, and pass it to the fprime parameter of fmin_cg.
def gradient(theta, x, y):
return dot(x.T, vSigmoid(dot(x, theta)) - y) / m / N
def gradientFlatTheta(flatTheta):
return ndarray.flatten(gradient(flatTheta.reshape(785, 10), trainX, trainY))
Then
newFlatTheta = fmin_cg(costFunctionFlatTheta, flatInitTheta, fprime=gradientFlatTheta, maxiter=0)
terminates within seconds, and setting maxiter to a higher number (say 100) one can train the model within reasonable amount of time.
The documentation of fmin_cg says the gradient would be numerically computed if no fprime is given, which is what I suspect caused the hanging.
Thanks to this notebook by zgo2016#Kaggle which helped me find the solution.

tensorflow giving nans when calculating gradient with sparse tensors

The following snippet is from a fairly large piece of code but hopefully I can give all the information necessary:
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
y1 and y2 are 128x30 and ymask is 30x30. ystar is 128x30. dist is 1x30. When ymask is the identity matrix, everything works fine. But when I set it to be all zeros, apart from a single 1 along the diagonal (so as to set all columns but one in y2 to be zero), I get nans for the gradient of dist with respect to y2, using tf.gradients(dist, [y2]). The specific value of dist is [0,0,7.9,0,...], with all the ystar-y2 values being around the range (-1,1) in the third column and zero elsewhere.
I'm pretty confused as to why a numerical issue would occur here, given there are no logs or divisions, is this underflow? Am I missing something in the maths?
For context, I'm doing this to try to train individual dimensions of y, one at a time, using the whole network.
longer version to reproduce:
import tensorflow as tf
import numpy as np
import pandas as pd
batchSize = 128
eta = 0.8
tasks = 30
imageSize = 32**2
groups = 3
tasksPerGroup = 10
trainDatapoints = 10000
w = np.zeros([imageSize, groups * tasksPerGroup])
toyIndex = 0
for toyLoop in range(groups):
m = np.ones([imageSize]) * np.random.randn(imageSize)
for taskLoop in range(tasksPerGroup):
w[:, toyIndex] = m * 0.1 * np.random.randn(1)
toyIndex += 1
xRand = np.random.normal(0, 0.5, (trainDatapoints, imageSize))
taskLabels = np.matmul(xRand, w) + np.random.normal(0,0.5,(trainDatapoints, groups * tasksPerGroup))
DF = np.concatenate((xRand, taskLabels), axis=1)
trainDF = pd.DataFrame(DF[:trainDatapoints, ])
# define graph variables
x = tf.placeholder(tf.float32, [None, imageSize])
W = tf.Variable(tf.zeros([imageSize, tasks]))
b = tf.Variable(tf.zeros([tasks]))
ystar = tf.placeholder(tf.float32, [None, tasks])
ymask = tf.placeholder(tf.float32, [tasks, tasks])
dataLength = tf.cast(tf.shape(ystar)[0],dtype=tf.float32)
y1 = tf.matmul(x, W) + b
y2 = tf.matmul(y1,ymask)
dist = tf.norm(ystar-y2,axis=0)
mse = tf.reciprocal(dataLength) * tf.reduce_mean(tf.square(dist))
grads = tf.gradients(dist, [y2])
trainStep = tf.train.GradientDescentOptimizer(eta).minimize(mse)
# build graph
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
randTask = np.random.randint(0, 9)
ymaskIn = np.zeros([tasks, tasks])
ymaskIn[randTask, randTask] = 1
batch = trainDF.sample(batchSize)
batch_xs = batch.iloc[:, :imageSize]
batch_ys = np.zeros([batchSize, tasks])
batch_ys[:, randTask] = batch.iloc[:, imageSize + randTask]
gradOut = sess.run(grads, feed_dict={x: batch_xs, ystar: batch_ys, ymask: ymaskIn})
sess.run(trainStep, feed_dict={x: batch_xs, ystar: batch_ys, ymask:ymaskIn})
Here's a very simple reproduction:
import tensorflow as tf
with tf.Graph().as_default():
y = tf.zeros(shape=[1], dtype=tf.float32)
dist = tf.norm(y,axis=0)
(grad,) = tf.gradients(dist, [y])
with tf.Session():
print(grad.eval())
Prints:
[ nan]
The issue is that tf.norm computes sum(x**2)**0.5. The gradient is x / sum(x**2) ** 0.5 (see e.g. https://math.stackexchange.com/a/84333), so when sum(x**2) is zero we're dividing by zero.
There's not much to be done in terms of a special case: the gradient as x approaches all zeros depends on which direction it's approaching from. For example if x is a single-element vector, the limit as x approaches 0 could either be 1 or -1 depending on which side of zero it's approaching from.
So in terms of solutions, you could just add a small epsilon:
import tensorflow as tf
def safe_norm(x, epsilon=1e-12, axis=None):
return tf.sqrt(tf.reduce_sum(x ** 2, axis=axis) + epsilon)
with tf.Graph().as_default():
y = tf.constant([0.])
dist = safe_norm(y,axis=0)
(grad,) = tf.gradients(dist, [y])
with tf.Session():
print(grad.eval())
Prints:
[ 0.]
Note that this is not actually the Euclidean norm. It's a good approximation as long as the input is much larger than epsilon.