Why my tensorflow distributed strategy works sequentially, not parallel? - tensorflow

I am trying to achieve a parallel computation with my 4 GPUs and tf.distribute.MirroredStrategy().
I wrote a simple version for this and expected that the result print(y) for each GPU comes out simultaneously. But it seems like the operation is always done sequentially, like replica_y :GPU0, replica_y:GPU1, replica_y:GPU2, replica_y:GPU3. Am I doing something wrong? I am using Python 3.6 and Spyder environment.
#tf.function
def train_step(dist_inputs):
def step_fn(inputs):
with tf.GradientTape() as tape:
y = 0.0
for j in range(100) :
y = y + tf.reduce_sum(xs_var) ** inputs
print(y)
return y
y_ = strategy.run(step_fn, args = [dist_inputs] )
return y_
strategy = tf.distribute.MirroredStrategy()
dataset = tf.data.Dataset.from_tensor_slices([ [1.], [2.], [3.], [4.] ]).batch(4)
dist_dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
optimizer = tf.keras.optimizers.Adam(learning_rate = 1E-1)
xs_var = tf.Variable(tf.ones(2), dtype = tf.float32)
var_list = [xs_var]
for inputs in dist_dataset:
print(inputs)
y = train_step(inputs)

Related

My Pytorch model is giving very bad results

I am new with Deep Learning with Pytorch. I am more experienced with Tensorflow, and thus I should say I am not new to Deep Learning itself.
Currently, I am working on a simple ANN classification. There are only 2 classes so quite naturally I am using a Softmax BCELoss combination.
The dataset is like this:
shape of X_train (891, 7)
Shape of Y_train (891,)
Shape of x_test (418, 7)
I transformed the X_train and others to torch tensors as train_data and so on. The next step is:
train_ds = TensorDataset(train_data, train_label)
# Define data loader
batch_size = 32
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
I made the model class like:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(7, 32)
self.bc1 = nn.BatchNorm1d(32)
self.fc2 = nn.Linear(32, 64)
self.bc2 = nn.BatchNorm1d(64)
self.fc3 = nn.Linear(64, 128)
self.bc3 = nn.BatchNorm1d(128)
self.fc4 = nn.Linear(128, 32)
self.bc4 = nn.BatchNorm1d(32)
self.fc5 = nn.Linear(32, 10)
self.bc5 = nn.BatchNorm1d(10)
self.fc6 = nn.Linear(10, 1)
self.bc6 = nn.BatchNorm1d(1)
self.drop = nn.Dropout2d(p=0.5)
def forward(self, x):
torch.nn.init.xavier_uniform(self.fc1.weight)
x = self.fc1(x)
x = self.bc1(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc2(x)
x = self.bc2(x)
x = F.relu(x)
#x = self.drop(x)
x = self.fc3(x)
x = self.bc3(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc4(x)
x = self.bc4(x)
x = F.relu(x)
#x = self.drop(x)
x = self.fc5(x)
x = self.bc5(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc6(x)
x = self.bc6(x)
x = torch.sigmoid(x)
return x
model = Net()
The loss function and the optimizer are defined:
loss = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
At last, the task is to run the forward in epochs:
num_epochs = 1000
# Repeat for given number of epochs
for epoch in range(num_epochs):
# Train with batches of data
for xb,yb in train_dl:
pred = model(xb)
yb = torch.unsqueeze(yb, 1)
#print(pred, yb)
print('grad', model.fc1.weight.grad)
l = loss(pred, yb)
#print('loss',l)
# 3. Compute gradients
l.backward()
# 4. Update parameters using gradients
optimizer.step()
# 5. Reset the gradients to zero
optimizer.zero_grad()
# Print the progress
if (epoch+1) % 10 == 0:
print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, l.item()))
I can see in the output that after each iteration with all the batches, the hard weights are non-zero, after this zero_grad is applied.
However, the model is pretty bad. I get an F1 score of around 50% only! And the model is bad when I call it to predict the train_dl itself!!!
I am wondering what the reason is. The grad of weights not zero but not updating properly? The optimizer not optimizing the weights? Or what else?
Can someone please have a look?
I already tried different loss functions and optimizers. I tried with smaller datasets, bigger batches, different hyperparameters.
Thanks! :)
First of all, you don't use softmax activation for BCE loss, unless you have 2 output nodes, which is not the case. In PyTorch, BCE loss doesn't apply any activation function before calculating the loss, unlike the CCE which has a built-in softmax function. So, if you want to use BCE, you have to use sigmoid (or any function f: R -> [0, 1]) at the output layer, which you don't have.
Moreover, you should ideally do optimizer.zero_grad() for each batch if you want to do SGD (which is the default). If you don't do that, you will be just doing full-batch gradient descent, which is quite slow and gets stuck in local minima easily.

Can neural networks handle redundant inputs?

I have a fully connected neural network with the following number of neurons in each layer [4, 20, 20, 20, ..., 1]. I am using TensorFlow and the 4 real-valued inputs correspond to a particular point in space and time, i.e. (x, y, z, t), and the 1 real-valued output corresponds to the temperature at that point. The loss function is just the mean square error between my predicted temperature and the actual temperature at that point in (x, y, z, t). I have a set of training data points with the following structure for their inputs:
(x,y,z,t):
(0.11,0.12,1.00,0.41)
(0.34,0.43,1.00,0.92)
(0.01,0.25,1.00,0.65)
...
(0.71,0.32,1.00,0.49)
(0.31,0.22,1.00,0.01)
(0.21,0.13,1.00,0.71)
Namely, what you will notice is that the training data all have the same redundant value in z, but x, y, and t are generally not redundant. Yet what I find is my neural network cannot train on this data due to the redundancy. In particular, every time I start training the neural network, it appears to fail and the loss function becomes nan. But, if I change the structure of the neural network such that the number of neurons in each layer is [3, 20, 20, 20, ..., 1], i.e. now data points only correspond to an input of (x, y, t), everything works perfectly and training is all right. But is there any way to overcome this problem? (Note: it occurs whether any of the variables are identical, e.g. either x, y, or t could be redundant and cause this error.) I have also attempted different activation functions (e.g. ReLU) and varying the number of layers and neurons in the network, but these changes do not resolve the problem.
My question: is there any way to still train the neural network while keeping the redundant z as an input? It just so happens the particular training data set I am considering at the moment has all z redundant, but in general, I will have data coming from different z in the future. Therefore, a way to ensure the neural network can robustly handle inputs at the present moment is sought.
A minimal working example is encoded below. When running this example, the loss output is nan, but if you simply uncomment the x_z in line 12 to ensure there is now variation in x_z, then there is no longer any problem. But this is not a solution since the goal is to use the original x_z with all constant values.
import numpy as np
import tensorflow as tf
end_it = 10000 #number of iterations
frac_train = 1.0 #randomly sampled fraction of data to create training set
frac_sample_train = 0.1 #randomly sampled fraction of data from training set to train in batches
layers = [4, 20, 20, 20, 20, 20, 20, 20, 20, 1]
len_data = 10000
x_x = np.array([np.linspace(0.,1.,len_data)])
x_y = np.array([np.linspace(0.,1.,len_data)])
x_z = np.array([np.ones(len_data)*1.0])
#x_z = np.array([np.linspace(0.,1.,len_data)])
x_t = np.array([np.linspace(0.,1.,len_data)])
y_true = np.array([np.linspace(-1.,1.,len_data)])
N_train = int(frac_train*len_data)
idx = np.random.choice(len_data, N_train, replace=False)
x_train = x_x.T[idx,:]
y_train = x_y.T[idx,:]
z_train = x_z.T[idx,:]
t_train = x_t.T[idx,:]
v1_train = y_true.T[idx,:]
sample_batch_size = int(frac_sample_train*N_train)
np.random.seed(1234)
tf.set_random_seed(1234)
import logging
logging.getLogger('tensorflow').setLevel(logging.ERROR)
tf.logging.set_verbosity(tf.logging.ERROR)
class NeuralNet:
def __init__(self, x, y, z, t, v1, layers):
X = np.concatenate([x, y, z, t], 1)
self.lb = X.min(0)
self.ub = X.max(0)
self.X = X
self.x = X[:,0:1]
self.y = X[:,1:2]
self.z = X[:,2:3]
self.t = X[:,3:4]
self.v1 = v1
self.layers = layers
self.weights, self.biases = self.initialize_NN(layers)
self.sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=False,
log_device_placement=False))
self.x_tf = tf.placeholder(tf.float32, shape=[None, self.x.shape[1]])
self.y_tf = tf.placeholder(tf.float32, shape=[None, self.y.shape[1]])
self.z_tf = tf.placeholder(tf.float32, shape=[None, self.z.shape[1]])
self.t_tf = tf.placeholder(tf.float32, shape=[None, self.t.shape[1]])
self.v1_tf = tf.placeholder(tf.float32, shape=[None, self.v1.shape[1]])
self.v1_pred = self.net(self.x_tf, self.y_tf, self.z_tf, self.t_tf)
self.loss = tf.reduce_mean(tf.square(self.v1_tf - self.v1_pred))
self.optimizer = tf.contrib.opt.ScipyOptimizerInterface(self.loss,
method = 'L-BFGS-B',
options = {'maxiter': 50,
'maxfun': 50000,
'maxcor': 50,
'maxls': 50,
'ftol' : 1.0 * np.finfo(float).eps})
init = tf.global_variables_initializer()
self.sess.run(init)
def initialize_NN(self, layers):
weights = []
biases = []
num_layers = len(layers)
for l in range(0,num_layers-1):
W = self.xavier_init(size=[layers[l], layers[l+1]])
b = tf.Variable(tf.zeros([1,layers[l+1]], dtype=tf.float32), dtype=tf.float32)
weights.append(W)
biases.append(b)
return weights, biases
def xavier_init(self, size):
in_dim = size[0]
out_dim = size[1]
xavier_stddev = np.sqrt(2/(in_dim + out_dim))
return tf.Variable(tf.truncated_normal([in_dim, out_dim], stddev=xavier_stddev), dtype=tf.float32)
def neural_net(self, X, weights, biases):
num_layers = len(weights) + 1
H = 2.0*(X - self.lb)/(self.ub - self.lb) - 1.0
for l in range(0,num_layers-2):
W = weights[l]
b = biases[l]
H = tf.tanh(tf.add(tf.matmul(H, W), b))
W = weights[-1]
b = biases[-1]
Y = tf.add(tf.matmul(H, W), b)
return Y
def net(self, x, y, z, t):
v1_out = self.neural_net(tf.concat([x,y,z,t], 1), self.weights, self.biases)
v1 = v1_out[:,0:1]
return v1
def callback(self, loss):
global Nfeval
print(str(Nfeval)+' - Loss in loop: %.3e' % (loss))
Nfeval += 1
def fetch_minibatch(self, x_in, y_in, z_in, t_in, den_in, N_train_sample):
idx_batch = np.random.choice(len(x_in), N_train_sample, replace=False)
x_batch = x_in[idx_batch,:]
y_batch = y_in[idx_batch,:]
z_batch = z_in[idx_batch,:]
t_batch = t_in[idx_batch,:]
v1_batch = den_in[idx_batch,:]
return x_batch, y_batch, z_batch, t_batch, v1_batch
def train(self, end_it):
it = 0
while it < end_it:
x_res_batch, y_res_batch, z_res_batch, t_res_batch, v1_res_batch = self.fetch_minibatch(self.x, self.y, self.z, self.t, self.v1, sample_batch_size) # Fetch residual mini-batch
tf_dict = {self.x_tf: x_res_batch, self.y_tf: y_res_batch, self.z_tf: z_res_batch, self.t_tf: t_res_batch,
self.v1_tf: v1_res_batch}
self.optimizer.minimize(self.sess,
feed_dict = tf_dict,
fetches = [self.loss],
loss_callback = self.callback)
def predict(self, x_star, y_star, z_star, t_star):
tf_dict = {self.x_tf: x_star, self.y_tf: y_star, self.z_tf: z_star, self.t_tf: t_star}
v1_star = self.sess.run(self.v1_pred, tf_dict)
return v1_star
model = NeuralNet(x_train, y_train, z_train, t_train, v1_train, layers)
Nfeval = 1
model.train(end_it)
I think your problem is in this line:
H = 2.0*(X - self.lb)/(self.ub - self.lb) - 1.0
In the third column fo X, corresponding to the z variable, both self.lb and self.ub are the same value, and equal to the value in the example, in this case 1, so it is acutally computing:
2.0*(1.0 - 1.0)/(1.0 - 1.0) - 1.0 = 2.0*0.0/0.0 - 1.0
Which is nan. You can work around the issue in a few different ways, a simple option is to simply do:
# Avoids dividing by zero
X_d = tf.math.maximum(self.ub - self.lb, 1e-6)
H = 2.0*(X - self.lb)/X_d - 1.0
This is an interesting situation. A quick check on an online tool for regression shows that even simple regression suffers from the problem of unable to fit data points when one of the inputs is constant through the dataset. Taking a look at the algebraic solution for a two-variable linear regression problem shows the solution involving division by standard deviation which, being zero in a constant set, is a problem.
As far as solving through backprop is concerned (as is the case in your neural network), I strongly suspect that the derivative of the loss with respect to the input (these expressions) is the culprit, and that the algorithm is not able to update the weights W using W := W - α.dZ, and ends up remaining constant.

How to define a layer only for training phase in TensorFlow?

I wanted to know if it's possible to define a layer (convolution, element-wise summation, etc.) only for the training phase in TensorFlow.
For example, I want to have an element-wise summation layer in my network only for the training phase and I want to ignore this layer in the test phase.
This is easily doable in Caffe, I wanted to know if it's possible to do so in TensorFlow as well.
You might want to do this with the "tf.cond" control_flow operation. https://www.tensorflow.org/api_docs/python/control_flow_ops/control_flow_operations#cond
I think you can use a boolean placeholder with tf.cond().
Just like this:
train_phase = tf.placeholder(tf.bool, [])
x = tf.constant(2)
def f1(): return tf.add(x, 1)
def f2(): return tf.identity(x)
r = tf.cond(train_phase, f1, f2)
sess.run(r, feed_dict={train_phase: True}) # training phase, r = tf.add(x, 1) = x + 1
sess.run(r, feed_dict={train_phase: False}) # testing phase, r = tf.identity(x) = x
I think you can do this by if
Train = False
x = tf.constant(5.)
y = x + 1
if Train:
y = y + 2
y = y + 3
with tf.Session() as sess:
res = sess.run(y) # 11 if Train else 9

SGD converges but batch learning does not, simple regression in tensorflow

I have run into an issue where batch learning in tensorflow fails to converge to the correct solution for a simple convex optimization problem, whereas SGD converges. A small example is found below, in the Julia and python programming languages, I have verified that the same exact behaviour results from using tensorflow from both Julia and python.
I'm trying to fit the linear model y = s*W + B with parameters W and B
The cost function is quadratic, so the problem is convex and should be easily solved using a small enough step size. If I feed all data at once, the end result is just a prediction of the mean of y. If, however, I feed one datapoint at the time (commented code in julia version), the optimization converges to the correct parameters very fast.
I have also verified that the gradients computed by tensorflow differs between the batch example and summing up the gradients for each datapoint individually.
Any ideas on where I have failed?
using TensorFlow
s = linspace(1,10,10)
s = [s reverse(s)]
y = s*[1,4] + 2
session = Session(Graph())
s_ = placeholder(Float32, shape=[-1,2])
y_ = placeholder(Float32, shape=[-1,1])
W = Variable(0.01randn(Float32, 2,1), name="weights1")
B = Variable(Float32(1), name="bias3")
q = s_*W + B
loss = reduce_mean((y_ - q).^2)
train_step = train.minimize(train.AdamOptimizer(0.01), loss)
function train_critic(s,targets)
for i = 1:1000
# for i = 1:length(y)
# run(session, train_step, Dict(s_ => s[i,:]', y_ => targets[i]))
# end
ts = run(session, [loss,train_step], Dict(s_ => s, y_ => targets))[1]
println(ts)
end
v = run(session, q, Dict(s_ => s, y_ => targets))
plot(s[:,1],v, lab="v (Predicted value)")
plot!(s[:,1],y, lab="y (Correct value)")
gui();
end
run(session, initialize_all_variables())
train_critic(s,y)
Same code in python (I'm not a python user so this might be ugly)
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import tensorflow as tf
from tensorflow.python.framework.ops import reset_default_graph
s = np.linspace(1,10,50).reshape((50,1))
s = np.concatenate((s,s[::-1]),axis=1).astype('float32')
y = np.add(np.matmul(s,[1,4]), 2).astype('float32')
reset_default_graph()
rng = np.random
s_ = tf.placeholder(tf.float32, [None, 2])
y_ = tf.placeholder(tf.float32, [None])
weight_initializer = tf.truncated_normal_initializer(stddev=0.1)
with tf.variable_scope('model'):
W = tf.get_variable('W', [2, 1],
initializer=weight_initializer)
B = tf.get_variable('B', [1],
initializer=tf.constant_initializer(0.0))
q = tf.matmul(s_, W) + B
loss = tf.reduce_mean(tf.square(tf.sub(y_ , q)))
optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(loss)
num_epochs = 200
train_cost= []
with tf.Session() as sess:
init = tf.initialize_all_variables()
sess.run(init)
for e in range(num_epochs):
feed_dict_train = {s_: s, y_: y}
fetches_train = [train_op, loss]
res = sess.run(fetches=fetches_train, feed_dict=feed_dict_train)
train_cost = [res[1]]
print train_cost
The answer turned out to be that when I fed in the targets, I fed a vector and not an Nx1 matrix. The operation y_-q then turned into a broadcast operation and instead of returning the elementwise difference, it returned an NxN matrix with the desired difference along the diagonal. In Julia, I solved this by modifying the line
train_critic(s,y)
to
train_critic(s,reshape(y, length(y),1))
to ensure y being a matrix.
A subtle error that took me a very long time to find! Part of the confusion was that TensorFlow seems to treat vectors as row vectors and not as column vectors like Julia, hence the broadcast operation in y_-q

TensorFlow first attempt, bad results

I can't solve my problem, help me please. It's my first attempt of neural networks, i tried to make nn which can check is number betwen (3:6) or not. I used several docs in internet and make some listing. But it has not working results. It's always "not in (3:6)". And I can't to understand what I'm doing wrong.
#Is number between (3:6)
import tensorflow as tf
import numpy as np
import random
def is_num_between(num):
right_border = 6
left_border = 3
if num < right_border and num > left_border:
return 1
return 0
def is_num_around(num):
right_border = 6
left_border = 3
if num <= left_border or num >= right_border:
return 1
return 0
def init_weights(shape):
return tf.Variable(tf.random_normal(shape, stddev=0.01))
def model(X, w_h, w_o):
h = tf.nn.tanh(tf.matmul(X, w_h))
return tf.nn.sigmoid(tf.matmul(h, w_o))
def included_or_not(i, prediction):
return [str(i) + " is in (3:6)", str(i) + " not in (3:6)"][prediction]
NUM_COUNT = 2
NUM_HIDDEN = 10
BATCH_SIZE = 10000
pre_trX = [np.random.random_sample() * 10 for i in range(100000)]
pre_trY1 = [is_num_between(i) for i in pre_trX]
pre_trY2 = [is_num_around(i) for i in pre_trX]
trX = np.array([np.array([pre_trX[i], 1]) for i in range(len(pre_trX))])
trY = np.array([np.array([pre_trY1[i], pre_trY2[i]]) for i in range(len(pre_trX))])
# print(type(trX))
# print(pre_trX)
# print(pre_trY1)
# print(pre_trY2)
# print(trX[0])
# exit()
X = tf.placeholder("float", [None, NUM_COUNT])
Y = tf.placeholder("float", [None, 2])
w_h = init_weights([NUM_COUNT, NUM_HIDDEN])
w_o = init_weights([NUM_HIDDEN, 2])
py_X = model(X, w_h, w_o)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(py_X, Y))
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
predict_op = tf.argmax(py_X, 1)
with tf.Session() as sess:
tf.initialize_all_variables().run()
for epoch in range(200):
p = np.random.permutation(range(len(trX)))
trX, trY = trX[p], trY[p]
for start in range(0, len(trX), BATCH_SIZE):
end = start + BATCH_SIZE
sess.run(train_op, feed_dict={X: trX[start:end], Y: trY[start:end]})
print(epoch, np.mean(np.argmax(trY, axis=1) ==
sess.run(predict_op, feed_dict={X: trX, Y: trY})))
# Tipo natrenirovana, nado ee potestit
def check_nnetwork():
numbers = [np.array([np.random.random_sample()*10, 1])]
teX = np.array(numbers)
teY = sess.run(predict_op, feed_dict={X: teX})
output = np.vectorize(included_or_not)("%.3f" % numbers[0][0], teY)
print(output)
for i in range(40):
check_nnetwork()
What does your loss function look like?
Also how many positive examples are there compared to negative examples? If the data is too skewed it might learn to just always predict negative as that is what minimizes the loss function.
The other issue might be that there is a fundamental problem with your architecture in that you expect a one-level neural network to learn a non-linear function which isn't actually possible.