Why shuffling data gives significantly higher accuracy? - tensorflow

In Tensorflow, I've wrote a big model for 2 image classes problem. My question is concerned with the following code snippet:
X, y, X_val, y_val = prepare_data()
probs = calc_probs(model, session, X)
accuracy = float(np.equal(np.argmax(probs, 1), np.argmax(y, 1)).sum()) / probs.shape[0]
loss = log_loss(y, probs)
X is an np.array of shape: (25000,244,244,3). That code results in accuracy=0.5834 (towards random accuracy) and loss=2.7106. But
when I shuffle the data, by adding these 3 lines after the first line:
sample_idx = random.sample(range(0, X.shape[0]), 25000)
X = X[sample_idx]
y = y[sample_idx]
, the results become convenient: accuracy=0.9933 and loss=0.0208.
Why shuffling data can give significantly higher accuracy ? or what can be a reason for that ?
The function calc_probs is mainly a run call:
probs = session.run(model.probs, feed_dict={model.X: X})
Update:
After hours of debugging, I figured out that evaluating a single image gives different result. For example, if you run the following line of code multiple times, you get a different result each time:
session.run(model.props, feed_dict={model.X: [X[20]])
My data is normally sorted, X contains class 1 samples first then class 2. And in calc_probs function, I run using each batch of the data sequentially. So, without shuffling, each run has data of a single class.
I've also noted that with shuffling, if batch size is very small, I get the random accuracy.

There is some mathematical justification for this in the context of randomized Kaczmarz algorithm. Regular Kaczmarz algorithm is an old algorithm which can be seen as an non-shuffling SGD on a least squares problem, and there are guaranteed faster convergence rates that come out if you use randomization, follow references in http://www.cs.ubc.ca/~nickhar/W15/Lecture21Notes.pdf

Related

How to load huge time series windows dataset without memory errors?

I want to convert a typical time series dataset of about 1 million lines into 100-item windows with 50% overlap. Note that it's a multivariate one, so for example given 8 features and 1000 windows with 100 items the final shape would be (1000, 100, 8) replacing (n_samples, n_timesteps, n_features). The goal is to use it for training machine learning algorithms including deep neural networks.
So far, I've enjoyed using numpy's sliding_window_view as shown below;
x = np.arange(100).reshape(20, 5)
v = sliding_window_view(x, (3, 5))
v
Unfortunately, I get crashes as I run out of RAM in large datasets with millions of lines. Do you have any suggestion?
Additionally, one serious restriction is that there's a consecutive label for every timestep (integer) according to which the dataset needs to be grouped by (using pandas) so this limits some options about reading it in portions.
I think you are looking for tf.data.Dataset. I'm working on a million rows dataset, and the following code runs well for me:
convert = tf.data.TextLineDataset("path_to_file.txt")
dataset = tf.data.Dataset.zip(convert)
Now you have initialized your dataset, but for don't stepping into memory issues:
def dataset_batches(ds, batch_size):
return (
ds
.cache()
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE) )
# you can do more operations here
train_batches = dataset_batches(dataset, 64)
And to run it, you'll have to loop:
for (batch, row) in enumerate(train_batche):
# do stuff
# batch = current batch (0, 1, 2, ...) so if your dataset has 1600 rows and you've used batch_size=16 you'll have 100 batches
# row is the actual data (tensor)

model.evaluate() returns different value for same metric depending on if it is returned as the loss or as a metric

I compiled and trained a model like so:
model.compile(optimizer=opt, loss=pixelwise_weighted_binary_crossentropy, metrics=[pixelwise_weighted_binary_crossentropy, dice_coef, dice_loss])
Now during evaluation I get different values for loss_weighted_cross_entropy_value_1 and weighted_cross_entropy_value_2, when running:
(loss_weighted_cross_entropy_value_1, weighted_cross_entropy_value_2, dice_value, dice_loss_value) = model.evaluate(data_generator)
Here, weighted_cross_entropy_value_2 returns the value I expect (same value as during training, when running on the validation dataset), but loss_weighted_cross_entropy_value_1 seems to randomly fluctuate around that value, depending on batch-size.
If I had to wager a guess, it seems as if loss_weighted_cross_entropy_value_1 is the value for only the last batch of the evaluation data. Whereas weighted_cross_entropy_value_2 is the averaged value across all batches of the evaluation data.
Is this correct or is what is going on here?
Edit:
I now ran the evaluation on each batch individually by getting them from the generator first and feeding them to model.evaluate(...) as numpy arrays (see code below). Averaging over the batch-results of loss_weighted_cross_entropy_val_1 and weighted_cross_entropy_val_2 gives the same result in this case:
Averaged loss_weighted_cross_entropy_val_1 - per-sample pass: 0.08109399276593375; std: 0.005511607824946092
Averaged weighted_cross_entropy_val_2 - per-sample pass: 0.08109399271862848; std: 0.005511607193872294
I see this as further indication for my interpretation above.
Code:
nr_of_samples = len(data_generator)
result = nr_of_samples * [None]
loss_weighted_cross_entropy_val_1 = np.zeros(nr_of_samples)
weighted_cross_entropy_val_2 = np.zeros(nr_of_samples)
dice_val = np.zeros(nr_of_samples)
dice_loss_val = np.zeros(nr_of_samples)
for index, sample in enumerate(data_generator):
image = sample[0]
mask_weight = sample[1]
(loss_weighted_cross_entropy_val_1[index], weighted_cross_entropy_val_2[index], dice_val[index], dice_loss_val[index]) = model.evaluate(image, mask_weight)
print(f"Sample {index}/{nr_of_samples}")
If you are using the same function as the loss and metric, you will see minor difference in results usually due to floating point precision errors.
Please refer to this SO Answer, which explain in detail for this case.

How to add after each iteration in tensorflow

I am trying to achieve the following:
compute the losses in the previous 25 predictions and sum them before
computing the gradient. I have tried this:
loss_summation=tf.Variable(0,dtype=tf.dtypes.float32,name="loss")
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=next_element[1],logits=logits2,name="xentropy")
loss=tf.math.reduce_sum(tf.reduce_mean(xentropy,name="loss"))
loss_summation=tf.assign(loss_summation,loss_summation+loss)
optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
gvs = optimizer.compute_gradients(loss_summation,[vars])
with tf.Session() as sess():
for i in range(25):
b=sess.run([loss_summation])
However optimizer.compute_gradients() complains that
None values not supported. How can go around this ?
I am actually trying to implement the following function(feedforward of LSTM) in tensorflow to predict the next word given the previous ones
def feedforward(self,x_s,hpre,targets,p_s):
fts,its,gts,css,ots,output,inputs=[],[],[],[],[],[],[]
losses=[]
hprev=hpre
hts=[hprev]
loss=0
losses=[]
previous_state=p_s
css.append(previous_state)
for x,y in zip(x_s,targets):
k=np.zeros((self.vocab_size,1))
k[x]=1
M_c=np.row_stack((hprev,k))
ft=self.sigmoid(np.dot(self.W1,M_c)+self.b1)
fts.append(ft)
it=self.sigmoid(np.dot(self.W2,M_c)+self.b2)
its.append(it)
gt=np.tanh(np.dot(self.W3,M_c)+self.b3)
gts.append(gt)
cs=(ft*previous_state)+(it*gt)
previous_state=cs
css.append(cs)
ot=self.sigmoid(np.dot(self.W4,M_c)+self.b4)
ots.append(ot)
ht=ot*np.tanh(cs)
hts.append(ht)
yt=self.softmax(np.dot(self.W5,ht)+self.b5)
hprev=ht
output.append(yt)
inputs.append(M_c)
loss+=-np.log(yt[y])
losses.append(loss)
return fts,its,gts,css,ots,output,hts,loss,hts[-1],css[-1],inputs
x_s is a list of integers representing words.
x_s=[0,1,2,3,4,5,6,7,8....,24]
targets is the list of integers expected i.e if x_s=0 then next letter is 1
targets=[1,2,3,4,5,6,7,8,9...,25]
The loss which is a summation of 25 losses is what will be minimized.
There are a few things you need to address here:
Is there a good reason not to just use larger batches? Are you trying to implement the lookahead optimizer or something?
You look like you're getting started with TensorFlow. Consider turning on eager execution with tf.enable_eager_execution(). TensorFlow 2.0 is coming soon, don't waste your time messing with tf.Sessions.
Variables are not differentiable. So accumulating the losses in a variable doesn't make any sense.
I would make a copy of all the model's variables, and accumulate new values there. Then, after N iterations assign those values back to the model. Something like:
model = tf.keras.Sequential(...)
vars = model.trainable_variables
weight_acc = [tf.Variable(var) for var in model.trainable_variables]
for n,(batch, label) in enumerate(dataset):
with tf.GradientTape() as tape:
pred = model(batch)
loss = cal_loss(batch, label)
grads = tape.gradients(loss, vars)
for g, a in zip(grad, weight_acc):
a.assign_add(learning_rate*g)
if n%25 == 0:
for a, v in zip(weight_acc, vars):
v.assign_add(lookahead_fraction*(a-v))

Use of DeepExplainer to get shap values for an MLP model in Keras with tensorflow backend

I am playing around with DeepExplainer to get shap values for deep learning models. By following some tutorials I can get some results, i.e. what variables are pushing the model prediction from the base value, which is the average model output in training set.
I have around 5,000 observations along with 70 features. The performance of DeepExplainer is quite satisfactory. And my code is:
model0 = load_model(model_p+'health0.h5')
background = healthScaler.transform(train[healthFeatures])
e = shap.DeepExplainer(model0, background)
shap_values = e.shap_values(healthScaler.transform(test[healthFeatures]))
test2 = test[healthFeatures].copy()
test2[healthFeatures] = healthScaler.transform(test[healthFeatures])
shap.force_plot(e.expected_value[0], shap_values[0][947,:], test2.iloc[947,:])
And the plot is the following:
Here the base value is 0.012 (can also be seen through e.expected_value[0]) and very close to the output value which is 0.01.
At this point I have some questions:
1) The output value is not identical to the prediction gotten through model0.predict(test[healthFeatures])[947] = -0.103 How should I assess output value?
2) As can be seen, I am using whole training set as the background to approximate conditional expectations of SHAP values. What is the difference between using random samples from training set and entire set? Is it only related to performance issue?
Many thanks in advance!
Probably too late but stil a most common question that will benefit other begginers. To answer (1), the expected and out values will be different. the expected is, as the name suggest, is the avereage over the scores predicted by your model, e.g., if it was probability then it is the average of the probabilties that your model spits. For (2), as long as the backroung values are less then 5k, it wont change much, but if > 5k then your calculations will take days to finish.
See this (lines 21-25) for more comprehensive answers.

Tensorflow: opt.compute_gradients() returns values different from the weight difference of opt.apply_gradients()

Question: What is the most efficient way to get the delta of my weights in the most efficient way in a TensorFlow network?
Background: I've got the operators hooked up as follows (thanks to this SO question):
self.cost = `the rest of the network`
self.rmsprop = tf.train.RMSPropOptimizer(lr,rms_decay,0.0,rms_eps)
self.comp_grads = self.rmsprop.compute_gradients(self.cost)
self.grad_placeholder = [(tf.placeholder("float", shape=grad[1].get_shape(), name="grad_placeholder"), grad[1]) for grad in self.comp_grads]
self.apply_grads = self.rmsprop.apply_gradients(self.grad_placeholder)
Now, to feed in information, I run the following:
feed_dict = `training variables`
grad_vals = self.sess.run([grad[0] for grad in self.comp_grads], feed_dict=feed_dict)
feed_dict2 = `feed_dict plus gradient values added to self.grad_placeholder`
self.sess.run(self.apply_grads, feed_dict=feed_dict2)
The command of run(self.apply_grads) will update the network weights, but when I compute the differences in the starting and ending weights (run(self.w1)), those numbers are different than what is stored in grad_vals[0]. I figure this is because the RMSPropOptimizer does more to the raw gradients, but I'm not sure what, or where to find out what it does.
So back to the question: How do I get the delta on my weights in the most efficient way? Am I stuck running self.w1.eval(sess) multiple times to get the weights and calc the difference? Is there something that I'm missing with the tf.RMSPropOptimizer function.
Thanks!
RMSprop does not subtract the gradient from the parameters but use more complicated formula involving a combination of:
a momentum, if the corresponding parameter is not 0
a gradient step, rescaled non uniformly (on each coordinate) by the square root of the squared average of the gradient.
For more information you can refer to these slides or this recent paper.
The delta is first computed in memory by tensorflow in the slot variable 'momentum' and then the variable is updated (see the C++ operator).
Thus, you should be able to access it and construct a delta node with delta_w1 = self.rmsprop.get_slot(self.w1, 'momentum'). (I have not tried it yet.)
You can add the weights to the list of things to fetch each run call. Then you can compute the deltas outside of TensorFlow since you will have the iterates. This should be reasonably efficient, although it might incur an extra elementwise difference, but to avoid that you might have to hack around in the guts of the optimizer and find where it puts the update before it applies it and fetch that each step. Fetching the weights each call shouldn't do wasteful extra evaluations of part of the graph at least.
RMSProp does complicated scaling of the learning rate for each weight. Basically it divides the learning rate for a weight by a running average of the magnitudes of recent gradients of that weight.