Why does the default GRU implementation in pytorch and keras differ significantly? - tensorflow

After implementing my own GRU cell, I was trying to validate it with the default implementation available on pytorch and keras. My implementation was very close to pytorch but significantly different from keras. So, I first decided to compare the implementations available in pytorch and keras against each other and found that they both were significantly different. Here is some code:
import numpy as np
import torch as tt
import torch.nn as nn
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# data generation
input_size = 2
seq_len = 4
hidden_size = 3
batch_size=1
rng = np.random.default_rng(10)
xx = rng.uniform(size=(batch_size, seq_len, input_size)).astype(np.float32)
print(xx.shape)
output:
(1, 4, 2)
First created a keras gru layer and do one forward pass to get the output
gru_keras = layers.GRU(hidden_size, return_sequences=True)
out_gru_keras = gru_keras(xx)
out_gru_keras
output:
<tf.Tensor: shape=(1, 4, 3), dtype=float32, numpy=
array([[[0.05563376, 0.22639018, 0.14037813],
[0.05021746, 0.27509487, 0.17843583],
[0.0273784 , 0.23740683, 0.1731183 ],
[0.10505023, 0.38027072, 0.31583947]]], dtype=float32)>
Then check the weights in keras gru
gru_keras_weights = gru_keras.get_weights()
print(f'{len(gru_keras_weights)=}')
for w in gru_keras_weights:
print(w.shape, w.dtype, w)
output:
len(gru_keras_weights)=3
(2, 9) float32 [[ 0.38249677 -0.67729133 -0.28855678 0.3081903 -0.530349 0.1531434
0.09444886 0.2978403 0.1516701 ]
[ 0.0833146 0.27516943 0.4720915 -0.7370237 -0.20921749 0.38180763
0.23018956 0.39872426 0.5722596 ]]
(3, 9) float32 [[-0.21449113 0.2944518 -0.25759113 0.2292317 0.35174483 0.42021522
-0.5116475 0.42759803 -0.05884372]
[-0.3477073 -0.15120703 0.7333025 0.418491 -0.07980055 -0.21007833
-0.1924745 0.20504965 0.11737184]
[ 0.01270523 0.15124948 0.4014033 -0.568793 0.4513449 -0.03860948
-0.39513308 -0.36090007 -0.02702253]]
(2, 9) float32 [[0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Now, create a pytorch gru and check its weights
gru_torch = nn.GRU(
input_size=input_size,
hidden_size=hidden_size,
bias=True,
batch_first=True,
num_layers=1,
dropout=0.0,
bidirectional=False,
dtype=tt.float32
)
sd_gru_torch = gru_torch.state_dict()
for k,v in sd_gru_torch.items():
print(f'{k}, {v.shape}')
output:
weight_ih_l0, torch.Size([9, 2])
weight_hh_l0, torch.Size([9, 3])
bias_ih_l0, torch.Size([9])
bias_hh_l0, torch.Size([9])
Clearly, pytorch implements two separate bias bias_ih_l0 and bias_hh_l0 while keras has a concatenated bias (which is zero). Now I copied the weights accordingly.
with tt.no_grad():
gru_torch.get_parameter('weight_ih_l0').copy_(tt.tensor(gru_keras_weights[0].T))
gru_torch.get_parameter('weight_hh_l0').copy_(tt.tensor(gru_keras_weights[1].T))
gru_torch.get_parameter('bias_ih_l0').copy_(tt.tensor(gru_keras_weights[2].T[:,0]))
gru_torch.get_parameter('bias_hh_l0').copy_(tt.tensor(gru_keras_weights[2].T[:,1]))
... and check the parameters again
with tt.no_grad():
for p in gru_torch.parameters():
print(p.shape, p)
output:
torch.Size([9, 2]) Parameter containing:
tensor([[ 0.3825, 0.0833],
[-0.6773, 0.2752],
[-0.2886, 0.4721],
[ 0.3082, -0.7370],
[-0.5303, -0.2092],
[ 0.1531, 0.3818],
[ 0.0944, 0.2302],
[ 0.2978, 0.3987],
[ 0.1517, 0.5723]], requires_grad=True)
torch.Size([9, 3]) Parameter containing:
tensor([[-0.2145, -0.3477, 0.0127],
[ 0.2945, -0.1512, 0.1512],
[-0.2576, 0.7333, 0.4014],
[ 0.2292, 0.4185, -0.5688],
[ 0.3517, -0.0798, 0.4513],
[ 0.4202, -0.2101, -0.0386],
[-0.5116, -0.1925, -0.3951],
[ 0.4276, 0.2050, -0.3609],
[-0.0588, 0.1174, -0.0270]], requires_grad=True)
torch.Size([9]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
torch.Size([9]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
Now the weights are matching exactly. Taking a forward pass through torch with the same data
with tt.no_grad():
out_gru_torch, _ = gru_torch(tt.tensor(xx))
print (out_gru_torch.shape, out_gru_torch)
output:
torch.Size([1, 4, 3]) tensor([[[0.0638, 0.2232, 0.1145],
[0.0553, 0.2741, 0.1618],
[0.0306, 0.2399, 0.1644],
[0.1231, 0.3970, 0.3153]]])
But it can be seen that the outputs are different in value
out_gru_torch.numpy() - out_gru_keras.numpy()
output:
array([[[ 0.00813178, -0.00323714, -0.02592391],
[ 0.0050362 , -0.000976 , -0.01662555],
[ 0.00321813, 0.0024882 , -0.00869839],
[ 0.0180769 , 0.01671427, -0.00050169]]], dtype=float32)
summing up the differences
np.sum(np.abs(out_gru_torch.numpy() - out_gru_keras.numpy()))
output:
0.10962818
Why is there such a significant difference?
I also tried the same thing with LSTMs but found that the difference was insignificant. This can be attributed to floating point rounding errors maybe. Here is some code.
Make a keras LSTM
lstm_keras = layers.LSTM(hidden_size, return_sequences=True)
out_lstm_keras = lstm_keras(xx)
out_lstm_keras
output:
<tf.Tensor: shape=(1, 4, 3), dtype=float32, numpy=
array([[[-0.12176377, -0.07746243, -0.08807365],
[-0.17760691, -0.11547467, -0.12406464],
[-0.16980645, -0.1159803 , -0.12289675],
[-0.16330168, -0.08463871, -0.18625976]]], dtype=float32)>
check keras LSTM weights
lstm_keras_weights = lstm_keras.get_weights()
print(f'{len(lstm_keras_weights)=}')
for w in lstm_keras_weights:
print(w.shape, w.dtype, w)
output:
len(lstm_keras_weights)=3
(2, 12) float32 [[-0.18769234 0.6526979 0.27196562 -0.23817068 -0.05964065 -0.11090988
-0.6442989 -0.4168117 -0.344454 0.12466687 -0.6536666 -0.28540143]
[-0.35713544 0.34027737 0.09951967 -0.21514818 0.47551024 0.305395
0.16330504 0.22410381 -0.13371867 0.21646535 0.01366949 0.4818431 ]]
(3, 12) float32 [[-0.22128499 -0.17296375 0.03671373 0.16226508 0.19011612 -0.41836154
-0.5816412 -0.32847112 -0.31468534 -0.27402246 0.05426207 0.24291728]
[ 0.4650536 -0.57491106 0.01105271 -0.0380749 0.2271702 0.39930764
0.11620218 -0.19071549 -0.30224687 -0.13937864 -0.27111995 0.08010413]
[ 0.45147026 0.3288076 -0.37750363 0.35117835 -0.31541684 -0.1335725
-0.0910389 -0.24736843 0.03350063 -0.27691114 -0.28898126 -0.27222085]]
(12,) float32 [0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
Now create a pytorch LSTM and check its weights
lstm_torch = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
bias=True,
batch_first=True,
num_layers=1,
dropout=0.0,
bidirectional=False,
dtype=tt.float32
)
sd_lstm_torch = lstm_torch.state_dict()
for k,v in sd_lstm_torch.items():
print(f'{k}, {v.shape}')
output:
weight_ih_l0, torch.Size([12, 2])
weight_hh_l0, torch.Size([12, 3])
bias_ih_l0, torch.Size([12])
bias_hh_l0, torch.Size([12])
It can be noted that in keras LSTM, only one bias is implemented (not a stack of 2 bias as done in GRU), so I choose one of the bias in pytorch LSTM and copied keras bias vector to it. It doesn't make much difference on which bias is copied to, the other bias is made zero. Here I choose bias_ih_l0 to copy to and bias_hh_l0 is made zero.
with tt.no_grad():
lstm_torch.get_parameter('weight_ih_l0').copy_(tt.tensor(lstm_keras_weights[0].T))
lstm_torch.get_parameter('weight_hh_l0').copy_(tt.tensor(lstm_keras_weights[1].T))
lstm_torch.get_parameter('bias_ih_l0').copy_(tt.tensor(lstm_keras_weights[2]))
lstm_torch.get_parameter('bias_hh_l0').copy_(tt.zeros(lstm_keras_weights[2].shape))
Checking the parameters again
with tt.no_grad():
for p in lstm_torch.parameters():
print(p.shape, p)
output:
torch.Size([12, 2]) Parameter containing:
tensor([[-0.1877, -0.3571],
[ 0.6527, 0.3403],
[ 0.2720, 0.0995],
[-0.2382, -0.2151],
[-0.0596, 0.4755],
[-0.1109, 0.3054],
[-0.6443, 0.1633],
[-0.4168, 0.2241],
[-0.3445, -0.1337],
[ 0.1247, 0.2165],
[-0.6537, 0.0137],
[-0.2854, 0.4818]], requires_grad=True)
torch.Size([12, 3]) Parameter containing:
tensor([[-0.2213, 0.4651, 0.4515],
[-0.1730, -0.5749, 0.3288],
[ 0.0367, 0.0111, -0.3775],
[ 0.1623, -0.0381, 0.3512],
[ 0.1901, 0.2272, -0.3154],
[-0.4184, 0.3993, -0.1336],
[-0.5816, 0.1162, -0.0910],
[-0.3285, -0.1907, -0.2474],
[-0.3147, -0.3022, 0.0335],
[-0.2740, -0.1394, -0.2769],
[ 0.0543, -0.2711, -0.2890],
[ 0.2429, 0.0801, -0.2722]], requires_grad=True)
torch.Size([12]) Parameter containing:
tensor([0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0.], requires_grad=True)
torch.Size([12]) Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)
Now do a forward pass through pytorch LSTM with the same data
with tt.no_grad():
out_lstm_torch, _ = lstm_torch(tt.tensor(xx))
print (out_lstm_torch.shape, out_lstm_torch)
output:
torch.Size([1, 4, 3]) tensor([[[-0.1218, -0.0775, -0.0881],
[-0.1776, -0.1155, -0.1241],
[-0.1698, -0.1160, -0.1229],
[-0.1633, -0.0846, -0.1863]]])
Taking absolute difference
np.sum(np.abs(out_lstm_torch.numpy() - out_lstm_keras.numpy()))
output:
2.7567148e-07
The difference is significantly less. I also found that the initial hidden states are assumed to be all zeros by both pytorch and keras so I don't need to set it manually.

Related

batch axis in keras custom layer

I want to make a custom layer that does the following, given a batch of input vectors.
For each vector a in the batch:
get the first element a[0].
multiply the vector a by a[0] elementwise.
So if the batch is
[[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.],
[10., 11., 12.]]
This should be a batch of 4 vectors, each with dimension 3 (or am I wrong here?).
Then my layer should transform the batch to the following:
[[ 1., 2., 3.],
[ 16., 20., 24.],
[ 49., 56., 63.],
[100., 110., 120.]]
Here is my implementation for the layer:
class MyLayer(keras.layers.Layer):
def __init__(self, activation=None, **kwargs):
super().__init__(**kwargs)
self.activation = keras.activations.get(activation)
def call(self, a):
scale = a[0]
return self.activation(a * scale)
def get_config(self):
base_config = super().get_config()
return {**base_config,
"activation": keras.activations.serialize(self.activation)}
But the output is different from what I expected:
batch = tf.Variable([[1,2,3],
[4,5,6],
[7,8,9],
[10,11,12]], dtype=tf.float32)
layer = MyLayer()
print(layer(batch))
Output:
tf.Tensor(
[[ 1. 4. 9.]
[ 4. 10. 18.]
[ 7. 16. 27.]
[10. 22. 36.]], shape=(4, 3), dtype=float32)
It looks like the implementation actually treats each column as a vector, which is strange to me because other pre-written models, such as the sequential model, specify the input shape to be (batch_size, ...), which means each row, instead of column, is a vector.
How should I modify my code so that it behaves the way I want?
Actually, your input shape is (4,3). So when you slice this tensor by a[0] it gets the first row which is [1,2,3]. To get what you want you should instead get the first column and then transpose your matrix to give you the desired matrix like this:
def call(self, a):
scale = a[:,0]
return tf.transpose(self.activation(tf.transpose(a) * scale))

tf.keras.losses.CategoricalCrossentropy gives different values than plain implementation

Any one knows why raw implementation of Categorical Crossentropy function is so different from the tf.keras's api function?
import tensorflow as tf
import math
tf.enable_eager_execution()
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
ce = tf.keras.losses.CategoricalCrossentropy()
res = ce(y_true, y_pred).numpy()
print("use api:")
print(res)
print()
print("implementation:")
step1 = -y_true * np.log(y_pred )
step2 = np.sum(step1, axis=1)
print("step1.shape:", step1.shape)
print(step1)
print("sum step1:", np.sum(step1, ))
print("mean step1", np.mean(step1))
print()
print("step2.shape:", step2.shape)
print(step2)
print("sum step2:", np.sum(step2, ))
print("mean step2", np.mean(step2))
Above gives:
use api:
0.3239681124687195
implementation:
step1.shape: (3, 3)
[[0.10536052 0. 0. ]
[0. 0.11653382 0. ]
[0. 0. 0.0618754 ]]
sum step1: 0.2837697356318653
mean step1 0.031529970625762814
step2.shape: (3,)
[0.10536052 0.11653382 0.0618754 ]
sum step2: 0.2837697356318653
mean step2 0.09458991187728844
If now with another y_true and y_pred:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
It gives:
use api:
16.11809539794922
implementation:
step1.shape: (1, 2)
[[-0. 25.32843602]]
sum step1: 25.328436022934504
mean step1 12.664218011467252
step2.shape: (1,)
[25.32843602]
sum step2: 25.328436022934504
mean step2 25.328436022934504
The difference is because of these values: [.5, .89, .6], since it's sum is not equal to 1. I think you have made a mistake and you meant this instead: [.05, .89, .06].
If you provide the values with sum equal to 1, then both formulas results will be the same:
import tensorflow as tf
import numpy as np
y_true = np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#output
#[0.10536052 0.11653382 0.0618754 ]
#[0.10536052 0.11653382 0.0618754 ]
However, let's explore how is calculated if the y_pred tensor is not scaled (the sum of values is not equal to 1)? If you look at the source code of categorical cross entropy here, you will see that it scales y_pred so that the class probas of each sample sum to 1:
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
keep_dims=True)
since we passed a pred which the sum of probas is not 1, let's see how this operation changes our tensor [.5, .89, .6]:
output = tf.constant([.5, .89, .6])
output /= tf.reduce_sum(output,
axis=len(output.get_shape()) - 1,
keepdims=True)
print(output.numpy())
# array([0.2512563 , 0.44723618, 0.30150756], dtype=float32)
So, it should be equal if we replace the above operation output (scaled y_pred), and pass it to your own implemented categorical cross entropy, with the unscaled y_pred passing to tensorflow implementation:
y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
#unscaled y_pred
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
#scaled y_pred (categorical_crossentropy scales above tensor to this internally)
y_pred = np.array([[.9, .05, .05], [0.2512563 , 0.44723618, 0.30150756], [.05, .01, .94]])
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
[0.10536052 0.80466845 0.0618754 ]
[0.10536052 0.80466846 0.0618754 ]
Now, let's explore the results of your second example. Why your second example shows different output?
If you check the source code again, you will see this line:
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
which clips values below than a threshold. Your input [0.99999999999, 0.00000000001] will be converted to [0.9999999, 0.0000001] in this line, so it gives you a different result:
y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
#now let's first clip the values less than epsilon, then compare loss
epsilon=1e-7
y_pred = tf.clip_by_value(y_pred, epsilon, 1. - epsilon)
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))
Output:
#results without clipping values
[16.11809565]
[25.32843602]
#results after clipping values if there is a value less than epsilon (1e-7)
[16.11809565]
[16.11809565]

How to specify spearman rank correlation as a loss function in keras?

I wanted to write a loss function that maximizes the spearman rank correlation between two vectors in keras. Unfortunately I could not find an existing implementation, nor a good method to calculate the rank of a vector in keras, so that I could use the formula to implement it myself
def rank_correlation(y_true, y_pred):
pass
model = tensorflow.keras.Sequential()
#### More model code
model.compile(loss=rank_correlation)
Can anyone please help me implement rank_correlation ?
You can try something like as follows, referenced.
from scipy.stats import spearmanr
def compute_spearmanr(y, y_pred):
spearsum = 0
cnt = 0
for col in range(y_pred.shape[1]):
v = spearmanr(y_pred[:,col], y[:,col]).correlation
if np.isnan(v):
continue
spearsum += v
cnt += 1
res = spearsum / cnt
return res
a = np.array([[2., 1., 2., 3.],[3., 3., 4., 5.]] )
b = np.array([[1., 0., 0., 3.], [1., 0., 3., 3.]])
compute_spearmanr(a, b)
0.9999999999999999

T5 Encoder model output all zeros?

I am trying out a project where I use the T5EncoderModel from HuggingFace in order to obtain hidden representations of my input sentences. I have 100K sentences which I tokenize and pad as follows:
for sentence in dataset[original]:
sentence = tokenizer(sentence, max_length=40, padding='max_length', return_tensors='tf', truncation= True)
original_sentences.append(sentence.input_ids)
org_mask.append(sentence.attention_mask)
This gives me the right outputs and tokenizes everything decently. The problem I achieve is when I am trying to actually train the model. The setup is a bit complex and is taken from https://keras.io/examples/vision/semantic_image_clustering/ which I am trying to apply to text.
The set-up for training is as follows:
def create_encoder(rep_dim):
encoder = TFT5EncoderModel.from_pretrained('t5-small', output_hidden_states=True)
encoder.trainable = True
original_input = Input(shape=(max_length), name = 'originalIn', dtype=tf.int32)
augmented_input = Input(shape=(max_length), name = 'originalIn', dtype=tf.int32)
concat = keras.layers.Concatenate(axis=1)([original_input, augmented_input])
#Take 0-index because it returns a TFBERTmodel type, and 0 returns a tensor
encoded = encoder(input_ids=concat)[0]
#This outputs shape: [sentences, max_length, encoded_dims]
output = Dense(rep_dim, activation='relu')(encoded)
return encoder
This function is fed into the ReprensentationLearner class from the above link as such:
class RepresentationLearner(keras.Model):
def __init__(
self,
encoder,
projection_units,
temperature=0.8,
dropout_rate=0.1,
l2_normalize=False,
**kwargs
):
super(RepresentationLearner, self).__init__(**kwargs)
self.encoder = encoder
# Create projection head.
self.projector = keras.Sequential(
[
layers.Dropout(dropout_rate),
layers.Dense(units=projection_units, use_bias=False),
layers.BatchNormalization(),
layers.ReLU(),
]
)
self.temperature = temperature
self.l2_normalize = l2_normalize
self.loss_tracker = keras.metrics.Mean(name="loss")
#property
def metrics(self):
return [self.loss_tracker]
def compute_contrastive_loss(self, feature_vectors, batch_size):
num_augmentations = tf.shape(feature_vectors)[0] // batch_size
if self.l2_normalize:
feature_vectors = tf.math.l2_normalize(feature_vectors, -1)
# The logits shape is [num_augmentations * batch_size, num_augmentations * batch_size].
logits = (
tf.linalg.matmul(feature_vectors, feature_vectors, transpose_b=True)
/ self.temperature
)
# Apply log-max trick for numerical stability.
logits_max = tf.math.reduce_max(logits, axis=1)
logits = logits - logits_max
# The shape of targets is [num_augmentations * batch_size, num_augmentations * batch_size].
# targets is a matrix consits of num_augmentations submatrices of shape [batch_size * batch_size].
# Each [batch_size * batch_size] submatrix is an identity matrix (diagonal entries are ones).
targets = tf.tile(tf.eye(batch_size), [num_augmentations, num_augmentations])
# Compute cross entropy loss
return keras.losses.categorical_crossentropy(
y_true=targets, y_pred=logits, from_logits=True
)
def call(self, inputs):
features = self.encoder(inputs[0])[0]
# Apply projection head.
return self.projector(features[0])
def train_step(self, inputs):
batch_size = tf.shape(inputs)[0]
# Run the forward pass and compute the contrastive loss
with tf.GradientTape() as tape:
feature_vectors = self(inputs, training=True)
loss = self.compute_contrastive_loss(feature_vectors, batch_size)
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update loss tracker metric
self.loss_tracker.update_state(loss)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
def test_step(self, inputs):
batch_size = tf.shape(inputs)[0]
feature_vectors = self(inputs, training=False)
loss = self.compute_contrastive_loss(feature_vectors, batch_size)
self.loss_tracker.update_state(loss)
return {"loss": self.loss_tracker.result()}
In order to train it, I use the Colab TPU and train it as such:
with strategy.scope():
encoder = create_encoder(rep_dim)
training_model = RepresentationLearner(encoder=encoder, projection_units=128, temperature=0.1)
lr_scheduler = keras.experimental.CosineDecay(initial_learning_rate=0.001, decay_steps=500, alpha=0.1)
training_model.compile(optimizer=tfa.optimizers.AdamW(learning_rate=lr_scheduler, weight_decay=0.0001))
history = training_model.fit(x = [original_train, augmented_train], batch_size=32*8, epocs = 10)
training_model.save_weights('representation_learner.h5', overwrite=True)
Note that I am giving my model two inputs. When I predict on my input data, I get all zeros, and I can not seem to understand why. I predict as follows:
training_model.load_weights('representation_learner.h5')
feature_vectors= training_model.predict([[original_train, augmented_train]], verbose = 1)
And the output is:
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
With a way too large shape of (1000000, 128)

How does a process of optimization go with tensorflow?

I have simple graph in tensorflow
(1) X = tf.Variable(dtype=tf.float32, shape=(1, 3), name="X", initial_value=np.array([[1,2,3]]))
(2) y = tf.reduce_sum(tf.square(X)) - 2 * tf.reduce_sum(tf.sin(tf.square(X)))
(3) training_op = tf.train.GradientDescentOptimizer(0.3).minimize(y)
Here's the code for 5 steps of gradient descent:
with tf.Session() as sess:
sess.run(init)
for i in range(5):
(4) *res, _ = sess.run(fetches=[X, y, training_op])
print(res)
[array([[1., 2., 3.]], dtype=float32), 13.006426]
[array([[ 1.0483627 , -0.76874477, -2.080069 ]], dtype=float32), 4.9738936]
[array([[ 0.9910337 , -1.0735381 , 0.10702228]], dtype=float32), -1.3677568]
[array([[ 1.0567244 , -0.95272505, 0.17122723]], dtype=float32), -1.3784065]
[array([[ 0.978967 , -1.0848547 , 0.27387527]], dtype=float32), -1.4229481]
I'm trying to figure out how its optimization process goes. Could you please explain it step by step?
I thought it should be like this:
Evaluate X (1)
Evaluate y (2)
Calculate gradient and make a step (3) (as here it says "Calling minimize() takes care of both computing the gradients and applying them to the variables."
Then yield all requested in fetches variables (4)
But the output shows that at first run yields initial values, so I'm confused...
tf version == '1.15.0'
Thank you in advance!
upd1. If I change the order in fetches list, the output is still the same.
with tf.Session() as sess:
sess.run(init)
for i in range(5):
_, *res = sess.run(fetches=[training_op, X, y])
print(res)
[array([[1., 2., 3.]], dtype=float32), 13.006426]
[array([[ 1.0483627 , -0.76874477, -2.080069 ]], dtype=float32), 4.9738936]
[array([[ 0.9910337 , -1.0735381 , 0.10702228]], dtype=float32), -1.3677568]
[array([[ 1.0567244 , -0.95272505, 0.17122723]], dtype=float32), -1.3784065]
[array([[ 0.978967 , -1.0848547 , 0.27387527]], dtype=float32), -1.4229481]
upd2. A slight modification of the answer by #thushv89 does what I initially expected to see:
with tf.Session() as sess:
sess.run(init)
for i in range(2):
res = sess.run(fetches=[X, y])
print('Variables before the step', res)
sess.run(training_op)
res = sess.run(fetches=[X, y])
print('Variables after the step', res)
print()
Variables before the step [array([[1., 2., 3.]], dtype=float32), 13.006426]
Variables after the step [array([[ 1.0483627 , -0.76874477, -2.080069 ]], dtype=float32), 4.9738936]
Variables before the step [array([[ 1.0483627 , -0.76874477, -2.080069 ]], dtype=float32), 4.9738936]
Variables after the step [array([[ 0.9910337 , -1.0735381 , 0.10702228]], dtype=float32), -1.3677568]
You have fetches=[X, y, training_op]. These don't respect the order (At least you shouldn't expect sess.run() to respect the order). Which means, all of the,
Evaluates X (so the training_op hasn't happened yet)
Evaluate y (still the training_op hasn't happened yet)
Executes training_op (now, X and y have changed).
gets executed and then the results are fetched. If you want the variable X to change first,
Option 1: Breaking the sess.run() function
r1 = sess.run(X)
_, r2 = sess.run(fetches=[training_op, y])
print(r1,r2)
Option 2: Using a separate tf.Variable with tf.control_dependencies
X = tf.Variable(dtype=tf.float32, shape=(1, 3), name="X", initial_value=np.array([[1,2,3]]))
prevX = tf.Variable(dtype=tf.float32, shape=(1, 3), name="prevX", initial_value=np.array([[1,2,3]]))
y = tf.reduce_sum(tf.square(X)) - 2 * tf.reduce_sum(tf.sin(tf.square(X)))
assign_op = tf.assign(prevX, X)
with tf.control_dependencies([assign_op]):
training_op = tf.train.GradientDescentOptimizer(0.3).minimize(y)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
for i in range(5):
*res, _ = sess.run(fetches=[prevX, y, training_op])
print(res)