Tensorflow running multiple independent models training asyncronously - tensorflow

I use UBUNTU.
Ok, i want to fine tune a simple NN and evaluate tensorflow settings. So i have combinations of [arg1=activation function,arg2=optimizer,arg3=loss_function] and i want to feed a training with N repeats and evaluate results. My problem is that model functions run synchronous and GPU beast is totally sleeping..
So i found out that my model needs about 10Mb of memory
I splited 90% of GPU memory on 15M blocks and took N logical_devices
Now.. i want to instruct the GPU logical_devices to run the following code asyncronously and feed a results_list.
for li in range(len(logical_devices)):
print("Working with GPU..:")
print(logical_devices[li].name)
....
myresult.append(independant_model_train(<tf.device(logical_devices[li].name) andsome_other_args>))
def independant_model_train(tf.device(logical_devices[li].name) andsome_other_args>):
modellogical_device = tf.device(logical_devices[li].name)
with modellogical_device:
model = keras.Sequential()
model.add(keras.layers.Dense(units=output_dimension, batch_input_shape=[1,input_sample_dimension], use_bias=True))
model.add(keras.layers.Dense(units=output_dimension, activation=activationarg ,use_bias=True))
model.compile(optimizer=optimizerarg, loss=lossarg)
model.fit(inputs, labels, batch_size=(int)(btparam), epochs=(int)(epparam))
predictionsarray = model.predict(datatopredict)
return predictionsarray
#How to make the above code run asyncronously and fill results asyncronously?
myresult.append(independant_model_train(<the_args>))
#how to determine if every model ended up with a prediction and then print result?
print(myresult)
Thank you!
..i tried make independant_model_train asyncronous but in ubuntu terminal i get training echoes and it's like everything runs serialy.. synchronous
asyncresult = asyncio.run(independant_model_train(...)
myresult.append(asyncresult)
in combination with
async def independant_model_train(...)
for 2 logical GPUS it outputs in Ubuntu terminal..
working with..: /device:GPU:0
Epoch 1/2
3/3 [==============================] - 1s 4ms/step - loss: 3.0154
Epoch 2/2
3/3 [==============================] - 0s 3ms/step - loss: 2.6918
..and then..
working with..: /device:GPU:1
Epoch 1/2
3/3 [==============================] - 1s 4ms/step - loss: 3.0154
Epoch 2/2
3/3 [==============================] - 0s 3ms/step - loss: 2.6918
not even close to async.. and fit times additional nad unchanged for every model

Related

Training model in Keras [duplicate]

How is Accuracy defined when the loss function is mean square error? Is it mean absolute percentage error?
The model I use has output activation linear and is compiled with loss= mean_squared_error
model.add(Dense(1))
model.add(Activation('linear')) # number
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
and the output looks like this:
Epoch 99/100
1000/1000 [==============================] - 687s 687ms/step - loss: 0.0463 - acc: 0.9689 - val_loss: 3.7303 - val_acc: 0.3250
Epoch 100/100
1000/1000 [==============================] - 688s 688ms/step - loss: 0.0424 - acc: 0.9740 - val_loss: 3.4221 - val_acc: 0.3701
So what does e.g. val_acc: 0.3250 mean? Mean_squared_error should be a scalar not a percentage - shouldnt it? So is val_acc - mean squared error, or mean percentage error or another function?
From definition of MSE on wikipedia:https://en.wikipedia.org/wiki/Mean_squared_error
The MSE is a measure of the quality of an estimator—it is always
non-negative, and values closer to zero are better.
Does that mean a value of val_acc: 0.0 is better than val_acc: 0.325?
edit: more examples of the output of accuracy metric when I train - where the accuracy is increase as I train more. While the loss function - mse should decrease. Is Accuracy well defined for mse - and how is it defined in Keras?
lAllocator: After 14014 get requests, put_count=14032 evicted_count=1000 eviction_rate=0.0712657 and unsatisfied allocation rate=0.071714
1000/1000 [==============================] - 453s 453ms/step - loss: 17.4875 - acc: 0.1443 - val_loss: 98.0973 - val_acc: 0.0333
Epoch 2/100
1000/1000 [==============================] - 443s 443ms/step - loss: 6.6793 - acc: 0.1973 - val_loss: 11.9101 - val_acc: 0.1500
Epoch 3/100
1000/1000 [==============================] - 444s 444ms/step - loss: 6.3867 - acc: 0.1980 - val_loss: 6.8647 - val_acc: 0.1667
Epoch 4/100
1000/1000 [==============================] - 445s 445ms/step - loss: 5.4062 - acc: 0.2255 - val_loss: 5.6029 - val_acc: 0.1600
Epoch 5/100
783/1000 [======================>.......] - ETA: 1:36 - loss: 5.0148 - acc: 0.2306
There are at least two separate issues with your question.
The first one should be clear by now from the comments by Dr. Snoopy and the other answer: accuracy is meaningless in a regression problem, such as yours; see also the comment by patyork in this Keras thread. For good or bad, the fact is that Keras will not "protect" you or any other user from putting not-meaningful requests in your code, i.e. you will not get any error, or even a warning, that you are attempting something that does not make sense, such as requesting the accuracy in a regression setting.
Having clarified that, the other issue is:
Since Keras does indeed return an "accuracy", even in a regression setting, what exactly is it and how is it calculated?
To shed some light here, let's revert to a public dataset (since you do not provide any details about your data), namely the Boston house price dataset (saved locally as housing.csv), and run a simple experiment as follows:
import numpy as np
import pandas
import keras
from keras.models import Sequential
from keras.layers import Dense
# load dataset
dataframe = pandas.read_csv("housing.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]
model = Sequential()
model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model asking for accuracy, too:
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y,
batch_size=5,
epochs=100,
verbose=1)
As in your case, the model fitting history (not shown here) shows a decreasing loss, and an accuracy roughly increasing. Let's evaluate now the model performance in the same training set, using the appropriate Keras built-in function:
score = model.evaluate(X, Y, verbose=0)
score
# [16.863721372581754, 0.013833992168483997]
The exact contents of the score array depend on what exactly we have requested during model compilation; in our case here, the first element is the loss (MSE), and the second one is the "accuracy".
At this point, let us have a look at the definition of Keras binary_accuracy in the metrics.py file:
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
So, after Keras has generated the predictions y_pred, it first rounds them, and then checks to see how many of them are equal to the true labels y_true, before getting the mean.
Let's replicate this operation using plain Python & Numpy code in our case, where the true labels are Y:
y_pred = model.predict(X)
l = len(Y)
acc = sum([np.round(y_pred[i])==Y[i] for i in range(l)])/l
acc
# array([0.01383399])
Well, bingo! This is actually the same value returned by score[1] above...
To make a long story short: since you (erroneously) request metrics=['accuracy'] in your model compilation, Keras will do its best to satisfy you, and will return some "accuracy" indeed, calculated as shown above, despite this being completely meaningless in your setting.
There are quite a few settings where Keras, under the hood, performs rather meaningless operations without giving any hint or warning to the user; two of them I have happened to encounter are:
Giving meaningless results when, in a multi-class setting, one happens to request loss='binary_crossentropy' (instead of categorical_crossentropy) with metrics=['accuracy'] - see my answers in Keras binary_crossentropy vs categorical_crossentropy performance? and Why is binary_crossentropy more accurate than categorical_crossentropy for multiclass classification in Keras?
Disabling completely Dropout, in the extreme case when one requests a dropout rate of 1.0 - see my answer in Dropout behavior in Keras with rate=1 (dropping all input units) not as expected
The loss function (Mean Square Error in this case) is used to indicate how far your predictions deviate from the target values. In the training phase, the weights are updated based on this quantity. If you are dealing with a classification problem, it is quite common to define an additional metric called accuracy. It monitors in how many cases the correct class was predicted. This is expressed as a percentage value. Consequently, a value of 0.0 means no correct decision and 1.0 only correct decisons.
While your network is training, the loss is decreasing and usually the accuracy increases.
Note, that in contrast to loss, the accuracy is usally not used to update the parameters of your network. It helps to monitor the learning progress and the current performane of the network.
#desertnaut has said it very clearly.
Consider the following two pieces of code
compile code
binary_accuracy code
def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
Your labels should be integer,Because keras does not round y_true, and you get high accuracy.......

How to use a language model for prediction after fine-tuning?

I've trained/fine-tuned a Spanish RoBERTa model that has recently been pre-trained for a variety of NLP tasks except for text classification.
Since the baseline model seems to be promising, I want to fine-tune it for a different task: text classification, more precisely, sentiment analysis of Spanish Tweets and use it to predict labels on scraped tweets I have.
The preprocessing and the training seem to work correctly. However, I don't know how I can use this mode afterwards for prediction.
I'll leave out the preprocessing part because I don't think there seems to be an issue.
Code:
# Training with native TensorFlow
from transformers import TFAutoModelForSequenceClassification
## Model Definition
model = TFAutoModelForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne", from_pt=True, num_labels=3)
## Model Compilation
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.metrics.SparseCategoricalAccuracy()
model.compile(optimizer=optimizer,
loss=loss,
metrics=metric)
## Fitting the data
history = model.fit(train_dataset.shuffle(1000).batch(64), epochs=3, batch_size=64)
Output:
/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py:337: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
"Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/5
16/16 [==============================] - 35s 1s/step - loss: 1.0455 - sparse_categorical_accuracy: 0.4452
Epoch 2/5
16/16 [==============================] - 18s 1s/step - loss: 0.6923 - sparse_categorical_accuracy: 0.7206
Epoch 3/5
16/16 [==============================] - 18s 1s/step - loss: 0.3533 - sparse_categorical_accuracy: 0.8885
Epoch 4/5
16/16 [==============================] - 18s 1s/step - loss: 0.1871 - sparse_categorical_accuracy: 0.9477
Epoch 5/5
16/16 [==============================] - 18s 1s/step - loss: 0.1031 - sparse_categorical_accuracy: 0.9714
Question:
How can I use the model after fine-tuning for text classification/sentiment analysis? (I want to create a predicted label for each tweet I scraped.)
What would be a good way of approaching this?
I've tried to save the model, but I don't know where I can find it and use then:
# Save the model
model.save_pretrained('Twitter_Roberta_Model')
I've also tried to just add it to a HuggingFace pipeline like the following. But I'm not sure if this works correctly.
classifier = pipeline('sentiment-analysis',
model=model,
tokenizer=AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-bne"))
Although this is an example for a specific model (DistilBert), the following prediction code should work similarly (small modifications according to your needs). You just need to replace the distillbert according to your model (TFAutoModelForSequenceClassification) and of course ensure the proper tokenizer is used.
loaded_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
loaded_model.load_weights('./distillbert_tf.h5')
input_text = "The text on which I test"
input_text_tokenized = tokenizer.encode(input_text,
truncation=True,
padding=True,
return_tensors="tf")
prediction = loaded_model(input_text_tokenized)
prediction_logits = prediction[0]
prediction_probs = tf.nn.softmax(prediction_logits,axis=1).numpy()
print(f'The prediction probs are: {prediction_probs}')

How can I evaluate a Huggingface model after fine-tuning? [duplicate]

I've trained/fine-tuned a Spanish RoBERTa model that has recently been pre-trained for a variety of NLP tasks except for text classification.
Since the baseline model seems to be promising, I want to fine-tune it for a different task: text classification, more precisely, sentiment analysis of Spanish Tweets and use it to predict labels on scraped tweets I have.
The preprocessing and the training seem to work correctly. However, I don't know how I can use this mode afterwards for prediction.
I'll leave out the preprocessing part because I don't think there seems to be an issue.
Code:
# Training with native TensorFlow
from transformers import TFAutoModelForSequenceClassification
## Model Definition
model = TFAutoModelForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne", from_pt=True, num_labels=3)
## Model Compilation
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.metrics.SparseCategoricalAccuracy()
model.compile(optimizer=optimizer,
loss=loss,
metrics=metric)
## Fitting the data
history = model.fit(train_dataset.shuffle(1000).batch(64), epochs=3, batch_size=64)
Output:
/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py:337: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`.
"Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/5
16/16 [==============================] - 35s 1s/step - loss: 1.0455 - sparse_categorical_accuracy: 0.4452
Epoch 2/5
16/16 [==============================] - 18s 1s/step - loss: 0.6923 - sparse_categorical_accuracy: 0.7206
Epoch 3/5
16/16 [==============================] - 18s 1s/step - loss: 0.3533 - sparse_categorical_accuracy: 0.8885
Epoch 4/5
16/16 [==============================] - 18s 1s/step - loss: 0.1871 - sparse_categorical_accuracy: 0.9477
Epoch 5/5
16/16 [==============================] - 18s 1s/step - loss: 0.1031 - sparse_categorical_accuracy: 0.9714
Question:
How can I use the model after fine-tuning for text classification/sentiment analysis? (I want to create a predicted label for each tweet I scraped.)
What would be a good way of approaching this?
I've tried to save the model, but I don't know where I can find it and use then:
# Save the model
model.save_pretrained('Twitter_Roberta_Model')
I've also tried to just add it to a HuggingFace pipeline like the following. But I'm not sure if this works correctly.
classifier = pipeline('sentiment-analysis',
model=model,
tokenizer=AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-bne"))
Although this is an example for a specific model (DistilBert), the following prediction code should work similarly (small modifications according to your needs). You just need to replace the distillbert according to your model (TFAutoModelForSequenceClassification) and of course ensure the proper tokenizer is used.
loaded_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
loaded_model.load_weights('./distillbert_tf.h5')
input_text = "The text on which I test"
input_text_tokenized = tokenizer.encode(input_text,
truncation=True,
padding=True,
return_tensors="tf")
prediction = loaded_model(input_text_tokenized)
prediction_logits = prediction[0]
prediction_probs = tf.nn.softmax(prediction_logits,axis=1).numpy()
print(f'The prediction probs are: {prediction_probs}')

Why is a Keras custom metric not called after each epoch? [SOLVED]

I'm using tensorflow 2.1 and tf.keras, and whether adding a custom Metric as a simple function or Metric subclassed instance, during training the custom metric appears only be called at the start of training and not after each epoch as I was expecting. The start of a run is below for example (much larger validation segment than usual as a test), where the metric is printing during update_state() and result() but only until epoch 1. In this case the metric is returning a dummy number that increments on each update call, reaching 2 and no further. Supplied metrics such as BinaryAccuracy do produce varying numbers after each epoch so I assume I'm missing or misunderstanding something. What could be the explanation for the observed behaviour?
update
result
Train on 4418 samples, validate on 4418 samples
Epoch 1/120
update
result
result
3800/4418 [========================>.....] - ETA: 0s - loss: 0.1597 - binary_true_positives: 2.0000 result
4418/4418 [==============================] - 3s 622us/sample - loss: 0.1500 - binary_true_positives: 2.0000 - val_loss: 0.0986 - val_binary_true_positives: 2.0000
Epoch 2/120
4418/4418 [==============================] - 0s 89us/sample - loss: 0.0868 - binary_true_positives: 2.0000 - val_loss: 0.0643 - val_binary_true_positives: 2.0000
SOLVED: After looking further, custom metrics are being produced after each epoch as expected, it's just that the Python code to produce them isn't being executed. It dawned on me that the code was likely transformed to a TensorFlow graph (it is viewable in TensorBoard), and the metrics were generated via executing the compiled graph rather than the original Python code that led to the graph representation being produced.
Because things not change after each epoch. Training samples and validating are same and don't change over iteration.

Can I provide multiple targets to a seq2seq model?

I'm doing video captioning on MSR-VTT dataset
In this dataset, I've got 10,000 videos and, for each videos, I've got 20 different captions.
My model consists of a seq2seq RNN. Encoder's inputs are the videos features, decoder's inputs are embedded target captions and decoder's output are predicted captions.
I'm wondering if using several time the same videos with different captions is useful, or not.
Since I couldn't find explicit info, I tried to benchmark it
Benchmark:
Model 1: One caption for each video
I trained it on 1108 sport videos, with a batch size of 5, over 60 epochs. This configuration takes about 211 seconds per epochs.
Epoch 1/60 ; Batch loss: 5.185806 ; Batch accuracy: 14.67% ; Test accuracy: 17.64%
Epoch 2/60 ; Batch loss: 4.453338 ; Batch accuracy: 18.51% ; Test accuracy: 20.15%
Epoch 3/60 ; Batch loss: 3.992785 ; Batch accuracy: 21.82% ; Test accuracy: 54.74%
...
Epoch 10/60 ; Batch loss: 2.388662 ; Batch accuracy: 59.83% ; Test accuracy: 58.30%
...
Epoch 20/60 ; Batch loss: 1.228056 ; Batch accuracy: 69.62% ; Test accuracy: 52.13%
...
Epoch 30/60 ; Batch loss: 0.739343; Batch accuracy: 84.27% ; Test accuracy: 51.37%
...
Epoch 40/60 ; Batch loss: 0.563297 ; Batch accuracy: 85.16% ; Test accuracy: 48.61%
...
Epoch 50/60 ; Batch loss: 0.452868 ; Batch accuracy: 87.68% ; Test accuracy: 56.11%
...
Epoch 60/60 ; Batch loss: 0.372100 ; Batch accuracy: 91.29% ; Test accuracy: 57.51%
Model 2: 12 captions for each video
Then I trained the same 1108 sport videos, with a batch size of 64.
This configuration takes about 470 seconds per epochs.
Since I've 12 captions for each videos, the total number of samples in my dataset is 1108*12.
That's why I took this batch size (64 ~= 12*old_batch_size). So the two models launch the optimizer the same number of times.
Epoch 1/60 ; Batch loss: 5.356736 ; Batch accuracy: 09.00% ; Test accuracy: 20.15%
Epoch 2/60 ; Batch loss: 4.435441 ; Batch accuracy: 14.14% ; Test accuracy: 57.79%
Epoch 3/60 ; Batch loss: 4.070400 ; Batch accuracy: 70.55% ; Test accuracy: 62.52%
...
Epoch 10/60 ; Batch loss: 2.998837 ; Batch accuracy: 74.25% ; Test accuracy: 68.07%
...
Epoch 20/60 ; Batch loss: 2.253024 ; Batch accuracy: 78.94% ; Test accuracy: 65.48%
...
Epoch 30/60 ; Batch loss: 1.805156 ; Batch accuracy: 79.78% ; Test accuracy: 62.09%
...
Epoch 40/60 ; Batch loss: 1.449406 ; Batch accuracy: 82.08% ; Test accuracy: 61.10%
...
Epoch 50/60 ; Batch loss: 1.180308 ; Batch accuracy: 86.08% ; Test accuracy: 65.35%
...
Epoch 60/60 ; Batch loss: 0.989979 ; Batch accuracy: 88.45% ; Test accuracy: 63.45%
Here is the intuitive representation of my datasets:
How can I interprete this results ?
When I manually looked at the test predictions, Model 2 predictions looked more accurate than Model 1 ones.
In addition, I used a batch size of 64 for Model 2. That means that I could obtain even more good results by choosing a smaller batch size. It seems I can't have better training method for Mode 1 since batch size is already very low
On the other hand, Model 1 have better loss and training accuracy results...
What should I conclude ?
Does the Model 2 constantly overwrites the previously trained captions with the new ones instead of adding new possible captions ?
I'm wondering if using several time the same videos with different captions is useful, or not.
I think it is definately. It can be interpreted as video to captions is not one-to-one mapped. And thus weights gets trained more based on the video context.
Since the video to caption is not one to one mapped. So even if the neural network is indefinitely dense it should never achieve 100% training accuracy(or loss as zero) thus reducing overfitting significantly.
When I manually looked at the test predictions, Model 2 predictions looked more accurate than Model 1 ones.
Nice! Same is visible here:
Model1; Batch accuracy: 91.29% ; Test accuracy: 57.51%
Model2; Batch accuracy: 88.45% ; Test accuracy: 63.45%
Increasing Generalization!!
In addition, I used a batch size of 64 for Model 2. That means that I could obtain even more good results by choosing a smaller batch size. It seems I can't have better training method for Mode 1 since batch size is already very low.
I might not be the right person to comment on the value of the batch_size here, but increasing it a bit more should be worth a try.
batch_size is a balance between moving the previous knowledge towards current batch(trying to converge in different directions after some time based on the learning rate) vs trying to learn similar knowledge again and again(converging in almost same direction).
And remember there are lot of other ways to improve the results.
On the other hand, Model 1 have better loss and training accuracy results...
What should I conclude ? ?
Training accuracy results and loss value tells about how the model is performing on the training data not on the validation/test data. In other words, having very small value of loss might mean memorization.
Does the Model 2 constantly overwrites the previously trained captions with the new ones instead of adding new possible captions.
Depends on how data is being splitted into batches.
Is multiple caption of the same video in same batch or spreaded over multiple batches.
Remember, Model 2 has multiple caption which might be a major factor behind generalization. Thus increasing the training loss value.
Thanks!