I'm loading the MNIST data into Google Colab, which I can see is about 60,000 samples
But for some reason it will only train on 1875 samples:
What is going on here?
1875 is not the number of samples, it is number of steps.
Keras uses batches to train and default batch size is 32.
60000 / 32 = 1875
Related
I would like to fit a keras model but instead of running through all the batches in each epoch, I want to shuffle the whole training dataset (at the start of each epoch) then train only on the first 20 batches.
I have tried using steps_per_epoch but that restarts the next epoch where the previous one left off. What I want to do is, for every epoch:
shuffle the training dataset
split into batches of size 100
train on the first 20 batches
Can I achieve this using keras? I suspect I may need to use tensorflow directly to achieve this rather than keras, but I'm not sure how to go about doing that.
I'm trying out a simple sequential model with the below dataset.
Using Colab PRO with 35 GB RAM + 225 GB Disk space.
Total sentences - 59000
Total words - 160000
Padded seq length - 38
So train_x (59000,37), train_y (59000)
I'm using FastText for embedding layer. FastText model generated weights with
(rows) vocab_size 113000
(columns/dimentionality) embedding_size 8815
Here is how the model.summary() looks like
It takes about ~15 mins to compile the model but .fit crashes without adequate memory.
I've brought down the batch_size to 4 (vs 32 default).. still no luck.
epochs=2
verbose=0
batch_size=4
history = seq_model.fit(train_x,train_y, epochs=epochs, verbose=verbose,callbacks=[csv_logger],batch_size=batch_size)
Appreciate any ideas to make this work.
If what I am seeing is right, your model is simply too large!
It has almost 1.5 billion parameters. That's way too much.
Reducing the batch size will not help at all.
Previously I implemented 2 CNN models (Resnet50v2 and inceptionResNetv2) with a dataset contains 3662 images. Both worked fine in Google colab during training and validation. Now I re-run the exactly same code again and training samples per epoch were reduced to only 92 samples per epoch by itself (before it was 2929/epoch). Two models were using separate notebooks and they are both like this now.
I thought it might due to limited RAM (after 1 month of google colab, it seems reduced to half) so I upgraded to Colab pro with 25 G RAM. It doesn't solve the problem.
Has anyone got the same issue? Anyone can give a clue what could be the reason and a solution to fix it? Many thanks!
Some code in the end of the workflow here (they worked well before):
model = tf.keras.applications.InceptionResNetV2(
include_top=True, weights=None, input_tensor=None, input_shape=None,
pooling=None, classes=5)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
model.fit(X, y_orig, epochs = 20, batch_size = 32, validation_split = 0.2, callbacks=[tensorboard_callback])
So I think I found the reason. It was the number of batch that displayed during training. In my case: 2929(no. of train samples) / 32(batch_size) = 91.5 (number displayed now during training).
To test it, I changed the batch size to 8 and I got 366 / epoch. Also the overall training time stays the same, suggesting the number of training samples were actually staying the same as before.
Are you using tensorflow v1 or v2?
Does this problem persist if you switch to 1.x by running a cell with %tensorflow_version 1.x prior to importing tensorflow?
Apologies if my questions are relatively simple, but I have been approaching the TensorFlow bit recently with the aim to learn new skills.
In the example, but there are several things I can't get:
in the explore data section, the size of the datasets return as 60/10k respectively for train and test.
where the size of the train/test size declared?
packages like SkLearn allows this to be specified in percentage when invoking the split methods.
in the training model part, when the 5 epochs are trained, the 1875 number appear below.
- what is that?
- I was expecting the training to run over the 60k items, but even by multiplying 1875 by 5 the number doesn't reach the 10k.
Dataset is loaded using tensorflow datasets API
The source itself has the split of 60K (Train) and 10K (Test)
https://www.tensorflow.org/datasets/catalog/fashion_mnist
An Epoch is a complete run with all the training samples. The training is done in batches. In the example you refer to, a batch size of 32 is used. So to complete one epoch, 1875 batches (60000 / 32) are run.
Hope this helps.
I'm running tensorflow on GPU for training. I have a 1 layer GRU cell, with 800 batch size and I do 10 epochs. I see this spikes in the accuracy graph from tensorboard and I do not understand why. See the image.
If you count the spikes, they are 10, as the number of epochs. I tried this with different configurations, reducing batch size, increasing number of layers but the spikes are still there.
You can find the code here if it helps.
I use tf.RandomShuffleQueue for the data with infinite epochs, and I calculate how many steps it should do. I do not think that the problem is on how I calculate the accuracy (here). Do you have any suggestions why this happens?
EDIT
min_after_dequeue=2000