I'm trying out a simple sequential model with the below dataset.
Using Colab PRO with 35 GB RAM + 225 GB Disk space.
Total sentences - 59000
Total words - 160000
Padded seq length - 38
So train_x (59000,37), train_y (59000)
I'm using FastText for embedding layer. FastText model generated weights with
(rows) vocab_size 113000
(columns/dimentionality) embedding_size 8815
Here is how the model.summary() looks like
It takes about ~15 mins to compile the model but .fit crashes without adequate memory.
I've brought down the batch_size to 4 (vs 32 default).. still no luck.
epochs=2
verbose=0
batch_size=4
history = seq_model.fit(train_x,train_y, epochs=epochs, verbose=verbose,callbacks=[csv_logger],batch_size=batch_size)
Appreciate any ideas to make this work.
If what I am seeing is right, your model is simply too large!
It has almost 1.5 billion parameters. That's way too much.
Reducing the batch size will not help at all.
Related
Are these 2 the same batch-size, or they have different meaning?
BATCH_SIZE=10
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.batch(BATCH_SIZE)
2nd
history = model.fit(train_ds,
epochs=EPOCHS,
validation_data=create_dataset(X_valid, y_valid_bin),
max_queue_size=1,
workers=1,
batch_size=10,
use_multiprocessing=False)
I'm having problem of Ram out of run...
Traning images example 333000
Ram 30GB
12 GB GPU
What should be the batch size for this?
Full Code Here
Data-Set (Batch Size)
Batch size only mean that how much data will pass through a pipeline you defined. In the case of Dateset batch size represent how much data will be passed to the model in one iteration. For-example if you form a data generator and set batch size 8. Now on every iteration data generator give 8 data records.
Model.fit (Batch Size)
and in the model.fit when we set batch size it means that model will calculate loss after passing data-records equal to the batch-size. If you know about deep learning models they will calculate a particular loss on feed forward and than through back propagation they improves themselves. Now if you set batch-size 8 in model.fit now 8 data-records are pass to the model and loss is calculated of that 8 data-records and than model improve from that loss.
Example:
Now if you set dateset batch-size equal to 4 and set model.fit batch-size equal to 8. Now your dateset generator has to iterate 2 times to give 8 images to model and model.fit only perform 1 iteration to calculate loss.
Ram Issue
what is your image size ? Try to reduce batch_size because step per epochs are not related to ram but batch size is. Because if you are giving 10 batch size than 10 images have to load onto the ram for processing and your ram is not capable of loading 10 images at the same time. Try to give batch size 4 or 2. that may help you
Previously I implemented 2 CNN models (Resnet50v2 and inceptionResNetv2) with a dataset contains 3662 images. Both worked fine in Google colab during training and validation. Now I re-run the exactly same code again and training samples per epoch were reduced to only 92 samples per epoch by itself (before it was 2929/epoch). Two models were using separate notebooks and they are both like this now.
I thought it might due to limited RAM (after 1 month of google colab, it seems reduced to half) so I upgraded to Colab pro with 25 G RAM. It doesn't solve the problem.
Has anyone got the same issue? Anyone can give a clue what could be the reason and a solution to fix it? Many thanks!
Some code in the end of the workflow here (they worked well before):
model = tf.keras.applications.InceptionResNetV2(
include_top=True, weights=None, input_tensor=None, input_shape=None,
pooling=None, classes=5)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
model.fit(X, y_orig, epochs = 20, batch_size = 32, validation_split = 0.2, callbacks=[tensorboard_callback])
So I think I found the reason. It was the number of batch that displayed during training. In my case: 2929(no. of train samples) / 32(batch_size) = 91.5 (number displayed now during training).
To test it, I changed the batch size to 8 and I got 366 / epoch. Also the overall training time stays the same, suggesting the number of training samples were actually staying the same as before.
Are you using tensorflow v1 or v2?
Does this problem persist if you switch to 1.x by running a cell with %tensorflow_version 1.x prior to importing tensorflow?
Apologies if my questions are relatively simple, but I have been approaching the TensorFlow bit recently with the aim to learn new skills.
In the example, but there are several things I can't get:
in the explore data section, the size of the datasets return as 60/10k respectively for train and test.
where the size of the train/test size declared?
packages like SkLearn allows this to be specified in percentage when invoking the split methods.
in the training model part, when the 5 epochs are trained, the 1875 number appear below.
- what is that?
- I was expecting the training to run over the 60k items, but even by multiplying 1875 by 5 the number doesn't reach the 10k.
Dataset is loaded using tensorflow datasets API
The source itself has the split of 60K (Train) and 10K (Test)
https://www.tensorflow.org/datasets/catalog/fashion_mnist
An Epoch is a complete run with all the training samples. The training is done in batches. In the example you refer to, a batch size of 32 is used. So to complete one epoch, 1875 batches (60000 / 32) are run.
Hope this helps.
I am using TF as a backend to Keras. I am using custom loss functions so I essentially use Keras as a wrapper for TF. I have a big model, which consists of 4 smaller ones, where 3 of them are pre-trained and loaded while the fourth gets trained.
The issue is, that when calling
self.session.run(tf.global_variables_initializer())
TF ends up with an error of trying to allocate too much memory on the GPU. The model itself has around 280 000 000 params (70 mil are trainable), and the TF graph has 1 000 000 000 variables. That's where the math doesn't add up.
Allocating 1 billion floats should take up around 4 GB of memory. And TF has 5.3 GB of VRAM available. The 1 billion variables should afaik include all stored activations and gradients and optimizer params (1 per trained param, using rmsprop).
There are very few activations because I only use quite small conv layers, so the activations for the whole thing per 1 sample should take around 6.5 MB and I'm using batch size of only 32, so 208 MB total.
Do you have any idea what's going on here? Does the model just barely not fit or is there a bigger problem somewhere?
Any advice appreciated!
EDIT: The model definition code: https://pastebin.com/6FRczTc0 (the first function is used for the 4 submodels and the second one puts them together into the bigger net)
I have been trying to use Google's RNN based seq2seq model.
I have been training a model for text summarization and am feeding in a textual data approximately of size 1GB. The model quickly fills up my entire RAM(8GB), starts filling up even the swap memory(further 8GB) and crashes post which I have to do a hard shutdown.
The configuration of my LSTM network is as follows:
model: AttentionSeq2Seq
model_params:
attention.class: seq2seq.decoders.attention.AttentionLayerDot
attention.params:
num_units: 128
bridge.class: seq2seq.models.bridges.ZeroBridge
embedding.dim: 128
encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
encoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
decoder.class: seq2seq.decoders.AttentionDecoder
decoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
optimizer.name: Adam
optimizer.params:
epsilon: 0.0000008
optimizer.learning_rate: 0.0001
source.max_seq_len: 50
source.reverse: false
target.max_seq_len: 50
I tried decreasing the batch size from 32 to 16, but it still did not help. What specific changes should I make in order to prevent my model from taking up the entirety of RAM and crashing? (Like decreasing data size, decreasing number of stacked LSTM cells, further decreasing batch size etc)
My system runs Python 2.7x, TensorFlow version 1.1.0, and CUDA 8.0. The system has an Nvidia Geforce GTX-1050Ti(768 CUDA cores) with 4GB of memory, and the system has 8GB of RAM and a further 8GB of swap memory.
You model looks pretty small. The only thing kind of big is the train data. Please check to make sure your get_batch() function has no bugs. It is possible that each batch you are actually loading the whole data set for training, in case there is a bug there.
In order to quickly prove this, just cut down your training data size to something very small (such as 1/10 of current size) and see if that helps. Note that it should not help because you are using mini batch. But if that resolve the problem, fix your get_batch() function.