Tensorflow dataset pipeline for large HDF5 file - tensorflow

Let's say I have 10 million training samples stored in a HDF5 file and I want to generate batches of size 1000 with a TensorFlow dataset. However, the 10 million samples are too big to be loaded in memory.
What I want to do is to load the numpy data from the HDF5 in memory 1 million at a time, and then iterate to generate my batches of size 1000. When the 1 million samples are finished, I want to load the next 1 million from the HDF5 and continue. I would like to managed this with a single Dataset in tensorflow.
However, I don't see how to do this with the Dataset API from tensorflow.
How can I iterate on two levels like this (1st level=big chunks of 1 million, 2nd level=small batches of 1000).
Thanks

Related

Tensorflow Datasets with different structure Merging to Batch Merged Dataset

**
`
I have 2 TensorFlow datasets. One dataset is 19 million and other one is 24000. I want to create 19million * 24000 type input feature matrix. Can I achieve it using TensorFlow dataset operation.
Currently I am trying to merge the datasets using pandas as follows. But I get memory error in AWS Sage maker pipeline and it needs 2.74TiB. So I decided to batch inferencing After converting my 2 users_df and items_df data frames to TensorFlow datasets!
Is there any way to do this pandas merge operation in tensorflow datasets?`
**

How to predict results from 20 million records using Hugging Face Model in minimum time

I am trying to predict sentiment for 20 million records using the model available in Hugging Face.
https://huggingface.co/finiteautomata/beto-sentiment-analysis
This model takes 1 hour and 20 minutes to predict 70000 records.
The model is saved locally and accessed locally by loading it.
Anyone can please suggest how I can efficiently use it to predict 20 million records in a minimum time.
Also, I am using the Zero-Shot Classification Model on the same data it is taking taking
7 minutes to predict for 1000 records.
Kindly suggest for this as well if any way to predict in minimum time.
model_path = 'path where model is saved'
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="Recognai/bert-base-spanish-wwm-cased-xnli")
def predict(row):
topics = # five candidate labels here
res = classifier(row, topics)
return res
df['Predict'] = df['Message'].apply(lambda x: predict_crash(x)) # This df contains 70k records

With Tensorflow/Keras, is there a way to split up input data for training so I don’t run out of RAM?

I am using tensorflow with Keras to process audio data through conv1d layers, however, I am running out RAM (only 8GB available). I have an input .wav and a target .wav for the network to learn, and each file is 40MB (about 4 minutes of audio).
In this model, one sample of audio is learned from the previous 200 samples. In order to accomplish this, I am taking (for example) the input 8000000 samples and “unfolding” it into (8000000, 200, 1), where each audio sample becomes an array of the previous 200 samples. Then I call “model.fit(unfolded_input_samples, target_samples)”.
The problem is I quickly run out of RAM when unfolding the input data. Is there a way around creating this massive array while still telling Tensorflow how to use the previous 200 samples for each data point? Can I break up the unfolded input array into chunks and pass each to fit() without starting a new epoch? Or is there an easier way to accomplish this with Tensorflow.

TensorFlow image classification colab sheet from training material: newbie questions

Apologies if my questions are relatively simple, but I have been approaching the TensorFlow bit recently with the aim to learn new skills.
In the example, but there are several things I can't get:
in the explore data section, the size of the datasets return as 60/10k respectively for train and test.
where the size of the train/test size declared?
packages like SkLearn allows this to be specified in percentage when invoking the split methods.
in the training model part, when the 5 epochs are trained, the 1875 number appear below.
- what is that?
- I was expecting the training to run over the 60k items, but even by multiplying 1875 by 5 the number doesn't reach the 10k.
Dataset is loaded using tensorflow datasets API
The source itself has the split of 60K (Train) and 10K (Test)
https://www.tensorflow.org/datasets/catalog/fashion_mnist
An Epoch is a complete run with all the training samples. The training is done in batches. In the example you refer to, a batch size of 32 is used. So to complete one epoch, 1875 batches (60000 / 32) are run.
Hope this helps.

Keras + TF, why is my model too big?

I am using TF as a backend to Keras. I am using custom loss functions so I essentially use Keras as a wrapper for TF. I have a big model, which consists of 4 smaller ones, where 3 of them are pre-trained and loaded while the fourth gets trained.
The issue is, that when calling
self.session.run(tf.global_variables_initializer())
TF ends up with an error of trying to allocate too much memory on the GPU. The model itself has around 280 000 000 params (70 mil are trainable), and the TF graph has 1 000 000 000 variables. That's where the math doesn't add up.
Allocating 1 billion floats should take up around 4 GB of memory. And TF has 5.3 GB of VRAM available. The 1 billion variables should afaik include all stored activations and gradients and optimizer params (1 per trained param, using rmsprop).
There are very few activations because I only use quite small conv layers, so the activations for the whole thing per 1 sample should take around 6.5 MB and I'm using batch size of only 32, so 208 MB total.
Do you have any idea what's going on here? Does the model just barely not fit or is there a bigger problem somewhere?
Any advice appreciated!
EDIT: The model definition code: https://pastebin.com/6FRczTc0 (the first function is used for the 4 submodels and the second one puts them together into the bigger net)