With Tensorflow 2.3, my input/output data comes from a CSV file (floating numbers). Unfortunatly, my dataset does not fit in memory. It is probably possible to avoid loading the whole dataset in memory by, for example, splitting it.
I already made a lot of researches including StackOverflow questions, but answers are not always clear, or are refering to previous versions of TensorFlow.
how to fit tensorflow dataset
Dataset does not fit in memory
how to fit tensorflow dataset
https://www.tensorflow.org/tutorials/load_data/csv
https://medium.com/#mrgarg.rajat/training-on-large-datasets-that-dont-fit-in-memory-in-keras-60a974785d71
If someone could provide a simple starter example ...
Related
There are a variety of ways to get a dataset you can train on in tensorflow. One of the things tensorflow transform does is provide the ability to do preprocessing via AnalyzeAndTransformDataset and TransformDataset. Surprisingly, the dataset being referred to is not a tensorflow dataset, but rather a dataset in the apache beam sense. That is understandable to some degree, given that the function is tft_beam.AnalyzeAndTransformDataset.
The heart of my question is this: given that the metadata is already known by tensorflow, why aren't there easier ways to get from a tensorflow dataset to a beam dataset. I understand that a tensorflow dataset will generally repeat itself forever, but is there a way to transform a tensorflow dataset to a dataset that can be processed by beam? Or is the only solution to have the beam dataset created by pointing to the original data on disk? Does this have to do with the unboundedness of a tensorflow dataset or is there some other reason that a tensorflow dataset cannot be analyzed/transformed through appropriate transformations so that it's abstracted from the developer?. All of the examples I have seen started with dictionaries, and there is another stack overflow question here that talks about this to some extent, but doesn't fully explain why this is the way it is.
This seems to be a question for Tensorflow team rather than Apache Beam, but TFX transforms you referred to are built on top of Beam transforms (so Beam is used as a utility). You are not directly working with Beam constructs (PColelctions, PTransforms etc.). If you want to build a Beam pipeline using the intermediate data, you might need to start with TFRecord files and use Beam's tfrecordio source as the other post mentioned.
I'm coming from a Scikit Learn background.
I'm having difficulty understanding how to preprocess data sets for Tensorflow.
I'm trying to implement svm with the iris data set.
If I have two numpy arrays, one containing a list of the features, and the other containing the list of the labels, which functions would I use to create the classifier?
estimator = SVM(
example_id_column='example_id',
feature_columns=[real_feature_column, sparse_feature_column],
l2_regularization=10.0)
I'm assuming the example_id_column would be
example_id_column = '0,1,2'
I'm not sure about how to attain the feature_columns
I think the most effective way is using the TFRecords files. There's a comprehensive tutorial available that's still mostly relevant, too. This also has the advantage of letting you define a lot more of your pipeline as part of the graph, being able to do concurrent reads from the source files, and not needing to fit your dataset in memory. It's definitely worth the effort.
I have specific questions about how to train a neural network that is larger than ram. I want to use the de facto standard which appears to be Keras and tensorflow.
What are the key classes and methods that I need to use
From Numpy, to scipy, to pandas, h5py, to keras in order to not exceed my meager 8 gb of ram? I have time to train the model; I don't have cash. My dataset requires 200 GB of ram.
In keras there is a model_fit() method. It requires X and Y numpy arrays. How do I get it to accept hdf5 numpy arrays on disk? And when specifying the model architecture itself How do I save ram because wouldn't the working memory require > 8 gb at times?
Regarding fit_generator, does that accept hdf5 files? If the model_fit() method can accept hdf5, do I even need fit generator? It seems that you still need to be able to fit the entire model in ram even with these methods?
In keras does the model include the training data when calculating its memory requirements? If so I am in trouble I think.
In essence I am under the assumption that at no time can I exceed my 8 Gb of ram, whether from one hot encoding to loading the model to training on even a small batch of samples. I am just not sure how to accomplish this concretely.
I cannot answer everything, and I'm also very interested in those answers, because I'm facing that 8GB problem too.
I can only suggest how to pass little batches at a time.
Question 2:
I don't think Keras will support passing the h5py file (but I really don't know), but you can create a loop to load the file partially (if the file is properly saved for that).
You can create an outer loop to:
create a little array with only one or two samples from the file
use the method train_on_batch passing only that little array.
release the memory disposing of the array or filling this same array with the next sample(s).
Question 3:
Also don't know about the h5py file, is the object that opens the file a python generator?
If not, you can create the generator yourself.
The idea is to make the generator load only part of the file and yield little batch arrays with one or two data samples. (Pretty much the same as done in question 2, but the loop goes inside a generator.
Usually for very large sample sets an "online" training method is used. This means that instead of training your neural network in one go with a large batch, it allows the neural network to be updated incrementally as more samples are obtained. See: Stochastic Gradient Descent
Why TFRecords file is sharded in the inception model example in TensorFlow ?
For randomness, can't the list of files be shuffled before creating one single TFRecord file ?
Why TFRecords file is sharded in the inception model example in TensorFlow ?
According to object detection API, there are two advantages in sharding your dataset:
Files can be read in parallel, improving data loading speed
Examples can be shuffled better by sharding
You probably already knew the second point as it is in your second question:
For randomness, can't the list of files be shuffled before creating one single TFRecord file ?
Shuffling the dataset before creating the record is indeed a good practice because shuffling a TFRecord can only be done partially. Indeed, you can only load a certain number of examples in memory. The shuffling is then done by selecting randomly your next example among the ones in memory. You can see more in this question
However, if you only shuffle the dataset when creating the record, your network will always see examples in the same order in the successive training epochs. This might result in unwanted convergence behaviours due to the random order that was given once and for all. It is thus more interesting to shuffle the dataset on the fly to have different orderings in different epochs.
Sharding your dataset can easen the shuffling. Instead to be forced to always read in the same order from the same one file, you can start to read a bit from each file, choosing at random.
I am a new user of Tensorflow. I would like to use it for training a dataset of 2M images. I did this experiment in caffe using lmdb file format.
After reading Tensorflow related posts, I realized that TFRecord is the most suitable file format to do so. Therefore, I am looking for complete CNN examples which use TFRecord data. I noticed that the image related tutorials (mnist and cifar10 in link1 and link2) are provided with a different binary file format where the entire data-set is loaded at once. Therefore, I would like to know if anyone knows if these tutorials (mnist and cifar10) are available using TFRecord data (for both CPU and GPU).
I assume that you want to both, write and read TFRecord files. I think what is done here reading_data.py should help you converting MNIST data into TFRecors.
For reading it back, this script does the trick: fully_connected_reader.py
This could be done similarly with cifar10.