Keras alternative to ImageDataGenerator for loading arbitrary numpy tensor - tensorflow

The Keras's ImageDataGenerator looks great for simply progressively loading images and passing an iterator to the model.fit function. However, it seems to be only usable for images and for classification tasks.
I want to do regression, i.e., my labels are also arrays of the same shape as the training set ones. In practice, they are multidimensional (>1 channels) arrays like images but they are not images.
Any suggestions on what class to use to simply spit batches of data to a keras model.fit() for training a deep neural net?
The problem, of course, is that my datasets are much too large to fit in memory, which is why I need to use these generators/iterators.

The best solution for your case is to use tf.data.Dataset().
While it may take a relatively short time to accustom to it, it is the recommended way to load your data and use model.fit().
You can consult the documentation here: https://www.tensorflow.org/api_docs/python/tf/data/Dataset
Is is new, fast, beautifully designed and easily extensible.
For instance, for your problem you may want to use tf.data.Dataset.from_tensor_slices(); I will leave you discover its features :D.

A quick solution would be to use Colab whose GPU instance has got 24 GB RAM to work with . You could also reduce your memory when you load the numpy array like the way I did here

Related

Using dynamically generated data with keras

I'm training a neural network using keras but I'm not sure how to feed the training data into the model in the way that I want.
My training data set is effectively infinite, I have some code to generate training examples as needed, so I just want to pipe a continuous stream of novel data into the network. keras seems to want me to specify my entire dataset in advance by creating a numpy array with everything in it, but this obviously wont work with my approach.
I've experimented with creating a generator class based on keras.utils.Sequence which seems like a better fit, but it still requires me to specify a length via the __len__ method which makes me think it will only create that many examples before recycling them. Can someone suggest a better approach?

preprocess data sets for Tensorflow highlevel estimators

I'm coming from a Scikit Learn background.
I'm having difficulty understanding how to preprocess data sets for Tensorflow.
I'm trying to implement svm with the iris data set.
If I have two numpy arrays, one containing a list of the features, and the other containing the list of the labels, which functions would I use to create the classifier?
estimator = SVM(
example_id_column='example_id',
feature_columns=[real_feature_column, sparse_feature_column],
l2_regularization=10.0)
I'm assuming the example_id_column would be
example_id_column = '0,1,2'
I'm not sure about how to attain the feature_columns
I think the most effective way is using the TFRecords files. There's a comprehensive tutorial available that's still mostly relevant, too. This also has the advantage of letting you define a lot more of your pipeline as part of the graph, being able to do concurrent reads from the source files, and not needing to fit your dataset in memory. It's definitely worth the effort.

Specifically, how to train neural network when it is larger than ram?

I have specific questions about how to train a neural network that is larger than ram. I want to use the de facto standard which appears to be Keras and tensorflow.
What are the key classes and methods that I need to use
From Numpy, to scipy, to pandas, h5py, to keras in order to not exceed my meager 8 gb of ram? I have time to train the model; I don't have cash. My dataset requires 200 GB of ram.
In keras there is a model_fit() method. It requires X and Y numpy arrays. How do I get it to accept hdf5 numpy arrays on disk? And when specifying the model architecture itself How do I save ram because wouldn't the working memory require > 8 gb at times?
Regarding fit_generator, does that accept hdf5 files? If the model_fit() method can accept hdf5, do I even need fit generator? It seems that you still need to be able to fit the entire model in ram even with these methods?
In keras does the model include the training data when calculating its memory requirements? If so I am in trouble I think.
In essence I am under the assumption that at no time can I exceed my 8 Gb of ram, whether from one hot encoding to loading the model to training on even a small batch of samples. I am just not sure how to accomplish this concretely.
I cannot answer everything, and I'm also very interested in those answers, because I'm facing that 8GB problem too.
I can only suggest how to pass little batches at a time.
Question 2:
I don't think Keras will support passing the h5py file (but I really don't know), but you can create a loop to load the file partially (if the file is properly saved for that).
You can create an outer loop to:
create a little array with only one or two samples from the file
use the method train_on_batch passing only that little array.
release the memory disposing of the array or filling this same array with the next sample(s).
Question 3:
Also don't know about the h5py file, is the object that opens the file a python generator?
If not, you can create the generator yourself.
The idea is to make the generator load only part of the file and yield little batch arrays with one or two data samples. (Pretty much the same as done in question 2, but the loop goes inside a generator.
Usually for very large sample sets an "online" training method is used. This means that instead of training your neural network in one go with a large batch, it allows the neural network to be updated incrementally as more samples are obtained. See: Stochastic Gradient Descent

Tensorflow - How to ignore certain labels

I'm trying to implement a fully convolutional network and train it on the Pascal VOC dataset, however after reading up on the labels in the set, I see that I need to somehow ignore the "void" label. In Caffe their softmax function has an argument to ignore labels, so I'm wondering what the mechanic is, so I can implement something similar in tensorflow.
Thanks
In tensorflow you're feeding the data in feed_dict right? Generally you'd want to just pre-process the data and remove the unwanted samples - don't give them to tensorflow for processing.
My prefered approach is a producer-consumer model where you fire up a tensorflow queue and load it with samples from a loader thread which just skips enqueuing your void samples.
In training your model dequeue samples in the model (you don't use feed_dict in the optimize step). This way you're not bothering to write out a whole new dataset with the specific preprocessing step you're interested in today (tomorrow you're likely to find you want to do some other preprocessing step).
As a side comment, I think tensorflow is a little more do-it-yourself than some other frameworks. But I tend to like that, it abstracts enough to be convenient, but not so much that you don't understand what's happening. When you implement it you understand it, that's the motto that comes to mind with tensorflow.

How to predict using Tensorflow?

This is a newbie question for the tensorflow experts:
I reading lot of data from power transformer connected to an array of solar panels using arduinos, my question is can I use tensorflow to predict the power generation in future.
I am completely new to tensorflow, if can point me to something similar I can start with that or any github repo which is doing similar predictive modeling.
Edit: Kyle pointed me to the MNIST data, which I believe is a Image Dataset. Again, not sure if tensorflow is the right computation library for this problem or does it only work on Image datasets?
thanks, Rajesh
Surely you can use tensorflow to solve your problem.
TensorFlowâ„¢ is an open source software library for numerical
computation using data flow graphs.
So it works not only on Image dataset but also others. Don't worry about this.
And about prediction, first you need to train a model(such as linear regression) on you dataset, then predict. The tutorial code can be found in tensorflow homepage .
Get your hand dirty, you will find it works on your dataset.
Good luck.
You can absolutely use TensorFlow to predict time series. There are plenty of examples out there, like this one. And this is a really interesting one on using RNN to predict basketball trajectories.
In general, TF is a very flexible platform for solving problems with machine learning. You can create any kind of network you can think of in it, and train that network to act as a model for your process. Depending on what kind of costs you define and how you train it, you can build a network to classify data into categories, predict a time series forward a number of steps, and other cool stuff.
There is, sadly, no short answer for how to do this, but that's just because the possibilities are endless! Have fun!