TensorFlow Federated - How to work with SparseTensors - tensorflow

I am using TensorFlow Federated to simulate a scenario in which clients hosted on a remote server can work with our very sparse dataset in a federated setting.
Presently, the code is capable of running with a small subset of the very sparse dataset being loaded on the server-side and passing it to the remote workers hosted on another device. The data is in SVM Light format and can be loaded through sklearn's load_svmlight_file function, but needs to be converted into Tensors to work within tff. The current solution to do so involves converting the very sparse data into a dense array, then setting it up through the tf.data.Dataset.from_tensor_slices function for use with a keras model (following existing examples for tff).
This works, but takes up significant memory resources and is not suitable for the dataset as it cannot be run remotely for more than six samples due to the sparse data's serialized size, nor locally with more than a few hundred samples due to the size in memory.
To mitigate this, I converted the data into SparseTensors, but this approach fails due to the tff.learning.from_keras_model function expecting a pair of TensorSpec input_spec values, not a SparseTensorSpec input_spec with the labels being TensorSpec.
So, are there any concrete examples or known methods to work with SparseTensors within keras models in tff? Or must they be as Tensors for now? The data loads fine when not converted to regular Tensors so I will need to find a solution for working with the sparse data.
If there is presently no way to do so, are there examples of strategies within tff to work with very small subsets of data at a time, either being loaded directly with the remote client or being passed from the server?
Thanks!

I'd say the best approach now is to work with the TF's representation of tf.SparseTensor. That is, a tuple of 3 tensors, indices, values and dense_shape.
So when the problem is with Keras requiring the input to not be sparse tensors, you can pass in the input as for instance a dictionary consisting of these three tensors, which you convert to tf.sparse.SparseTensor as part of your tf.data pipeline.
See also this tutorial which I think is doing something related to what you are looking for, and please ask more detailed questions if needed!

Related

Using dynamically generated data with keras

I'm training a neural network using keras but I'm not sure how to feed the training data into the model in the way that I want.
My training data set is effectively infinite, I have some code to generate training examples as needed, so I just want to pipe a continuous stream of novel data into the network. keras seems to want me to specify my entire dataset in advance by creating a numpy array with everything in it, but this obviously wont work with my approach.
I've experimented with creating a generator class based on keras.utils.Sequence which seems like a better fit, but it still requires me to specify a length via the __len__ method which makes me think it will only create that many examples before recycling them. Can someone suggest a better approach?

Keras alternative to ImageDataGenerator for loading arbitrary numpy tensor

The Keras's ImageDataGenerator looks great for simply progressively loading images and passing an iterator to the model.fit function. However, it seems to be only usable for images and for classification tasks.
I want to do regression, i.e., my labels are also arrays of the same shape as the training set ones. In practice, they are multidimensional (>1 channels) arrays like images but they are not images.
Any suggestions on what class to use to simply spit batches of data to a keras model.fit() for training a deep neural net?
The problem, of course, is that my datasets are much too large to fit in memory, which is why I need to use these generators/iterators.
The best solution for your case is to use tf.data.Dataset().
While it may take a relatively short time to accustom to it, it is the recommended way to load your data and use model.fit().
You can consult the documentation here: https://www.tensorflow.org/api_docs/python/tf/data/Dataset
Is is new, fast, beautifully designed and easily extensible.
For instance, for your problem you may want to use tf.data.Dataset.from_tensor_slices(); I will leave you discover its features :D.
A quick solution would be to use Colab whose GPU instance has got 24 GB RAM to work with . You could also reduce your memory when you load the numpy array like the way I did here

Tensorflow Stored Learning

I haven't tried Tensorflow yet but still curious, how does it store, and in what form, data type, file type, the acquired learning of a machine learning code for later use?
For example, Tensorflow was used to sort cucumbers in Japan. The computer used took a long time to learn from the example images given about what good cucumbers look like. In what form the learning was saved for future use?
Because I think it would be inefficient if the program should have to re-learn the images again everytime it needs to sort cucumbers.
Ultimately, a high level way to think about a machine learning model is three components - the code for the model, the data for that model, and metadata needed to make this model run.
In Tensorflow, the code for this model is written in Python, and is saved in what is known as a GraphDef. This uses a serialization format created at Google called Protobuf. Common serialization formats include Python's native Pickle for other libraries.
The main reason you write this code is to "learn" from some training data - which is ultimately a large set of matrices, full of numbers. These are the "weights" of the model - and this too is stored using ProtoBuf, although other formats like HDF5 exist.
Tensorflow also stores Metadata associated with this model - for instance, what should the input look like (eg: an image? some text?), and the output (eg: a class of image aka - cucumber1, or 2? with scores, or without?). This too is stored in Protobuf.
During prediction time, your code loads up the graph, the weights and the meta - and takes some input data to give out an output. More information here.
Are you talking about the symbolic math library, or the idea of tensor flow in general? Please be more specific here.
Here are some resources that discuss the library and tensor flow
These are some tutorials
And here is some background on the field
And this is the github page
If you want a more specific answer, please give more details as to what sort of work you are interested in.
Edit: So I'm presuming your question is more related to the general field of tensor flow than any particular application. Your question still is too vague for this website, but I'll try to point you toward a few resources you might find interesting.
The tensorflow used in image recognition often uses an ANN (Artificial Neural Network) as the object on which to act. What this means is that the tensorflow library helps in the number crunching for the neural network, which I'm sure you can read all about with a quick google search.
The point is that tensorflow isn't a form of machine learning itself, it more serves as a useful number crunching library, similar to something like numpy in python, in large scale deep learning simulations. You should read more here.

preprocess data sets for Tensorflow highlevel estimators

I'm coming from a Scikit Learn background.
I'm having difficulty understanding how to preprocess data sets for Tensorflow.
I'm trying to implement svm with the iris data set.
If I have two numpy arrays, one containing a list of the features, and the other containing the list of the labels, which functions would I use to create the classifier?
estimator = SVM(
example_id_column='example_id',
feature_columns=[real_feature_column, sparse_feature_column],
l2_regularization=10.0)
I'm assuming the example_id_column would be
example_id_column = '0,1,2'
I'm not sure about how to attain the feature_columns
I think the most effective way is using the TFRecords files. There's a comprehensive tutorial available that's still mostly relevant, too. This also has the advantage of letting you define a lot more of your pipeline as part of the graph, being able to do concurrent reads from the source files, and not needing to fit your dataset in memory. It's definitely worth the effort.

Specifically, how to train neural network when it is larger than ram?

I have specific questions about how to train a neural network that is larger than ram. I want to use the de facto standard which appears to be Keras and tensorflow.
What are the key classes and methods that I need to use
From Numpy, to scipy, to pandas, h5py, to keras in order to not exceed my meager 8 gb of ram? I have time to train the model; I don't have cash. My dataset requires 200 GB of ram.
In keras there is a model_fit() method. It requires X and Y numpy arrays. How do I get it to accept hdf5 numpy arrays on disk? And when specifying the model architecture itself How do I save ram because wouldn't the working memory require > 8 gb at times?
Regarding fit_generator, does that accept hdf5 files? If the model_fit() method can accept hdf5, do I even need fit generator? It seems that you still need to be able to fit the entire model in ram even with these methods?
In keras does the model include the training data when calculating its memory requirements? If so I am in trouble I think.
In essence I am under the assumption that at no time can I exceed my 8 Gb of ram, whether from one hot encoding to loading the model to training on even a small batch of samples. I am just not sure how to accomplish this concretely.
I cannot answer everything, and I'm also very interested in those answers, because I'm facing that 8GB problem too.
I can only suggest how to pass little batches at a time.
Question 2:
I don't think Keras will support passing the h5py file (but I really don't know), but you can create a loop to load the file partially (if the file is properly saved for that).
You can create an outer loop to:
create a little array with only one or two samples from the file
use the method train_on_batch passing only that little array.
release the memory disposing of the array or filling this same array with the next sample(s).
Question 3:
Also don't know about the h5py file, is the object that opens the file a python generator?
If not, you can create the generator yourself.
The idea is to make the generator load only part of the file and yield little batch arrays with one or two data samples. (Pretty much the same as done in question 2, but the loop goes inside a generator.
Usually for very large sample sets an "online" training method is used. This means that instead of training your neural network in one go with a large batch, it allows the neural network to be updated incrementally as more samples are obtained. See: Stochastic Gradient Descent