I'm coming from a Scikit Learn background.
I'm having difficulty understanding how to preprocess data sets for Tensorflow.
I'm trying to implement svm with the iris data set.
If I have two numpy arrays, one containing a list of the features, and the other containing the list of the labels, which functions would I use to create the classifier?
estimator = SVM(
example_id_column='example_id',
feature_columns=[real_feature_column, sparse_feature_column],
l2_regularization=10.0)
I'm assuming the example_id_column would be
example_id_column = '0,1,2'
I'm not sure about how to attain the feature_columns
I think the most effective way is using the TFRecords files. There's a comprehensive tutorial available that's still mostly relevant, too. This also has the advantage of letting you define a lot more of your pipeline as part of the graph, being able to do concurrent reads from the source files, and not needing to fit your dataset in memory. It's definitely worth the effort.
Related
I'm training a neural network using keras but I'm not sure how to feed the training data into the model in the way that I want.
My training data set is effectively infinite, I have some code to generate training examples as needed, so I just want to pipe a continuous stream of novel data into the network. keras seems to want me to specify my entire dataset in advance by creating a numpy array with everything in it, but this obviously wont work with my approach.
I've experimented with creating a generator class based on keras.utils.Sequence which seems like a better fit, but it still requires me to specify a length via the __len__ method which makes me think it will only create that many examples before recycling them. Can someone suggest a better approach?
I am using TensorFlow Federated to simulate a scenario in which clients hosted on a remote server can work with our very sparse dataset in a federated setting.
Presently, the code is capable of running with a small subset of the very sparse dataset being loaded on the server-side and passing it to the remote workers hosted on another device. The data is in SVM Light format and can be loaded through sklearn's load_svmlight_file function, but needs to be converted into Tensors to work within tff. The current solution to do so involves converting the very sparse data into a dense array, then setting it up through the tf.data.Dataset.from_tensor_slices function for use with a keras model (following existing examples for tff).
This works, but takes up significant memory resources and is not suitable for the dataset as it cannot be run remotely for more than six samples due to the sparse data's serialized size, nor locally with more than a few hundred samples due to the size in memory.
To mitigate this, I converted the data into SparseTensors, but this approach fails due to the tff.learning.from_keras_model function expecting a pair of TensorSpec input_spec values, not a SparseTensorSpec input_spec with the labels being TensorSpec.
So, are there any concrete examples or known methods to work with SparseTensors within keras models in tff? Or must they be as Tensors for now? The data loads fine when not converted to regular Tensors so I will need to find a solution for working with the sparse data.
If there is presently no way to do so, are there examples of strategies within tff to work with very small subsets of data at a time, either being loaded directly with the remote client or being passed from the server?
Thanks!
I'd say the best approach now is to work with the TF's representation of tf.SparseTensor. That is, a tuple of 3 tensors, indices, values and dense_shape.
So when the problem is with Keras requiring the input to not be sparse tensors, you can pass in the input as for instance a dictionary consisting of these three tensors, which you convert to tf.sparse.SparseTensor as part of your tf.data pipeline.
See also this tutorial which I think is doing something related to what you are looking for, and please ask more detailed questions if needed!
I have a data model consisting only of categorial features and a categorial label.
So when I build that model manually in XGBoost, I would basically transform the features to binary columns (using LabelEncoder and OneHotEncoder), and the label into classes using LabelEncoder. I would then run a Multilabel Classification (multi:softmax).
I tried that with my dataset and ended up with an accuracy around 0.4 (unfortunately can't share the dataset due to confidentiality)
Now, if I run the same dataset in Azure AutoML, I end up with an accuracy around 0.85 in the best experiment. But what is really interesting is that the AutoML uses SparseNormalizer, XGBoostClassifier, with reg:logistic as objective.
So if I interpret this right, AzureML just normalizes the data (somehow from categorial data?) and then executes a logistic regression? Is this even possible / does this make sense with categorial data?
Thanks in advance.
TL;DR You're right that normalization doesn't make sense for training gradient-boosted decision trees (GBDTs) on categorical data, but it won't have an adverse impact. AutoML is an automated framework for modeling. In exchange for calibration control, you get ease-of-use. It is still worth verifying first that AutoML is receiving data with the columns properly encoded as categorical.
Think of an AutoML model as effectively a sklearn Pipeline, which is a bundled set of pre-processing steps along with a predictive Estimator. AutoML will attempt to sample from a large swath of pre-configured Pipelines such that the most accurate Pipeline will be discovered. As the docs say:
In every automated machine learning experiment, your data is automatically scaled or normalized to help algorithms perform well. During model training, one of the following scaling or normalization techniques will be applied to each model.
Too see this, you can called .named_steps on your fitted model. Also check out fitted_model.get_featurization_summary()
I especially empathize with your concern especially w.r.t. how LightGBM (MSFT's GBDT implementation) is levered by AutoML. LightGBM accepts categorical columns and instead of one-hot encoding, will bin them into two subsets whenever split. Despite this, AutoML will pre-process away the categorical columns by one-hot encoding, scaling, and/or normalization; so this unique categorical approach is never utilized in AutoML.
If you're interested in "manual" ML in Azure ML, I highly suggest looking into Estimators and Azure ML Pipelines
The Keras's ImageDataGenerator looks great for simply progressively loading images and passing an iterator to the model.fit function. However, it seems to be only usable for images and for classification tasks.
I want to do regression, i.e., my labels are also arrays of the same shape as the training set ones. In practice, they are multidimensional (>1 channels) arrays like images but they are not images.
Any suggestions on what class to use to simply spit batches of data to a keras model.fit() for training a deep neural net?
The problem, of course, is that my datasets are much too large to fit in memory, which is why I need to use these generators/iterators.
The best solution for your case is to use tf.data.Dataset().
While it may take a relatively short time to accustom to it, it is the recommended way to load your data and use model.fit().
You can consult the documentation here: https://www.tensorflow.org/api_docs/python/tf/data/Dataset
Is is new, fast, beautifully designed and easily extensible.
For instance, for your problem you may want to use tf.data.Dataset.from_tensor_slices(); I will leave you discover its features :D.
A quick solution would be to use Colab whose GPU instance has got 24 GB RAM to work with . You could also reduce your memory when you load the numpy array like the way I did here
Currently I am using Python, Numpy, pandas, scikit-learn to do data preprocessing (LabelEncoder, MinMaxScaler, fillna, etc.), and then feeding the processed data to DNN models built with Tensorflow 2.0. This input pipeline meets my needs when data is small enough to fit a PC's RAM.
Now I have some large datasets, more than 10GB, some are larger. I also plan to deploy the models in a production environment, which means there will be new data coming everyday. For DNN model training there is distributed strategy of tensorflow 2.0. But for data preprocessing obviously I cannot use pandas, scikitlearn on the large datasets with one PC. It seems to me I need to use a for-loop where I repeatedly fetch a small part of the data and use it for training?
I am wondering what do people typically use in either experiment or production environment for big data preprocessing?
Should I use Spark(Scala) / PySpark and Tensorflow input pipeline?
Yeah, with the current way you are doing preprocessing, it'll not scale well.
PySpark is one right way to run your preprocessing layer. Setup a simple standalone spark cluster with few workers and then run your preprocessing (labelEncoder/OneHotEncoder/fillNA/...) This solution should scale well and it abstracts the distributed computation layer.
PS : PySpark might not be the only way forward, but it is one of the good way forward for this use case.