Efficient management of large amounts of data with SageMaker for training a keras model - amazon-s3

I'm working on a deep learning project with about 700GB of table-like time series data in thousands of .csv files (each about 15MB).
All the data is on S3 and it needs some preprocessing before being fed into the model. The question is how to best go about automating the process of loading, preprocessing and training. Is a custom keras generator with some built in preprocessing the best solution?

Preprocessing implies that this is something you might want to decouple from the model execution and run separately, possibly on a schedule or in response to new data flowing in.
If so, you'll probably want to do the preprocessing outside of SageMaker. You could orchestrate it using Glue, or you could write a custom job and run it through AWS Batch or alternatively on an EMR cluster.
That way, your Keras notebook can load the already preprocessed data, train and test through SageMaker.
With a little care, you should be able to perform at least some of the heavy lifting incrementally in the preprocessing step, saving both time and cost downstream in the Deep Learning pipeline.

Related

Solutions for big data preprecessing for feeding deep neural network models built with TensorFlow 2.0?

Currently I am using Python, Numpy, pandas, scikit-learn to do data preprocessing (LabelEncoder, MinMaxScaler, fillna, etc.), and then feeding the processed data to DNN models built with Tensorflow 2.0. This input pipeline meets my needs when data is small enough to fit a PC's RAM.
Now I have some large datasets, more than 10GB, some are larger. I also plan to deploy the models in a production environment, which means there will be new data coming everyday. For DNN model training there is distributed strategy of tensorflow 2.0. But for data preprocessing obviously I cannot use pandas, scikitlearn on the large datasets with one PC. It seems to me I need to use a for-loop where I repeatedly fetch a small part of the data and use it for training?
I am wondering what do people typically use in either experiment or production environment for big data preprocessing?
Should I use Spark(Scala) / PySpark and Tensorflow input pipeline?
Yeah, with the current way you are doing preprocessing, it'll not scale well.
PySpark is one right way to run your preprocessing layer. Setup a simple standalone spark cluster with few workers and then run your preprocessing (labelEncoder/OneHotEncoder/fillNA/...) This solution should scale well and it abstracts the distributed computation layer.
PS : PySpark might not be the only way forward, but it is one of the good way forward for this use case.

Reusing transformations between training and predictions

I'd like to apply stemming to my training data set. I can do this outside of tensorflow as part of training data prep, but I then need to do the same process on prediction request data before calling the (stored) model.
Is there a way of implementing this transformation in tensorflow itself so the transformation is used for both training and predictions?
This problem becomes more annoying if the transformation requires knowledge of the whole dataset, normalisation for example.
Can you easily express your processing (e.g. stemming) as a tensorflow operation? If yes, then you can build your graph in a way that both your inputs and predictions can make use of the same set of operations. Otherwise, there isn't much harm in calling the same (non tensorflow) function for both pre-processing and for predictions.
Re normalisation: you would find the dataset statistics (means, variance, etc. depending on how exactly you are normalizing) and then hardcode them for the pre/post-processing so I don't think that's really an annoying case.

Dataproc, Dataprep and Tensorflow

I'm trying to create ML models dealing with big datasets. My question is more related to the preprocessing of these big datasets. In this sense, I'd like to know what are the differences between doing the preprocessing with Dataprep, Dataproc or Tensorflow.
Any help would be appreciated.
Those are 3 different things, you can't really compare them.
Dataprep - data service for visually exploring, cleaning, and
preparing structured and unstructured data for analysis
In other words, if you have a large training data and you want to clean it up, visualize etc. google dataprep enables you to do that easily.
Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for
running Apache Spark and Apache Hadoop clusters in a simpler, more
cost-efficient way.
Within the context of your question, after you cleanup your data and it is ready to feed into your ML algorithm, you can use Cloud Dataproc to distribute it across multiple nodes and process it much faster. In some machine learning algorithms the disk read speed might be a bottleneck so it could greatly improve your machine learning algorithms running time.
Finally Tensorflow:
TensorFlowâ„¢ is an open source software library for numerical
computation using data flow graphs. Nodes in the graph represent
mathematical operations, while the graph edges represent the
multidimensional data arrays (tensors) communicated between them.
So after your data is ready to process; you can use Tensorflow to implement machine learning algorithms. Tensorflow is a python library so it is relatively easy to pick up. Tensorflow also enables to run your algorithms on GPU instead of CPU and (recently) also on Google Cloud TPUs(hardware made specifically for machine learning, even better performance than GPUs).
In the context of preprocessing for Machine Learning, I would like to put a time to answer this question in details. So, please bear with me!
Google provides four different processing products. Since, preprocessing has different aspects and covers many different ML prerequisites, each of these platforms is more suitable for a particular preprocessing domain. Products are as follows:
Google ML Engine/ Cloud AI: This product is based on Tensorflow. You can run your Machine Learning code in Tensorflow on the ML Engine. For specific types of data like image, text or sequential, tf.keras.preprocessing or tf.contrib.learn.preprocessing Libraries are available to make the appropriate input/tensor format of data for Tensorflow rapidly.
You may also need to transform your data via tf.Transform in a preprocessing step. tf.Transform, a library for TensorFlow, allows users to define preprocessing pipelines as part of a TensorFlow graph. tf.Transform ensures that no skew can arise during preprocessing.
Cloud DataPrep: Preprocessing sometimes is defined as data cleaning, data cleansing, data prepping and data alteration. For this purposes, Cloud DataPrep is the best option. For instance, if you want to get rid of null values or some ASCII characters which may cause errors in your ML model, you can use Cloud DataPrep.
Cloud DataFlow, Cloud Dataproc: Feature extraction, feature selection, scaling, dimension reduction also can be considered as a part of ML preprocessing. Since Cloud DataFlow and DataProc both support Spark, one can use Spark libraries for distributed fast preprocessing of the ML models input. Apache Spark MLlib can also be applied to many ML preprocessing/processing. Note that since Cloud DataFlow supports Apache Beam, it is more into stream processing while Cloud DataProc is more Hadoop-based and is better for batch preprocessing. For more details, please refer to Using Apache Spark with TensorFlow document

Deep Learning with TensorFlow on Compute Engine VM

I'm actualy new in Machine Learning, but this theme is vary interesting for me, so Im using TensorFlow to classify some images from MNIST datasets...I run this code on Compute Engine(VM) at Google Cloud, because my computer is to weak for this. And the code actualy run well, but the problam is that when I each time enter to my VM and run the same code I need to wait while my model is training on CNN, and after I can make some tests or experiment with my data to plot or import some external images to impruve my accuracy etc.
Is There is some way to save my result of trainin model just once, some where, that when I will decide for example to enter to the same VM tomorrow...and dont wait anymore while my model is training. Is that possible to do this ?
Or there is maybe some another way to do something similar ?
You can save a trained model in TensorFlow and then use it later by loading it; that way you only have to train your model once, and use it as many times as you want. To do that, you can follow the TensorFlow documentation regarding that topic, where you can find information on how to save and load the model. In short, you will have to use the SavedModelBuilder class to define the type and location of your saved model, and then add the MetaGraphs and variables you want to save. Loading the saved model for posterior usage is even easier, as you will only have to run a command pointing to the location of the file in which the model was exported.
On the other hand, I would strongly recommend you to change your working environment in such a way that it can be more profitable for you. In Google Cloud you have the Cloud ML Engine service, which might be good for the type of work you are developing. It allows you to train your models and perform predictions without the need of an instance running all the required software. I happen to have worked a little bit with TensorFlow recently, and at first I was also working with a virtualized instance, but after following some tutorials I was able to save some money by migrating my work to ML Engine, as you are only charged for the usage. If you are using your VM only with that purpose, take a look at it.
You can of course consult all the available documentation, but as a first quickstart, if you are interested in ML Engine, I recommend you to have a look at how to train your models and how to get your predictions.

TensorFlow in production: How to retrain your models

I have a question related to this one:
TensorFlow in production for real time predictions in high traffic app - how to use?
I want to setup TensorFlow Serving to do inference as a service for our other application. I see how TensorFlow Serving helps me to do that. Additionally, it mentions a continuous training pipeline, which probably is related to the possibility that TensorFlow Serving can serve with multiple versions of a trained model. But what I am not sure is how to retrain your model as you get new data. The other post mentions the idea to run retraining with cron jobs. However, I am not sure if automatic retraining is a good idea. What architecture would you propose for a continuous retraining pipeline with a system continuously facing new, labelled data?
Edit: It is a supervised learning case. The question is would you automatically retrain your model after n new datapoints came in or would you retrain during the downtime of the customer automatically or just retrain manually?
You probably want to use some kind of semi-supervised training. There's fairly extensive research in that area.
A crude, but expedient way, which works well, is to use the current best models that you have to label the new, incoming data. Models are typically able to produce a score (hopefully a logprob). You can use that score to only train on the data that fits well.
That is an approach that we have used in speech recognition and is an excellent baseline.