I'm trying to create ML models dealing with big datasets. My question is more related to the preprocessing of these big datasets. In this sense, I'd like to know what are the differences between doing the preprocessing with Dataprep, Dataproc or Tensorflow.
Any help would be appreciated.
Those are 3 different things, you can't really compare them.
Dataprep - data service for visually exploring, cleaning, and
preparing structured and unstructured data for analysis
In other words, if you have a large training data and you want to clean it up, visualize etc. google dataprep enables you to do that easily.
Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for
running Apache Spark and Apache Hadoop clusters in a simpler, more
cost-efficient way.
Within the context of your question, after you cleanup your data and it is ready to feed into your ML algorithm, you can use Cloud Dataproc to distribute it across multiple nodes and process it much faster. In some machine learning algorithms the disk read speed might be a bottleneck so it could greatly improve your machine learning algorithms running time.
Finally Tensorflow:
TensorFlow™ is an open source software library for numerical
computation using data flow graphs. Nodes in the graph represent
mathematical operations, while the graph edges represent the
multidimensional data arrays (tensors) communicated between them.
So after your data is ready to process; you can use Tensorflow to implement machine learning algorithms. Tensorflow is a python library so it is relatively easy to pick up. Tensorflow also enables to run your algorithms on GPU instead of CPU and (recently) also on Google Cloud TPUs(hardware made specifically for machine learning, even better performance than GPUs).
In the context of preprocessing for Machine Learning, I would like to put a time to answer this question in details. So, please bear with me!
Google provides four different processing products. Since, preprocessing has different aspects and covers many different ML prerequisites, each of these platforms is more suitable for a particular preprocessing domain. Products are as follows:
Google ML Engine/ Cloud AI: This product is based on Tensorflow. You can run your Machine Learning code in Tensorflow on the ML Engine. For specific types of data like image, text or sequential, tf.keras.preprocessing or tf.contrib.learn.preprocessing Libraries are available to make the appropriate input/tensor format of data for Tensorflow rapidly.
You may also need to transform your data via tf.Transform in a preprocessing step. tf.Transform, a library for TensorFlow, allows users to define preprocessing pipelines as part of a TensorFlow graph. tf.Transform ensures that no skew can arise during preprocessing.
Cloud DataPrep: Preprocessing sometimes is defined as data cleaning, data cleansing, data prepping and data alteration. For this purposes, Cloud DataPrep is the best option. For instance, if you want to get rid of null values or some ASCII characters which may cause errors in your ML model, you can use Cloud DataPrep.
Cloud DataFlow, Cloud Dataproc: Feature extraction, feature selection, scaling, dimension reduction also can be considered as a part of ML preprocessing. Since Cloud DataFlow and DataProc both support Spark, one can use Spark libraries for distributed fast preprocessing of the ML models input. Apache Spark MLlib can also be applied to many ML preprocessing/processing. Note that since Cloud DataFlow supports Apache Beam, it is more into stream processing while Cloud DataProc is more Hadoop-based and is better for batch preprocessing. For more details, please refer to Using Apache Spark with TensorFlow document
Related
We are looking for ML model serving with a developer experience where the ML engineers don’t need to know Devops.
Ideally we are looking for the following ergonomics or something similar:
Initialize a new model serving end point preferably by a CLI, get a GCS bucket
each time we train a new model, we put it in the GCS bucket of step 1.
The serving system guarantees that the most recent model in the bucket is served unless a model is specified by version number.
We are also looking for a service that optimizes cost and latency.
Any suggestions?
Have you considered https://www.tensorflow.org/tfx/serving/architecture? You can definitely automate the entire workflow using tfx. I think the guide here does a good job walking through it. Depending on your use-case, you may want to use tft instead of Kubeflow like they're doing in that guide. Besides serving automation, you may also want to consider pipeline automation to separate the feature engineering from the pipeline mechanics itself. For example, you can build the pipeline, abstract out the feature engineering into a tensorflow function meeting certain requirements, and automate the deployment process also. This way you don't need to deal with the feature specs/schemas manually, and you know that your transformations are the same during serving as they were while training.
You can do the same thing with scikit-learn also, and I believe serving scikit-learn models is also supported under the vertex-ai umbrella.
To your point about latency, you definitely want the pipeline doing the transformations on the gpu, as such, I would recommend using tensorflow over something like scikit-learn if the use-case is truly time sensitive.
Best of luck!
Currently I am studying the usage of Apache Spark 3.0 with Rapids GPU Acceleration. In the official spark-rapids docs I came across this page which states:
There are cases where you may want to get access to the raw data on the GPU, preferably without copying it. One use case for this is exporting the data to an ML framework after doing feature extraction.
To me this sounds as if one could make data that is already available on the GPU from some upstream Spark ETL process directly available to a framework such as Tensorflow or PyTorch. If this is the case how can I access the data from within any of these frameworks? If I am misunderstanding something here, what is the quote exactly referring to?
The link you references really only allows you to get access to the data still sitting on the GPU, but using that data in another framework, like Tensorflow or PyTorch is not that simple.
TL;DR; Unless you have a library explicitly setup to work with the RAPIDS accelerator you probably want to run your ETL with RAPIDS, then save it, and launch a new job to train your models using that data.
There are still a number of issues that you would need to solve. We have worked on these in the case of XGBoost, but it has not been something that we have tried to tackle for Tensorflow or PyTorch yet.
The big issues are
Getting the data to the correct process. Even if the data is on the GPU, because of security, it is tied to a given user process. PyTorch and Tensorflow generally run as python processes and not in the same JVM that Spark is running in. This means that the data has to be sent to the other process. There are several ways to do this, but it is non-trivial to try and do it as a zero-copy operation.
The format of the data is not what Tensorflow or PyTorch want. The data for RAPIDs is in an arrow compatible format. Tensorflow and PyTorch have APIs for importing data in standard formats from the CPU, but it might take a bit of work to get the data into a format that the frameworks want and to find an API to let you pull it in directly from the GPU.
Sharing GPU resources. Spark only recently added in support for scheduling GPUs. Prior to that people would just launch a single spark task per executor and a single python process so that the python process would own the entire GPU when doing training or inference. With the RAPIDS accelerator the GPU is not free any more and you need a way to share the resources. RMM provides some of this if both libraries are updated to use it and they are in the same process, but in the case of Pytorch and and Tensoflow they are typically in python processes so figuring out how to share the GPU is hard.
Currently I am using Python, Numpy, pandas, scikit-learn to do data preprocessing (LabelEncoder, MinMaxScaler, fillna, etc.), and then feeding the processed data to DNN models built with Tensorflow 2.0. This input pipeline meets my needs when data is small enough to fit a PC's RAM.
Now I have some large datasets, more than 10GB, some are larger. I also plan to deploy the models in a production environment, which means there will be new data coming everyday. For DNN model training there is distributed strategy of tensorflow 2.0. But for data preprocessing obviously I cannot use pandas, scikitlearn on the large datasets with one PC. It seems to me I need to use a for-loop where I repeatedly fetch a small part of the data and use it for training?
I am wondering what do people typically use in either experiment or production environment for big data preprocessing?
Should I use Spark(Scala) / PySpark and Tensorflow input pipeline?
Yeah, with the current way you are doing preprocessing, it'll not scale well.
PySpark is one right way to run your preprocessing layer. Setup a simple standalone spark cluster with few workers and then run your preprocessing (labelEncoder/OneHotEncoder/fillNA/...) This solution should scale well and it abstracts the distributed computation layer.
PS : PySpark might not be the only way forward, but it is one of the good way forward for this use case.
I'm using ML Engine to run a hyperparameter tuning job for a Keras / Tensorflow model. I originally had set the machine type to be complex_model_l which is $1.65/hour. However, I'm using a TFRecords saved on Google Cloud Storage for my training and validation sets.
Given that they only take up ~6GB of space combined, is there any need for such a large machine? Could I use a standard machine (costs $0.27/hour) and run the tuning job as quickly?
Any advice would be awesome! I'm just not sure to what degree Tensorflow can make use of multiple cores by default.
Thanks!
I have implemented a neural network model using Python and Tensorflow, which normally runs on my own computer.
Now I would like to train it on new datasets on the Google Cloud Platform. Do you think it is possible? Do I need to change my code?
Thank you very much for your help!
Google Cloud offers the Cloud ML Engine service, which allows to train your models and perform predictions without the need of running and maintaining an instance with the required software.
In order to run the TensorFlow NN models you already have, you will not need to change your code, you will only have to package the trainer appropriately, as described in the documentation, and run a ML Engine job that performs the training itself. Once you have your model, you can also deploy it in the same service and later get predictions with different features depending on your requirements (urgency in getting the predictions, data set sources, etc.).
Alternatively, as suggested in the comments, you can always launch a Compute Engine instance and run there your TensorFlow model as if you were doing it locally in your computer. However, I would strongly recommend the approach I proposed earlier, as you will be saving some money, because you will only be charged for your usage (training jobs and/or predictions) and do not need to configure an instance from scratch.