How does the TensorFlow dataset handle large data that cannot fit into the memory in a server? - tensorflow

Question
How does the TensorFlow dataset handle large data that cannot fit into the memory in a server?
Spark RDD can handle large large data with multiple nodes. For the question in Tensorflow Transform: How to find the mean of a variable over the entire dataset, the answer is using Tensorflow Transform which uses Apache Beam that requires a distributed computation cluster such as Spark.
if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics.
Hence I suppose TensorFlow requires a multi node cluster but not clear if TensorFlow has its own cluster implementation, or re-using existing technologies. Since TensorFlow pre-processing e.g. getting mean or std of a column requires Apache Beam, I guess it is Apache Beam based too, but not sure.
A google paper Large-Scale Machine Learning on Heterogeneous Distributed Systems shows multiple workers.
This article TensorFlow: A new paradigm for large scale ML in distributed systems tells the system components.
In terms of system components, TensorFlow consists of Master, Worker and Client for distributed coordination and execution.
This Github TensorFlow2-tutorial/05-distributed-training/ tells TF_CONFIG specifying the node IP/port.
TF_CONFIG='{"cluster": {"worker": ["10.1.10.58:12345", "10.1.10.250:12345"]}, "task": {"index": 0, "type": "worker"}}' python worker.py
TensorFlow example Github Distributed TensorFlow has the section below but do not see node setup detail.
Create a tf.train.ClusterSpec to describe the cluster
Hence apparently there is a way to setup TensorFlow cluster which I suppose handles large dataset loading into a TF dataset.
However, Install TensorFlow 2 only shows:
# Current stable release for CPU and GPU
pip install tensorflow
Please point to the step by step documentation of how to setup a TensorFlow multi node cluster, and resources that explain the details on how the large data loading is handled (Similar to the Spark RDD/DataFrame explanation and internals) in TF.

You need to use generator functions that pull in chunked data. Each chunk that took want to send is through a **yield ** operation. Tensorflow allows one to create a Dataset that returns Tensors as input yielded by a generator function. This dataset is finally viewed by the .fit methods as follows:
import itertools
def gen():
for i in itertools.count(1):
yield (i, [1] * i)
dataset = tf.data.Dataset.from_generator(
gen,
(tf.int64, tf.int64),
(tf.TensorShape([]), tf.TensorShape([None])))
list(dataset.take(3).as_numpy_iterator())
train(dataset, max_steps=100)
This approach has several benefits:
it limits the usage of RAM when training (to the size of chunks)
it allows one to stream asynchronously (like from a large file, remote database, web scraping bot, etc.)

Related

Transforming tensorflow datasets to beam datasets

There are a variety of ways to get a dataset you can train on in tensorflow. One of the things tensorflow transform does is provide the ability to do preprocessing via AnalyzeAndTransformDataset and TransformDataset. Surprisingly, the dataset being referred to is not a tensorflow dataset, but rather a dataset in the apache beam sense. That is understandable to some degree, given that the function is tft_beam.AnalyzeAndTransformDataset.
The heart of my question is this: given that the metadata is already known by tensorflow, why aren't there easier ways to get from a tensorflow dataset to a beam dataset. I understand that a tensorflow dataset will generally repeat itself forever, but is there a way to transform a tensorflow dataset to a dataset that can be processed by beam? Or is the only solution to have the beam dataset created by pointing to the original data on disk? Does this have to do with the unboundedness of a tensorflow dataset or is there some other reason that a tensorflow dataset cannot be analyzed/transformed through appropriate transformations so that it's abstracted from the developer?. All of the examples I have seen started with dictionaries, and there is another stack overflow question here that talks about this to some extent, but doesn't fully explain why this is the way it is.
This seems to be a question for Tensorflow team rather than Apache Beam, but TFX transforms you referred to are built on top of Beam transforms (so Beam is used as a utility). You are not directly working with Beam constructs (PColelctions, PTransforms etc.). If you want to build a Beam pipeline using the intermediate data, you might need to start with TFRecord files and use Beam's tfrecordio source as the other post mentioned.

Solutions for big data preprecessing for feeding deep neural network models built with TensorFlow 2.0?

Currently I am using Python, Numpy, pandas, scikit-learn to do data preprocessing (LabelEncoder, MinMaxScaler, fillna, etc.), and then feeding the processed data to DNN models built with Tensorflow 2.0. This input pipeline meets my needs when data is small enough to fit a PC's RAM.
Now I have some large datasets, more than 10GB, some are larger. I also plan to deploy the models in a production environment, which means there will be new data coming everyday. For DNN model training there is distributed strategy of tensorflow 2.0. But for data preprocessing obviously I cannot use pandas, scikitlearn on the large datasets with one PC. It seems to me I need to use a for-loop where I repeatedly fetch a small part of the data and use it for training?
I am wondering what do people typically use in either experiment or production environment for big data preprocessing?
Should I use Spark(Scala) / PySpark and Tensorflow input pipeline?
Yeah, with the current way you are doing preprocessing, it'll not scale well.
PySpark is one right way to run your preprocessing layer. Setup a simple standalone spark cluster with few workers and then run your preprocessing (labelEncoder/OneHotEncoder/fillNA/...) This solution should scale well and it abstracts the distributed computation layer.
PS : PySpark might not be the only way forward, but it is one of the good way forward for this use case.

Does Tensorflow automaticaly use multiple CPUs?

I have programmed some code doing an inference with Tensorflow's C API (CPU only). It is running on a cluster node, where I have access to 24 CPUs and 1 GPU. I do not make use of the GPU as I will need to do the task CPU-only later on.
Somehow every time I call the Tensorflow-Code from the other program (OpenFOAM) Tensorflow seems to run on all CPUs parallelized. However I have not done anything to cause this behavior. Now I would like to know whether Tensorflow does this parallelization by default?
Greets and thanks in advance!
I am not sure how you are using tensorflow. But a typical TensorFlow training has an input pipeline which can be thought as an ETL process. Following are the main activities involved:
Extract: Read data from persistent storage
Transform: Use CPU cores to parse and perform preprocessing operations on the data such as image decompression, data augmentation transformations (such as random crop, flips, and color distortions), shuffling, and batching.
Load: Load the transformed data onto the accelerator device(s) (for example, GPU(s) or TPU(s)) that execute the machine learning model.
CPUs are generally used during the data transformation. During the transformation, the data input elements are preprocessed. To improve the performance of the pre-processing, it is parallelized across multiple CPU cores by default.
Tensorflow provides the tf.data API which offers the tf.data.Dataset.map transformation. To control the parallelism, the map provides the num_parallel_calls argument.
Read more on this from here:
https://www.tensorflow.org/guide/performance/datasets

How can I evaluate the performance of my TensorFlow model on specific slices of a large dataset?

How can I evaluate the performance of my TensorFlow model on specific slices (segments) of a large evaluation dataset?
Use TensorFlow Model Analysis (TFMA) which is an open-source library that combines TensorFlow and Apache Beam to compute and visualize evaluation metrics. It is designed for this use case, and allows you to evaluate your models on large amounts of data in a distributed fashion, using the same metrics defined in your TensorFlow trainer. These metrics can also be computed over different slices of data, and the results can be visualized in Jupyter Notebooks. TFMA uses Apache Beam to do a full pass over your specified evaluation dataset. This not only allows more accurate calculation of metrics, but also scales up to massive evaluation datasets, since Beam pipelines can be run using distributed processing back-ends.
See https://github.com/tensorflow/model-analysis for more information.

Dataproc, Dataprep and Tensorflow

I'm trying to create ML models dealing with big datasets. My question is more related to the preprocessing of these big datasets. In this sense, I'd like to know what are the differences between doing the preprocessing with Dataprep, Dataproc or Tensorflow.
Any help would be appreciated.
Those are 3 different things, you can't really compare them.
Dataprep - data service for visually exploring, cleaning, and
preparing structured and unstructured data for analysis
In other words, if you have a large training data and you want to clean it up, visualize etc. google dataprep enables you to do that easily.
Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for
running Apache Spark and Apache Hadoop clusters in a simpler, more
cost-efficient way.
Within the context of your question, after you cleanup your data and it is ready to feed into your ML algorithm, you can use Cloud Dataproc to distribute it across multiple nodes and process it much faster. In some machine learning algorithms the disk read speed might be a bottleneck so it could greatly improve your machine learning algorithms running time.
Finally Tensorflow:
TensorFlowâ„¢ is an open source software library for numerical
computation using data flow graphs. Nodes in the graph represent
mathematical operations, while the graph edges represent the
multidimensional data arrays (tensors) communicated between them.
So after your data is ready to process; you can use Tensorflow to implement machine learning algorithms. Tensorflow is a python library so it is relatively easy to pick up. Tensorflow also enables to run your algorithms on GPU instead of CPU and (recently) also on Google Cloud TPUs(hardware made specifically for machine learning, even better performance than GPUs).
In the context of preprocessing for Machine Learning, I would like to put a time to answer this question in details. So, please bear with me!
Google provides four different processing products. Since, preprocessing has different aspects and covers many different ML prerequisites, each of these platforms is more suitable for a particular preprocessing domain. Products are as follows:
Google ML Engine/ Cloud AI: This product is based on Tensorflow. You can run your Machine Learning code in Tensorflow on the ML Engine. For specific types of data like image, text or sequential, tf.keras.preprocessing or tf.contrib.learn.preprocessing Libraries are available to make the appropriate input/tensor format of data for Tensorflow rapidly.
You may also need to transform your data via tf.Transform in a preprocessing step. tf.Transform, a library for TensorFlow, allows users to define preprocessing pipelines as part of a TensorFlow graph. tf.Transform ensures that no skew can arise during preprocessing.
Cloud DataPrep: Preprocessing sometimes is defined as data cleaning, data cleansing, data prepping and data alteration. For this purposes, Cloud DataPrep is the best option. For instance, if you want to get rid of null values or some ASCII characters which may cause errors in your ML model, you can use Cloud DataPrep.
Cloud DataFlow, Cloud Dataproc: Feature extraction, feature selection, scaling, dimension reduction also can be considered as a part of ML preprocessing. Since Cloud DataFlow and DataProc both support Spark, one can use Spark libraries for distributed fast preprocessing of the ML models input. Apache Spark MLlib can also be applied to many ML preprocessing/processing. Note that since Cloud DataFlow supports Apache Beam, it is more into stream processing while Cloud DataProc is more Hadoop-based and is better for batch preprocessing. For more details, please refer to Using Apache Spark with TensorFlow document