java api for feeding tfrecords bytes to tensorflow model - tensorflow

Does any one know of high level api if exist in java for directly reading tfrecords and feeding to a tensorflow savedModel . Python api allows both example.proto (tfrecords) and tensors to be fed to tf model for inference. The only api i have seen in java is of creating raw tensors, is there a way similar to python sdk where i can directly feed tfrecords (example.proto_ to a saved model bundle in java as well.

I just came across same scenario and I used TFRecordIO from Java Apache Beam to read records. For example,
pipeline
.apply(TFRecordIO.read().from(dataPath))
.apply(ParDo.of(new ModelEvaluationFn()));
Inside ModelEvaluationFn I do the scoring using savedModel. With Java Apache Beam, you can run locally, on GCP Dataflow, Spark, Flink, etc. But if you are using Spark directly, there's spark-tensorflow-connector.
Another thing I came across is how to parse the tfrecords in Java because I need get the label value and group by using some column values to get breakdown scores. org.tensorlfow/proto package can help you do that. Here are examples: example1, example2. Essentially, it's Example.parseFrom(byte[]).

Related

Transforming tensorflow datasets to beam datasets

There are a variety of ways to get a dataset you can train on in tensorflow. One of the things tensorflow transform does is provide the ability to do preprocessing via AnalyzeAndTransformDataset and TransformDataset. Surprisingly, the dataset being referred to is not a tensorflow dataset, but rather a dataset in the apache beam sense. That is understandable to some degree, given that the function is tft_beam.AnalyzeAndTransformDataset.
The heart of my question is this: given that the metadata is already known by tensorflow, why aren't there easier ways to get from a tensorflow dataset to a beam dataset. I understand that a tensorflow dataset will generally repeat itself forever, but is there a way to transform a tensorflow dataset to a dataset that can be processed by beam? Or is the only solution to have the beam dataset created by pointing to the original data on disk? Does this have to do with the unboundedness of a tensorflow dataset or is there some other reason that a tensorflow dataset cannot be analyzed/transformed through appropriate transformations so that it's abstracted from the developer?. All of the examples I have seen started with dictionaries, and there is another stack overflow question here that talks about this to some extent, but doesn't fully explain why this is the way it is.
This seems to be a question for Tensorflow team rather than Apache Beam, but TFX transforms you referred to are built on top of Beam transforms (so Beam is used as a utility). You are not directly working with Beam constructs (PColelctions, PTransforms etc.). If you want to build a Beam pipeline using the intermediate data, you might need to start with TFRecord files and use Beam's tfrecordio source as the other post mentioned.

How to access Spark DataFrame data in GPU from ML Libraries such as PyTorch or Tensorflow

Currently I am studying the usage of Apache Spark 3.0 with Rapids GPU Acceleration. In the official spark-rapids docs I came across this page which states:
There are cases where you may want to get access to the raw data on the GPU, preferably without copying it. One use case for this is exporting the data to an ML framework after doing feature extraction.
To me this sounds as if one could make data that is already available on the GPU from some upstream Spark ETL process directly available to a framework such as Tensorflow or PyTorch. If this is the case how can I access the data from within any of these frameworks? If I am misunderstanding something here, what is the quote exactly referring to?
The link you references really only allows you to get access to the data still sitting on the GPU, but using that data in another framework, like Tensorflow or PyTorch is not that simple.
TL;DR; Unless you have a library explicitly setup to work with the RAPIDS accelerator you probably want to run your ETL with RAPIDS, then save it, and launch a new job to train your models using that data.
There are still a number of issues that you would need to solve. We have worked on these in the case of XGBoost, but it has not been something that we have tried to tackle for Tensorflow or PyTorch yet.
The big issues are
Getting the data to the correct process. Even if the data is on the GPU, because of security, it is tied to a given user process. PyTorch and Tensorflow generally run as python processes and not in the same JVM that Spark is running in. This means that the data has to be sent to the other process. There are several ways to do this, but it is non-trivial to try and do it as a zero-copy operation.
The format of the data is not what Tensorflow or PyTorch want. The data for RAPIDs is in an arrow compatible format. Tensorflow and PyTorch have APIs for importing data in standard formats from the CPU, but it might take a bit of work to get the data into a format that the frameworks want and to find an API to let you pull it in directly from the GPU.
Sharing GPU resources. Spark only recently added in support for scheduling GPUs. Prior to that people would just launch a single spark task per executor and a single python process so that the python process would own the entire GPU when doing training or inference. With the RAPIDS accelerator the GPU is not free any more and you need a way to share the resources. RMM provides some of this if both libraries are updated to use it and they are in the same process, but in the case of Pytorch and and Tensoflow they are typically in python processes so figuring out how to share the GPU is hard.

How does the TensorFlow dataset handle large data that cannot fit into the memory in a server?

Question
How does the TensorFlow dataset handle large data that cannot fit into the memory in a server?
Spark RDD can handle large large data with multiple nodes. For the question in Tensorflow Transform: How to find the mean of a variable over the entire dataset, the answer is using Tensorflow Transform which uses Apache Beam that requires a distributed computation cluster such as Spark.
if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics.
Hence I suppose TensorFlow requires a multi node cluster but not clear if TensorFlow has its own cluster implementation, or re-using existing technologies. Since TensorFlow pre-processing e.g. getting mean or std of a column requires Apache Beam, I guess it is Apache Beam based too, but not sure.
A google paper Large-Scale Machine Learning on Heterogeneous Distributed Systems shows multiple workers.
This article TensorFlow: A new paradigm for large scale ML in distributed systems tells the system components.
In terms of system components, TensorFlow consists of Master, Worker and Client for distributed coordination and execution.
This Github TensorFlow2-tutorial/05-distributed-training/ tells TF_CONFIG specifying the node IP/port.
TF_CONFIG='{"cluster": {"worker": ["10.1.10.58:12345", "10.1.10.250:12345"]}, "task": {"index": 0, "type": "worker"}}' python worker.py
TensorFlow example Github Distributed TensorFlow has the section below but do not see node setup detail.
Create a tf.train.ClusterSpec to describe the cluster
Hence apparently there is a way to setup TensorFlow cluster which I suppose handles large dataset loading into a TF dataset.
However, Install TensorFlow 2 only shows:
# Current stable release for CPU and GPU
pip install tensorflow
Please point to the step by step documentation of how to setup a TensorFlow multi node cluster, and resources that explain the details on how the large data loading is handled (Similar to the Spark RDD/DataFrame explanation and internals) in TF.
You need to use generator functions that pull in chunked data. Each chunk that took want to send is through a **yield ** operation. Tensorflow allows one to create a Dataset that returns Tensors as input yielded by a generator function. This dataset is finally viewed by the .fit methods as follows:
import itertools
def gen():
for i in itertools.count(1):
yield (i, [1] * i)
dataset = tf.data.Dataset.from_generator(
gen,
(tf.int64, tf.int64),
(tf.TensorShape([]), tf.TensorShape([None])))
list(dataset.take(3).as_numpy_iterator())
train(dataset, max_steps=100)
This approach has several benefits:
it limits the usage of RAM when training (to the size of chunks)
it allows one to stream asynchronously (like from a large file, remote database, web scraping bot, etc.)

Does Tensorflow server serve/support non-tensorflow based libraries like scikit-learn?

Actually we are creating a platform to be able to put AI usecases in production. TFX is the first choice but what if we want to use non-tensorflow based libraries like scikit learn etc and want to include a python script to create models. Will output of such a model be served by tensorflow server. How can I make sure to be able to run both tensorflow based model and non-tensorflow based libraries and models in one system design. Please suggest.
Mentioned below is the procedure to Deploy and Serve a Sci-kit Learn Model in Google Cloud Platform.
First step is to Save/Export the SciKit Learn Model using the below code:
from sklearn.externals import joblib
joblib.dump(clf, 'model.joblib')
Next step is to upload the model.joblib file to Google Cloud Storage.
After that, we need to create our model and version, specifying that we are loading up a scikit-learn model, and select the runtime version of Cloud ML engine, as well as the version of Python that we used to export this model.
Next, we need to present the data to Cloud ML Engine as a simple array, encoded as a json file, like shown below. We can use JSON Library as well.
print(list(X_test.iloc[10:11].values))
Next, we need to run the below command to perform the Inference,
gcloud ml-engine predict --model $MODEL_NAME --version $VERSION_NAME --json-instances $INPUT_FILE
For more information, please refer this link.

How to parse tensorflow model with C++ API

I want to parse a pre-trained model of tensorflow. For example, I want to get the full list of operation nodes, including the names and dependency given a model.
So, first I searched Java API and apparently there's little APIs supported by Java interface. So I seek for C++ API, but failed to find the right APIs.
The reason I don't use python is that I need to do this on android devices.
The TensorFlow graph is stored as a GraphDef protocol buffer. You should be able to build a java version of this and use it to inspect the stored graph. This will have the lists of operations, and their dependencies, but will have the values of the weights.