Tensorflow - Is a Tensor limited to a single node like numpy array? - tensorflow

Question
Can a Tensor in TensorFlow be distributed among multiple nodes? Or is it like Numpy array that can only exist in a memory of a single machine?
If a Tensor can be distributed using multiple nodes, how to configure TensorFlow to use multiple nodes e.g. how to specify the IP addresses of the nodes to TensorFlow?
Tried to find an answer but so far found only about distributed training.
Background
Matrix on Spark is distributed among Spark worker nodes, and the data size of a matrix is not restricted to the memory size of a single machine.
Would like to know if we can use TensorFlow Tensor to run math calculations on a large data e.g. large size matrix that cannot fit into the memory of a single machine.

Related

How does the TensorFlow dataset handle large data that cannot fit into the memory in a server?

Question
How does the TensorFlow dataset handle large data that cannot fit into the memory in a server?
Spark RDD can handle large large data with multiple nodes. For the question in Tensorflow Transform: How to find the mean of a variable over the entire dataset, the answer is using Tensorflow Transform which uses Apache Beam that requires a distributed computation cluster such as Spark.
if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics.
Hence I suppose TensorFlow requires a multi node cluster but not clear if TensorFlow has its own cluster implementation, or re-using existing technologies. Since TensorFlow pre-processing e.g. getting mean or std of a column requires Apache Beam, I guess it is Apache Beam based too, but not sure.
A google paper Large-Scale Machine Learning on Heterogeneous Distributed Systems shows multiple workers.
This article TensorFlow: A new paradigm for large scale ML in distributed systems tells the system components.
In terms of system components, TensorFlow consists of Master, Worker and Client for distributed coordination and execution.
This Github TensorFlow2-tutorial/05-distributed-training/ tells TF_CONFIG specifying the node IP/port.
TF_CONFIG='{"cluster": {"worker": ["10.1.10.58:12345", "10.1.10.250:12345"]}, "task": {"index": 0, "type": "worker"}}' python worker.py
TensorFlow example Github Distributed TensorFlow has the section below but do not see node setup detail.
Create a tf.train.ClusterSpec to describe the cluster
Hence apparently there is a way to setup TensorFlow cluster which I suppose handles large dataset loading into a TF dataset.
However, Install TensorFlow 2 only shows:
# Current stable release for CPU and GPU
pip install tensorflow
Please point to the step by step documentation of how to setup a TensorFlow multi node cluster, and resources that explain the details on how the large data loading is handled (Similar to the Spark RDD/DataFrame explanation and internals) in TF.
You need to use generator functions that pull in chunked data. Each chunk that took want to send is through a **yield ** operation. Tensorflow allows one to create a Dataset that returns Tensors as input yielded by a generator function. This dataset is finally viewed by the .fit methods as follows:
import itertools
def gen():
for i in itertools.count(1):
yield (i, [1] * i)
dataset = tf.data.Dataset.from_generator(
gen,
(tf.int64, tf.int64),
(tf.TensorShape([]), tf.TensorShape([None])))
list(dataset.take(3).as_numpy_iterator())
train(dataset, max_steps=100)
This approach has several benefits:
it limits the usage of RAM when training (to the size of chunks)
it allows one to stream asynchronously (like from a large file, remote database, web scraping bot, etc.)

Does Tensorflow automaticaly use multiple CPUs?

I have programmed some code doing an inference with Tensorflow's C API (CPU only). It is running on a cluster node, where I have access to 24 CPUs and 1 GPU. I do not make use of the GPU as I will need to do the task CPU-only later on.
Somehow every time I call the Tensorflow-Code from the other program (OpenFOAM) Tensorflow seems to run on all CPUs parallelized. However I have not done anything to cause this behavior. Now I would like to know whether Tensorflow does this parallelization by default?
Greets and thanks in advance!
I am not sure how you are using tensorflow. But a typical TensorFlow training has an input pipeline which can be thought as an ETL process. Following are the main activities involved:
Extract: Read data from persistent storage
Transform: Use CPU cores to parse and perform preprocessing operations on the data such as image decompression, data augmentation transformations (such as random crop, flips, and color distortions), shuffling, and batching.
Load: Load the transformed data onto the accelerator device(s) (for example, GPU(s) or TPU(s)) that execute the machine learning model.
CPUs are generally used during the data transformation. During the transformation, the data input elements are preprocessed. To improve the performance of the pre-processing, it is parallelized across multiple CPU cores by default.
Tensorflow provides the tf.data API which offers the tf.data.Dataset.map transformation. To control the parallelism, the map provides the num_parallel_calls argument.
Read more on this from here:
https://www.tensorflow.org/guide/performance/datasets

How to split TensorFlow graph (model) onto multiple GPUs to avoid OOM?

So I have this very large and deep model I implemented with TensorFlow r1.2, running on an NVIDIA Tesla k40 with 12 GB of memory. The model consists of several RNNs, a bunch of weight and embedding matrices as well as bias vectors. When I launched the training program, it first took about 2-3 hours to build to model, and then crashed due to OOM issues. I tried to reduce batch size to even 1 data sample per batch, but still ran into the same issue.
If I google tensorflow muitlple gpu, the examples I found mainly focused on utilizing multiple GPUs by parallel model design, which means to have each GPU run the same graph and have the CPU calculate the total gradient thus propagate back to each parameters.
I know one possible solution might be running the model on an GPU with larger memory. But I wonder if there's a way to split my graph (model) into different parts sequentially and assign them to different GPUs?
The official guide on using GPUs shows you that example in "Using multiple GPUs". You just need to create the operations within different tf.device contexts; the nodes will still be added to the same graph, but they will be annotated with device directives indicating where they should be run. For example:
with tf.device("/gpu:0"):
net0 = make_subnet0()
with tf.device("/gpu:1"):
net1 = make_subnet1()
result = combine_subnets(net0, net1)

Deep networks on Cloud ML

I am trying to train a very deep model on Cloud ML however i am having serious memory issues that i am not managing to go around. The model is a very deep convolutional neural network to auto-tag music.
The model for this can be found in the image below. A batch of 20 with a tensor of 12x38832x1 is inserted in the network.
The music was originally 465894x1 samples which was then split into 12 windows. Hence, 12x38832x1. When using the map_fn function each loop would have the seperate 38832x1 samples (conv1d).
Processing windows at a time yields better results than the whole music using one CNN. This was split prior to storing the data in TFRecords in order to minimise the needed processing during training. This is loaded in a queue with maximum queue size of 200 samples (ie 10 batches).
Once dequeue, it is transposed to have the 12 dimension first which then can be used in the map_fn function for processing of the windows. This is not transposed prior to being queued as the first dimension needs to match the batch dimension of the output which is [20, 50]. Where 20 is the batch size as the data and 50 are the different tags.
For each window, the data is processed and the results of each map_fn are superpooled using a smaller network. The processing of the windows is done by a very deep neural network which is giving me problems to keep as all the config options i am giving are giving me out of memory errors.
As a model i am using one similar to Census Tensorflow Model.
First and foremost, i am not sure if this is the best option since for evaluation a separate graph is built and not shared variables. This would require double the amount of parameters.
Secondly, as a cluster setup, i have been using one complex_l master, 3 complex_l workers and 3 large_model parameter servers. I do not know if am underestimating the amount of memory needed here.
My model has previously worked with a much smaller network. However, increasing it in size started giving me bad out of memory errors.
My questions are:
The memory requirement is big, but i am sure it can be processed on cloud ml. Am i underestimating the amount of memory needed? What are your suggestions about the cluster for such a network?
When using a train.server in the dispatch function, do you need to pass on the cluster_spec so it is used in the replica_device setter? Or does it allocate on it's own? When not using it, and setting tf.configProto of log placement, all the variables seem to be on the master worker. On the Census Example in the task.py this is not passed on. I can assume this is correct?
How does one calculate how much memory is needed for a model (rough estimate to select the cluster)?
Is there any other tensorflow core tutorial how to setup such big jobs? (other than Census)
When training a big model in distributed between-graph replication, does all the model need to fit on the worker, or the worker only does ops and then transmits the results to the PS. Does that mean that the workers can have low memory just for singular ops?
PS: With smaller models the network trained successfully. I am trying to deepen the network for better ROC.
Questions coming up from on-going troubleshooting:
When using the replica_device_setter with the parameter cluster, i noticed that the master has very little memory and CPU usage and checking the log placement there are very little ops on the master. I checked the TF_CONFIG that is loaded and it says the following for the cluster field:
u'cluster': {u'ps': [u'ps-4da746af4e-0:2222'], u'worker': [u'worker-4da746af4e-0:2222'], u'master': [u'master-4da746af4e-0:2222']}
On the other hand, in the tf.train.Clusterspec documentation, it only shows workers. Does that mean that the master is not considered as worker? What happens in such case?
Error is it Memory or something else? EOF Error?

Will adding GPU cards automatically scale tensorflow usage?

Suppose I can train with sample size N, batch size M and network depth L on my GTX 1070 card with tensorflow. Now, suppose I want to train with larger sample 2N and/or deeper network 2L and getting out of memory error.
Will plugging additional GPU cards automatically solve this problem (suppose, that total amount of memory of all GPU cards is sufficient to hold batch and it's gradients)? Or it is impossible with pure tensorflow?
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Will plugging additional GPU cards automatically solve this problem?
No. You have to change your Tensorflow code to explicitly compute different operations on different devices (e.g: compute the gradients over a single batch on every GPU, then send the computed gradients to a coordinator that accumulates the received gradients and updates the model parameters averaging these gradients).
Also, Tensorflow is so flexible that allows you to specify different operations for every different device (or different remote nodes, it's the same).
You could do data augmentation on a single computational node and let the others process the data without applying this function. You can execute certain operation on a device or set of devices only.
it is impossible with pure tensorflow?
It's possible with tensorflow, but you have to change the code you wrote for a single train/inference device.
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Blockchains that work using POW (Proof Of Work) requires to solve a difficult problem using a brute-force like approach (they compute a lot's of hash with different inputs until they found a valid hash).
That means that if your single GPU can guess 1000 hash/s, 2 identical GPUs can guess 2 x 1000 hash/s.
The computation the GPUs are doing are completely uncorrelated: the data produced by the GPU:0 is not used by the GPU:1 and there are no synchronization points between the computations. This means that the task that a GPU do can be executed in parallel by another GPU (obviously with different inputs per GPU, so the devices compute hashes to solve different problems given by the network)
Back to Tensorflow: once you modified your code to work with different GPUs, you could train your network faster (in short because you're using bigger batches)