Tensorflow distributed inference - tensorflow

I have a large model (GPT-J 6B) and two 16G GPUs (V100 with no NVLink). I would like to do inference (generation). GPT-J needs 24G memory, so I need to split the model across my two GPUs.
For me, simplicity is much more important than maximizing utilization/throughput.
What is the easiest way to distribute the network? Is it possible to put first half of the network on one GPU and the second half on the other? (pipeline-parallel essentially)

Related

Prediction with GPU is much slower than with CPU?

curiously I just found out that my CPU is much faster for predictions.
Doing inference with GPU is much slower then with CPU.
I have tf.keras (tf2) NN model with a simple dense layer:
input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)
#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)
scores = model.predict_on_batch(data)
For 100 samples doing predictions I get:
2 s for GPU
0.07 s for CPU (!)
I am using a simple geforce mx150 with 2gb
I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.
Refer: Why does keras model predict slower after compile?
Has anyone an idea, what is going on there? What could be an issue possibly?
Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back.
In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.
The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:
Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible
If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.
GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.

Parallelization strategies for deep learning

What strategies and forms of parallelization are feasible and available for training and serving a neural network?:
inside a machine across cores (e.g. GPU / TPU / CPU)
across machines on a network or a rack
I'm also looking for evidence for how they may also be used in e.g. TensorFlow, PyTorch or MXNet.
Training
To my knowledge, when training large neural networks on large datasets, one could at least have:
Different cores or machines operate on different parts of the graph ("graph splitting"). E.g. backpropagation through the graph itself can be parallelized e.g. by having different layers hosted on different machines since (I think?) the autodiff graph is always a DAG.
Different cores or machines operate on different samples of data ("data splitting"). In SGD, the computation of gradients across batches or samples can also be parallelized (e.g. the gradients can be combined after computing them independently on different batches). I believe this is also called gradient accumulation (?).
When is each strategy better for what type of problem or neural network? Which modes are supported by modern libraries? and can one combine all four (2x2) strategies?
On top of that, I have read about:
Asynchronous training
Synchronous training
but I don't know what exactly that refers to, e.g. is it the computation of gradients on different data batches or the computation of gradients on different subgraphs? Or perhaps it refers to something else altogether?
Serving
If the network is huge, prediction / inference may also be slow, and the model may not fit on a single machine in memory at serving time. Are there any known multi-core and multi-node prediction solutions that work that can handle such models?
Training
In general, there are two strategies of parallelizing model training: data parallelism and model parallelism.
1. Data parallelism
This strategy splits training data into N partitions, each of which will be trained on different “devices” (different CPU cores, GPUs, or even machines). In contrast to training without data parallelism which produces one gradient per minibatch, we now have N gradients for each minibatch step. The next question is how we should combine these N gradients.
One way to do it is by averaging all the N gradients and then updating the model parameters once based on the average. This technique is called synchronous distributed SGD. By doing the average, we have a more accurate gradient, but with a cost of waiting all the devices to finish computing its own local gradient.
Another way is by not combining the gradients — each gradient will instead be used to update the model parameters independently. So, there will be N parameter updates for each minibatch step, in contrast to only one for the previous technique. This technique is called asynchronous distributed SGD. Because it doesn't have to wait other devices to finish, the async approach will take less time to complete a minibatch step than the sync approach will do. However, the async approach will produce a more noisy gradient, so it might need to complete more minibatch steps to catch up with the performance (in terms of loss) of the sync approach.
There are many papers proposing some improvements and optimizations on either approach, but the main idea is generally the same as described above.
In the literature there's been some disagreement on which technique is better in practice. At the end most people now settle on the synchronous approach.
Data Parallelism in PyTorch
To do synchronous SGD, we can wrap our model with torch.nn.parallel.DistributedDataParallel:
from torch.nn.parallel import DistributedDataParallel as DDP
# `model` is the model we previously initialized
model = ...
# `rank` is a device number starting from 0
model = model.to(rank)
ddp_model = DDP(model, device_ids=[rank])
Then we can train it similarly. For more details, you can refer to the official tutorial.
For doing asynchronous SGD in PyTorch, we need to implement it more manually since there is no wrapper similar to DistributedDataParallel for it.
Data Parallelism in TensorFlow/Keras
For synchronous SGD, we can use tf.distribute.MirroredStrategy to wrap the model initalization:
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Model(...)
model.compile(...)
Then we can train it as usual. For more details, you can refer to the official guides on Keras website and TensorFlow website.
For asynchronous SGD, we can use tf.distribute.experimental.ParameterServerStrategy similarly.
2. Model Parallelism
This strategy splits the model into N parts, each of which will be computed on different devices. A common way to split the model is based on layers: different sets of layers are placed on different devices. But we can also split it more intricately depending on the model architecture.
Model Parallelism in TensorFlow and PyTorch
To implement model parallelism in either TensorFlow or PyTorch, the idea is the same: to move some model parameters into a different device.
In PyTorch we can use torch.nn.Module.to method to move a module into a different device. For example, suppose we want to create two linear layers each of which is placed on a different GPU:
import torch.nn as nn
linear1 = nn.Linear(16, 8).to('cuda:0')
linear2 = nn.Linear(8, 4).to('cuda:1')
In TensorFlow we can use tf.device to place an operation into a specific device. To implement the PyTorch example above in TensorFlow:
import tensorflow as tf
from tensorflow.keras import layers
with tf.device('/GPU:0'):
linear1 = layers.Dense(8, input_dim=16)
with tf.device('/GPU:1'):
linear2 = layers.Dense(4, input_dim=8)
For more details you can refer to the official PyTorch tutorial; or if you use TensorFlow you can even use a more high-level library like mesh.
3. Hybrid: Data and Model Parallelism
Recall that data parallelism only splits the training data, whereas model parallelism only splits the model structures. If we have a model so large that even after using either parallelism strategy it still doesn't fit in the memory, we can always do both.
In practice most people prefer data parallelism to model parallelism since the former is more decoupled (in fact, independent) from the model architecture than the latter. That is, by using data parallelism they can change the model architecture as they like, without worrying which part of the model should be parallelized.
Model Inference / Serving
Parallelizing model serving is easier than parallelizing model training since the model parameters are already fixed and each request can be processed independently. Similar to scaling a regular Python web service, we can scale model serving by spawning more processes (to workaround Python's GIL) in a single machine, or even spawning more machine instances.
When we use a GPU to serve the model, though, we need to do more work to scale it. Because of how concurrency is handled differently by a GPU compared to a CPU, in order to maximize the performance, we need to do inference request batching. The idea is when a request comes, instead of immediately processing it, we wait some timeout duration for other requests to come. When the timeout is up, even if the number of requests is only one, we batch them all to be processed on the GPU.
In order to minimize the average request latency, we need to find the optimal timeout duration. To find it we need to observe that there is a trade-off between minimizing the timeout duration and maximizing the number of batch size. If the timeout is too low, the batch size will be small, so the GPU will be underutilized. But if the timeout is too high, the requests that come early will wait too long before they get processed. So, the optimal timeout duration depends on the model complexity (hence, the inference duration) and the average requests per second to receive.
Implementing a scheduler to do request batching is not a trivial task, so instead of doing it manually, we'd better use TensorFlow Serving or PyTorch Serve which already supports it.
To learn more about parallel and distributed learning, you can read this review paper.
As the question is quite broad, I'll try to shed a little different light and touch on different topics than what was shown in
#Daniel's in-depth answer.
Training
Data parallelization vs model parallelization
As mentioned by #Daniel data parallelism is used way more often and is easier to do correctly. Major caveat of model parallelism is the need to wait for part of neural network and synchronization between them.
Say you have a simple feedforward 5 layer neural network spread across 5 different GPUs, each layer for one device. In this case, during each forward pass each device has to wait for computations from the previous layers. In this simplistic case, copying data between devices and synchronization would take a lot longer and won't bring benefits.
On the other hand, there are models better suited for model parallelization like Inception networks, see picture below:
Here you can see 4 independent paths from previous layer which could go in parallel and only 2 synchronization points (Filter concatenation and Previous Layer).
Questions
E.g. backpropagation through the graph itself can be parallelized e.g.
by having different layers hosted on different machines since (I
think?) the autodiff graph is always a DAG.
It's not that easy. Gradients are calculated based on the loss value (usually) and you need to know gradients of deeper layers to calculate gradients for the more shallow ones. As above, if you have independent paths it's easier and may help, but it's way easier on a single device.
I believe this is also called gradient accumulation (?)
No, it's actually reduction across multiple devices. You can see some of that in PyTorch tutorial. Gradient accumulation is when you run your forward pass (either on single or multiple devices) N times and backpropagate (the gradient is kept in the graph and the values are added during each pass) and optimizer only makes a single step to change neural network's weights (and clears the gradient). In this case, loss is usually divided by the number of steps without optimizer. This is used for more reliable gradient estimation, usually when you are unable to use large batches.
Reduction across devices looks like this:
This is all-reduce in data parallelization, each device calculates the values which are send to all other devices and backpropagated there.
When is each strategy better for what type of problem or neural
network?
Described above, data parallel is almost always fine if you have enough of data and the samples are big (up to 8k samples or more can be done at once without very big struggle).
Which modes are supported by modern libraries?
tensorflow and pytorch both support either, most modern and maintained libraries have those functionalities implemented one way or another
can one combine all four (2x2) strategies
Yes, you can parallelize both model and data across and within machines.
synchronous vs asynchronous
asynchronous
Described by #Daniel in brief, but it's worth mentioning updates are not totally separate. That would make little sense, as we would essentially train N different models based on their batches.
Instead, there is a global parameter space, where each replica is supposed to share calculated updates asynchronously (so forward pass, backward, calculate update with optimizer and share this update to global params).
This approach has one problem though: there is no guarantee that when one worker calculated forward pass another worker updated the parameters, so the update is calculated with respect to old set of params and this is called stale gradients. Due to this, convergence might be hurt.
Other approach is to calculate N steps and updates for each worker and synchronize them afterwards, though it's not used as often.
This part was based on great blogpost and you should definitely read it if interested (there is more about staleness and some solutions).
synchronous
Mostly described previously, there are different approaches but PyTorch gathers output from network and backpropagates on them (torch.nn.parallel.DistributedDataParallel)[https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel]. BTW. You should solely this (no torch.nn.DataParallel) as it overcomes Python's GIL problem.
Takeaways
Data parallelization is always almost used when going for speed up as you "only" have to replicate neural network on each device (either over the network or within single machine), run part of batch on each during the forward pass, concatenate them into a single batch (synchronization) on one device and backpropagate on said.
There are multiple ways to do data parallelization, already introduced by #Daniel
Model parallelization is done when the model is too large to fit on single machine (OpenAI's GPT-3 would be an extreme case) or when the architecture is suited for this task, but both are rarely the case AFAIK.
The more and the longer parallel paths the model has (synchronization points), the better it might be suited for model parallelization
It's important to start workers at similar times with similar loads in order not to way for synchronization processes in synchronous approach or not to get stale gradients in asynchronous (though in the latter case it's not enough).
Serving
Small models
As you are after large models I won't delve into options for smaller ones, just a brief mention.
If you want to serve multiple users over the network you need some way to scale your architecture (usually cloud like GCP or AWS). You could do that using Kubernetes and it's PODs or pre-allocate some servers to handle requests, but that approach would be inefficient (small number of users and running servers would generate pointless costs, while large numbers of users may halt the infrastructure and take too long to process resuests).
Other way is to use autoscaling based on serverless approach. Resources will be provided based on each request so it has large scaling abilities + you don't pay when the traffic is low. You can see Azure Functions as they are on the path to improve it for ML/DL tasks, or torchlambda for PyTorch (disclaimer, I'm the author) for smaller models.
Large models
As mentioned previously, you could use Kubernetes with your custom code or ready to use tools.
In the first case, you can spread the model just the same as for training, but only do forward pass. In this way even giant models can be put up on the network (once again, GPT-3 with 175B parameters), but requires a lot of work.
In the second, #Daniel provided two possibilities. Others worth mentioning could be (read respective docs as those have a lot of functionalities):
KubeFlow - multiple frameworks, based on Kubernetes (so auto-scaling, multi-node), training, serving and what not, connects with other things like MLFlow below
AWS SageMaker - training and serving with Python API, supported by Amazon
MLFlow - multiple frameworks, for experiment handling and serving
BentoML - multiple frameworks, training and serving
For PyTorch, you could read more here, while tensorflow has a lot of serving functionality out of the box via Tensorflow EXtended (TFX).
Questions from OP's comment
Are there any forms of parallelism that are better within a machine vs
across machines
The best for of parallelism would probably be within one giant computer as to minimize transfer between devices.
Additionally, there are different backends (at least in PyTorch) one can choose from (mpi, gloo, nccl) and not all of them support direct sending, receiving, reducing etc. data between devices (some may support CPU to CPU, others GPU to GPU). If there is no direct link between devices, those have to be first copied to another device and copied again to target device (e.g. GPU on other machine -> CPU on host -> GPU on host). See pytorch info.
The more data and the bigger network, the more profitable it should be to parallelize computations. If whole dataset can be fit on a single device there is no need for parallelization. Additionally, one should take into account things like internet transfer speed, network reliability etc. Those costs may outweigh benefits.
In general, go for data parallelization if you have lots of of data (say ImageNet with 1.000.000 images) or big samples (say images 2000x2000). If possible, within a single machine as to minimize between-machines transfer. Distribute model only if there is no way around it (e.g. it doesn't fit on GPU). Don't otherwise (there is little to no point to parallelize when training MNIST as the whole dataset will easily fit in RAM and the read will be fastest from it).
why bother build custom ML-specific hardware such as TPUs?
CPUs are not the best suited for highly parallel computations (e.g. matrices multiplication) + CPU may be occupied with many other tasks (like data loading), hence it makes sense to use GPU.
As GPU was created with graphics in mind (so algebraic transformation), it can take some of CPU duties and can be specialized (many more cores when compared to CPU but simpler ones, see V100 for example).
Now, TPUs are tailored specificially for tensor computations (so deep learning mainly) and originated in Google, still WIP when compared to GPUs. Those are suited for certain types of models (mainly convolutional neural networks) and can bring speedups in this case. Additionally one should use the largest batches with this device (see here), best to be divisible by 128. You can compare that to NVidia's Tensor Cores technology (GPU) where you are fine with batches (or layer sizes) divisible by 16 or 8 (float16 precision and int8 respectively) for good utilization (although the more the better and depends on number of cores, exact graphic card and many other stuff, see some guidelines here).
On the other hand, TPUs support still isn't the best, although two major frameworks support it (tensorflow officially, while PyTorch with torch_xla package).
In general, GPU is a good default choice in deep learning right now, TPUs for convolution heavy architectures, though might give some headache tbh. Also (once again thanks #Daniel), TPUs are more power effective, hence should be cheaper when comparing single floating point operation cost.

Assign Keras/TF/PyTorch layer to hardware type

Suppose we have the following architecture:
Multiple CNN layers
RNN layer
(Time-distributed) Dense classification layer
We want to train this architecture now. Our fancy GPU is very fast at solving the CNN layers. Although using a lower clockrate, it can perform many convolutions in parallel, thus the speed. Our fancy CPU however is faster for the (very long) resulting time series, because the time steps cannot be parallelized, and the processing profits from the higher CPU clockrate. So the (supposedly) smart idea for execution would look like this:
Multiple CNN layers (run on GPU)
RNN layer (run on CPU)
(Time-distributed) Dense classification layer (run on GPU/CPU)
This lead me to two important questions:
Is it possible, with any of the frameworks mentioned in the title, to distribute certain layers to certain hardware, and how?
If it is possible, would the overhead for the additional memory operations, e.g. tranferring between GPU-/CPU-RAM, render the whole idea useless?
Basically, in Pytorch you can control the device on which variables/parameters reside. AFAIK, it is your responsibility to make sure that for each operation all the arguments reside on the same device: i.e., you cannot conv(x, y) where x is on GPU and y is on CPU.
This is done via pytorch's .to() method that moves a module/variable .to('cpu') or .to('cuda:0')
As Shai mentioned you can control this yourself in pytorch so in theory you can have parts of your model on different devices. You have then to move data between devices in your forward pass.
I think the overhead as you mentioned would make the performance worst. The cuda RNN implementation benefits greatly from running on a gpu anyways :)

Will adding GPU cards automatically scale tensorflow usage?

Suppose I can train with sample size N, batch size M and network depth L on my GTX 1070 card with tensorflow. Now, suppose I want to train with larger sample 2N and/or deeper network 2L and getting out of memory error.
Will plugging additional GPU cards automatically solve this problem (suppose, that total amount of memory of all GPU cards is sufficient to hold batch and it's gradients)? Or it is impossible with pure tensorflow?
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Will plugging additional GPU cards automatically solve this problem?
No. You have to change your Tensorflow code to explicitly compute different operations on different devices (e.g: compute the gradients over a single batch on every GPU, then send the computed gradients to a coordinator that accumulates the received gradients and updates the model parameters averaging these gradients).
Also, Tensorflow is so flexible that allows you to specify different operations for every different device (or different remote nodes, it's the same).
You could do data augmentation on a single computational node and let the others process the data without applying this function. You can execute certain operation on a device or set of devices only.
it is impossible with pure tensorflow?
It's possible with tensorflow, but you have to change the code you wrote for a single train/inference device.
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Blockchains that work using POW (Proof Of Work) requires to solve a difficult problem using a brute-force like approach (they compute a lot's of hash with different inputs until they found a valid hash).
That means that if your single GPU can guess 1000 hash/s, 2 identical GPUs can guess 2 x 1000 hash/s.
The computation the GPUs are doing are completely uncorrelated: the data produced by the GPU:0 is not used by the GPU:1 and there are no synchronization points between the computations. This means that the task that a GPU do can be executed in parallel by another GPU (obviously with different inputs per GPU, so the devices compute hashes to solve different problems given by the network)
Back to Tensorflow: once you modified your code to work with different GPUs, you could train your network faster (in short because you're using bigger batches)

tensorflow: difference between multi GPUs and distributed tensorflow

I am little confused about these two concepts.
I saw some examples about multi GPU without using clusters and servers in the code.
Are these two different? What is the difference?
Thanks a lot!
It depends a little on the perspective from which you look at it. In any multi-* setup, either multi-GPU or multi-machine, you need to decide how to split up your computation across the parallel resources. In a single-node, multi-GPU setup, there are two very reasonable choices:
(1) Intra-model parallelism. If a model has long, independent computation paths, then you can split the model across multiple GPUs and have each compute a part of it. This requires careful understanding of the model and the computational dependencies.
(2) Replicated training. Start up multiple copies of the model, train them, and then synchronize their learning (the gradients applied to their weights & biases).
Our released Inception model has some good diagrams in the readme that show how both multi-GPU and distributed training work.
But to tl;dr that source: In a multi-GPU setup, it's often best to synchronously update the model by storing the weights on the CPU (well, in its attached DRAM). But in a multi-machine setup, we often use a separate "parameter server" that stores and propagates the weight updates. To scale that to a lot of replicas, you can shard the parameters across multiple parameter servers.
With multiple GPUs and parameter servers, you'll find yourself being more careful about device placement using constructs such as with tf.device('/gpu:1'), or placing weights on the parameter servers using tf.train.replica_device_setter to assign it on /job:ps or /job:worker.
In general, training on a bunch of GPUs in a single machine is much more efficient -- it takes more than 16 distributed GPUs to equal the performance of 8 GPUs in a single machine -- but distributed training lets you scale to even larger numbers, and harness more CPU.
Well until recently there was no open-source cluster version of tensor flow - just single machine with zero or more GPU.
The new release v0.9 may or may not have changed things.
The article in the original release documentation (Oct 2015) showed that Google has cluster-based solutions - but they had not open-sourced them.
Here is what the whitepaper says:
3.2 Multi-Device Execution Once a system has multiple devices, there are two main complications: deciding which device to place the computation for each node in the graph, and then managing the required
communication of data across device boundaries implied by these
placement decisions. This subsection discusses these two issues