Any existing implementation of distributed matrix multiplication in tensorflow? - tensorflow

From the github code, it seems the MatMul op doesn't support partitioned matrixes. So is there any tool in tensorflow that supports multiplication of two huge matrixes that are distributed across multiple nodes?

Support for distributing computation across machines is built into TensorFlow. I would recommend reading distributed TensorFlow docs to figure out how to setup a TensorFlow cluster.
Once cluster is setup, you can decide how to partition your problem and use with tf.device to allocate each worker to their partition of work.
For instance, suppose you are multiplying a*a', and you want to split intermediate multiplication evenly over 2 workers, and the aggregate results on the 3rd.
You would do something like this:
with tf.device(worker0):
# load a1
b1 = tf.matmul(a1, tf.transpose(a1))
with tf.device(worker1):
# load a2
b2 = tf.matmul(a2, tf.transpose(a2))
with tf.device(worker2):
result = b1+b2
The load a1 part depends on how big is your matrix is stored. If it's huge, then perhaps load a1 will read it from disk. If it fits in memory, you can use a1=a[:n/2,:] to get a partition of it

Related

Create round-robin sharding while generating sharded tfrecords

I am new to tensorflow and I am working on image segmentation problem in tensorflow 1.14. I have a huge dataset and generating tfrecords is very slow, when I try to generate one big tfrecord file. So, I would like to create 'n' shards of tfrecords. I could not find a way to do it online. Say I have 600 images and 600 masks. I want to generate 6 shards of tfrecords, with 100 images and 100 masks each in round robin fashion. A high level /pseudo-code of what I want is as follows -
sharded_tf_record_writer:
create n TFRecordWriter
----> for each_item in n TFRecordWriter
-----> write_example in round-robin fashion
I did search online and could not find relevant answer. I do not want to use apache beam for sharding. I appreciate any idea/help/guidance to achieve this.
I had asked the same question in one of the issues of tensorflow datasets and the user - Conchylicultor responded this -
Writing is done by _TFRecordWriter. Tfds will automatically compute the required number of shards and distribute examples across shards, However each shard is written sequentially.
You do not have control over the number of shards, it is also automatically computed.
However, the fact that examples are distributed between shards do not make the writing faster as examples are not pre-processed in parallel. If you want parallelism, then you'll have to use Apache Beam which allow to scale even to huge datasets
The link to the tensorflow/datasets issue is - https://github.com/tensorflow/datasets/issues/676
This might help.
Since you are working with object detection in tensorflow, there are some nice code in the official Tensorflow models repository that will do what you want. Note this code is for Tensorflow2 (not sure if it'll work in TF1)
See this example of writing sharded tfrecords from coco annotations. The idea is that you open up a list of TFRecordWriter in an exit stack (using contextlib2.ExitStack()), which will automatically close the TFRecords when each thread finishes writing to it.
The utility function open_sharded_output_tfrecords function creates this list of TFRecordWriter
import contextlib2
import tensorflow as tf
with contextlib2.ExitStack() as tf_record_close_stack, tf.gfile.GFile(
annotations_file, 'r'
) as fid:
output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
tf_record_close_stack, output_path, num_shards
)
Next you can use the ProcessPoolExecutor to write tfrecords into each shard in a round-robin fashion in parallel (4 workers in this example)
from concurrent.futures.process import ProcesPoolExecutor
with ProcessPoolExecutor(4) as executor:
for idx, image in enumerate(images):
futures = []
future = executor.submit(
_write_tf_record,
image,
idx,
num_shards,
output_tfrecords,
)
futures.append(future)
for future in futures:
future.result()
where _write_tf_record may look something like this:
def _write_tf_record(image, idx, num_shards, output_tfrecords)
tf_example = create_tf_example(image)
shard_idx = idx % num_shards
output_tfrecords[shard_idx].write(tf_example.SerializeToString())
Just make sure you have more shards than multiprocess workers, otherwise the same writer may be accessed by two different processes.

Is there any good reason to transpose a tensor from NHWC to NCHW?

I often see the transpose implementation in tensorflow code. I wonder why one would want to transpose the NHWC tensor to NCHW. Please give me the good example and the reason behind it.
Rather than citing the documentation. You should read into how CUDA works and think about how to implement most operations.
The reason for NCHW generally being faster than NHWC is how the CUDA kernels are written. In CUDA you need to specify what each thread is doing like
const int threads = 32;
dim3 block(threads, threads);
dim3 grid(up2(W / 2, threads), up2(H, threads), B);
kernel<Dtype> <<< grid, block>>> (args ...)
Here you get 3 indices threadId.z, threadId.y, threadId.x. And these threads are organized in warps (hardware design).
And you want to have coalesced memory transaction, which means the threads are ordered in such a way, that the GPU can nicely operate in a fast way.
To sum it up:
You want to have "threadId.x" being the most inner-loop and you should organize the data layout such that it reading them in coalesced way. The ideal data structure should accessible by
b * C * H * W + c * H * W + h * W + w
where lower letters denote the index and capitalized letters denotes the shape (e.g., 0 <= w < W).
In convolution operations (a part of the most used layer) what you are essentially doing is cropping a region in each channel computing a dot production with a region in another channel (from another tensor). So the indices which need to run crazy fast are the height-idx and width-idx. In the end, you are adding along the channel axis (like the convolution formulae suggest). This also explains, why it makes no difference to consider NWHC, NCWH.
This has an impact on how you order the data. And it is the reason you want to have the memory layout I described above.
The worst layout would be:
H, C, B, in threadId.z, threadId.y, threadId.x
The best layout would be:
B, C, H in threadId.z, threadId.y, threadId.x
The same is (mostly) true for GEMM as well (here one matrix should be transpose). There is no source for CuDNN available. But you might be interested in looking into cutlass.
From the performance guide of Tensorflow:
NHWC is the TensorFlow default and NCHW is the optimal format to use
when training on NVIDIA GPUs using cuDNN. [...] The brief history of these two formats is that TensorFlow started by using NHWC because it was a little faster on CPUs. In the long term, we are working on tools to auto rewrite graphs to make switching between the formats transparent and take advantages of micro optimizations where a GPU Op may be faster using NHWC than the normally most efficient NCHW.
Essentially, cuDNN is optimized for NCHW, while CPU-only tensorflow is optimized for NHWC. Switching from one to the other is just a matter of performance maximization and/or unavailability of certain operations in a specific data format.

Splitting Training Data to train optimal number of n models

lets assume we have a huge Database providing us with the training data D and a dedicated smaller testing data T for a machine learning problem.
The data covers many aspects of a real world problem and thus is very diverse in its structure.
When we now train a not closer defined machine learning algorithm (Neural Network, SVM, Random Forest, ...) with D and finally test the created model against T we obtain a certain performance measure P (confusion matrix, mse, ...).
The Question: If I could achieve a better performance, by dividing the problem ito smaller sub-problems, e.g. by clustering D into several distinct training sets D1, D2, D3, ..., how could I find the optimal clusters? (number of clusters, centroids,...)
In a brute-force fashion I am thinking about using a kNN Clustering with a random number of clusters C, which leads to the training data D1, D2,...Dc.
I would now train C different models and finally test them against the training sets T1, T2, ..., Tc, where the same kNN Clustering has been used to split T into the C test sets T1,..,Tc.
The combination which gives me the best overall performance mean(P1,P2,...,Pc) would be the one I would like to choose.
I was just wondering whether you know a more sophisticated way than brute-forcing this?
Many thanks in advance
Clustering is hard.
Much harder than classification, because you don't have labels to tell you if you are doing okay, or not well at all. It can't do magic, but it requires you to carefully choose parameters and evaluate the result.
You cannot just dump your data into k-means and expect anything useful to come out. You'd first need to really really carefully clean and preprocess your data, and then you might simply figure out that it actually is only one single large clump...
Furthermore, if clustering worked well and you train classifiers on each cluster independently, then every classifier will miss crucial data. The result will likely performing really really bad!
If you want to only train on parts of the data, use a random forest.
But it sounds like you are more interested in a hierarchical classification approach. That may work, if you have good hierarchy information. You'd first train a classifier on the category, then another within the category only to get the final class.

In distributed tensorflow, how to write to summary from workers as well

I am using google cloud ml distributed sample for training a model on a cluster of computers. Input and output (ie rfrecords, checkpoints, tfevents) are all on gs:// (google storage)
Similarly to the distributed sample, I use an evaluation step that is called at the end, and the result is written as a summary, in order to use parameter hypertuning / either within Cloud ML, or using my own stack of tools.
But rather than performing a single evaluation on a large batch of data, I am running several evaluation steps, in order to retrieve statistics on the performance criteria, because I don't want to limited to a single value. I want to get information regarding the performance interval. In particular, the variance of performance is important to me. I'd rather select a model with lower average performance but with better worst cases.
I therefore run several evaluation steps. What I would like to do is to parallelize these evaluation steps because right now, only the master is evaluating. When using large clusters, it is a source of inefficiency, and task workers to evaluate as well.
Basically, the supervisor is created as :
self.sv = tf.train.Supervisor(
graph,
is_chief=self.is_master,
logdir=train_dir(self.args.output_path),
init_op=init_op,
saver=self.saver,
# Write summary_ops by hand.
summary_op=None,
global_step=self.tensors.global_step,
# No saving; we do it manually in order to easily evaluate immediately
# afterwards.
save_model_secs=0)
At the end of training I call the summary writer. :
# only on master, this is what I want to remove
if self.is_master and not self.should_stop:
# I want to have an idea of statistics of accuracy
# not just the mean, hence I run on 10 batches
for i in range(10):
self.global_step += 1
# I call an evaluator, and extract the accuracy
evaluation_values = self.evaluator.evaluate()
accuracy_value = self.model.accuracy_value(evaluation_values)
# now I dump the accuracy, ready to use within hptune
eval_summary = tf.Summary(value=[
tf.Summary.Value(
tag='training/hptuning/metric', simple_value=accuracy_value)
])
self.sv.summary_computed(session, eval_summary, self.global_step)
I tried to write summaries from workers as well , but I got an error : basically summary can be written from masters only. Is there any easy way to workaround ? The error is : "Writing a summary requires a summary writer."
My guess is you'd create a separate summary writer on each worker yourself, and write out summaries directly rather.
I suspect you wouldn't use a supervisor for the eval processing either. Just load a session on each worker for doing eval with the latest checkpoint, and writing out independent summaries.

How tensorflow deals with large Variables which can not be stored in one box

I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weight matrix is too large to be stored in one box.
By now, is there any solution to deal with such large variables, for example partition the large weight matrix to multiple boxes.
Update:
Thanks Olivier and Keveman. let me add more detail about my problem.
The example is very sparse and all features are binary value: 0 or 1. The parameter weight looks like tf.Variable(tf.truncated_normal([1 000 000 000, 512],stddev=0.1))
The solutions kaveman gave seem reasonable, and I will update results after trying.
The answer to this question depends greatly on what operations you want to perform on the weight matrix.
The typical way to handle such a large number of features is to treat the 512 vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the tf.nn.embedding_lookup function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the tf.nn.embedding_lookup_sparse to lookup the embeddings.
In both these cases, your weight matrix can be distributed across many machines. That is, the params argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at tf.device and the primer on distributed execution to understand how data and computation can be distributed across many machines.
If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.