How can I use multiple gpus in cupy? - gpu

I am trying to parallelise multiple matrix multiplications using multiple GPUs in CUPY.
Cupy accelerates matrix multiplication (e.g. $A\times B$).
I am wondering if I have four square matrices A,B,C,D. I want to calculate AB and CD on two different local GPUs. How can I do it in CUPY?
For example, in tensorflow,
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
Is there a similar way in CUPY. The thing about Cupy is that it execute code straight away, so that it cannot run the next line (e.g. $C\times D$) until current line finishes (e.g. $A\times B$).
Thanks for Tos's help. Now the new questions is,
say I have ten of these matrices pairs stored in two 3d numpy array (say ?*?*10). How can I write a loop to store the result of multiplication?
anumpy #size(1e5,1e5,10)
bnumpy #size(1e5,1e5,10)
for i in range(10):
#say I have 3 gpus
with cupy.cuda.Device(i % 3):
a = cupy.array(anumpy[:,:,i])
b = cupy.array(bnumpy[:,:,i])
ab[:,:,math.floor(i/3)] = a # b
How can I combine these 3 ab in different devices?
Can I have arrays with the same name in different GPUs?

Use with cupy.cuda.Device(i) and avoid blocking operations. For example, to compute matmul of pairs of CPU arrays, send the results to CPU (cupy.asnumpy) after all matmul operations are called.
a = cupy.array(a)
b = cupy.array(b)
ab = a # b
# ab = cupy.asnumpy(ab) # not here
with cupy.cuda.Device(1):
c = cupy.array(c)
d = cupy.array(d)
cd = c # d
cd = cupy.asnumpy(cd)
ab = cupy.asnumpy(ab)

CuPy does not synchronize the device execution in most operations. The code like A.dot(B) returns immediately after launching the matrix product on the device, without waiting for the device side operation itself, so if the operation is heavy enough (e.g. the matrices are large), the computation effectively overlaps with the second matrix product on another device.

I'm not 100% sure if I understand the question properly, but I guess it can be something like this:
def my_cal(gpu_id, anumpy, bnumpy):
a = None
b = None
ab = None
with cupy.cuda.Device(gpu_id):
for i in range(10):
a = cupy.array(anumpy[:,:,i])
b = cupy.array(bnumpy[:,:,i])
ab[:,:,math.floor(i/3)] = a # b
return cupy.asnumpy(ab)
np_ab0 = my_cal(0, anumpy, bnumpy)
np_ab1 = my_cal(1, anumpy, bnumpy)
np_ab2 = my_cal(2, anumpy, bnumpy)

Related

tensorflow profile explanation

I use tensorflow profile to test the inference of my model and here is the profile details. I find that there are 0,1,2,3, four numbers where 1 and 2 are filled with blank. So what is the meaning of 0-4 and why there are blanks in 1 and 2.
The machine has 80 cores and does it mean that the inference course only occupy 4 cores of them ?
Thanks.
I suppose that each row corresponds to each worker thread to run operators.
So your inference processing only occupies 4 cores as you say.
Tensorflow uses multi-threads when
There are some independent graph parts.
There is a operator using multi-threads.
So you can use multi-core effectively, if your graph have many independent graph parts.
In the following code, the graph has many independent graph parts. Therefore the number of the rows in profiler matches to "inter_op_parallelism_threads".
config = tf.ConfigProto(inter_op_parallelism_threads=5, intra_op_parallelism_threads=1)
with tf.device("/cpu:0"):
list_r = []
for i in range(80):
r = tf.random_normal(shape=[100, 100])
list_r.append(r)
v = tf.add_n(list_r)
global_step = tf.train.create_global_step()
hook = tf.train.ProfilerHook(save_steps=1)
increment_global = global_step.assign_add(1)
with tf.train.SingularMonitoredSession(hooks=[hook], config=config) as sess:
sess.run([v, increment_global])
If you want to know the detail of ConfigProto, you can get information from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto

How to connect two models

I have model A (autoencoder) which takes as input a batch of images A_in (original images), and outputs a batch of images A_out (reconstructed images). Then I have model B (binary classifier) which takes as input a batch of images B_in, which is a mixture of A_in and A_out.
I want B to distinguish between A_in and A_out, to see if A is doing a good job reconstructing images. B_out is a probability that a given image is A_in.
B trains in parallel with A to classify the two kinds of images. B_loss = (B_out - label). Labels are 0 or 1 (original or reconstructed). When we optimize B_loss we only update B parameters.
I want to train model A so that it optimizes a combined loss function: Combined_Loss = reconstruction error (A_out - A_in) - classification error (B_out - label), so that it tries to reconstruct the images and fool B at the same time. Here I want to only update A parameters (we don't want to help B here).
Now, my question is about constructing that mixture of A_in and A_out, and feeding it to B so that the graphs A and B are connected.
Right now it's like this:
A_out = autoencoder(A_in: orig_images)
B_out = classifier(B_in: numpy(mix(A_in, A_out))
How do I define it like this:
A_out = autoencoder(A_in: orig_images)
B_out = classifier(mix(A_out, A_in))
So that when I train A and B at the same time:
sess.run([autoencoder_train_op, classifier_train_op], feed_dict=
{A_in: orig_images, B_in: classifier_images, labels: classifier_labels})
I wouldn't need B_in placeholder (the graphs would be connected)?
Here's my Numpy code that constructs classifier_images (mix(A_in, A_out)):
reconstr_images = sess.run(A_out, feed_dict={A_in: orig_images})
half_and_half_images = np.concatenate((reconstr_images[:batch_size/2], orig_images[batch_size/2:]))
half_and_half_labels = np.zeros(labels.shape)
half_and_half_labels[batch_size/2:] = 1
random_indices = np.random.permutation(batch_size)
classifier_images = half_and_half_images[random_indices]
classifier_labels = half_and_half_labels[random_indices]
How do I convert it into TensorFlow node?
You can connect your models directly. In other words, you don't use placeholder for B's inputs, but use your mixture of A_in and A_out. If you just want to run B, you can still feed your inputs into the tensors that are coming from A. Feeding only placeholders is common, but TensorFlow supports feeding a value into any tensor. If it makes it easier to think about, you can pass the A's outputs through tf.identity so that you have something like a placeholder.
Another approach is what is usually done in GANs (where the generator output is fed into discriminator). You can create two "towers" of operations that share the variables. One tower will be just B and you can feed your inputs into B's placeholders to run just B. Another tower can be B on top of A, which you can use to run/train A and B together. The Bs in these two towers will have the same structure and share variables, but have separate ops. This approach is likely the cleanest and most flexible.

Numpy- Deep Learning, Training Examples

Silly Question, I am going through the third week of Andrew Ng's newest Deep learning course, and getting stuck at a fairly simple Numpy function ( i think? ).
The exercise is to find How many training examples, m , we have.
Any idea what the Numpy function is to find out about the size of a preloaded training example.
Thanks!
shape_X = X.shape
shape_Y = Y.shape
m = ?
print ('The shape of X is: ' + str(shape_X))
print ('The shape of Y is: ' + str(shape_Y))
print ('I have m = %d training examples!' % (m))
It depends on what kind of storage-approach you use.
Most python-based tools use the [n_samples, n_features] approach where the first dimension is the sample-dimension, the second dimension is the feature-dimension (like in scikit-learn and co.). Alternatively expressed: samples are rows and features are columns.
So:
# feature 1 2 3 4
x = np.array([[1,2,3,4], # first sample
[2,3,4,5], # second sample
[3,4,5,6]
])
is a training-set of 3 samples with 4 features each.
The sizes M,N (again: interpretation might be different for others) you can get with:
M, N = x.shape
because numpy's first dimension are rows, numpy's second dimension are columns like in matrix-algebra.
For the above example, the target-array is of shape (M) = n_samples.
Anytime you want to find the number of training examples or the size of an array, you can use
m = X.size
This will give you the size or the total number of the examples. In this case, it would be 400.
The above method is also correct but not the optimal method to find the size since, in large datasets, the values could be large and while python easily handles large values, it is not advisable to utilize extra unneeded space.
Or a better way of doing the above scenario is
m=X.shape[1]

How to shift a tensor using api in tensorflow, just like nump.roll() or shift ? [duplicate]

Lets say, that we do want to process images (or ndim vectors) using Keras/TensorFlow.
And we want, for fancy regularization, to shift each input by a random number of positions to the left (owerflown portions reappearing at the right side ).
How could it be viewed and solved:
1)
Is there any variation to numpy roll function for TensorFlow?
2)
x - 2D tensor
ri - random integer
concatenate(x[:,ri:],x[:,0:ri], axis=1) #executed for each single input to the layer, ri being random again and again (I can live with random only for each batch)
In TensorFlow v1.15.0 and up, you can use tf.roll which works just like numpy roll. https://github.com/tensorflow/tensorflow/pull/14953 .
To improve on the answer above you can do:
# size of x dimension
x_len = tensor.get_shape().as_list()[1]
# random roll amount
i = tf.random_uniform(shape=[1], maxval=x_len, dtype=tf.int32)
output = tf.roll(tensor, shift=i, axis=[1])
For older versions starting from v1.6.0 you will have to use tf.manip.roll :
# size of x dimension
x_len = tensor.get_shape().as_list()[1]
# random roll amount
i = tf.random_uniform(shape=[1], maxval=x_len, dtype=tf.int32)
output = tf.manip.roll(tensor, shift=i, axis=[1])
I just had to do this myself, and I don't think there is a tensorflow op to do np.roll unfortunately. Your code above looks basically correct though, except it doesn't roll by ri, rather by (x.shape[1] - ri).
Also you need to be careful in choosing your random integer that it is from range(1,x.shape[1]+1) rather than range(0,x.shape[1]), as if ri was 0, then x[:,0:ri] would be empty.
So what I would suggest would be something more like (for rolling along dimension 1):
x_len = x.get_shape().as_list()[1]
i = np.random.randint(0,x_len) # The amount you want to roll by
y = tf.concat([x[:,x_len-i:], x[:,:x_len-i]], axis=1)
EDIT: added missing colon after hannes' correct comment.

How to write expensive summaries less often in tensorflow

I have a tensorflow model. In it, I have different summaries. Some, such as loss and accuracy and inexpensive, and I want to write them often. Others, like accuracy on the test set are more expensive to calculate and I want to write them say, 100 times less often than normal summaries. What is the best way to implement it in tensorflow?
Instead of merging all summaries with merge_all(), you create a few different groups of summaries with merge() and then write them with different frequency. Something like this:
s1 = tf.summary.image(...)
s2 = tf.summary.scalar(...)
s3 = tf.summary.histogram(...)
s4 = tf.summary.audio(...)
summary_expensive = tf.summary.merge([s1, s4])
summary_cheap = tf.summary.merge([s2, s3])
# open a session `sess`
# init variables
# create a writer `writer`
for i in xrange(many_steps):
summary1 = sess.run(summary_cheap)
writer.add_summary(summary1, i)
if i % 100 == 0:
summary2 = sess.run(summary_expensive)
writer.add_summary(summary2, i)