I'm trying to implement AdaRound quantization algorithm and I need to train my layers one by one.
I'm using a dataset with 1024 with batch-size of 32 and I need to iterate over the dataset roughly 312 epochs (or 10k iteration over a batched dataset), I've noticed that the data is copy from the Host to the device every iteration and that the data is not cached on the GPU (despite using the same data repeatedly) - the GPU is idle 30~40% percent of the time
Idle GPU percentage
The data is still copied from the host to the device on advanced iterations:
memcpyH2D chunk in single iteration
I've tried using
tf.data.experimental.prefetch_to_device
tf.data.experimental.copy_to_device
but when I'm iterating over the data after prefetch_to_device or copy_to_device the tensors are stored on the GPU, but if I use repeat to go over the dataset, the tensors are stored on the CPU
I tried using model.fit without dataset.repeat but with multiple epochs, but I get a similar behavior.
I also tried using model.fit with tensors that are stored on the GPU, but the Model's fit converts it to Dataset which forces the data back to the CPU
A code snippet to recreate the issue:
input_shape = (56, 56, 64)
output_shape = (54, 54, 64)
conv = tf.keras.layers.Conv2D(64, (3, 3))
mock_input = tf.keras.layers.Input(input_shape)
mock_output = conv(mock_input)
train_model = tf.keras.Model(inputs=mock_input, outputs=mock_output)
input_data = np.random.rand(1024, *input_shape)
output_data = np.random.rand(1024, *output_shape)
input_dataet = tf.data.Dataset.from_tensor_slices(input_data)
output_dataset = tf.data.Dataset.from_tensor_slices(output_data)
train_model.compile(
optimizer='adam',
loss='mse'
)
train_data = tf.data.Dataset.zip((input_dataet, output_dataset))
batched_train_data = train_data.batch(32).cache()
fetched_train_data = batched_train_data.prefetch(tf.data.AUTOTUNE).repeat()
with tf.profiler.experimental.Profile('logs'):
train_model.fit(fetched_train_data, steps_per_epoch=1024, epochs=1)
Is there a way to apply the dataset.repeat operation on the GPU?
I'm using tensorflow 2.5.2 with python 3.6.9
Detailed answer
nvidia has a package named Nvidia DALI, this packages offers an efficient wrapper to tensorflow's dataset (and more, but this is the relevant feature I used here), I had to install 2 packages - nvidia-dali-cuda110, nvidia-dali-tf-plugin-cuda110 (a detailed installation guide can be found here)
The class I've used is called DALIDataset, to insatiate it properly I first had to initialize pipeline object
single iteration when using properly initialized DALIDataset
Code snippet:
from nvidia.dali.plugin.tf import DALIDataset
from nvidia.dali import pipeline_def, fn
def prep_dataset_dali(dir1, dir2, batch_size):
#pipeline_def(batch_size=batch_size, num_threads=3, device_id=0)
def pipe(path1, path2):
data1 = fn.readers.numpy(device='cpu', file_root=path1, file_filter='*.npy')
data2 = fn.readers.numpy(device='cpu', file_root=path2, file_filter='*.npy')
return data1.gpu(), data2.gpu()
my_pipe = pipe(dir1, dir2)
my_pipe.build()
return DALIDataset(my_pipe, output_dtypes=(tf.float32, tf.float32), output_shapes=((None, 56, 56, 64), (None, 56, 56, 64)))
Note:
external pipeline doesn't work with DALIDataset but it might work with the DALIDatasetWithInputs class from the experimental section
Related
I am using tensorflow to build a CNN model, which consists of 4 layers.
Each layer contain one Conv2D, one batch normalization and one activation in order.
For details,
def __call__(self, hidden):
hidden = self._layers(hidden)
hidden = self._norm(hidden)
hidden = self._act(hidden)
return hidden
the Conv2D layer is built by
tf.keras.layers.Conv2D(depth,kernel,stride,pad,**kwargs).
For example in the first layer, depth = 64, kernel = [4, 4], stride = 2, pad = 'valid'.
Depth times 2 at each layer afterwards while others keep same.
the batch normalization is implemented by
self.scale = self.add_weight(
'scale', input_shape[-1], tf.float32, 'Ones')
self.offset = self.add_weight(
'offset', input_shape[-1], tf.float32, 'Zeros')
mean, var = tf.nn.moments(x, 0, keepdims=True)
tf.nn.batch_normalization(x, mean, var, self.offset, self.scale, 1e-3)
where x is input that is also the output of previous Conv2d layer.
the activation is tf.nn.elu.
I checked GPU memory at each step.
Before Conv2D layer, used GPU memory is 4GB. The input data has shape (6490, 60, 80, 4), type is float16.
After Conv2D layer and before BN, memory is 6GB. Features has shape (6490, 29, 39, 64), type is float16.
After BN and before act, memory is 10GB. Features has shape (6490, 29, 39, 64), type is float16.
After act, memory is 18GB. Output has shape (6490, 29, 39, 64), type is float16.
Tensorflow has tf.config.experimental.set_memory_growth(True)
When completes 4 layers of CNN model and forwards to the next process, the memory used still keeps 18GB. So in the next process, out of memory occurs.
I am not sure whether there is a bug or this is just due to excessive input data. I really wonder why such a large memory space is used when passing through those layers in training model, especially for BN and act layer.
Any help will be appreciated!
System Info
OS : Ubuntu 20.04
Python : 3.8.10
Tensorflow : 2.11.0
I'm using Google's cloud TPUs (v2.8) to train a Tensorflow/keras model with a custom keras layer, which I call a pairwise conv2d. Tensorflow/keras code is below. The model compiles fine, but XLA compilation hangs indefinitely. If I scale down or remove pairwise conv2d, everything works normally.
Pairwise conv2d extracts all possible pairs of rows from an "image" and runs conv2d (1 filter) on it using a kernel size of (2,x), where x right now is 6. The current "image" size is (493x28) with one channel. Pairs of rows from the "image" are extracted, followed by applying conv2d. So conv2d is operating on a tensor with shape (batch_size, 2, 28, 1). All possible pairs of rows is 493*492/2 = 121278 separate conv2d calls. The output from each conv2d call is then stacked to generate the output.
So yep, that's a lot of conv2d calls, and definitely the source of the problem. If I reduce the number of conv2d calls down to 100, XLA compilation proceeds normally.
The "image" here is not an image -- it's a matrix of binding probabilities for transcription factors binding to DNA sites at different positions. So the rows here are different transcription factors (493) and the columns are different DNA sites (28 positions, maxpooled). We expect that adjacent/nearby transcription factors could interact with one another and so taking all possible pairs of rows is the same as considering all possible pairs of transcription factors.
Are there smart way of debugging XLA compilation? I can dump the generated files using XLA_FLAGS="--xla_dump_to=/tmp/generated" TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 train_model.py
but that doesn't really help me.
Are there better ways of accomplishing the pairwise conv2d that doesn't split the conv2d into 121278 calls? The tensor size is only 15.5 MB (per batch). I tried lowering the batch size to 32, but I don't think affects XLA compilation. I don't think this is a memory issue as model training doesn't even begin yet.
Any help would be appreciated! Thanks in advance.
EDIT #1. tf.map_fn is not supported by XLA on TPUs. The code below was edited to replace the map_fn call with a for loop + tf.stack. A few initial observations: [1] The for loop is unrolled by XLA, but there is a limit of 50000 loops. [2] The layer call() is called several times during model compilation. [3] XLA compilation triggers a Segfault (likely out of memory) when running PairwiseConv2D on the 121278 slices of the image (3 separate PairwiseConv2D layers). This was reduced to a single PairwiseConv2D layer (50000 slices of the image), but it still triggered a SegFault. Now running at 10000 slices of the image and memory usage on TPU v2.8 (64 GiB) is flat at around 60%.
class PairwiseConv2D(layers.Layer):
"""Layer that carries out Conv2D on specified pairs of rows (axis=1) within an input tensor using a specified kernel"""
def __init__(self, indices, kernel_size, dtype=None, **kwargs):
super().__init__(dtype=dtype, **kwargs)
self.indices = indices #tf.convert_to_tensor(itertools.combinations(range(493),2),dtype=tf.int32)
self.numFilters = indices.shape[0] #493*492/2
self.kernel_size = kernel_size #(2,6)
def build(self, input_shape = None):
self.filter_weights = self.add_weight("weights", shape=[self.numFilters, self.kernel_size[0], self.kernel_size[1], 1, 1], initializer="zeros", dtype=self.dtype)
#tf.function
def call(self, inputs):
ylist = []
for n in range(self.numFilters):
print('iteration #%s/%s' % (n, self.numFilters) )
y = tf.nn.conv2d( tf.stack([inputs[:, tf.gather(self.indices,n)[0], :, : ], inputs[:, tf.gather(self.indices,n)[1], :, :]], axis=1),
tf.reshape(self.filter_weights[n, :,:,:,:], [self.kernel_size[0], self.kernel_size[1], 1, 1]),
strides=1,
padding='SAME')
# ReLu Activation
y = tf.nn.relu(y)
ylist.append( y )
x = tf.stack(ylist, axis=1)
return x
def get_config(self):
config = super().get_config()
config.update({
"indices" : list(self.indices.numpy()),
"kernel_size" : self.kernel_size
})
return config
#classmethod
def from_config(cls, config):
return cls(**config)
```
I have inputs (shape: (None, 416, 416, 3)) and targets (shape: (None, 13, 13, 5, 19)) that I'm trying to zip together into a data pipeline. Both inputs and targets are placeholders so I have been feeding in the numpy arrays consisting of my data in the sess.run(iterator.initializer, feed_dict = {inputs: _inputs, targets: _targets}) function. I am using an initializable iterator and my dataset consists of ~10000 entries. However, when I run my code, it stops prematurely on the sess.run method below with zero stack trace. Running %ERRORLEVEL% gives me -1073741819. What does this error level mean?
I have tried slicing my data into smaller numpy arrays and calling sess.run every time I exhaust an array. This approach works, but has significant hang time every time sess.run is called. This approach makes every 10 epochs take roughly 5 minutes longer than just ditching tf.data and using feed dict exclusively. I've also looked into system ram while the program is running and it never exceeds 40%. In addition, I tried setting the parameter shuf_buf_size to 100 (even smaller than the slices above).
Code I used to construct the pipeline
def dataset(inputs, targets, shuf_buf_size, batch_size, prefetch_buf_size):
with tf.device('/cpu:0'):
in_dataset = tf.data.Dataset.from_tensor_slices(inputs)
tar_dataset = tf.data.Dataset.from_tensor_slices(targets)
zipped = tf.data.Dataset.zip((in_dataset, tar_dataset))
return zipped.shuffle(shuf_buf_size).batch(batch_size).prefetch(prefetch_buf_size)
Code in session
#training
with tf.Session() as sess:
if os.path.isfile(ckpt_path + '.index'):
saver = blocks['saver']
saver.restore(sess, ckpt_path)
else:
initializer.run()
print('Initialized all variables')
...
sess.run(iterator.initializer,
feed_dict = {inputs: _inputs,
targets: _targets})
...
Memory shouldn't be a problem since I was able to load in the numpy arrays without problem. However, I know the problem is related to memory as slicing my numpy arrays before feeding them in worked.
I am doing transfer learning using a pre-trained inception-resnet-v2 model. From one of the conv layers I am extracting the best activation (best quality) to calculate the predicted landmarks using opencv and numpy operations. The loss function I am applying is the mean_squared_error loss. Unfortunately, when I am executing this function I get an error message that no gradients are available for any of the variables. I am struggling with this problem since two weeks and I don't know how to proceed. While debugging I could see that the problem occurred when the apply_gradients function gets executed internally. I have searched and used some solutions from here like this ones:
ValueError: No gradients provided for any variable in Tensorflow
selecting trainable variables to compute gradient "No variables to optimize"
Tensorflow: How to replace or modify gradient?
...
In addition, I have tried to write my own operation with gradient support, using this awesome tutorial: https://code-examples.net/en/q/253d718, because this solution wraps my python and opencv code in tensorflow. Unfortunately, the issue still remains. Tracing the path from the output of the network to the mean_squared_error function using TensorBoard, I could see that the path is available and continuously, too.
# Extracts the best predicted images from a specific activation
layer
# PYTHON function: get_best_images(...) -> uses numpy and opencv
# PYTHON function: extract_landmarks(...) -> uses numpy
# Endpoints is the conv layer that gets extracted
best_predicted = tf.py_func(get_best_images, [input,
end_points['Conv2d_1a_3x3']], tf.uint8) # Gets best activation
best_predicted.set_shape(input.shape)
# Gets the predicted landmarks and processes both target and
predicted for further calculation
proc_landmarks = tf.py_func(get_landmarks, [best_predicted,
target_landmarks], [tf.int32, tf.int32])
proc_landmarks[0].set_shape(target_landmarks.shape)
# target landmarks
proc_landmarks[1].set_shape(target_landmarks.shape)
# predicted landmarks
# --> HERE COMES THE COMPUTATION TO PROCESS THE TARGET AND PREDICTED
LANDMARKS
# Flattens and reshapes the tensors to 1D (68,1)
target_flatten = tf.reshape(target_result[0], [-1])
target_flatten = tf.reshape(target_flatten, [68,1])
predicted_flatten = tf.reshape(predicted_result[1], [-1])
predicted_flatten = tf.reshape(predicted_flatten, [68,1])
edit_target_landmarks = tf.cast(target_flatten, dtype=tf.float32)
edit_predicted_landmarks = tf.cast(predicted_flatten,
dtype=tf.float32)
# Calculating the loss
mse_loss =
tf.losses.mean_squared_error(labels=edit_target_landmarks,
predictions=edit_predicted_landmarks)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001,
name='ADAM_OPT').minimize(mse_loss) # <-- here does the error occur
The error message is this one (for short only some variables get listed):
ValueError: No gradients provided for any variable, check your graph >for ops that do not support gradients, between variables ["'InceptionResnetV2/Conv2d_1a_3x3/weights:0' shape=(3, 3, 3, 32) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_1a_3x3/BatchNorm/beta:0' shape=(32,) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_2a_3x3/weights:0' shape=(3, 3, 32, 32) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_2a_3x3/BatchNorm/beta:0' shape=(32,) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_2b_3x3/weights:0' shape=(3, 3, 32, 64) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_2b_3x3/BatchNorm/beta:0' shape=(64,) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_3b_1x1/weights:0' shape=(1, 1, 64, 80) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_3b_1x1/BatchNorm/beta:0' shape=(80,) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_4a_3x3/weights:0' shape=(3, 3, 80, 192) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_4a_3x3/BatchNorm/beta:0' shape=(192,) >dtype=float32_ref>", "'InceptionResnetV2/Mixed_5b/Branch_0/Conv2d_1x1/weights:0' shape=(1, 1, >192, 96) dtype=float32_ref>", "
EDIT:
I have managed to compute the gradients for the first two variables of the train list using this guide Override Tensorflow Backward-Propagation. Based on that I forgot the third parameter (which is mentioned as the d parameter in the guide) in the forward and backward propagation function which is in my case the conv layer output of the net. Nevertheless, I am getting only the first two gradients computed and all the others are missing. Do I have to compute and return in the backpropagation function for every trainable variable the gradient?. When I am right in the backpropagation function we are computing the derivatives with respect to the ops inputs, which are in my case 2 variables (target and predicted) and one for the conv layer output (i.e. return grad * op.inputs[0], grad * op.inputs[1], grad * op.inputs[2]). I thought that the overall computation for all trainable variables gets done after defining the custom gradient computation and while applying the opt.compute_gradient function using as a second parameter the variable list. Am I right or wrong?.
I have posted the part of the TensorBoard output for the mean_squared_error op. The image shows the additional loss function which I had left out to simplify my problem. This loss function works well. The arrow from the mean_squared_error function to the gradient computation is missing, because of the issue. I hope this gives a better overview.
I have a TensorFlow CNN model that is performing well and we would like to implement this model in hardware; i.e., an FPGA. It's a relatively small network but it would be ideal if it were smaller. With that goal, I've examined the kernels and find that there are some where the weights are quite strong and there are others that aren't doing much at all (the kernel values are all close to zero). This occurs specifically in layer 2, corresponding to the tf.Variable() named, "W_conv2". W_conv2 has shape [3, 3, 32, 32]. I would like to freeze/lock the values of W_conv2[:, :, 29, 13] and set them to zero so that the rest of the network can be trained to compensate. Setting the values of this kernel to zero effectively removes/prunes the kernel from the hardware implementation thus achieving the goal stated above.
I have found similar questions with suggestions that generally revolve around one of two approaches;
Suggestion #1:
tf.Variable(some_initial_value, trainable = False)
Implementing this suggestion freezes the entire variable. I want to freeze just a slice, specifically W_conv2[:, :, 29, 13].
Suggestion #2:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list)
Again, implementing this suggestion does not allow the use of slices. For instance, if I try the inverse of my stated goal (optimize only a single kernel of a single variable) as follows:
Optimizer = tf.train.RMSPropOptimizer(0.001).minimize(loss, var_list = W_conv2[:,:,0,0]))
I get the following error:
NotImplementedError: ('Trying to optimize unsupported type ', <tf.Tensor 'strided_slice_2228:0' shape=(3, 3) dtype=float32>)
Slicing tf.Variables() isn't possible in the way that I've tried it here. The only thing that I've tried which comes close to doing what I want is using .assign() but this is extremely inefficient, cumbersome, and caveman-like as I've implemented it as follows (after the model is trained):
for _ in range(10000):
# get a new batch of data
# reset the values of W_conv2[:,:,29,13]=0 each time through
for m in range(3):
for n in range(3):
assign_op = W_conv2[m,n,29,13].assign(0)
sess.run(assign_op)
# re-train the rest of the network
_, loss_val = sess.run([optimizer, loss], feed_dict = {
dict_stuff_here
})
print(loss_val)
The model was started in Keras then moved to TensorFlow since Keras didn't seem to have a mechanism to achieve the desired results. I'm starting to think that TensorFlow doesn't allow for pruning but find this hard to believe; it just needs the correct implementation.
A possible approach is to initialize these specific weights with zeros, and modify the minimization process such that gradients won't be applied to them. It can be done by replacing the call to minimize() with something like:
W_conv2_weights = np.ones((3, 3, 32, 32))
W_conv2_weights[:, :, 29, 13] = 0
W_conv2_weights_const = tf.constant(W_conv2_weights)
optimizer = tf.train.RMSPropOptimizer(0.001)
W_conv2_orig_grads = tf.gradients(loss, W_conv2)
W_conv2_grads = tf.multiply(W_conv2_weights_const, W_conv2_orig_grads)
W_conv2_train_op = optimizer.apply_gradients(zip(W_conv2_grads, W_conv2))
rest_grads = tf.gradients(loss, rest_of_vars)
rest_train_op = optimizer.apply_gradients(zip(rest_grads, rest_of_vars))
tf.group([rest_train_op, W_conv2_train_op])
I.e,
Preparing a constant Tensor for canceling the appropriate gradients
Compute gradients only for W_conv2, then multiply element-wise with the constant W_conv2_weights to zero the appropriate gradients and only then apply gradients.
Compute and apply gradients "normally" to the rest of the variables.
Group the 2 train ops to a single training op.