I am using tensorflow to build a CNN model, which consists of 4 layers.
Each layer contain one Conv2D, one batch normalization and one activation in order.
For details,
def __call__(self, hidden):
hidden = self._layers(hidden)
hidden = self._norm(hidden)
hidden = self._act(hidden)
return hidden
the Conv2D layer is built by
tf.keras.layers.Conv2D(depth,kernel,stride,pad,**kwargs).
For example in the first layer, depth = 64, kernel = [4, 4], stride = 2, pad = 'valid'.
Depth times 2 at each layer afterwards while others keep same.
the batch normalization is implemented by
self.scale = self.add_weight(
'scale', input_shape[-1], tf.float32, 'Ones')
self.offset = self.add_weight(
'offset', input_shape[-1], tf.float32, 'Zeros')
mean, var = tf.nn.moments(x, 0, keepdims=True)
tf.nn.batch_normalization(x, mean, var, self.offset, self.scale, 1e-3)
where x is input that is also the output of previous Conv2d layer.
the activation is tf.nn.elu.
I checked GPU memory at each step.
Before Conv2D layer, used GPU memory is 4GB. The input data has shape (6490, 60, 80, 4), type is float16.
After Conv2D layer and before BN, memory is 6GB. Features has shape (6490, 29, 39, 64), type is float16.
After BN and before act, memory is 10GB. Features has shape (6490, 29, 39, 64), type is float16.
After act, memory is 18GB. Output has shape (6490, 29, 39, 64), type is float16.
Tensorflow has tf.config.experimental.set_memory_growth(True)
When completes 4 layers of CNN model and forwards to the next process, the memory used still keeps 18GB. So in the next process, out of memory occurs.
I am not sure whether there is a bug or this is just due to excessive input data. I really wonder why such a large memory space is used when passing through those layers in training model, especially for BN and act layer.
Any help will be appreciated!
System Info
OS : Ubuntu 20.04
Python : 3.8.10
Tensorflow : 2.11.0
Related
I'm using Google's cloud TPUs (v2.8) to train a Tensorflow/keras model with a custom keras layer, which I call a pairwise conv2d. Tensorflow/keras code is below. The model compiles fine, but XLA compilation hangs indefinitely. If I scale down or remove pairwise conv2d, everything works normally.
Pairwise conv2d extracts all possible pairs of rows from an "image" and runs conv2d (1 filter) on it using a kernel size of (2,x), where x right now is 6. The current "image" size is (493x28) with one channel. Pairs of rows from the "image" are extracted, followed by applying conv2d. So conv2d is operating on a tensor with shape (batch_size, 2, 28, 1). All possible pairs of rows is 493*492/2 = 121278 separate conv2d calls. The output from each conv2d call is then stacked to generate the output.
So yep, that's a lot of conv2d calls, and definitely the source of the problem. If I reduce the number of conv2d calls down to 100, XLA compilation proceeds normally.
The "image" here is not an image -- it's a matrix of binding probabilities for transcription factors binding to DNA sites at different positions. So the rows here are different transcription factors (493) and the columns are different DNA sites (28 positions, maxpooled). We expect that adjacent/nearby transcription factors could interact with one another and so taking all possible pairs of rows is the same as considering all possible pairs of transcription factors.
Are there smart way of debugging XLA compilation? I can dump the generated files using XLA_FLAGS="--xla_dump_to=/tmp/generated" TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 train_model.py
but that doesn't really help me.
Are there better ways of accomplishing the pairwise conv2d that doesn't split the conv2d into 121278 calls? The tensor size is only 15.5 MB (per batch). I tried lowering the batch size to 32, but I don't think affects XLA compilation. I don't think this is a memory issue as model training doesn't even begin yet.
Any help would be appreciated! Thanks in advance.
EDIT #1. tf.map_fn is not supported by XLA on TPUs. The code below was edited to replace the map_fn call with a for loop + tf.stack. A few initial observations: [1] The for loop is unrolled by XLA, but there is a limit of 50000 loops. [2] The layer call() is called several times during model compilation. [3] XLA compilation triggers a Segfault (likely out of memory) when running PairwiseConv2D on the 121278 slices of the image (3 separate PairwiseConv2D layers). This was reduced to a single PairwiseConv2D layer (50000 slices of the image), but it still triggered a SegFault. Now running at 10000 slices of the image and memory usage on TPU v2.8 (64 GiB) is flat at around 60%.
class PairwiseConv2D(layers.Layer):
"""Layer that carries out Conv2D on specified pairs of rows (axis=1) within an input tensor using a specified kernel"""
def __init__(self, indices, kernel_size, dtype=None, **kwargs):
super().__init__(dtype=dtype, **kwargs)
self.indices = indices #tf.convert_to_tensor(itertools.combinations(range(493),2),dtype=tf.int32)
self.numFilters = indices.shape[0] #493*492/2
self.kernel_size = kernel_size #(2,6)
def build(self, input_shape = None):
self.filter_weights = self.add_weight("weights", shape=[self.numFilters, self.kernel_size[0], self.kernel_size[1], 1, 1], initializer="zeros", dtype=self.dtype)
#tf.function
def call(self, inputs):
ylist = []
for n in range(self.numFilters):
print('iteration #%s/%s' % (n, self.numFilters) )
y = tf.nn.conv2d( tf.stack([inputs[:, tf.gather(self.indices,n)[0], :, : ], inputs[:, tf.gather(self.indices,n)[1], :, :]], axis=1),
tf.reshape(self.filter_weights[n, :,:,:,:], [self.kernel_size[0], self.kernel_size[1], 1, 1]),
strides=1,
padding='SAME')
# ReLu Activation
y = tf.nn.relu(y)
ylist.append( y )
x = tf.stack(ylist, axis=1)
return x
def get_config(self):
config = super().get_config()
config.update({
"indices" : list(self.indices.numpy()),
"kernel_size" : self.kernel_size
})
return config
#classmethod
def from_config(cls, config):
return cls(**config)
```
The new tf.keras.preprocessing.timeseries_dataset_from_array function is used to create sliding minibatch windows over the sequential data, for example for tasks involving rnn networks.
According to the docs it returns a minibatch of inputs and targets. However, the target minibatch this function returns does not have a sequence_length (timesteps) dimension. For example.
data = timeseries_dataset_from_array(
data=tokens,
targets=targets,
sequence_length=25,
batch_size=32,
)
for minbatch in data:
inputs, targets = minbatch
assert(inputs.shape[1] == targets.shape[1]) # error
The inputs have [32, 25, 1] shape in case you already just have word indices there and targets confusingly have [32, 1] shape.
So, my question is how am I supposed to map a tensor of inputs with a window of 25 units to a target tensor with a window of 0 units?
How I always train sequence models is by feeding the input tensor of [32, 25, 1] which is then projected into [32, 25, 100] and then you feed the target tensor to the network of size [32, 25, 1] to your loss function or if you have multi-class problem a target vector of [32, 25, num_of_classes].
That is why I am confused by the shape of the target tensor from timeseries_dataset_from_array and the intuition behind it.
I implemented Resnet34 model in federated images classification tutorial. After 10 rounds the training accuracy can be higher than 90%, however, the evaluation accuracy using the last round's state.model is always around 50%.
evaluation = tff.learning.build_federated_evaluation(model_fn)
federated_test_data = make_federated_data(emnist_test, sample_clients)
test_metrics = evaluation(state.model, federated_test_data)
str(test_metrics)
I am very confused what's possibly wrong with the evaluation part? Also, I printed the untrainable variables (mean and variance in BatchNorm) of the server's model, which are 0 and 1 with no updates/averaging after those rounds. Should they be like that or that could be the problem?
Thanks very much!
Updates:
The codes to prepare training data and printed results:
len(emnist_train.client_ids)
4
emnist_train.element_type_structure
OrderedDict([('label', TensorSpec(shape=(), dtype=tf.int64, name=None)),('pixels',TensorSpec(shape=(256, 256, 3), dtype=tf.float32, name=None))])
NUM_CLIENTS = 4
NUM_EPOCHS = 1
BATCH_SIZE = 30
SHUFFLE_BUFFER = 500
def preprocess(dataset):
def element_fn(element):
return collections.OrderedDict([
('x', element['pixels']),
('y', tf.reshape(element['label'], [1])),
])
return dataset.repeat(NUM_EPOCHS).map(element_fn).shuffle(
SHUFFLE_BUFFER).batch(BATCH_SIZE)
sample_clients = emnist_train.client_ids[0:NUM_CLIENTS]
federated_train_data = make_federated_data(emnist_train, sample_clients)
preprocessed_example_dataset = preprocess(example_dataset)
sample_batch = tf.nest.map_structure(
lambda x: x.numpy(), iter(preprocessed_example_dataset).next())
def make_federated_data(client_data, client_ids):
return [preprocess(client_data.create_tf_dataset_for_client(x))
for x in client_ids]
len(federated_train_data), federated_train_data[0]
(4,<BatchDataset shapes: OrderedDict([(x, (None, 256, 256, 3)), (y, (None, 1))]), types: OrderedDict([(x, tf.float32), (y, tf.int64)])>)
The training and evaluation codes:
def create_compiled_keras_model():
base_model = tf.keras.applications.resnet.ResNet50(include_top=False, weights='imagenet', input_shape=(256,256,3,))
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
prediction_layer = tf.keras.layers.Dense(2, activation='softmax')
model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])
model.compile(optimizer = tf.keras.optimizers.SGD(lr = 0.001, momentum=0.9), loss = tf.keras.losses.SparseCategoricalCrossentropy(), metrics = [tf.keras.metrics.SparseCategoricalAccuracy()])
return model
def model_fn():
keras_model = create_compiled_keras_model()
return tff.learning.from_compiled_keras_model(keras_model, sample_batch)
iterative_process = tff.learning.build_federated_averaging_process(model_fn)
state = iterative_process.initialize()
for round_num in range(2, 12):
state, metrics = iterative_process.next(state, federated_train_data)
print('round {:2d}, metrics={}'.format(round_num, metrics, state))
evaluation = tff.learning.build_federated_evaluation(model_fn)
federated_test_data = make_federated_data(emnist_test, sample_clients)
len(federated_test_data), federated_test_data[0]
(4,
<BatchDataset shapes: OrderedDict([(x, (None, 256, 256, 3)), (y, (None, 1))]), types: OrderedDict([(x, tf.float32), (y, tf.int64)])>)
test_metrics = evaluation(state.model, federated_test_data)
str(test_metrics)
The training and evaluations results after each round:
round 1, metrics=<sparse_categorical_accuracy=0.5089045763015747,loss=0.7813001871109009,keras_training_time_client_sum_sec=0.008826255798339844>
<sparse_categorical_accuracy=0.49949443340301514,loss=8.0671968460083,keras_training_time_client_sum_sec=0.0>
round 2, metrics=<sparse_categorical_accuracy=0.519825279712677,loss=0.7640910148620605,keras_training_time_client_sum_sec=0.011750459671020508>
<sparse_categorical_accuracy=0.49949443340301514,loss=8.0671968460083,keras_training_time_client_sum_sec=0.0>
round 3, metrics=<sparse_categorical_accuracy=0.5099126100540161,loss=0.7513422966003418,keras_training_time_client_sum_sec=0.0039823055267333984>
<sparse_categorical_accuracy=0.49949443340301514,loss=8.0671968460083,keras_training_time_client_sum_sec=0.0>
round 4, metrics=<sparse_categorical_accuracy=0.5278897881507874,loss=0.7905193567276001,keras_training_time_client_sum_sec=0.0010638236999511719>
<sparse_categorical_accuracy=0.49949443340301514,loss=8.0671968460083,keras_training_time_client_sum_sec=0.0>
round 5, metrics=<sparse_categorical_accuracy=0.5199933052062988,loss=0.7782396674156189,keras_training_time_client_sum_sec=0.012729644775390625>
<sparse_categorical_accuracy=0.49949443340301514,loss=8.0671968460083,keras_training_time_client_sum_sec=0.0>
There are a few nuances and a few open research problems in Federated Learning and this question has struck a couple of them.
Training loss looks much better than evaluation loss: when using Federated Averaging (the optimization algorithm used in the Federated Learning for Image Classification tutorial) one needs to be careful interpreting metrics as they have nuanced differences from centralized model training. Especially training loss, which is the average over many sequence steps or batches. This means after one round, each client may have fit the model to their local data very well (obtaining a high accuracy), but after averaging these updates into the global model the global model may still be far away from "good", resulting in a low test accuracy. Additionally, 10 rounds may be too few; one of the original academic papers on Federated Learning demonstrated at least 20 rounds until 99% accuracy (McMahan 2016) with IID data, and more than 100 rounds in with non-IID data.
BatchNorm in the federated setting: its an open research problem on how to combine the batchnorm parameters, particularly with non-IID client data. Should each new client start with fresh parameters, or receive the global model parameters? TFF may not be communicating them between the server and client (since it currently is implemented only to communicate trainable variables), and may be leading to unexpected behavior. It may we good to print the state parameters watch what happens each round to them.
I found that the initialization is the reason why ResNet has poor performance. It is possibly because that ttf uses relatively simple state initialization which doesn't consider some layers like batch norm, so when I assigned the normal Keras model initial weights to the server instead of using its default initialization, the federated results were much better.
I've heard that machine learning algorithms rarely get stuck in local minima, but my CNN (in tensorflow) is predicting a constant output for all values and I am using a mean square error loss function so I think this must be a local minima given the properties of MSE. I have a network with 2 convolution layers and 1 dense layer (+1 dense output layer for regression) with 24, 32 and 100 neurons respectively, but I've tried changing the numbers of layers/neurons and the issue is not solved. I have relu activations for the hidden layers and absolute value on the output layer (I know this is uncommon but it converges faster to a lower MSE than the softplus function which still has the same problem and I need strictly positive outputs). I also have a 50% dropout layer between the dense and output layers and a pooling layer between the 2 convolutions. I have also tried changing the learning rate (currently 0.0001) and batch size. I am using an Adam Optimizer.
I have seen it suggested to change/add bias but I'm not sure how to initialize it in tf.layers.conv2d/tf.layers.dense (for which I have bias=True), and I can't see any options for bias with tf.nn.conv2d which I used for my first layer so I could initialize the kernel easily.
Any suggestions would be really appreciated, thanks.
Here's the section of my code with the network:
filter_shape = [3,3,12,24]
def nn_model(input):
weights = tf.Variable(tf.truncated_normal(filter_shape, mean=10,
stddev=3), name='weights')
conv1 = tf.nn.conv2d(input, weights, [1,1,1,1], padding='SAME')
conv2 = tf.layers.conv2d(inputs=conv1, filters=32, kernel_size=[3,3],
padding="same", activation=tf.nn.relu)
pool = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2,
padding='same')
flat = tf.reshape(pool, [-1, 32*3*3])
dense_3 = tf.layers.dense(flat, neurons, activation = tf.nn.relu)
dropout_2 = tf.layers.dropout(dense_3, rate = rate)
prediction = tf.layers.dense(dropout_2, 1, activation=tf.nn.softplus)
return prediction
My inputs are 5x5 images with 12 channels of environmental data and I have ~100,000 training samples. My current MSE is ~90 on values of ~25.
I used to face the same problem with bigger images. I incresed the number of convolution layers to solve it. Maybe you should try to add even more convolution layers.
In my opinion, the problem comes from the fact you don't have enough parameters and thus get stuck in a local minimum. If you increase your number of parameters, it can help the updates to converge to a better minimum.
Also, I can't see the optimizer you are using. Is it Adam ? You can try to start with a bigger learning-rate and use a decay to decrease it epoch after epoch.
I'm trying to use batch normalization in a conv2d_transpose as follows:
h1 = tf.layers.conv2d_transpose(inputs, 64, 4, 2, padding='SAME',
kernel_initializer=tf.variance_scaling_initializer,
bias_initializer=tf.ones_initializer,
activity_regularizer=tf.layers.batch_normalization,
)
h2 = tf.layers.conv2d_transpose(h1, 3, 4, 2, padding='SAME',
kernel_initializer=tf.variance_scaling_initializer,
bias_initializer=tf.ones_initializer,
activity_regularizer=tf.layers.batch_normalization,
)
And I am receiving the following error:
ValueError: Dimension 1 in both shapes must be equal, but are 32 and 64
From merging shape 2 with other shapes. for 'tower0/AddN' (op: 'AddN') with input shapes: [?,32,32,64], [?,64,64,3].
I've seen that other people have had this error in Keras because of the difference in dimension ordering between TensorFlow and Theano. However, I'm using pure TensorFlow, all of my variables are in TensorFlow dimension format (batch_size, height, width, channels), and the data_format of the conv2d_transpose layer should be the default 'channels_last'. What am I missing here?
tf.layers.batch_normalization should be added as a layer, not a regularizer. activity_regularizer is a function that takes activity (layer's output) and produces an extra loss term that is added to the overall loss term of the whole network. For example, you might want to penalize networks that produce high activation. You can see how activity_regularizer is called on the outputs and its result added to the loss here.