Memory error while creating large one hot encoding for lstm - numpy

I am trying to build a character level lstm model using keras and for that I need to create one hot encoding for characters to feed in the model. And I have around 1000 characters in each line with around 160,000 lines.
I tried to create a numpy array of zeros and make the corresponding entries 1, but I am geting memory error due to large size of the matrix is there any other way to do this.

Sure:
Create batches. Only process, say, 10,000 entries (characters) at a time, computing and feeding them into your neural network just before they're needed (say, by using a generator instead of a list). Keras has a fit_generator training function to do this.
Group chunks of data together. Say, instead of a line being a matrix of the one-hot encodings of its characters, instead use the sum/max of all those columns to produce a single vector for the line. Now, each line is only a single vector, with dimensionality equal to the number of unique characters in your data set. E.g., instead of [[0, 0, 1], [0, 1, 0], [0, 0, 1]], use, [0, 1, 1] to represent the entire line.

Perhaps an easier and more intuitive solution is to add a custom one-hot encoding layer in your Keras model architecture.
def build_model(self, batch_size, print_summary=False):
X = Input(shape=(self.sequence_length,), batch_size=batch_size)
embedding = OneHotEncoding(num_classes=self.vocab_size+1,
sequence_length=self.sequence_length)(X)
encoder = Bidirectional(CuDNNLSTM(units=self.recurrent_units,
return_sequences=True))(embedding)
...
where we can define the OneHotEncoding layer as follows:
from tensorflow.keras.layers import Lambda
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Layer # for creating custom layers
class OneHotEncoding(Layer):
def __init__(self, num_classes=None, sequence_length=None):
if num_classes is None or sequence_length is None:
raise ValueError("Can't leave params #num_classes or #sequence_length empty")
super(OneHotEncoding, self).__init__()
self.num_classes = num_classes
self.sequence_length = sequence_length
def encode(self, inputs):
return K.one_hot(indices=inputs,
num_classes=self.num_classes)
def call(self, inputs):
return Lambda(function=self.encode,
input_shape=(self.sequence_length,))(inputs)
Here we are utilizing the fact that the Keras model is fed the training samples in appropriate batch sizes (with the standard fit function), which in turn doesn't yield a MemoryError.

Related

Tensorflow TPU XLA hangs indefinitely on custom keras layer (121278 conv2d calls on slices of a 15.5 MB tensor, "pairwise conv2d"). How to debug,fix?

I'm using Google's cloud TPUs (v2.8) to train a Tensorflow/keras model with a custom keras layer, which I call a pairwise conv2d. Tensorflow/keras code is below. The model compiles fine, but XLA compilation hangs indefinitely. If I scale down or remove pairwise conv2d, everything works normally.
Pairwise conv2d extracts all possible pairs of rows from an "image" and runs conv2d (1 filter) on it using a kernel size of (2,x), where x right now is 6. The current "image" size is (493x28) with one channel. Pairs of rows from the "image" are extracted, followed by applying conv2d. So conv2d is operating on a tensor with shape (batch_size, 2, 28, 1). All possible pairs of rows is 493*492/2 = 121278 separate conv2d calls. The output from each conv2d call is then stacked to generate the output.
So yep, that's a lot of conv2d calls, and definitely the source of the problem. If I reduce the number of conv2d calls down to 100, XLA compilation proceeds normally.
The "image" here is not an image -- it's a matrix of binding probabilities for transcription factors binding to DNA sites at different positions. So the rows here are different transcription factors (493) and the columns are different DNA sites (28 positions, maxpooled). We expect that adjacent/nearby transcription factors could interact with one another and so taking all possible pairs of rows is the same as considering all possible pairs of transcription factors.
Are there smart way of debugging XLA compilation? I can dump the generated files using XLA_FLAGS="--xla_dump_to=/tmp/generated" TF_XLA_FLAGS="--tf_xla_auto_jit=2" python3 train_model.py
but that doesn't really help me.
Are there better ways of accomplishing the pairwise conv2d that doesn't split the conv2d into 121278 calls? The tensor size is only 15.5 MB (per batch). I tried lowering the batch size to 32, but I don't think affects XLA compilation. I don't think this is a memory issue as model training doesn't even begin yet.
Any help would be appreciated! Thanks in advance.
EDIT #1. tf.map_fn is not supported by XLA on TPUs. The code below was edited to replace the map_fn call with a for loop + tf.stack. A few initial observations: [1] The for loop is unrolled by XLA, but there is a limit of 50000 loops. [2] The layer call() is called several times during model compilation. [3] XLA compilation triggers a Segfault (likely out of memory) when running PairwiseConv2D on the 121278 slices of the image (3 separate PairwiseConv2D layers). This was reduced to a single PairwiseConv2D layer (50000 slices of the image), but it still triggered a SegFault. Now running at 10000 slices of the image and memory usage on TPU v2.8 (64 GiB) is flat at around 60%.
class PairwiseConv2D(layers.Layer):
"""Layer that carries out Conv2D on specified pairs of rows (axis=1) within an input tensor using a specified kernel"""
def __init__(self, indices, kernel_size, dtype=None, **kwargs):
super().__init__(dtype=dtype, **kwargs)
self.indices = indices #tf.convert_to_tensor(itertools.combinations(range(493),2),dtype=tf.int32)
self.numFilters = indices.shape[0] #493*492/2
self.kernel_size = kernel_size #(2,6)
def build(self, input_shape = None):
self.filter_weights = self.add_weight("weights", shape=[self.numFilters, self.kernel_size[0], self.kernel_size[1], 1, 1], initializer="zeros", dtype=self.dtype)
#tf.function
def call(self, inputs):
ylist = []
for n in range(self.numFilters):
print('iteration #%s/%s' % (n, self.numFilters) )
y = tf.nn.conv2d( tf.stack([inputs[:, tf.gather(self.indices,n)[0], :, : ], inputs[:, tf.gather(self.indices,n)[1], :, :]], axis=1),
tf.reshape(self.filter_weights[n, :,:,:,:], [self.kernel_size[0], self.kernel_size[1], 1, 1]),
strides=1,
padding='SAME')
# ReLu Activation
y = tf.nn.relu(y)
ylist.append( y )
x = tf.stack(ylist, axis=1)
return x
def get_config(self):
config = super().get_config()
config.update({
"indices" : list(self.indices.numpy()),
"kernel_size" : self.kernel_size
})
return config
#classmethod
def from_config(cls, config):
return cls(**config)
```

tensor slicing in tensorflow

I want to do the same numpy operation as follow to make a custom layer
img=cv2.imread('img.jpg') # img.shape =>(600,600,3)
mask=np.random.randint(0,2,size=img.shape[:2],dtype='bool')
img2=np.expand_dims(img,axis=0) #img.shape => (1,600,600,3)
img2[:,mask,:].shape # => (1, 204030, 3)
this is my first attemp but I failed. I can't do the same operation for for tensorflow tensors
class Sampling_layer(keras.layers.Layer):
def __init__(self,sampling_matrix):
super(Sampling_layer,self).__init__()
self.sampling_matrix=sampling_matrix
def call(self,input_img):
return input_img[:,self.sampling_matrix,:]
More Explanations:
I want to define a keras layer so that given a batch of images it use a sampling matrix and give me a batch of sampled vectors for the images.The sampling matrix is a random boolean matrix the same size as the image. The slicing operation I used is straight forward for numpy arrays and works perfectly. but I can't get it done with tensors in tensorflow. I tried to use loops to perform the operation I want manually but I failed.
You can do the following.
import numpy as np
import tensorflow as tf
# Batch of images
img=np.random.normal(size=[2,600,600,3]) # img.shape =>(600,600,3)
# You'll need to match the first 3 dimensions of mask with the img
# for that we'll repeat the first axis twice
mask=np.random.randint(0,2,size=img.shape[1:3],dtype='bool')
mask = np.repeat(np.expand_dims(mask, axis=0), 2, axis=0)
# Defining input layers
inp1 = tf.keras.layers.Input(shape=(600,600,3))
mask_inp = tf.keras.layers.Input(shape=(600,600))
# The layer you're looking for
out = tf.keras.layers.Lambda(lambda x: tf.boolean_mask(x[0], x[1]) )([inp1, mask])
model = tf.keras.models.Model([inp1, mask_inp], out)
# Predict on sample data
toy_out = model.predict([img, mask])
Note that both your images and mask needs to have the same batch size. I couldn't find a solution to make this work without repeating the mask on batch axis to match the batch size of images. This is the only possible solution that came to my mind, (assuming that your mask changes for every batch of data).

What would be the output from tensorflow dense layer if we assign itself as input and output while making a neural network?

I have been going through the implementation of neural network in openAI code for any Vanilla Policy Gradient (As a matter of fact, this part is used nearly everywhere). The code looks something like this :
def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activation, action_space):
act_dim = action_space.n
logits = mlp(x, list(hidden_sizes) + [act_dim], activation, None)
logp_all = tf.nn.log_softmax(logits)
pi = tf.squeeze(tf.random.categorical(logits, 1), axis=1)
logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)
logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)
return pi, logp, logp_pi
and this multi-layered perceptron network is defined as follows :
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(inputs=x, units=h, activation=activation)
return tf.layers.dense(inputs=x, units=hidden_sizes[-1], activation=output_activation)
My question is what is the return from this mlp function? I mean the structure or shape. Is it an N-dimentional tensor? If so, how is it given as an input to tf.random_categorical? If not, and its just has the shape [hidden_layer2, output], then what happened to the other layers? As per their website description about random_categorical it only takes a 2-D input. The complete code of openAI's VPG algorithm can be found here. The mlp is implemented here. I would be highly grateful if someone would just tell me what this mlp_categorical_policy() is doing?
Note: The hidden size is [64, 64], the action dimension is 3
Thanks and cheers
Note that this is a discrete action space - there are action_space.n different possible actions at every step, and the agent chooses one.
To do this the MLP is returning the logits (which are a function of the probabilities) of the different actions. This is specified in the code by + [act_dim] which is appending count of the action_space as the final MLP layer. Note that the last layer of an MLP is the output layer. The input layer is not specified in tensorflow, it is inferred from the inputs.
tf.random.categorical takes the logits and samples a policy action pi from them, which is returned as a number.
mlp_categorical_policy also returns logp, the log probability of the action a (used to assign credit), and logp_pi, the log probability of the policy action pi.
It seems your question is more about the return from the mlp.
The mlp creates a series of fully connected layers in a loop. In each iteration of the loop, the mlp is creating a new layer using the previous layer x as an input and assigning it's output to overwrite x, with this line x = tf.layers.dense(inputs=x, units=h, activation=activation).
So the output is not the same as the input, on each iteration x is overwritten with the value of the new layer. This is the same kind of coding trick as x = x + 1, which increments x by 1. This effectively chains the layers together.
The output of tf.layers.dense is a tensor of size [:,h] where : is the batch dimension (and can usually be ignored). The creation of the last layer happens outisde the loop, it can be seen that the number of nodes in this layer is act_dim (so shape is [:,3]). You can check the shape by doing this:
import tensorflow.compat.v1 as tf
import numpy as np
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(x, units=h, activation=activation)
return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)
obs = np.array([[1.0,2.0]])
logits = mlp(obs, [64, 64, 3], tf.nn.relu, None)
print(logits.shape)
result: TensorShape([1, 3])
Note that the observation in this case is [1.,2.], it is nested inside a batch of size 1.

Tensorflow: How to define a one-hot feature column for a canned estimator

My one-hot encoding appears to incorrectly have 3 dimensions during training (I think it should have 2), which causes an OOM. How am I constructing the one-hot feature column incorrectly?
I get this error when I begin to train the neural net:
OOM when allocating tensor with shape[114171,829,829]
[[Node:
dnn/input_from_feature_columns/input_layer/air_store_id_indicator/one_hot
= OneHot[T=DT_FLOAT, TI=DT_INT64, axis=-1, _device="/job:localhost/replica:0/task:0/gpu:0"](dnn/input_from_feature_columns/input_layer/air_store_id_indicator/SparseToDense/_149,
dnn/input_from_feature_columns/input_layer/air_store_id_indicator/one_hot/depth,
dnn/input_from_feature_columns/input_layer/air_store_id_indicator/one_hot/on_value,
dnn/input_from_feature_columns/input_layer/air_store_id_indicator/one_hot/off_value)]]
I tried to define a one-hot feature column for use in my DNNRegressor as follows:
tf.feature_column.indicator_column(
tf.feature_column.categorical_column_with_identity(key='id', num_buckets=df_train['id'].unique().size))
In my input_fn to DNNRegressor::fit(), I populate the one-hot encoding like this:
labels, uniques = pd.factorize(df_train['id'])
returned_feature_columns[k] = tf.one_hot(labels, uniques.size, 1, 0)
When I print that one-hot encoding, its dimensions appear correct, because I have 114171 training examples, and 829 unique ids:
Tensor("one_hot:0", shape=(114171, 829), dtype=int32)
The defined tensor is consuming to much memory. There is a 2GB limit for the tf.GraphDef protocol buffer. You should train your model with smaller batches. There is a nice higher level Estimator API to build a input_fn for pandas dataframes:
input_fn = tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame({'x':x_data}),
num_epochs=num_epochs,
shuffle=True)
For more details you can find documentation here.

Oversampling images during inference

It is is a common practice in convolutional neural networks to oversample a given image during inference,
I.e to create a batch from different transformation of the same image (most common - different crops and mirroring), transfer the entire batch through the network and average (or another kind of reducing function) over the results to get a single prediction (caffe example),
How can this approach be implemented in tensorflow?
You can take a look at the TF cnn tutorial. In particular, the function distorted_inputs does the image preprocessing step.
In short, there are a couple of TF functions in the tf.image package that help with distorting the images. You can use either them or regular numpy functions to create an extra dimension for the output, for which you can average the results:
Before:
input_place = tf.placeholder(tf.float32, [None, 256, 256, 3])
prediction = some_model(input_place) # size: [None]
sess.run(prediction, feed_dict={input_place: batch_of_images})
After:
input_place = tf.placeholder(tf.float32, [None, NUM_OF_DISTORTIONS, 256, 256, 3])
prediction = some_model(input_place) # make sure it is of size [None, NUM_DISTORTIONS]
new_prediction = tf.reduce_mean(prediction, axis=1)
new_batch = np.zeros(batch_size, NUM_OF_DISTORTIONS, 256, 256, 3)
for i in xrange(len(batch_of_images)):
for f in xrange(len(distortion_functions)):
new_batch[i, f, :, :, :] = distortion_functions[f](batch_of_images[i])
sess.run(new_prediction, feed_dict={input_place: new_batch})
Take a look at TF's image-related functions. You could apply those transformations at test time to some input image, and stack all of them together to make a batch.
I imagine you could also do this using OpenCV or some other image processing tool. I don't see a need to do it in the computation graph. You could create the batches beforehand, and pass it through in feed_dict.