Tensorflow: How to define a one-hot feature column for a canned estimator - tensorflow

My one-hot encoding appears to incorrectly have 3 dimensions during training (I think it should have 2), which causes an OOM. How am I constructing the one-hot feature column incorrectly?
I get this error when I begin to train the neural net:
OOM when allocating tensor with shape[114171,829,829]
[[Node:
dnn/input_from_feature_columns/input_layer/air_store_id_indicator/one_hot
= OneHot[T=DT_FLOAT, TI=DT_INT64, axis=-1, _device="/job:localhost/replica:0/task:0/gpu:0"](dnn/input_from_feature_columns/input_layer/air_store_id_indicator/SparseToDense/_149,
dnn/input_from_feature_columns/input_layer/air_store_id_indicator/one_hot/depth,
dnn/input_from_feature_columns/input_layer/air_store_id_indicator/one_hot/on_value,
dnn/input_from_feature_columns/input_layer/air_store_id_indicator/one_hot/off_value)]]
I tried to define a one-hot feature column for use in my DNNRegressor as follows:
tf.feature_column.indicator_column(
tf.feature_column.categorical_column_with_identity(key='id', num_buckets=df_train['id'].unique().size))
In my input_fn to DNNRegressor::fit(), I populate the one-hot encoding like this:
labels, uniques = pd.factorize(df_train['id'])
returned_feature_columns[k] = tf.one_hot(labels, uniques.size, 1, 0)
When I print that one-hot encoding, its dimensions appear correct, because I have 114171 training examples, and 829 unique ids:
Tensor("one_hot:0", shape=(114171, 829), dtype=int32)

The defined tensor is consuming to much memory. There is a 2GB limit for the tf.GraphDef protocol buffer. You should train your model with smaller batches. There is a nice higher level Estimator API to build a input_fn for pandas dataframes:
input_fn = tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame({'x':x_data}),
num_epochs=num_epochs,
shuffle=True)
For more details you can find documentation here.

Related

Training with Dataset API and numpy array yields completely different results

I have a CNN regression model and feature comes in (2000, 3000, 1) shape, where 2000 is total number of samples with each being a (3000, 1) 1D array. Batch size is 8, 20% of the full dataset is used for validation.
However, zip feature and label into tf.data.Dataset gives completely different scores from feeding numpy arrays directly in.
The tf.data.Dataset code looks like:
# Load features and labels
features = np.array(features) # shape is (2000, 3000, 1)
labels = np.array(labels) # shape is (2000,)
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
# Training model
model.fit(train_dataset, validation_data=val_dataset,
batch_size=8, epochs=1000)
The numpy code looks like:
# Load features and labels
features = np.array(features) # exactly the same as previous
labels = np.array(labels) # exactly the same as previous
# Training model
model.fit(x=features, y=labels, shuffle=True, validation_split=0.2,
batch_size=8, epochs=1000)
Except for this, other code is exactly the same, for example
# Set global random seed
tf.random.set_seed(0)
np.random.seed(0)
# No preprocessing of feature at all
# Load model (exactly the same)
model = load_model()
# Compile model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.mean_absolute_error, ],
)
The former method via tf.data.Dataset API yields mean absolute error (MAE) around 10-3 on both training and validation set, which looks quite suspicious as the model doesn't have any drop-out or regularization to prevent overfitting. On the other hand, feeding numpy arrays right in gives training MAE around 0.1 and validation MAE around 1.
The low MAE of tf.data.Dataset method looks super suspicious however I just couldn't figure out anything wrong with the code. Also I could confirm the number of training batches is 200 and validation batches is 50, meaning I didn't use the training set for validation.
I tried to vary the global random seed or use some different shuffle seeds, which didn't change the results much. Training was done on NVIDIA V100 GPUs, and I tried tensorflow version 2.9, 2.10, 2.11 which didn't make much difference.
The problem lies in the default behaviour of "shuffle" method of tf.data.Dataset, more specificially the reshuffle_each_iteration argument which is by default True. Meaning if I implement the following code:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
model.fit(train_dataset, validation_data=val_dataset, batch_size=8, epochs=1000)
The dataset would actually be shuffle after each epoch though it might not look so apparently so. As a result, the validation data would leak into training set (in fact there would be no distinguish between these two sets as the order is shuffled every epoch).
So make sure to set reshuffle_each_iteration to False if you would like to shuffle the dataset and then do train-val split.
UPDATE: TensorFlow confirms this issue and warning would be added in future docs.
PS: It's a hard lesson for me, as I have been using the model for analysing the results for several months (as a graduating MPhil student).

How to use embedding models in tensorflow hub with LSTM layer?

I'm learning tensorflow 2 working through the text classification with TF hub tutorial. It used an embedding module from TF hub. I was wondering if I could modify the model to include a LSTM layer. Here's what I've tried:
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Embedding(10000, 50))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(1))
model.summary()
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_data.shuffle(10000).batch(512),
epochs=10,
validation_data=validation_data.batch(512),
verbose=1)
results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
print("%s: %.3f" % (name, value))
I don't know how to get the vocabulary size from the hub_layer. So I just put 10000 there. When run it, it throws this exception:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[480,1] = -6 is not in [0, 10000)
[[node sequential/embedding/embedding_lookup (defined at .../learning/tensorflow/text_classify.py:36) ]] [Op:__inference_train_function_36284]
Errors may have originated from an input operation.
Input Source operations connected to node sequential/embedding/embedding_lookup:
sequential/embedding/embedding_lookup/34017 (defined at Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py:112)
Function call stack:
train_function
I stuck here. My questions are:
how should I use the embedding module from TF hub to feed an LSTM layer? it looks like embedding lookup has some issues with the setting.
how do I get the vocabulary size from the hub layer?
Thanks
Finally figured out the way to link pre-trained embeddings to LSTM or other layers. Just post the steps here in case anyone feels helpful.
Embedding layer has to be the first layer in the model. (hub_layer is the same as Embedding layer.) The not very intuitive part is that any text input to the hub layer will be converted to only one vector of shape [embedding_dim]. You need to do sentence splitting and tokenization to make sure whatever input to the model is a sequence in the form of array of arrays. e.g., "Let us prepare the data." should be converted to [["let"],["us"],["prepare"], ["the"], ["data"]]. You will also need to pad the sequences if you are using batch mode.
In addition, you will need to convert your target tokens to int if your training labels are strings. The input to the model is array of strings with shape [batch, seq_length], the hub embedding layer converts it to [batch, seq_length, embed_dim]. (If you add a LSTM or other RNN layer, the output from the layer is [batch, seq_length, rnn_units]. ) The output dense layer will output index of text instead of actual text. The index of text is stored in the downloaded tfhub directory as "tokens.txt". You can load the file and convert text to the corresponding index. Otherwise you cannot compute the loss.

Memory error while creating large one hot encoding for lstm

I am trying to build a character level lstm model using keras and for that I need to create one hot encoding for characters to feed in the model. And I have around 1000 characters in each line with around 160,000 lines.
I tried to create a numpy array of zeros and make the corresponding entries 1, but I am geting memory error due to large size of the matrix is there any other way to do this.
Sure:
Create batches. Only process, say, 10,000 entries (characters) at a time, computing and feeding them into your neural network just before they're needed (say, by using a generator instead of a list). Keras has a fit_generator training function to do this.
Group chunks of data together. Say, instead of a line being a matrix of the one-hot encodings of its characters, instead use the sum/max of all those columns to produce a single vector for the line. Now, each line is only a single vector, with dimensionality equal to the number of unique characters in your data set. E.g., instead of [[0, 0, 1], [0, 1, 0], [0, 0, 1]], use, [0, 1, 1] to represent the entire line.
Perhaps an easier and more intuitive solution is to add a custom one-hot encoding layer in your Keras model architecture.
def build_model(self, batch_size, print_summary=False):
X = Input(shape=(self.sequence_length,), batch_size=batch_size)
embedding = OneHotEncoding(num_classes=self.vocab_size+1,
sequence_length=self.sequence_length)(X)
encoder = Bidirectional(CuDNNLSTM(units=self.recurrent_units,
return_sequences=True))(embedding)
...
where we can define the OneHotEncoding layer as follows:
from tensorflow.keras.layers import Lambda
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Layer # for creating custom layers
class OneHotEncoding(Layer):
def __init__(self, num_classes=None, sequence_length=None):
if num_classes is None or sequence_length is None:
raise ValueError("Can't leave params #num_classes or #sequence_length empty")
super(OneHotEncoding, self).__init__()
self.num_classes = num_classes
self.sequence_length = sequence_length
def encode(self, inputs):
return K.one_hot(indices=inputs,
num_classes=self.num_classes)
def call(self, inputs):
return Lambda(function=self.encode,
input_shape=(self.sequence_length,))(inputs)
Here we are utilizing the fact that the Keras model is fed the training samples in appropriate batch sizes (with the standard fit function), which in turn doesn't yield a MemoryError.

MultiClass Keras Classifier prediction output meaning

I have a Keras classifier built using the Keras wrapper of the Scikit-Learn API. The neural network has 10 output nodes, and the training data is all represented using one-hot encoding.
According to Tensorflow documentation, the predict function outputs a shape of (n_samples,). When I fitted 514541 samples, the function returned an array with shape (514541, ), and each entry of the array ranged from 0 to 9.
Since I have ten different outputs, does the numerical value of each entry correspond exactly to the result that I encoded in my training matrix?
i.e. if index 5 of my one-hot encoding of y_train represents "orange", does a prediction value of 5 mean that the neural network predicted "orange"?
Here is a sample of my model:
model = Sequential()
model.add(Dropout(0.2, input_shape=(32,) ))
model.add(Dense(21, activation='selu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
There are some issues with your question.
The neural network has 10 output nodes, and the training data is all represented using one-hot encoding.
Since your network has 10 output nodes, and your labels are one-hot encoded, your model's output should also be 10-dimensional, and again hot-encoded, i.e. of shape (n_samples, 10). Moreover, since you use a softmax activation for your final layer, each element of your 10-dimensional output should be in [0, 1], and interpreted as the probability of the output belonging to the respective (one-hot encoded) class.
According to Tensorflow documentation, the predict function outputs a shape of (n_samples,).
It's puzzling why you refer to Tensorflow, while your model is clearly a Keras one; you should refer to the predict method of the Keras sequential API.
When I fitted 514541 samples, the function returned an array with shape (514541, ), and each entry of the array ranged from 0 to 9.
If something like that happens, it must be due to a later part in your code that you do not show here; in any case, the idea would be to find the argument with the highest value from each 10-dimensional network output (since they are interpreted as probabilities, it is intuitive that the element with the highest value would be the most probable). In other words, somewhere in your code there must be something like this:
pred = model.predict(x_test)
y = np.argmax(pred, axis=1) # numpy must have been imported as np
which will give an array of shape (n_samples,), with each y an integer between 0 and 9, as you report.
i.e. if index 5 of my one-hot encoding of y_train represents "orange", does a prediction value of 5 mean that the neural network predicted "orange"?
Provided that the above hold, yes.

Oversampling images during inference

It is is a common practice in convolutional neural networks to oversample a given image during inference,
I.e to create a batch from different transformation of the same image (most common - different crops and mirroring), transfer the entire batch through the network and average (or another kind of reducing function) over the results to get a single prediction (caffe example),
How can this approach be implemented in tensorflow?
You can take a look at the TF cnn tutorial. In particular, the function distorted_inputs does the image preprocessing step.
In short, there are a couple of TF functions in the tf.image package that help with distorting the images. You can use either them or regular numpy functions to create an extra dimension for the output, for which you can average the results:
Before:
input_place = tf.placeholder(tf.float32, [None, 256, 256, 3])
prediction = some_model(input_place) # size: [None]
sess.run(prediction, feed_dict={input_place: batch_of_images})
After:
input_place = tf.placeholder(tf.float32, [None, NUM_OF_DISTORTIONS, 256, 256, 3])
prediction = some_model(input_place) # make sure it is of size [None, NUM_DISTORTIONS]
new_prediction = tf.reduce_mean(prediction, axis=1)
new_batch = np.zeros(batch_size, NUM_OF_DISTORTIONS, 256, 256, 3)
for i in xrange(len(batch_of_images)):
for f in xrange(len(distortion_functions)):
new_batch[i, f, :, :, :] = distortion_functions[f](batch_of_images[i])
sess.run(new_prediction, feed_dict={input_place: new_batch})
Take a look at TF's image-related functions. You could apply those transformations at test time to some input image, and stack all of them together to make a batch.
I imagine you could also do this using OpenCV or some other image processing tool. I don't see a need to do it in the computation graph. You could create the batches beforehand, and pass it through in feed_dict.