I have several different small models 1-2MB size on disk, some are just fully connected layers, Transformer, and a couple of Bidirectional GRU networks. But they have very different memory requirements when I load them in TensorFlow Serving.
Does anyone know why the Bidirectional GRU network a very simple one shown below takes >100MB to load compared to a Transformer with several layers and double the number of weights which only takes half the RAM of 50MB to load, (the fully connected networks are very small 5-10MB RAM).
vocab_size = 200
embedding_dim = 32
enc_units = 64
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
gru = tf.keras.layers.GRU(enc_units,
self.bidi = tf.keras.layers.Bidirectional(gru, merge_mode="concat")
Any way to reduce the RAM requirements for the network?
I tried reducing the enc_units from 64 to 32, same for embedding_dim 64->32, but that didn't change the RAM usage at all.
How to select the following: Size of filter for convolution, strides, pooling, and densely connected layer
There is no single answer to this question. This Reddit and this answer have some nice discussion. To quote the second post on the Reddit, "Start simple."
Celeba has similar, maybe exactly the same, image size. When I was working with Celeba on a DCGAN project, I gently cropped and then reshaped the images to 64 x 64 x 3. My discriminator was a convolutional neural network and used 4 convolutional layers and one fully connected layer. All conv layers had 5 x 5 window size and stride size of 2 x 2. SAME padding and no pooling. The output channels per layer were 128 -> 256 -> 512 -> 1024. So, the last conv layer output a 4 x 4 x 1024 tensor. My dense layer then had a weight size of classes x 1024. (I had 1 class since its purpose was to determine whether the input image was from the dataset or made by the generator.)
That relatively simple architecture had good results, but it was intentionally built to not overpower the generator. If you're looking for pure classification, you may want a deeper architecture. You might not want to crop as aggressively as I did. Then you can include more conv layers before the fully connected layer. You may want to use a 3 x 3 window size with a stride size of 1 x 1 and use pooling - although I see architectures abandoning pooling in favor of larger stride size. If your dataset is small, it is prone to overfitting. Having smaller weights helps combat this when dropout isn't enough. That means fewer output channels per layer.
There are a lot of possibilities when choosing an architecture, and there is no hard-and-fast rule for the best architecture. Remember to start simple.
Currently, I have a neural network, built in tensorflow that is used to classify time sequence data into one of 6 categories. The network is composed of:
2 fully connected layers -> LSTM unit -> softmax -> output
All layers have regularization in the form of dropout and or layer normalization. In order to speed up the training process, I am using mini-batching of the data, where the mini-batch size = # of categories = 6. Each mini-batch contains exactly one sample for each of the 6 categories, arranged randomly in the mini-batch. Below is the feed-forward code, where x is of shape [batch_size, number of time steps, number of features], and the various get commands are simple definitions for creating standard fully connected layers and LSTM units with regularization.
def getFullyConnected(input ,hidden ,dropout, layer, phase):
weight = tf.Variable(tf.random_normal([input.shape.dims[1].value,hidden]), name="weight_layer"+str(layer))
bias = tf.Variable(tf.random_normal([1]), name="bias_layer"+str(layer))
layer = tf.add(tf.matmul(input, weight), bias)
layer = tf.contrib.layers.batch_norm(layer,
center=True, scale=True,
layer = tf.minimum(tf.nn.relu(layer), FLAGS.relu_clip)
layer = tf.nn.dropout(layer, (1.0 - dropout))
return layer
def RNN(x, weights, biases, time_steps):
#shape the input as [batch_size*time_steps, input_depth]
x = tf.reshape(x, [-1,input_depth])
layer1 = getFullyConnected(input=x, hidden=16, dropout=full_drop, layer=1, phase=True)
layer2 = getFullyConnected(input=layer1, hidden=input_depth*3, dropout=full_drop, layer=2, phase=True)
rnn_input = tf.reshape(layer2, [-1,time_steps,input_depth*3])
# 1-layer LSTM with n_hidden units.
LSTM_cell = getLSTMcell(n_hidden)
#generate prediction
outputs, state = tf.nn.dynamic_rnn(LSTM_cell,
#good old tensorboard saves
tf.summary.histogram('weight', weights['out'])
#there are time_steps outputs, but only grab the last output for the classification
return tf.sigmoid(tf.matmul(outputs[:,-1,:], weights['out']) + biases['out'])
Surprisingly, this network trained extremely well giving me about 99.75% accuracy on my test data (which the trained network had never seen). However, it only scored this high when I fed the training data into the network with a mini-batch size the same as during training, 6. If I only fed the training data one sample at a time (mini-batch size = 1), the network was scoring around 60%. What is weird is that, if I train the network with only single samples (mini-batch size = 1), the trained network works perfectly fine with high accuracy once the network is trained. This leads me to the weird conclusion that the network is almost learning to utilize the batch size in its learning, so much so that it becomes dependent on the mini-batch to classify correctly.
Is it a thing for a deep network to become dependent on the size of the mini-batch during training, so much that the final trained network will require input data to have the same mini-batch size just to perform correctly?
All ideas or thoughts would be loved!
I am working on MNIST dataset on TensorFlow with deep neural networks classifier. I am using the following structure for the network.
MNIST_DATASET = input_data.read_data_sets(mnist_data_path)
train_data = np.array(MNIST_DATASET.train.images, 'int64')
train_target = np.array(MNIST_DATASET.train.labels, 'int64')
test_data = np.array(MNIST_DATASET.test.images, 'int64')
test_target = np.array(MNIST_DATASET.test.labels, 'int64')
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=[tf.contrib.layers.real_valued_column("", dimension=784)],
n_classes=10, #0 to 9 - 10 classes
hidden_units=[2500, 1000, 1500, 2000, 500],
classifier.fit(train_data, train_target, steps=1000)
However, I faced with the 40% accuracy when I run the following line.
accuracy_score = 100*classifier.evaluate(test_data, test_target)['accuracy']
How can I tune the network? I do something wrong? Similar studies retrieved 99% accuracy in academia.
Thank you.
I find an optimum configuration on GitHub.
Firstly, that's not the best configuration. Academic studies have already reached the 99.79% accuracy on test set.
classifier = tf.contrib.learn.DNNClassifier(
, n_classes=10
, hidden_units=[128, 32]
, optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=learning_rate)
, activation_fn = tf.nn.relu
Also, the following parameters is transfered to the classifier.
epoch = 15000
learning_rate = 0.1
batch_size = 40
In this way, model classifies 97.83% accuray on test set, and 99.77% accuracy on trainset.
Speaking from experience, it would be a good idea to have no more than 2 hidden layers in fully connected network for MNIST dataset. i.e. hidden_units=[500, 500]. That should get to over 90% accuracy.
What is the problem? Extreme number of model parameters. For example, just second hidden layer would require (2500*1000+1000) of parameters. The rule of thumb would be to keep number of trainable parameters somewhat comparable to number of training examples, or it is at least so in classical machine learning. If otherwise, regularize model rigorously.
What steps can be taken here?
Use simpler model. Decrease number of hidden units, number of layers
Use model with smaller number of parameters. Convolutional layers, for instance, would generally utilize much smaller number of parameters for the same number of units. For instance 1000 convolutinal neurons with 3x3 kernels would need only 1000*(3*3+1) parameters
Apply regularization: batch normalization, noise injection into your input, dropout, weight decay would be good examples to start from.
I am using an pretrained model in Keras to generate features for a set of images:
model = InceptionV3(weights='imagenet', include_top=False)
train_data = model.predict(data).reshape(data.shape[0],-1)
However, I have a lot of images and the Imagenet model outputs 131072 features (columns) for each image.
With 200k images I would get an array of (200000, 131072) which is too large to fit into memory.
More importantly, I need to save this array to disk and it would take 100 GB of space when saved as .npy or .h5py
I could circumvent the memory problem by feeding only batches of like 1000 images and saving them to disk, but not the disk space problem.
How can I make the model smaller without losing too much information?
as the answer suggested I include the next layer in the model as well:
base_model = InceptionV3(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('avg_pool').output)
this reduced the output to (200000, 2048)
update 2:
another interesting solution may be the bcolz package to reduce size of numpy arrays https://github.com/Blosc/bcolz
I see at least two solutions to your problem:
Apply a model = AveragePooling2D((8, 8), strides=(8, 8))(model) where model is an InceptionV3 object you loaded (without top). This is the next step in InceptionV3 architecture - so one may easily assume - that these features still hold loads of discriminatory clues.
Apply a some kind of dimensionality reduction (e.g. like PCA) on a sample of data and reduce the dimensionality of all data to get the reasonable file size.
In Tensorflow tutorial, it gives the following example regarding tf.train.shuffle_batch():
# Creates batches of 32 images and 32 labels.
image_batch, label_batch = tf.train.shuffle_batch(
[single_image, single_label],
I am not very clear about the meaning of capacity and min_after_dequeue. In this example, it is set as 50000 and 10000 respectively. What is the logic for this kind of setup, or what does that mean. If input has 200 images and 200 labels, what will happen?
The tf.train.shuffle_batch() function uses a tf.RandomShuffleQueue internally to accumulate batches of batch_size elements, which are sampled uniformly at random from the elements currently in the queue.
Many training algorithms, such as the stochastic gradient descent–based algorithms that TensorFlow uses to optimize neural networks, rely on sampling records uniformly at random from the entire training set. However, it is not always practical to load the entire training set in memory (in order to sample from it), so tf.train.shuffle_batch() offers a compromise: it fills an internal buffer with between min_after_dequeue and capacity elements, and samples uniformly at random from that buffer. For many training processes, this improves the accuracy of the model and provides adequate randomization.
The min_after_dequeue and capacity arguments have an indirect effect on training performance. Setting a large min_after_dequeue value will delay the start of training, because TensorFlow has to process at least that many elements before training can start. The capacity is an upper bound on the amount of memory that the input pipeline will consume: setting this too large may cause the training process to run out of memory (and possibly start swapping, which will impair the training throughput).
If the dataset has only 200 images, it would be easily possible to load the entire dataset in memory. tf.train.shuffle_batch() would be quite inefficient, because it enqueue each image and label multiple times in the tf.RandomShuffleQueue. In this case, you may find it more efficient to do the following instead, using tf.train.slice_input_producer() and tf.train.batch():
random_image, random_label = tf.train.slice_input_producer([all_images, all_labels],
image_batch, label_batch = tf.train.batch([random_image, random_label],