How to better organize the nodes in tensorboard with keras? - tensorflow

I'm using keras instead of dealing with tensorflow because its simplicity. But when I tried to visiualize the computational graph in keras by sending a keras.callbacks.Tensorboard instance to the model.fit()'s callbacks argument. The graph I got from tensorboard is so awkward,
For demonstration purpose, here I only build a very simple linear classifier with 1 unit in 1 dense layer. But the graph looks like this:
Could I do the same thing as what we did in tensorflow, like use the name_space to group things together and give layers, bias, weights names? I mean, in the graph here, it's such a mess, I can only understand the Dense layer, and a logistic loss namespace. But typically with tensorflow, we can see something like train namespace, and not so many nodes without namespace here. How can I make it more clear?

Tensorflow graph shows all the computations being called. You won't be able to simplify it.
As an alternative, Keras has it's own layer-by-layer graph. Which shows a clear and concise structure of your network. You can generate it by calling
from keras.utils import plot_model
plot_model(model, to_file='/some/pathname/model.png')
Last, you can also call model.summary(), which generate a textual version of the graph, with additional summaries.
Here is an output of model.summary() for example:
Layer (type) Output Shape Param # Connected to
====================================================================================================
input_1 (InputLayer) (None, 2048) 0
____________________________________________________________________________________________________
activation_1 (Activation) (None, 2048) 0
____________________________________________________________________________________________________
dense_1 (Dense) (None, 511) 1047039
____________________________________________________________________________________________________
activation_2 (Activation) (None, 511) 0
____________________________________________________________________________________________________
decoder_layer_1 (DecoderLayer) (None, 512) 0
____________________________________________________________________________________________________
ctg_output (OrLayer) (None, 201) 102912
____________________________________________________________________________________________________
att_output (OrLayer) (None, 312) 159744
====================================================================================================
Total params: 1,309,695.0
Trainable params: 1,309,695.0
Non-trainable params: 0.0

Related

How to split dataset to implement svm classifier after extracting features from Inception v3 transfer learning?

The training dataset consists of 42848 images in 4 (classes) subdirectories.
image_size= [520,578]
BATCH_SIZE= 32
Model:
inception = InceptionV3(input_shape=CROP_SHAPE + [3], weights='imagenet', include_top=False)
for layer in inception.layers:
layer.trainable = False
x = inception.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
prediction = Dense(len(folders), activation='softmax')(x)
Here's the model summary.
Model: "model"
_________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 520, 578, 3 0 []
)]
conv2d (Conv2D) (None, 259, 288, 32 864 ['input_1[0][0]']
)
batch_normalization (BatchNorm (None, 259, 288, 32 96 ['conv2d[0][0]']
alization) )
mixed9_1 (Concatenate) (None, 14, 16, 768) 0 ['activation_87[0][0]',
'activation_88[0][0]']
concatenate_1 (Concatenate) (None, 14, 16, 768) 0 ['activation_91[0][0]',
'activation_92[0][0]']
activation_93 (Activation) (None, 14, 16, 192) 0 ['batch_normalization_93[0][0]']
mixed10 (Concatenate) (None, 14, 16, 2048 0 ['activation_85[0][0]',
) 'mixed9_1[0][0]',
'concatenate_1[0][0]',
'activation_93[0][0]']
global_average_pooling2d (Glob (None, 2048) 0 ['mixed10[0][0]']
alAveragePooling2D)
dense (Dense) (None, 1024) 2098176 ['global_average_pooling2d[0][0]'
]
dropout (Dropout) (None, 1024) 0 ['dense[0][0]']
Feature_extractor (Dense) (None, 64) 65600 ['dropout[0][0]']
dense_1 (Dense) (None, 4) 260 ['Feature_extractor[0][0]']
==================================================================================================
Total params: 23,966,820
Trainable params: 2,164,036
Non-trainable params: 21,802,784
__________________________________
I've extracted features from Inception v3 model. Now I want to split the features using sci-kit learn to train SVM classifier.
from sklearn.model_selection import train_test_split
model_feat = Model(inputs=loaded_model.input,outputs=loaded_model .get_layer('Feature_extractor').output)
feat_trainX = model_feat.predict(train_data)
...
X_train2, X_test2, y_train2, y_test2 = train_test_split(feat_trainX, train_gen, test_size=0.25, random_state=42)
Here, the "traingen" directory iterator is used as a "label" but getting inconsistent values. I've also got same problems using "tf.keras.utils.image_dataset_from_directory".
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1./255)
train_gen = train_datagen.flow_from_directory(
'/content/DeepSeagrass/Training',
target_size=CROP_SHAPE,
batch_size=BATCH_SIZE,
shuffle=True,
class_mode='categorical')
Found 42848 images belonging to 4 classes.
Is there any solution to labeling large image dataset for training discriminative algorithms?
It seems that flow_from_directory and tf.keras.utils.image_dataset_from_directory both takes the data as small batch randomly which may cause inconsistency between features and label.
Now, the main question is how tackle this situation by using less gpu. I'm using google colab which gets out of memory on gpu runtime if I want to convert the whole dataset into numpy array.
Also, if yes then how to save the features with label in CSV file for further visualization like TSNE plot?
Below references I've found but neither gave exact solution:
How to store CNN extracted features to train a SVM classifier
How to implement t-SNE in tensorflow?
https://pyimagesearch.com/2019/05/27/keras-feature-extraction-on-large-datasets-with-deep-learning/
Actually, the issue is you're using flow_from_directory() with batch_size smaller than the entire input, which is why it's only producing 1339 elements at a time (because it's in batches). The number of items in Dataset created from flow_from_directory() is total_number / batch_number. Either set the batch number with flow_from_directory() to be one (1), or load all the images into memory before train_test_split(). Either one of those will load everything in memory.
It's also worth noting train_gen is not an array, it's a Dataset, which means it contains x and y values. You most likely aren't trying to get both feature and labels there, right? If you need one or the other, you'll need to debug and look into that variable. Whenever I do, they contain arrays x and y, corresponding to features and labels, respectively. Some people don't see it named that way. You could technically do it this way, but like my comment, given the random nature of flow_from_directory(), it probably won't be as robust as loading all in memory (because some could get skipped, some could get picked multiple times).

Tensorflow super simple model? 10 inputs, 1 output, so 11 trainable parameters

I am a little new to Tensorflow, I'm using TensorflowJS, but feel free to post your Python code.
What I am trying to achieve is the following:
I want to train a simple model of 10 inputs and 1 output.
I have 10 inputs of consistent dimensions [255,255].
The output should be of size [255,255] aswell, and should add each of the inputs according to some weights. So there will be 10 weights (+bias), the output is simply a lineair combination of the inputs.
I want to train these 10 weights so the result is as close as possible to a validation matrix of size [255,255]. I think the absoluteDifference as a loss function is best for this.
However, I have no idea how to make this trainable model in Tensorflow? So far this is what I got:
const model = tf.sequential();
model.add(tf.layers.dense({inputShape: [255,255], units: 10, activation: 'relu'}));
/* Prepare the model for training: Specify the loss and the optimizer. */
model.compile({loss: 'absoluteDifference', optimizer: 'momentum'});
In python it would be something this:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(255, 255, 10)), # 10 inputs of 255x255
keras.layers.Dense(9, activation='relu'),
keras.layers.Dense(1, activation='sigmoid') #assuming it's binary classification, we use sigmoid
])
model.compile(optimizer='adam',
loss=tf.losses.BinaryCrossentropy(from_logits=True))
Quick note that in TF 2.0, absolutedifference loss does not exist. You'd have to use TF 1.X
You can go through a detailed example of it in TF Documentation
EDIT:
Model Summary
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_3 (Flatten) (None, 650250) 0
_________________________________________________________________
dense_5 (Dense) (None, 9) 5852259
_________________________________________________________________
dense_6 (Dense) (None, 1) 10
=================================================================
Total params: 5,852,269
Trainable params: 5,852,269
Non-trainable params: 0
_________________________________________________________________

How to calculate the number of multiplications happening in BatchNormalization layer during test evaluation?

or, why do my CNN's test evaluations take significantly longer with BatchNormalization than without?
I need to approximate the theoretical runtime for the evaluation of a trained CNN (using Keras with TF backend) on a test set. Thus, I attempted to calculate the number of mutliplications happening during evaluation to use this as a metric.
But for some reason, Batch Normalization (BN) appears to have a significant impact on the evaluation time, despite not being relevant in theory in my understanding.
I can calculate the number of multiplications for Dense and Conv Layers, and I thought I can ignore the computations for the activation function and the Batch Normalization as both only add one multiplication per Input, which is significantly less than what the Convolutional Layers do.
However, when I test the same network once with and once without Batch Normalization after every ConvLayer, I noticed that I cannot ignore it:
In the simple example given below, there only is one ConvLayer with filter size (3x3), followed by a softmax activated dense layer as I'm doing classification.
With BN after the conv layer, it takes me ~4.6 seconds to work through the test set.
Using the otherwise exact same net architecture without BN, the same test set is processed in half the time.
Summary of the test configuration with BN (finishes test set evaluation in ~4.6s):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
batch_normalization_1 (Batch (None, 32, 32, 32) 128
_________________________________________________________________
flatten_1 (Flatten) (None, 32768) 0
_________________________________________________________________
dense_1 (Dense) (None, 43) 1409067
=================================================================
Total params: 1,410,091
Trainable params: 1,410,027
Non-trainable params: 64
Without BN (finishes test set evaluation in ~2.3s):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_2 (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
flatten_2 (Flatten) (None, 32768) 0
_________________________________________________________________
dense_2 (Dense) (None, 43) 1409067
=================================================================
Total params: 1,409,963
Trainable params: 1,409,963
Non-trainable params: 0
I don't know how this scales as I don't understand the cause in the first place, but I can tell that I had tested other nets with 3 to 6 identical conv layers (using padding = same to keep the dimensions constant), and the difference in test evaluation appeared to vary between ~25% to ~50% in most cases (The one-conv-layer example given below even has ~100%).
Why does BN have such a big impact, in other words, what calculations are happening that I'm missing?
I thought: BN just adds one multiplication per input. So, for example in the network with BN given above:
I expected batch_normalization_1 for example would add 32*32*32 multiplications, and conv2d_1 32*32*32*3*3 multiplications.
But then, how does that have so much impact on the overall runtime, even though the ConvLayers add more multiplications?
Code used to build the model:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation="relu", input_shape=x_train.shape[1:], padding="same"))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(43, activation='softmax'))
with x_train.shape[1:] being (32, 32, 3), representing a 32x32 image with RGB colors.
Kind of answering my own question here, in case anyone stumbles across the same issue.
By embedding the Fritz AI benchmark libary https://docs.fritz.ai/python-library/benchmark.html, I could actually check the number of flops per layer, and it indeed turned out that the normalization only adds a neglectable amount of computations.
----------------------------------------------------------------------------------------------------------------------
Layer (type) Output Shape MFLOPS Weights Core ML Compatible
======================================================================================================================
conv2d_1 (Conv2D) [None, 32, 32, 32] 0.92 896 True
----------------------------------------------------------------------------------------------------------------------
batch_normalization_1 (BatchNormalization) [None, 32, 32, 32] 0.07 128 True
----------------------------------------------------------------------------------------------------------------------
flatten_1 (Flatten) [None, 32768] 0.00 0 True
----------------------------------------------------------------------------------------------------------------------
dense_1 (Dense) [None, 43] 2.82 1,409,067 True
----------------------------------------------------------------------------------------------------------------------
That said, the issue must be caused by some inefficent routine or even a bug in keras to evaluate models with Batch Normalization. Weird, but that is the only possible explanation.

Sentiment classifier training with Keras

I am using keras (backend tensorflow) to classify sentiments from Amazon review.
It starts with an embedding layer (which uses GloVe), then LSTM layer and finally a Dense layer as output. Model summary below:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 100) 2258700
_________________________________________________________________
lstm_1 (LSTM) (None, 16) 7488
_________________________________________________________________
dense_1 (Dense) (None, 5) 85
=================================================================
Total params: 2,266,273
Trainable params: 2,266,273
Non-trainable params: 0
_________________________________________________________________
Train on 454728 samples, validate on 113683 samples
When training the train and eval accuracy is about 74% and loss (train and eval) around 0.6.
I've tried with changing amount of elements in LSTM layer, as well as including dropout, recurrent dropout, regularizer, and with GRU (instead of LSTM). Then the accuracy increased a bit (~76%).
What else could I try in order to improve my results?
I have had a great a better success with sentiment analysis using Bidirectional LSTM also stacking two layers vertically i.e 2 LSTMS together forming a deep network also helped and try to increase the number of lstm elements to be around 128.

Can I use transfer learning to retrain a Neural Network on different subsets of the data to solve memory problems?

I am trying to train a Neural Network on the Amazon Reviews DataSet, so that I can teach it to classify correctly between Positive and Negative sentiment. The approach I am trying to use is to first use Google's Word2Vec model to vectorize each review, by sampling the vector from the model. Then, I feed them into a Convolutional Neural Network to train it.
I obtained Google's Pre-trained Word2Vec model from here, which gives me a Vector of length 300-dimensions, and by truncating each review to 80 words, I obtain a 80 x 300 Vector for each review.
The Convolutional Neural Network I train has the following structure:
Layer (type) Output Shape
- conv2d_1 (Conv2D) (None, 1, 300, 128)
- conv2d_2 (Conv2D) (None, 1, 300, 64)
- conv2d_3 (Conv2D) (None, 1, 300, 32)
- conv2d_4 (Conv2D) (None, 1, 300, 16)
- flatten_1 (Flatten) (None, 4800)
- dropout_1 (Dropout 0.5) (None, 4800)
- dense_1 (Dense) (None, 256)
- batch_normalization_1 (Batch (None, 256)
- activation_1 (Relu) (None, 256)
- dropout_2 (Dropout 0.5) (None, 256)
- dense_2 (Dense) (None, 1)
I use a large Network with big dropout and neurons to reduce overfitting on the Training data.
However, my main problem is that I am unable to train on most of the data because I can't load all of the data in memory, and since the Featurized vectors contain mostly high-precision decimals, they take up a lot of memory, and disk space if I serialize them.
Is it possible for me to use Transfer Learning to solve the problem of not training on enough data? The approach I plan on using is:
Load a subset of the dataset that can fit into the memory
Vectorize it using Google's Word2Vec model (This part takes around 5-10 minutes)
Train the model for 50-100 epochs
Load in a second subset of the dataset and repeat.
Is this a valid approach for training a large model? Because I am re-training the model on the same dataset, am I correct in assuming that I won't have to freeze any layers?
Also, is Stochastic Gradient Descent a good optimizer for this problem, since I will be training on a large amount of data?
Instead of your proposed method, I think it would be more appropriate if you used a data generator.