I am using keras (backend tensorflow) to classify sentiments from Amazon review.
It starts with an embedding layer (which uses GloVe), then LSTM layer and finally a Dense layer as output. Model summary below:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 100) 2258700
_________________________________________________________________
lstm_1 (LSTM) (None, 16) 7488
_________________________________________________________________
dense_1 (Dense) (None, 5) 85
=================================================================
Total params: 2,266,273
Trainable params: 2,266,273
Non-trainable params: 0
_________________________________________________________________
Train on 454728 samples, validate on 113683 samples
When training the train and eval accuracy is about 74% and loss (train and eval) around 0.6.
I've tried with changing amount of elements in LSTM layer, as well as including dropout, recurrent dropout, regularizer, and with GRU (instead of LSTM). Then the accuracy increased a bit (~76%).
What else could I try in order to improve my results?
I have had a great a better success with sentiment analysis using Bidirectional LSTM also stacking two layers vertically i.e 2 LSTMS together forming a deep network also helped and try to increase the number of lstm elements to be around 128.
Related
In the below neural network, the 2nd layer is non trainable. when calculating gradient for the 1st layer, however, will the 2nd layer participate in?
In short, when a layer is set to non-trainable, will it affect the gradient descent of other layers?
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 256) 200960
my_pca_2 (my_pca) (None, 10) 2570
=================================================================
Total params: 203,530
Trainable params: 200,960
Non-trainable params: 2,570
_________________________________________________________________
The second layer in the neural network is set to non-trainable. This only means that the weights in that layer will not be updated during the training process.
However, when calculating the gradient for the first layer, the second layer will still participate. This is because in the process of calculating the gradient the outputs of the second layer are used as inputs to the first layer, and therefore have an effect on the gradients calculated for the first layer. In other words, the non-trainable status of a layer only affects its own weight updates, but not its impact on the gradients of other layers.
It is the essense of backpropagation using the chain rule that states how you even calculate the gradients of each layer. There is no way a layer could not effect the gradients of its predecessor layer.
I am a little new to Tensorflow, I'm using TensorflowJS, but feel free to post your Python code.
What I am trying to achieve is the following:
I want to train a simple model of 10 inputs and 1 output.
I have 10 inputs of consistent dimensions [255,255].
The output should be of size [255,255] aswell, and should add each of the inputs according to some weights. So there will be 10 weights (+bias), the output is simply a lineair combination of the inputs.
I want to train these 10 weights so the result is as close as possible to a validation matrix of size [255,255]. I think the absoluteDifference as a loss function is best for this.
However, I have no idea how to make this trainable model in Tensorflow? So far this is what I got:
const model = tf.sequential();
model.add(tf.layers.dense({inputShape: [255,255], units: 10, activation: 'relu'}));
/* Prepare the model for training: Specify the loss and the optimizer. */
model.compile({loss: 'absoluteDifference', optimizer: 'momentum'});
In python it would be something this:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(255, 255, 10)), # 10 inputs of 255x255
keras.layers.Dense(9, activation='relu'),
keras.layers.Dense(1, activation='sigmoid') #assuming it's binary classification, we use sigmoid
])
model.compile(optimizer='adam',
loss=tf.losses.BinaryCrossentropy(from_logits=True))
Quick note that in TF 2.0, absolutedifference loss does not exist. You'd have to use TF 1.X
You can go through a detailed example of it in TF Documentation
EDIT:
Model Summary
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_3 (Flatten) (None, 650250) 0
_________________________________________________________________
dense_5 (Dense) (None, 9) 5852259
_________________________________________________________________
dense_6 (Dense) (None, 1) 10
=================================================================
Total params: 5,852,269
Trainable params: 5,852,269
Non-trainable params: 0
_________________________________________________________________
I would love some insight on this question - I've tried to find explanations in the literature, but I'm stumped. So I am building a neural network (using Keras) to solve a regression problem. I have ~500,000 samples with 20,000 features each, and am trying to predict a numerical output. Think predicting a house price based on a bunch of numerical measurements of the house, yard, etc. The features are arranged alphabetically so their neighboring features are fairly meaningless.
When I first tried to create a neural network, it suffered from severe overfitting if I provided all 20,000 features - manually reducing it to 1,000 features improved performance massively.
I read about 1x1 convolutional neural networks being used for feature reduction, but it was all used for images and 2D inputs.
So I built a basic neural network with 3 layers:
model = Sequential()
model.add(Conv1D(128, kernel_size=1, activation="relu", input_shape=(n_features,1)))
model.add(Flatten())
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
I also reshaped my training set as input from n_samples, n_features to:
reshaped= X_train.reshape(n_samples, n_features, 1) to conform with the expected input of Conv1D.
Contrary to normal dense neural networks, this works as though I manually selected the top performing features. My questions is - why does this work?? Replacing the convolution layer with a dense layer completely kills the performance. Does this even have anything to do with feature reduction or is something else going on entirely?
I thought 2d images use 1x1 convolutions to reduce the channel dimensions of the image - but I only have 1 channel with 1x1 convolution, so what's being reduced? Does setting my 1D convolution layer filters to 128 mean I have selected 128 features which are subsequently fed to the next layer? Are the features selected based on loss back propagation?
I'm having a lot of trouble visualizing what is happening to the information from my features.
Lastly, what if I were to then add another convolution layer down the road? Is there a way to conceptualize what would happen if I added another 1x1 layer? Is it further subsampling of features?
Thank you!
Let's augment your model with a Dense layer with 128 units and observe the summary for two models.
Conv Model
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
n_features = 1000 # your sequence length
model = Sequential()
model.add(Conv1D(128, kernel_size=1, activation="relu", input_shape=(n_features,1)))
model.add(Flatten())
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_1 (Conv1D) (None, 1000, 128) 256
_________________________________________________________________
flatten_1 (Flatten) (None, 128000) 0
_________________________________________________________________
dense_8 (Dense) (None, 100) 12800100
_________________________________________________________________
dense_9 (Dense) (None, 1) 101
=================================================================
Total params: 12,800,457
Trainable params: 12,800,457
Non-trainable params: 0
FC Model
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model, Sequential
n_features = 1000 # your sequence length
model = Sequential()
model.add(Dense(128, activation="relu", input_shape=(n_features,1)))
model.add(Flatten())
model.add(Dense(100, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_10 (Dense) (None, 1000, 128) 256
_________________________________________________________________
flatten_2 (Flatten) (None, 128000) 0
_________________________________________________________________
dense_11 (Dense) (None, 100) 12800100
_________________________________________________________________
dense_12 (Dense) (None, 1) 101
=================================================================
Total params: 12,800,457
Trainable params: 12,800,457
Non-trainable params: 0
_____________________________
As you can see both models have an identical number of parameters in each layer. But inherently they are completely different.
Let's say we have the inputs with length 4 only. A 1 convolution with 3 filters will use 3 separate kernels on those 4 inputs, each kernel will operate on a single element of input at a time as we have chosen kernel_size = 1. So, each kernel is just a single scalar value which will be multiplied with the input array one element at a time (bias will be added). The thing here is the 1 convolution doesn't look anywhere besides the current input meaning it doesn't have any spatial freedom, it only looks at current input point at a time. (this will become useful for later explanation)
Now, with dense/fc layer each neuron is connected to each input, meaning the fc layer has full spatial freedom, it looks everywhere. The equivalent Conv layer will be something with a kernel_size = 1000 (the actual input length).
So, why Conv1D 1 convolution is maybe performing better?
Well, it's hard to tell without actually looking into data properties. But one guess would be you're using features that don't have any spatial dependency.
You have chosen the features randomly and probably mixing them (looking at many input features at once doesn't help but learns some extra noise). This could be the reason why you're getting better performance with a Conv layer which only looks one feature at a time instead of an FC layer which looks at all of them and mixes them.
or, why do my CNN's test evaluations take significantly longer with BatchNormalization than without?
I need to approximate the theoretical runtime for the evaluation of a trained CNN (using Keras with TF backend) on a test set. Thus, I attempted to calculate the number of mutliplications happening during evaluation to use this as a metric.
But for some reason, Batch Normalization (BN) appears to have a significant impact on the evaluation time, despite not being relevant in theory in my understanding.
I can calculate the number of multiplications for Dense and Conv Layers, and I thought I can ignore the computations for the activation function and the Batch Normalization as both only add one multiplication per Input, which is significantly less than what the Convolutional Layers do.
However, when I test the same network once with and once without Batch Normalization after every ConvLayer, I noticed that I cannot ignore it:
In the simple example given below, there only is one ConvLayer with filter size (3x3), followed by a softmax activated dense layer as I'm doing classification.
With BN after the conv layer, it takes me ~4.6 seconds to work through the test set.
Using the otherwise exact same net architecture without BN, the same test set is processed in half the time.
Summary of the test configuration with BN (finishes test set evaluation in ~4.6s):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
batch_normalization_1 (Batch (None, 32, 32, 32) 128
_________________________________________________________________
flatten_1 (Flatten) (None, 32768) 0
_________________________________________________________________
dense_1 (Dense) (None, 43) 1409067
=================================================================
Total params: 1,410,091
Trainable params: 1,410,027
Non-trainable params: 64
Without BN (finishes test set evaluation in ~2.3s):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_2 (Conv2D) (None, 32, 32, 32) 896
_________________________________________________________________
flatten_2 (Flatten) (None, 32768) 0
_________________________________________________________________
dense_2 (Dense) (None, 43) 1409067
=================================================================
Total params: 1,409,963
Trainable params: 1,409,963
Non-trainable params: 0
I don't know how this scales as I don't understand the cause in the first place, but I can tell that I had tested other nets with 3 to 6 identical conv layers (using padding = same to keep the dimensions constant), and the difference in test evaluation appeared to vary between ~25% to ~50% in most cases (The one-conv-layer example given below even has ~100%).
Why does BN have such a big impact, in other words, what calculations are happening that I'm missing?
I thought: BN just adds one multiplication per input. So, for example in the network with BN given above:
I expected batch_normalization_1 for example would add 32*32*32 multiplications, and conv2d_1 32*32*32*3*3 multiplications.
But then, how does that have so much impact on the overall runtime, even though the ConvLayers add more multiplications?
Code used to build the model:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation="relu", input_shape=x_train.shape[1:], padding="same"))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(43, activation='softmax'))
with x_train.shape[1:] being (32, 32, 3), representing a 32x32 image with RGB colors.
Kind of answering my own question here, in case anyone stumbles across the same issue.
By embedding the Fritz AI benchmark libary https://docs.fritz.ai/python-library/benchmark.html, I could actually check the number of flops per layer, and it indeed turned out that the normalization only adds a neglectable amount of computations.
----------------------------------------------------------------------------------------------------------------------
Layer (type) Output Shape MFLOPS Weights Core ML Compatible
======================================================================================================================
conv2d_1 (Conv2D) [None, 32, 32, 32] 0.92 896 True
----------------------------------------------------------------------------------------------------------------------
batch_normalization_1 (BatchNormalization) [None, 32, 32, 32] 0.07 128 True
----------------------------------------------------------------------------------------------------------------------
flatten_1 (Flatten) [None, 32768] 0.00 0 True
----------------------------------------------------------------------------------------------------------------------
dense_1 (Dense) [None, 43] 2.82 1,409,067 True
----------------------------------------------------------------------------------------------------------------------
That said, the issue must be caused by some inefficent routine or even a bug in keras to evaluate models with Batch Normalization. Weird, but that is the only possible explanation.
I'm using keras instead of dealing with tensorflow because its simplicity. But when I tried to visiualize the computational graph in keras by sending a keras.callbacks.Tensorboard instance to the model.fit()'s callbacks argument. The graph I got from tensorboard is so awkward,
For demonstration purpose, here I only build a very simple linear classifier with 1 unit in 1 dense layer. But the graph looks like this:
Could I do the same thing as what we did in tensorflow, like use the name_space to group things together and give layers, bias, weights names? I mean, in the graph here, it's such a mess, I can only understand the Dense layer, and a logistic loss namespace. But typically with tensorflow, we can see something like train namespace, and not so many nodes without namespace here. How can I make it more clear?
Tensorflow graph shows all the computations being called. You won't be able to simplify it.
As an alternative, Keras has it's own layer-by-layer graph. Which shows a clear and concise structure of your network. You can generate it by calling
from keras.utils import plot_model
plot_model(model, to_file='/some/pathname/model.png')
Last, you can also call model.summary(), which generate a textual version of the graph, with additional summaries.
Here is an output of model.summary() for example:
Layer (type) Output Shape Param # Connected to
====================================================================================================
input_1 (InputLayer) (None, 2048) 0
____________________________________________________________________________________________________
activation_1 (Activation) (None, 2048) 0
____________________________________________________________________________________________________
dense_1 (Dense) (None, 511) 1047039
____________________________________________________________________________________________________
activation_2 (Activation) (None, 511) 0
____________________________________________________________________________________________________
decoder_layer_1 (DecoderLayer) (None, 512) 0
____________________________________________________________________________________________________
ctg_output (OrLayer) (None, 201) 102912
____________________________________________________________________________________________________
att_output (OrLayer) (None, 312) 159744
====================================================================================================
Total params: 1,309,695.0
Trainable params: 1,309,695.0
Non-trainable params: 0.0