How to mask paddings in LSTM model for speech emotion recognition - tensorflow

Given a few directories of .wav audio files, I have extracted their features in terms of a 3D array (batch, step, features).
For my case, the training dataset is (1883,100,136).
Basically, each audio has been analyzed 100 times (imagine that as 1fps) and each time, 136 features have been extracted. However, those audio files are different in length so some of them cannot be analyzed for 100 times.
For instance, one of the audio has 50 sets of 136 features as effective values so the rest 50 sets were padded with zeros.
Here is my model.
def LSTM_model_building(units=200,learning_rate=0.005,epochs=20,dropout=0.19,recurrent_dropout=0.2):
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Bidirectional(LSTM(units, dropout=dropout, recurrent_dropout=recurrent_dropout, input_shape=(X_train.shape[0],100, 136))))
# model.add(tf.keras.layers.Bidirectional(LSTM(32)))
model.add(Dense(num_classes, activation='softmax'))
adamopt = tf.keras.optimizers.Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
opt = tf.keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-6)
# opt = tf.keras.optimizers.SGD(lr=learning_rate, momentum=0.9, decay=0., nesterov=False)
history =, y_train,
validation_data=(X_test, y_test),
verbose = 1)
score, acc = model.evaluate(X_test, y_test,
return history
I wish to mask the padding however the instruction, shown on the Keras website, uses an embedding layer which I believe is usually used for NLP. I have no idea how to use the embedding layer for my model.
Can anyone teach me how to apply masking for my LSTM model?

Embedding layer is not for your case. You can consider instead Masking layer. It is simply integrable in your model structure, as shown below.
I also remember you that the input shape must be specified in the first layer of a sequential model. Remember also that you don't need to pass the sample dimension. In your case, the input shape is (100,136) which is equal to (timesteps,n_features)
units,learning_rate,dropout,recurrent_dropout = 200,0.005,0.19,0.2
num_classes = 3
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(100,136)))
model.add(tf.keras.layers.Bidirectional(LSTM(units, dropout=dropout, recurrent_dropout=recurrent_dropout)))
model.add(Dense(num_classes, activation='softmax'))
adamopt = tf.keras.optimizers.Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
opt = tf.keras.optimizers.RMSprop(lr=learning_rate, rho=0.9, epsilon=1e-6)


Why does model.summary() give shape None when input shape is clear and fixed?

The code below that is adapted from tensorflow:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(len(x_train), -1)
model = tf.keras.models.Sequential([
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics=['accuracy']), y_train, epochs=1, verbose=0)
gives output Shape (32, 10), whereas this code
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics=['accuracy']), y_train, epochs=1, verbose=0)
gives Output Shape (None, 10).
I'm conscious that 32 means batch size, 10 means the output classes. I'd just like to know where does the None come from when input shape is clear and fixed.
The first dimension is the number of samples (batch_size). Since it should be flexible and work with any number of samples or batch sizes, it is represented as None. So, don't worry about it. Your model does not care about the first dimension.
For example in your case input shape is (28,28) and output is (10). The model considers (None,28,28) and (None,10) shapes as input and output. It means that you can feed to the model any number of samples, but each input sample should be (28,28), and the model gives you the same unknown number of samples but each of which with 10 labels. This is the reason that you don't need to set the batch_size in the input_shape parameter in your first layer.
Another example for the first dimension, is when you train your model, vs. when you predict using that model. For training you may pass an input array say (10,28,28), which means 10 samples with 28,28 size. But when you want to get a prediction from your model using model.predict() you may pass one single sample like (1,28,28) to get a prediction. So, The first dimension varies during the model life cycle. So it is set to None.
The first model shows (32,10) because you called it after and you didn't specified input_shape in your first layer, so it inferences the shapes from training procedure. sets batch_size to 32 as default. So, it shows the batch size.
But if you set input_shape, since you should not include the batch size, model will be created by None as the first dimension.

DCNN for Binary Classification Converges to 50%/50%

I am new to Keras, and never asked a question here, so excuse me any rookie mistakes I might make.
What I am trying to do is to implement a binary classifier, operating on images (CTs to be exact).
My model is based on a pretrained net, that performed classification on 14 classes (see wonderful git here
As the saying goes, "crawl before you walk, walk before you run", my current humble goal is to achieve overfitting of the network on some 100 examples.
My current problem is that the net converges to a weird solution, with the output neuron (im using sigmoid) always very close to 50%, with 100% of the predictions going to one class (that way im stuck at about 50% accuracy). My loss and accuracy do not change at all from epoch 1 or so.
Things I tried/considered:
using different optimizers (i used Adam optimizer and the following SGD).
trying also to go with categorical crossentropy (with softmax layer at the end, instead of sigmoid, since some say it might perform better [Keras' fit_generator() for binary classification predictions always 50%).
adding an additional denselayer (I thought i might be underfitting somehow).
tried to maybe change the batchsize, to 128 (and overfit on 1000 examples).
All failed miserably, so im kind of at a lost here. I would be happy to provide more details if needed, and would appreciate any help or insights you might have. Major parts of my code are attached. Note that the ModelFactory() that I'm loading and using is the pretrained one.
Thanks in advance!
data generator code
rescale = 1./255.0
target_size = (224, 224)
batch_size = 128
train_datagen = ImageDataGenerator(
train_generator = train_datagen.flow_from_dataframe(
my model
def get_model():
base_model = ModelFactory().get_model(class_names=[str(i) for i in range(14)],
x = base_model.output
x = keras.layers.Dense(1024, activation='relu')(x)
x = keras.layers.BatchNormalization(trainable=True)(x)
predictions = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.models.Model(inputs=base_model.inputs, outputs=predictions)
for layer in base_model.layers:
layer.trainable = False
return model
training the model
class_weight = sklearn.utils.class_weight.compute_class_weight('balanced',np.unique(train_csv['class']), train_csv['class'])
model.compile(keras.optimizers.SGD(lr=1e-6, decay=1e-6, momentum=0.9, nesterov=True),
history = model.fit_generator(

Why I'm getting bad result with Keras vs random forest or knn?

I'm learning deep learning with keras and trying to compare the results (accuracy) with machine learning algorithms (sklearn) (i.e random forest, k_neighbors)
It seems that with keras I'm getting the worst results.
I'm working on simple classification problem: iris dataset
My keras code looks:
samples = datasets.load_iris()
X =
y =
df = pd.DataFrame(data=X)
df.columns = samples.feature_names
df['Target'] = y
# prepare data
X = df[df.columns[:-1]]
y = df[df.columns[-1]]
# hot encoding
encoder = LabelEncoder()
y1 = encoder.fit_transform(y)
y = pd.get_dummies(y1).values
# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
# build model
model = Sequential()
model.add(Dense(1000, activation='tanh', input_shape = ((df.shape[1]-1),)))
model.add(Dense(500, activation='tanh'))
model.add(Dense(250, activation='tanh'))
model.add(Dense(125, activation='tanh'))
model.add(Dense(64, activation='tanh'))
model.add(Dense(32, activation='tanh'))
model.add(Dense(9, activation='tanh'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']), y_train)
score, acc = model.evaluate(X_test, y_test, verbose=0)
#score = 0.77
#acc = 0.711
I have tired to add layers and/or change number of units per layer and/or change the activation function (to relu) by it seems that the result are not higher than 0.85.
With sklearn random forest or k_neighbors I'm getting result (on same dataset) above 0.95.
What am I missing ?
With sklearn I did little effort and got good results, and with keras, I had a lot of upgrades but not as good as sklearn results. why is that ?
How can I get same results with keras ?
In short, you need:
ReLU activations
Simpler model
Data mormalization
More epochs
In detail:
The first issue here is that nowadays we never use activation='tanh' for the intermediate network layers. In such problems, we practically always use activation='relu'.
The second issue is that you have build quite a large Keras model, and it might very well be the case that with only 100 iris samples in your training set you have too few data to effectively train such a large model. Try reducing drastically both the number of layers and the number of nodes per layer. Start simpler.
Large neural networks really thrive when we have lots of data, but in cases of small datasets, like here, their expressiveness and flexibility may become a liability instead, compared with simpler algorithms, like RF or k-nn.
The third issue is that, in contrast to tree-based models, like Random Forests, neural networks generally require normalizing the data, which you don't do. Truth is that knn also requires normalized data, but in this special case, since all iris features are in the same scale, it does not affect the performance negatively.
Last but not least, you seem to run your Keras model for only one epoch (the default value if you don't specify anything in; this is somewhat equivalent to building a random forest with a single tree (which, BTW, is still much better than a single decision tree).
All in all, with the following changes in your code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
model = Sequential()
model.add(Dense(150, activation='relu', input_shape = ((df.shape[1]-1),)))
model.add(Dense(150, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax')), y_train, epochs=100)
and everything else as is, we get:
score, acc = model.evaluate(X_test, y_test, verbose=0)
# 0.9333333373069763
We can do better: use slightly more training data and stratify them, i.e.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.20, # a few more samples for training
And with the same model & training epochs you can get a perfect accuracy of 1.0 in the test set:
score, acc = model.evaluate(X_test, y_test, verbose=0)
# 1.0
(Details might differ due to some randomness imposed by default in such experiments).
Adding some dropout might help you improve accuracy. See Tensorflow's documentation for more information.
Essentially how you add a Dropout layer is just very similar to how you added those Dense() layers.
Note: The parameter '0.2 implies that 20% of the connections in the layer is randomly omitted to reduce the interdependencies between them, which reduces overfitting.

How to use augumented data when using transfer learning?

I have used VGG16 for transfer learning and got very low accuracy. Is it possible to use data augmentation technique to increase the accuracy when using transfer learning?
Following is the code for better understanding:
# Show the image paths
train_path = 'myNetDB/train' # Relative Path
valid_path = 'myNetDB/valid'
test_path = 'myNetDB/test'
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(224, 224), classes=['dog', 'cat'], batch_size=10)
valid_batches = ImageDataGenerator().flow_from_directory(valid_path, target_size=(224, 224), classes=['dog', 'cat'], batch_size=4)
test_batches = ImageDataGenerator().flow_from_directory(test_path, target_size=(224, 224), classes=['dog', 'cat'], batch_size=10)
vgg16_model= load_model('Fetched_VGG.h5')
# transform the model to Sequential
model= Sequential()
for layer in vgg16_model.layers[:-1]:
# Freezing the layers (Oppose weights to be updated)
for layer in model.layers:
layer.trainable = False
# adding the last layer
model.add(Dense(2, activation='softmax'))
model.compile(Adam(lr=.0001), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(train_batches, steps_per_epoch=4,
validation_data=valid_batches, validation_steps=4, epochs=5, verbose=2)
predictions = model.predict_generator(test_batches, steps=1, verbose=0)
If you got very low accuracy, it might be that your dataset is very different from the dataset VGG16 was trained on. There are two possibilities:
your dataset is big enough such that you can train your model starting from the pre-trained weights.
your dataset is small. In this case there are no shortcuts. You should consider a simpler model than VGG16 so that you're less likely to incur in overfitting.
In both cases, to answer your question, yes, augmentation techniques, when done consciously, help increasing the accuracy.

RNN Not Generalizing on Text Classification

I am using keras and RNN to classify slack text data on whether the text is reaction worthy or not (1 - emoji, 0 - no emoji). I have removed usernames and urls from the text as well as dropped duplicates with different target variables.
I am not able to get the model to generalize to unseen data. The loss of the train/val sets look good and continually decrease but the accuracy of the val set only decreases.
I am using a pretrained GLOVE word embedding since my training size is only about 25,000 sentences.
I have added additional layers, changed my regularization value and increased dropout but get similar results. Is my model not complex enough to generalize the data? The times i added additional layers they were much smaller but deeper because the training time was about 2 min per epoch.
Any insight would be appreciated.
embedding_layer = Embedding(len(word_index) + 1,
# Creating the Model
model = Sequential()
model.add(Convolution1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compiling the model with our given Optimizer
optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.000025)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])