Understanding output of Dense layer for higher dimension - tensorflow

I don't have problem in understanding output shape of a Dense layer followed by a Flatten layer. Output shape is in accordance of my understanding i.e (Batch size, unit).
nn= keras.Sequential()
nn.add(keras.layers.Conv2D(8,kernel_size=(2,2),input_shape=(4,5,1)))
nn.add(keras.layers.Conv2D(1,kernel_size=(2,2)))
nn.add(keras.layers.Flatten())
nn.add(keras.layers.Dense(5))
nn.add(keras.layers.Dense(1))
nn.summary()
Output is:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 3, 4, 8) 40
_________________________________________________________________
conv2d_2 (Conv2D) (None, 2, 3, 1) 33
_________________________________________________________________
flatten_1 (Flatten) (None, 6) 0
_________________________________________________________________
dense_1 (Dense) (None, 5) 35
_________________________________________________________________
dense_2 (Dense) (None, 1) 6
=================================================================
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
But I am having trouble in understanding the output shape of a dense layer for multidimensional input .So for following code
nn= keras.Sequential()
nn.add(keras.layers.Conv2D(8,kernel_size=(2,2),input_shape=(4,5,1)))
nn.add(keras.layers.Conv2D(1,kernel_size=(2,2)))
#nn.add(keras.layers.Flatten())
nn.add(keras.layers.Dense(5))
nn.add(keras.layers.Dense(1))
nn.summary()
output is
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 3, 4, 8) 40
_________________________________________________________________
conv2d_2 (Conv2D) (None, 2, 3, 1) 33
_________________________________________________________________
dense_1 (Dense) (None, 2, 3, 5) 10
_________________________________________________________________
dense_2 (Dense) (None, 2, 3, 1) 6
=================================================================
Total params: 89
Trainable params: 89
I am unable to make intuition for output shape of dense_1 and dense_2 layer. Shouldn't the final output be a scalar or (batch,unit)?
Following answer to similar question tries to explain the intuition but I can not fully grasp the concept.
From the same answer:
That is, each output "pixel" (i, j) in the 640x959 grid is calculated as a dense combination of the 8 different convolution channels at point (i, j) from the previous layer.
May be some explanation with pictures will be useful .

This is tricky but it does fit with the documentation from Keras on dense layers,
Output shape
nD tensor with shape: (batch_size, ..., units). For instance, for a 2D input with shape (batch_size, input_dim), the output would have shape (batch_size, units)
Note it is not the clearest, but they are saying with the ... that the final dimension of the input shape will be elided by the number of dense connections. Basically, for each item of the final dimension, create a connection to each of the requested dense nodes in the coming dense layer.
In your case you have something which is 2 x 3 x 1. So there is "one thing" (the 2 x 3 thing) to be connected to each of the 5 dense layer nodes, hense 2 x 3 x 5. You can think of it like channels of a CNN layer in this particular case. There is a distinct 2 x 3 sheet of outputs for each of the 5 output "nodes".
In a purely 2-D case (batch_size, units) ... then each item iterated by the final dimension units is itself a scalar value, so you end up with something of exactly the size of the number of dense nodes requested.
But in a higher-dimensional case, each item you iterate along the final dimension of the input will itself still be a higher-dimensional thing, and so the output is k distinct "clones" of those higher-dimensional things, where k is the dense layer size requested, and by "clone" we mean the output for a single dense connection has the same shape as the the items in the final dimension of the input.
Then the Dense-ness means that each specific element of that output has a connection to each element of the corresponding set of inputs. But be careful about this. Dense layers are defined by having "one" connection between each item of the output and each item of the input. So even though you have 5 "2x3 things" in your output, they each just have one solitary weight associated with them about how they are connected to the 2x3 thing that is the input. Keras also defaults to using a bias vector (not bias tensor), so if the dense layer has dimension k and the final dimension of the previous layer is n you should expect (n+1)k trainable parameters. These will always be used with numpy-like broadcasting to make the lesser dimensional shape of the weight and bias vectors conformable to the actual shapes of the input tensors.
It is customary to use Flatten as in your first example if you want to enforce the exact size of the coming dense layer. You would use multidimensional Dense layer when you want different "(n - 1)D" groups of connections to each Dense node. This is probably extremely rare for higher dimensional inputs because you'd typically want a CNN type of operation, but I could imagine maybe in some cases where a model predicts pixel-wise values or if you are generating a full nD output, like from the decoder portion of an encoder-decoder network, you might want a dense array of cells that match the dimensions of some expected structured output type like an image or video.

Related

How to extract convolutional neural network from Keras model object to Networkx DiGraph object keeping weights as an edge attribute?

I'm interested in using the Networkx Python package to perform network analysis on convolutional neural networks. To achieve this I want to extract the edge and weight information from Keras model objects and put them into a Networkx Digraph object where it can be (1) written to a graphml file and (2) be subject to the graph analysis tools available in Networkx.
Before jumping in further, let me clarify and how to consider pooling. Pooling (examples: max, or average) means that the entries within a convolution window will be aggregated, creating an ambiguity on 'which' entry would be used in the graph I want to create. To resolve this, I would like every possible choice included in the graph as I can account for this later as needed.
For the sake of example, let's consider doing this with VGG16. Keras makes it pretty easy to access the weights while looping over the layers.
from keras.applications.vgg16 import VGG16
model = VGG16()
for layer_index, layer in enumerate(model.layers):
GW = layer.get_weights()
if layer_index == 0:
print(layer_index, layer.get_config()['name'], layer.get_config()['batch_input_shape'])
elif GW:
W, B = GW
print(layer_index, layer.get_config()['name'], W.shape, B.shape)
else:
print(layer_index, layer.get_config()['name'])
Which will print the following:
0 input_1 (None, 224, 224, 3)
1 block1_conv1 (3, 3, 3, 64) (64,)
2 block1_conv2 (3, 3, 64, 64) (64,)
3 block1_pool
4 block2_conv1 (3, 3, 64, 128) (128,)
5 block2_conv2 (3, 3, 128, 128) (128,)
6 block2_pool
7 block3_conv1 (3, 3, 128, 256) (256,)
8 block3_conv2 (3, 3, 256, 256) (256,)
9 block3_conv3 (3, 3, 256, 256) (256,)
10 block3_pool
11 block4_conv1 (3, 3, 256, 512) (512,)
12 block4_conv2 (3, 3, 512, 512) (512,)
13 block4_conv3 (3, 3, 512, 512) (512,)
14 block4_pool
15 block5_conv1 (3, 3, 512, 512) (512,)
16 block5_conv2 (3, 3, 512, 512) (512,)
17 block5_conv3 (3, 3, 512, 512) (512,)
18 block5_pool
19 flatten
20 fc1 (25088, 4096) (4096,)
21 fc2 (4096, 4096) (4096,)
22 predictions (4096, 1000) (1000,)
For the convolutional layers, I've read that the tuples will represent (filter_x, filter_y, filter_z, num_filters) where filter_x, filter_y, filter_z give the shape of the filter and num_filters is the number of filters. There's one bias term for each filter, so the last tuple in these rows will also equal the number of filters.
While I've read explanations of how the convolutions within a convolutional neural network behave conceptually, I seem to be having a mental block when I get to handling the shapes of the layers in the model object.
Once I know how to loop over the edges of the Keras model, with Networkx I should be able to easily code the construction of the Networkx object. The code for this might loosely resemble something like this, where keras_edges is an iterable that contains tuples formatted as (in_node, out_node, edge_weight).
import networkx as nx
g = nx.DiGraph()
g.add_weighted_edges_from(keras_edges)
nx.write_graphml(g, 'vgg16.graphml')
So to be specific, how do I loop over all the edges in a way that accounts for the shape of the layers and the pooling in the way I described above?
Since Keras doesn't have an edge element, and a Keras node seems to be something totally different (a Keras node is an entire layer when it's used, it's the layer as presented in the graph of the model)
So, assuming you are using the smallest image possible (which is equal to the kernel size), and that you're creating nodes manually (sorry, I don't know how it works in networkx):
For a convolution that:
Has i input channels (channels in the image that comes in)
Has o output channels (the selected number of filters in keras)
Has kernel_size = (x, y)
You already know the weights, which are shaped (x, y, i, o).
You would have something like:
#assuming a node here is one pixel from one channel only:
#kernel sizes x and y
kSizeX = weights.shape[0]
kSizeY = weights.shape[1]
#in and out channels
inChannels = weights.shape[2]
outChannels = weights.shape[3]
#slide steps x
stepsX = image.shape[0] - kSizeX + 1
stepsY = image.shape[1] - kSizeY + 1
#stores the final results
all_filter_results = []
for ko in range(outChannels): #for each output filter
one_image_results = np.zeros((stepsX, stepsY))
#for each position of the sliding window
#if you used the smallest size image, start here
for pos_x in range(stepsX):
for pos_y in range(stepsY):
#storing the results of a single step of a filter here:
one_slide_nodes = []
#for each weight in the filter
for kx in range(kSizeX):
for ky in range(kSizeY):
for ki in range(inChannels):
#the input node is a pixel in a single channel
in_node = image[pos_x + kx, pos_y + ky, ki]
#one multiplication, single weight x single pixel
one_slide_nodes.append(weights[kx, ky, ki, ko] * in_node)
#so, here, you have in_node and weights
#the results of each step in the slide is the sum of one_slide_nodes:
slide_result = sum(one_slide_nodes)
one_image_results[pos_x, pos_y] = slide_result
all_filter_results.append(one_image_results)

Luong's Concat Alignment Score Issue

I've been trying to understand how attention mechanism works. Currently looking at tfjs-examples/date-conversion-attention example. I've found out that in the example the dot product alignment score (from Effective Approaches to Attention-based Neural Machine Translation) is being used.
So this expression is represented as
let attention = tf.layers.dot({axes: [2, 2]}).apply([decoder, encoder]);
in the code.
The decoder (h_t) has a shape of [10,64] and the encoder (h_s) is [12,64] so the result will have a shape of [10,12]. So far so good.
Now I'm trying to implement the concat alignment score, which looks like this
.
So the first thing to do is to concatenate the h_t and h_s. However, they have different shapes so I don't know how to proceed. Should I reshape somehow the tensors? If so, what would be the shape?
I've been googling around to find out how other people do this and found this.
#For concat scoring, decoder hidden state and encoder outputs are concatenated first
out = torch.tanh(self.fc(decoder_hidden+encoder_outputs))
But this doesn't seem right as they sum the values instead of concatenating.
Any guidance would be appreciated.
UPDATE Here is the model summary:
__________________________________________________________________________________________________
Layer (type) Output shape Param # Receives inputs
==================================================================================================
input1 (InputLayer) [null,12] 0
__________________________________________________________________________________________________
embedding_Embedding1 (Embedding [null,12,64] 2240 input1[0][0]
__________________________________________________________________________________________________
input2 (InputLayer) [null,10] 0
__________________________________________________________________________________________________
lstm_LSTM1 (LSTM) [null,12,64] 33024 embedding_Embedding1[0][0]
__________________________________________________________________________________________________
embedding_Embedding2 (Embedding [null,10,64] 832 input2[0][0]
__________________________________________________________________________________________________
encoderLast (GetLastTimestepLay [null,64] 0 lstm_LSTM1[0][0]
__________________________________________________________________________________________________
lstm_LSTM2 (LSTM) [null,10,64] 33024 embedding_Embedding2[0][0]
encoderLast[0][0]
encoderLast[0][0]
__________________________________________________________________________________________________
dot_Dot1 (Dot) [null,10,12] 0 lstm_LSTM2[0][0]
lstm_LSTM1[0][0]
__________________________________________________________________________________________________
attention (Activation) [null,10,12] 0 dot_Dot1[0][0]
__________________________________________________________________________________________________
context (Dot) [null,10,64] 0 attention[0][0]
lstm_LSTM1[0][0]
__________________________________________________________________________________________________
concatenate_Concatenate1 (Conca [null,10,128] 0 context[0][0]
lstm_LSTM2[0][0]
__________________________________________________________________________________________________
time_distributed_TimeDistribute [null,10,64] 8256 concatenate_Concatenate1[0][0]
__________________________________________________________________________________________________
time_distributed_TimeDistribute [null,10,13] 845 time_distributed_TimeDistributed1
==================================================================================================
Total params: 78221
Trainable params: 78221
Non-trainable params: 0
__________________________________________________________________________________________________
First thing, for the tf.layers.dot to work, both inputs should have the same shape.
To perform a concatenation, you can use tf.concat([h_t, h_s]). The new shape will depend on the axis over which the concatenation is performed.
Lets suppose that both h_t and h_s have the shape [a, b], if the concatenation is done over the axis 0, then the new shape would be [2a, b] and if it is done over the axis 1, the resulting shape would be [a, 2b]
Then you can apply the tf.tanh to the input or create a customize layer that does it for you.
Update:
Since the tf.layers.dot is performed over 3d data who happen not to match on the second axis (axis = 1), the concatenation can be done only on that axis and the resulting shape would be [ 1, 10 + 12, 64 ]

Concatenating layers with diffrent shapes

I have 2 layers with the following shapes.
Layer 1: LSTM
K.int_shape(x)
(None, None, 500)
Layer 2 : Conv2D -> Flatten -> Reshape
K.int_shape(y)
(None, 1, 2352)
I need to concatenate them, but I get the following error.
ValueError: A 'Concatenate' layer requires inputs with matching shapes
except for the concat axis. Got inputs shapes: [(None, None, 500),
(None, 1, 2352)]
I'm using Keras v2.1.4

How does the LSTM know number of time steps and features in an Conv1D-LSTM network?

I have a time series signal (n samples, each sample has 81 time steps and 3 features = n x 81 x 3).
I am using an conv1D-LSTM network. n_timesteps = 81, n_features = 3.
Normal LSTM specifies both n_timesteps and n_features, however when combined with conv1d, these are not specified.
How does the LSTM know how many time steps and features there are in the input to it?
How does the LSTM know the end of the sequence for each sample?
Are the time steps "stored up" and them fed into the LSTM or are the processed one time step at a time and fed into the LSTM one time step at a time?
If I include the "flatten" (below) it fails. Why?
Do the number of filters in the conv1d have to correspond to the number of filters in the LSTM?
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features)))
model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model.add(Dropout(0.5))
model.add(MaxPooling1D(pool_size=2))
#model.add(Flatten())
#model.add(LSTM(units=128, input_shape=(n_timesteps, n_features), return_sequences=True))
model.add(LSTM(units=128, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=64, dropout=0.5, recurrent_dropout=0.5, return_sequences=True))
model.add(LSTM(units=32, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
1 and 2
Everything is based on tensors (sort of like matrices, but with any number of dimensions).
The tensors have shapes and everything is based on the shapes. Your data tensors are three-dimensional: (samples, time_steps, features).
It happens that 1D convolutions also use the same 3D tensors: (samples, length, channels). So:
samples = examples = sequences
time_steps = length
features = channels
There is no secret. The data is structured and the layers will use this structure. Look at your model.summary() and see the number of steps and features for every layer's output.
3
There is no interleaving between layers.
The conv layer will process its entire input tensor and generate an output tensor.
The next conv layer will take this entire output and produce another entire output
The LSTM layer will do the same, take an entire input and output an entire tensor.
4
If you flatten the data, your 3D tensors (samples, steps, feats) will become 2D tensors (samples, something). No more data structure that can be understood by the layers.
5
There is absolutely no requirement for number of filters or units. The only thing is that the final output of your model needs to have the same shape of your y_train data.
Here is my model summary. It appears that the number of features has changed from the original 3 (of the input) to 32 (for the conv1d). Is it correct that the LSTM will now process then entire time steps (~81) on the 32 features of the conv1d instead of the 3 features of the input?
Example of summary:
The first LSTM will take an input shape of (None, 38,32). This means this LSTM will process:
38 steps
32 features
The convolutions are discarding border steps and the maxpooling is halving the steps.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d (Conv1D) (None, 79, 32) 320
_________________________________________________________________
conv1d_1 (Conv1D) (None, 77, 32) 3104
_________________________________________________________________
dropout (Dropout) (None, 77, 32) 0
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 38, 32) 0
_________________________________________________________________
lstm (LSTM) (None, 38, 128) 82432
_________________________________________________________________
dropout_1 (Dropout) (None, 38, 128) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 38, 64) 49408
_________________________________________________________________
lstm_2 (LSTM) (None, 32) 12416
_________________________________________________________________
dense (Dense) (None, 16) 528
_________________________________________________________________
dropout_2 (Dropout) (None, 16) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 148,225
Trainable params: 148,225
Non-trainable params: 0
_________________________________________________________________```

the impact of flatten layer and its correct usage

I am working on simple 1D convolution model, which is built as follows
model1= Sequential()
model1.add(Conv1D(60,32, strides=1, activation='relu',padding='causal',input_shape=(64,1)))
model1.add(Conv1D(80,10, strides=1, activation='relu',padding='causal'))
model1.add(Conv1D(100,5, strides=1, activation='relu',padding='causal'))
model1.add(MaxPooling1D(2))
model1.add(Dense(300,activation='relu'))
model1.add(Flatten())
model1.add(Dense(1,activation='relu'))
print(model1.summary())
Its model summary is as follows
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_1 (Conv1D) (None, 64, 60) 1980
_________________________________________________________________
conv1d_2 (Conv1D) (None, 64, 80) 48080
_________________________________________________________________
conv1d_3 (Conv1D) (None, 64, 100) 40100
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 32, 100) 0
_________________________________________________________________
dense_1 (Dense) (None, 32, 300) 30300
_________________________________________________________________
flatten_1 (Flatten) (None, 9600) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 9601
=================================================================
Total params: 130,061
Trainable params: 130,061
Non-trainable params: 0
_________________________________________________________________
If I change move the flatten layer before the first dense layer, as follows, I got the following model architecture. it seems that the number of model parameters of this one is much larger than the previous one. Why the placement of flatten layer has such a larger impact? What's the correct way to place the flatten layer.
model1= Sequential()
model1.add(Conv1D(60,32, strides=1, activation='relu',padding='causal',input_shape=(64,1)))
model1.add(Conv1D(80,10, strides=1, activation='relu',padding='causal'))
model1.add(Conv1D(100,5, strides=1, activation='relu',padding='causal'))
model1.add(MaxPooling1D(2))
model1.add(Flatten())
model1.add(Dense(300,activation='relu'))
model1.add(Dense(1,activation='relu'))
The difference is that in the first case you have a channel-wise dense layer. The layer will map 100 inputs to 300 outputs, using 100 x 300 = 30,000 weights and 300 biases, for a total of 30,300 parameters. The same operation will be repeated for all 32 channels of its input from max_pooling1d_1.
In the second case you flatten the input first, so now you have 3,200 inputs and map it to 300 outputs, requiring 300 x 3,200 + 300 = 960,300 parameters.
Which one is the correct one is up to you. In the first case the network is much smaller, will learn quicker, will be much less prone to overfitting, but might not have the expressivity necessary to give a usable performance on your dataset. But does it make sense to force the dense layer to treat all channels the same way? Only experiments can tell. You have to try both ways and see which one yields better results.