Fully Connected Layer dimensions - tensorflow

I have a few uncertainties regarding the fully connected layer of a convolutional neural network. Lets say the the input is the output of a convolutional layer. I understand the previous layer is flattened. But can it have multiple channels? (for example, can the input to the fully connected layer be 16x16x3 (3 channels, flattened into a vector of 768 elements?)
Next, I understand the equation for outputs is,
outputs = activation(inputs * weights' + bias)
Is there 1 weight per input? (for example, in the example above, would there be 768 weights?)
Next, how many bias's are there? 1 per channel (so 3)? 1 no matter what? Something else?
Lastly, how do filters work in the fully connected layer? Can there be more than 1?

You might have a misunderstanding of how the fully connected neural network works. To get a better understanding of it, you could always check some good tutorials such as online courses from Stanford HERE
To answer your first question: yes, whatever dimensions you have, you need to flatten it before sending to fully connected layers.
To answer your second question, you have to understand that fully connected layer is actually a process of matrix multiplication followed by a vector addition:
input^T * weights + bias = output
where you have an input of dimension 1xIN, weights of size INxOUT, and output of size 1xOUT, so you have 1xIN * INxOUT = 1xOUT. Altogether, you will have INxOUT weights, and OUT weights for each input. You will also need OUT biases, such that the full equation is 1xIN * INxOUT + 1xOUT(bias term).
There is no filters since you are not doing convolution.
Note that fully connected layer is also equal to 1x1 convolution layer, and many implementations use later for fully connected layer, this could be confusing for beginners. For details, please refer to HERE

Related

What are the effects of padding a tensor?

I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.

What's the dimensionality of the output space in Tensorflow's docs?

I've been going through the docs recently and in many different functions like tf.layers.dense or the tf.nn.conv2d, I came across with the arguments units and filters respectively and I can't understand the point of them. Can someone clearly describe the meaning of
dimensionality of the output space
in the above cases or maybe more general terms? Thanks in advance.
from my opinion:
units in tf.layers.dense:
means that how many output nodes of dense layer should be returned.
Because the fully connected layer(dense layer) should consist of input and output.
Then , the mean of dimensionality of the output space could be translated to the number of ouput nodes.
if the units = 1 , it means all the input nodes connected to one output nodes
in inception v3 or other classifier model, we could found the units of dense layer always be the classifier number.
filters in tf.nn.conv2d:
like the state in api doc :
filter: A Tensor. Must have the same type as input. A 4-D tensor of shape [filter_height, filter_width, in_channels, out_channels]
maybe the confused point is out_channels
for out_channels , I try to understand it as how many filters we want to scan the input tensors.
so out_channels is regarded as the number of kernel.

How to decide number of nodes for CNN model for image classification using tensorflow ? Images are of size 178,218

How to select the following: Size of filter for convolution, strides, pooling, and densely connected layer
There is no single answer to this question. This Reddit and this answer have some nice discussion. To quote the second post on the Reddit, "Start simple."
Celeba has similar, maybe exactly the same, image size. When I was working with Celeba on a DCGAN project, I gently cropped and then reshaped the images to 64 x 64 x 3. My discriminator was a convolutional neural network and used 4 convolutional layers and one fully connected layer. All conv layers had 5 x 5 window size and stride size of 2 x 2. SAME padding and no pooling. The output channels per layer were 128 -> 256 -> 512 -> 1024. So, the last conv layer output a 4 x 4 x 1024 tensor. My dense layer then had a weight size of classes x 1024. (I had 1 class since its purpose was to determine whether the input image was from the dataset or made by the generator.)
That relatively simple architecture had good results, but it was intentionally built to not overpower the generator. If you're looking for pure classification, you may want a deeper architecture. You might not want to crop as aggressively as I did. Then you can include more conv layers before the fully connected layer. You may want to use a 3 x 3 window size with a stride size of 1 x 1 and use pooling - although I see architectures abandoning pooling in favor of larger stride size. If your dataset is small, it is prone to overfitting. Having smaller weights helps combat this when dropout isn't enough. That means fewer output channels per layer.
There are a lot of possibilities when choosing an architecture, and there is no hard-and-fast rule for the best architecture. Remember to start simple.

Detection Text from natural images

I write a code in tensorflow by using convolution neural network to detect the text from images. I used TFRecords file to read the street view text dataset, then, I resized the images to 128 for height and width.
I used 9-conv layer with zero padding and three max_pool layer with window size of (2×2) and stride of 2. Since I use just three pooling layer, the last layer shape will be (16×16). the last conv layer has '256' filters.
I used too, two regression fully connected layers (tf.nn.sigmoid) and tf.losses.mean_squared_error as a loss function.
My question is
is this architecture enough for detection process?? I know there is something call NMS for detection. Also what is the label in this case??
In general and this not a rule , it's just based on my experience, you should start with a smaller net 2 or 3 conv layer, and say what happens, if you get some good result focus more on the winning topology and adapt the hyperparameters ( learnrat, batchsize and so one ) , if you don't get good result at all go deep meaning add conv layer. and evaluate again. 12 conv is really huge , your problem complexity should be huge too ! otherwise you wil reach a good accuracy but waste a lot computer power and time for nothing ! and by the way use pyramid form meaning start wider and finish tiny

Dense final layer vs. another rnn layer

It is common to add a dense fully-connected layer as the last layer on top of a recurrent neural network (which has one or more layers) in order to learn the reduction to the final output dimensionality.
Let's say I need one output with a -1 to 1 range, in which case I would use a dense layer with a tanh activation function.
My question is: Why not add another recurrent layer instead with an internal size of 1?
It will be different (in the sense of propagating that through time) but will it have a disadvantage over the dense layer?
If I understand correctly the two alternatives you present do the exact same computation, so they should behave identically.
In TensorFlow, if you're using dynamic_rnn, it's much easier if all time steps are identical, though, hence processing the output instead of having a different last step.