What does it mean to have more neurons than bits in the input layer in TensorFlow? - tensorflow

I am creating a model like this to input 32 bit integers into my nn:
model = Sequential()
model.add(Dense(32, input_dim=32, activation='relu'))
From what I know now there are 32 neurons (one for each bit). The output of each neuron is -1 if the bit is 0 and 1 if it is 1.
I wondered that it's also possible to do this:
model = Sequential()
model.add(Dense(40, input_dim=32, activation='relu'))
Now I have 40 neurons for 32 bits. It runs fine. But why does it work and what does it mean?
Edit: I'll try to explain it a little further. I want to input a 32 bit unsigned integer. So I have 32 bits. When I have 32 neurons in the features/input layer it's very easy:
bit 0 goes into neuron 0
bit 1 goes into neuron 1
...
bit 31 goes into neuron 31
Now when I have 32 bits and 40 neurons it looks like this:
bit 0 goes into neuron 0
bit 1 goes into neuron 1
...
bit 31 goes into neuron 31
What goes into neuron 32?
On playground.tensorflow.org you can see stuff like sin(x1) in the features layer. I want to know how exactly these values are calculated.

Having more neurons than bits in the input layer in TensorFlow means that the neural network is designed to have more nodes in the first layer than the number of input features. This can be useful in certain cases, such as when trying to capture complex relationships between inputs, or when dealing with high-dimensional data where many input features may be correlated or redundant.
When you have more neurons in the input layer than the number of input features, the extra neurons will receive zero values as inputs. These extra neurons can be thought of as "dummy neurons" that do not contribute any meaningful information to the network, but do not harm the network's performance either.
In your example, the first 32 neurons in the input layer will receive the input bits, and the remaining 8 neurons will receive zero values. These extra neurons may not be useful in this case, but they do not cause any harm to the network's performance either.
Regarding your question about functions like sin(x1) in the input layer, these are not directly related to the input values but are instead functions of the input values. In this case, sin(x1) would take the value of the sine function applied to the first input feature, x1. The function would be evaluated for each input sample and the resulting values would be fed into the network as inputs. This can be useful for certain types of data where transforming the input features can help the network better capture the underlying relationships in the data.

Having more neurons than bits in the input layer means that the neural network can learn more complex representations of the input data beyond the individual bits. In the case of the example you provided, the additional neurons can learn higher-order features that combine the values of multiple bits in non-linear ways, allowing the network to capture more complex relationships between the input variables. This can lead to improved performance in tasks such as classification or regression.

Related

Fully Connected Layer dimensions

I have a few uncertainties regarding the fully connected layer of a convolutional neural network. Lets say the the input is the output of a convolutional layer. I understand the previous layer is flattened. But can it have multiple channels? (for example, can the input to the fully connected layer be 16x16x3 (3 channels, flattened into a vector of 768 elements?)
Next, I understand the equation for outputs is,
outputs = activation(inputs * weights' + bias)
Is there 1 weight per input? (for example, in the example above, would there be 768 weights?)
Next, how many bias's are there? 1 per channel (so 3)? 1 no matter what? Something else?
Lastly, how do filters work in the fully connected layer? Can there be more than 1?
You might have a misunderstanding of how the fully connected neural network works. To get a better understanding of it, you could always check some good tutorials such as online courses from Stanford HERE
To answer your first question: yes, whatever dimensions you have, you need to flatten it before sending to fully connected layers.
To answer your second question, you have to understand that fully connected layer is actually a process of matrix multiplication followed by a vector addition:
input^T * weights + bias = output
where you have an input of dimension 1xIN, weights of size INxOUT, and output of size 1xOUT, so you have 1xIN * INxOUT = 1xOUT. Altogether, you will have INxOUT weights, and OUT weights for each input. You will also need OUT biases, such that the full equation is 1xIN * INxOUT + 1xOUT(bias term).
There is no filters since you are not doing convolution.
Note that fully connected layer is also equal to 1x1 convolution layer, and many implementations use later for fully connected layer, this could be confusing for beginners. For details, please refer to HERE

How to prune neurons in neural network

Context:
Suppose we have a simple 3-layer feed-forward network. The hidden size of the first linear layer is 100000 -- W1[input_size, 100000] in which input_size is a number much smaller than 100000. Some of neurons won't be learning any thing. I want to select and shutdown these neurons using pruning.
Expected outcomes
After pruning the selected neurons, we will have a smaller network with the less neurons in the first layer, say reduced to 500. And this smaller network turns out to have the same predicting capacity to the large one.
My implementation:
According to some criterion (some metrics applied to check weight similarities after each backpropagation update), I have cheery picked the indices of neurons I want to shut down, e.g., [1,7,8 ...].
Zero out the weights represented by the indices in W1, W1[:, 1, 7, 8 ...] = 0. So that no information will be passed forward via these neurons to the next layer.
Will that be enough? Should I be manually intervening the backpropagation as well? Zero out neurons stop only computations passing forward, but for learning/updating on weights, backpropgation matters more. Since I am using pytorch, it will be great if illustrations are provided in pytorch, other frameworks like tensorflow, Keras are also fine.

What are the effects of padding a tensor?

I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.

How to decide number of nodes for CNN model for image classification using tensorflow ? Images are of size 178,218

How to select the following: Size of filter for convolution, strides, pooling, and densely connected layer
There is no single answer to this question. This Reddit and this answer have some nice discussion. To quote the second post on the Reddit, "Start simple."
Celeba has similar, maybe exactly the same, image size. When I was working with Celeba on a DCGAN project, I gently cropped and then reshaped the images to 64 x 64 x 3. My discriminator was a convolutional neural network and used 4 convolutional layers and one fully connected layer. All conv layers had 5 x 5 window size and stride size of 2 x 2. SAME padding and no pooling. The output channels per layer were 128 -> 256 -> 512 -> 1024. So, the last conv layer output a 4 x 4 x 1024 tensor. My dense layer then had a weight size of classes x 1024. (I had 1 class since its purpose was to determine whether the input image was from the dataset or made by the generator.)
That relatively simple architecture had good results, but it was intentionally built to not overpower the generator. If you're looking for pure classification, you may want a deeper architecture. You might not want to crop as aggressively as I did. Then you can include more conv layers before the fully connected layer. You may want to use a 3 x 3 window size with a stride size of 1 x 1 and use pooling - although I see architectures abandoning pooling in favor of larger stride size. If your dataset is small, it is prone to overfitting. Having smaller weights helps combat this when dropout isn't enough. That means fewer output channels per layer.
There are a lot of possibilities when choosing an architecture, and there is no hard-and-fast rule for the best architecture. Remember to start simple.

How to derive the shapes of tensors given the networks architecture, number of outputs and number of samples

How can we calculate the shapes of various tensors involved in the computation graph once we know the network architecture (number of hidden layers, number of units in each layer), number of outputs, number of inputs and number of samples in the training set. (Assuming the network is fully connected).
For example, lets say there are 100 features, 10000 samples, 2 hidden layers (H0, H1) and 10 outputs. H1 has 500 units and H2 has 5000 units. Assume RELU activation is used in H0/H1 and softmax is used for output layer.
In this case, although the sequence of calculations that need to happen is very clear, finding correct shape for each constant/variable/place holder is difficult.
I am trying to understand if there is a standard method which we can follow to do this.