Increase dimension of RNN LSTM cell in Keras

Increase dimension of RNN LSTM cell in Keras - tensorflow

I want to increase amount of recurrent weights in rnn or lstm cell.
If you look at the code below you will see, that lsrm cell inputs shape is (2,1), which means 2 timesteps and 1 feature.
%tensorflow_version 2.x
import tensorflow as tf
m = tf.keras.models.Sequential()
lstm = tf.keras.layers.LSTM(1, use_bias=False)
input = tf.keras.Input(shape=(2,1))
m.add(input)
m.add(lstm)
lstm.get_weights()
The output is
[array([[ 0.878217 , 0.89324415, 0.404307 , -1.0542995 ]], dtype=float32),
array([[-0.24181306, -0.341401 , 0.65207034, 0.63227856]], dtype=float32)]
4 weights for each feature, and 4 weights for previous outputs
Now if I change Input shape like this
input = tf.keras.Input(shape=(2,1))
then the output of get_weights function will be like this:
[array([[-0.9725287 , -0.90078545, 0.97881985, -0.9623983 ],
[-0.9644511 , 0.90705967, 0.05965471, 0.32613564]], dtype=float32),
array([[-0.24867296, -0.22346373, -0.6410606 , 0.69084513]], dtype=float32)]
Now my question is: how do I increase amount of weights in the second array whick keeps the (4,1) shape?
The idea is that I want RNN or STRM take not only the previous output (t-1 moment) but more prevois values like (t-2, t-3, t-4) moments.
Is there way to do it in keras with tf backend?

I can't understand the change, I think you had a typo in your question, but:
Length - Time steps:
The number of time steps will never change the number of weights. The layer is "recurrent", meaning it will "loop" the time steps. It's not supposed to have different weights for each step.
The whole purpose of the layer is to apply the same operations over and over and over for each time step.
Input features:
Input features are the last dimension of the input. They define one dimension of the weights.
Units = Output features:
Output features, also the last dimension of the output, are another dimension of the weights.
Two types of kernels
The LSTM layers have two groups of kernels:
What they call simply kernels - with shape=(input_dim, self.units * 4)
What they call recurrent kernels - with shape=(self.units, self.units * 4)
The first group acts on the input data, they have shape considering the input features and the output features.
The second group acts on inner states and have shapes considering only the output features (units).
From the source code:
self.kernel = self.add_weight(shape=(input_dim, self.units * 4),
name='kernel',
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units * 4),
name='recurrent_kernel',
initializer=self.recurrent_initializer,
regularizer=self.recurrent_regularizer,
constraint=self.recurrent_constraint)
The last array in the list:
The last array in the list of weights are the 4 recurrent kernels with shape (1, 1) grouped into one.
So:
You can increase the kernels with more input features. Transform Input((anything, 1)) into Input((anything, more)) for instance.
You can increase the kernels and the recurrent_kernels (and biases, when considered) with bigger output features. Transform LSTM(1, ...) into LSTM(more, ...)
Weights are independent of the lenght. It's even possible to have Input((None, 1)), meaning a variable length.
Using more than just the last step
This should be automatic. LSTM layers are designed to have memory. The memory is an inner state that participates in all time steps. There are gates (the kernels) that decide how a new step will participate in this memory. Since all steps participate in the same memory, LSTM layer theoretically considers "all" time steps from the beginning.
So, you shouldn't really worry with this.
But if you do want this, there are maybe two ways. Don't know if they will bring any improvement, though.
One is to concatenate shifted inputs as features:
def pad_and_shift(x):
steps = 3
paddings = tf.constant([[0,0], [steps-1, 0], [0, 0]])
x = tf.pad(x, paddings)
to_concat = [ x[:,i:i - steps + 1] for i in range(steps-1) ]
to_concat += x[:, steps-1:]
return tf.concat(to_concat, axis=-1)
given_inputs = ....
out = Lambda(pad_and_shift)(given_inputs)
out = LSTM(units, ...)(out)
The other involves editing the source code of the LSTM, which would be very complicated and probably not very worthy.

Related

Number of nodes in output later greater than number of classes in a neural network

While training a neural network, on the fashion mnist dataset, I decided to have a greater number of nodes in my output layer than the number of classes in the dataset.
The dataset has 10 classes, while I trained my network to have 15 nodes in the output layer. I also used a softmax.
Now surprisingly, this gave me an accuracy of 97% which is quite good.
This leads me to the question, what do those extra 5 nodes even mean, and what do they do here?
Why is my softmax able to work properly when the label range(0-9) isn't equal to the number of nodes(15)?
And finally, in general, what does it mean to have more nodes in your output layer than the number of classes, in a classification task?
I understand the effects of having lesser nodes than the number of classes, and also that the rule of thumb is to use number of nodes = number of classes. Yet, I've never seen someone use a greater number of nodes, and I'd like to understand why/why not.
I'm attaching some code so that the results can be reproduced. This was done using Tensorflow 2.3
import tensorflow as tf
print(tf.__version__)
mnist = tf.keras.datasets.mnist
(training_images, training_labels) , (test_images, test_labels) = mnist.load_data()
training_images = training_images/255.0
test_images = test_images/255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(15, activation=tf.nn.softmax)])
model.compile(optimizer = 'adam',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy'])
model.fit(training_images, training_labels, epochs=5)
model.evaluate(test_images, test_labels)

The only reason you are able to use such a configuration is because you have specified your loss function as sparse_categorical_crossentropy.
let's understand the effects of greater output nodes in forward propagation.
Consider a neural network with 2 layers.
1st layer - 6 neurons (Hidden layer)
2nd layer - 4 neurons (output layer)
You have dataset X whose shape is(100*12) ie. 12 features and 100 rows.
you have labels y whose shape is (100,) containing two unique values 0 and 1.
Therefore essentially this is a binary classification problem but we will use 4 neurons in our output layer.
Consider each neuron as a logistic regression unit. Therefore each of your neurons will 12 weights (w1, w2,.....,w12)
Why? - Because you have 12 features.
Each neuron will output a single term given by a. I will give the computation of a in two steps.
z = w1x1 + w2x2 + ........ + w12*x12 + w0 # w0 is bias
a = activation(z)
Therefore, your 1st layer will output 6 values for each row in our dataset.
So now you have a feature matrix of 100 * 6.
This is passed to the 2nd layer and the same process repeats.
So in essence you are able to complete the forward propagation step even when you have more neurons than the actual classes.
Now let's see backpropagation.
For backpropagation to exist you must be able to calculate the loss_value.
we will take a small example:
y_true has two labels as in our problem and y_pred has 4 probability values since we have 4 units in our final layer.
y_true = [0, 1]
y_pred = [[0.03, 0.90, 0.02, 0.05], [0.15, 0.02, 0.8, 0.03]]
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy() # 3.7092905
How is it calculated:
( log(0.03) + log(0.02) ) / 2
So essentially we can compute the loss so we can also compute its gradients.
Therefore no problem in using backpropagation too.
Therefore our model can very well train and achieve 90 % accuracy.
So the final question, what are these extra neurons representing. ie( neuron 2 and neuron 3).
Ans - They are representing the probability of the example being of class 2 and class 3 respectively. But since the labels contain no values of class 2 and class 3 they will have zero contribution in calculating the loss value.
Note- If you encode your y_label in one-hot-encoding and use categorical_crossentropy as your loss you will encounter an error.

What would be the output from tensorflow dense layer if we assign itself as input and output while making a neural network?

I have been going through the implementation of neural network in openAI code for any Vanilla Policy Gradient (As a matter of fact, this part is used nearly everywhere). The code looks something like this :
def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activation, action_space):
act_dim = action_space.n
logits = mlp(x, list(hidden_sizes) + [act_dim], activation, None)
logp_all = tf.nn.log_softmax(logits)
pi = tf.squeeze(tf.random.categorical(logits, 1), axis=1)
logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)
logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)
return pi, logp, logp_pi
and this multi-layered perceptron network is defined as follows :
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(inputs=x, units=h, activation=activation)
return tf.layers.dense(inputs=x, units=hidden_sizes[-1], activation=output_activation)
My question is what is the return from this mlp function? I mean the structure or shape. Is it an N-dimentional tensor? If so, how is it given as an input to tf.random_categorical? If not, and its just has the shape [hidden_layer2, output], then what happened to the other layers? As per their website description about random_categorical it only takes a 2-D input. The complete code of openAI's VPG algorithm can be found here. The mlp is implemented here. I would be highly grateful if someone would just tell me what this mlp_categorical_policy() is doing?
Note: The hidden size is [64, 64], the action dimension is 3
Thanks and cheers

Note that this is a discrete action space - there are action_space.n different possible actions at every step, and the agent chooses one.
To do this the MLP is returning the logits (which are a function of the probabilities) of the different actions. This is specified in the code by + [act_dim] which is appending count of the action_space as the final MLP layer. Note that the last layer of an MLP is the output layer. The input layer is not specified in tensorflow, it is inferred from the inputs.
tf.random.categorical takes the logits and samples a policy action pi from them, which is returned as a number.
mlp_categorical_policy also returns logp, the log probability of the action a (used to assign credit), and logp_pi, the log probability of the policy action pi.
It seems your question is more about the return from the mlp.
The mlp creates a series of fully connected layers in a loop. In each iteration of the loop, the mlp is creating a new layer using the previous layer x as an input and assigning it's output to overwrite x, with this line x = tf.layers.dense(inputs=x, units=h, activation=activation).
So the output is not the same as the input, on each iteration x is overwritten with the value of the new layer. This is the same kind of coding trick as x = x + 1, which increments x by 1. This effectively chains the layers together.
The output of tf.layers.dense is a tensor of size [:,h] where : is the batch dimension (and can usually be ignored). The creation of the last layer happens outisde the loop, it can be seen that the number of nodes in this layer is act_dim (so shape is [:,3]). You can check the shape by doing this:
import tensorflow.compat.v1 as tf
import numpy as np
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(x, units=h, activation=activation)
return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)
obs = np.array([[1.0,2.0]])
logits = mlp(obs, [64, 64, 3], tf.nn.relu, None)
print(logits.shape)
result: TensorShape([1, 3])
Note that the observation in this case is [1.,2.], it is nested inside a batch of size 1.

How to deploy a trigger word detection with tensorflow

I'm working on the "trigger word detection" model, and I decided to deploy the model to my phone.
The input shape of the model is (None, 5511, 101).
The output shape is (None, 1375, 1).
But in a real deployed App, the model can't get the 5511 timesteps all at once, instead the audio frame produced by the sensor of the phone is one by one.
How can I feed this pieces of data to the model one by one and get the output at each timestep?
The model is a recurrent one. But the "model.predict()" takes a first parameter of (None,5511,101), and what I intend to do is
output = []
for i in range(5511):
a = model.func(i, (None,1,101))
output.append(a)
structure of the model:

This problem can be solved by making the timesteps axis dynamic. In other words, when you define the model, the number of timesteps should be set to None. Here is an example illustrating how it would work for a simplified version of your model:
from keras.layers import GRU, Input, Conv1D
from keras.models import Model
import numpy as np
x = Input(shape=(None, 101))
h = Conv1D(196, 15, strides=4)(x)
h = GRU(1, return_sequences=True)(h)
model = Model(x, h)
# The model works for the original number of timesteps (5511)
batch_size = 2
out = model.predict(np.random.rand(batch_size, 5511, 101))
print(out.shape)
# ... but also for fewer timesteps (say 32)
out = model.predict(np.random.rand(batch_size, 32, 101))
print(out.shape)
# However, it will not work if timesteps < Conv1D filter_size (15)!
out = model.predict(np.random.rand(batch_size, 14, 101))
print(out.shape)
Note, however, that you will not be able to feed less than 15 timesteps (dimension of the Conv1D filters) unless you pad the input sequences to 15.

You should either change your model in a recurrent one where you can feed pieces of data one at a time or you should think about changing the model and using something that works on (overlapping) windows in time, where you apply the model every few pieces of data and get a partial output.
Still depending on the model you might get the output you want only at the end. You should design it accordingly.
Here is an example: https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/

For passing inputs step by step, you need recurrent layers with stateful=True.
The convolutional layer will certainly prevent you from achieving what you want. Either you remove it or you pass inputs in groups of 15 steps (where 15 is your kernel size for the convolution).
You would need to coordinate these 15 steps with stride 4, and might need a padding. If I may suggest, to avoid mathematical difficulties, you could use kernel_size=16, stride=4 and input_steps = 5512, this is a multiple of 4 which is your stride value. (This will avoid padding and allow easier calculations), and your output steps will be 1375 perfectly round.
Then your model would be like:
inputs = Input(batch_shape=(batch_size,None, 101)) #where you will always use input shapes of (batch_size, 16, 101)
out = Conv1D(196, 16, strides=4)(inputs)
...
...
out = GRU(..., stateful=True)(out)
...
out = GRU(..., stateful=True)(out)
...
...
model = Model(inputs, out)
It's necessary to have a fixed batch size with a stateful=True model. It can be 1, but for optimizing your processing speed, if you have more than one sequence to process in parallel (and independently from each other), use a bigger batch size.
For working it step by step, you need, first of all, to reset states (whenever you use a stateful=True model, you need to keep resetting states every time you are going to feed a new sequence or a new batch of parallel sequences).
So:
#will start a new batch containing a number of sequences equal to batch_size:
model.reset_states()
#received 16 steps from batch_size sequences:
steps = an_array_shaped((batch_size, 16, 101))
#for training
model.train_on_batch(steps, something_for_y_shaped((batch_size, 1, 1)), ...)
#I don't recommend to train like this because of the batch normalizations
#If you can train the entire length at once, do it.
#never forget: for full length training, you would need model.reset_states() every batch.
#for predicting:
predictions = model.predict_on_batch(steps, ...)
#received 4 new steps from X sequences:
steps = np.concatenate([steps[:,4:], new_steps], axis=1)
#these new steps belong to the "same" batch_size sequences! Don't call reset states!
#repeat one of the above for training or predicting
new_predictions = model.predict_on_batch(steps, ...)
predictions = np.concatenate([predictions, new_predictions], axis=1)
#keep repeating this loop until you reach the last step
Finally, when you reached the last step, for safety, call `model.reset_states()` again, everything that you input will be "new" sequences, not new "steps" or the previous sequences.
------------
# Training hint
If you are able to train with the full sequences (not step by step), use a `stateful=False` model, train normally with `model.fit(...)`, later you recreate the model exactly, but using `stateful=True`, copy the weights with `new_model.set_weights(old_model.get_weights())`, and use the new model for predicting like above.

Getting keras LSTM layer to accept two inputs?

I'm working with padded sequences of maximum length 50. I have two types of sequence data:
1) A sequence, seq1, of integers (1-100) that correspond to event types (e.g. [3,6,3,1,45,45....3]
2) A sequence, seq2, of integers representing time, in minutes, from the last event in seq1. So the last element is zero, by definition. So for example [100, 96, 96, 45, 44, 12,... 0]. seq1 and seq2 are the same length, 50.
I'm trying to run the LSTM primarily on the event/seq1 data, but have the time/seq2 strongly influence the forget gate within the LSTM. The reason for this is I want the LSTM to tend to really penalize older events and be more likely to forget them. I was thinking about multiplying the forget weight by the inverse of the current value of the time/seq2 sequence. Or maybe (1/seq2_element + 1), to handle cases where it's zero minutes.
I see in the keras code (LSTMCell class) where the change would have to be:
f = self.recurrent_activation(x_f + K.dot(h_tm1_f,self.recurrent_kernel_f))
So I need to modify keras' LSTM code to accept multiple inputs. As an initial test, within the LSTMCell class, I changed the call function to look like this:
def call(self, inputs, states, training=None):
time_input = inputs[1]
inputs = inputs[0]
So that it can handle two inputs given as a list.
When I try running the model with the Functional API:
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
## Input 2: time vectors
auxiliary_input = Input(shape=(max_seq_length,1), dtype='float32', name='aux_input')
m = Masking(mask_value = 99999999.0)(auxiliary_input)
lstm_out = LSTM(32)(x, time_vector = m)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input, auxiliary_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
print(model.summary())
np.random.seed(21)
model.fit([train_X1, train_X2], [train_Y, train_Y], epochs=1, batch_size=200)
However, I get the following error:
An `initial_state` was passed that is not compatible with `cell.state_size`. Received `state_spec`=[InputSpec(shape=(None, 50, 1), ndim=3)]; however `cell.state_size` is (32, 32)
Any advice?

You can't pass a list of inputs to default recurrent layers in Keras. The input_spec is fixed and the recurrent code is implemented based on single tensor input also pointed out in the documentation, ie it doesn't magically iterate over 2 inputs of same timesteps and pass that to the cell. This is partly because of how the iterations are optimised and assumptions made if the network is unrolled etc.
If you like 2 inputs, you can pass constants (doc) to the cell which will pass the tensor as is. This is mainly to implement attention models in the future. So 1 input will iterate over timesteps while the other will not. If you really like 2 inputs to be iterated like a zip() in python, you will have to implement a custom layer.

I would like to throw in a different ideas here. They don't require you to modify the Keras code.
After the embedding layer of the event types, stack the embeddings with the elapsed time. The Keras function is keras.layers.Concatenate(axis=-1). Imagine this, a single even type is mapped to a n dimensional vector by the embedding layer. You just add the elapsed time as one more dimension after the embedding so that it becomes a n+1 vector.
Another idea, sort of related to your problem/question and may help here, is 1D convolution. The convolution can happen right after the concatenated embeddings. The intuition for applying convolution to event types and elapsed time is actually 1x1 convolution. In such a way that you linearly combine the two together and the parameters are trained. Note in terms of convolution, the dimensions of the vectors are called channels. Of course, you can also convolve more than 1 event at a step. Just try it. It may or may not help.

Understanding TensorBoard (weight) histograms

It is really straightforward to see and understand the scalar values in TensorBoard. However, it's not clear how to understand histogram graphs.
For example, they are the histograms of my network weights.
(After fixing a bug thanks to sunside)
What is the best way to interpret these? Layer 1 weights look mostly flat, what does this mean?
I added the network construction code here.
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
x_image = tf.reshape(X, [-1, 6, 10, 1])
tf.summary.image('input', x_image, 4)
# First layer of weights
with tf.name_scope("layer1"):
W1 = tf.get_variable("W1", shape=[input_size, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.matmul(X, W1)
layer1_act = tf.nn.tanh(layer1)
tf.summary.histogram("weights", W1)
tf.summary.histogram("layer", layer1)
tf.summary.histogram("activations", layer1_act)
# Second layer of weights
with tf.name_scope("layer2"):
W2 = tf.get_variable("W2", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer2 = tf.matmul(layer1_act, W2)
layer2_act = tf.nn.tanh(layer2)
tf.summary.histogram("weights", W2)
tf.summary.histogram("layer", layer2)
tf.summary.histogram("activations", layer2_act)
# Third layer of weights
with tf.name_scope("layer3"):
W3 = tf.get_variable("W3", shape=[hidden_layer_neurons, hidden_layer_neurons],
initializer=tf.contrib.layers.xavier_initializer())
layer3 = tf.matmul(layer2_act, W3)
layer3_act = tf.nn.tanh(layer3)
tf.summary.histogram("weights", W3)
tf.summary.histogram("layer", layer3)
tf.summary.histogram("activations", layer3_act)
# Fourth layer of weights
with tf.name_scope("layer4"):
W4 = tf.get_variable("W4", shape=[hidden_layer_neurons, output_size],
initializer=tf.contrib.layers.xavier_initializer())
Qpred = tf.nn.softmax(tf.matmul(layer3_act, W4)) # Bug fixed: Qpred = tf.nn.softmax(tf.matmul(layer3, W4))
tf.summary.histogram("weights", W4)
tf.summary.histogram("Qpred", Qpred)
# We need to define the parts of the network needed for learning a policy
Y = tf.placeholder(tf.float32, [None, output_size], name="input_y")
advantages = tf.placeholder(tf.float32, name="reward_signal")
# Loss function
# Sum (Ai*logp(yi|xi))
log_lik = -Y * tf.log(Qpred)
loss = tf.reduce_mean(tf.reduce_sum(log_lik * advantages, axis=1))
tf.summary.scalar("Q", tf.reduce_mean(Qpred))
tf.summary.scalar("Y", tf.reduce_mean(Y))
tf.summary.scalar("log_likelihood", tf.reduce_mean(log_lik))
tf.summary.scalar("loss", loss)
# Learning
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

It appears that the network hasn't learned anything in the layers one to three. The last layer does change, so that means that there either may be something wrong with the gradients (if you're tampering with them manually), you're constraining learning to the last layer by optimizing only its weights or the last layer really 'eats up' all error. It could also be that only biases are learned. The network appears to learn something though, but it might not be using its full potential. More context would be needed here, but playing around with the learning rate (e.g. using a smaller one) might be worth a shot.
In general, histograms display the number of occurrences of a value relative to each other values. Simply speaking, if the possible values are in a range of 0..9 and you see a spike of amount 10 on the value 0, this means that 10 inputs assume the value 0; in contrast, if the histogram shows a plateau of 1 for all values of 0..9, it means that for 10 inputs, each possible value 0..9 occurs exactly once.
You can also use histograms to visualize probability distributions when you normalize all histogram values by their total sum; if you do that, you'll intuitively obtain the likelihood with which a certain value (on the x axis) will appear (compared to other inputs).
Now for layer1/weights, the plateau means that:
most of the weights are in the range of -0.15 to 0.15
it is (mostly) equally likely for a weight to have any of these values, i.e. they are (almost) uniformly distributed
Said differently, almost the same number of weights have the values -0.15, 0.0, 0.15 and everything in between. There are some weights having slightly smaller or higher values.
So in short, this simply looks like the weights have been initialized using a uniform distribution with zero mean and value range -0.15..0.15 ... give or take. If you do indeed use uniform initialization, then this is typical when the network has not been trained yet.
In comparison, layer1/activations forms a bell curve (gaussian)-like shape: The values are centered around a specific value, in this case 0, but they may also be greater or smaller than that (equally likely so, since it's symmetric). Most values appear close around the mean of 0, but values do range from -0.8 to 0.8.
I assume that the layer1/activations is taken as the distribution over all layer outputs in a batch. You can see that the values do change over time.
The layer 4 histogram doesn't tell me anything specific. From the shape, it's just showing that some weight values around -0.1, 0.05 and 0.25 tend to be occur with a higher probability; a reason could be, that different parts of each neuron there actually pick up the same information and are basically redundant. This can mean that you could actually use a smaller network or that your network has the potential to learn more distinguishing features in order to prevent overfitting. These are just assumptions though.
Also, as already stated in the comments below, do add bias units. By leaving them out, you are forcefully constraining your network to a possibly invalid solution.

Here I would indirectly explain the plot by giving a minimal example. The following code produce a simple histogram plot in tensorboard.
from datetime import datetime
import tensorflow as tf
filename = datetime.now().strftime("%Y%m%d-%H%M%S")
fw = tf.summary.create_file_writer(f'logs/fit/{filename}')
with fw.as_default():
for i in range(10):
t = tf.random.uniform((2, 2), 1000)
tf.summary.histogram(
"train/hist",
t,
step=i
)
print(t)
We see that generating a 2x2 matrix with a maximum range 1000 will produce values from 0-1000. To how this tensor might look, i am putting log of a few of them here.
tf.Tensor(
[[398.65747 939.9828 ]
[942.4269 59.790222]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[869.5309 980.9699 ]
[149.97845 454.524 ]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[967.5063 100.77594 ]
[ 47.620544 482.77008 ]], shape=(2, 2), dtype=float32)
We logged into tensorboard 10 times. The to right of the plot, a timeline is generated to indicate timesteps. The depth of histogram indicate which values are new. The lighter/front values are newer and darker/far values are older.
Values are gathered into buckets which are indicated by those triangle structures. x-axis indicate the range of values where the bunch lies.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas