How does loss of information lead to better accuracy? [closed] - tensorflow

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
So, I’ve been looking into the following code
# Define the model
model = tf.keras.models.Sequential([
# Add convolutions and max pooling
tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2, 2),
# Add the same layers as before
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
# Print the model summary
model.summary()
# Use same settings
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
print(f'\nMODEL TRAINING:')
model.fit(training_images, training_labels, epochs=5)
# Evaluate on the test set
print(f'\nMODEL EVALUATION:')
test_loss = model.evaluate(test_images, test_labels)
From what I understand, Conv2D is used to convolve the 26x26 matrix into 32 smaller matrices. This means each matrix will have lost a lot of data. Then we use MaxPooling2D(2, 2). This method further causes data loss. Converting 2x2 matrix to 1x1. That’s another 25% data loss. Again, we repeat this process losing even more data.
Which is further proven by this graph
So, Intuition says, Since there are less data pieces available. This means classification would be inaccurate. Just like when your vision blurs, you can’t correctly identify the object.
But surprisingly, the accuracy here goes up.
Can anyone help me figure out why?

The loss of information is a by-product of mapping the image onto a lower dimensional target (compressing the representation in a lossy fashion), which is actually what you want. The relevant information content however is preserved as much as possible, while reducing the irrelevant or redundant information. The initial 'bias' of the pooling operation (to assume that close-by patterns can be summarized with such an operation) and the learned convolution kernel set do so effectively.

The purpose of most algorithms is to lose unnecessary information.
You want to decide if there is a dog on the image? Then you have to destroy all the information that is irrelevant to identifying a dog. From what remains, it is relevant the hair color to decide if it is a dog? If not, then delete. Ignore.
Is relevant if the dog is up left or down right on the image? If not, ignore.
That is why you augment your data by reflecting it, rotating, cutting pieces: you are teaching to the NN what to ignore. What is not important.
If the neural network focuses on unimportant things, it is overfitting.
You want to sort an array? Then you are destroying the information on how it was ordered. A sorted array only remembers if 5 is in the array, not where it was. 5 cannot be before 1, and cannot be after 7, and that is why is useful to sort arrays: you can find stuff easier, because it has less information.
A main tool for finding a solution to a problem is to discard everything that is not relevant. Intelligence is mostly about simplifying problems, about finding only what is relevant about the problem.

Your intuitive might be wrong as you apply maxpool to a feature "cube", then what it does is to keep the most informative ones in a 2x2 region. You do not apply it to the data directly, which is different than your example of blurring the vision.

Related

Why doesn't my CNN validation accuracy increase? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 months ago.
Improve this question
I am attempting to create a simple CNN to be able to distinguish eye (retinal) scans of different severities. It is a multi-class classification problem, 5 classes. This by now is probably a fairly standard, textbook case for CNNs. I am using the Kaggle EyePACs dataset. The photos are very big, so I'm using a dataset that has rescaled them.
My issue is, when I'm training the model, I expect to see the usual learning curves where both training and validation curves increase together like this example from google:
However my curves look like this:
I haven't done any image pre-processing on the data, I was hoping that there would be some rudimentary learning going on which I can then improve upon using CLAHE and what have you. I've changed the classes so that instead of trying to predict the grades from 0 to 4, I've removed the middle classes so that we just have the extremes: 0 and 4 (and thus it became a binary classification problem, where class 4 was relabelled 1 and so it's 0 and 1). However the curve didn't change much and still looks like this:
What could be the issue? I thought that as the model gets better with the training data, it must improve on the validation. Yes, this is overfitting, but I assumed that kicks in after some positive learning, not straight away. Validation set doesn't seem to be learning at all. Also, shouldn't these models start with random parameters, so that the initial accuracy would be random; but instead it's around 0.75 from the get-go. It just doesn't learn after that. What's going on? What should I look at changing? Is this a data problem or a hyperparameter problem? Shall I include the code here? Many thanks.
=============================Edit=============================
Here's the code I used. I know it's rudimentary, it's a mishmash of both the 'image classification from scratch' Keras tutorial as well as some standard MNIST tutorials you get around the web. Grateful for any pointers.
Creating the image-label dataset objects for train (+validation split) and test:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
"/content/drive/MyDrive/Colab Notebooks/resized train 15/Binary 0-4",
labels="inferred",
label_mode="binary",
validation_split=0.2,
seed=1337,
subset="training",
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
"/content/drive/MyDrive/Colab Notebooks/resized train 15/Binary 0-4",
labels="inferred",
label_mode="binary",
validation_split=0.2,
seed=1337,
subset="validation",
)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
"/content/drive/MyDrive/Colab Notebooks/resized test 15/0-4/",
labels="inferred",
label_mode="binary",
)
Found 26518 files belonging to 2 classes.
Using 21215 files for training.
Found 26518 files belonging to 2 classes.
Using 5303 files for validation.
Found 36759 files belonging to 2 classes.
#To make it run faster (I think?):
train_ds = train_ds.prefetch(buffer_size=32)
val_ds = val_ds.prefetch(buffer_size=32)
test_ds = test_ds.prefetch(buffer_size=32)
#The architecture:
from keras.models import Sequential
from keras.layers import Dense, Rescaling, Conv2D, MaxPool2D, Flatten
model = Sequential()
model.add(Rescaling(1.0 / 255))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(256,256,3)))
model.add(MaxPool2D(pool_size=(2, 2), strides=2))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2), strides=2))
model.add(Flatten())
model.add(Dense(units=2, activation='sigmoid'))
#Compile it:
from keras import optimizers
model.compile(optimizer=keras.optimizers.Adam(1e-3), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
#And then finally train (run) it:
history = model.fit(
x=train_ds,
epochs=30,
validation_data=val_ds,
)
#I think this is how I evaluate the trained model against the test data:
loss, acc = model.evaluate(test_ds)
print("Accuracy", acc)
#It prints out the following output:
1149/1149 [==============================] - 278s 238ms/step - loss: 0.1408 - accuracy: 0.9672
Accuracy 0.9671916961669922
And then of course I end it with model.save('Binary CNN 0-4').
I think I have spotted one thing I can change already -- that's to change the loss function to binary_crossentropy and adjust the number of units at the final dense layer to 1 (instead of 2)(?). But surely that little change won't actually address why the validation set isn't learning.
You've not included code, so I hope it's OK to give a couple of
tentative general answers
Q) initial accuracy, how can it be as high as 0.75?
A) Tensorflow reports the average training accuracy over the epoch, and if
there are many batches then it learns during epoch 0.
The first accuracy reported is the average over epoch 0
and can be much better than random.
If, for example, the input data is unbalanced and has 75% of
labels in one category, the model may learn very quickly that
it can achieve 75% accuracy by allocating 100% of training data
to that category.
Q) Can overfitting start at the beginning?
A) It can start very close to the beginning. A network may in effect just be memorising the training set.
There are standard approaches to overfitting, which include
i) Try a simpler network. It makes sense anyway to start simple and add
complexity as required.
ii) Regularization of layers - add (e.g.) L2 regularizers to your layers
iii) Add dropout layers between hidden layers
iv) Batch normalisation between hidden layers.
v) Image augmentation (randomly add some rotation, shift, flipping if appropriate)
vi) Get more training data
vii) Use transfer learning
as another answer has suggested. This is most likely appropriate if
you don't have much training data. You can then just add a layer
or two to the pre-built model (probably removing its last
layer or two), and train only the new layers.
Only trial and error will show what works

How to avoid overfitting with keras? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
def build_model():
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32,32,3]))
keras.layers.Dropout(rate=0.2)
model.add(keras.layers.Dense(500, activation="relu"))
keras.layers.Dropout(rate=0.2)
model.add(keras.layers.Dense(300, activation="relu"))
keras.layers.Dropout(rate=0.2)
model.add(keras.layers.Dense(10, activation="softmax"))
model.compile(loss='sparse_categorical_crossentropy', optimizer=keras.optimizers.SGD(), metrics=['accuracy'])
return model
keras_clf = keras.wrappers.scikit_learn.KerasClassifier(build_model)
def exponential_decay_fn(epoch):
return 0.05 * 0.1**(epoch / 20)
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = keras_clf.fit(np.array(X_train_new), np.array(y_train_new), epochs=100,
validation_data=(np.array(X_validation), np.array(y_validation)),
callbacks=[keras.callbacks.EarlyStopping(patience=10),lr_scheduler])
I use 'drop out', 'early stopping', and 'lr scheduler'. The results seem overfitting, I tried to reduce n_neurons of hidden layers to (300, 100). The results were underfitting, the accuracy of the train set was only around 0.5.
Are there any suggestions?
i dealing with these issue I first start out with a simple model like just a few dense layer with not a lot of nodes. I run the model and look at the resultant training accuracy. First step in modelling is to get a high training accuracy. You can add more layers and or more nodes in each layer until you get a satisfactory level of accuracy. Once that is achieved then start to evaluate the validation loss. If after a certain number of epochs the training loss continues to decrease but the validation loss starts to TREND upward then you are in an over fitting condition. Now the word TREND is import. I can't tell from you graphs if you are really overfitting but it looks to me that the validation loss has reached its minimum and is probably oscillating around the minimum. This is normal and is NOT overfitting. If you have an adjustable lr callback that monitors validation loss or alternately a learning rate scheduler lowering the learning may get you to a lower minimum loss but at some point (provided you run for enough epochs) continually reducing the learning rate doesn't get you to a lower minimum loss. The model has just done the best it can.
Now if you are REALLY over fitting you can take remedial actions. One is to add more dropout at the potential of reduced training accuracy. Another is to add L1 and or L2 regularization. Documentation for that is here.. If your training accuracy is high but your validation accuracy is poor it usually implies you need more training samples because the samples you have are not fully representative of the data probability distribution. More training data is always better. I notice you have 10 classes. Look at the balance of your dataset. If the classes have a significantly different number of samples this can cause problems. There are a bunch of methods to handle that problem like over-sampling under represented classes, under-sampling over represented classes, or a combination of both. An easy method is to use the class_weight parameter in model.fit. Look at your validation set and make sure it is not using to many samples from under represented classes. Always best to select the validation set randomly from the overall data set.

How to design an optimal CNN? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working on a Ph.D. project, which objective is to reduce CO2 emissions on Earth.
I have a dataset, and I was able to successfully implement a CNN, which gives 80% accuracy (worst-case scenario). However, the field where I work is very demanding, and I have the impression that I could get better accuracy with a well-optimized CNN.
How do experts design CNN's? How could I choose between Inception Modules, Dropout Regularization, Batch Normalization, convolutional filter size, size and depth of convolutional channels, number of fully-connected layers, activations neurons, etc? How do people navigate this large optimization problem in a scientific manner? The combinations are endless. Are there any real-life examples where this problem is navigated, addressing its full complexity (not just optimizing a few hyper-parameters)?
Hopefully, my dataset is not too large, so the CNN models that I am considering should have very few parameters.
How do experts design CNN's? How could I choose between Inception Modules, Dropout Regularization, Batch Normalization, convolutional filter size, size and depth of convolutional channels, number of fully-connected layers, activations neurons, etc? How do people navigate this large optimization problem in a scientific manner? The combinations are endless.
You said truly that the combinations are huge in number. And without approaching rightly you may end up with nowhere. A great one said machine Learning is an art, not science. Results are data-dependent. Here are a few tips regarding your above concern.
Log Everything: In the training time, save necessary logs of every experiment such as training loss, validation loss, weight files, execution times, visualization, etc. Some of them can be saved with CSVLogger, ModelCheckpoint etc. TensorBoard is a great tool for inspecting both training log and visualization and many more.
Strong Validation Strategies: This is very important. To build a stable Cross-Validation (CV), we must have a good understanding of the data and the challenges faced. We’ll check and make sure the validation set has a similar distribution to the training set and test set. And We’ll try to make sure our models improve both on our CV and on the test set (if gt is available for the test set). Basically, partitioning the data randomly is usually not enough to satisfy this. Understanding the data and how we can partition it without introducing a data leakage in our CV is key to avoid overfitting.
Change Only One: During the experiment, change one thing at a time and save the observations (logs) for those changes. For example: change the image size gradually from 224 (for example) to higher and observe the results. We should start with a small combination. While experimenting with image size, fix others like model architecture, learning rate, etc. The same goes for the learning rate part or model architectures. However, later we also may need to change more than one when we get some promising combinations. In kaggle competition, these are very common approaches one would follow. Below is a very simple example regarding this. But it's not limited any way.
However, as you said, your Ph.D. project is to reduce CO2 emissions on Earth. In my understanding, these are more application-specific problems and less than the algorithm-specific problems. So, we think it's better to take benefit from well-recognized pre-trained models.
In case if we wish to write our CNN on our own, we should give a decent time on it. Start with a very simple one, for example:
Conv2D (16, 3, 'relu') - > MaxPool (2)
Conv2D (32, 3, 'relu') - > MaxPool (2)
Conv2D (64, 3, 'relu') - > MaxPool (2)
Conv2D (128, 3, 'relu') - > MaxPool (2)
Here we gradually increase the depth but reducing the feature dimension. By the end layer, more semantic information would emerge. While stacking Conv2D layers, it's common practice to increase the channel depth in such order 16, 32, 64, 128 etc. If we want to impute Inception or Residual Block inside our network, I think, we should do some basic math first about what feature properties will come out of this, etc. Following a concept like this, we may also wish to look at approaches like SENet, ResNeSt etc. About Dropout, if we observe that our model is getting overfitted during training, then we should add some. In the final layer, we may want to choose GlobalAveragePooling over the Flatten layer (FCC). We can probably now understand that there are lots of ablation studies that need to be done to get a satisfactory CNN model.
In this regard, We suggest you explore the two most important things: (1). Read one of the pre-trained model papers/blogs/videos about their strategies to build the algorithm. For example: check out this EfficientNet Explained. (2). Next, explore the source code of it. That would give your more sense and encourage you to build your own giant.
We like to end this with one last working example. See the model diagram below, it's a small inception network, source. If we look closely, we will see, it consists of the following three modules.
Conv Module
Inception Module
Downsample Modul
Take a close look at each module's configuration such as filter size, strides, etc. Let's try to understand and implement this module. Before that, here are two good references (1, 2) for the Inception concept to refresh the concept.
Conv Module
From the diagram we can see, it consists of one convolutional network, one batch normalization, and one relu activation. Also, it produces C times feature maps with K x K filters and S x S strides. To do that, we will create a class object that will inherit the tf.keras.layers.Layer classes
class ConvModule(tf.keras.layers.Layer):
def __init__(self, kernel_num, kernel_size, strides, padding='same'):
super(ConvModule, self).__init__()
# conv layer
self.conv = tf.keras.layers.Conv2D(kernel_num,
kernel_size=kernel_size,
strides=strides, padding=padding)
# batch norm layer
self.bn = tf.keras.layers.BatchNormalization()
def call(self, input_tensor, training=False):
x = self.conv(input_tensor)
x = self.bn(x, training=training)
x = tf.nn.relu(x)
return x
Inception Module
Next comes the Inception module. According to the above graph, it consists of two convolutional modules and then merges together. Now as we know to merge, here we need to ensure that the output feature maps dimension ( height and width ) needs to be the same.
class InceptionModule(tf.keras.layers.Layer):
def __init__(self, kernel_size1x1, kernel_size3x3):
super(InceptionModule, self).__init__()
# two conv modules: they will take same input tensor
self.conv1 = ConvModule(kernel_size1x1, kernel_size=(1,1), strides=(1,1))
self.conv2 = ConvModule(kernel_size3x3, kernel_size=(3,3), strides=(1,1))
self.cat = tf.keras.layers.Concatenate()
def call(self, input_tensor, training=False):
x_1x1 = self.conv1(input_tensor)
x_3x3 = self.conv2(input_tensor)
x = self.cat([x_1x1, x_3x3])
return x
Here you may notice that we are now hard-coded the exact kernel size and strides number for both convolutional layers according to the network (diagram). And also in ConvModule, we have already set padding to the same, so that the dimension of the feature maps will be the same for both (self.conv1 and self.conv2); which is required in order to concatenate them to the end.
Again, in this module, two variable performs as the placeholder, kernel_size1x1, and kernel_size3x3. This is for the purpose of course. Because we will need different numbers of feature maps to the different stages of the entire model. If we look into the diagram of the model, we will see that InceptionModule takes a different number of filters at different stages in the model.
Downsample Module
Lastly the downsampling module. The main intuition for downsampling is that we hope to get more relevant feature information that highly represents the inputs to the model. As it tends to remove the unwanted feature so that model can focus on the most relevant. There are many ways we can reduce the dimension of the feature maps (or inputs). For example: using strides 2 or using the conventional pooling operation. There are many types of pooling operation, namely: MaxPooling, AveragePooling, GlobalAveragePooling.
From the diagram, we can see that the downsampling module contains one convolutional layer and one max-pooling layer which later merges together. Now, if we look closely at the diagram (top-right), we will see that the convolutional layer takes a 3 x 3 size filter with strides 2 x 2. And the pooling layer (here MaxPooling) takes pooling size 3 x 3 with strides 2 x 2. Fair enough, however, we also ensure that the dimension coming from each of them should be the same in order to merge at the end. Now, if we remember when we design the ConvModule we purposely set the value of the padding argument to same. But in this case, we need to set it to valid.
class DownsampleModule(tf.keras.layers.Layer):
def __init__(self, kernel_size):
super(DownsampleModule, self).__init__()
# conv layer
self.conv3 = ConvModule(kernel_size, kernel_size=(3,3),
strides=(2,2), padding="valid")
# pooling layer
self.pool = tf.keras.layers.MaxPooling2D(pool_size=(3, 3),
strides=(2,2))
self.cat = tf.keras.layers.Concatenate()
def call(self, input_tensor, training=False):
# forward pass
conv_x = self.conv3(input_tensor, training=training)
pool_x = self.pool(input_tensor)
# merged
return self.cat([conv_x, pool_x])
Okay, now we have built all three modules, namely: ConvModule InceptionModule DownsampleModule. Let's initialize their parameter according to the diagram.
class MiniInception(tf.keras.Model):
def __init__(self, num_classes=10):
super(MiniInception, self).__init__()
# the first conv module
self.conv_block = ConvModule(96, (3,3), (1,1))
# 2 inception module and 1 downsample module
self.inception_block1 = InceptionModule(32, 32)
self.inception_block2 = InceptionModule(32, 48)
self.downsample_block1 = DownsampleModule(80)
# 4 inception module and 1 downsample module
self.inception_block3 = InceptionModule(112, 48)
self.inception_block4 = InceptionModule(96, 64)
self.inception_block5 = InceptionModule(80, 80)
self.inception_block6 = InceptionModule(48, 96)
self.downsample_block2 = DownsampleModule(96)
# 2 inception module
self.inception_block7 = InceptionModule(176, 160)
self.inception_block8 = InceptionModule(176, 160)
# average pooling
self.avg_pool = tf.keras.layers.AveragePooling2D((7,7))
# model tail
self.flat = tf.keras.layers.Flatten()
self.classfier = tf.keras.layers.Dense(num_classes, activation='softmax')
def call(self, input_tensor, training=True, **kwargs):
# forward pass
x = self.conv_block(input_tensor)
x = self.inception_block1(x)
x = self.inception_block2(x)
x = self.downsample_block1(x)
x = self.inception_block3(x)
x = self.inception_block4(x)
x = self.inception_block5(x)
x = self.inception_block6(x)
x = self.downsample_block2(x)
x = self.inception_block7(x)
x = self.inception_block8(x)
x = self.avg_pool(x)
x = self.flat(x)
return self.classfier(x)
The amount of filter number for each computational block is set according to the design of the model (see the diagram). After initialing all the blocks (in the __init__ function), we connect them according to the design (in the call function).
I think you are way off on your estimate of the number of parameters needed. Think more like a few million which is what you will get if you use transfer learning. You can struggle trying to make your own model if you wish but you will probable not be any better (and more likely no where near as good) as the results you will get from transfer learning. I highly recommend the MobileV2 model. Now you can make that or any of the other models perform better if you use an adjustable learning rate using ReduceLROnPlateau . Documentation for that is here. The other thing I recommend is to use the Keras callback EarlyStopping. Documentation is here. . Set it to monitor validation loss and set restore_best_weights=True. Set the number of epochs to a large number so this callback gets triggered and returns the model with the weights from the epoch with the lowest validation loss. My recommended code is shown below
height=224
width=224
img_shape=(height, width, 3)
dropout=.3
lr=.001
class_count=156 # number of classes
img_shape=(height, width, 3)
base_model=tf.keras.applications.MobileNetV2( include_top=False, input_shape=img_shape, pooling='max', weights='imagenet')
x=base_model.output
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x = Dense(512, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu', kernel_initializer= tf.keras.initializers.GlorotUniform(seed=123))(x)
x=Dropout(rate=dropout, seed=123)(x)
output=Dense(class_count, activation='softmax',kernel_initializer=tf.keras.initializers.GlorotUniform(seed=123))(x)
model=Model(inputs=base_model.input, outputs=output)
model.compile(Adamax(lr=lr), loss='categorical_crossentropy', metrics=['accuracy'])
rlronp=tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1, verbose=1, mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)
estop=tf.keras.callbacks.EarlyStopping( monitor="val_loss", min_delta=0, patience=4,
verbose=1, mode="auto", baseline=None,
restore_best_weights=True)
callbacks=[rlronp, estop]
Also look at the balance in your data set. That is, compare how many training samples you have for each class. If the ratio of most samples/least samples>2 or 3 you may want to take action to mitigate that. Numerous methods are available, the simplest is to use the class_weight parameter in model.fit. o do that you need to create a class_weights dictionary. The process to do that is outline below
Lets say your class distribution is
class0 - 500 samples
class1- 2000 samples
class2 - 1500 samples
class3 - 200 samples
Then your dictionary would be
class_weights={0: 2000/500, 1:2000/2000, 2: 2000/1500, 3: 2000/200}
in model.fit set class_weight=class_weights

How do you decide on the dimensions for a the activation layer in tensorflow

The tensorflow hub docs have this example code for text classification:
hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", output_shape=[50],
input_shape=[], dtype=tf.string)
model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
I don't understand how we decide if 16 is the right magic number for the relu layer. Can someone explain this please.
The choice of 16 units in the hidden layer is not a uniquely determined magic value. Like Shubham commented, it's all about experimenting and finding values that work well for your problem. Here is some folklore to guide your experimentation:
The usual range for the number of units in hidden layers is tens to thousands.
Powers of two may utilize specific hardware (like GPUs) more effectively.
Simple feed-forward networks like the one above often decrease the number of units between successive layers. A commonly cited intuition is to progress from many basic features to fewer, more abstract ones. (Hidden layers tend to produce dense representations like embeddings, not discrete features, but the reasoning applies analogously to the dimension of the feature space.)
The code snippet above does not show regularization. When trying whether more hidden units help, watch out for the gap between training and validation quality. A widening gap may indicate the need to regularize more.

Why can't my neural network learn to recover information from input?

I am trying to set up a neural network to learn to predict the consensus statistic given several input statistics (for multiple data points in each sample).
In other words my input data is of the shape (n_samples, n_statistics, n_points)
and output is (n_samples, 1_consensus_statistic, n_points).
The Neural Network us supposed to learn the single consensus output given n_statistics about it; possibly data in the other n_points might help recovering the relationship so it might be worth adding a fully connected layer. Note that one of the n_statistics resembles the 1_consensus_stat well in terms of magnitude (most of n_points are about 3.1, but some values are small (0.5 or 0.2) and the first and third of the n_statistics resemble that well; the other of the n_statistics are smaller values but also there should be a nonlinear relationship between them and the consensus output - i.e. they should help the network learn the consensus statistic).
Instead the network learns to predict that all n_points are about 2.5-3.0 in magnitude. Somehow it manages to loose the information would could be retrieved from a simple linear regression model. Here is my setup in Keras. What am I doing wrong?
model = Sequential()
model.add(LocallyConnected1D(n_points, 3, input_shape=(n_statistics=6,n_points), kernel_initializer="uniform"), activation='relu')
model.add(AveragePooling1D(strides=2))
model.add(LocallyConnected1D(n_points, 2, kernel_initializer="uniform"), activation='relu')
model.add(Dense(n_points, activation='relu', kernel_initializer="uniform"))
model.add(Dense(n_points, activation='softplus', kernel_initializer="uniform"))
model.compile(optimizer='adamax', loss='mean_squared_logarithmic_error', metrics=['mse'])