CNTK classification model Classifies all 1 - cntk

I have a cntk model which takes in features related to clicks and other information and predicts if something would be clicked in the future. Using the same features in a randomforest works fine, however, cntk classifies all 1. Why does this happen? Is there any parameter tuning needed? The features have varying scale.
My train action looks like this:
BrainScriptNetworkBuilder = [
inputD = $inputD$
labelD = $labelD$
#hidden1 = $hidden1$
model(features) = {
w0 = ParameterTensor{(1 : 2), initValueScale=10}; b0 = ParameterTensor{1, initValueScale=10};
h1 = w0*features + b0; #hidden layer
z = Sigmoid (h1)
}.z
features = Input(inputD)
labels = Input(labelD)
z = model(features)
#now that we have output, find error
err = SquareError (labels, z)
lr = Logistic (labels, z)
output = z
criterionNodes = (err)
evaluationNodes = (err)
outputNodes = (z)
]
SGD = [
epochSize = 4 #learn
minibatchSize = 1 #learn
maxEpochs = 1000 #learn
learningRatesPerSample = 1
numMBsToShowResult = 10000
firstMBsToShowResult = 10
]

In addition to what KeD said, a random forest does not care about the actual values of the features, only about their relative order.
Unlike trees, neural networks are sensitive to the actual values of the features (rather than just their relative order).
Your input might contain some features with very large values. You should probably recode them. There are different schemes for doing this. One possibility is to subtract the mean from each feature and scale it to -1,1 or divide by it's standard deviation. Another possibility for positive features is a transformation such as f => log(1+f). You could also use a batch normalization layer.

Since your features are of varying scales, I would suggest you normalize the features. You mentioned, cntk classifies all input as 1. I am assuming it happens when you predict using the trained model. But, what happens during training? Can you plot a graph of training + test error on a graph (cntk supports TensorBoard now)? That would give you some indication of if your model is over-fitting. Moreover, as a side not, I would suggest to increase the model's learning capability (most likely, by increasing number of hidden layers) to learn a better distribution of you data.

It seems the learning rate is too high, please try learningRatesPerSample = 0.001

Related

Keras sequential model with altered loss does not learn

(I am new to stackexchange, but I believe I have correctly classified this question. If there's something off about my question, please inform me.)
I am trying to write a machine learning algorithm that learns to move an arm by contracting muscles. I have done my best to work out every possible bug I can think of but I have come to an impasse where every individual part of the program seems to run correctly, yet the algorithm does not learn. Fundamentally, all this model is doing is finding the inverse of a function by training a neural network to said function's inputs and outputs. The only thing that makes it even remotely nontrivial is that it uses an intermediary function when calculating the loss.
Working in python with TensorFlow, we first define some constants and a function that converts deltoid and bicep muscle contractions to hand positions,
lH =1.0
lU =1.0
lCD=0.1
lHD=0.1
lHB=0.9
lUB=0.1
lD_max = lCD+lHD
lD_min = abs(lCD-lHD)
lD_diff = lD_max-lD_min
lB_max = lHB+lUB
lB_min = abs(lHB-lUB)
lB_diff = lB_max-lB_min
max_muscle_contraction = 0.9
min_muscle_contraction = 0.1
lD_min_eff = lD_min + min_muscle_contraction*lD_diff
lD_max_eff = lD_min + max_muscle_contraction*lD_diff
lB_min_eff = lB_min + min_muscle_contraction*lB_diff
lB_max_eff = lB_min + max_muscle_contraction*lB_diff
def contractionToPosition(c):
# Takes a (n, m, 2) tensor of contraction pairs and returns a (n, m, 2) tensor of the resulting positions
# Commonly takes (n, 2, 2) contraction tensors: a vector of initial and final vectors of deltoid-tricep pairs.
cosD = (lCD**2+lHD**2 - tf.math.square(c[:,:,0]))/(2*lCD*lHD)
cosD = tf.math.minimum(cosD, 2*max_muscle_contraction-1)
cosD = tf.math.maximum(cosD, 2*min_muscle_contraction-1) # Equivalent to limiting the contraction
sinD = tf.math.sqrt(1-tf.math.square(cosD))
cosB = (lHB**2+lUB**2 - tf.math.square(c[:,:,1]))/(2*lHB*lUB)
cosB = tf.math.minimum(cosB, 2*max_muscle_contraction-1)
cosB = tf.math.maximum(cosB, 2*min_muscle_contraction-1) to limiting the contraction
sinB = tf.math.sqrt(1-tf.math.square(cosB))
px = lH*cosD + lU*sinB*sinD - lU*cosB*cosD
py = -lH*sinD + lU*sinB*cosD + lU*cosB*sinD
p = tf.stack([px, py], axis=-1) # By px[i,j] being the [i,j]th px value that must be paired with the [i,j]th py value
return p
Regardless of the above values and function's validity, the algorithm should still be able to learn from it because the data itself is synthetically generated with this same function. This function is also what the neural network is (approximately) trying to invert. Note that the neural network should take in the initial position and the planned final position, returning a change in the muscles contraction. Calculating the difference in the true final positions and the planned final positions will thus require that we also know the initial contraction. Toward this, we generate the synthetic data that we will later train the algorithm on,
def generateContraction(samples): # Returns a random vector of contraction lengths
cD = tf.zeros(samples)
cD += tf.random.uniform(shape=cD.shape, minval=lD_min_eff, maxval=lD_max_eff)
cB = tf.zeros(samples)
cB += tf.random.uniform(shape=cB.shape, minval=lB_min_eff, maxval=lB_max_eff)
return tf.transpose(tf.stack([cD,cB]))
def data(samples):
ci = generateContraction(samples)
cf = generateContraction(samples)
c = tf.stack([ci,cf], axis=1)
p = contractionToPosition(c)
return p, c
sample_size = 10000
positions, contractions = data(sample_size)
initial_contractions = contractions[:,0]
final_contractions = contractions[:,1]
features = positions
labels = tf.subtract(final_contractions, initial_contractions)
initial_data = initial_contractions
I have meticulously tested the entire process of this data's construction and every step has proven accurate. We then load this raw data into a dataset for the learning algorithm,
def load_array(data_arrays, batch_size, is_train=True):
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
batch_size = 64
data_iter = load_array((features, labels, initial_data), batch_size)
The network model doesn't need to be very complicated to know whether the learning algorithm works since there is no statistical error in the data. With this model, we are also intending it to act like the neural network found in the cerebellum of mammals. Specifically, this implies that for simple motions it is a shallow sequential neural network with ReLu activation. As such, we construct it fairly simply,
net = tf.keras.Sequential([
tf.keras.Input(shape=(2,2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units=64,
activation='relu',
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.1)),
tf.keras.layers.Dense(units=2,
activation='relu',
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.1))
])
Finally, we write our learning algorithm based off the TensorFlow documentation, https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Note that we are optimizing the squared distance between the planned and actual final positions rather than optimizing the difference in the contractions. This is the only thing that makes this ever so slightly nontrivial.
loss = tf.keras.losses.MeanSquaredError()
def train(net, train_iter, loss, epochs, lr):
trainer = tf.keras.optimizers.Adam(learning_rate=lr)
params = net.trainable_variables
for epoch in range(epochs):
epochError = 0
for X, y, I in train_iter:
with tf.GradientTape() as g:
g.watch(params)
P_hat = contractionToPosition(tf.reshape(net(X, training=True) + I, (-1,1,2)))
P = contractionToPosition(tf.reshape( y + I, (-1,1,2))) # We have to reshape because of our function contractionToPosition
l = loss(P, P_hat)
epochError += l
error = l
grads = g.gradient(l, params)
trainer.apply_gradients(zip(grads,params))
print(f'epoch {epoch + 1}, '
f'loss: {epochError}')
train(net, data_iter, loss, 5, 0.05)
The result of all this, though, is a complete lack of learning. Usually the epoch loss is about 109 (which is expected for no learning) with no significant change in said loss (usually fluctuates within +/-0.7 .) If anything is at fault, I would suspect this final code snippet to be, specifically the gradient tape. I have probed every aspect of the gradient tape, however, and everything seems to be functioning correctly. Overall, I cannot think of a part of my code I have not dissected at this point so I am at a total loss here.
Any and all help is deeply appreciated!

CNN + LSTM model for images performs poorly on validation data set

My training and loss curves look like below and yes, similar graphs have received comments like "Classic overfitting" and I get it.
My model looks like below,
input_shape_0 = keras.Input(shape=(3,100, 100, 1), name="img3")
model = tf.keras.layers.TimeDistributed(Conv2D(8, 3, activation="relu"))(input_shape_0)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(16, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(32, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(Flatten())(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.4))(model)
model = LSTM(16, kernel_regularizer=tf.keras.regularizers.l2(0.007))(model)
# model = Dense(100, activation="relu")(model)
# model = Dense(200, activation="relu",kernel_regularizer=tf.keras.regularizers.l2(0.001))(model)
model = Dense(60, activation="relu")(model)
# model = Flatten()(model)
model = Dropout(0.15)(model)
out = Dense(30, activation='softmax')(model)
model = keras.Model(inputs=input_shape_0, outputs = out, name="mergedModel")
def get_lr_metric(optimizer):
def lr(y_true, y_pred):
return optimizer.lr
return lr
opt = tf.keras.optimizers.RMSprop()
lr_metric = get_lr_metric(opt)
# merged.compile(loss='sparse_categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=opt, metrics=['accuracy',lr_metric])
model.summary()
In the above model building code, please consider the commented lines as some of the approaches I have tried so far.
I have followed the suggestions given as answers and comments to this kind of question and none seems to be working for me. Maybe I am missing something really important?
Things that I have tried:
Dropouts at different places and different amounts.
Played with inclusion and expulsion of dense layers and their number of units.
Number of units on the LSTM layer was tried with different values (started from as low as 1 and now at 16, I have the best performance.)
Came across weight regularization techniques and tried to implement them as shown in the code above and so tried to put it at different layers ( I need to know what is the technique in which I need to use it instead of simple trial and error - this is what I did and it seems wrong)
Implemented learning rate scheduler using which I reduce the learning rate as the epochs progress after a certain number of epochs.
Tried two LSTM layers with the first one having return_sequences = true.
After all these, I still cannot overcome the overfitting problem.
My data set is properly shuffled and divided in a train/val ratio of 80/20.
Data augmentation is one more thing that I found commonly suggested which I am yet to try, but I want to see if I am making some mistake so far which I can correct it and avoid diving into data augmentation steps for now. My data set has the below sizes:
Training images: 6780
Validation images: 1484
The numbers shown are samples and each sample will have 3 images. So basically, I input 3 mages at once as one sample to my time-distributed CNN which is then followed by other layers as shown in the model description. Following that, my training images are 6780 * 3 and my Validation images are 1484 * 3. Each image is 100 * 100 and is on channel 1.
I am using RMS prop as the optimizer which performed better than adam as per my testing
UPDATE
I tried some different architectures and some reularizations and dropouts at different places and I am now able to achieve a val_acc of 59% below is the new model.
# kernel_regularizer=tf.keras.regularizers.l2(0.004)
# kernel_constraint=max_norm(3)
model = tf.keras.layers.TimeDistributed(Conv2D(32, 3, activation="relu"))(input_shape_0)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(64, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Conv2D(128, 3, activation="relu"))(model)
model = tf.keras.layers.TimeDistributed(MaxPooling2D(2))(model)
model = tf.keras.layers.TimeDistributed(Dropout(0.3))(model)
model = tf.keras.layers.TimeDistributed(GlobalAveragePooling2D())(model)
model = LSTM(128, return_sequences=True,kernel_regularizer=tf.keras.regularizers.l2(0.040))(model)
model = Dropout(0.60)(model)
model = LSTM(128, return_sequences=False)(model)
model = Dropout(0.50)(model)
out = Dense(30, activation='softmax')(model)
Try to perform Data Augmentation as a preprocessing step. Lack of data samples can lead to such curves. You can also try using k-fold Cross Validation.
There are many ways to prevent overfitting, according to the papers below:
Dropout layers (Disabling randomly neurons). https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
Input Noise (e.g. Random Gaussian Noise on the imges). https://arxiv.org/pdf/2010.07532.pdf
Random Data Augmentations (e.g. Rotating, Shifting, Scaling, etc.).
https://arxiv.org/pdf/1906.11052.pdf
Adjusting Number of Layers & Units.
https://clgiles.ist.psu.edu/papers/UMD-CS-TR-3617.what.size.neural.net.to.use.pdf
Regularization Functions (e.g. L1, L2, etc)
https://www.researchgate.net/publication/329150256_A_Comparison_of_Regularization_Techniques_in_Deep_Neural_Networks
Early Stopping: If you notice that for N successive epochs that your model's training loss is decreasing, but the model performs poorly on validaiton data set, then It is a good sign to stop the training.
Shuffling the training data or K-Fold cross validation is also common way way of dealing with Overfitting.
I found this great repository, which contains examples of how to implement data augmentations:
https://github.com/kochlisGit/random-data-augmentations
Also, this repository here seems to have examples of CNNs that implement most of the above methods:
https://github.com/kochlisGit/Tensorflow-State-of-the-Art-Neural-Networks
The goal should be to get the model predict correctly irrespective of
the order in which the 3 images in the sample are arranged.
If the order of the images of each sample is not important for the training, I think your model does the inverse, the Timedistributed layers succeded by LSTM take into account the order of the three images. As a solution, primarily, you can add images by reordering the images of each sample (= Augmented data). Secondly, try to consider the three images as one image with three-channel and remove the Timedistributed layers (I'm not sure that the three-channels are more efficient but you can give it a try)

How to adjust Model for rare binary outcome with Tensorflow or GBM

I'm currently working on data with rare binary outcome, i.e. the response vector contains mostly 0 and only a few 1 (approximately 1.5% ones). I've got about 20 continuous explanatory variables. I tried to train models using GBM, Random Forests, TensorFlow with Keras backend.
I observed a special behavior of the models, regardless which method I used:
The accuracy is high (~98%) but the model predicts probabilities for class "0" for all outcomes as ~98.5% and for class "1" ~1,5%.
How can I prevent this behavior?
I'm using RStudio. For Example a TF model with Keras would be:
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = "relu", input_shape = c(20)) %>%
layer_dense(units = 256, activation = "relu") %>%
layer_dense(units = 2, activation = "sigmoid")
parallel_model <- multi_gpu_model(model, gpus=2)
parallel_model %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = "binary_accuracy")
histroy <- parallel_model %>% fit(
x_train, y_train,
batch_size = 64,
epochs = 100,
class_weight = list("0"=1,"1"=70),
verbose = 1,
validation_split = 0.2
)
But my observation is not limited to TF. This makes my question more general. I'm not asking for specific adjustments for the model above, rather I'd like to discuss at what point all outcomes are assigned the same probability.
I can guess, the issue is connected to the loss-function.
I know there is no way to use AUC as loss functions, since it's not differentiable. If I test models with AUC with unknown data, the result is not better than random guessing.
I don't mind answers with code in Python, since this isn't a problem about coding rather than general behavior and algorithms.
When your problem has unbalanced classes, I suggest using SMOTE (on the training data only!!! never use smote on your testing data!!!) before training the model.
For example:
from imblearn.over_sampling import SMOTE
X_trn_balanced, Y_trn_balanced = SMOTE(random_state=1, ratio=1).fit_sample(X_trn, Y_trn)
#next fit the model with the balanced data
model.fit(X_trn_balanced, Y_trn_balanced )
In my (not so big) experience with AUC problems and rare positives, I see models with one class (not two). It's either "result is positive (1)" or "result is negative (0)".
Metrics like accuracy are useless for these problems, you should use AUC based metrics with big batch sizes.
For these problems, it doesn't matter whether the outcome probabilities are too little, as long as there is a difference between them. (Forests, GBM, etc. will indeed output these little values, but this is not a problem)
For neural networks, you can try to use class weights to increase the output probabilities. But notice that if you split the result in two separate classes (considering only one class should be positive), it doesn't matter if you use weights, because:
For the first class, low weights: predict all ones is good
For the second class, high weights: predict all zeros is good (weighted to very good)
So, as an initial solution, you can:
Use a 'softmax' activation (to guarantee your model will have only one correct output) and a 'categorical_crossentropy' loss.
(Or, preferrably) Use a model with only one class and keep 'sigmoid' with 'binary_crossentropy'.
I always work with the preferrable option above. In this case, if you use batch sizes that are big enough to contain one or two positive examples (batch size around 100 for you), weights may even be discarded. If the batch sizes are too little and many batches don't contain positive results, you may have too many weight updates towards plain zeros, which is bad.
You may also resample your data and, for instance, multiply by 10 the number of positive examples, so your batches contain more positives and training becomes easier.
Example of AUC metric to determine when training should end:
#in python - considering outputs with only one class
def aucMetric(true, pred):
true= K.flatten(true)
pred = K.flatten(pred)
totalCount = K.shape(true)[0]
values, indices = tf.nn.top_k(pred, k = totalCount)
sortedTrue = K.gather(true, indices)
tpCurve = K.cumsum(sortedTrue)
negatives = 1 - sortedTrue
auc = K.sum(tpCurve * negatives)
totalCount = K.cast(totalCount, K.floatx())
positiveCount = K.sum(true)
negativeCount = totalCount - positiveCount
totalArea = positiveCount * negativeCount
return auc / totalArea

Tensorflow variable value different on same training set

I build a neural network model on Python 3.6
I'm trying to predict price of condominium based on their attributes such as lat, lng, distance to public transport, year-built, and so on.
I use the same training set for the model. However, each time I print out value of the variables in hidden layer is different.
testing_df_w_price = testing_df.copy()
testing_df.drop('PricePerSq',axis = 1, inplace = True)
training_df, testing_df = training_df.drop(['POID'], axis=1), testing_df.drop(['POID'], axis=1)
col_train = list(training_df.columns)
col_train_bis = list(training_df.columns)
col_train_bis.remove('PricePerSq')
mat_train = np.matrix(training_df)
mat_test = np.matrix(testing_df)
mat_new = np.matrix(training_df.drop('PricePerSq', axis = 1))
mat_y = np.array(training_df.PricePerSq).reshape((training_df.shape[0],1))
prepro_y = MinMaxScaler()
prepro_y.fit(mat_y)
prepro = MinMaxScaler()
prepro.fit(mat_train)
prepro_test = MinMaxScaler()
prepro_test.fit(mat_new)
train = pd.DataFrame(prepro.transform(mat_train),columns = col_train)
test = pd.DataFrame(prepro_test.transform(mat_test),columns = col_train_bis)
# List of features
COLUMNS = col_train
FEATURES = col_train_bis
LABEL = "PricePerSq"
# Columns for tensorflow
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]
# Training set and Prediction set with the features to predict
training_set = train[COLUMNS]
prediction_set = train.PricePerSq
# Train and Test
x_train, x_test, y_train, y_test = train_test_split(training_set[FEATURES] , prediction_set, test_size=0.25, random_state=42)
y_train = pd.DataFrame(y_train, columns = [LABEL])
training_set = pd.DataFrame(x_train, columns = FEATURES).merge(y_train, left_index = True, right_index = True) # good
# Training for submission
training_sub = training_set[col_train] # good
# Same thing but for the test set
y_test = pd.DataFrame(y_test, columns = [LABEL])
testing_set = pd.DataFrame(x_test, columns = FEATURES).merge(y_test, left_index = True, right_index = True) # good
# Model
# tf.logging.set_verbosity(tf.logging.INFO)
tf.logging.set_verbosity(tf.logging.ERROR)
regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols,
hidden_units=[int(len(col_train)+1/2)],
model_dir = "/tmp/tf_model")
for k in regressor.get_variable_names():
print(k)
print(regressor.get_variable_value(k))
Example of hidden layer value difference
The variables are initialized with random values when you construct the network. Since there's likely to be many local minima of your loss function, the fitted parameters will change every time you run the network.
In addition if your loss function is convex (only one (global) minima) the order of the variables is somewhat arbitrary. If for example you fit a network with 1 hidden layers with 2 hidden nodes, the parameters of node 1 in your first run might correspond to the parameters of node 2 and vice versa.
In Machine Learnining, the current "knowledge state" of your neural network is expressed through the weights of the connections in your graph. Generally considered, your whole network represents a high-dimensional function and the task of learning means finding the global optimum of this funktion. The learning process changes the weights of the connections in your neural network according to the specified optimizer, which in your case is the default of tf.contrib.learn.DNNRegressor (which is the Adagrad optimizer). But there are other parameters that affect the final "knowledge state" in your model. There are for instance (and i guarantee no completeness in the following list):
The initial learning rate in your model
The learning rate schedule that adapts the learning rate over time
eventually defined regularities and early stopping
The initialization strategy used for weight initialization (e.g. He-initialization or random initialization)
Plus (and this is maybe the most important thing to understand why your weights are different after each retraining), you have to consider that you use a stochastic gradient descent algorithm during training. This means, that for each optimization step the algorithm choses a random subset of your whole training set. Therefore, one optimization step doesn't always point tho the global optimum of your high-dimensional function, but to the steepest descent that could be computed with the randomly chosen subset. Because of this stochastic component in the optimization process, you will likely never reach the global optimum for your task. But with carefully chosen hyperparameters (and of course good data) you will reach a good approximate solution, which lies whithin a local optimum of the function and which can change everytime you retrain the model.
So to conclude, don't look at the weights to judge the performance of your model, because they will be slightly different each time. Use a performance measure like the accuracy computed in a cross validation or a confusion matrix computed on the test set.
P.S. tf.contrib.learn.DNNRegressor is a deprecated function in the newest TensorFlow release, as you can see in the docs. Use tf.estimator.DNNRegressor instead.

Neural network only converges when data cloud is close to 0

I am new to tensorflow and am learning the basics at the moment so please bear with me.
My problem concerns strange non-convergent behaviour of neural networks when presented with the supposedly simple task of finding a regression function for a small training set consisting only of m = 100 data points {(x_1, y_1), (x_2, y_2),...,(x_100, y_100)}, where x_i and y_i are real numbers.
I first constructed a function that automatically generates a computational graph corresponding to a classical fully connected feedforward neural network:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import math
def neural_network_constructor(arch_list = [1,3,3,1],
act_func = tf.nn.sigmoid,
w_initializer = tf.contrib.layers.xavier_initializer(),
b_initializer = tf.zeros_initializer(),
loss_function = tf.losses.mean_squared_error,
training_method = tf.train.GradientDescentOptimizer(0.5)):
n_input = arch_list[0]
n_output = arch_list[-1]
X = tf.placeholder(dtype = tf.float32, shape = [None, n_input])
layer = tf.contrib.layers.fully_connected(
inputs = X,
num_outputs = arch_list[1],
activation_fn = act_func,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
for N in arch_list[2:-1]:
layer = tf.contrib.layers.fully_connected(
inputs = layer,
num_outputs = N,
activation_fn = act_func,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
Phi = tf.contrib.layers.fully_connected(
inputs = layer,
num_outputs = n_output,
activation_fn = tf.identity,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
Y = tf.placeholder(tf.float32, [None, n_output])
loss = loss_function(Y, Phi)
train_step = training_method.minimize(loss)
return [X, Phi, Y, train_step]
With the above default values for the arguments, this function would construct a computational graph corresponding to a neural network with 1 input neuron, 2 hidden layers with 3 neurons each and 1 output neuron. The activation function is per default the sigmoid function. X corresponds to the input tensor, Y to the labels of the training data and Phi to the feedforward output of the neural network. The operation train_step performs one gradient-descent step when executed in the session environment.
So far, so good. If I now test a particular neural network (constructed with this function and the exact default values for the arguments given above) by making it learn a simple regression function for artificial data extracted from a sinewave, strange things happen:
Before training, the network seems to be a flat line. After 100.000 training iterations, it manages to partially learn the function, but only the part which is closer to 0. After this, it becomes flat again. Further training does not decrease the loss function anymore.
This get even stranger, when I take the exact same data set, but shift all x-values by adding 500:
Here, the network completely refuses to learn. I cannot understand why this is happening. I have tried changing the architecture of the network and its learning rate, but have observed similar effects: the closer the x-values of the data cloud are to the origin, the easier the network can learn. After a certain distance to the origin, learning stops completely. Changing the activation function from sigmoid to ReLu has only made things worse; here, the network tends to just converge to the average, no matter what position the data cloud is in.
Is there something wrong with my implementation of the neural-network-constructor? Or does this have something do do with initialization values? I have tried to get a deeper understanding of this problem now for quite a while and would greatly appreciate some advice. What could be the cause of this? All thoughts on why this behaviour is occuring are very much welcome!
Thanks,
Joker