Keras sequential model with altered loss does not learn - tensorflow

(I am new to stackexchange, but I believe I have correctly classified this question. If there's something off about my question, please inform me.)
I am trying to write a machine learning algorithm that learns to move an arm by contracting muscles. I have done my best to work out every possible bug I can think of but I have come to an impasse where every individual part of the program seems to run correctly, yet the algorithm does not learn. Fundamentally, all this model is doing is finding the inverse of a function by training a neural network to said function's inputs and outputs. The only thing that makes it even remotely nontrivial is that it uses an intermediary function when calculating the loss.
Working in python with TensorFlow, we first define some constants and a function that converts deltoid and bicep muscle contractions to hand positions,
lH =1.0
lU =1.0
lCD=0.1
lHD=0.1
lHB=0.9
lUB=0.1
lD_max = lCD+lHD
lD_min = abs(lCD-lHD)
lD_diff = lD_max-lD_min
lB_max = lHB+lUB
lB_min = abs(lHB-lUB)
lB_diff = lB_max-lB_min
max_muscle_contraction = 0.9
min_muscle_contraction = 0.1
lD_min_eff = lD_min + min_muscle_contraction*lD_diff
lD_max_eff = lD_min + max_muscle_contraction*lD_diff
lB_min_eff = lB_min + min_muscle_contraction*lB_diff
lB_max_eff = lB_min + max_muscle_contraction*lB_diff
def contractionToPosition(c):
# Takes a (n, m, 2) tensor of contraction pairs and returns a (n, m, 2) tensor of the resulting positions
# Commonly takes (n, 2, 2) contraction tensors: a vector of initial and final vectors of deltoid-tricep pairs.
cosD = (lCD**2+lHD**2 - tf.math.square(c[:,:,0]))/(2*lCD*lHD)
cosD = tf.math.minimum(cosD, 2*max_muscle_contraction-1)
cosD = tf.math.maximum(cosD, 2*min_muscle_contraction-1) # Equivalent to limiting the contraction
sinD = tf.math.sqrt(1-tf.math.square(cosD))
cosB = (lHB**2+lUB**2 - tf.math.square(c[:,:,1]))/(2*lHB*lUB)
cosB = tf.math.minimum(cosB, 2*max_muscle_contraction-1)
cosB = tf.math.maximum(cosB, 2*min_muscle_contraction-1) to limiting the contraction
sinB = tf.math.sqrt(1-tf.math.square(cosB))
px = lH*cosD + lU*sinB*sinD - lU*cosB*cosD
py = -lH*sinD + lU*sinB*cosD + lU*cosB*sinD
p = tf.stack([px, py], axis=-1) # By px[i,j] being the [i,j]th px value that must be paired with the [i,j]th py value
return p
Regardless of the above values and function's validity, the algorithm should still be able to learn from it because the data itself is synthetically generated with this same function. This function is also what the neural network is (approximately) trying to invert. Note that the neural network should take in the initial position and the planned final position, returning a change in the muscles contraction. Calculating the difference in the true final positions and the planned final positions will thus require that we also know the initial contraction. Toward this, we generate the synthetic data that we will later train the algorithm on,
def generateContraction(samples): # Returns a random vector of contraction lengths
cD = tf.zeros(samples)
cD += tf.random.uniform(shape=cD.shape, minval=lD_min_eff, maxval=lD_max_eff)
cB = tf.zeros(samples)
cB += tf.random.uniform(shape=cB.shape, minval=lB_min_eff, maxval=lB_max_eff)
return tf.transpose(tf.stack([cD,cB]))
def data(samples):
ci = generateContraction(samples)
cf = generateContraction(samples)
c = tf.stack([ci,cf], axis=1)
p = contractionToPosition(c)
return p, c
sample_size = 10000
positions, contractions = data(sample_size)
initial_contractions = contractions[:,0]
final_contractions = contractions[:,1]
features = positions
labels = tf.subtract(final_contractions, initial_contractions)
initial_data = initial_contractions
I have meticulously tested the entire process of this data's construction and every step has proven accurate. We then load this raw data into a dataset for the learning algorithm,
def load_array(data_arrays, batch_size, is_train=True):
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
batch_size = 64
data_iter = load_array((features, labels, initial_data), batch_size)
The network model doesn't need to be very complicated to know whether the learning algorithm works since there is no statistical error in the data. With this model, we are also intending it to act like the neural network found in the cerebellum of mammals. Specifically, this implies that for simple motions it is a shallow sequential neural network with ReLu activation. As such, we construct it fairly simply,
net = tf.keras.Sequential([
tf.keras.Input(shape=(2,2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(units=64,
activation='relu',
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.1)),
tf.keras.layers.Dense(units=2,
activation='relu',
kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.1))
])
Finally, we write our learning algorithm based off the TensorFlow documentation, https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Note that we are optimizing the squared distance between the planned and actual final positions rather than optimizing the difference in the contractions. This is the only thing that makes this ever so slightly nontrivial.
loss = tf.keras.losses.MeanSquaredError()
def train(net, train_iter, loss, epochs, lr):
trainer = tf.keras.optimizers.Adam(learning_rate=lr)
params = net.trainable_variables
for epoch in range(epochs):
epochError = 0
for X, y, I in train_iter:
with tf.GradientTape() as g:
g.watch(params)
P_hat = contractionToPosition(tf.reshape(net(X, training=True) + I, (-1,1,2)))
P = contractionToPosition(tf.reshape( y + I, (-1,1,2))) # We have to reshape because of our function contractionToPosition
l = loss(P, P_hat)
epochError += l
error = l
grads = g.gradient(l, params)
trainer.apply_gradients(zip(grads,params))
print(f'epoch {epoch + 1}, '
f'loss: {epochError}')
train(net, data_iter, loss, 5, 0.05)
The result of all this, though, is a complete lack of learning. Usually the epoch loss is about 109 (which is expected for no learning) with no significant change in said loss (usually fluctuates within +/-0.7 .) If anything is at fault, I would suspect this final code snippet to be, specifically the gradient tape. I have probed every aspect of the gradient tape, however, and everything seems to be functioning correctly. Overall, I cannot think of a part of my code I have not dissected at this point so I am at a total loss here.
Any and all help is deeply appreciated!

Related

Tensorflow variable value different on same training set

I build a neural network model on Python 3.6
I'm trying to predict price of condominium based on their attributes such as lat, lng, distance to public transport, year-built, and so on.
I use the same training set for the model. However, each time I print out value of the variables in hidden layer is different.
testing_df_w_price = testing_df.copy()
testing_df.drop('PricePerSq',axis = 1, inplace = True)
training_df, testing_df = training_df.drop(['POID'], axis=1), testing_df.drop(['POID'], axis=1)
col_train = list(training_df.columns)
col_train_bis = list(training_df.columns)
col_train_bis.remove('PricePerSq')
mat_train = np.matrix(training_df)
mat_test = np.matrix(testing_df)
mat_new = np.matrix(training_df.drop('PricePerSq', axis = 1))
mat_y = np.array(training_df.PricePerSq).reshape((training_df.shape[0],1))
prepro_y = MinMaxScaler()
prepro_y.fit(mat_y)
prepro = MinMaxScaler()
prepro.fit(mat_train)
prepro_test = MinMaxScaler()
prepro_test.fit(mat_new)
train = pd.DataFrame(prepro.transform(mat_train),columns = col_train)
test = pd.DataFrame(prepro_test.transform(mat_test),columns = col_train_bis)
# List of features
COLUMNS = col_train
FEATURES = col_train_bis
LABEL = "PricePerSq"
# Columns for tensorflow
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]
# Training set and Prediction set with the features to predict
training_set = train[COLUMNS]
prediction_set = train.PricePerSq
# Train and Test
x_train, x_test, y_train, y_test = train_test_split(training_set[FEATURES] , prediction_set, test_size=0.25, random_state=42)
y_train = pd.DataFrame(y_train, columns = [LABEL])
training_set = pd.DataFrame(x_train, columns = FEATURES).merge(y_train, left_index = True, right_index = True) # good
# Training for submission
training_sub = training_set[col_train] # good
# Same thing but for the test set
y_test = pd.DataFrame(y_test, columns = [LABEL])
testing_set = pd.DataFrame(x_test, columns = FEATURES).merge(y_test, left_index = True, right_index = True) # good
# Model
# tf.logging.set_verbosity(tf.logging.INFO)
tf.logging.set_verbosity(tf.logging.ERROR)
regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols,
hidden_units=[int(len(col_train)+1/2)],
model_dir = "/tmp/tf_model")
for k in regressor.get_variable_names():
print(k)
print(regressor.get_variable_value(k))
Example of hidden layer value difference
The variables are initialized with random values when you construct the network. Since there's likely to be many local minima of your loss function, the fitted parameters will change every time you run the network.
In addition if your loss function is convex (only one (global) minima) the order of the variables is somewhat arbitrary. If for example you fit a network with 1 hidden layers with 2 hidden nodes, the parameters of node 1 in your first run might correspond to the parameters of node 2 and vice versa.
In Machine Learnining, the current "knowledge state" of your neural network is expressed through the weights of the connections in your graph. Generally considered, your whole network represents a high-dimensional function and the task of learning means finding the global optimum of this funktion. The learning process changes the weights of the connections in your neural network according to the specified optimizer, which in your case is the default of tf.contrib.learn.DNNRegressor (which is the Adagrad optimizer). But there are other parameters that affect the final "knowledge state" in your model. There are for instance (and i guarantee no completeness in the following list):
The initial learning rate in your model
The learning rate schedule that adapts the learning rate over time
eventually defined regularities and early stopping
The initialization strategy used for weight initialization (e.g. He-initialization or random initialization)
Plus (and this is maybe the most important thing to understand why your weights are different after each retraining), you have to consider that you use a stochastic gradient descent algorithm during training. This means, that for each optimization step the algorithm choses a random subset of your whole training set. Therefore, one optimization step doesn't always point tho the global optimum of your high-dimensional function, but to the steepest descent that could be computed with the randomly chosen subset. Because of this stochastic component in the optimization process, you will likely never reach the global optimum for your task. But with carefully chosen hyperparameters (and of course good data) you will reach a good approximate solution, which lies whithin a local optimum of the function and which can change everytime you retrain the model.
So to conclude, don't look at the weights to judge the performance of your model, because they will be slightly different each time. Use a performance measure like the accuracy computed in a cross validation or a confusion matrix computed on the test set.
P.S. tf.contrib.learn.DNNRegressor is a deprecated function in the newest TensorFlow release, as you can see in the docs. Use tf.estimator.DNNRegressor instead.

Pytorch how to get the gradient of loss function twice

Here is what I'm trying to implement:
We calculate loss based on F(X), as usual. But we also define "adversarial loss" which is a loss based on F(X + e). e is defined as dF(X)/dX multiplied by some constant. Both loss and adversarial loss are backpropagated for the total loss.
In tensorflow, this part (getting dF(X)/dX) can be coded like below:
grad, = tf.gradients( loss, X )
grad = tf.stop_gradient(grad)
e = constant * grad
Below is my pytorch code:
class DocReaderModel(object):
def __init__(self, embedding=None, state_dict=None):
self.train_loss = AverageMeter()
self.embedding = embedding
self.network = DNetwork(opt, embedding)
self.optimizer = optim.SGD(parameters)
def adversarial_loss(self, batch, loss, embedding, y):
self.optimizer.zero_grad()
loss.backward(retain_graph=True)
grad = embedding.grad
grad.detach_()
perturb = F.normalize(grad, p=2)* 0.5
self.optimizer.zero_grad()
adv_embedding = embedding + perturb
network_temp = DNetwork(self.opt, adv_embedding) # This is how to get F(X)
network_temp.training = False
network_temp.cuda()
start, end, _ = network_temp(batch) # This is how to get F(X)
del network_temp # I even deleted this instance.
return F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
def update(self, batch):
self.network.train()
start, end, pred = self.network(batch)
loss = F.cross_entropy(start, y[0]) + F.cross_entropy(end, y[1])
loss_adv = self.adversarial_loss(batch, loss, self.network.lexicon_encoder.embedding.weight, y)
loss_total = loss + loss_adv
self.optimizer.zero_grad()
loss_total.backward()
self.optimizer.step()
I have few questions:
1) I substituted tf.stop_gradient with grad.detach_(). Is this correct?
2) I was getting "RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time." so I added retain_graph=True at the loss.backward. That specific error went away.
However now I'm getting a memory error after few epochs (RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.cu:58
). I suspect I'm unnecessarily retaining graph.
Can someone let me know pytorch's best practice on this? Any hint / even short comment will be highly appreciated.
I think you are trying to implement generative adversarial network (GAN), but from the code, I don't understand and can't follow to what you are trying to achieve as there are a few missing pieces for a GAN to works. I can see there's a discriminator network module, DNetwork but missing the generator network module.
If to guess, when you say 'loss function twice', I assumed you mean you have one loss function for the discriminator net and another for the generator net. If that's the case, let me share how I would implement a basic GAN model.
As an example, let's take a look at this Wasserstein GAN Jupyter notebook
I'll skip the less important bits and zoom into the important ones here:
First, import PyTorch libraries and set up
# Set up batch size, image size, and size of noise vector:
bs, sz, nz = 64, 64, 100 # nz is the size of the latent z vector for creating some random noise later
Build a discriminator module
class DCGAN_D(nn.Module):
def __init__(self):
... truncated, the usual neural nets stuffs, layers, etc ...
def forward(self, input):
... truncated, the usual neural nets stuffs, layers, etc ...
Build a generator module
class DCGAN_G(nn.Module):
def __init__(self):
... truncated, the usual neural nets stuffs, layers, etc ...
def forward(self, input):
... truncated, the usual neural nets stuffs, layers, etc ...
Put them all together
netG = DCGAN_G().cuda()
netD = DCGAN_D().cuda()
Optimizer needs to be told what variables to optimize. A module automatically keeps track of its variables.
optimizerD = optim.RMSprop(netD.parameters(), lr = 1e-4)
optimizerG = optim.RMSprop(netG.parameters(), lr = 1e-4)
One forward step and one backward step for Discriminator
Here, the network can calculate gradient during the backward pass, depends on the input to this function. So, in my case, I have 3 type of losses; generator loss, dicriminator real image loss, dicriminator fake image loss. I can get gradient of loss function three times for 3 different net passes.
def step_D(input, init_grad):
# input can be from generator's generated image data or input image from dataset
err = netD(input)
err.backward(init_grad) # backward pass net to calculate gradient
return err # loss
Control trainable parameters [IMPORTANT]
Trainable parameters in the model are those that require gradients.
def make_trainable(net, val):
for p in net.parameters():
p.requires_grad = val # note, i.e, this is later set to False below in netG update in the train loop.
In TensorFlow, this part can be coded like below:
grad = tf.gradients(loss, X)
grad = tf.stop_gradient(grad)
So, I think this will answer your first question, "I substituted tf.stop_gradient with grad.detach_(). Is this correct?"
Train loop
You can see here how's the 3 different loss functions are being called here.
def train(niter, first=True):
for epoch in range(niter):
# Make iterable from PyTorch DataLoader
data_iter = iter(dataloader)
i = 0
while i < n:
###########################
# (1) Update D network
###########################
make_trainable(netD, True)
# train the discriminator d_iters times
d_iters = 100
j = 0
while j < d_iters and i < n:
j += 1
i += 1
# clamp parameters to a cube
for p in netD.parameters():
p.data.clamp_(-0.01, 0.01)
data = next(data_iter)
##### train with real #####
real_cpu, _ = data
real_cpu = real_cpu.cuda()
real = Variable( data[0].cuda() )
netD.zero_grad()
# Real image discriminator loss
errD_real = step_D(real, one)
##### train with fake #####
fake = netG(create_noise(real.size()[0]))
input.data.resize_(real.size()).copy_(fake.data)
# Fake image discriminator loss
errD_fake = step_D(input, mone)
# Discriminator loss
errD = errD_real - errD_fake
optimizerD.step()
###########################
# (2) Update G network
###########################
make_trainable(netD, False)
netG.zero_grad()
# Generator loss
errG = step_D(netG(create_noise(bs)), one)
optimizerG.step()
print('[%d/%d][%d/%d] Loss_D: %f Loss_G: %f Loss_D_real: %f Loss_D_fake %f'
% (epoch, niter, i, n,
errD.data[0], errG.data[0], errD_real.data[0], errD_fake.data[0]))
"I was getting "RuntimeError: Trying to backward through the graph a second time..."
PyTorch has this behaviour; to reduce GPU memory usage, during the .backward() call, all the intermediary results (if you have like saved activations, etc.) are deleted when they are not needed anymore. Therefore, if you try to call .backward() again, the intermediary results don't exist and the backward pass cannot be performed (and you get the error you see).
It depends on what you are trying to do. You can call .backward(retain_graph=True) to make a backward pass that will not delete intermediary results, and so you will be able to call .backward() again. All but the last call to backward should have the retain_graph=True option.
Can someone let me know pytorch's best practice on this
As you can see from the PyTorch code above and from the way things are being done in PyTorch which is trying to stay Pythonic, you can get a sense of PyTorch's best practice there.
If you want to work with higher-order derivatives (i.e. a derivative of a derivative) take a look at the create_graph option of backward.
For example:
loss = get_loss()
loss.backward(create_graph=True)
loss_grad_penalty = loss + loss.grad
loss_grad_penalty.backward()

Neural network only converges when data cloud is close to 0

I am new to tensorflow and am learning the basics at the moment so please bear with me.
My problem concerns strange non-convergent behaviour of neural networks when presented with the supposedly simple task of finding a regression function for a small training set consisting only of m = 100 data points {(x_1, y_1), (x_2, y_2),...,(x_100, y_100)}, where x_i and y_i are real numbers.
I first constructed a function that automatically generates a computational graph corresponding to a classical fully connected feedforward neural network:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import math
def neural_network_constructor(arch_list = [1,3,3,1],
act_func = tf.nn.sigmoid,
w_initializer = tf.contrib.layers.xavier_initializer(),
b_initializer = tf.zeros_initializer(),
loss_function = tf.losses.mean_squared_error,
training_method = tf.train.GradientDescentOptimizer(0.5)):
n_input = arch_list[0]
n_output = arch_list[-1]
X = tf.placeholder(dtype = tf.float32, shape = [None, n_input])
layer = tf.contrib.layers.fully_connected(
inputs = X,
num_outputs = arch_list[1],
activation_fn = act_func,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
for N in arch_list[2:-1]:
layer = tf.contrib.layers.fully_connected(
inputs = layer,
num_outputs = N,
activation_fn = act_func,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
Phi = tf.contrib.layers.fully_connected(
inputs = layer,
num_outputs = n_output,
activation_fn = tf.identity,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
Y = tf.placeholder(tf.float32, [None, n_output])
loss = loss_function(Y, Phi)
train_step = training_method.minimize(loss)
return [X, Phi, Y, train_step]
With the above default values for the arguments, this function would construct a computational graph corresponding to a neural network with 1 input neuron, 2 hidden layers with 3 neurons each and 1 output neuron. The activation function is per default the sigmoid function. X corresponds to the input tensor, Y to the labels of the training data and Phi to the feedforward output of the neural network. The operation train_step performs one gradient-descent step when executed in the session environment.
So far, so good. If I now test a particular neural network (constructed with this function and the exact default values for the arguments given above) by making it learn a simple regression function for artificial data extracted from a sinewave, strange things happen:
Before training, the network seems to be a flat line. After 100.000 training iterations, it manages to partially learn the function, but only the part which is closer to 0. After this, it becomes flat again. Further training does not decrease the loss function anymore.
This get even stranger, when I take the exact same data set, but shift all x-values by adding 500:
Here, the network completely refuses to learn. I cannot understand why this is happening. I have tried changing the architecture of the network and its learning rate, but have observed similar effects: the closer the x-values of the data cloud are to the origin, the easier the network can learn. After a certain distance to the origin, learning stops completely. Changing the activation function from sigmoid to ReLu has only made things worse; here, the network tends to just converge to the average, no matter what position the data cloud is in.
Is there something wrong with my implementation of the neural-network-constructor? Or does this have something do do with initialization values? I have tried to get a deeper understanding of this problem now for quite a while and would greatly appreciate some advice. What could be the cause of this? All thoughts on why this behaviour is occuring are very much welcome!
Thanks,
Joker

TensfoFlow: Linear Regression loss increasing (instead decreasing) with successive epochs

I'm learning TensorFlow and trying to apply it on a simple linear regression problem. data is numpy.ndarray of shape [42x2].
I'm a bit puzzled why after each succesive epoch the loss is increasing. Isn't the loss expected to to go down with each successive epoch!
Here is my code (let me know, if you'd like me to share the output as well!): (Thanks a lot for taking your time to answer to it.)
1) created the placeholders for dependent / independent variables
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32,name='Y')
2) created vars for weight, bias, total_loss (after each epoch)
w = tf.Variable(0.0,name='weights')
b = tf.Variable(0.0,name='bias')
3) defined loss function & optimizer
Y_pred = X * w + b
loss = tf.reduce_sum(tf.square(Y - Y_pred), name = 'loss')
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001).minimize(loss)
4) created summary events & event file writer
tf.summary.scalar(name = 'weight', tensor = w)
tf.summary.scalar(name = 'bias', tensor = b)
tf.summary.scalar(name = 'loss', tensor = loss)
merged = tf.summary.merge_all()
evt_file = tf.summary.FileWriter('def_g')
evt_file.add_graph(tf.get_default_graph())
5) and execute all in a session
with tf.Session() as sess1:
sess1.run(tf.variables_initializer(tf.global_variables()))
for epoch in range(10):
summary, _,l = sess1.run([merged,optimizer,loss],feed_dict={X:data[:,0],Y:data[:,1]})
evt_file.add_summary(summary,epoch+1)
evt_file.flush()
print(" new_loss: {}".format(sess1.run(loss,feed_dict={X:data[:,0],Y:data[:,1]})))
Cheers!
The short answer is that your learning rate is too big. I was able to get reasonable results by changing it from 0.001 to 0.0001, but I only used the 23 points from your second-last comment (I initially didn't notice your last comment), so using all the data might require an even lower number.
0.001 seems like a really low learning rate. However, the real problem is that your loss function is using reduce_sum instead of reduce_mean. This causes your loss to be a large number, which sends a very strong signal to the GradientDescentOptimizer, so it's overshooting despite the low learning rate. The problem would only get worse if you added more points to your training data. So use reduce_mean to get the average squared error and your algorithms will be much better behaved.

CNTK classification model Classifies all 1

I have a cntk model which takes in features related to clicks and other information and predicts if something would be clicked in the future. Using the same features in a randomforest works fine, however, cntk classifies all 1. Why does this happen? Is there any parameter tuning needed? The features have varying scale.
My train action looks like this:
BrainScriptNetworkBuilder = [
inputD = $inputD$
labelD = $labelD$
#hidden1 = $hidden1$
model(features) = {
w0 = ParameterTensor{(1 : 2), initValueScale=10}; b0 = ParameterTensor{1, initValueScale=10};
h1 = w0*features + b0; #hidden layer
z = Sigmoid (h1)
}.z
features = Input(inputD)
labels = Input(labelD)
z = model(features)
#now that we have output, find error
err = SquareError (labels, z)
lr = Logistic (labels, z)
output = z
criterionNodes = (err)
evaluationNodes = (err)
outputNodes = (z)
]
SGD = [
epochSize = 4 #learn
minibatchSize = 1 #learn
maxEpochs = 1000 #learn
learningRatesPerSample = 1
numMBsToShowResult = 10000
firstMBsToShowResult = 10
]
In addition to what KeD said, a random forest does not care about the actual values of the features, only about their relative order.
Unlike trees, neural networks are sensitive to the actual values of the features (rather than just their relative order).
Your input might contain some features with very large values. You should probably recode them. There are different schemes for doing this. One possibility is to subtract the mean from each feature and scale it to -1,1 or divide by it's standard deviation. Another possibility for positive features is a transformation such as f => log(1+f). You could also use a batch normalization layer.
Since your features are of varying scales, I would suggest you normalize the features. You mentioned, cntk classifies all input as 1. I am assuming it happens when you predict using the trained model. But, what happens during training? Can you plot a graph of training + test error on a graph (cntk supports TensorBoard now)? That would give you some indication of if your model is over-fitting. Moreover, as a side not, I would suggest to increase the model's learning capability (most likely, by increasing number of hidden layers) to learn a better distribution of you data.
It seems the learning rate is too high, please try learningRatesPerSample = 0.001