Understanding the result from Encog neural network example - encog

I'm playing around with encog 3.2 for java. From the example (http://www.heatonresearch.com/wiki/Hello_World), I make my own network with 4 neutrons in input layers and 2 neutrons in output layer.
1.0,1.0, actual=0.22018401281844316,ideal=1.0
-1.0,-1.0, actual=0.9903002141301814,ideal=0.0
Can someone explain to me how can I understand the result(actual vs ideal and those numbers before them..)
Thank you very much.

Note that at this stage, the network has been trained, and you are now in the testing stage.
The network has 2 inputs neurons and 1 output neuron.
The first two numbers in your result are given to the trained network as the inputs. Using the internal weights and biases (which are not changed during testing) it computes the result / output ... listed as actual.
ideal is what the result should be, ie the number listed in the dataset for that sample/row.
Generally when you want a 0 or 1 output (eg one of n) you will round the actual result.
So in this case the network computes
1 XOR 1 = 0.22, (rounded = 0) which is wrong (according to ideal).
-1 XOR -1 = 0.99, (rounded = 1) which is also wrong

Related

After quantisation in neural network, will the output need to be scaled with the inverse of the weight scaling

I'm currently writing a script to quantise a Keras model down to 8 bits. I'm doing a fairly basic linear scaling on the weights, by assuming a normal distribution of weights and biases, and then interpolating all the values within 2 standard deviations of the mean, to the range [-128, 127].
This all works, and I run the model through inference, but my image out is crazy bad. I know there will be a small performance hit, but I'm seeing roughly 10x performance degradation.
My question is, after this scaling of the weights, do I need to do the inverse scaling operation to my output? None of the papers I've been reading seem to mention this, but I'm unsure why else my results would be so bad.
The network is for image demosaicing. It takes in a RAW image, and is meant to output an image with very low noise, and no demosaicing artefacts. My full precision model is very good, with image PSNRs of around 40-43dB, but after quantisation, I'm getting 4-8dB, and incredibly bad looking images.
Code for anyone who's bothered to read it
for i in layer_index:
count = count+1
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
std = np.std(weights_act)
if (std > max_std):
max_std = std
mean = np.mean(weights_act)
mean_of_mean = mean_of_mean + mean
mean_of_mean = mean_of_mean / count
max_bound = mean_of_mean + 2*max_std
min_bound = mean_of_mean - 2*max_std
print(max_bound, min_bound)
for i in layer_index:
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
weights_shape = weights_act.shape;
bias_shape = bias_act.shape;
new_weights = np.empty(weights_shape, dtype = np.int8)
print(new_weights.dtype)
new_biass = np.empty(bias_shape, dtype = np.int8)
for a in range(weights_shape[0]):
for b in range(weights_shape[1]):
for c in range(weights_shape[2]):
for d in range(weights_shape[3]):
new_weight = (((weights_act[a,b,c,d] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_weights[a,b,c,d] = np.int8(new_weight)
#print(new_weights[a,b,c,d], weights_act[a,b,c,d])
for e in range(bias_shape[0]):
new_bias = (((bias_act[e] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_biass[e] = np.int8(new_bias)
new_weight_layer = (new_weights, new_biass)
layer.set_weights(new_weight_layer)
You dont do what you think you are doing, I'll explain.
If you wish to take pre-trained model and quantize it you have to add scales after each operation that involves weights, lets take for example the convolution operation.
As we know convolution operation is linear in my explantion i will ignore the bias for the sake of simplicity (adding him is relatively easy), Let's assume X is our input Y is our output and W is the weights, convolution can be written as:
Y=W*X
where '*' represent the convolution operation, what you are basically doing is taking the weights and multiple them by some scalar (lets call it 'a') and shift them by some other scalar (let's call it 'b') so in your model you use W' where: W'= Wa+b
So if we return to the convolution operation we get that in your quantized network you basically do the next operation: Y' = W'*X = (Wa+b)*X
Because convolution is linear we get: Y' = a(W*X) + b*X'
Don't forget that in your network you want to receive Y not Y' at the output of the convolution therefore you must do shift + re scale to get the correct answer.
So after that explanation (which i hope was clear enough) i hope you can understand what is the problem in your network, you do this scale and shift to all of weights and you never compensate for it, I think your confusion is because your read papers that trained models in quantized mode from the beginning and didn't take pretrained model quantized it.
For you problem i think tensorflow graph transform tool might help, take a look at:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md
If you wish to read more about quantizing pre trained model you can find more information in (for more academic info just go to scholar.google.com:
https://www.tensorflow.org/lite/performance/post_training_quantization

What is the end result of this machine learning tutorial?

I've been trying to learn tensorflow and machine learning and this article was one of the first tutorials I've stumbled onto: https://medium.com/towards-data-science/tensorflow-for-absolute-beginners-28c1544fb0d6. I stepped through the code and thought I understood the vast majority of it but then I got to the final output which was a set of 3 numbers, the weights. How are these weights supposed to be used? That is, how would I put this result to use in a real world scenario?
Weights are what you are trying to optimize.
The goal is to find a set of weights, that when given a set of inputs will ouput the right answer.
In this case, you have 1 (True) and -1 (False) inputs and a bias that is always one. The goal is to learn the AND function. The function should return 1 (True) only when both inputs are 1 (True), -1 (False) otherwise.
When given a new input [1, -1, 1] (bias is always one in this case), the function will multiply these inputs by the weights you computed earlier and sum the result. If the result of this is greater than 0 it will output 1, if not, it will ouput -1

tensorflow: how to recognize untrained new letter

I just get a question. I want the system to return the -1 as unknown char for the new untrained letters. For example, if I have trained 1/2/3/4,when I test the char '5' or '6', the tensorflow should return -1 as unknown char.
Is it possible?
Thanks.
I'd think for simple classifications, you're looking for anything that has less than a certain confidence/score of being a known class.
To be fair, I've only use Keras on top of TensorFlow, so YMMV.
I'd just train it on the 4 categories you know, then when it classifies if the final results if the top one has less than a certain raw score/weight (let's say it classifies an unknown 7 as a 4, but with a mediocre score) treat it as a -1.
This might not work with every loss/objection function you train your model on, but should work MSE or categorical cross entropy if you can get the raw final weight.

Why shuffling data gives significantly higher accuracy?

In Tensorflow, I've wrote a big model for 2 image classes problem. My question is concerned with the following code snippet:
X, y, X_val, y_val = prepare_data()
probs = calc_probs(model, session, X)
accuracy = float(np.equal(np.argmax(probs, 1), np.argmax(y, 1)).sum()) / probs.shape[0]
loss = log_loss(y, probs)
X is an np.array of shape: (25000,244,244,3). That code results in accuracy=0.5834 (towards random accuracy) and loss=2.7106. But
when I shuffle the data, by adding these 3 lines after the first line:
sample_idx = random.sample(range(0, X.shape[0]), 25000)
X = X[sample_idx]
y = y[sample_idx]
, the results become convenient: accuracy=0.9933 and loss=0.0208.
Why shuffling data can give significantly higher accuracy ? or what can be a reason for that ?
The function calc_probs is mainly a run call:
probs = session.run(model.probs, feed_dict={model.X: X})
Update:
After hours of debugging, I figured out that evaluating a single image gives different result. For example, if you run the following line of code multiple times, you get a different result each time:
session.run(model.props, feed_dict={model.X: [X[20]])
My data is normally sorted, X contains class 1 samples first then class 2. And in calc_probs function, I run using each batch of the data sequentially. So, without shuffling, each run has data of a single class.
I've also noted that with shuffling, if batch size is very small, I get the random accuracy.
There is some mathematical justification for this in the context of randomized Kaczmarz algorithm. Regular Kaczmarz algorithm is an old algorithm which can be seen as an non-shuffling SGD on a least squares problem, and there are guaranteed faster convergence rates that come out if you use randomization, follow references in http://www.cs.ubc.ca/~nickhar/W15/Lecture21Notes.pdf

Given logistic regression coefficients computed in SSAS, create a formula to calculate a continuous output value

I've trained a simple logistic regression model in SSAS, using Gender and NIC as discrete input nodes (NIC is 0 for non-smoker, 1 for smoker) with Score (0-100) as a continuous output node.
I want to predict the score based on a new participant's values for Gender and NIC. Of course, I can run a singleton query in DMX; for example, the following produces a value of 49.51....
SELECT Predict(Score)
FROM [MyModel]
NATURAL PREDICTION JOIN
(SELECT 'M' AS Gender, '1' AS NIC) as t
But instead of using DMX, I want to create a formula from the model in order to calculate scores while "disconnected" from SSAS.
Investigating the model, I have the following information in the NODE_DISTRIBUTION of the output node:
ATTRIBUTE_NAME ATTRIBUTE_VALUE SUPPORT PROBABILITY VARIANCE VALUETYPE
Gender:F 0.459923854 0 0 0 7 (Coefficient)
Gender:M 0.273306289 0 0 0 7 (Coefficient)
Nic:0 -0.282281195 0 0 0 7 (Coefficient)
Nic:1 -0.802106901 0 0 0 7 (Coefficient)
0.013983007 0 0 0.647513829 7 (Coefficient)
Score 75.03691517 0 0 0 3 (Continuous
Plugging these coefficients into a logistic regression formula -- that I am being disallowed from uploading as a new user :) -- for the smoking male example above,
f(...) = 1 / (1 + exp(0 - (0.0139830071136734 -- Constant(?)
+ 0 * 0.459923853918008 -- Gender:F = 0
+ 1 * 0.273306289390897 -- Gender:M = 1
+ 1 * -0.802106900621717 -- Nic:1 = 1
+ 0 * -0.282281195489355))) -- Nic:0 = 0
results in a value of 0.374.... But how do I "map" this value back to the score distribution of 0-100? In other words, how do I extend the equation above to produce the same value that the DMX singleton query does? I'm assuming it will require the stdev and mean of my Score distribution, but I'm stuck on exactly how to use those values. I'm also unsure whether I'm using the ATTRIBUTE_VALUE in the fifth row correctly as the constant.
Any help you can provide will be appreciated!
I'm no expert, but it sounds to me you don't want to use logistic regression at all. You want to train a linear regression. You currently have a logistic regression model, these are typically used for binary classification, not continuous values, i.e., 0-100.
How to do linear regression in SAS
Wikipedia: linear regression
more details: the question really depends, like most datamining/machine learing problems, on your data. If your data is bimodal, more than 90% of the training set is very close to either 1 or 100, then a logistic regression MIGHT be used. The equation used in logistic regression is specifically designed to render YES/NO answers. It is technically a continuous function, therefore results such as .34 are possible, but they are statistically very unlikely (in typical usage you would round down to 0).
However, if your data is normally distributed (most of nature is) the better method is linear regression. Only problem is it CAN predict outside of your range 0-100, if given a particularly bad data point. In this case you would be best off rounding (clipping the result to 0-100) or ignore the data point as an outlier.
In the case of gender, a quick hack would be to map male to 0 and female to 1, then treat gender as an input for the model.
SSAS linear regression
You do not want to be using logistic regression if you are trying to model a score restricted to an interval [0,100]. Logistic regression is used to model either binary data or proportions based on a binomial distribution. Assuming a logit link function what you are actually modelling with logistic regression is a function of probability (log of odds) and as such the entire process is geared to give you values in the interval [0,1]. To try to use this to map to a score does not seem to be the right type of analysis at all.
In addition I cannot see how regular linear regression will help you either as your fitted model will be capable of generating values way outside of your target interval [0,100] and if you are having to perform ad hoc truncation of values to this range then can you really be sure that your data has any effective meaning??
I would like to be able to point you to the type of analysis that you require but I have not encountered this type of analysis. My advice to you would be to abandon the logistic regression approach and consider joining the ALLSTAT mailing list used by professional statisticians and mathematicians and asking for advice there. Or something similar.