Multilabel/ Multitask/ Multiclass Regression in machine learning - tensorflow

My challenge is to train a neural network to recognize certain actions and events for different classes of task or how you want to call it given the input.
I see that most of the input/output when training neural networks is either 0 or 1 or [0,1]. But in my scenario I want my input to be in the form of integers which are arbitrarily big and the same form is expected for the output.
Let me give you an example:
Input
X = [ 23, 4, 0, 1233423, 1, 0, 0] ->
Y = [ 2, 1, 1]
Now each element in X[i] represent different properties of the same entity.
Let's say it want to describe a human being:
23 -> maps to a place he/she was born
4 -> maps to a school they graduated
etc.
Each entry in Y[i], on the other hand, means what is more likely the human to do in 3 different categories ( as len(Y) is 3 in this case ):
Y[0] = 2 -> maps to eating icecream ( from a variety of other choices )
Y[1] = 1 -> maps to a time of day moment ( morning, noon, afternoon, evening, etc...)
Y[2] = 1 -> maps to a day of the week for example
Now of course if the task was just a multi label problem I would apply a sigmoid on the output layer and do a binary_crossentropy as the loss function but that of course does not work.
Here because my output is obviously not between [0,1].
Also I am not really sure what loss function to apply since I want all classes/subclasses in Y to be correctly predicted. What I am basically saying is that each Y[i] is itself is a class of its own.
It would be more accurate if my output was in the shape of (3, labels_per_class)
and the loss function would calculate a loss for each of the 3 different classes
trying to optimize the result in such a way that each of the 3 classes would have the correct labels.
I am not sure if that is possible or how at least.
I am really still in the beginnings with my neural network knowledge and learning so clearly I am struggling with this problem.
But really to put it more simply I have a better idea how to describe it. It is more or less like an auto-encoder but the inputs and outputs are integers. The difference is that in my case the output has a different size from the input where in the auto-encoder they are the same.
My solution was to apply a relu at the output layer, ( and of course relu-like activations on all other layers as well ) and binary_crossentropy as the loss functions but the accuracy of the network is very low, around 15%.

For a standard classification you would probably do a dense layer with a number of nodes equal to the number of classes then apply softmax. The loss would be tf.losses.softmax_cross_entropy. You would do a sigmoid if you want to allow multiple classes, not just one.
Now you have multiple classification tasks. One way to do it is to take the last hidden layer (the one before the one where you do softmax). For each task do a dense layer with a number of nodes equals to the number of classes for that task and apply softmax. To compute the loss just add the losses together.
If the tasks are too different you may want to have more than one layer for each prediction.
You can also put some weights on the different losses if, say, eating ice-cream is a lot more important than getting the time of day right.
Only use relu if the prediction space is continous. Say time of day is continous but the choice between eating ice-cream, going to work, watching TV is not. If you use relu use a loss like L1(tf.losses.absolut_difference) or L2 (tf.losses.mean_squared_error).

Related

What is the significance of normalization of data before feeding it to a ML/DL model?

I just started learning Deep Learning and was working with the Fashion MNIST data-set.
As a part of pre-processing the X-labels, the training and test images, dividing the pixel values by 255 is included as a part of normalization of the input data.
training_images = training_images/255.0
test_images = test_images/255.0
I understand that this is to scale down the values to [0,1] because neural networks are more efficient while handling such values. However, if I try to skip these two lines, my model predicts something entire different for a particular test_image.
Why does this happen?
Let's see both the scenarios with the below details.
1. With Unnormaized data:
Since your network is tasked with learning how to combine inputs through a series of linear combinations and nonlinear activations, the parameters associated with each input will exist on different scales.
Unfortunately, this can lead toward an awkward loss function topology which places more emphasis on certain parameter gradients.
Or in a simple definition as Shubham Panchal mentioned in comment.
If the images are not normalized, the input pixels will range from [ 0 , 255 ]. These will produce huge activation values ( if you're using ReLU ). After the forward pass, you'll end up with a huge loss value and gradients.
2. With Normalized data:
By normalizing our inputs to a standard scale, we're allowing the network to more quickly learn the optimal parameters for each input node.
Additionally, it's useful to ensure that our inputs are roughly in the range of -1 to 1 to avoid weird mathematical artifacts associated with floating-point number precision. In short, computers lose accuracy when performing math operations on really large or really small numbers. Moreover, if your inputs and target outputs are on a completely different scale than the typical -1 to 1 range, the default parameters for your neural network (ie. learning rates) will likely be ill-suited for your data. In the case of image the pixel intensity range is bound by 0 and 1(mean =0 and variance =1).

Conv 1x1 configuration for feature reduction

I am using 1x1 convolution in the deep network to reduce a feature x: Bx2CxHxW to BxCxHxW. I have three options:
x -> Conv (1x1) -> Batchnorm-->ReLU. Code will be output = ReLU(BN(Conv(x))). Reference resnet
x -> BN -> ReLU-> Conv. So the code will be output = Conv(ReLU(BN(x))) . Reference densenet
x-> Conv. The code is output = Conv(x)
Which one is most using for feature reduction? Why?
Since you are going to train your net end-to-end, whatever configuration you are using - the weights will be trained to accommodate them.
BatchNorm?
I guess the first question you need to ask yourself is do you want to use BatchNorm? If your net is deep and you are concerned with covariate shifts then you probably should have a BatchNorm -- and here goes option no. 3
BatchNorm first?
If your x is the output of another conv layer, than there's actually no difference between your first and second alternatives: your net is a cascade of ...-conv-bn-ReLU-conv-BN-ReLU-conv-... so it's only an "artificial" partitioning of the net into triplets of functions conv, bn, relu and up to the very first and last functions you can split things however you wish. Moreover, since Batch norm is a linear operation (scale + bias) it can be "folded" into an adjacent conv layer without changing the net, so you basically left with conv-relu pairs.
So, there's not really a big difference between the first two options you highlighted.
What else to consider?
Do you really need ReLU when changing dimension of features? You can think of the reducing dimensions as a linear mapping - decomposing the weights mapping to x into a lower rank matrix that ultimately maps into c dimensional space instead of 2c space. If you consider a linear mapping, then you might omit the ReLU altogether.
See fast RCNN SVD trick for an example.

LSTM Sequence Prediction in Keras just outputs last step in the input

I am currently working with Keras using Tensorflow as the backend. I have a LSTM Sequence Prediction model shown below that I am using to predict one step ahead in a data series (input 30 steps [each with 4 features], output predicted step 31).
model = Sequential()
model.add(LSTM(
input_dim=4,
output_dim=75,
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
150,
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
output_dim=4))
model.add(Activation("linear"))
model.compile(loss="mse", optimizer="rmsprop")
return model
The issue I'm having is that after training the model and testing it - even with the same data it trained on - what it outputs is essentially the 30th step in the input. My first thought is the patterns of my data must be too complex to accurately predict, at least with this relatively simple model, so the best answer it can return is essentially the last element of the input. To limit the possibility of over-fitting I've tried turning training epochs down to 1 but the same behavior appears. I've never observed this behavior before though and I have worked with this type of data before with successful results (for context, I'm using vibration data taken from 4 points on a complex physical system that has active stabilizers; the prediction is used in a pid loop for stabilization hence why, at least for now, I'm using a simpler model to keep things fast).
Does that sound like the most likely cause, or does anyone have another idea? Has anyone seen this behavior before? In case it helps with visualization here is what the prediction looks like for one vibration point compared to the desired output (note, these screenshots are zoomed in smaller selections of a very large dataset - as #MarcinMożejko noticed I did not zoom quite the same both times so any offset between the images is due to that, the intent is to show the horizontal offset between the prediction and true data within each image):
...and compared to the 30th step of the input:
Note: Each data point seen by the Keras model is an average over many actual measurements with the window of the average processed along in time. This is done because the vibration data is extremely chaotic at the smallest resolution I can measure so instead I use this moving average technique to predict the larger movements (which are the more important ones to counteract anyway). That is why the offset in the first image appears as many points off instead of just one, it is 'one average' or 100 individual points of offset.
.
-----Edit 1, code used to get from the input datasets 'X_test, y_test' to the plots shown above-----
model_1 = lstm.build_model() # The function above, pulled from another file 'lstm'
model_1.fit(
X_test,
Y_test,
nb_epoch=1)
prediction = model_1.predict(X_test)
temp_predicted_sensor_b = (prediction[:, 0] + 1) * X_b_orig[:, 0]
sensor_b_y = (Y_test[:, 0] + 1) * X_b_orig[:, 0]
plot_results(temp_predicted_sensor_b, sensor_b_y)
plot_results(temp_predicted_sensor_b, X_b_orig[:, 29])
For context:
X_test.shape = (41541, 30, 4)
Y_test.shape = (41541, 4)
X_b_orig is the raw (averaged as described above) data from the b sensor. This is multiplied by the prediction and input data when plotting to undo normalization I do to improve the prediction. It has shape (41541, 30).
----Edit 2----
Here is a link to a complete project setup to demonstrate this behavior:
https://github.com/ebirck/lstm_sequence_prediction
That is because for your data(stock data?), the best prediction for 31st value is the 30th value itself.The model is correct and fits the data.
I also have similar experience predicting the stock data.
I feel I should post a follow-up, since it seems this post has been getting more attention than my other questions.
Ferret Zhang's answer is correct (and has been accepted), and I find this discovery is actually quite funny when you understand it in relation to stock / cryptocurrency data which some have commented about. What sequence prediction is ultimately doing is assigning statistical weights to different moves, to pick the highest probability move and 'predict' it will happen. In the case of stock data, in a vacuum it is (at least at this scale) completely random, there is equal probability of moving up or down, and hence the model predicts that it will stay the exact same.
The model, in a sense, learned that the best way to play is to not play at all :)

Approximating multidimensional functions with neural networks

Is it possible to fit or approximate multidimensional functions with neural networks?
Let's say I want to model the function f(x,y)=sin(x)+y from some given measurement data. (f(x,y) is considered as ground truth and is not known). Also if it's possible some code examples written in Tensorflow or Keras would be great.
As said by #AndreHolzner, theoretically you can approximate any continuous function with a neural network as well as you want, on any compact subset of R^n, even with only one hidden layer.
However, in practice, the neural net can have to be very large for some functions, and sometimes be untrainable (the optimal weights may be hard to find without getting in a local minimum). So here are a few practical suggestions (unfortunately vague, because the details depend too much on your data and are hard to predict without multiple tries):
Keep the network not too big (it'hard to define though, unfortunately): you'll just overfit. You'll probably need a LOT of training samples.
A big number of reasonably-sized layers is usually better than a reasonable number of big layers.
If you have some priors about the function, use them: for instance, if you believe there is some kind of periodicity in f (like in your example, but it could be more complicated), you could add the sin() function to some of of the outputs of the first layer (not all, that would give you a truly periodic output). If you suspect a polynom of degree n, just augment you input x with x², ...x^n and use a linear regression on that input, etc. It will be much easier than learning the weights.
The universal approximator theorem is true on any compact subset of R^n, not on the entire multidimensional space. In particular, you'll never be able to predict the value for an input that's way bigger than any of the training samples for instance (say you trained on numbers from 0 to 100, don't test on 200, it will fail).
For an example of regression you can look here for instance. To regress a more complicated function you'd need to put more complicated functions to get pred from x, for instance like this:
n_layers = 3
x = tf.placeholder(shape=[-1, n_dimensions], dtype=tf.float32)
last_layer = x
# Add n_layers dense hidden layers
for i in range(n_layers):
last_layer = tf.layers.dense(inputs=last_layer, units=128, activation=tf.nn.relu)
# Get the output prediction
pred = tf.layers.dense(inputs=last_layer, units=1, activation=None)
# Get the cost, training op, etc, just like in the linear regression example

How to deal with multi step time series forecasting in multivariate LSTM in keras

I am trying to do multi-step time series forecasting using multivariate LSTM in Keras. Specifically, I have two variables (var1 and var2) for each time step originally. Having followed the online tutorial here, I decided to use data at time (t-2) and (t-1) to predict the value of var2 at time step t. As sample data table shows, I am using the first 4 columns as input, Y as output. The code I have developed can be seen here, but I have got three questions.
var1(t-2) var2(t-2) var1(t-1) var2(t-1) var2(t)
2 1.5 -0.8 0.9 -0.5 -0.2
3 0.9 -0.5 -0.1 -0.2 0.2
4 -0.1 -0.2 -0.3 0.2 0.4
5 -0.3 0.2 -0.7 0.4 0.6
6 -0.7 0.4 0.2 0.6 0.7
Q1: I have trained an LSTM model with the data above. This model does
well in predicting the value of var2 at time step t. However, what
if I want to predict var2 at time step t+1. I feel it is hard
because the model cannot tell me the value of var1 at time step t. If I want to do it, how should I modify the code to build the model?
Q2: I have seen this question asked a lot, but I am still confused. In
my example, what should be the correct time step in [samples, time
steps, features] 1 or 2?
Q3: I just started studying LSTMs. I have
read here that one of the biggest advantages of LSTM is that it
learns the temporal dependence/sliding window size by itself, then
why must we always covert time series data into format like the
table above?
Update: LSTM result (blue line is the training seq, orange line is the ground truth, green is the prediction)
Question 1:
From your table, I see you have a sliding window over a single sequence, making many smaller sequences with 2 steps.
For predicting t, you take first line of your table as input
For predicting t+1, you take the second line as input.
If you're not using the table: see question 3
Question 2:
Assuming you're using that table as input, where it's clearly a sliding window case taking two time steps as input, your timeSteps is 2.
You should probably work as if var1 and var2 were features in the same sequence:
input_shape = (2,2) - Two time steps and two features/vars.
Question 3:
We do not need to make tables like that or build a sliding window case. That is one possible approach.
Your model is actually capable of learning things and deciding the size of this window itself.
If on one hand your model is capable of learning long time dependencies, allowing you not to use windows, on the other hand, it may learn to identify different behaviors at the beginning and at the middle of a sequence. In this case, if you want to predict using sequences that start from the middle (not including the beginning), your model may work as if it were the beginning and predict a different behavior. Using windows eliminate this very long influence. Which is better may depend on testing, I guess.
Not using windows:
If your data has 800 steps, feed all the 800 steps at once for training.
Here, we will need to separate two models, one for training, another for predicting. In training, we will take advantage of the parameter return_sequences=True. This means that for each input step, we will get an output step.
For predicting later, we will want only one output, then we will use return_sequences= False. And in case we are going to use the predicted outputs as inputs for following steps, we are going to use a stateful=True layer.
Training:
Have your input data shaped as (1, 799, 2), 1 sequence, taking the steps from 1 to 799. Both vars in the same sequence (2 features).
Have your target data (Y) shaped also as (1, 799, 2), taking the same steps shifted, from 2 to 800.
Build a model with return_sequences=True. You may use timeSteps=799, but you may also use None (allowing variable amount of steps).
model.add(LSTM(units, input_shape=(None,2), return_sequences=True))
model.add(LSTM(2, return_sequences=True)) #it could be a Dense 2 too....
....
model.fit(X, Y, ....)
Predicting:
For predicting, create a similar model, now with return_sequences=False.
Copy the weights:
newModel.set_weights(model.get_weights())
You can make an input with length 800, for instance (shape: (1,800,2)) and predict just the next step:
step801 = newModel.predict(X)
If you want to predict more, we are going to use the stateful=True layers. Use the same model again, now with return_sequences=False (only in the last LSTM, the others keep True) and stateful=True (all of them). Change the input_shape by batch_input_shape=(1,None,2).
#with stateful=True, your model will never think that the sequence ended
#each new batch will be seen as new steps instead of new sequences
#because of this, we need to call this when we want a sequence starting from zero:
statefulModel.reset_states()
#predicting
X = steps1to800 #input
step801 = statefulModel.predict(X).reshape(1,1,2)
step802 = statefulModel.predict(step801).reshape(1,1,2)
step803 = statefulModel.predict(step802).reshape(1,1,2)
#the reshape is because return_sequences=True eliminates the step dimension
Actually, you could do everything with a single stateful=True and return_sequences=True model, taking care of two things:
When training, reset_states() for every epoch. (Train with a manual loop and epochs=1)
When predicting from more than one step, take only the last step of the output as the desired result.
Actually you can't just feed in the raw time series data, as the network won't fit to it naturally. The current state of RNNs still requires you to input multiple 'features' (manually or automatically derived) for it to properly learn something useful.
Usually the prior steps needed are:
Detrend
Deseasonalize
Scale (normalize)
A great source of information is this post from a Microsoft researcher which won a time series forecasting competition by the means of a LSTM Network.
Also this post: CNTK - Time series Prediction