Loss function for ordinal multi classification in pytorch - optimization

I am a beginner with DNN and pytorch.
I am dealing with a multi-classification problem where my label are encoded into a one-hotted vector, say of dimension D.
To this end, I am using the CrossEntropyLoss. However now I want to modify or change such criterion in order to penalize value distant from the actual one, say classify 4 instead of 5 is better than 2 instead of 5.
Is there a function already built-in in Pytorch that implement this behavior? Otherwise how can I modify the CrossEntropyLoss to achieve it?

This could help you. It is a PyTorch implementation ordinal regression:
https://www.ethanrosenthal.com/2018/12/06/spacecutter-ordinal-regression/

Related

Is there any difference between keras.utils.to_categorical and pd.get_dummies?

I think the same purpose among sklearn.OneHotEncoder, pandas.get_dummies, and keras.to_categorical. But I don't know the difference. 
Apart from the difference of the output/input type there is no difference, they all achieve the same result.
There's some technical difference:
Keras is very simple, you give him the target vector and he one -hot encodes it, use keras if you need to encode the labels vector.
Pandas is the most complex, it creates a new column for every class of the data, the good part is that works on dataframes where you want to one-hot only one of the columns (so you could say this is more of a multi purpose method, but not the preferable option if you need to train a NN)
Sklearn lets you one-hot encode multiple features in the same variable, is a bit more flexible that the use keras offers, if the method from keras is too simple try with sklearn, if keras is enough stick with it.

Calculate average and class-wise precision/recall for multiple classes in TensorFlow

I have a multiclass model with 4 classes. I have already implemented a callback able to calculate the precision/recall for each class and their macro average. But for some technical reason, I have to calculate them using the metrics mechanism.
I'm using TensorFlow 2 and Keras 2.3.0. I have already used the tensorflow.keras.metrics.Recall/Precision to get the class-wise metrics:
metrics_list = ['accuracy']
metrics_list.extend([Recall(class_id=i, name="recall_{}".format(label_names[i])) for i in range(n_category)])
metrics_list.extend([Precision(class_id=i, name="precision_{}".format(label_names[i])) for i in range(n_category)])
model = Model(...)
model.compile(...metrics=metrics_list)
However, this solution is not satisfying:
firstly, tensorflow.keras.metrics.Recall/Precision uses a threshold to define the affiliation to a class, while it should use argmax to define the most probable class, if class_id is defined
Secondly, I have to create 2 new metrics that would calculate the average over all classes, which itself requires to calculate the class-wise metrics. This is inelegant and inefficient to calculate twice the same thing.
Is there a way to create a class or a function that would calculate directly the class-wise and the average predicion/recall using the TensorFlow/Keras metrics logic?
Apparently I can easily obtain the confusion matrix using tf.math.confusion_matrix(). However, I do not see how to inject a list of scalar at once, instead of returning a single scalar.
Any comment is welcomed!
It occurs that in my very specific case, I can simply use CategoricalAccuracy() as unique metric because i'm using a batch_size=1. It this case, accuracy=recall=precision={1.|0.} for a batch. That only partially solve the problem. The best solution would be to update the confusion matrix using argmax at each batch end, then calculate the Precision/Recall based on that. I don't known how it is possible to do that yet, but it should be doable.

RNN LSTM Keras custom loss function

I'm beginning with Keras and TensorFlow.
I have an LSTM model learning on a dataset of stocks prices.
I don't want that my model learn to predict next steps like today. I want that my model learn on each step if it must buy, sell or do nothing and how much.
I think that I need to make a custom loss function, but I really don't know how to code my concept : buy, sell, nothing and how much based on a capital like 100 unit at beginning. The objective would be to have the hightest capital possible at the end.
I must to use an existant function and customise it like MSE ? If yes, how ?
I must to let my model learn the time series and after add a buy/sell layer(s) ? If yes, how ?
Other ?
I am pretty lost.
Thank's a lot for your help.
Sam
I would try categorical cross-entropy,
I mean you have three options: buy (0) , sell (1), and do nothing (2). You can encode it like this:
[1,0,0] < - means 'buy'
[0,1,0] < - means 'sell'
[0,0,1] < - means 'do nothing'
And don't forget to add softmax function in the end of you NN.
what I understand, we have stock prices dataset and at each point, we are required to predict the decision buy/sell/nothing.
For each point, we should decide a window size, which we believe impact the current point.
Use this window as time series input to LSTM layer. Using moving window, we can create multiple inputs. The corresponding output will be the decision, which can be taken to be encoded 3 bits.
For time point t, use time series (0..t-1) as input. and decision [0,0,1] or [0,1,0] or [1,0,0] as output. The model will learn to predict probabilities for each decision.
To compute loss, categorical cross entropy will be useful, as mentioned by Paddy.
Also, if you haven't looked into pre-processing data, detrending the data is useful in such cases. This link might be useful.

taking the gradient in Tensorflow, tf.gradient

I am using this function of tensorflow to get my function jacobian. Came across two problems:
The tensorflow documentation is contradicted to itself in the following two paragraph if I am not mistaken:
gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.
Blockquote
Blockquote
Returns:
A list of sum(dy/dx) for each x in xs.
Blockquote
According to my test, it is, in fact, return a vector of len(ys) which is the sum(dy/dx) for each x in xs.
I do not understand why they designed it in a way that the return is the sum of the columns(or row, depending on how you define your Jacobian).
How can I really get the Jacobian?
4.In the loss, I need the partial derivative of my function with respect to input (x), but when I am optimizing with respect to the network weights, I define x as a placeholder whose value is fed later, and weights are variable, in this case, can I still define the symbolic derivative of function with respect to input (x)? and put it in the loss? ( which later when we optimize with respect to weights will bring second order derivative of the function.)
I think you are right and there is a typo there, it was probably meant to be "of length len(ys)".
For efficiency. I can't explain exactly the reasoning, but this seems to be a pretty fundamental characteristic of how TensorFlow handles automatic differentiation. See issue #675.
There is no straightforward way to get the Jacobian matrix in TensorFlow. Take a look at this answer and again issue #675. Basically, you need one call to tf.gradients per column/row.
Yes, of course. You can compute whatever gradients you want, there is no real difference between a placeholder and any other operation really. There are a few operations that do not have a gradient because it is not well defined or not implemented (in which case it will generally return 0), but that's all.

Should my seq2seq RNN idea work?

I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").