How to inject future sequential input for multi-step time series forecasts in LSTM networks - tensorflow

I am trying to do multi-step (i.e., sequence-to-sequence) forecasts for product sales using both (multivariate) sequential and non-sequential inputs.
Specifically, I am using sales numbers as well as some other sequential inputs (e.g., price, is day before holiday, etc...) of the past n days to predict the sales for future m days. Additionally, I have some non-sequential features characterizing the product itself.
Definitions:
n_seq_features <- number of sequential features (in the multivariate time-series) including sales
n_non_seq_features <- number of non-sequential features characterizing a product
I got as far as building a hybrid-model, where first the sequential input is passed through some LSTM layers. The output of the final LSTM layer is then concatenated with the non-sequential features and fed into some dense layers.
What I can't quite get my head around, though, is how to input future sequntial input (everything except sales numbers for the following m days) in a way that efficiently utilizes the sequential information (i.e., causality, etc...). For m=1, I can simply input the sequential data for this one day together with the non-sequential input after the LSTM layers, however as soon as m becomes greater than 1 this appears to be a waste of causal information.
The only ways I could think of were:
to incorporate the sequential information for future m days as features in the LSTM input blowing up the input shape from (..., n, n_seq_features) to (..., n, n_seq_features + m*(n_seq_features-1))
add a separate LSTM branch handling the future data, the output of which is then 'somehow' fed into the dense layers at the last stage of the model
I only started using LSTM networks a while ago so I unfortunately have only limited intuition on how they are best utilized (especially in hybrid approaches). For this reason, I would like to ask:
Is the general approach of injecting sequential and non-sequential input at different stages of the same model (i.e., trained concurrently) useful or would one rather split it into separate models which can be trained independently for more fine-grained control?
How is future sequential input injected into an LSTM network to preserve causal information? Can this be achieved with a high-level frontend like KERAS or does it require a very deep dive into the tensorflow backend?
Are LSTM networks not the way to go for this specific problem in the first place?
Cheers and thanks in advance for any advice, resources or thoughts on the matter. :)

In case someone is having a similar issue with future sequential (or temporal) data, University of Oxford and Google Cloud AI have come up with a new architecture to handle all three types of input (past temporal, future temporal as well as static). It is called Temporal Fusion Transformer and, at least from reading the paper, looks like a neat fit. However, I have yet to implement and test it. There is also a PyTorch Tutorial available.

Related

How to implement feature importance on nominal categorical features in tree based classifiers?

I am using SKLearn XGBoost model for my binary classification problem. My data contains nominal categorical features (such as race) for which one hot encoding should be used to feed them to the tree based models.
On the other hand, using feature_importances_ variable of XGBoost yields us the importance of each column on the trained model. So if I do the encoding and then get the features importance of columns, the result will includes names like race_2 and its importance.
What should I do to solve this problem and get a whole score for each nominal feature? Can I take the average of one hot encoded columns importance scores that belong to one feature? (like race_1, race_2 and race_3)
First of all, if your goal is to select the most useful features for later training, I would advise you to use regularization in your model. In the case of xgboost, you can tune the parameter gamma so the model would actually be more dependent on "more useful" features (i.e. tune the minimum loss reduction required for the model to add a partition leaf). Here is a good article on implementing regularization into xgboost models.
On the other hand, if you insist on doing feature importance, I would say grouping the encoded variables and simply adding them is not a good decision. This would result in feature-importance results that do not consider the relationship between these dummy variables.
My suggestion would be to take a look at the permutation tools for this. The basic idea is you take your original dataset, shuffle the values on the column in which you are going to calculate feature importance, train the model and record the score. Repeat this over different columns and the effect of each on the model performance would be a sign of their importance.
It is actually easier done than said, sklearn has this feature built-in to do for you: check out the example provided in here.

Multiple BERT binary classifications on a single graph to save on inference time

I have five classes and I want to compare four of them against one and the same class. This isn't a One vs Rest classifier, as for each output I want to score them against one base class.
The four outputs should be: base class vs classA, base class vs classB, etc.
I could do this by having multiple binary classification tasks, but that's wasting computation time if the first layers are BERT preprocessing + pretrained BERT layers, and the only differences between the four classifiers are the last few layers of BERT (finetuned ones) and the Dense layer.
So why not merge the graphs for more performance?
My inputs are four different datasets, each annotated with true/false for each class.
As I understand it, I can re-use most of the pipeline (BERT preprocessing and the first layers of BERT), as those have shared weights. I should then be able to train the last few layers of BERT and the Dense layer on top differently depending on the branch of the classifier (maybe using something like keras.switch?).
I have tried many alternative options including multi-class and multi-label classifiers, with actual and generated (eg, machine-annotated) labels in the case of multiple input labels, different activation and loss functions, but none of the results were acceptable to me (none were as good as the four separate models).
Is there a solution for merging the four different models for more performance, or am I stuck with using 4x binary classifiers?
When you train DNN for specific task it will be (in vast majority of cases) be better than the more general model that can handle several task simultaneously. Saying that, based on my experience the properly trained general model produces very similar results to the original binary ones. Anyways, here couple of suggestions for training strategies (assuming your training datasets for each task are completely different):
Weak supervision approach
Train your binary classifiers, and label your datasets using them (i.e. label with binary classifier trained on dataset 2 datasets [1,3,4]). Then train your joint model as multilabel task using all the newly labeled datasets (don't forget to randomize samples before feeding them to trainer ;) ). Here you will need to experiment if you will use threshold and set a label to 0/1 or use the scores of the binary classifiers.
Create custom loss function that will not penalize if no information provided for certain class. So when your will introduce sample from (say) dataset 2, your loss will be calculated only for the 2nd class.
Of course you can apply both simultaneously. For example, if you know that binary classifier produces scores that are polarized (most results are near 0 or 1), you can use weak labels, and automatically label your data with scores. Now during the second stage penalize loss such that for score x' = 4(x-0.5)^2 (note that you get logits from the model, so you will need to apply sigmoid function). This way you will increase contribution of the samples binary classifier is confident about, and reduce that of less certain ones.
As for releasing last layers of BERT, usually unfreezing upper 3-6 layers is enough. Releasing more layers improves results very little and increases time and memory requirements.

TensorFlow model with time series data, having different input shapes for training and prediction

I am having a somewhat decent working neural net, utilising mostly LSTM, Dropout and Dense layers. I usually use it for sales prediction only but now my issue is that I'd like to train and predict with datasets of different shapes.
I have several columns showing marketing spending per channel, as well as sales for different products. Below you find an image, illustrating the dataset. Now, the orange data (marketing channels and product sales) are supposed to be the training data. When I do a many-to-many prediction, I could just forecast all the columns, like I do when I've got a dataset containing only sales.
But I already know the marketing spendings for the future, because it already is planned ahead. Now, for that I could just use pystats (OLS for example) but LSTM are really good at remembering the past marketing spendings and sales.
Actual Question:
is there a way to utilise a tensorflow neural net with a different input shape on training and test data? Test data in this case would be either actual test data or already the actual future.
Or any other comparable model? Unfortunately, I have not found any solution during my research.
Thanks for your time.

Time series classification using LSTM - How to approach?

I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with samples a1,a2,a3,a4), B(b1,b2,b3,b4) and C(c1,c2,c3,c4), then I will feed the LSTM with batches of (a1,b1,c1), then (a2,b2,c2) etc.? Meaning that all time series needs to be of the same size/number of samples?
If so, can anynone more experienced be so kind and describe me very simply how to approach the whole process of learning the LSTM and creating the classifier?
My intention is to use TensorFlow, but I am still new to this.
If your goal is classification, then your data should be a a time series and a label. During training, you feed each into the lstm, and look only at the last output and backprop as necessary.
Judging from your question, you are probably confused about batching -- you can train multiple items at once. However, each item in the batch would get its own hidden state, and only the parameters of the layers are updated.
The time series in a single batch should be of the same length. You should terminate each sequence with a END token and pad items that are too short with a special token PAD -- the lstm should learn that PAD's after and END are useless.
There is no need for different batches to have the same number of items, nor to have items of the same length.

Using unlabeled dataset in Keras

Usually, when using Keras, the datasets used to train the neural network are labeled.
For example, if I have a 100,000 rows of patients with 12 field per each row, then the last field will indicate if this patient is diabetic or no (0 or 1).
And then after training is finished I can insert a new record and predict if this person is diabetic or no.
But in the case of unlabeled datasets, where I can not label the data due to some reasons, how can I train the neural network to let him know that those are the normal records and any new record that does not match this network will be malicious or not accepted ?
This is called one-class learning and is usually done by using autoencoders. You train an autoencoder on the training data to reconstruct the data itself. The labels in this case is the input itself. This will give you a reconstruction error. https://en.wikipedia.org/wiki/Autoencoder
Now you can define a threshold where the data is benign or not, depending on the reconstruction error. The hope is that the reconstruction of the good data is better than the reconstruction of the bad data.
Edit to answer the question about the difference in performance between supervised and unsupervised learning.
This cannot be said with any certainty, because I have not tried it and I do not know what the final accuracy is going to be. But for a rough estimate supervised learning will perform better on the trained data, because more information is supplied to the algorithm. However if the actual data is quite different to the training data the network will underperform in practice, while the autoencoder tends to deal better with different data. Additionally, per rule of thumb you should have 5000 examples per class to train a neural network reliably, so labeling could take some time. But you will need some data to test anyways.
It sounds like you need fit two different models:
a model for bad record detection
a model for prediction of a patient's likelihood to be diabetic
For both of these models, you will need to have labels. For the first model your labels would indicate whether the record is good or bad (malicious) and the second would be whether the patient is diabetic or not.
In order to detect bad records, you may find that simple logistic regression or SVM performs adequately.