Predict a nonlinear array based on 2 features with scalar values using XGBoost or equivalent - xgboost

So I have been looking at XGBoost as a place to start with this, however I am not sure the best way to accomplish what I want.
My data is set up something like this
Where every value, whether it be input or output is numerical. The issue I'm facing is that I only have 3 input data points per several output data points.
I have seen that XGBoost has a multi-output regression method, however I am only really seeing it used to predict around 2 outputs per 1 input, whereas my data may have upwards of 50 output points needing to be predicted with only a handful of scalar input features.
I'd appreciate any ideas you may have.
For reference, I've been looking at mainly these two demos (they are the same idea just one is scikit and the other xgboost)
https://machinelearningmastery.com/multi-output-regression-models-with-python/
https://xgboost.readthedocs.io/en/stable/python/examples/multioutput_regression.html

Related

Multiple-input multiple-output CNN with custom loss function

I have a set of 2D input arrays m x n namely A,B,C and I have to predict two 2D output arrays namely d,e for which I do have the expected values. You can think of the inputs/outputs as grey images if you like.
Because of the spatial information is relevant (these are actually 2D physical domains) I want to use a Convolutional Neural Network to predict d and e. My design (not tested yet) looks as follows:
Because I have multiple inputs, I guess I should use multiple columns (or branches) to find different features for each of the inputs (they look fairly different). Each of these columns follows a encoding-decoding architecture used in segmentation (see SegNet): Conv2D block involves a convolution+batch normalisation+ReLU layer. Deconv2D involves a deconvolution+batch normalisation+ReLU.
Then, I can merge the output of each column by either concatenating, averaging or taking the maximum for example. To obtain the original m x n shape for each of the outputs I have seen I could do this with a 1 x 1 kernel convolution.
I want to predict the two outputs from that single layer. Is that okay from the network structure point of view? Finally my loss function depends on the outputs themselves compared to the target plus another relation I want to impose.
A would like to have some expert opinion on this since this is my first design of a CNN and I am not sure if I it makes sense as it is now and/or if there are better approaches (or network architectures) to this problem.
I posted this originally in datascience but I did not get much feedback. I am now posting it here since there is a bigger community on these topics plus I would be very grateful to receive implementation tips beside network architectural ones. Thanks.
I think your design makes sense in general:
since A, B, and C are fairly different, you make each input a transform sub-network, and then fuse them together, which is your intermediate representation.
from the intermediate representation, you apply additional CNN to decode D and E, respectively.
Several things:
A, B, and C looking different does not necessarily mean you can't stack them together as a 3-channel input. The decision should be made upon the fact that whether the values in A, B, and C mean differently or not. For example, if A is a gray scale image, B is a depth map, C is a also a gray image captured by a different camera. Then A and B are better processed in your suggested way, but A and C can be concatenated as one single input before feeding it to your network.
D and E are two outputs of the network and will be trained in the multi-task manner. Of course, they should share some latent feature, and one should split at this feature to apply a down-stream non-shared weight branch for each output. However, where to split is usually tricky.
It is really a broad question, asking for answers relying mostly on opinions. Here are my two cents though, which you might find interesting as it does not go along the previous answers here and on datascience.
First, I wouldn't go with separate columns for each input. AFAIK, when different inputs are processed by different columns, it is almost always the case that the network is some sort of Siemese network and the columns share the same weights; or at least the columns all need to produce a similar code. It is not your case here, so I would simply not bother.
Second, you are blessed with a problem with a dense output and no need to learn a code. This should direct you straight to U-nets, which outperforms any bottleneck-designed network without much effort. U-nets were introduced for dense segmentation but they shine at any dense-output problem really.
In short, just stack your inputs together and use a U-net.

Should my seq2seq RNN idea work?

I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

What is the output of XGboost using 'rank:pairwise'?

I use the python implementation of XGBoost. One of the objectives is rank:pairwise and it minimizes the pairwise loss (Documentation). However, it does not say anything about the scope of the output. I see numbers between -10 and 10, but can it be in principle -inf to inf?
good question. you may have a look in kaggle competition:
Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query. That is, this is not a regression problem or classification problem. Hence, if a document, attached to a query, gets a negative predict score, it means and only means that it's relatively less relative to the query, when comparing to other document(s), with positive scores.
It gives predicted score for ranking.
However, the scores are valid for ranking only in their own groups.
So we must set the groups for input data.
For esay ranking, refer to my project xgboostExtension
If I understand your questions correctly, you mean the output of the predict function on a model fitted using rank:pairwise.
Predict gives the predicted variable (y_hat).
This is the same for reg:linear / binary:logistic etc. The only difference is that reg:linear builds trees to Min(RMSE(y, y_hat)), while rank:pairwise build trees to Max(Map(Rank(y), Rank(y_hat))). However, output is always y_hat.
Depending on the values of your dependent variables, output can be anything. But I typically expect output to be much smaller in variance vs the dependent variable. This is usually the case as it is not necessary to fit extreme data values, the tree just needs to produce predictors that are large/small enough to be ranked first/last in the group.

How to plot a Pearson correlation given a time series?

I am using the code in this website http://blog.chrislowis.co.uk/2008/11/24/ruby-gsl-pearson.html to implement a Pearson Correlation given two time series data like so:
require 'gsl'
pearson_correlation = GSL::Stats::correlation(
GSL::Vector.alloc(first_metrics),GSL::Vector.alloc(second_metrics)
)
This returns a number such as -0.2352461593569471.
I'm currently using the highcharts library and am feeding it two sets of timeseries data. Given that I have a finite time series for both sets, can I do something with this number (-0.2352461593569471) to create a third time series showing the slope of this curve? If anyone can point me in the right direction I'd really appreciate it!
No, correlation doesn't tell you anything about the slope of the line of best fit. It just tells you approximately how much of the variability in one variable (or one time series, in this case) can be explained by the other. There is a reasonably good description here: http://www.graphpad.com/support/faqid/1141/.
How you deal with the data in your specific case is highly dependent on what you're trying to achieve. Are you trying to show that variable X causes variable Y? If so, you could start by dropping the time-series-ness, and just treat the data as paired values, and use linear regression. If you're trying to find a model of how X and Y vary together over time, you could look at multivariate linear regression (I'm not very familiar with this, though).