How to effectively use knn in Stata - error-handling

I have two questions with executing discrim knn in Stata.
1) How do you properly code the command? I've tried various versions, but seem to always get an error that there are too many variables specified.
The vector with the correct result is buy.
I am trying: discrim knn buy, group(train test) k(1)
2) My understanding with KNN was that factor variables (binary) were fine for using KNN, even encouraged. However I get the error message that factor variables and time-series operators not allowed.
Lastly, though I know this isn't the best space for this question, should each vector be normalized for knn? I've heard conflicting responses.

I'm guessing that the error you're getting is
group(): too many variables specified
This is because you can only group by 1 variable with knn. knn performs discriminant analysis based on a single grouping variable, in your case, distinguishing the training from the test. I imagine your train and test variables are binary, in which case using only one of the variables is enough, as they are merely logical opposites of each other. A single variable has enough information to distinguish the two groups.

Related

Solving an optimization problem bounded by conditional constrains

Basically, I have a dataset that contains 'weights' for some (207) variables, some are more important than the others for determining the class variable (binary) and therefore they are bigger etc. at the end all weigths are summed up across all columns so that the resulting cumulative weight is obtained for each observation.
If this weight is higher then some number then class variable is 1 otherwise is 0. I do have true labels for a class variable so the problem is to minimize false positives.
The thing is, for me it looks like a OR problem as it's about finding optimal weights. However, I am not sure if there is an OR method for such problem, at least I have not heard about one. Question is: does anyone recognize this type of problems and can send some keywords for me to research?
Another thing of course is to predict that with machine learning rather then deterministic methods but I need to do it this way.
Thank you!
Are the variables discrete (integer numbers etc) or continuous (floating point numbers)?
If they are discrete, it sounds like the knapsack problem, which constraint solvers like OptaPlanner (see this training that builds a knapsack solver) excel at.
If they are continuous, look for an LP solver, like CPLEX.
Either way, you'll get much better results than machine learning approaches, because neural nets et al are great at pattern recognition use cases (image/voice recognition, prediction, catagorization, ...), but consistently inferior for constraint optimization problems (like this, I presume).

Inference on several inputs in order to calculate the loss function

I am modeling a perceptual process in tensorflow. In the setup I am interested in, the modeled agent is playing a resource game: it has to choose 1 out of n resouces, by relying only on the label that a classifier gives to the resource. Each resource is an ordered pair of two reals. The classifier only sees the first real, but payoffs depend on the second. There is a function taking first to second.
Anyway, ideally I'd like to train the classifier in the following way:
In each run, the classifier give labels to n resources.
The agent then gets the payoff of the resource corresponding to the highest label in some predetermined ranking (say, A > B > C > D), and randomly in case of draw.
The loss is taken to be the normalized absolute difference between the payoff thus obtained and the maximum payoff in the set of resources. I.e., (Payoff_max - Payoff) / Payoff_max
For this to work, one needs to run inference n times, once for each resource, before calculating the loss. Is there a way to do this in tensorflow? If I am tackling the problem in the wrong way feel free to say so, too.
I don't have much knowledge in ML aspects of this, but from programming point of view, I can see doing it in two ways. One is by copying your model n times. All the copies can share the same variables. The output of all of these copies would go into some function that determines the the highest label. As long as this function is differentiable, variables are shared, and n is not too large, it should work. You would need to feed all n inputs together. Note that, backprop will run through each copy and update your weights n times. This is generally not a problem, but if it is, I heart about some fancy tricks one can do by using partial_run.
Another way is to use tf.while_loop. It is pretty clever - it stores activations from each run of the loop and can do backprop through them. The only tricky part should be to accumulate the inference results before feeding them to your loss. Take a look at TensorArray for this. This question can be helpful: Using TensorArrays in the context of a while_loop to accumulate values

xgboost vs H2o Gradient Boosting

I have a dataset having a large missing values (more than 40% missing). Genrated a model in xgboost and H2o gradient boosting - got a decent model in both cases. However, the xgboost shows this variable as one of the key contributors to the model but as per H2o Gradient Boosting the variable is not important. Does xgboost handle variables with missing values differently. All the configuration to both the models are exactly the same.
Both missing value handling and variable importances are slightly different between the two methods. Both are treating missing values as information (i.e., they learn from them, and don't just impute with a simple constant). The variable importances are computed from the gains of their respective loss functions during tree construction. H2O uses squared error, and XGBoost uses a more complicated one based on gradient and hessian.
One thing you could check is the variance of the variable importances between different runs with different seeds, to see how stable each method is in terms of variable importances.
PS. If you have categoricals, then you're better off leaving the column as a factor for H2O, no need to do your own encoding. This can lead to a different effective count of columns vs XGBoost's dataset, so for column sampling, things will be different.

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

How do I have to train a HMM with Baum-Welch and multiple observations?

I am having some problems understanding how the Baum-Welch algorithm exactly works. I read that it adjusts the parameters of the HMM (the transition and the emission probabilities) in order to maximize the probability that my observation sequence may be seen by the given model.
However, what does happen if I have multiple observation sequences? I want to train my HMM against a huge lot of observations (and I think this is what is usually done).
ghmm for example can take both a single observation sequence and a full set of observations for the baumWelch method.
Does it work the same in both situations? Or does the algorithm have to know all observations at the same time?
In Rabiner's paper, the parameters of GMMs (weights, means and covariances) are re-estimated in the Baum-Welch algorithm using these equations:
These are just for the single observation sequence case. In the multiple case, the numerators and denominators are just summed over all observation sequences, and then divided to get the parameters. (this can be done since they simply represent occupation counts, see pg. 273 of the paper)
So it's not required to know all observation sequences during an invocation of the algorithm. As an example, the HERest tool in HTK has a mechanism that allows splitting up the training data amongst multiple machines. Each machine computes the numerators and denominators and dumps them to a file. In the end, a single machine reads these files, sums up the numerators and denominators and divides them to get the result. See pg. 129 of the HTK book v3.4