How is the Gini-Index minimized in CART Algorithm for Decision Trees? - optimization

For neural networks for example I minimize the cost function by using the backpropagation algorithm. Is there something equivalent for the Gini Index in decision trees?
CART Algorithm always states "choose partition of set A, that minimizes Gini-Index", but how to I actually get that partition mathematically?
Any input on this would be helpful :)

For a decision tree, there are different methods for splitting continuous variables like age, weight, income, etc.
A) Discretize the continuous variable to use it as a categorical variable in all aspects of the DT algorithm. This can be done:
only once during the start and then keeping this discretization
static
at every stage where a split is required, using percentiles or
interval ranges or clustering to bucketize the variable
B) Split at all possible distinct values of the variable and see where there is the highest decrease in the Gini Index. This can be computationally expensive. So, there are optimized variants where you sort the variables and instead of choosing all distinct values, choose the midpoints between two consecutive values as the splits. For example, if the variable 'weight' has 70, 80, 90 and 100 kgs in the data points, try 75, 85, 95 as splits and pick the best one (highest decrease in Gini or other impurities)
But then, what is the exact split algorithm that is implemented in scikit-learn in python, rpart in R, and the mlib package in pyspark , and what are the differences between them in the splitting of a continuous variable is something I am not sure as well and am still researching.

Here there is a good example of CART algorithm. Basically, we get the gini index like this:
For each attribute we have different values each of which will have a gini index, according to the class they belong to. For example, if we had two classes (positive and negative), each value of an attribute will have some records that belong to the positive class and some other values that belong to the negative class. So we can calculate the probabilities. Say if an attribute was called weather and it had two values (e.g. rainy and sunny), and we had these information:
rainy: 2 positive, 3 negative
sunny: 1 positive, 2 negative
we could say:
Then we can have the weighted sum of gini indexes for weather (assuming we had a total of 8 records):
We do this for all the other attributes (like we did for weather) and at the end we choose the attribute with the lowest gini index to be the one to split the tree from. We have to do all this at each split (unless we could classify the sub-tree without the need for splitting).

Related

Algorithm - finding the order of HMM from observations

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

Can we treat discrete variable as continuous variable in regression as one of covariates

For example, can we consider the count of emergency room visit as a continuous variable when we do regression?
In general, it is risky to treat a discrete numerical variable as though it is equivalent to a continuous variable. This is especially true if your discrete variable represents some sort of categorical information (e.g. red/blue/green), unless the categories have some natural one-dimensional ordering (e.g. ages grouped into 10-year bands), and the numbers representing the different categories are, in some sense, appropriately spaced when mapped into the continuous space.
In your case, if the discrete variable is a count of patient visits, it may be more reasonable to treat this as though it is a continuous variable, especially if those counts tend to be large. Under those circumstances, it may be more reasonable to assume that the counts resemble random numbers drawn from a Gaussian distribution (following the Central Limit theorem), which may fit well with the underlying statistical assumptions of popular regression algorithms. However, if the counts are smaller, or have a high probability of outliers, then it may be more risky to treat them as though they are continuous variables.

How should be defined the seasons of a year as an input variable for ANN in matlab

I wanted to define the season of a year(4 seasons) as one of the inputs variables for neural network in matlab .Can I just use numbers from 1 to 4 ?Thanks for any suggestion
Since it is a categorical variable, it's better to use 1-hot coding:
0001: summer
0010: fall
0100: winter
1000: spring
So, your season input will become 4 binary inputs.
Generally there are 2 ways to do this: use one input for each category and scale the integer values, e.g. (0,...,4) for season to continuous values in the range of other input variables. However, this approach would assume that you have some hierarchy in the categories, let's say Spring is 'better' or 'higher' than Summer. Since this is not the case, you would need to create one input node for each possible realization of a category, i.e. 4 input variables for the season where all are set to '0' except the category that is active, which is set to '1'. I would not advice to encode the integer categorical variables into binary values, and thus reduce the number of required input nodes. You would end up with a correlation bias among categories that have the value set to '1' at the same time, e.g. for (hot, mild, cold) = ([0,1], [1,0], [1,1]), the encoding for 'hot' would mean an artificial similarity to 'cold' and 'mild', since they share the same bit.

Markovian chains with Redis

For self-education purposes, I want to implement a Markov chain generator, using as much Redis, and as little application-level logic as possible.
Let's say I want to build a word generator, based on frequency table with history depth N (say, 2).
As a not very interesting example, for dictionary of two words bar and baz, the frequency table is as follows ("." is terminator, numbers are weights):
. . -> b x2
. b -> a x2
b a -> r x1
b a -> z x1
a r -> . x1
a z -> . x1
When I generate the word, I start with history of two terminators . .
There is only one possible outcome for the first two letters, b a.
Third letter may be either r or z, with equal probabilities, since their weights are equal.
Fourth letter is always a terminator.
(Things would be more interesting with longer words in dictionary.)
Anyway, how to do this with Redis elegantly?
Redis sets have SRANDMEMBER, but do not have weights.
Redis sorted sets have weights, but do not have random member retrieval.
Redis lists allow to represent weights as entry copies, but how to make set intersections with them?
Looks like application code is doomed to do some data processing...
You can accomplish a weighted random selection with a redis sorted set, by assigning each member a score between zero and one, according to the cumulative probability of the members of the set considered thus far, including the current member.
The ordering you use is irrelevant; you may choose any order which is convenient for you. The random selection is then accomplished by generating a random floating point number r uniformly distributed between zero and one, and calling
ZRANGEBYSCORE zset r 1 LIMIT 0 1,
which will return the first element with a score greater than or equal to r.
A little bit of reasoning should convince you that the probability of choosing a member is thus weighted correctly.
Unfortunately, the fact that the scores assigned to the elements needs to be proportional to the cumulative probability would seem to make it difficult to use the sorted set union or intersection operations in a way which would preserve the significance of the scores for random selection of elements. That part would seem to require some significant application logic.

How to design acceptance probability function for simulated annealing with multiple distinct costs?

I am using simulated annealing to solve an NP-complete resource scheduling problem. For each candidate ordering of the tasks I compute several different costs (or energy values). Some examples are (though the specifics are probably irrelevant to the question):
global_finish_time: The total number of days that the schedule spans.
split_cost: The number of days by which each task is delayed due to interruptions by other tasks (this is meant to discourage interruption of a task once it has started).
deadline_cost: The sum of the squared number of days by which each missed deadline is overdue.
The traditional acceptance probability function looks like this (in Python):
def acceptance_probability(old_cost, new_cost, temperature):
if new_cost < old_cost:
return 1.0
else:
return math.exp((old_cost - new_cost) / temperature)
So far I have combined my first two costs into one by simply adding them, so that I can feed the result into acceptance_probability. But what I would really want is for deadline_cost to always take precedence over global_finish_time, and for global_finish_time to take precedence over split_cost.
So my question to Stack Overflow is: how can I design an acceptance probability function that takes multiple energies into account but always considers the first energy to be more important than the second energy, and so on? In other words, I would like to pass in old_cost and new_cost as tuples of several costs and return a sensible value .
Edit: After a few days of experimenting with the proposed solutions I have concluded that the only way that works well enough for me is Mike Dunlavey's suggestion, even though this creates many other difficulties with cost components that have different units. I am practically forced to compare apples with oranges.
So, I put some effort into "normalizing" the values. First, deadline_cost is a sum of squares, so it grows exponentially while the other components grow linearly. To address this I use the square root to get a similar growth rate. Second, I developed a function that computes a linear combination of the costs, but auto-adjusts the coefficients according to the highest cost component seen so far.
For example, if the tuple of highest costs is (A, B, C) and the input cost vector is (x, y, z), the linear combination is BCx + Cy + z. That way, no matter how high z gets it will never be more important than an x value of 1.
This creates "jaggies" in the cost function as new maximum costs are discovered. For example, if C goes up then BCx and Cy will both be higher for a given (x, y, z) input and so will differences between costs. A higher cost difference means that the acceptance probability will drop, as if the temperature was suddenly lowered an extra step. In practice though this is not a problem because the maximum costs are updated only a few times in the beginning and do not change later. I believe this could even be theoretically proven to converge to a correct result since we know that the cost will converge toward a lower value.
One thing that still has me somewhat confused is what happens when the maximum costs are 1.0 and lower, say 0.5. With a maximum vector of (0.5, 0.5, 0.5) this would give the linear combination 0.5*0.5*x + 0.5*y + z, i.e. the order of precedence is suddenly reversed. I suppose the best way to deal with it is to use the maximum vector to scale all values to given ranges, so that the coefficients can always be the same (say, 100x + 10y + z). But I haven't tried that yet.
mbeckish is right.
Could you make a linear combination of the different energies, and adjust the coefficients?
Possibly log-transforming them in and out?
I've done some MCMC using Metropolis-Hastings. In that case I'm defining the (non-normalized) log-likelihood of a particular state (given its priors), and I find that a way to clarify my thinking about what I want.
I would take a hint from multi-objective evolutionary algorithm (MOEA) and have it transition if all of the objectives simultaneously pass with the acceptance_probability function you gave. This will have the effect of exploring the Pareto front much like the standard simulated annealing explores plateaus of same-energy solutions.
However, this does give up on the idea of having the first one take priority.
You will probably have to tweak your parameters, such as giving it a higher initial temperature.
I would consider something along the lines of:
If (new deadline_cost > old deadline_cost)
return (calculate probability)
else if (new global finish time > old global finish time)
return (calculate probability)
else if (new split cost > old split cost)
return (calculate probability)
else
return (1.0)
Of course each of the three places you calculate the probability could use a different function.
It depends on what you mean by "takes precedence".
For example, what if the deadline_cost goes down by 0.001, but the global_finish_time cost goes up by 10000? Do you return 1.0, because the deadline_cost decreased, and that takes precedence over anything else?
This seems like it is a judgment call that only you can make, unless you can provide enough background information on the project so that others can suggest their own informed judgment call.