How does xgboost split root node and question for Taylor expansion - xgboost

I know xgboost use Gain = Score(L)+Score(R)-Score(L+R) to split node, but how does xgboost split root node? Also, why not use the fourth or fifth derivative in Taylor expansion for loss function?

Before root node, There is a initial value nameed 'base_score' (default 0.5 for classification) as a predict value, so in root node, you can calculate all samples' hessian and gradient and obtain the score for gain.

Related

Is max operation differentiable in Pytorch?

I am using Pytorch to training some neural networks. The part I am confused about is:
prediction = myNetwork(img_batch)
max_act = prediction.max(1)[0].sum()
loss = softcrossentropy_loss - alpha * max_act
In the above codes, "prediction" is the output tensor of "myNetwork".
I hope to maximize the larget output of "prediction" over a batch.
For example:
[[-1.2, 2.0, 5.0, 0.1, -1.5] [9.6, -1.1, 0.7, 4,3, 3.3]]
For the first prediction vector, the 3rd element is the larget, while for the second vector, the 1st element is the largets. And I want to maximize "5.0+9.6", although we cannot know what index is the larget output for a new input data.
In fact, my training seems to be successful, because the "max_act" part was really increased, which is the desired behavior to me. However, I heard some discussion about whether max() operation is differentiable or not:
Some says, mathmatically, max() is not differentiable.
Some says, max() is just an identity function to select the largest element, and this largest element is differentiable.
So I got confused now, and I am worried if my idea of maximizing "max_act" is wrong from the beginning.
Could someone provide some guidance if max() operation is differentiable in Pytorch?
max is differentiable with respect to the values, not the indices. It is perfectly valid in your application.
From the gradient point of view, d(max_value)/d(v) is 1 if max_value==v and 0 otherwise. You can consider it as a selector.
d(max_index)/d(v) is not really meaningful as it is a discontinuous function, with only 0 and undefined as possible gradients.

What is meaning of "parameter optimization of SVM by PSO"?

I can change parameters C and epsilon manually to obtain an optimised result, but I found that there is parameter optimization of SVM by PSO (or any other optimization algorithm). There is no algorithm. What does it mean: how can PSO automatically optimize the SVM parameters? I read several papers on this topic, but I'm still not sure.
Particle Swarm Optimization is a technique that uses the ML parameters (SVM parameters, in your case) as its features.
Each "particle" in the swarm is characterized by those parameter values. For instance, you might have initial coordinates of
degree epsilon gamma C
p1 3 0.001 0.25 1.0
p2 3 0.003 0.20 0.9
p3 2 0.0003 0.30 1.2
p4 4 0.010 0.25 0.5
...
pn ...........................
The "fitness" of each particle (p1-p4 shown here out of a population of n particles) is measured by the accuracy of the resulting model: the PSO algorithm trains and tests a model for each particle, returning that model's error rate as the value analogous to that from the training loss function (which it how the value is computed).
On each iteration, particles move toward the fittest neighbours. The process repeats until a maximum (hopefully the global one) appears as a convergence point. This process is simply one from the familiar gradient descent family.
There are two basic PSO variants. In gbest (global best), every particle affects every other particle, sort of a universal gravitation principle. It converges quickly, but may well miss a global max in favor of a local max that happened to be nearer to the swarm's original center. In lbest (local best), a particle responds to only its k closest neighbors. This can form localized clusters; it converges more slowly, but is more likely to find the global max in a non-convex space.
I'll try to briefly explain enough to answer your clarification questions. If that doesn't work, I'm afraid you'll probably have to find someone to discuss this in front of a white board.
To use PSO, you have to decide which SVM parameters you'll try to optimize, and how many particles you want to use. PSO is a meta-algorithm, so its features are the SVM parameters. The PSO parameters are population (how many particles you want to use, update neighbourhood (lbest size and a distance function; gbest is the all-inclusive case), and velocity (learning rate for the SVM parameters).
For a bit of illustration, let's assume the particle table above, extended to a population of 20 particles. We'll use lbest with a neighbourhood of 4, and a velocity of 0.1. We choose (randomly, in a grid, or however we think might give us nice results) the initial values of degree, epsilon, gamma, and C for each of the 20 particles.
Each iteration of PSO works like this:
# Train the model described by each particle's "position"
For each of the 20 particles:
Train an SVM with the SVM input and the given parameters.
Test the SVM; return the error rate as the PSO loss function value.
# Update the particle positions
for each of the 20 particles:
find the nearest 4 neighbours (using the PSO distance function)
identify the neighbour with the lowest loss (SVM's error rate).
adjust this particle's features (degree, epsilon, gamma, C) 0.1 of the way toward that neighbour's features. 0.1 is our learning rate / velocity. (Yes, I realize that changing degree is not likely to happen (it's a discrete value) without a special case in the update routine.
Continue iterating through PSO until the particles have converged to your liking.
gbest is simply lbest with an infinite neighbourhood; in that case, you don't need a distance function on the particle space.

Tensorflow: opt.compute_gradients() returns values different from the weight difference of opt.apply_gradients()

Question: What is the most efficient way to get the delta of my weights in the most efficient way in a TensorFlow network?
Background: I've got the operators hooked up as follows (thanks to this SO question):
self.cost = `the rest of the network`
self.rmsprop = tf.train.RMSPropOptimizer(lr,rms_decay,0.0,rms_eps)
self.comp_grads = self.rmsprop.compute_gradients(self.cost)
self.grad_placeholder = [(tf.placeholder("float", shape=grad[1].get_shape(), name="grad_placeholder"), grad[1]) for grad in self.comp_grads]
self.apply_grads = self.rmsprop.apply_gradients(self.grad_placeholder)
Now, to feed in information, I run the following:
feed_dict = `training variables`
grad_vals = self.sess.run([grad[0] for grad in self.comp_grads], feed_dict=feed_dict)
feed_dict2 = `feed_dict plus gradient values added to self.grad_placeholder`
self.sess.run(self.apply_grads, feed_dict=feed_dict2)
The command of run(self.apply_grads) will update the network weights, but when I compute the differences in the starting and ending weights (run(self.w1)), those numbers are different than what is stored in grad_vals[0]. I figure this is because the RMSPropOptimizer does more to the raw gradients, but I'm not sure what, or where to find out what it does.
So back to the question: How do I get the delta on my weights in the most efficient way? Am I stuck running self.w1.eval(sess) multiple times to get the weights and calc the difference? Is there something that I'm missing with the tf.RMSPropOptimizer function.
Thanks!
RMSprop does not subtract the gradient from the parameters but use more complicated formula involving a combination of:
a momentum, if the corresponding parameter is not 0
a gradient step, rescaled non uniformly (on each coordinate) by the square root of the squared average of the gradient.
For more information you can refer to these slides or this recent paper.
The delta is first computed in memory by tensorflow in the slot variable 'momentum' and then the variable is updated (see the C++ operator).
Thus, you should be able to access it and construct a delta node with delta_w1 = self.rmsprop.get_slot(self.w1, 'momentum'). (I have not tried it yet.)
You can add the weights to the list of things to fetch each run call. Then you can compute the deltas outside of TensorFlow since you will have the iterates. This should be reasonably efficient, although it might incur an extra elementwise difference, but to avoid that you might have to hack around in the guts of the optimizer and find where it puts the update before it applies it and fetch that each step. Fetching the weights each call shouldn't do wasteful extra evaluations of part of the graph at least.
RMSProp does complicated scaling of the learning rate for each weight. Basically it divides the learning rate for a weight by a running average of the magnitudes of recent gradients of that weight.

Update equation for gradient descent

If we have a approximation function y = f(w,x), where x is input, y is output, and w is the weight. According to gradient descent rule, we should update the weight according to w = w - df/dw. But is that possible that we update the weight according to w = w - w * df/dw instead? Has anyone seen this before? The reason I want to do this is because it is easier for me to do it this way in my algorithm.
Recall, gradient descent is based on the Taylor expansion of f(w, x) in the close vicinity of w, and has its purpose---in your context---in repeatedly modifying the weight in small steps. The reverse gradient direction is just a search direction, based upon very local knowledge of the function f(w, x).
Usually the iterative of the weight includes a step length, yielding the expression
w_(i+1) = w_(i) - nu_j df/dw,
where the value of the step length nu_j is found by using line search, see e.g. https://en.wikipedia.org/wiki/Line_search.
Hence, based on the discussion above, to answer your question: no, it is not a good idea to update according to
w_(i+1) = w_(i) - w_(i) df/dw.
Why? If w_(i) is large (in context), we'll take a huge step based on very local information, and we would be using something very different than the fine-stepped gradient descent method.
Also, as lejlot points out in the comments below, a negative value of w(i) would mean you traverse in the (positive) direction of the gradient, i.e., in the direction in which the function grows most rapidly, which is, locally, the worst possible search direction (for minimization problems).

PyMC: How can I describe a state space model?

I used to code my MCMC using C. But I'd like to give PyMC a try.
Suppose X_n is the underlying state whose dynamics following a Markov chain and Y_n is the observed data. In particular,
Y_n has Poisson distribution with mean depending on X_n and a multidimensional unknown parameter theta
X_n | X_{n-1} has distribution depending on theta
How should I describe this model using PyMC?
Another question: I can find conjugate priors for theta but not for X_n. Is it possible to specify which posteriors are updated using conjugate priors and which using MCMC?
Here is an example of a state-space model in PyMC on the PyMC wiki. It basically involves populating a list and allowing PyMC to treat it as a container of PyMC nodes.
As for the second part of the question, you could certainly calculate some of your conjugate posteriors ahead of time and put them into the model. For example, if you observed binomial data x=4, n=10 you could insert a Beta node p = Beta('p', 5, 7) to represent that posterior (its really just a prior, as far as the model is concerned, but it is the posterior given data x). Then PyMC would draw a sample for this posterior at every iteration to be used wherever it is needed in the model.