Pecuiliar output with Liblinear - optimization

I am facing a very peculiar problem with lib-linear package.
I have two levels (+1, -1).
Say I have only one feature which takes values $x_1$, $x_2$,..., $x_n$ for n points. It classifies well giving some positive weight $w*$ and cost C say for example.
Now if I stack $1$ to the previous feature to make a new feature vectors [1 x_i] i=1, 2, ...,n; Now with this new problem lib-linear gives the following:
a weight vector [w_1 -w_2]; w_i>0 i.e. weights to 1 is w_1 and to x is w_2.
Cost C1 much greater than previous cost C.
I understand that new feature (1) has no variation throughout and hence the weight to it should automatically go zero.
It is a minimization problem so it should give w_1~0 so that now the cost C1 is at most equal to C.
Can anyone help?

Since you have a constant input dimension, its contribution in the decision function will also be constant. LIBLINEAR's decision function is
f(x)=sign(w^T*x-rho)
My guess is that your new model corrects for the extra term (due to non-zero w_1) through rho. I can't say I have a good idea as to why w_1 was not minimized to zero, though. Are the predictions of both models equal?

Related

Asynchrony loss function over an array of 1D signals

So I have an array of N 1D-signals (e.g. time series) with same number of samples per signal (all in equal resolution) and I want to define a differentiable loss function to penalize asynchrony among them and therefore be zero if all N 1D signals will be equal to each other. I've been searching the literature to find something but haven't had luck yet.
Few remarks:
1 - since N (number of signals) could be quite large I can not afford to calculate Mean squared loss between every single pair which could grow combinatorialy large. also I'm not quite sure whether it would be optimal in any mathematical sense for the goal to achieve.
There are two naive loss functions that I could think of :
a) Total variation loss for each time sample across all signals (to force to reach ideally zero variation). the problem is here the weight needs to be very large to yield zero varion. masking any other loss term that is going to be added and also there is no inherent order among the N signals, which doesnt make it suitable to TV loss to begin with.
b) minimizing the sum of variance at each time point among all signals. however, choice of the reference of variance (aka mean) could be crucial I believe as just using the sample mean might not really yield the desired result, not quite sure.

Explained variance calculation

My questions are specific to https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.
I don't understand why you square eigenvalues
https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/decomposition/pca.py#L444
here?
Also, explained_variance is not computed for new transformed data other than original data used to compute eigen-vectors. Is that not normally done?
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X)
pca.transform(Y)
In this case, won't you separately calculate explained variance for data Y as well. For that purpose, I think we would have to use point 3 instead of using eigen-values.
Explained variance can be also computed by taking the variance of each axis in the transformed space and dividing by the total variance. Any reason that is not done here?
Answers to your questions:
1) The square roots of the eigenvalues of the scatter matrix (e.g. XX.T) are the singular values of X (see here: https://math.stackexchange.com/a/3871/536826). So you square them. Important: the initial matrix X should be centered (data has been preprocessed to have zero mean) in order for the above to hold.
2) Yes this is the way to go. explained_variance is computed based on the singular values. See point 1.
3) It's the same but in the case you describe you HAVE to project the data and then do additional computations. No need for that if you just compute it using the eigenvalues / singular values (see point 1 again for the connection between these two).
Finally, keep in mind that not everyone really wants to project the data. Someone can only get the eigenvalues and then immediately estimate the explained variance WITHOUT projecting the data. So that's the best gold standard way to do it.
EDIT 1:
Answer to edited Point 2
No. PCA is an unsupervised method. It only transforms the X data not the Y (labels).
Again, the explained variance can be computed fast, easily, and with half line of code using the eigenvalues/singular values OR as you said using the projected data e.g. estimating the covariance of the projected data, then variances of PCs will be in the diagonal.

Balance Dataset for Tensorflow Object Detection

I currently want to use Tensorflows Object Detection API for my custom problem.
I already created the dataset, but its pretty unbalanced.
The Dataset has 3 classes and my main problem is, that one class has about 16k samples and another class has only about 2.5k samples.
So I think I have to balance the dataset. Someone told me, that there is something called sample/class weights(Not sure if this is 100% correct), which balance the samples for training, so that the biggest class has a smaller impact on training then the smallest class.
I'm not able to find this method for balancing. Can someone pleas give me a hint where to start?
You can do normal cross entropy, giving you a ? x 1 tensor, X of losses
If you want class number N to count T times more, you can do
X = X * tf.reduce_sum(tf.multiply(one_hot_label, class_weight), axis = 1)
tf.multiply
scales the label by whatever weight you want,
tf.reduce_sum
converts the label vector a to a scalar, so you end up with a ? x 1 tensor filled with the class weightings. Then you simply multiply the tensor of losses with the tensor of weightings to achieve desired results.
Since one class is 6.4 times more common than the other, I would apply the weightings 1 and 6.4 to the more common and less common class respectively. This will mean that every time the less common class occurs, it has 6.4 times the affect of the more common class, so it's like it saw the same number of samples from each.
You might want to modify it so that the weighting add up to the number of classes. This matches the default case is all of the weightings are 1. In that case we have 1 /7.4 and 6.4/7.4

Reason why setting tensorflow's variable with small stddev

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)
The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.

How to design acceptance probability function for simulated annealing with multiple distinct costs?

I am using simulated annealing to solve an NP-complete resource scheduling problem. For each candidate ordering of the tasks I compute several different costs (or energy values). Some examples are (though the specifics are probably irrelevant to the question):
global_finish_time: The total number of days that the schedule spans.
split_cost: The number of days by which each task is delayed due to interruptions by other tasks (this is meant to discourage interruption of a task once it has started).
deadline_cost: The sum of the squared number of days by which each missed deadline is overdue.
The traditional acceptance probability function looks like this (in Python):
def acceptance_probability(old_cost, new_cost, temperature):
if new_cost < old_cost:
return 1.0
else:
return math.exp((old_cost - new_cost) / temperature)
So far I have combined my first two costs into one by simply adding them, so that I can feed the result into acceptance_probability. But what I would really want is for deadline_cost to always take precedence over global_finish_time, and for global_finish_time to take precedence over split_cost.
So my question to Stack Overflow is: how can I design an acceptance probability function that takes multiple energies into account but always considers the first energy to be more important than the second energy, and so on? In other words, I would like to pass in old_cost and new_cost as tuples of several costs and return a sensible value .
Edit: After a few days of experimenting with the proposed solutions I have concluded that the only way that works well enough for me is Mike Dunlavey's suggestion, even though this creates many other difficulties with cost components that have different units. I am practically forced to compare apples with oranges.
So, I put some effort into "normalizing" the values. First, deadline_cost is a sum of squares, so it grows exponentially while the other components grow linearly. To address this I use the square root to get a similar growth rate. Second, I developed a function that computes a linear combination of the costs, but auto-adjusts the coefficients according to the highest cost component seen so far.
For example, if the tuple of highest costs is (A, B, C) and the input cost vector is (x, y, z), the linear combination is BCx + Cy + z. That way, no matter how high z gets it will never be more important than an x value of 1.
This creates "jaggies" in the cost function as new maximum costs are discovered. For example, if C goes up then BCx and Cy will both be higher for a given (x, y, z) input and so will differences between costs. A higher cost difference means that the acceptance probability will drop, as if the temperature was suddenly lowered an extra step. In practice though this is not a problem because the maximum costs are updated only a few times in the beginning and do not change later. I believe this could even be theoretically proven to converge to a correct result since we know that the cost will converge toward a lower value.
One thing that still has me somewhat confused is what happens when the maximum costs are 1.0 and lower, say 0.5. With a maximum vector of (0.5, 0.5, 0.5) this would give the linear combination 0.5*0.5*x + 0.5*y + z, i.e. the order of precedence is suddenly reversed. I suppose the best way to deal with it is to use the maximum vector to scale all values to given ranges, so that the coefficients can always be the same (say, 100x + 10y + z). But I haven't tried that yet.
mbeckish is right.
Could you make a linear combination of the different energies, and adjust the coefficients?
Possibly log-transforming them in and out?
I've done some MCMC using Metropolis-Hastings. In that case I'm defining the (non-normalized) log-likelihood of a particular state (given its priors), and I find that a way to clarify my thinking about what I want.
I would take a hint from multi-objective evolutionary algorithm (MOEA) and have it transition if all of the objectives simultaneously pass with the acceptance_probability function you gave. This will have the effect of exploring the Pareto front much like the standard simulated annealing explores plateaus of same-energy solutions.
However, this does give up on the idea of having the first one take priority.
You will probably have to tweak your parameters, such as giving it a higher initial temperature.
I would consider something along the lines of:
If (new deadline_cost > old deadline_cost)
return (calculate probability)
else if (new global finish time > old global finish time)
return (calculate probability)
else if (new split cost > old split cost)
return (calculate probability)
else
return (1.0)
Of course each of the three places you calculate the probability could use a different function.
It depends on what you mean by "takes precedence".
For example, what if the deadline_cost goes down by 0.001, but the global_finish_time cost goes up by 10000? Do you return 1.0, because the deadline_cost decreased, and that takes precedence over anything else?
This seems like it is a judgment call that only you can make, unless you can provide enough background information on the project so that others can suggest their own informed judgment call.