if the lasso is equivalent to linear regression with a laplace prior how can there be mass on sets with components at zero? - lasso-regression

We are all familiar with the notion, well documented in the literature, that lasso optimization (for sake of simplicity confine attention here to the case of linear regression)
loss = || y - x b ||^2 + c || b ||
is equivalent to the linear model with gaussian errors in which the parameters are given the laplace prior
exp(-c || b || )
We are also aware that the higher one sets the tuning parameter, c, the larger the portion of parameters get set to zero. This being said, I have the following thought question:
Consider that from the Bayesian point of view we can calculate the posterior probability that, say, the non-zero parameter estimates lie in any given collection of intervals and the parameters set to zero by the lasso are equal to zero. What has me confused is, given that the laplace prior is continuous (in fact absolutely continuous) then how can there be any mass on any set that is a product of intervals and singletons at {0} ?

Related

How to perform dynamic optimization for a nonlinear discrete optimization problem with nonlinear constraints, using non-linear solvers like SNOPT?

I am new to the field of optimization and I need help in the following optimization problem. I have tried to solve it using normal coding to make sure that I got he correct results. However, the results I got are different and I am not sure my way of analysis is correct or not. This is a short description of the problem:
The objective function shown in the picture is used to find the optimal temperature of the insulating system that minimizes the total cost over a given horizon.
[This image provides the mathematical description of the objective function and the constraints] (https://i.stack.imgur.com/yidrO.png)
The data of the problems are as follow:
1-
Problem data:
A=1.07×10^8
h=1
T_ref=87.5
N=20
p1=0.001;
p2=0.0037;
This is the curve I want to obtain
2- Optimization variable:
u_t
3- Model type:
The model is a nonlinear cost function with non-linear constraints and it is solved using non-linear solver SNOPT.
4-The meaning of the symbols in the objective and constrained functions
The optimization is performed over a prediction horizon of N years.
T_ref is The reference temperature.
Represent the degree of polymerization in the kth year.
X_DP Represents the temperature of the insulating system in the kth year.
h is the time step (1 year) of the discrete-time model.
R is the ratio of the load loss at the rated load to the no-load loss.
E is the activation energy.
A is the pre-exponential constant.
beta is a linear coefficient representing the cost due to the decrement of the temperature.
I have developed the source code in MATLAB, this code is used to check if my analysis is correct or not.
I have tried to initialize the Ut value in its increasing or decreasing states so that I can have the curves similar to the original one. [This is the curve I obtained] (https://i.stack.imgur.com/KVv2q.png)
I have tried to simulate the problem using conventional coding without optimization and I got the figure shown above.
close all; clear all;
h=1;
N=20;
a=250;
R=8.314;
A=1.07*10^8;
E=111000;
Tref=87.5;
p1=0.0019;
p2=0.0037;
p3=0.0037;
Utt=[80,80.7894736842105,81.5789473684211,82.3684210526316,83.1578947368421,... % The value of Utt given here represent the temperature increament over a predictive horizon.
83.9473684210526,84.7368421052632,85.5263157894737,86.3157894736842,...
87.1052631578947,87.8947368421053,88.6842105263158,89.4736842105263,...
90.2631578947369,91.0526315789474,91.8421052631579,92.6315789473684,...
93.4210526315790,94.2105263157895,95];
Utt1 = [95,94.2105263157895,93.4210526315790,92.6315789473684,91.8421052631579,... % The value of Utt1 given here represent the temperature decreament over a predictive horizon.
91.0526315789474,90.2631578947369,89.4736842105263,88.6842105263158,...
87.8947368421053,87.1052631578947,86.3157894736842,85.5263157894737,...
84.7368421052632,83.9473684210526,83.1578947368421,82.3684210526316,...
81.5789473684211,80.7894736842105,80];
Ut1=zeros(1,N);
Ut2=zeros(1,N);
Xdp =zeros(N,N);
Xdp(1,1)=1000;
Xdp1 =zeros(N,N);
Xdp1(1,1)=1000;
for L=1:N-1
for k=1:N-1
%vt(k+L)=Ut(k-L+1);
Xdq(k+1,L) =(1/Xdp(k,L))+A*exp((-1*E)/(R*(Utt(k)+273)))*24*365*h;
Xdp(k+1,L)=1/(Xdq(k+1,L));
Xdp(k,L+1)=1/(Xdq(k+1,L));
Xdq1(k+1,L) =(1/Xdp1(k,L))+A*exp((-1*E)/(R*(Utt1(k)+273)))*24*365*h;
Xdp1(k+1,L)=1/(Xdq1(k+1,L));
Xdp1(k,L+1)=1/(Xdq1(k+1,L));
end
end
% MATLAB code
for j =1:N-1
Ut1(j)= -p1*(Utt(j)-Tref);
Ut2(j)= -p2*(Utt1(j)-Tref);
end
sum00=sum(Ut1);
sum01=sum(Ut2);
X1=1./Xdp(:,1);
Xf=1./Xdp(:,20);
Total= table(X1,Xf);
Tdiff =a*(Total.Xf-Total.X1);
X22=1./Xdp1(:,1);
X2f=1./Xdp1(:,20);
Total22= table(X22,X2f);
Tdiff22 =a*(Total22.X2f-Total22.X22);
obj=(sum00+(Tdiff));
ob1 = min(obj);
obj2=sum01+Tdiff22;
ob2 = min(obj2);
plot(Utt,obj,'-o');
hold on
plot(Utt1,obj)

effective number

In Gelman book, the effective number is defined in terms of the following;
R hat
between- within MCMC sequence of variance, B and W
the number of MCMC samples, denoted by n
the number of chains, denoted by m
I do not know how the samplig() calculate the between MCMC sequence of variance for the case chains=1. So, I cannot calculate these terms ( B,W,m). I want to implement some algorithm according to the paper:https://arxiv.org/abs/1804.06788.
Roughly speaking, this paper construct some test statistics which is uniformly distributed under the null hypothesis that the MCMC sampling is correct. And if MCMC sampling is not correct, then the histogram of the test statistics become skew shape and this deviation from uniformity tells us the MCMC contains bias. I want to implement but it needs to calculate the above quantities.
In rstan, is there such function to extract the above quantities ? I think the process of calculation of R hat statistics, the above quantities B,W, m are retained in some place in the stanfit S4 object.
I am sorry, I found n_eff, but I do not know the choice of m of the case chains =1.
In the case that only one chain is estimated (which should not be happening anyway), then m = 2 because the post-warmup draws from the single chain are split into the first half and the second half. This splitting method is discussed in the documentation.

pymc python change point detection for small probabilities. ZeroProbability Error

I am trying to use pymc to find a change point in a time-series. The value I am looking at over time is probability to "convert" which is very small, 0.009 on average with a range of 0.001-0.016.
I give the two probabilities a uniform distribution as a prior between zero and the max observation.
alpha = df.cnvrs.max() # Set upper uniform
center_1_c = pm.Uniform("center_1_c", 0, alpha)
center_2_c = pm.Uniform("center_2_c", 0, alpha)
day_c = pm.DiscreteUniform("day_c", lower=1, upper=n_days)
#pm.deterministic
def lambda_(day_c=day_c, center_1_c=center_1_c, center_2_c=center_2_c):
out = np.zeros(n_days)
out[:day_c] = center_1_c
out[day_c:] = center_2_c
return out
observation = pm.Uniform("obs", lambda_, value=df.cnvrs.values, observed=True)
When I run this code I get:
ZeroProbability: Stochastic obs's value is outside its support,
or it forbids its parents' current values.
I'm pretty new to pymc so not sure if I'm missing something obvious. My guess is I might not have appropriate distributions for modelling small probabilities.
It's impossible to tell where you've introduced this bug—and programming is off-topic here, in any case—without more of your output. But there is a statistical issue here: You've somehow constructed a model that cannot produce either the observed variables or the current sample of latent ones.
To give a simple example, say you have a dataset with negative values, and you've assumed it to be gamma distributed; this will produce an error, because the data has zero probability under a gamma. Similarly, an error will be thrown if an impossible value is sampled during an MCMC chain.

how tensorflow handles complex gradient?

Let z is a complex variable, C(z) is its conjugation.
In complex analysis theory, the derivative of C(z) w.r.t z don't exist. But in tesnsorflow, we can calculate dC(z)/dz and the result is just 1.
Here is an example:
x = tf.placeholder('complex64',(2,2))
y = tf.reduce_sum(tf.conj(x))
z = tf.gradients(y,x)
sess = tf.Session()
X = np.random.rand(2,2)+1.j*np.random.rand(2,2)
X = X.astype('complex64')
Z = sess.run(z,{x:X})[0]
The input X is
[[0.17014372+0.71475762j 0.57455420+0.00144318j]
[0.57871044+0.61303568j 0.48074263+0.7623235j ]]
and the result Z is
[[1.-0.j 1.-0.j]
[1.-0.j 1.-0.j]]
I don't understand why the gradient is set to be 1?
And I want to know how tensorflow handles the complex gradients in general.
How?
The equation used by Tensorflow for the gradient is:
Where the '*' means conjugate.
When using the definition of the partial derivatives wrt z and z* it uses Wirtinger Calculus. Wirtinger calculus enables to calculate the derivative wrt a complex variable for non-holomorphic functions. The Wirtinger definition is:
Why this definition?
When using for example Complex-Valued Neural Networks (CVNN) the gradients will be used over non-holomorphic, real-valued scalar function of one or several complex variables, tensorflow definition of a gradient can then be written as:
This definition corresponds with the literature of CVNN like for example chapter 4 section 4.3 of this book or Amin et al. (between countless examples).
Bit late, but I came across this issue recently too.
The key point is that TensorFlow defines the "gradient" of a complex-valued function f(z) of a complex variable as "the gradient of the real map F: (x,y) -> Re(f(x+iy)), expressed as a complex number" (the gradient of that real map is a vector in R^2, so we can express it as a complex number in the obvious way).
Presumably the reason for that definition is that in TF one is usually concerned with gradients for the purpose of running gradient descent on a loss function, and in particular for identifying the direction of maximum increase/decrease of that loss function. Using the above definition of gradient means that a complex-valued function of complex variables can be used as a loss function in a standard gradient descent algorithm, and the result will be that the real part of the function gets minimised (which seems to me a somewhat reasonable interpretation of "optimise this complex-valued function").
Now, to your question, an equivalent way to write that definition of gradient is
gradient(f) := dF/dx + idF/dy = conj(df/dz + dconj(f)/dz)
(you can easily verify that using the definition of d/dz). That's how TensorFlow handles complex gradients. As for the case of f(z):=conj(z), we have df/dz=0 (as you mention) and dconj(f)/dz=1, giving gradient(f)=1.
I wrote up a longer explanation here, if you're interested: https://github.com/tensorflow/tensorflow/issues/3348#issuecomment-512101921

Reason why setting tensorflow's variable with small stddev

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)
The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.