How does Chi-Square based SPAM detection works in SpamAssassin - bayesian

I'm trying to understand about Bayes based spam detection, and have difficulty understanding how to code it.
To understand it, I'm reading code of SpamAssassin like below.
http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Bayes/CombineChi.pm?view=markup
But, I could not understand how the chi2q function.
# Chi-squared function (API changed; see comment above)
107 sub chi2q {
108 my ($x2, $halfv) = #_;
109
110 my $m = $x2 / 2.0;
111 my ($sum, $term);
112 $sum = $term = exp(0 - $m);
113
114 # replace 'for my $i (1 .. (($v/2)-1))' idiom, which creates a temp
115 # array, with a plain C-style for loop
116 my $i;
117 for ($i = 1; $i < $halfv; $i++) {
118 $term *= $m / $i;
119 $sum += $term;
120 }
121 return $sum < 1.0 ? $sum : 1.0;
122 }
I tried to google or read book, but cannot find full explanation including from theory to code.
Can you explain why it works?

The Chi Squared test can tell if two sets of numbers are "similar"
The best explaination I could find with googling quickly was here http://formulas.tutorvista.com/math/chi-square-formula.html
This involves finding the difference between an observed value and the expected value. Or the value in a different condition. Then the difference is squared. Squaring it has two effects, the squared numbers become positive and any differences are accentuated.
Then all the numbers found with this difference and squaring operation are added up and this makes a number. The number, together with the "degrees of freedom" in the observations is compared on a table to find the "p value" or probability of the result occurring by chance
It allows a match of similarity on two sets of values, without them being exactly the same
I'm sure you can imagine how useful this sort of comparison can be for detecting spam
Your code sample does not seem to do this, so I can only guess that there are other calculations happening in the rest of the spamassassin code base

Related

AMPL: "Bad suffix .npool for Initial" error with poolstub

I need to find a pool of solutions with AMPL (I am relatively new to it) using the option "poolstub" but I get an error when I try to retrive them. I will try to explain everything step by step. This is my code:
option solver cplex;
model my_model.mod;
data my_data.dat;
option cplex_options 'poolstub=multmip poolcapacity=10 populate=1 poolintensity=4 poolreplace=1';
solve;
At this point AMPLE gives me this:
CPLEX 20.1.0.0: poolstub=multmip
poolcapacity=10
populate=1
poolintensity=4
poolreplace=1
CPLEX 20.1.0.0: optimal solution; objective 4.153846154
66 dual simplex iterations (0 in phase I)
It seems like AMPL has not stored the solutions in the pool.
And in fact, if I try to retrive them with this code
for {i in 1..Current.npool} {
solution ('multmip' & i & '.sol');
display _varname, _var;
}
I get this error:
Bad suffix .npool for Initial
context: for {i in >>> 1..Current.npool} <<< {
Possible suffix values for Initial.suffix:
astatus exitcode message relax
result sstatus stage
for{...} { ? ampl: for{...} { ? ampl:
I have no integer variables, only real ones and I read that CPLEX doesn't support the populate method for linear programs. Could this be the problem or is something else missing? Thank you in advance
You have identified your problem correctly. Entity Initial does not have the npool suffix, which means the solver (in your case CPLEX) did not return one.
Gurobi can return that information for linear programs, but it seems to be identical to the optimal solution, so it would not give you any extra information (more info on AMPL-Gurobi options).
Here is an example AMPL script:
model net1.mod;
data net1.dat;
option solver gurobi;
option gurobi_options 'ams_stub=allopt ams_mode=1';
solve;
for {n in 1..Total_Cost.npool} {
solution ("allopt" & n & ".sol");
display Ship;
}
Output (on my machine):
Gurobi 9.1.1: ams_stub=allopt
ams_mode=2
ams_epsabs=0.5
Gurobi 9.1.1: optimal solution; objective 1819
1 simplex iterations
Alternative MIP solution 1, objective = 1819
1 alternative MIP solutions written to "allopt1.sol"
... "allopt1.sol".
Alternative solutions do not include dual variable values.
Best solution is available in "allopt1.sol".
suffix npool OUT;
Alternative MIP solution 1, objective = 1819
Ship :=
NE BOS 90
NE BWI 60
NE EWR 100
PITT NE 250
PITT SE 200
SE ATL 70
SE BWI 60
SE EWR 20
SE MCO 50
;
The files net1.mod and net1.dat are from the AMPL book.
When solving a MIP the solver can store sub-optimal solutions that it found along the way as they might be interesting for some reason to the modeler.
In terms of your LP, are you interested in the vertices the simplex algorithm visits?

Stan. Using target += syntax

I'm starting to learn Stan.
Could anyone explain when and how to use syntax such as... ?
target +=
instead of just:
y ~ normal(mu, sigma)
For example in Stan manual you can find the following example.
model {
real ps[K]; // temp for log component densities
sigma ~ cauchy(0, 2.5);
mu ~ normal(0, 10);
for (n in 1:N) {
for (k in 1:K) {
ps[k] = log(theta[k])
+ normal_lpdf(y[n] | mu[k], sigma[k]);
}
target += log_sum_exp(ps);
}
}
I think the target line increases the target value, that I think it's the logarithm of the posterior density.
But the posterior density for what parameter?
When is it updated and initialized?
After Stan finishes (and converges), how do you access its value and how I use it?
Other examples:
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // s.e. of effect estimates
}
parameters {
real mu;
real<lower=0> tau;
vector[J] eta;
}
transformed parameters {
vector[J] theta;
theta = mu + tau * eta;
}
model {
target += normal_lpdf(eta | 0, 1);
target += normal_lpdf(y | theta, sigma);
}
the example above uses target twice instead of just once.
another example.
data {
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma_sq;
vector<lower=-0.5, upper=0.5>[N] y_err;
}
transformed parameters {
real<lower=0> sigma;
vector[N] z;
sigma = sqrt(sigma_sq);
z = y + y_err;
}
model {
target += -2 * log(sigma);
z ~ normal(mu, sigma);
}
This last example even mixes both methods.
To do it even more difficult I've read that
y ~ normal(0,1);
has the same effect than
increment_log_prob(normal_log(y,0,1));
Could anyone explain why, please?
Could anyone provide a simple example written in two different ways, with "target +=" and in the regular simpler "y ~" way, please?
Regards
The syntax
target += u;
adds u to the target log density.
The target density is the density from which the sampler samples and it needs to be equal to the joint density of all the parameters given the data up to a constant (which is usually achieved via Bayes's rule by coding as the joint density of parameters and modeled data up to a constant). You access it as lp__ in the posterior, but be careful, as it also contains the Jacobians arising from the constraints and drops constants in sampling statements---you do not want to use it for model comparison.
From a sampling perspective, writing
target += normal_lpdf(y | mu, sigma);
has the same effect as
y ~ normal(mu, sigma);
The _lpdf signals it's the log probability density function for the normal, which is implicit in the sampling notation. The sampling notation is just shorthand for the target += syntax, and in addition, drops constant terms in the log density.
It's explained in the statements section of the language reference (the second part of the manual) and used in multiple examples through the programmer's guide (the first part of the manual).
I am just starting to learn Stan and Bayesian statistics, and mainly rely on John Kruschke's book "Doing Bayesian Data Analysis". Here, in chapter 14.3.3, he explains:
Thus, the essence of computation in Stan is dealing with the logarithm
of the posterior probability density and its gradient; there is no
direct random sampling of parameters from distributions.
As a result (still rephrasing Kruschke), a
model [...] like y ∼ normal(mu,sigma) [actuallly] means to multiply the current posterior probability by the density of the normal distribution at the datum value y.
Following logarithm calculation rules, this multiplication is equal to add the log probability density of a given data y to the current log-probability. (log(a*b) = log(a) + log(b), hence the equality of multiplication and sum).
I concede that I don't grasp the full implications of that, but I think it points into the right direction into what, mathematically speaking, the targer += does.

Find global maximum in the lest number of computations

Let's say I have a function f defined on interval [0,1], which is smooth and increases up to some point a after which it starts decreasing. I have a grid x[i] on this interval, e.g. with a constant step size of dx = 0.01, and I would like to find which of those points has the highest value, by doing the smallest number of evaluations of f in the worst-case scenario. I think I can do much better than exhaustive search by applying something inspired with gradient-like methods. Any ideas? I was thinking of something like a binary search perhaps, or parabolic methods.
This is a bisection-like method I coded:
def optimize(f, a, b, fa, fb, dx):
if b - a <= dx:
return a if fa > fb else b
else:
m1 = 0.5*(a + b)
m1 = _round(m1, a, dx)
fm1 = fa if m1 == a else f(m1)
m2 = m1 + dx
fm2 = fb if m2 == b else f(m2)
if fm2 >= fm1:
return optimize(f, m2, b, fm2, fb, dx)
else:
return optimize(f, a, m1, fa, fm1, dx)
def _round(x, a, dx, right = False):
return a + dx*(floor((x - a)/dx) + right)
The idea is: find the middle of the interval and compute m1 and m2- the points to the right and to the left of it. If the direction there is increasing, go for the right interval and do the same, otherwise go for the left. Whenever the interval is too small, just compare the numbers on the ends. However, this algorithm still does not use the strength of the derivatives at points I computed.
Such a function is called unimodal.
Without computing the derivatives, you can work by
finding where the deltas x[i+1]-x[i] change sign, by dichotomy (the deltas are positive then negative after the maximum); this takes Log2(n) comparisons; this approach is very close to what you describe;
adapting the Golden section method to the discrete case; it takes Logφ(n) comparisons (φ~1.618).
Apparently, the Golden section is more costly, as φ<2, but actually the dichotomic search takes two function evaluations at a time, hence 2Log2(n)=Log√2(n) .
One can show that this is optimal, i.e. you can't go faster than O(Log(n)) for an arbitrary unimodal function.
If your function is very regular, the deltas will vary smoothly. You can think of the interpolation search, which tries to better predict the searched position by a linear interpolation rather than simple halving. In favorable conditions, it can reach O(Log(Log(n)) performance. I don't know of an adaptation of this principle to the Golden search.
Actually, linear interpolation on the deltas is very close to parabolic interpolation on the function values. The latter approach might be the best for you, but you need to be careful about the corner cases.
If derivatives are allowed, you can use any root solving method on the first derivative, knowing that there is an isolated zero in the given interval.
If only the first derivative is available, use regula falsi. If the second derivative is possible as well, you may consider Newton, but prefer a safe bracketing method.
I guess that the benefits of these approaches (superlinear and quadratic convergence) are made a little useless by the fact that you are working on a grid.
DISCLAIMER: Haven't test the code. Take this as an "inspiration".
Let's say you have the following 11 points
x,f(x) = (0,3),(1,7),(2,9),(3,11),(4,13),(5,14),(6,16),(7,5),(8,3)(9,1)(1,-1)
you can do something like inspired to the bisection method
a = 0 ,f(a) = 3 | b=10,f(b)=-1 | c=(0+10/2) f(5)=14
from here you can see that the increasing interval is [a,c[ and there is no need to that for the maximum because we know that in that interval the function is increasing. Maximum has to be in interval [c,b]. So at the next iteration you change the value of a s.t. a=c
a = 5 ,f(a) = 14 | b=10,f(b)=-1 | c=(5+10/2) f(6)=16
Again [a,c] is increasing so a is moved on the right
you can iterate the process until a=b=c.
Here the code that implements this idea. More info here:
int main(){
#define STEP (0.01)
#define SIZE (1/STEP)
double vals[(int)SIZE];
for (int i = 0; i < SIZE; ++i) {
double x = i*STEP;
vals[i] = -(x*x*x*x - (0.6)*(x*x));
}
for (int i = 0; i < SIZE; ++i) {
printf("%f ",vals[i]);
}
printf("\n");
int a=0,b=SIZE-1,c;
double fa=vals[a],fb=vals[b] ,fc;
c=(a+b)/2;
fc = vals[c];
while( a!=b && b!=c && a!=c){
printf("%i %i %i - %f %f %f\n",a,c,b, vals[a], vals[c],vals[b]);
if(fc - vals[c-1] > 0){ //is the function increasing in [a,c]
a = c;
}else{
b=c;
}
c=(a+b)/2;
fa=vals[a];
fb=vals[b];
fc = vals[c];
}
printf("The maximum is %i=%f with %f\n", c,(c*STEP),vals[a]);
}
Find points where derivative(of f(x))=(df/dx)=0
for derivative you could use five-point-stencil or similar algorithms.
should be O(n)
Then fit those multiple points (where d=0) on a polynomial regression / least squares regression .
should be also O(N). Assuming all numbers are neighbours.
Then find top of that curve
shouldn't be more than O(M) where M is resolution of trials for fit-function.
While taking derivative, you could leap by k-length steps until derivate changes sign.
When derivative changes sign, take square root of k and continue reverse direction.
When again, derivative changes sign, take square root of new k again, change direction.
Example: leap by 100 elements, find sign change, leap=10 and reverse direction, next change ==> leap=3 ... then it could be fixed to 1 element per step to find exact location.
I am assuming that the function evaluation is very costly.
In the special case, that your function could be approximately fitted with a polynomial, you can easily calculate the extrema in least number of function evaluations. And since you know that there is only one maximum, a polynomial of degree 2 (quadratic) might be ideal.
For example: If f(x) can be represented by a polynomial of some known degree, say 2, then, you can evaluate your function at any 3 points and calculate the polynomial coefficients using Newton's difference or Lagrange interpolation method.
Then its simple to solve for the maximum for this polynomial. For a degree 2 you can easily get a closed form expression for the maximum.
To get the final answer you can then search in the vicinity of the solution.

Stan version of a JAGS model which includes a sum of discrete values - Is it possible?

I was trying to run this model in Stan. I have a running JAGS version of it (that returns highly autocorrelated parameters) and I know how to formulate it as CDF of a double exponential (with two rates), which would probably run without problems. However, I would like to use this version as a starting point for similar but more complex models.
By now I have the suspicion that a model like this is not possible in Stan. Maybe because of the discreteness introduces by taking the sum of a Boolean value, Stan may not be able to calculate gradients.
Does anyone know whether this is the case, or if I do something else in a wrong way in this model? I paste the errors I get below the model code.
Many thanks in advance
Jan
Model:
data {
int y[11];
int reps[11];
real soas[11];
}
parameters {
real<lower=0.001,upper=0.200> v1;
real<lower=0.001,upper=0.200> v2;
}
model {
real dif[11,96];
real cf[11];
real p[11];
real t1[11,96];
real t2[11,96];
for (i in 1:11){
for (r in 1:reps[i]){
t1[i,r] ~ exponential(v1);
t2[i,r] ~ exponential(v2);
dif[i,r] <- (t1[i,r]+soas[i]<=(t2[i,r]));
}
cf[i] <- sum(dif[i]);
p[i] <-cf[i]/reps[i];
y[i] ~ binomial(reps[i],p[i]);
}
}
Here is some dummy data:
psy_dat = {
'soas' : numpy.array(range(-100,101,20)),
'y' : [47, 46, 62, 50, 59, 47, 36, 13, 7, 2, 1],
'reps' : [48, 48, 64, 64, 92, 92, 92, 64, 64, 48, 48]
}
And here are the errors:
DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal): Left-hand side of sampling statement (~) contains a non-linear transform of a parameter or local variable.
You must call increment_log_prob() with the log absolute determinant of the Jacobian of the transform.
Sampling Statement left-hand-side expression:
get_base1(get_base1(t1,i,"t1",1),r,"t1",2) ~ exponential_log(...)
Warning (non-fatal): Left-hand side of sampling statement (~) contains a non-linear transform of a parameter or local variable.
You must call increment_log_prob() with the log absolute determinant of the Jacobian of the transform.
Sampling Statement left-hand-side expression:
get_base1(get_base1(t2,i,"t2",1),r,"t2",2) ~ exponential_log(...)
And at runtime:
Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
stan::prob::exponential_log(N4stan5agrad3varE): Random variable is nan:0, but must not be nan!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Rejecting proposed initial value with zero density.
Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model
Here is a working JAGS version of this model:
model {
for ( n in 1 : N ) {
for (r in 1 : reps[n]){
t1[r,n] ~ dexp(v1)
t2[r,n] ~ dexp(v2)
c[r,n] <- (1.0*((t1[r,n]+durs[n])<=t2[r,n]))
}
p[n] <- max((min(sum(c[,n]) / (reps[n]),0.99999999999999)), 1-0.99999999999999))
y[n] ~ dbin(p[n],reps[n])
}
v1 ~ dunif(0.0001,0.2)
v2 ~ dunif(0.0001,0.2)
}
With regard to the min() and max(): See this post https://stats.stackexchange.com/questions/130978/observed-node-inconsistent-when-binomial-success-rate-exactly-one?noredirect=1#comment250046_130978.
I'm still not sure what model you are trying to estimate (it would be best if you post the JAGS code) but what you have above cannot work in Stan. Stan is closer to C++ in the sense that you have to declare and then define objects. In your Stan program, you have the two declarations
real t1[11,96];
real t2[11,96];
but no definitions of t1 or t2. Consequently, they are initalized to NaN and when you do
t1[i,r] ~ exponential(v1);
that gets parsed as something like
for(i in 1:11) for(r in 1:reps[i]) lp__ += log(v1) - v1 * NaN
where lp__ is an internal symbol that holds value of the log-posterior, which becomes NaN and it cannot do Metropolis-style updates of the parameters.
Perhaps you meant for t1 and t2 to be unknown parameters, in which case they should be declared in the parameters block. The following [EDITED] Stan program is valid and should work, but it might not be the program you had in mind (it does not make a lot of sense to me and the discontinuity in dif will probably preclude Stan from sampling from this posterior distribution efficiently).
data {
int<lower=1> N;
int y[N];
int reps[N];
real soas[N];
}
parameters {
real<lower=0.001,upper=0.200> v1;
real<lower=0.001,upper=0.200> v2;
real t1[N,max(reps)];
real t2[N,max(reps)];
}
model {
for (i in 1:N) {
real dif[reps[i]];
for (r in 1:reps[i]) {
dif[r] <- (t1[i,r]+soas[i]) <= t2[i,r];
}
y[i] ~ binomial(reps[i], (1.0 + sum(dif)) / (1.0 + reps[i]));
}
to_array_1d(t1) ~ exponential(v1);
to_array_1d(t2) ~ exponential(v2);
}

What causes a nan error result?

I have this chunk of code which runs through a loop. The lower case x var always prints correctly. The upper case X var sometimes prints correctly, sometimes prints nan or junk. Why?
N.B. The data is always identical.
Link to FFT
Link to FFT example usage
Link to my other SO question which shows how this is being used. BOUNTY OF 200 points!
double (*x)[2];
double (*X)[2];
x = malloc(2 * 512 * sizeof(double));
X = malloc(2 * 512 * sizeof(double));
for (j = 0; j < 10; j++){
(*x)[j] = // values inserted from method argument.;
}
fft(512, x, X);
for (j = 0; j < 512; j++){
if (i==512*20) {
NSLog(#"PRE POST %f - %f",(*x)[j], (*X)[j]);
}
}
free(x);
free(X);
In floating point arithmetic, there are several operations that will result in a NaN error. Wikipedia points out these operations as resulting in a NaN:
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
(These are called indeterminate forms.)
Check your code to see if you're performing any operations that can't have a numeric answer.
As for the 'junk' results, they may be the result of messed up memory allocation, but you haven't given much detail so I can't be sure.
In other languages which I have worked in, "NaN" (not a number) is what you get when you divide a 0.0 by 0.0. I don't know Objective-C, but it's likely the same.
As for what is causing nan to be stored in X... you'll have to show us the body of fft before anyone can answer that. You said you think it might be a memory/pointer bug, because it's not consistent. I just looked up how NaN is represented in the IEE 7754 floating point format (which your platform is likely using) -- basically, several of the high-order bits (which normally hold the exponent of a floating-point number) all have to be filled with 1s.
If you do have a memory corruption bug, which is causing junk to be stored into X, then if those particular bits happened to all be 1s, that would cause the number to print as "nan".
Again, please show the body of fft, so someone can try to help you further.
I tried running this - initialised the data using (*x)[j] = j and removed the i==512*20 printing condition. All values came back fine. I also tried with random input data - still good. What is the nature of your input data?
(I'll look at your other question as well)
edit: I should point out I filled 512 values of the x array - your loop above only fills 10, so much of the input array is uninitialised.