What is the purpose of the Tilda operator in this code? - data-science

Here is a code to find IQR in a dataset
cols = ['M01AB', 'M01AE', 'N02BA', 'N02BE', 'N05B', 'N05C', 'R03', 'R06']
Q1 = df[cols].quantile(0.25)
Q3 = df[cols].quantile(0.75)
IQR = Q3 - Q1
df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
I am unable to understand the use of the Tilda operator in the last line of this code
Not able to understand the use case of the tilda operator in this code

Related

Numerically stable calculation of invariant mass in particle physics?

In particle physics, we have to compute the invariant mass a lot, which is for a two-body decay
When the momenta (p1, p2) are sometimes very large (up to a factor 1000 or more) compared to the masses (m1, m2). In that case, there is large cancellation happening between the last two terms when the calculation is carried out with floating point numbers on a computer.
What kind of numerical tricks can be used to compute this accurately for any inputs?
The question is about suitable numerical tricks to improve the accuracy of the calculation with floating point numbers, so the solution should be language-agnostic. For demonstration purposes, implementations in Python are preferred. Solutions which reformulate the problem and increase the amount of elementary operations are acceptable, but solutions which suggest to use other number types like decimal or multi-precision floating point numbers are not.
Note: The original question presented a simplified 1D dimensional problem in form of a Python expression, but the question is for the general case where the momenta are given in 3D dimensions. The question was reformulated in this way.
With a few tricks listed on Stackoverflow and the transformation described by Jakob Stark in his answer, it is possible to rewrite the equation into a form that does not suffer anymore from catastrophic cancellation.
The original question asked for a solution in 1D, which has a simple solution, but in practice, we need the formula in 3D and then the solution is more complicated. See this notebook for a full derivation.
Example implementation of numerically stable calculation in 3D in Python:
import numpy as np
# numerically stable implementation
#np.vectorize
def msq2(px1, py1, pz1, px2, py2, pz2, m1, m2):
p1_sq = px1 ** 2 + py1 ** 2 + pz1 ** 2
p2_sq = px2 ** 2 + py2 ** 2 + pz2 ** 2
m1_sq = m1 ** 2
m2_sq = m2 ** 2
x1 = m1_sq / p1_sq
x2 = m2_sq / p2_sq
x = x1 + x2 + x1 * x2
a = angle(px1, py1, pz1, px2, py2, pz2)
cos_a = np.cos(a)
if cos_a >= 0:
y1 = (x + np.sin(a) ** 2) / (np.sqrt(x + 1) + cos_a)
else:
y1 = -cos_a + np.sqrt(x + 1)
y2 = 2 * np.sqrt(p1_sq * p2_sq)
return m1_sq + m2_sq + y1 * y2
# numerically stable calculation of angle
def angle(x1, y1, z1, x2, y2, z2):
# cross product
cx = y1 * z2 - y2 * z1
cy = x1 * z2 - x2 * z1
cz = x1 * y2 - x2 * y1
# norm of cross product
c = np.sqrt(cx * cx + cy * cy + cz * cz)
# dot product
d = x1 * x2 + y1 * y2 + z1 * z2
return np.arctan2(c, d)
The numerically stable implementation can never produce a negative result, which is a commonly occurring problem with naive implementations, even in double precision.
Let's compare the numerically stable function with a naive implementation.
# naive implementation
def msq1(px1, py1, pz1, px2, py2, pz2, m1, m2):
p1_sq = px1 ** 2 + py1 ** 2 + pz1 ** 2
p2_sq = px2 ** 2 + py2 ** 2 + pz2 ** 2
m1_sq = m1 ** 2
m2_sq = m2 ** 2
# energies of particles 1 and 2
e1 = np.sqrt(p1_sq + m1_sq)
e2 = np.sqrt(p2_sq + m2_sq)
# dangerous cancelation in third term
return m1_sq + m2_sq + 2 * (e1 * e2 - (px1 * px2 + py1 * py2 + pz1 * pz2))
For the following image, the momenta p1 and p2 are randomly picked from 1 to 1e5, the values m1 and m2 are randomly picked from 1e-5 to 1e5. All implementations get the input values in single precision. The reference in both cases is calculated with mpmath using the naive formula with 100 decimal places.
The naive implementation loses all accuracy for some inputs, while the numerically stable implementation does not.
If you put e.g. m1 = 1e-4, m2 = 1e-4, p1 = 1 and p2 = 1 in the expression, you get about 4e-8 with double precision but 0.0 with single precision calculation. I assume, that your question is about how one can get the 4e-8 as well with single precision calculation.
What you can do is a taylor expansion (around m1 = 0 and m2 = 0) of the expression above.
e ~ e|(m1=0,m2=0) + de/dm1|(m1=0,m2=0) * m1 + de/dm2|(m1=0,m2=0) * m2 + ...
If I calculated correctly, the zeroth and first order terms are 0 and the second order expansion would be
e ~ (p1+p2)/p1 * m1**2 + (p1+p2)/p2 * m2**2
This yields exactly 4e-8 even with single precision calculation. You can of course do more terms in the expansion if you need, until you hit the precision limit of a single float.
Edit
If the mi are not always much smaller than the pi you could further massage the equation to get
The complicated part is now the one in the square brackets. It essentially is sqrt(x+1)-1 for a wide range of x values. If x is very small, we can use the taylor expansion of the square root (e.g. like here). If the x value is larger, the formula works just fine, because the addition and subtraction of 1 are no longer discarding the value of x due to floating point precision. So one threshold for x must be choosen below one switches to the taylor expansion.

Setting priors in rhierBinLogit function from basyem package

I checked the sample code from the appendix of bayesian stat and marketing the sample code used default prior which is
nu = nvar + 3
V = nu * diag(nvar)
Deltabar = matrix(rep(0, nz * nvar), ncol = nvar)
ADelta = 0.01 * diag(nz)
I just want to try a different prior setting and see how is look like so I change the deltabar to be
Deltabar = matrix(rep(10, nz * nvar), ncol = nvar)
Based on my understanding that set the prior believe of the deltabar to start from 10 so the MCMC of delta should begin in the value of 10 (please correct me if I am wrong).
However, I get same trace plot as default prior.

Using vDSP_biquad as a one pole filter

I'd like to be able to use the vDSP_biquad function as a one pole filter.
My one pole filter looks like this :
output[i] = onePole->z1 = input[i] * onePole->a0 + onePole->z1 * onePole->b1;
where
b1 = exp(-2.0 * M_PI * (_frequency / sampleRate));
a0 = 1.0 - b1;
This one pole works great, but of course it's not optimized, which is why I'd like to use the Accelerate Framework to speed it up.
Because vDSP_biquad uses the Direct Form II of the biquad implementation, it seems to me I should be able to set the coefficients to use it as a one-pole filter. https://en.wikipedia.org/wiki/Digital_biquad_filter#Direct_form_2
filter->omega = 2 * M_PI * freq / sampleRate;
filter->b1 = exp(-filter->omega);
filter->b0 = 1 - filter->b1;
filter->b2 = 0;
filter->a1 = 0;
filter->a2 = 0;
However, this does not work as a one pole filter. (The implementation of biquad is fine, I use it for many other filter types, it's just these coefficients don't have the desired effect).
What am I doing wrong?
Also open to hearing other ways to optimize a one-pole filter with Accelerate or otherwise.
The formula in the Apple docs is:
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]
In your above code, you're using b1 which is two inputs ago. For a one-pole, you'll need to use the previous output, y[n-1].
So I think the coefficients you want are:
a1 = -exp(-2.0 * M_PI * (_frequency / sampleRate))
b0 = 1.0 + a1

Is there a more concise way to calculate the P value for the Anderson-Darling A-squared statistic with VBA?

I have two bits of code in VBA for Excel. One calculates the A-squared statistic for the Anderson-Darling test, this bit of code calculates the P value of the A-squared statistic. I am curious if there is a more concise way or more efficient way to calculate this value in VBA:
Function AndDarP(AndDar, Elements)
'Calculates P value for level of significance for the
'Anderson-Darling Test for Normality
'AndDar is the Anderson-Darling Test Statistic
'Elements is the count of elements used in the
'Anderson-Darling test statistic.
'based on calculations at
'http://www.kevinotto.com/RSS/Software/Anderson-Darling%20Normality%20Test%20Calculator.xls
'accessed 21 May 2010
'www.kevinotto.com
'kevin_n_otto#yahoo.com
'Version 6.0
'Permission to freely distribute and modify when properly
'referenced and contact information maintained.
'
'"Keep in mind the test assumes normality, and is looking for sufficient evidence to reject normality.
'That is, a large p-value (often p > alpha = 0.05) would indicate normality.
' * * *
'Test Hypotheses:
'Ho: Data is sampled from a population that is normally distributed
'(no difference between the data and normal data).
'Ha: Data is sampled from a population that is not normally distributed"
Dim M As Double
M = AndDar * (1 + 0.75 / Elements + 2.25 / Elements ^ 2)
Select Case M
Case Is < 0.2
AndDarP = 1 - Exp(-13.436 + 101.14 * M - 223.73 * M ^ 2)
Case Is < 0.34
AndDarP = 1 - Exp(-8.318 + 42.796 * M - 59.938 * M ^ 2)
Case Is < 0.6
AndDarP = Exp(0.9177 - 4.279 * M - 1.38 * M ^ 2)
Case Is < 13
AndDarP = Exp(1.2937 - 5.709 * M + 0.0186 * M ^ 2)
Case Else
AndDarP = 0
End Select
End Function

Number Rounding

I have the following code written by VBA in MS Access 2010. This code prints in the Label1.caption the number of this equation.
the_average = CDbl(TextBox1.Text) * 2 / 3 + CDbl(TextBox2.Text) / 3
the_average = Format(the_average, "#.###")
Label1.caption= the_average
Some of the operations have more than three decimal places. This code rounds the numbers when printing them in the caption of the label. For example if I have 1.66666666, it shows 1.667 and I want it to show 1.666 without rounding.
How can I do that?
Thanks in advance
the_average = CDbl(TextBox1.Text) * 2 / 3 + CDbl(TextBox2.Text) / 3
the_average = Format(the_average, "#.####")
Label1.caption= Left(the_average,5)
If your final string can vary in length, you will have to use the InStr() function (I believe that's it).
You can take advantage of Int()'s truncation;
the_average = formatnumber(int(the_average * 1000) / 1000, 3)
You can omit formatnumber if you don't want trailing zeros (.99 -> .990)