is there a possibility to create a confidence interval for conditional probabilities? I found some formula to create a CI for probabilities, but not for this special case.
Confidence Intervals
I think you mean a confidence interval for a random variable. For example, if I have a continuous real-valued random variable X, I could say something like
P(a <= X <= b) = 0.95
meaning I am 95% confident that X is between a and b.
Conditional Confidence Intervals
A conditional confidence interval would be a statement like
P(a <= X <= b | Y = y) = 0.95
meaning, when Y is y, I am 95% confident that X is between a and b, where Y is some other random variable.
Related
These are the conditions:
if(x > 0)
{
y >= a;
z <= b;
}
It is quite easy to convert the conditions into Linear Programming constraints if x were binary variable. But I am not finding a way to do this.
You can do this in 2 steps
Step 1: Introduce a binary dummy variable
Since x is continuous, we can introduce a binary 0/1 dummy variable. Let's call it x_positive
if x>0 then we want x_positive =1. We can achieve that via the following constraint, where M is a very large number.
x < x_positive * M
Note that this forces x_positive to become 1, if x is itself positive. If x is negative, x_positive can be anything. (We can force it to be zero by adding it to the objective function with a tiny penalty of the appropriate sign.)
Step 2: Use the dummy variable to implement the next 2 constraints
In English: if x_positive = 1, then y >= a
However, if x_positive = 0, y can be anything (y > -inf)
y > a - M (1 - x_positive)
Similarly,
if x_positive = 1, then z <= b
z <= b + M * (1 - x_positive)
Both the linear constraints above will kick in if x>0 and will be trivially satisfied if x <=0.
I have curve that initially Y increases linearly with X, then reach a plateau at point C.
In other words, the curve can be defined as:
if X < C:
Y = k * X + b
else:
Y = k * C + b
The training data is a list of X ~ Y values. I need to determine k, b and C through a machine learning approach (or similar), since the data is noisy and refection point C changes over time. I want something more robust than get C through observing the current sample data.
How can I do it using sklearn or maybe scipy?
WLOG you can say the second equation is
Y = C
looks like you have a linear regression to fit the line and then a detection point to find the constant.
You know that in the high values of X, as in X > C you are already at the constant. So just check how far back down the values of X you get the same constant.
Then do a linear regression to find the line with value of X, X <= C
Your model is nonlinear
I think the smartest way to solve this is to do these steps:
find the maximum value of Y which is equal to k*C+b
M=max(Y)
drop this maximum value from your dataset
df1 = df[df.Y != M]
and then you have simple dataset to fit your X to Y and you can use sklearn for that
Can anybody give me an example for a continuous, monotone, and quasi concave function that is not concave?
Continuous
Has a value everywhere.
Monotone
(x <= y) => (f(x) <= f(y))
or
(x <= y) => (f(x) >= f(y))
Concave
f'(x) is monotonically decreasing. If f'(x) is constant on an interval [a, b] where a <> b, then f is not strictly concave.
The construction of an example
Take the logarithm as a starting point. It is, continuous monotonically (on the realm of positive real numbers) increasing, quasiconcave AND concave. Not quite the result we expect yet, but pretty close to it:
ln'(x) = 1/x
ln(1) = 0
now, let's define a function as follows:
f(x) = ln(x), if x < 1 or x > 2
f(1), if 1 <= x <= 2
I have X,Y data which i would like to bin according to X values.
However, I would like to determine the optimal number of X bins that satisfy a condition based on the resulting bin intervals and average Y of each bin. For example if i have
X=[2,3,4,5,6,7,8,9,10]
Y=[120,140,143,124,150,140,180,190,200]
I would like to determine the best number of X bins that will satisfy this condition: Average of Y bin/(8* width of X bin) should be above 20, but as close as possible to 20. The bins should also be integers e.g., [1,2,..].
I am currently using:
bin_means, bin_edges, binnumber = binned_statistic(X, Y, statistic='mean', bins=bins)
with bins being pre-defined. However, i would like an algorithim that can determine the optimal bins for me before using this.
One can easily determine it for a small data but for hundreds of points it becomes time consuming.
Thank you
If you NEED to iterate to find optimal nbins with your minimization function, take a look at numpy.digtize
https://numpy.org/doc/stable/reference/generated/numpy.digitize.html
And try:
start = min(X)
stop = max(X)
cut_dict = {
n: np.digitize(X, bins=np.linspace(start, stop, num=n+1))
for n in range(min_nbins, max_nbins)}
#input min/max_nbins
avg = {}
Y = pd.Series(Y).rename('Y')
avg = {nbins: Y.groupby(cut).mean().mean() for nbins, cut in cut_dict.items()}
avg = pd.Series(avg.values(), index=avg.keys()).rename('mean_ybins').to_frame()
Then you can find which is closest to 20 or if 20 is the right number...
I am processing a series of points which all have the same Y value, but different X values. I go through the points by incrementing X by one. For example, I might have Y = 50 and X is the integers from -30 to 30. Part of my algorithm involves finding the distance to the origin from each point and then doing further processing.
After profiling, I've found that the sqrt call in the distance calculation is taking a significant amount of my time. Is there an iterative way to calculate the distance?
In other words:
I want to efficiently calculate: r[n] = sqrt(x[n]*x[n] + y*y)). I can save information from the previous iteration. Each iteration changes by incrementing x, so x[n] = x[n-1] + 1. I can not use sqrt or trig functions because they are too slow except at the beginning of each scanline.
I can use approximations as long as they are good enough (less than 0.l% error) and the errors introduced are smooth (I can't bin to a pre-calculated table of approximations).
Additional information:
x and y are always integers between -150 and 150
I'm going to try a couple ideas out tomorrow and mark the best answer based on which is fastest.
Results
I did some timings
Distance formula: 16 ms / iteration
Pete's interperlating solution: 8 ms / iteration
wrang-wrang pre-calculation solution: 8ms / iteration
I was hoping the test would decide between the two, because I like both answers. I'm going to go with Pete's because it uses less memory.
Just to get a feel for it, for your range y = 50, x = 0 gives r = 50 and y = 50, x = +/- 30 gives r ~= 58.3. You want an approximation good for +/- 0.1%, or +/- 0.05 absolute. That's a lot lower accuracy than most library sqrts do.
Two approximate approaches - you calculate r based on interpolating from the previous value, or use a few terms of a suitable series.
Interpolating from previous r
r = ( x2 + y2 ) 1/2
dr/dx = 1/2 . 2x . ( x2 + y2 ) -1/2 = x/r
double r = 50;
for ( int x = 0; x <= 30; ++x ) {
double r_true = Math.sqrt ( 50*50 + x*x );
System.out.printf ( "x: %d r_true: %f r_approx: %f error: %f%%\n", x, r, r_true, 100 * Math.abs ( r_true - r ) / r );
r = r + ( x + 0.5 ) / r;
}
Gives:
x: 0 r_true: 50.000000 r_approx: 50.000000 error: 0.000000%
x: 1 r_true: 50.010000 r_approx: 50.009999 error: 0.000002%
....
x: 29 r_true: 57.825065 r_approx: 57.801384 error: 0.040953%
x: 30 r_true: 58.335225 r_approx: 58.309519 error: 0.044065%
which seems to meet the requirement of 0.1% error, so I didn't bother coding the next one, as it would require quite a bit more calculation steps.
Truncated Series
The taylor series for sqrt ( 1 + x ) for x near zero is
sqrt ( 1 + x ) = 1 + 1/2 x - 1/8 x2 ... + ( - 1 / 2 )n+1 xn
Using r = y sqrt ( 1 + (x/y)2 ) then you're looking for a term t = ( - 1 / 2 )n+1 0.36n with magnitude less that a 0.001, log ( 0.002 ) > n log ( 0.18 ) or n > 3.6, so taking terms to x^4 should be Ok.
Y=10000
Y2=Y*Y
for x=0..Y2 do
D[x]=sqrt(Y2+x*x)
norm(x,y)=
if (y==0) x
else if (x>y) norm(y,x)
else {
s=Y/y
D[round(x*s)]/s
}
If your coordinates are smooth, then the idea can be extended with linear interpolation. For more precision, increase Y.
The idea is that s*(x,y) is on the line y=Y, which you've precomputed distances for. Get the distance, then divide it by s.
I assume you really do need the distance and not its square.
You may also be able to find a general sqrt implementation that sacrifices some accuracy for speed, but I have a hard time imagining that beating what the FPU can do.
By linear interpolation, I mean to change D[round(x)] to:
f=floor(x)
a=x-f
D[f]*(1-a)+D[f+1]*a
This doesn't really answer your question, but may help...
The first questions I would ask would be:
"do I need the sqrt at all?".
"If not, how can I reduce the number of sqrts?"
then yours: "Can I replace the remaining sqrts with a clever calculation?"
So I'd start with:
Do you need the exact radius, or would radius-squared be acceptable? There are fast approximatiosn to sqrt, but probably not accurate enough for your spec.
Can you process the image using mirrored quadrants or eighths? By processing all pixels at the same radius value in a batch, you can reduce the number of calculations by 8x.
Can you precalculate the radius values? You only need a table that is a quarter (or possibly an eighth) of the size of the image you are processing, and the table would only need to be precalculated once and then re-used for many runs of the algorithm.
So clever maths may not be the fastest solution.
Well there's always trying optimize your sqrt, the fastest one I've seen is the old carmack quake 3 sqrt:
http://betterexplained.com/articles/understanding-quakes-fast-inverse-square-root/
That said, since sqrt is non-linear, you're not going to be able to do simple linear interpolation along your line to get your result. The best idea is to use a table lookup since that will give you blazing fast access to the data. And, since you appear to be iterating by whole integers, a table lookup should be exceedingly accurate.
Well, you can mirror around x=0 to start with (you need only compute n>=0, and the dupe those results to corresponding n<0). After that, I'd take a look at using the derivative on sqrt(a^2+b^2) (or the corresponding sin) to take advantage of the constant dx.
If that's not accurate enough, may I point out that this is a pretty good job for SIMD, which will provide you with a reciprocal square root op on both SSE and VMX (and shader model 2).
This is sort of related to a HAKMEM item:
ITEM 149 (Minsky): CIRCLE ALGORITHM
Here is an elegant way to draw almost
circles on a point-plotting display:
NEW X = OLD X - epsilon * OLD Y
NEW Y = OLD Y + epsilon * NEW(!) X
This makes a very round ellipse
centered at the origin with its size
determined by the initial point.
epsilon determines the angular
velocity of the circulating point, and
slightly affects the eccentricity. If
epsilon is a power of 2, then we don't
even need multiplication, let alone
square roots, sines, and cosines! The
"circle" will be perfectly stable
because the points soon become
periodic.
The circle algorithm was invented by
mistake when I tried to save one
register in a display hack! Ben Gurley
had an amazing display hack using only
about six or seven instructions, and
it was a great wonder. But it was
basically line-oriented. It occurred
to me that it would be exciting to
have curves, and I was trying to get a
curve display hack with minimal
instructions.