Transforming random variables from competing exponential distributions - variables

I have a question regarding transforming one set of random variables to another.
Suppose that we have a sample of data collected from two competing exponential distributions, with rate a and b. We know that the random variables X follow an exponential distribution with rate (a + b).
Now suppose that we know the rate a, and we want to transform the random variables to a new set of random variables, so that the new set of random variables Y so that Y follows the exponential distribution with rate b.
Is there a way to do so? I would appreciate any answers or references.
Best regards.
We have looked into convolution of variables, but it seems not to fit since X cannot be represented as sum of two random variables.

Related

Determine the running time of an algorithm with two parameters

I have implemented an algorithm that uses two other algorithms for calculating the shortest path in a graph: Dijkstra and Bellman-Ford. Based on the time complexity of the these algorithms, I can calculate the running time of my implementation, which is easy giving the code.
Now, I want to experimentally verify my calculation. Specifically, I want to plot the running time as a function of the size of the input (I am following the method described here). The problem is that I have two parameters - number of edges and number of vertices.
I have tried to fix one parameter and change the other, but this approach results in two plots - one for varying number of edges and the other for varying number of vertices.
This leads me to my question - how can I determine the order of growth based on two plots? In general, how can one experimentally determine the running time complexity of an algorithm that has more than one parameter?
It's very difficult in general.
The usual way you would experimentally gauge the running time in the single variable case is, insert a counter that increments when your data structure does a fundamental (putatively O(1)) operation, then take data for many different input sizes, and plot it on a log-log plot. That is, log T vs. log N. If the running time is of the form n^k you should see a straight line of slope k, or something approaching this. If the running time is like T(n) = n^{k log n} or something, then you should see a parabola. And if T is exponential in n you should still see exponential growth.
You can only hope to get information about the highest order term when you do this -- the low order terms get filtered out, in the sense of having less and less impact as n gets larger.
In the two variable case, you could try to do a similar approach -- essentially, take 3 dimensional data, do a log-log-log plot, and try to fit a plane to that.
However this will only really work if there's really only one leading term that dominates in most regimes.
Suppose my actual function is T(n, m) = n^4 + n^3 * m^3 + m^4.
When m = O(1), then T(n) = O(n^4).
When n = O(1), then T(n) = O(m^4).
When n = m, then T(n) = O(n^6).
In each of these regimes, "slices" along the plane of possible n,m values, a different one of the terms is the dominant term.
So there's no way to determine the function just from taking some points with fixed m, and some points with fixed n. If you did that, you wouldn't get the right answer for n = m -- you wouldn't be able to discover "middle" leading terms like that.
I would recommend that the best way to predict asymptotic growth when you have lots of variables / complicated data structures, is with a pencil and piece of paper, and do traditional algorithmic analysis. Or possibly, a hybrid approach. Try to break the question of efficiency into different parts -- if you can split the question up into a sum or product of a few different functions, maybe some of them you can determine in the abstract, and some you can estimate experimentally.
Luckily two input parameters is still easy to visualize in a 3D scatter plot (3rd dimension is the measured running time), and you can check if it looks like a plane (in log-log-log scale) or if it is curved. Naturally random variations in measurements plays a role here as well.
In Matlab I typically calculate a least-squares solution to two-variable function like this (just concatenates different powers and combinations of x and y horizontally, .* is an element-wise product):
x = log(parameter_x);
y = log(parameter_y);
% Find a least-squares fit
p = [x.^2, x.*y, y.^2, x, y, ones(length(x),1)] \ log(time)
Then this can be used to estimate running times for larger problem instances, ideally those would be confirmed experimentally to know that the fitted model works.
This approach works also for higher dimensions but gets tedious to generate, maybe there is a more general way to achieve that and this is just a work-around for my lack of knowledge.
I was going to write my own explanation but it wouldn't be any better than this.

sampling 2-dimensional surface: how many sample points along X & Y axes?

I have a set of first 25 Zernike polynomials. Below are shown few in Cartesin co-ordinate system.
z2 = 2*x
z3 = 2*y
z4 = sqrt(3)*(2*x^2+2*y^2-1)
:
:
z24 = sqrt(14)*(15*(x^2+y^2)^2-20*(x^2+y^2)+6)*(x^2-y^2)
I am not using 1st since it is piston; so I have these 24 two-dim ANALYTICAL functions expressed in X-Y Cartesian co-ordinate system. All are defined over unit circle, as they are orthogonal over unit circle. The problem which I am describing here is relevant to other 2D surfaces also apart from Zernike Polynomials.
Suppose that origin (0,0) of the XY co-ordinate system and the centre of the unit circle are same.
Next, I take linear combination of these 24 polynomials to build a 2D wavefront shape. I use 24 random input coefficients in this combination.
w(x,y) = sum_over_i a_i*z_i (i=2,3,4,....24)
a_i = random coefficients
z_i = zernike polynomials
Upto this point, everything is analytical part which can be done on paper.
Now comes the discretization!
I know that when you want to re-construct a signal (1Dim/2Dim), your sampling frequency should be at least twice the maximum frequency present in the signal (Nyquist-Shanon principle).
Here signal is w(x,y) as mentioned above which is nothing but a simple 2Dim
function of x & y. I want to represent it on computer now. Obviously I can not take all infinite points from -1 to +1 along x axis and same for y axis.
I have to take finite no. of data points (which are called sample points or just samples) on this analytical 2Dim surface w(x,y)
I am measuring x & y in metres, and -1 <= x <= +1; -1 <= y <= +1.
e.g. If I divide my x-axis from -1 to 1, in 50 sample points then dx = 2/50= 0.04 metre. Same for y axis. Now my sampling frequency is 1/dx i.e. 25 samples per metre. Same for y axis.
But I took 50 samples arbitrarily; I could have taken 10 samples or 1000 samples. That is the crux of the matter here: how many samples points?How will I determine this number?
There is one theorem (Nyquist-Shanon theorem) mentioned above which says that if I want to re-construct w(x,y) faithfully, I must sample it on both axes so that my sampling frequency (i.e. no. of samples per metre) is at least twice the maximum frequency present in the w(x,y). This is nothing but finding power spectrum of w(x,y). Idea is that any function in space domain can be represented in spatial-frequency domain also, which is nothing but taking Fourier transform of the function! This tells us how many (spatial) frequencies are present in your function w(x,y) and what is the maximum frequency out of these many frequencies.
Now my question is first how to find out this maximum sampling frequency in my case. I can not use MATLAB fft2() or any other tool since it means already I have samples taken across the wavefront!! Obviously remaining option is find it analytically ! But that is time consuming and difficult since I have 24 polynomials & I will have to use then continuous Fourier transform i.e. I will have to go for pen and paper.
Any help will be appreciated.
Thanks
Key Assumptions
You want to use the "Nyquist-Shanon" theorem to determine sampling frequency
Obviously remaining option is find it analytically ! But that is time
consuming and difficult since I have 21 polynomials & I have to use
continuous Fourier transform i.e. done by analytically.
Given the assumption I have made (and noting that consideration of other mathematical techniques is out of scope for StackOverflow), you have no option but to calculate the continuous Fourier Transform.
However, I believe you haven't considered all the options for calculating the transform other than a laborious paper exercise e.g.
Numerical approximation of the continuous F.T. using code
Symbolic Integration e.g. Wolfram Alpha
Surely a numerical approximation of the Fourier Transform will be adequate for your solution?
I am assuming this is for coursework or research rather, so all you really care about as a physicist is a solution that is the quickest solution that is accurate within the scope of your problem.
So to conclude, IMHO, don't waste time searching for a more mathematically elegant solution or trick and just solve the problem with one of the above methods

Generating Matched Pairs for Statistical Analysis

In my study, a person is represented as a pair of real numbers (x, y). x is on [30, 80] and y is [60, 120]. There are two types of people, A and B. I have ~300 of each type. How can I generate the largest (or even a large) set of pairs of one person from A with one from B: ((xA, yA), (xB, yB)) such that each pair of points is close? Two points are close if abs(x1-x2) < dX and abs(y1 - y2) < dY. Similar constraints are acceptable. (That is, this constraint is roughly a Manhattan metric, but euclidean/etc is ok too.) Not all points need be used, but no point can be reused.
You're looking for the Hungarian Algorithm.
Suggested formulation: A are rows, B are columns, each cell contains a distance metric between Ai and Bi, e.g. abs(X(Ai)-X(Bi)) + abs(Y(Ai)-Y(Bi)). (You can normalize the X and Y values to [0,1] if you want distances to be proportional to the range of each variable)
Then use the Hungarian Algorithm to minimize matching weight.
You can filter out matches with distances over your threshold. If you're worried that this filtering might cause the approach to be sub-optimal, you could set distances over your threshold to a very high number.
There are many implementations of this algorithm. A short search found one in any conceivable language, including VBA for Excel and some online solvers (not sure about matching 300x300 matrix with them, though)
Hungarian algorithm did it, thanks Etov.
Source code available here: http://www.filedropper.com/stackoverflow1

How to plot a Pearson correlation given a time series?

I am using the code in this website http://blog.chrislowis.co.uk/2008/11/24/ruby-gsl-pearson.html to implement a Pearson Correlation given two time series data like so:
require 'gsl'
pearson_correlation = GSL::Stats::correlation(
GSL::Vector.alloc(first_metrics),GSL::Vector.alloc(second_metrics)
)
This returns a number such as -0.2352461593569471.
I'm currently using the highcharts library and am feeding it two sets of timeseries data. Given that I have a finite time series for both sets, can I do something with this number (-0.2352461593569471) to create a third time series showing the slope of this curve? If anyone can point me in the right direction I'd really appreciate it!
No, correlation doesn't tell you anything about the slope of the line of best fit. It just tells you approximately how much of the variability in one variable (or one time series, in this case) can be explained by the other. There is a reasonably good description here: http://www.graphpad.com/support/faqid/1141/.
How you deal with the data in your specific case is highly dependent on what you're trying to achieve. Are you trying to show that variable X causes variable Y? If so, you could start by dropping the time-series-ness, and just treat the data as paired values, and use linear regression. If you're trying to find a model of how X and Y vary together over time, you could look at multivariate linear regression (I'm not very familiar with this, though).

Need help generating discrete random numbers from distribution

I searched the site but did not find exactly what I was looking for... I wanted to generate a discrete random number from normal distribution.
For example, if I have a range from a minimum of 4 and a maximum of 10 and an average of 7. What code or function call ( Objective C preferred ) would I need to return a number in that range. Naturally, due to normal distribution more numbers returned would center round the average of 7.
As a second example, can the bell curve/distribution be skewed toward one end of the other? Lets say I need to generate a random number with a range of minimum of 4 and maximum of 10, and I want the majority of the numbers returned to center around the number 8 with a natural fall of based on a skewed bell curve.
Any help is greatly appreciated....
Anthony
What do you need this for? Can you do it the craps player's way?
Generate two random integers in the range of 2 to 5 (inclusive, of course) and add them together. Or flip a coin (0,1) six times and add 4 to the result.
Summing multiple dice produces a normal distribution (a "bell curve"), while eliminating high or low throws can be used to skew the distribution in various ways.
The key is you are going for discrete numbers (and I hope you mean integers by that). Multiple dice throws famously generate a normal distribution. In fact, I think that's how we were first introduced to the Gaussian curve in school.
Of course the more throws, the more closely you approximate the bell curve. Rolling a single die gives a flat line. Rolling two dice just creates a ramp up and down that isn't terribly close to a bell. Six coin flips gets you closer.
So consider this...
If I understand your question correctly, you only have seven possible outcomes--the integers (4,5,6,7,8,9,10). You can set up an array of seven probabilities to approximate any distribution you like.
Many frameworks and libraries have this built-in.
Also, just like TokenMacGuy said a normal distribution isn't characterized by the interval it's defined on, but rather by two parameters: Mean μ and standard deviation σ. With both these parameters you can confine a certain quantile of the distribution to a certain interval, so that 95 % of all points fall in that interval. But resticting it completely to any interval other than (−∞, ∞) is impossible.
There are several methods to generate normal-distributed values from uniform random values (which is what most random or pseudorandom number generators are generating:
The Box-Muller transform is probably the easiest although not exactly fast to compute. Depending on the number of numbers you need, it should be sufficient, though and definitely very easy to write.
Another option is Marsaglia's Polar method which is usually faster1.
A third method is the Ziggurat algorithm which is considerably faster to compute but much more complex to program. In applications that really use a lot of random numbers it may be the best choice, though.
As a general advice, though: Don't write it yourself if you have access to a library that generates normal-distributed random numbers for you already.
For skewing your distribution I'd just use a regular normal distribution, choosing μ and σ appropriately for one side of your curve and then determine on which side of your wanted mean a point fell, stretching it appropriately to fit your desired distribution.
For generating only integers I'd suggest you just round towards the nearest integer when the random number happens to fall within your desired interval and reject it if it doesn't (drawing a new random number then). This way you won't artificially skew the distribution (such as you would if you were clamping the values at 4 or 10, respectively).
1 In testing with deliberately bad random number generators (yes, worse than RANDU) I've noticed that the polar method results in an endless loop, rejecting every sample. Won't happen with random numbers that fulfill the usual statistic expectations to them, though.
Yes, there are sophisticated mathematical solutions, but for "simple but practical" I'd go with Nosredna's comment. For a simple Java solution:
Random random=new Random();
public int bell7()
{
int n=4;
for (int x=0;x<6;++x)
n+=random.nextInt(2);
return n;
}
If you're not a Java person, Random.nextInt(n) returns a random integer between 0 and n-1. I think the rest should be similar to what you'd see in any programming language.
If the range was large, then instead of nextInt(2)'s I'd use a bigger number in there so there would be fewer iterations through the loop, depending on frequency of call and performance requirements.
Dan Dyer and Jay are exactly right. What you really want is a binomial distribution, not a normal distribution. The shape of a binomial distribution looks a lot like a normal distribution, but it is discrete and bounded whereas a normal distribution is continuous and unbounded.
Jay's code generates a binomial distribution with 6 trials and a 50% probability of success on each trial. If you want to "skew" your distribution, simply change the line that decides whether to add 1 to n so that the probability is something other than 50%.
The normal distribution is not described by its endpoints. Normally it's described by it's mean (which you have given to be 7) and its standard deviation. An important feature of this is that it is possible to get a value far outside the expected range from this distribution, although that will be vanishingly rare, the further you get from the mean.
The usual means for getting a value from a distribution is to generate a random value from a uniform distribution, which is quite easily done with, for example, rand(), and then use that as an argument to a cumulative distribution function, which maps probabilities to upper bounds. For the standard distribution, this function is
F(x) = 0.5 - 0.5*erf( (x-μ)/(σ * sqrt(2.0)))
where erf() is the error function which may be described by a taylor series:
erf(z) = 2.0/sqrt(2.0) * Σ∞n=0 ((-1)nz2n + 1)/(n!(2n + 1))
I'll leave it as an excercise to translate this into C.
If you prefer not to engage in the exercise, you might consider using the Gnu Scientific Library, which among many other features, has a technique to generate random numbers in one of many common distributions, of which the Gaussian Distribution (hint) is one.
Obviously, all of these functions return floating point values. You will have to use some rounding strategy to convert to a discrete value. A useful (but naive) approach is to simply downcast to integer.