How to plot a Pearson correlation given a time series? - data-visualization

I am using the code in this website http://blog.chrislowis.co.uk/2008/11/24/ruby-gsl-pearson.html to implement a Pearson Correlation given two time series data like so:
require 'gsl'
pearson_correlation = GSL::Stats::correlation(
GSL::Vector.alloc(first_metrics),GSL::Vector.alloc(second_metrics)
)
This returns a number such as -0.2352461593569471.
I'm currently using the highcharts library and am feeding it two sets of timeseries data. Given that I have a finite time series for both sets, can I do something with this number (-0.2352461593569471) to create a third time series showing the slope of this curve? If anyone can point me in the right direction I'd really appreciate it!

No, correlation doesn't tell you anything about the slope of the line of best fit. It just tells you approximately how much of the variability in one variable (or one time series, in this case) can be explained by the other. There is a reasonably good description here: http://www.graphpad.com/support/faqid/1141/.
How you deal with the data in your specific case is highly dependent on what you're trying to achieve. Are you trying to show that variable X causes variable Y? If so, you could start by dropping the time-series-ness, and just treat the data as paired values, and use linear regression. If you're trying to find a model of how X and Y vary together over time, you could look at multivariate linear regression (I'm not very familiar with this, though).

Related

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

Determine the running time of an algorithm with two parameters

I have implemented an algorithm that uses two other algorithms for calculating the shortest path in a graph: Dijkstra and Bellman-Ford. Based on the time complexity of the these algorithms, I can calculate the running time of my implementation, which is easy giving the code.
Now, I want to experimentally verify my calculation. Specifically, I want to plot the running time as a function of the size of the input (I am following the method described here). The problem is that I have two parameters - number of edges and number of vertices.
I have tried to fix one parameter and change the other, but this approach results in two plots - one for varying number of edges and the other for varying number of vertices.
This leads me to my question - how can I determine the order of growth based on two plots? In general, how can one experimentally determine the running time complexity of an algorithm that has more than one parameter?
It's very difficult in general.
The usual way you would experimentally gauge the running time in the single variable case is, insert a counter that increments when your data structure does a fundamental (putatively O(1)) operation, then take data for many different input sizes, and plot it on a log-log plot. That is, log T vs. log N. If the running time is of the form n^k you should see a straight line of slope k, or something approaching this. If the running time is like T(n) = n^{k log n} or something, then you should see a parabola. And if T is exponential in n you should still see exponential growth.
You can only hope to get information about the highest order term when you do this -- the low order terms get filtered out, in the sense of having less and less impact as n gets larger.
In the two variable case, you could try to do a similar approach -- essentially, take 3 dimensional data, do a log-log-log plot, and try to fit a plane to that.
However this will only really work if there's really only one leading term that dominates in most regimes.
Suppose my actual function is T(n, m) = n^4 + n^3 * m^3 + m^4.
When m = O(1), then T(n) = O(n^4).
When n = O(1), then T(n) = O(m^4).
When n = m, then T(n) = O(n^6).
In each of these regimes, "slices" along the plane of possible n,m values, a different one of the terms is the dominant term.
So there's no way to determine the function just from taking some points with fixed m, and some points with fixed n. If you did that, you wouldn't get the right answer for n = m -- you wouldn't be able to discover "middle" leading terms like that.
I would recommend that the best way to predict asymptotic growth when you have lots of variables / complicated data structures, is with a pencil and piece of paper, and do traditional algorithmic analysis. Or possibly, a hybrid approach. Try to break the question of efficiency into different parts -- if you can split the question up into a sum or product of a few different functions, maybe some of them you can determine in the abstract, and some you can estimate experimentally.
Luckily two input parameters is still easy to visualize in a 3D scatter plot (3rd dimension is the measured running time), and you can check if it looks like a plane (in log-log-log scale) or if it is curved. Naturally random variations in measurements plays a role here as well.
In Matlab I typically calculate a least-squares solution to two-variable function like this (just concatenates different powers and combinations of x and y horizontally, .* is an element-wise product):
x = log(parameter_x);
y = log(parameter_y);
% Find a least-squares fit
p = [x.^2, x.*y, y.^2, x, y, ones(length(x),1)] \ log(time)
Then this can be used to estimate running times for larger problem instances, ideally those would be confirmed experimentally to know that the fitted model works.
This approach works also for higher dimensions but gets tedious to generate, maybe there is a more general way to achieve that and this is just a work-around for my lack of knowledge.
I was going to write my own explanation but it wouldn't be any better than this.

How do I plug distance data into scipy's agglomerative clustering methods?

So, I have a set of texts I'd like to do some clustering analysis on. I've taken a Normalized Compression Distance between every text, and now I have basically built a complete graph with weighted edges that looks something like this:
text1, text2, 0.539
text2, text3, 0.675
I'm having tremendous difficulty figuring out the best way to plug this data into scipy's hierarchical clustering methods. I can probably convert the distance data into a table like the one on this page. How can I format this data so that it can easily be plugged into scipy's HAC code?
You're on the right track with converting the data into a table like the one on the linked page (a redundant distance matrix). According to the documentation, you should be able to pass that directly into scipy.cluster.hierarchy.linkage or a related function, such as scipy.cluster.hierarchy.single or scipy.cluster.hierarchy.complete. The related functions explicitly specify how distance between clusters should be calculated. scipy.cluster.hierarchy.linkage lets you specify whichever method you want, but defaults to single link (i.e. the distance between two clusters is the distance between their closest points). All of these methods will return a multidimensional array representing the agglomerative clustering. You can then use the rest of the scipy.cluster.hierarchy module to perform various actions on this clustering, such as visualizing or flattening it.
However, there's a catch. As of the time this question was written, you couldn't actually use a redundant distance matrix, despite the fact that the documentation says you can. Based on the fact that the github issue is still open, I don't think this has been resolved yet. As pointed out in the answers to the linked question, you can get around this issue by passing the complete distance matrix into the scipy.spatial.distance.squareform function, which will convert it into the format which is actually accepted (a flat array containing the upper-triangular portion of the distance matrix, called a condensed distance matrix). You can then pass the result to one of the scipy.cluster.hierarchy functions.

Fitting curves to a set of points

Basically, I have a set of up to 100 co-ordinates, along with the desired tangents to the curve at the first and last point.
I have looked into various methods of curve-fitting, by which I mean an algorithm with takes the inputted data points and tangents, and outputs the equation of the cure, such as the gaussian method and interpolation, but I really struggled understanding them.
I am not asking for code (If you choose to give it, thats acceptable though :) ), I am simply looking for help into this algorithm. It will eventually be converted to Objective-C for an iPhone app, if that changes anything..
EDIT:
I know the order of all of the points. They are not too close together, so passing through all points is necessary - aka interpolation (unless anyone can suggest something else). And as far as I know, an algebraic curve is what I'm looking for. This is all being done on a 2D plane by the way
I'd recommend to consider cubic splines. There is some explanation and code to calculate them in plain C in Numerical Recipes book (chapter 3.3)
Most interpolation methods originally work with functions: given a set of x and y values, they compute a function which computes a y value for every x value, meeting the specified constraints. As a function can only ever compute a single y value for every x value, such an curve cannot loop back on itself.
To turn this into a real 2D setup, you want two functions which compute x resp. y values based on some parameter that is conventionally called t. So the first step is computing t values for your input data. You can usually get a good approximation by summing over euclidean distances: think about a polyline connecting all your points with straight segments. Then the parameter would be the distance along this line for every input pair.
So now you have two interpolation problem: one to compute x from t and the other y from t. You can formulate this as a spline interpolation, e.g. using cubic splines. That gives you a large system of linear equations which you can solve iteratively up to the desired precision.
The result of a spline interpolation will be a piecewise description of a suitable curve. If you wanted a single equation, then a lagrange interpolation would fit that bill, but the result might have odd twists and turns for many sets of input data.

LabView cos fitting

I am working on a program that needs to fit numerous cosine waves in order to determine one of the parameters for the function. The equation that I am using is y = y_0 + Acos((4*pi*L)/x + pi) where L is the value that I am trying to obtain from the best fit line.
I know that it is possible to do this correctly by hand for each set of data, but what is the best way to automate this process? I am currently reading in the data from text files, and running a loop with the initial paramiters changing until I have an array of paramater values that have an amplitude similar to the data, then I check the percent difference between points on the center peak and two end peaks to try to pick the best one. It in consistently picking lower values than what I get when fitting by hand (almost exactly one phase off). So is there a way to improve this method, or another method that works better?
Edit: My LabVIEW version has a cos fitting VI which is what I am using, the problem is when I try to automate the fitting by changing the initial parameters using a loop, I cant figure out how to get the program to pick the same best fit line as a human would pick.
Why not just use a Fast Fourier Transform? This should be way faster than fitting a cosine. In the result vector of complex numbers look for the largest peak of in the totals. You're given frequency (position in the FFT result vector), amplitude and phase.
You can evaluate the goodness of the fit by computing the difference between fitting curve and your data. A VI does this in the "Advanced curve fitting" palette. Then all you have to do is pick up the best fit.