How do I plug distance data into scipy's agglomerative clustering methods? - numpy

So, I have a set of texts I'd like to do some clustering analysis on. I've taken a Normalized Compression Distance between every text, and now I have basically built a complete graph with weighted edges that looks something like this:
text1, text2, 0.539
text2, text3, 0.675
I'm having tremendous difficulty figuring out the best way to plug this data into scipy's hierarchical clustering methods. I can probably convert the distance data into a table like the one on this page. How can I format this data so that it can easily be plugged into scipy's HAC code?

You're on the right track with converting the data into a table like the one on the linked page (a redundant distance matrix). According to the documentation, you should be able to pass that directly into scipy.cluster.hierarchy.linkage or a related function, such as scipy.cluster.hierarchy.single or scipy.cluster.hierarchy.complete. The related functions explicitly specify how distance between clusters should be calculated. scipy.cluster.hierarchy.linkage lets you specify whichever method you want, but defaults to single link (i.e. the distance between two clusters is the distance between their closest points). All of these methods will return a multidimensional array representing the agglomerative clustering. You can then use the rest of the scipy.cluster.hierarchy module to perform various actions on this clustering, such as visualizing or flattening it.
However, there's a catch. As of the time this question was written, you couldn't actually use a redundant distance matrix, despite the fact that the documentation says you can. Based on the fact that the github issue is still open, I don't think this has been resolved yet. As pointed out in the answers to the linked question, you can get around this issue by passing the complete distance matrix into the scipy.spatial.distance.squareform function, which will convert it into the format which is actually accepted (a flat array containing the upper-triangular portion of the distance matrix, called a condensed distance matrix). You can then pass the result to one of the scipy.cluster.hierarchy functions.

Related

Camera matrix from essential matrix

I am trying to extract camera matrix from essential matrix. I found some answers about this.
determine camera rotation and translation matrix from essential matrix
Rotation and Translation from Essential Matrix incorrect
In these answers, they suggest me to use newE where [U,S,V] = svd(E) and newE = U*diag(1,1,0)*Vt. I don't understand why I need to use newE. As I know, singular values are unique. So changing singular values to diag(1,1,0) seems to make E to completely different values.
I read 'Multiple View Geometry in Computer Vision' also, but it just refers to the ideal case, i.e., singular values are (1,1,0). I didn't find the reason of using newE.
Please can anyone explain me why people use newE?
If I understand your question correctly, then since you source data (and thus E) is usually noisy real world data, then by using diag(1,1,0) you are constraining the matrix to be of the correct scale and rank and algebraically enforcing the geometric constraints.
Wikipedia also a has a nice section explaining this.

Implementing a 2D recursive spatial filter using Scipy

Minimally, I would like to know how to achieve what is stated in the title. Specifically, signal.lfilter seems like the only implementation of a difference equation filter in scipy, but it is 1D, as shown in the docs. I would like to know how to implement a 2D version as described by this difference equation. If that's as simple as "bro, use this function," please let me know, pardon my naiveté, and feel free to disregard the rest of the post.
I am new to DSP and acknowledging there might be a different approach to answering my question so I will explain the broader goal and give context for the question in the hopes someone knows how do want I want with Scipy, or perhaps a better way than what I explicitly asked for.
To get straight into it, broadly speaking I am using vectorized computation methods (Numpy/Scipy) to implement a Monte Carlo simulation to improve upon a naive for loop. I have successfully abstracted most of my operations to array computation / linear algebra, but a few specific ones (recursive computations) have eluded my intuition and I continually end up in the digital signal processing world when I go looking for how this type of thing has been done by others (that or machine learning but those "frameworks" are much opinionated). The reason most of my google searches end up on scipy.signal or scipy.ndimage library references is clear to me at this point, and subsequent to accepting the "signal" representation of my data, I have spent a considerable amount of time (about as much as reasonable for a field that is not my own) ramping up the learning curve to try and figure out what I need from these libraries.
My simulation entails updating a vector of data representing the state of a system each period for n periods, and then repeating that whole process a "Monte Carlo" amount of times. The updates in each of n periods are inherently recursive as the next depends on the state of the prior. It can be characterized as a difference equation as linked above. Additionally this vector is theoretically indexed on an grid of points with uneven stepsize. Here is an example vector y and its theoretical grid t:
y = np.r_[0.0024, 0.004, 0.0058, 0.0083, 0.0099, 0.0133, 0.0164]
t = np.r_[0.25, 0.5, 1, 2, 5, 10, 20]
I need to iteratively perform numerous operations to y for each of n "updates." Specifically, I am computing the curvature along the curve y(t) using finite difference approximations and using the result at each point to adjust the corresponding y(t) prior to the next update. In a loop this amounts to inplace variable reassignment with the desired update in each iteration.
y += some_function(y)
Not only does this seem inefficient, but vectorizing things seems intuitive given y is a vector to begin with. Furthermore I am interested in preserving each "updated" y(t) along the n updates, which would require a data structure of dimensions len(y) x n. At this point, why not perform the updates inplace in the array? This is wherein lies the question. Many of the update operations I have succesfully vectorized the "Numpy way" (such as adding random variates to each point), but some appear overly complex in the array world.
Specifically, as mentioned above the one involving computing curvature at each element using its neighbouring two elements, and then imediately using that result to update the next row of the array before performing its own curvature "update." I was able to implement a non-recursive version (each row fails to consider its "updated self" from the prior row) of the curvature operation using ndimage generic_filter. Given the uneven grid, I have unique coefficients (kernel weights) for each triplet in the kernel footprint (instead of always using [1,-2,1] for y'' if I had a uniform grid). This last part has already forced me to use a spatial filter from ndimage rather than a 1d convolution. I'll point out, something conceptually similar was discussed in this math.exchange post, and it seems to me only the third response saliently addressed the difference between mathematical notion of "convolution" which should be associative from general spatial filtering kernels that would require two sequential filtering operations or a cleverly merged kernel.
In any case this does not seem to actually address my concern as it is not about 2D recursion filtering but rather having a backwards looking kernel footprint. Additionally, I think I've concluded it is not applicable in that this only allows for "recursion" (backward looking kernel footprints in the spatial filtering world) in a manner directly proportional to the size of the recursion. Meaning if I wanted to filter each of n rows incorporating calculations on all prior rows, it would require a convolution kernel far too big (for my n anyways). If I'm understanding all this correctly, a recursive linear filter is algorithmically more efficient in that it returns (for use in computation) the result of itself applied over the previous n samples (up to a level where the stability of the algorithm is affected) using another companion vector (z). In my case, I would only need to look back one step at output signal y[n-1] to compute y[n] from curvature at x[n] as the rest works itself out like a cumsum. signal.lfilter works for this, but I can't used that to compute curvature, as that requires a kernel footprint that can "see" at least its left and right neighbors (pixels), which is how I ended up using generic_filter.
It seems to me I should be able to do both simultaneously with one filter namely spatial and recursive filtering; or somehow I've missed the maths of how this could be mathematically simplified/combined (convolution of multiples kernels?).
It seems like this should be a common problem, but perhaps it is rarely relevant to do both at once in signal processing and image filtering. Perhaps this is why you don't use signals libraries solely to implement a fast monte carlo simulation; though it seems less esoteric than using a tensor math library to implement a recursive neural network scan ... which I'm attempting to do right now.
EDIT: For those familiar with the theoretical side of DSP, I know that what I am describing, the process of designing a recursive filters with arbitrary impulse responses, is achieved by employing a mathematical technique called the z-transform which I understand is generally used for two things:
converting between the recursion coefficients and the frequency response
combining cascaded and parallel stages into a single filter
Both are exactly what I am trying to accomplish.
Also, reworded title away from FIR / IIR because those imply specific definitions of "recursion" and may be confusing / misnomer.

Interpolation from irregular grid to regular grid

I have some 1D data (time series data) that is sampled irregularly; i.e., non-constant sample rate. I would like transform these data into a regularly sampled (uniform sample rate) time series. I have used linear interpolation in an attempt to accomplish this; however, this is not very effective when there is a large variation in the time between samples. This is no surprise. I have also attempted some ad hoc methods that again are not very effective.
I have looked at several papers on the use of matching pursuit for interpolation over irregular grids; but, how this approach could be used to obtain samples over a regular grid is not clear to me (at least not yet).
I would appreciate any suggestions on algorithms for interpolation from irregular grids to regular grids (1D data).
If you want to fit the data points exactly, run
scipy.interpolate.UnivariateSpline
with s=0
(and ask further if that's not clear).

Which features of Pandas DataFrame could be used to model GPS Tracklog data (read from GPX file)

It's been months now since I started to use Pandas DataFrame to deserialize GPS data and perform some data processing and analyses.
Although I am very impressed with Pandas robustness, flexibility and power, I'm a bit lost about which features, and in which way, I should use to properly model the data, both for clarity, simplicity and computational speed.
Basically, each DataFrame is primarily indexed by a datetime object, having at least one column for a latitude-longitude tuple, and one column for elevation.
The first thing I do is to calculate a new column with the geodesic distance between coordinate pairs (first one being 0.0), using a function that takes two coordinate pairs as arguments, and from that new column I can calculate the cumulative distance along the track, which I use as a Linear Referencing System
The questions I need to address would be:
Is there a way in which I can use, in the same dataframe, two different monotonically increasing columns (cumulative distance and timestamp), choosing whatever is more convenient in each given context at runtime, and use these indexes to auto-align newly inserted rows?
In the specific case of applying a diff function that could be vectorized (applied like an array operation instead of an iterative pairwise loop), is there a way to do that idiomatically in pandas? Should I create a "coordinate" class which support the diff (__sub__) operation so I could use dataframe.latlng.diff directly?
I'm not sure these questions are well formulated, but that is due, at least a bit, by the overwhelming number of possibilities, and a somewhat fragmented documentation (yet).
Also, any tip about using Pandas for GPS data (tracklogs) or Geospatial data in general is very much welcome.
Thanks for any help!

How to plot a Pearson correlation given a time series?

I am using the code in this website http://blog.chrislowis.co.uk/2008/11/24/ruby-gsl-pearson.html to implement a Pearson Correlation given two time series data like so:
require 'gsl'
pearson_correlation = GSL::Stats::correlation(
GSL::Vector.alloc(first_metrics),GSL::Vector.alloc(second_metrics)
)
This returns a number such as -0.2352461593569471.
I'm currently using the highcharts library and am feeding it two sets of timeseries data. Given that I have a finite time series for both sets, can I do something with this number (-0.2352461593569471) to create a third time series showing the slope of this curve? If anyone can point me in the right direction I'd really appreciate it!
No, correlation doesn't tell you anything about the slope of the line of best fit. It just tells you approximately how much of the variability in one variable (or one time series, in this case) can be explained by the other. There is a reasonably good description here: http://www.graphpad.com/support/faqid/1141/.
How you deal with the data in your specific case is highly dependent on what you're trying to achieve. Are you trying to show that variable X causes variable Y? If so, you could start by dropping the time-series-ness, and just treat the data as paired values, and use linear regression. If you're trying to find a model of how X and Y vary together over time, you could look at multivariate linear regression (I'm not very familiar with this, though).