Interpolation from irregular grid to regular grid - numpy

I have some 1D data (time series data) that is sampled irregularly; i.e., non-constant sample rate. I would like transform these data into a regularly sampled (uniform sample rate) time series. I have used linear interpolation in an attempt to accomplish this; however, this is not very effective when there is a large variation in the time between samples. This is no surprise. I have also attempted some ad hoc methods that again are not very effective.
I have looked at several papers on the use of matching pursuit for interpolation over irregular grids; but, how this approach could be used to obtain samples over a regular grid is not clear to me (at least not yet).
I would appreciate any suggestions on algorithms for interpolation from irregular grids to regular grids (1D data).

If you want to fit the data points exactly, run
scipy.interpolate.UnivariateSpline
with s=0
(and ask further if that's not clear).

Related

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

Implementing a 2D recursive spatial filter using Scipy

Minimally, I would like to know how to achieve what is stated in the title. Specifically, signal.lfilter seems like the only implementation of a difference equation filter in scipy, but it is 1D, as shown in the docs. I would like to know how to implement a 2D version as described by this difference equation. If that's as simple as "bro, use this function," please let me know, pardon my naiveté, and feel free to disregard the rest of the post.
I am new to DSP and acknowledging there might be a different approach to answering my question so I will explain the broader goal and give context for the question in the hopes someone knows how do want I want with Scipy, or perhaps a better way than what I explicitly asked for.
To get straight into it, broadly speaking I am using vectorized computation methods (Numpy/Scipy) to implement a Monte Carlo simulation to improve upon a naive for loop. I have successfully abstracted most of my operations to array computation / linear algebra, but a few specific ones (recursive computations) have eluded my intuition and I continually end up in the digital signal processing world when I go looking for how this type of thing has been done by others (that or machine learning but those "frameworks" are much opinionated). The reason most of my google searches end up on scipy.signal or scipy.ndimage library references is clear to me at this point, and subsequent to accepting the "signal" representation of my data, I have spent a considerable amount of time (about as much as reasonable for a field that is not my own) ramping up the learning curve to try and figure out what I need from these libraries.
My simulation entails updating a vector of data representing the state of a system each period for n periods, and then repeating that whole process a "Monte Carlo" amount of times. The updates in each of n periods are inherently recursive as the next depends on the state of the prior. It can be characterized as a difference equation as linked above. Additionally this vector is theoretically indexed on an grid of points with uneven stepsize. Here is an example vector y and its theoretical grid t:
y = np.r_[0.0024, 0.004, 0.0058, 0.0083, 0.0099, 0.0133, 0.0164]
t = np.r_[0.25, 0.5, 1, 2, 5, 10, 20]
I need to iteratively perform numerous operations to y for each of n "updates." Specifically, I am computing the curvature along the curve y(t) using finite difference approximations and using the result at each point to adjust the corresponding y(t) prior to the next update. In a loop this amounts to inplace variable reassignment with the desired update in each iteration.
y += some_function(y)
Not only does this seem inefficient, but vectorizing things seems intuitive given y is a vector to begin with. Furthermore I am interested in preserving each "updated" y(t) along the n updates, which would require a data structure of dimensions len(y) x n. At this point, why not perform the updates inplace in the array? This is wherein lies the question. Many of the update operations I have succesfully vectorized the "Numpy way" (such as adding random variates to each point), but some appear overly complex in the array world.
Specifically, as mentioned above the one involving computing curvature at each element using its neighbouring two elements, and then imediately using that result to update the next row of the array before performing its own curvature "update." I was able to implement a non-recursive version (each row fails to consider its "updated self" from the prior row) of the curvature operation using ndimage generic_filter. Given the uneven grid, I have unique coefficients (kernel weights) for each triplet in the kernel footprint (instead of always using [1,-2,1] for y'' if I had a uniform grid). This last part has already forced me to use a spatial filter from ndimage rather than a 1d convolution. I'll point out, something conceptually similar was discussed in this math.exchange post, and it seems to me only the third response saliently addressed the difference between mathematical notion of "convolution" which should be associative from general spatial filtering kernels that would require two sequential filtering operations or a cleverly merged kernel.
In any case this does not seem to actually address my concern as it is not about 2D recursion filtering but rather having a backwards looking kernel footprint. Additionally, I think I've concluded it is not applicable in that this only allows for "recursion" (backward looking kernel footprints in the spatial filtering world) in a manner directly proportional to the size of the recursion. Meaning if I wanted to filter each of n rows incorporating calculations on all prior rows, it would require a convolution kernel far too big (for my n anyways). If I'm understanding all this correctly, a recursive linear filter is algorithmically more efficient in that it returns (for use in computation) the result of itself applied over the previous n samples (up to a level where the stability of the algorithm is affected) using another companion vector (z). In my case, I would only need to look back one step at output signal y[n-1] to compute y[n] from curvature at x[n] as the rest works itself out like a cumsum. signal.lfilter works for this, but I can't used that to compute curvature, as that requires a kernel footprint that can "see" at least its left and right neighbors (pixels), which is how I ended up using generic_filter.
It seems to me I should be able to do both simultaneously with one filter namely spatial and recursive filtering; or somehow I've missed the maths of how this could be mathematically simplified/combined (convolution of multiples kernels?).
It seems like this should be a common problem, but perhaps it is rarely relevant to do both at once in signal processing and image filtering. Perhaps this is why you don't use signals libraries solely to implement a fast monte carlo simulation; though it seems less esoteric than using a tensor math library to implement a recursive neural network scan ... which I'm attempting to do right now.
EDIT: For those familiar with the theoretical side of DSP, I know that what I am describing, the process of designing a recursive filters with arbitrary impulse responses, is achieved by employing a mathematical technique called the z-transform which I understand is generally used for two things:
converting between the recursion coefficients and the frequency response
combining cascaded and parallel stages into a single filter
Both are exactly what I am trying to accomplish.
Also, reworded title away from FIR / IIR because those imply specific definitions of "recursion" and may be confusing / misnomer.

How do I plug distance data into scipy's agglomerative clustering methods?

So, I have a set of texts I'd like to do some clustering analysis on. I've taken a Normalized Compression Distance between every text, and now I have basically built a complete graph with weighted edges that looks something like this:
text1, text2, 0.539
text2, text3, 0.675
I'm having tremendous difficulty figuring out the best way to plug this data into scipy's hierarchical clustering methods. I can probably convert the distance data into a table like the one on this page. How can I format this data so that it can easily be plugged into scipy's HAC code?
You're on the right track with converting the data into a table like the one on the linked page (a redundant distance matrix). According to the documentation, you should be able to pass that directly into scipy.cluster.hierarchy.linkage or a related function, such as scipy.cluster.hierarchy.single or scipy.cluster.hierarchy.complete. The related functions explicitly specify how distance between clusters should be calculated. scipy.cluster.hierarchy.linkage lets you specify whichever method you want, but defaults to single link (i.e. the distance between two clusters is the distance between their closest points). All of these methods will return a multidimensional array representing the agglomerative clustering. You can then use the rest of the scipy.cluster.hierarchy module to perform various actions on this clustering, such as visualizing or flattening it.
However, there's a catch. As of the time this question was written, you couldn't actually use a redundant distance matrix, despite the fact that the documentation says you can. Based on the fact that the github issue is still open, I don't think this has been resolved yet. As pointed out in the answers to the linked question, you can get around this issue by passing the complete distance matrix into the scipy.spatial.distance.squareform function, which will convert it into the format which is actually accepted (a flat array containing the upper-triangular portion of the distance matrix, called a condensed distance matrix). You can then pass the result to one of the scipy.cluster.hierarchy functions.

Where to start with Fourier Analysis

I'm reading data from the microphone and want to perform some analysis on it. I'm attempting to generate a spectrum analyser something like this:
What I have at the moment is this:
My understanding is that I need to perform a Fourier analysis - a Fast Fourier Transform ? - to extract the component frequencies and their amplitudes.
Can someone confirm my understanding is correct and exactly what type of Fourier transform I need to apply?
At the moment, I'm getting frames containing 4k samples from the mic (using NAudio). The buffer I've got is 16bits/sample (Signed Short). For reference, the above plot shows approx half a frame
I'm coding in VB so any .Net libraries/examples (preferably on NuGet) would be of most use. I believe implementations vary considerably so the less I have to massage my data, the better.
The top plot is that of a spectrograph, where each vertical time line is colored based on the magnitudes of the result from an FFT (likely windowed) of a slice in time (possibly overlapped) of the input waveform. The number of vertical points to plot (the frequency resolution) is related to the length of the FFT. Almost any FFT will do. If you use the most common complex-to-complex FFT, just set the imaginary portion of each complex input sample to zero, copy a slice in time of samples of your input waveform to the "real" part, FFT, and take the magnitude or log magnitude of each complex result bin, then map these values to colors per your preference.

Optimizing interpolation in Mathematica

As part of my work, I often have to visualize complex 3 dimensional densities. One program suite that I work with outputs the radial component of the densities as a set of 781 points on a logarithmic grid, ri = (Rmax/Rstep)^((i-1)/(pts-1), times a spherical harmonic. For low symmetry systems, the number of spherical harmonics can be fairly large to ensure accuracy, e.g. one system requires 49 harmonics corresponding to lmax = 6. So, to use this data within Mathematica, I would have a sum of up to 49 interpolated functions with each multiplied by a different spherical harmonic. While using v.6 and constructing the interpolated radial functions using Interpolation and setting r = Sqrt(x^2 + y^2 + z^2), I would stop ContourPlot3D after well over an hour without anything displayed. This included reducing both the InterpolationOrder and MaxRecursion to 1.
Several alternatives presented themselves:
Evaluate the density function on a fixed grid, and use ListContourPlot instead.
Or, linearly spline the radial function and use Piecewise to stitch them together. (This presented itself, as I could use simplify to help reduce the complexity of the resulting function.)
I ended up using both, as InterpolatingFunction gives a noticeable delay in its evaluation, and with up to 49 interpolated functions to evaluate, any delay can become noticeable. Also, ContourPlot3D was faster with the spline, but it didn't give me the speed up I desired.
I'll freely admit that I haven't tried Interpolation on v.7, nor I have tried this on my upgraded hardware (G4 v. Intel Core i5). However, I'm looking for alternatives to my current scheme; preferably, one where I can use ContourPlot3D directly. I could try some other form of spline, such as a B-spline, and possibly combine that with UnitBox instead of using Piecewise.
Edit: Just to clarify, my current implementation involves creating a first order spline for each radial part, multiplying each one by their respective spherical harmonic, summing and Simplifying the equations on each radial interval, and then using Piecewise to bind them into one function. So, my implementation is semi-analytical in that the spherical harmonics are exact, and only the radial part is numerical. This is part of the reason why I would like to be able to use ContourPlot3D, so that I can take advantage of the semi-analytical nature of the data. As a point of note, the radial grid is fine enough that a good representation of the radial part is generated and can be smoothly interpolated. While this gave me a significant speed-up, when I wrote the code, it was still to slow for the hardware I was using at the time.
So, instead of using ContourPlot3D, I would first generate the function, as above, then I would evaluate it on an 803 Cartesian grid. It is the data from this step that I used in ListContourPlot3D. Since this is not an adaptive grid, in some places this was too course, and I was missing features.
If you can do without Mathematica, I would suggest you have a look at Paraview (US government funded FOSS, all platforms) which I have found to be superior to everything when it comes to visualizing massive amounts of data.
The core of the software is the "Visualization Toolkit" VTK, and you can find/write other frontends if need be.
VTK/Paraview can handle almost any data-type: scalar and vector on structured grids or random points, polygons, time-series data, etc. From Mathematica I often just dump grid data into VTK legacy format which in then simplest case looks like this
# vtk DataFile Version 2.0
Generated by mma via vtkGridDump
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 49 25 15
SPACING 0.125 0.125 0.0625
ORIGIN 8.5 5. 0.7124999999999999
POINT_DATA 18375
SCALARS RF_pondpot_1V1MHz1amu double 1
LOOKUP_TABLE default
0.04709501616121583
0.04135197485227461
... <18373 more numbers> ...
HTH!
If it really is the interpolation of the radial functions that is slowing you down, you could consider hand-coding that part based on your knowledge of the sample points. As demonstrated below, this gives a significant speedup:
I set things up with your notation. lookuprvals is a list of 100000 r values to look up for timing.
First, look at stock interpolation as a basemark
With[{interp=Interpolation[N#Transpose#{rvals,yvals}]},
Timing[interp[lookuprvals]][[1]]]
Out[259]= 2.28466
Switching to 0th-order interpolation is already an order of magnitude faster (first order is almost same speed):
With[{interp=Interpolation[N#Transpose#{rvals,yvals},InterpolationOrder->0]},
Timing[interp[lookuprvals]][[1]]]
Out[271]= 0.146486
We can get another 1.5 order of magnitude by calculating indices directly:
Module[{avg=MovingAverage[yvals,2],idxfact=N[(pts-1) /Log[Rmax/Rstep]]},
Timing[res=Part[avg,Ceiling[idxfact Log[lookuprvals]]]][[1]]]
Out[272]= 0.006067
As a middle ground, do a log-linear interpolation by hand. This is slower than the above solution but still much faster than stock interpolation:
Module[{diffs=Differences[yvals],
idxfact=N[(pts-1) /Log[Rmax/Rstep]]},
Timing[Block[{idxraw,idxfloor,idxrel},
idxraw=1+idxfact Log[lookuprvals];
idxfloor=Floor[idxraw];
idxrel=idxraw-idxfloor;
res=Part[yvals,idxfloor]+Part[diffs,idxfloor]idxrel
]][[1]]]
Out[276]= 0.026557
If you have the memory for it, I would cache the spherical harmonics and radius (or even radius-index) on the full grid. Then flatten the grid caches so you can do
Sum[ interpolate[yvals[lm],gridrvals] gridylmvals[lm], {lm,lmvals} ]
and recreate your grid as discussed here.