PCA dimension reduction, scikit-learn

PCA dimension reduction, scikit-learn - variables

I have a dataset, the size is (21263, 81). I need to reduce the first 81 columns to 2 dimensions and plot it.
However, the last column is a continual variable (temperature), not a categorical variable as I normally see. When plotting the 2D figure, if I need to use the last column to control the point size, how do I write the python code?
Thank you.

Related

How Can I Find Peak Values of Defined Areas from Spectrogram Data using numpy?

I have spectrogram data from an audio analysis which looks like this:
On one axis I have frequencies in Hz and in the other times in seconds. I added the grid over the map to show the actual data points. Due to the nature of the used frequency analysis, the best results never give evenly spaced time and frequency values.
To allow comparison data from multiple sources, I would like to normalize this data. For this reason, I would like to calculate the peak values (maximum and minimum values) for specified areas in the map.
The second visualization shows the areas where I would like to calculate the peak values. I marked an area with a green rectangle to visualize this.
While for the time values, I would like to use equally spaced ranges (e.g 0.0-10.0, 10.0-20.0, 20.0-30.0), The frequency ranges are unevenly distributed. In higher frequencies, they will be like 450-550, 550-1500, 1500-2500, ...
You can download an example data-set here: data.zip. You can unpack the datasets like this:
with np.load(DATA_PATH) as data:
frequency_labels = data['frequency_labels']
time_labels = data['time_labels']
spectrogram_data = data['data']
DATA_PATH has to point to the path of the .npz data file.
As input, I would provide an array of frequency and time ranges. The result should be another 2d NumPy ndarray with either the maximum or the minimum values. As the amount of data is huge, I would like to rely on NumPy as much as possible to speed up the calculations.
How do I calculate the maximum/minimum values of defined areas from a 2d data map?

A plot describing the density of data points in 2D space in Julia

I am trying to use Julia to create a gif animation showing the change of density of data points with time (the data points are at the beginning concentrated at the center, and than spread to the sides, a little bit like 2D Gaussian of variance increasing with time). I have checked a catalogue of available kinds of plots in Julia:
http://docs.juliaplots.org/latest/examples/gr/
And I have tried contour plot, heatmap and 2D histogram. However, it seems that the grids of a heatmap or a contour plot have to be manually specified which is highly inconvenient. A 2D histogram serves the purpose better, but it's more related to the number of data points and when I want the plot to be more continuous by setting more bins, it cannot describe the density of data points well. Are there any good substitutes of the 2D density plot in matplotlib in Julia as the following?
https://python-graph-gallery.com/85-density-plot-with-matplotlib/

You use a package like KernelDensity to calculate the point density, then plot that. Here's an example
using StatsPlots, KernelDensity
a, b = randn(10000), randn(10000)
dens = kde((a,b))
plot(dens)
The philosophy, in the Plots package and other places in Julia, is that you generate the object you are interested in first, and then dispatch takes care of plotting it correctly.
Alternatively, you can always use PyPlot to plot anything using matplotlib directly.

Interpolating data onto a line of points

I have some irregularly spaced data and need to analyze it. I can successfully interpolate this data onto a regular grid using mlab.griddata (or rather, the natgrid implementation of it). This allows me to use pcolormesh and contour to generate plots, extract levels, etc. Using plot.contour, I then extract a certain level using get_paths from the contour CS.collections().
Now, what I'd like to do is then, with my original irregularly spaced data, interpolate some quantities onto this specific contour line (i.e., NOT onto a regular grid). The similarly named griddata function from Scipy allows for this behavior, and it almost works. However, I find that as I increase the number of original points, I can get odd erratic behavior in the interpolation. I'm wondering if there's a way around this, i.e., another way to interpolate irregularly spaced (or regularly spaced data for that matter, since I can use my regularly spaced data from mlab.griddata) onto a specific line.
Let me show some numerical examples of what I'm talking about. Take a look at this figure:
The top left shows my data as points, and the line shows an extracted level of level=0 from some data D that I have at those points (x,y) [note, I have data 'D', 'Energy', and 'Pressure', all defined in this (x,y) space]. Once I have this curve, I can plot the interpolated quantities of D, Energy, and Pressure onto my specific line. First, note the plot of D (middle, right). It should be zero at all points, but it's not quite zero at all points. The likely cause of this is that the line that corresponds to the 0 level is generated from a uniform set of points that came from mlab.griddata, whereas the plot of 'D' is generated from my ORIGINAL data interpolated onto that level curve. You can also see some unphysical wiggles in 'Energy' and 'Pressure'.
Okay, seems easy enough, right? Maybe I should just get more original data points along my level=0 curve. Getting some more of these points, I then generate the following plots:
First look at the top left. You can see that I've sampled the hell out of the (x,y) space in the vicinity of my level=0 curve. Furthermore, you can see that my new "D" plot (middle, right) now correctly interpolates to zero in the region that it originally didn't. But now I get some wiggles at the start of the curve, as well as getting some other wiggles in the 'Energy' and 'Pressure' in this space! It is far from obvious to me that this should occur, since my original data points are still there and I've only supplemented additional points. Furthermore, some regions where my interpolation is going bad aren't even near the points that I added in the second run -- they are exclusively neighbored by my original points.
So this brings me to my original question. I'm worried that the interpolation that produces the 'Energy', 'D', and 'Pressure' curves is not working correctly (this is scigrid's griddata). Mlab's griddata only interpolates to a regular grid, whereas I want to interpolate to this specific line shown in the top left plot. What's another way for me to do this?
Thanks for your time!

After posting this, I decided to try scipy.interpolate.SmoothBivariateSpline, which produced the following result:
You can now see that my line is smoothed, so it seems like this will work. I'll mark this as the answer unless someone posts something soon that hints that there may be an even better solution.
Edit: As requested, below is some of the code used to generate these plots. I don't have a minimally working example, and the above plots were generated in a larger framework of code, but I'll write the important parts schematically below with comments.
# x,y,z are lists of data where the first point is x[0],y[0],z[0], and so on
minx=min(x)
maxx=max(x)
miny=min(y)
maxy=max(y)
# convert to numpy arrays
x=np.array(x)
y=np.array(y)
z=np.array(z)
# here we are creating a fine grid to interpolate the data onto
xi=np.linspace(minx,maxx,100)
yi=np.linspace(miny,maxy,100)
# here we interpolate our data from the original x,y,z unstructured grid to the new
# fine, regular grid in xi,yi, returning the values zi
zi=griddata(x,y,z,xi,yi)
# now let's do some plotting
plt.figure()
# returns the CS contour object, from which we'll be able to get the path for the
# level=0 curve
CS=plt.contour(x,y,z,levels=[0])
# can plot the original data if we want
plt.scatter(x,y,alpha=0.5,marker='x')
# now let's get the level=0 curve
for c in CS.collections:
data=c.get_paths()[0].vertices
# lineX,lineY are simply the x,y coordinates for our level=0 curve, expressed as arrays
lineX=data[:,0]
lineY=data[:,1]
# so it's easy to plot this too
plt.plot(lineX,lineY)
# now what to do if we want to interpolate some other data we have, say z2
# (also at our original x,y positions), onto
# this level=0 curve?
# well, first I tried using scipy.interpolate.griddata == scigrid like so
origdata=np.transpose(np.vstack((x,y))) # just organizing this data like the
# scigrid routine expects
lineZ2=scigrid(origdata,z2,data,method='linear')
# plotting the above curve (as plt.plot(lineZ2)) gave me really bad results, so
# trying a spline approach
Z2spline=SmoothBivariateSpline(x,y,z2)
# the above creates a spline object on our original data. notice we haven't EVALUATED
# it anywhere yet (we'll want to evaluate it on our level curve)
Z2Line=[]
# here we evaluate the spline along all our points on the level curve, and store the
# result as a new list
for i in range(0,len(lineX)):
Z2Line.append(Z2spline(lineX[i],lineY[i])[0][0]) # the [0][0] is just to get the
# value, which is enclosed in
# some array structure for some
# reason otherwise
# you can then easily plot this
plt.plot(Z2Line)
Hope this helps someone!

Visualizing randomized four dimensional data set

I have a four dimensional data set. None of the four variables are equally spaced. Right now, I visualize the data using 3D scatter (with the color of the dots indicating the fourth dimension). But this makes it extremely unwieldy while it is printed. Had the variables been evenly spaced,a series of pcolors would have been an option. Is there some way, wherein I can represent such a data using a series of 2D plots? My data set looks something like this:
x = [3.67, 3.89, 25.6]
y = [4.88, 4.88, 322.9]
z = [1.0, 2.0, 3.0]
b = [300.0,411.0,414.5]

A scatter plot matrix is a common way to plot multiple dimensions. Here's a plot of four continuous variables colored by a fifth categorical variable.
To deal with the uneven spacing, it depends on the nature of the unevenness.
You might plot it as-is if the unevenness is significant.
You might make a second plot with the extreme values excluded.
You might apply a transformation (such as log or quantile) if the data justifies it.

Put pcolormesh and contour onto same grid?

I'm trying to display 2D data with axis labels using both contour and pcolormesh. As has been noted on the matplotlib user list, these functions obey different conventions: pcolormesh expects the x and y values to specify the corners of the individual pixels, while contour expects the centers of the pixels.
What is the best way to make these behave consistently?
One option I've considered is to make a "centers-to-edges" function, assuming evenly spaced data:
def centers_to_edges(arr):
dx = arr[1]-arr[0]
newarr = np.linspace(arr.min()-dx/2,arr.max()+dx/2,arr.size+1)
return newarr
Another option is to use imshow with the extent keyword set.
The first approach doesn't play nicely with 2D axes (e.g., as created by meshgrid or indices) and the second discards the axis numbers entirely

Your data is a regular mesh? If it doesn't, you can use griddata() to obtain it. I think that if your data is too big, a sub-sampling or regularization always is possible. If the data is too big, maybe your output image always will be small compared with it and you can exploit this.
If you use imshow() with "extent" and "interpolation='nearest'", you will see that the data is cell-centered, and extent provided the lower edges of cells (corners). On the other hand, contour assumes that the data is cell-centered, and X,Y must be the center of cells. So, you need to be care about the input domain for contour. The trivial example is:
x = np.arange(-10,10,1)
X,Y = np.meshgrid(x,x)
P = X**2+Y**2
imshow(P,extent=[-10,10,-10,10],interpolation='nearest',origin='lower')
contour(X+0.5,Y+0.5,P,20,colors='k')
My tests told me that pcolormesh() is a very slow routine, and I always try to avoid it. griddata and imshow() always is a good choose for me.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas