Plot variable size/color-heatmap for mulitple occurences of points in scatter plot

Plot variable size/color-heatmap for mulitple occurences of points in scatter plot - numpy

I'm stuck with the following problem and I hope I can explain it coherent.
So, I have a number (about 10) of descrete positions on a coordinate system.
Now, I want to analyse data from a program where user could label each point as somethingA and somethingB.
I extracted the data points for each class. So I have about 60 points for the somethingA class and a little bit less for the other class. One class stands for good points and one for bad points. I want to find the positions which have the most good/bad labels. I do that with machine learning algorithms, I just want to visualize this with plots.
I now want to plot those points. So I make one plot per class. But since in every class every point occurs at least once, the two plots would look exactly the same.
But, the amount of occurences has a different distribution thoughout the positions.
Maybe point A has 20 occurences in class A and 1 in class B, both plots would look the same.
So, my question is: How can I take the number of occurences for points into account when plotting scatters in Matplotlib?
Either with different colors (like a heatmap?) maybe with a cool legend.
Or with different sizes (e.g. higher amount = bigger cirlce).
Any help would be appreciated!

I don't know if this helps you but I have had a problem where I wanted a scatterplot to reflect both positions as well as two variables that were attributed to the data points.
Since size and color in the scatter function do not allow variables themselves, meaning one has to specify color code and size in the usual way, meaning sth like
ax.scatter(..., c=whatEverFunction, s=numberOfOccurences, ...)
did not work for me.
what I did was to bin the values of the two variables I wanted to visualize. In my case the variable nodeMass and another variable.
for i in range(Number):
mask[i] = False
if(lowerBound1<variableOne[i]<upperBound1):
mask[i] = True & pmask[i]
if len(positionX[mask])>0:
ax.scatter(positionX[mask], positionY[mask], positionZ[mask],C='#424242',s=10, edgecolors='none')
for i in range(Number):
mask[i] = False
if(lowerBound2<variableOne[i]<upperBound2):
mask[i] = True & pmask[i]
if len(positionX[mask])>0:
ax.scatter(positionX[mask], positionY[mask], positionZ[mask],c='#9E0050',s=25,edgecolors='none')
I know it is not very elegant but it worked for me. I had to make as many for loops as I had bins in my variables. With if-querys and the masks I could at least avoid redundant or 'unreadable' plots.

Related

"Zoom in" on a violinplot whilst keeping accurate quartile lines (matplotlib/seaborn)

TL;DR: How can I get a subrange of a violinplot whilst keeping accurate quartile lines?
I am using seaborn violinplots to make static charts for a report, but as far as I can tell, there's no way to redraw a particular area between limits whilst retaining the 25/median/75 quartile lines of the original dataset.
Here's my example dataset as a violin. The 25/median/75 values are left side: 1.0/5.0/9.0; right side: 2.0/5.0/9.0
My data has such a long tail that all the useful info is scrunched up into a tiny area. I want to ignore (but not throw away) the tail and show a closer look at the interesting bit.
I tried to reset the ylim using ax.set(ylim=(0, upp)), but the resultant graph is not great: it's jaggy and the inner lines don't meet the violin edge.
Is there a way to reset the y-axis limits but get a better quality result?
Next I tried to cut off the tail by dropping values from the dataset. I dropped anything over the 97th centile. The violin looks way better, but the quartile lines have been recalculated for this new dataset. They're showing a median of about 4, not 5 as per the original dataset.
I'm using inner="quartile", so the code that gets called in Seaborn is _ViolinPlotter::draw_quartiles
def draw_quartiles(self, ax, data, support, density, center, split=False):
"""Draw the quartiles as lines at width of density."""
q25, q50, q75 = np.percentile(data, [25, 50, 75])
self.draw_to_density(ax, center, q25, support, density, split,
linewidth=self.linewidth,
dashes=[self.linewidth * 1.5] * 2)
As you can see, it assumes (understandably) that one wants to draw the quartile lines at percentiles 25, 50 and 75. It'd be amazeballs if there was a way I could call draw_to_density with my own values (is there?).
At the moment, I am attempting to manually adjust the position of the lines. It's trivial to figure out & set the y-values:
for l in ax.lines:
l.set_ydata(<get correct quartile value from original dataset>)
but I'm finding it hard to figure out the limits for x, i.e. the density of the distribution at the quartiles. It seems to involve gaussian kde, and tbh it's getting hacky and inelegant at this point. Is there an easy way to calculate how long each line should be?
What do you suggest?
Thanks for your help
Lnr

W/ Thanks to #JohanC.
added gridsize=1000 to the params of the violinplot and used ax.set(ylim=(0, upp)) to resize the y-axis to show the range from 0 to upp where upp is the upper limit. Much prettier lookin' graph:

Interpolating data onto a line of points

I have some irregularly spaced data and need to analyze it. I can successfully interpolate this data onto a regular grid using mlab.griddata (or rather, the natgrid implementation of it). This allows me to use pcolormesh and contour to generate plots, extract levels, etc. Using plot.contour, I then extract a certain level using get_paths from the contour CS.collections().
Now, what I'd like to do is then, with my original irregularly spaced data, interpolate some quantities onto this specific contour line (i.e., NOT onto a regular grid). The similarly named griddata function from Scipy allows for this behavior, and it almost works. However, I find that as I increase the number of original points, I can get odd erratic behavior in the interpolation. I'm wondering if there's a way around this, i.e., another way to interpolate irregularly spaced (or regularly spaced data for that matter, since I can use my regularly spaced data from mlab.griddata) onto a specific line.
Let me show some numerical examples of what I'm talking about. Take a look at this figure:
The top left shows my data as points, and the line shows an extracted level of level=0 from some data D that I have at those points (x,y) [note, I have data 'D', 'Energy', and 'Pressure', all defined in this (x,y) space]. Once I have this curve, I can plot the interpolated quantities of D, Energy, and Pressure onto my specific line. First, note the plot of D (middle, right). It should be zero at all points, but it's not quite zero at all points. The likely cause of this is that the line that corresponds to the 0 level is generated from a uniform set of points that came from mlab.griddata, whereas the plot of 'D' is generated from my ORIGINAL data interpolated onto that level curve. You can also see some unphysical wiggles in 'Energy' and 'Pressure'.
Okay, seems easy enough, right? Maybe I should just get more original data points along my level=0 curve. Getting some more of these points, I then generate the following plots:
First look at the top left. You can see that I've sampled the hell out of the (x,y) space in the vicinity of my level=0 curve. Furthermore, you can see that my new "D" plot (middle, right) now correctly interpolates to zero in the region that it originally didn't. But now I get some wiggles at the start of the curve, as well as getting some other wiggles in the 'Energy' and 'Pressure' in this space! It is far from obvious to me that this should occur, since my original data points are still there and I've only supplemented additional points. Furthermore, some regions where my interpolation is going bad aren't even near the points that I added in the second run -- they are exclusively neighbored by my original points.
So this brings me to my original question. I'm worried that the interpolation that produces the 'Energy', 'D', and 'Pressure' curves is not working correctly (this is scigrid's griddata). Mlab's griddata only interpolates to a regular grid, whereas I want to interpolate to this specific line shown in the top left plot. What's another way for me to do this?
Thanks for your time!

After posting this, I decided to try scipy.interpolate.SmoothBivariateSpline, which produced the following result:
You can now see that my line is smoothed, so it seems like this will work. I'll mark this as the answer unless someone posts something soon that hints that there may be an even better solution.
Edit: As requested, below is some of the code used to generate these plots. I don't have a minimally working example, and the above plots were generated in a larger framework of code, but I'll write the important parts schematically below with comments.
# x,y,z are lists of data where the first point is x[0],y[0],z[0], and so on
minx=min(x)
maxx=max(x)
miny=min(y)
maxy=max(y)
# convert to numpy arrays
x=np.array(x)
y=np.array(y)
z=np.array(z)
# here we are creating a fine grid to interpolate the data onto
xi=np.linspace(minx,maxx,100)
yi=np.linspace(miny,maxy,100)
# here we interpolate our data from the original x,y,z unstructured grid to the new
# fine, regular grid in xi,yi, returning the values zi
zi=griddata(x,y,z,xi,yi)
# now let's do some plotting
plt.figure()
# returns the CS contour object, from which we'll be able to get the path for the
# level=0 curve
CS=plt.contour(x,y,z,levels=[0])
# can plot the original data if we want
plt.scatter(x,y,alpha=0.5,marker='x')
# now let's get the level=0 curve
for c in CS.collections:
data=c.get_paths()[0].vertices
# lineX,lineY are simply the x,y coordinates for our level=0 curve, expressed as arrays
lineX=data[:,0]
lineY=data[:,1]
# so it's easy to plot this too
plt.plot(lineX,lineY)
# now what to do if we want to interpolate some other data we have, say z2
# (also at our original x,y positions), onto
# this level=0 curve?
# well, first I tried using scipy.interpolate.griddata == scigrid like so
origdata=np.transpose(np.vstack((x,y))) # just organizing this data like the
# scigrid routine expects
lineZ2=scigrid(origdata,z2,data,method='linear')
# plotting the above curve (as plt.plot(lineZ2)) gave me really bad results, so
# trying a spline approach
Z2spline=SmoothBivariateSpline(x,y,z2)
# the above creates a spline object on our original data. notice we haven't EVALUATED
# it anywhere yet (we'll want to evaluate it on our level curve)
Z2Line=[]
# here we evaluate the spline along all our points on the level curve, and store the
# result as a new list
for i in range(0,len(lineX)):
Z2Line.append(Z2spline(lineX[i],lineY[i])[0][0]) # the [0][0] is just to get the
# value, which is enclosed in
# some array structure for some
# reason otherwise
# you can then easily plot this
plt.plot(Z2Line)
Hope this helps someone!

Reportlab LinePlot - how do I add a lineLegend...or label my lines?

I have a lineplot with 2 lines on it...they're two separate channels from the same data set. Would love to just label each one - the "labels" options are all about giving a number for each point on your plot, and that is simply not helpful.
Would love to know how to do any (really, all, but I just need to do one to be happy) of these:
plot each against its own y axis and be able to sensibly label that axis with units (and color the numbers to correspond to the data it correlates to)
put a legend on it. I can't figure out how to use lineLegend
just put any kind of (singular) label in the vicinity of the lines.

Layered, not stacked column graph in Excel

I want to layer (superimpose) one column graph on another in Excel. So it would be like a stacked column graph, except that each column for a given category on the x-axis would have its origin at 0 on the y-axis. My data are before and after scores. By layering the columns instead of putting them side-by-side, it would be easier to visualize the magnitude and direction of the difference between the two scores. I've seen this done with R, but can't find examples in Excel. Anyone ever attempted this?

I tried the 3D suggestion and it worked. But the other answer I discovered was to choose a Clustered Column graph and click 'Format Data Series' and change the 'overlap' percentage to 100%. I'm using a Mac so it's slightly different, but the person who helped me with this was on a PC and I've used PC's mainly. What I ended up discovering is that using 90% looked quite nice, but 100% will achieve what you're looking for.

I did the same thing for my thesis presentation. It's a little tricky and I made it by myself. To do it, you have to create a 3D bar graph (not a stacked one), in which your columns are put in front of each other. You have to make sure that all the taller columns in each X cell are behind the shorter columns in that cell on the X axis.
Once you created that graph, you can rotate the 3D graph in a way that it looks like a 2D graph (by resetting the axes values to zero). Now you have a bar graph, in which every bar has different columns and all of the columns start at zero. ;)

Short answer: Change the post score to (post - pre), then you can proceed with making the stacked bar chart.
Long and correct answer: DO NOT DO THIS. Clustered bar chart is much better because:
The visual line for comparison is the same line anyway, you're not facilitating the understanding in any means.
Any kind of overlapping of the bars conceals the area of the post-score, which induces visual distortion. A pre-score of 10 and a post score of 20 should have a column area ratio of 1:2. But if you completely overlap them, it'd be reduced to 1:1. Partial overlapping is equally problematic.

Put pcolormesh and contour onto same grid?

I'm trying to display 2D data with axis labels using both contour and pcolormesh. As has been noted on the matplotlib user list, these functions obey different conventions: pcolormesh expects the x and y values to specify the corners of the individual pixels, while contour expects the centers of the pixels.
What is the best way to make these behave consistently?
One option I've considered is to make a "centers-to-edges" function, assuming evenly spaced data:
def centers_to_edges(arr):
dx = arr[1]-arr[0]
newarr = np.linspace(arr.min()-dx/2,arr.max()+dx/2,arr.size+1)
return newarr
Another option is to use imshow with the extent keyword set.
The first approach doesn't play nicely with 2D axes (e.g., as created by meshgrid or indices) and the second discards the axis numbers entirely

Your data is a regular mesh? If it doesn't, you can use griddata() to obtain it. I think that if your data is too big, a sub-sampling or regularization always is possible. If the data is too big, maybe your output image always will be small compared with it and you can exploit this.
If you use imshow() with "extent" and "interpolation='nearest'", you will see that the data is cell-centered, and extent provided the lower edges of cells (corners). On the other hand, contour assumes that the data is cell-centered, and X,Y must be the center of cells. So, you need to be care about the input domain for contour. The trivial example is:
x = np.arange(-10,10,1)
X,Y = np.meshgrid(x,x)
P = X**2+Y**2
imshow(P,extent=[-10,10,-10,10],interpolation='nearest',origin='lower')
contour(X+0.5,Y+0.5,P,20,colors='k')
My tests told me that pcolormesh() is a very slow routine, and I always try to avoid it. griddata and imshow() always is a good choose for me.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas