Ordering Seaborn heatmap with non-index variable - pandas

I am currently in the process of moving from R and ggplot2 to seaborn for a lot of work because R was struggling with the size of data I was using. I am currently working on a heatmap that is fairly simplistic and I have been able to render the general heatmap without too many issues, but I am not sure how to adjust the ordering of my categoricals for the heatmap.
In this case my data has this header:
Sample Position Depth Order
Sample is the "y-axis" categorical and Position is the "x-axis" categorical. Depth is the value of the cell. Order is a meta-value calculated elsewhere, but I want to use Order as my ordering value for the y-axis, while retaining Sample as the label. Is there a way to do this?

You need to provide a rectangular format, or matrix for sns.heatmap, so though you have a Order column for ordering Sample, it's not clear whether there is a unique value for each 'Order' category.
Below I use a simple example, and basically you change the 'Sample' to a category, according to the mean value of 'Order'. It is like changing the factor levels in R. Also, you need to make sure there is no NaN otherwise the heatmap might complain:
df = pd.DataFrame({'Sample':np.repeat(['A','B','C'],4),
'Position':[1,2,3,4]*3,
'Depth':np.random.normal(0,1,12),
'Order':np.repeat([2,1,3],4)})
y_order = df.groupby('Sample')['Order'].agg('mean').sort_values().index
df['Sample'] = pd.Categorical(df['Sample'],ordered=True,categories=y_order)
sns.heatmap(df.pivot(index='Sample',columns='Position', values='Depth'))

Related

why is ggplot2 geom_col misreading discrete x axis labels as continuous?

Aim: plot a column chart representing concentration values at discrete sites
Problem: the 14 site labels are numeric, so I think ggplot2 is assuming continuous data and adding spaces for what it sees as 'missing numbers'. I only want 14 columns with 14 marks/labels, relative to the 14 values in the dataframe. I've tried assigning the sites as factors and characters but neither work.
Also, how do you ensure the y-axis ends at '0', so the bottom of the columns meet the x-axis?
Thanks
Data:
Sites: 2,4,6,7,8,9,10,11,12,13,14,15,16,17
Concentration: 10,16,3,15,17,10,11,19,14,12,14,13,18,16
You have two questions in one with two pretty straightforward answers:
1. How to force a discrete axis when your column is a continuous one? To make ggplot2 draw a discrete axis, the data must be discrete. You can force your numeric data to be discrete by converting to a factor. So, instead of x=Sites in your plot code, use x=as.factor(Sites).
2. How to eliminate the white space below the columns in a column plot? You can control the limits of the y axis via the scale_y_continuous() function. By default, the limits extend a bit past the actual data (in this case, from 0 to the max Concentration). You can override that behavior via the expand= argument. Check the documentation for expansion() for more details, but here I'm going to use mult=, which uses a multiplication to find the new limits based on the data. I'm using 0 for the lower limit to make the lower axis limit equal the minimum in your data (0), and 0.05 as the upper limit to expand the chart limits about 5% past the max value (this is default, I believe).
Here's the code and resulting plot.
library(ggplot2)
df <- data.frame(
Sites = c(2,4,6,7,8,9,10,11,12,13,14,15,16,17),
Concentration = c(10,16,3,15,17,10,11,19,14,12,14,13,18,16)
)
ggplot(df, aes(x=as.factor(Sites), y=Concentration)) +
geom_col(color="black", fill="lightblue") +
scale_y_continuous(expand=expansion(mult=c(0, 0.05))) +
theme_bw()

Altair sorting Chart

I'm trying to plot the data of my DataFarme in a groupedChart and I want the columns to preserve the order I gave them before. The data looks as follows (its not all there but its in the same way organized)
dataframe
When I plot it I get the following Graph:
graph
So the months were sorted even though I specified not to sort in the chart. I used the following code:
chart2 = alt.Chart(melted).mark_bar().encode(
column=alt.Column('variable',sort=None),
x=alt.X('room',sort=None),
y=alt.Y('value'),
color='room',
tooltip= ['room', 'value']
)
Does anyone know how I could fix that?
You've already used sort=None, which is the correct way to make scales in a non-faceted chart reflect the input order.
The missing piece is that faceted charts share scales by default (See Scale and Guide Resolution), so each facet is being forced to share an order.
If you make the x scale resolution independent, then each facet should retain the input order:
chart2 = alt.Chart(melted).mark_bar().encode(
column=alt.Column('variable',sort=None),
x=alt.X('room',sort=None),
y=alt.Y('value'),
color='room',
tooltip= ['room', 'value']
).resolve_scale(x='independent')

Plot variable size/color-heatmap for mulitple occurences of points in scatter plot

I'm stuck with the following problem and I hope I can explain it coherent.
So, I have a number (about 10) of descrete positions on a coordinate system.
Now, I want to analyse data from a program where user could label each point as somethingA and somethingB.
I extracted the data points for each class. So I have about 60 points for the somethingA class and a little bit less for the other class. One class stands for good points and one for bad points. I want to find the positions which have the most good/bad labels. I do that with machine learning algorithms, I just want to visualize this with plots.
I now want to plot those points. So I make one plot per class. But since in every class every point occurs at least once, the two plots would look exactly the same.
But, the amount of occurences has a different distribution thoughout the positions.
Maybe point A has 20 occurences in class A and 1 in class B, both plots would look the same.
So, my question is: How can I take the number of occurences for points into account when plotting scatters in Matplotlib?
Either with different colors (like a heatmap?) maybe with a cool legend.
Or with different sizes (e.g. higher amount = bigger cirlce).
Any help would be appreciated!
I don't know if this helps you but I have had a problem where I wanted a scatterplot to reflect both positions as well as two variables that were attributed to the data points.
Since size and color in the scatter function do not allow variables themselves, meaning one has to specify color code and size in the usual way, meaning sth like
ax.scatter(..., c=whatEverFunction, s=numberOfOccurences, ...)
did not work for me.
what I did was to bin the values of the two variables I wanted to visualize. In my case the variable nodeMass and another variable.
for i in range(Number):
mask[i] = False
if(lowerBound1<variableOne[i]<upperBound1):
mask[i] = True & pmask[i]
if len(positionX[mask])>0:
ax.scatter(positionX[mask], positionY[mask], positionZ[mask],C='#424242',s=10, edgecolors='none')
for i in range(Number):
mask[i] = False
if(lowerBound2<variableOne[i]<upperBound2):
mask[i] = True & pmask[i]
if len(positionX[mask])>0:
ax.scatter(positionX[mask], positionY[mask], positionZ[mask],c='#9E0050',s=25,edgecolors='none')
I know it is not very elegant but it worked for me. I had to make as many for loops as I had bins in my variables. With if-querys and the masks I could at least avoid redundant or 'unreadable' plots.

How to change text of y-axes on a matplotlib generated picture

The page is
"http://matplotlib.sourceforge.net/examples/pylab_examples/histogram_demo_extended.html"
Let's look at the y-axis, the numbers there do not make any sense, could we change it to something else that is meaningful?
Except the cumulative distribution plot, and the last one, the rest of the y-axes data show normalized histogram values with normed=1 keyword set (i.e., the are underneath the histogram equals to 1 as in the definition of a probability density function (PDF))
You can use yticks(), see this example.

Easiest way to plot values as symbols in scatter plot?

In an answer to an earlier question of mine regarding fixing the colorspace for scatter images of 4D data, Tom10 suggested plotting values as symbols in order to double-check my data. An excellent idea. I've run some similar demos in the past, but I can't for the life of me find the demo I remember being quite simple.
So, what's the easiest way to plot numerical values as the symbol in a scatter plot instead of 'o' for example? Tom10 suggested plt.txt(x,y,value)- and that is the implementation used in a number of examples. I however wonder if there's an easy way to evaluate "value" from my array of numbers? Can one simply say: str(valuearray) ?
Do you need a loop to evaluate the values for plotting as suggested in the matplotlib demo section for 3D text scatter plots?
Their example produces:
(source: sourceforge.net)
However, they're doing something fairly complex in evaluating the locations as well as changing text direction based on data. So, is there a cute way to plot x,y,C data (where C is a value often taken as the color in the plot data- but instead I wish to make the symbol)?
Again, I think we have a fair answer to this- I just wonder if there's an easier way?
The easiest way I've seen to do this is:
for x, y, val in zip(x_array, y_array, val_array):
plt.text(x, y, val)
Also, btw, you suggested using str(valarray), and this, as you may have noticed doesn't work. To convert an array of numbers to a sequence of strings you could use
valarray.astype(str)
to get a numpy array, or,
[str(v) for v in valarray]
to get a Python list. But even with valarray as a proper sequence of strings, plt.text won't iterate over it's inputs.