Annotating numeric values on grouped bars chart in pyplot - matplotlib

Good evening all,
I have a pd.dataframe called plot_eigen_vecs_df which is of (3,11) dimension, and I am plotting each column value grouped by rows on a bar chart. I am using the following code:
plot_eigen_vecs_df.plot(kind='bar', figsize=(12, 8),
title='First 3 PCs factor loadings',
xlabel='Evects', legend=True)
The result is this graph:
enter image description here
I would like to keep the graph (grouped) exactly as it is, but I need to show the numeric value above each bars.
Thank you
I tried the add_label method, but unfortunately I am currently using a version of pyplot which is not the most recent, so .add_label doesn't work for me. Could you please help on the matter?

Related

Cannot plot a histogram from a Pandas dataframe

I've used pandas.read_csv to generate a 1000-row dataframe with 32 columns. I'm looking to plot a histogram or bar chart (depending on data type) of each column. For columns of type 'int64', I've tried doing matplotlib.pyplot.hist(df['column']) and df.hist(column='column'), as well as calling matplotlib.pyplot.hist on df['column'].values and df['column'].to_numpy(). Weirdly, nthey all take areally long time (>30s) and when I've allowed them to complet, I get unit-height bars in multiple colors, as if there's some sort of implicit grouping and they're all being separated into different groups. Any ideas about what I can do to get a normal histogram? Unfortunately I closed the charts so I can't show you an example right now.
Edit - this seems to be a much bigger problem with Int columns, and casting them to float fixes the problem.
Follow these two steps:
import the Histogram class from the Matplotlib library
use the "plot" method, which will accept a dataframe as argument
import matplotlib.pyplot as plt
plt.hist(df['column'], color='blue', edgecolor='black', bins=int(45/1))
Here's the source.

visualizing longitudinal patient data: Adding specific icons or symbols to certain cells in a time series heatmap to indicate events/outcomes

I am currently involved in a clinical study. We are trying to visualize patients blood work over time using seaborn cluster maps (for example patient CPR levels). For reference: We have some 200 Patients and up to 60 days of observed data, so cells in the plot are pretty small.
Some patients during the observations period either died or developed an outcome of interest. We would love to visualize these key events with some form a symbol or icon. I am imagining something like this:
In addition to its color coding the field at the date of death gehts big dot right in the middle, or even a symbolic cross or some other symbol.
Things that might work, but i do not know how to do:
I am using lines to seperate cells. Changing the widths and color of the cells at the date an event occured might work.
Things that dont work:
cell in my heatmap are too small for custom annotations
import pandas as pd
import seaborn as sns
df= pd.read_excel('data.xlsx')
heatmap = sns.clustermap(df,col_cluster=False, row_cluster=False, cmap='YlOrRd', mask=df=0, vmax=10, vmin=0, linewidths=1, linecolor='black', figsize=(20,16), cbar_pos=(0.1, 0.2, .02, .6))

Matplotlib/Seaborn: Boxplot collapses on x axis

I am creating a series of boxplots in order to compare different cancer types with each other (based on 5 categories). For plotting I use seaborn/matplotlib. It works fine for most of the cancer types (see image right) however in some the x axis collapses slightly (see image left) or strongly (see image middle)
https://i.imgur.com/dxLR4B4.png
Looking into the code how seaborn plots a box/violin plot https://github.com/mwaskom/seaborn/blob/36964d7ffba3683de2117d25f224f8ebef015298/seaborn/categorical.py (line 961)
violin_data = remove_na(group_data[hue_mask])
I realized that this happens when there are too many nans
Is there any possibility to prevent this collapsing by code only
I do not want to modify my dataframe (replace the nans by zero)
Below you find my code:
boxp_df=pd.read_csv(pf_in,sep="\t",skip_blank_lines=False)
fig, ax = plt.subplots(figsize=(10, 10))
sns.violinplot(data=boxp_df, ax=ax)
plt.xticks(rotation=-45)
plt.ylabel("label")
plt.tight_layout()
plt.savefig(pf_out)
The output is a per cancer type differently sized plot
(depending on if there is any category completely nan)
I am expecting each plot to be in the same width.
Update
trying to use the order parameter as suggested leads to the following output:
https://i.imgur.com/uSm13Qw.png
Maybe this toy example helps ?
|Cat1|Cat2|Cat3|Cat4|Cat5
|3.93| |0.52| |6.01
|3.34| |0.89| |2.89
|3.39| |1.96| |4.63
|1.59| |3.66| |3.75
|2.73| |0.39| |2.87
|0.08| |1.25| |-0.27
Update
Apparently, the problem is not the data but the length of the title
https://github.com/matplotlib/matplotlib/issues/4413
Therefore I would close the question
#Diziet should I delete it or does my issue might help other ones?
Sorry for not including the line below in the code example:
ax.set_title("VERY LONG TITLE", fontsize=20)
It's hard to be sure without data to test it with, but I think you can pass the names of your categories/cancers to the order= parameter. This forces seaborn to use/display those, even if they are empty.
for instance:
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips, order=['Thur','Fri','Sat','Freedom Day','Sun','Durin\'s Day'])

Matplotlib stacked bar chart not showing all bars

I will make a stacked bar chart in matplotlib. Somehow it doesnt include all the bar chart that i gave him (there should be like 50 bar charts stacked on each other)
The code:
N=45 #numbers of columns
max_el=50
ind=np.arrange(N)
for bar in range(0,max_el):
y=[dic[value][bar] for value in dic]
plt.bar(ind,y,)
plt.show()
note: I used the similar code and same data and made a stacked bar chart with plotly (which worked)
With plotly
With matplotlib
Some of the values of variables are zeros or 0.1. Could that be the problem ?
As described in the comments, you need to add a bottoms array that keeps track of how much each should be moved up from the 0 line. Otherwise, they all start plotting a 0 and overplot one another, with the tallest one sticking up to its values and each one hiding those that were plotted before.

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.