pie chart for each column pandas - pandas

I have a dataframe of categorical values, and want to tabulate, then make a pie graph on each column.
I can tabulate my table and create one massive plot, but I do not think this meets my needs, and would prefer a pie graph for each column instead:
df = pd.DataFrame({'a': ['table', 'chair', 'chair', 'lamp', 'bed'],
'b': ['lamp', 'candle', 'chair', 'lamp', 'bed'],
'c': ['mirror', 'mirror', 'mirror', 'mirror', 'mirror']})
df
df2=df.apply(pd.value_counts).fillna(0)
df2.plot.bar()
display()
I tried making pie plots for each column, but have been struggling the past few hours with:
df2.plot(kind='pie',subplots=True,autopct='%1.1f%%', startangle=270, fontsize=17)
display()
I am thinking I am close, and hopefully soeone can help me get over the final hurdle. ie, make a pie graph based on each column, so that it is meaningful and interpretable, not this bungled mess (ie, a title above each plot referring to the column, the legend in an appropriate position), or even the correct documentation to read

One easy thing to do is to increase the figure size and specify the layout:
df2.plot(kind='pie', subplots=True,
autopct='%1.1f%%', startangle=270, fontsize=17,
layout=(2,2), figsize=(10,10))
Output:

Related

Annotating numeric values on grouped bars chart in pyplot

Good evening all,
I have a pd.dataframe called plot_eigen_vecs_df which is of (3,11) dimension, and I am plotting each column value grouped by rows on a bar chart. I am using the following code:
plot_eigen_vecs_df.plot(kind='bar', figsize=(12, 8),
title='First 3 PCs factor loadings',
xlabel='Evects', legend=True)
The result is this graph:
enter image description here
I would like to keep the graph (grouped) exactly as it is, but I need to show the numeric value above each bars.
Thank you
I tried the add_label method, but unfortunately I am currently using a version of pyplot which is not the most recent, so .add_label doesn't work for me. Could you please help on the matter?

Seaborn time series plotting: a different problem for each function

I'm trying to use seaborn dataframe functionality (e.g. passing column names to x, y and hue plot parameters) for my timeseries (in pandas datetime format) plots.
x should come from a timeseries column(converted from a pd.Series of strings with pd.to_datetime)
y should come from a float column
hue comes from a categorical column that I calculated.
There are multiple streams in the same series that I am trying to separate (and use the hue for separating them visually), and therefore they should not be connected by a line (like in a scatterplot)
I have tried the following plot types, each with a different problem:
sns.scatterplot: gets the plotting right and the labels right bus has problems with the xlimits, and I could not set them right with plt.xlim() using data.Dates.min and data.Dates.min
sns.lineplot: gets the limits and the labels right but I could not find a setting to disable the lines between the individual datapoints like in matplotlib. I tried the setting the markers and the dashes parameters to no avail.
sns.stripplot: my last try, plotted the datapoints correctly and got the xlimits right but messed the labels ticks
Example input data for easy reproduction:
dates = pd.to_datetime(('2017-11-15',
'2017-11-29',
'2017-12-15',
'2017-12-28',
'2018-01-15',
'2018-01-30',
'2018-02-15',
'2018-02-27',
'2018-03-15',
'2018-03-27',
'2018-04-13',
'2018-04-27',
'2018-05-15',
'2018-05-28',
'2018-06-15',
'2018-06-28',
'2018-07-13',
'2018-07-27'))
values = np.random.randn(len(dates))
clusters = np.random.randint(1, size=len(dates))
D = {'Dates': dates, 'Values': values, 'Clusters': clusters}
data = pd.DataFrame(D)
To each of the functions I am passing the same arguments:
sns.OneOfThePlottingFunctions(x='Dates',
y='Values',
hue='Clusters',
data=data)
plt.show()
So to recap, what I want is a plot that uses seaborn's pandas functionality, and plots points(not lines) with correct x limits and readable x labels :)
Any help would be greatly appreciated.
ax = sns.scatterplot(x='Dates', y='Values', hue='Clusters', data=data)
ax.set_xlim(data['Dates'].min(), data['Dates'].max())

Cannot create bar plot with pandas

I am trying to create a bar plot using pandas. I have the following code:
import pandas as pd
indexes = ['Strongly agree', 'Agree', 'Neutral', 'Disagree', 'Strongly disagree']
df = pd.DataFrame({'Q7': [10, 11, 1, 0, 0]}, index=indexes)
df.plot.bar(indexes, df['Q7'].values)
By my reckoning this should work but I get a weird KeyError: 'Strongly agree' thrown at me. I can't figure out why this won't work.
By invoking plot as a Pandas method, you're referring to the data structures of df to make your plot.
The way you have it set up, with index=indexes, your bar plot's x values are stored in df.index. That's why Wen's suggestion in the comments to just use df.plot.bar() will work, as Pandas automatically looks to use df.index as the x-axis in this case.
Alternately, you can specify column names for x and y. In this case, you can move indexes into a column with reset_index() and then call the new index column explicitly:
df.reset_index().plot.bar(x="index", y="Q7")
Either approach will yield the correct plot:

Why does DataFrameGroupBy.boxplot method throw error when given argument "subplots=True/False"?

I can use DataFrameGroupBy.boxplot(...) to create a boxplot in the following way:
In [15]: df = pd.DataFrame({"gene_length":[100,100,100,200,200,200,300,300,300],
...: "gene_id":[1,1,1,2,2,2,3,3,3],
...: "density":[0.4,1.1,1.2,1.9,2.0,2.5,2.2,3.0,3.3],
...: "cohort":["USA","EUR","FIJ","USA","EUR","FIJ","USA","EUR","FIJ"]})
In [17]: df.groupby("cohort").boxplot(column="density",by="gene_id")
In [18]: plt.show()
This produces the following image:
This is exactly what I want, except instead of making three subplots, I want all the plots to be in one plot (with different colors for USA, EUR, and FIJ). I've tried
In [17]: df.groupby("cohort").boxplot(column="density",subplots=False,by="gene_id")
but it produces the error
KeyError: 'gene_id'
I think the problem has something to do with the fact that by="gene_id" is a keyword sent to the matplotlib boxplot method. If someone has a better way of producing the plot I am after, perhaps by using DataFrame.boxplot(?) instead, please respond here. Thanks so much!
To use the pure pandas functions, I think you should not GroupBy before calling boxplot, but instead, request to group by certain columns in the call to boxplot on the DataFrame itself:
df.boxplot(column='density',by=['gene_id','cohort'])
To get a better-looking result, you might want to consider using the Seaborn library. It is designed to help precisely with this sort of tasks:
sns.boxplot(data=df,x='gene_id',y='density',hue='cohort')
EDIT to take into account comment below
If you want to have each of your cohort boxplots stacked/superimposed for each gene_id, it's a bit more complicated (plus you might end up with quite an ugly output). You cannot do this using Seaborn, AFAIK, but you could with pandas directly, by using the position= parameter to boxplot (see doc). The catch it to generate the correct sequence of positions to place the boxplots where you want them, but you'll have to fix the tick labels and the legend yourself.
pos = [i for i in range(len(df.gene_id.unique())) for _ in range(len(df.cohort.unique()))]
df.boxplot(column='density',by=['gene_id','cohort'],positions=pos)
An alternative would be to use seaborn.swarmplot instead of using boxplot. A swarmplot plots every point instead of the synthetic representation of boxplots, but you can use the parameter split=False to get the points colored by cohort but stacked on top of each other for each gene_id.
sns.swarmplot(data=df,x='gene_id',y='density',hue='cohort', split=False)
Without knowing the actual content of your dataframe (number of points per gene and per cohort, and how separate they are in each cohort), it's hard to say which solution would be the most appropriate.

How to give a distinctive colour to each variable in matplotlib and pandas?

I am creating a times series graph with matplot. The outcome of the function I'm using gives me:
plt.figure(); data_actuals_month.plot(figsize=(10, 7)); plt.legend(loc='best')
the problem is that many variables get the same colour making it impossible to interpret the graph.
How to change the colour scheme and give to each variable its own colour? Thanks
You can specify the color used before plotting by giving a list of colors to matplotlib.plot :
matplotlib.plot.rc('axes', color_cycle=['r', 'g', 'b', 'y'])
Or to ax :
ax1.set_color_cycle(['c', 'm', 'y', 'k'])
See : http://matplotlib.org/examples/color/color_cycle_demo.html
I don't know how pandas uses matplotlib, but I see no reason this shouldn't work. You might also want to try to use different dashstyle/marker/linewidth (http://matplotlib.org/examples/lines_bars_and_markers/index.html), although I'm not sure there is an equivalent of color_cycle for those.