Histogram as stacked bar chart based on categories - pandas

I have data with a numeric and categorical variable. I want to show a histogram of the numeric column, where each bar is stacked by the categorical variable. I tried to do this with ax.hist(data, histtype='bar', stacked=True), but couldn't quite get it to work.
If my data is
df = pd.DataFrame({'age': np.random.normal(45, 5, 100), 'job': np.random.choice(['engineer', 'barista',
'quantity surveyor'], size=100)})
I've organised it like this:
df['binned_age'] = pd.qcut(df.age, 5)
df.groupby('binned_age')['job'].value_counts().plot(kind='bar')
Which gives me a bar chart divided the way I want, but side by side, not stacked, and without different colours for each category.
Is there a way to stack this plot? Or just do it a regular histogram, but stacked by category?

IIUC, you will need to reshape your dataset first - i will do that using pivot_table and use len for an aggregator as that will give you the frequency.
Then you can use a similar code to the one you provided above.
df.drop('age',axis=1,inplace=True)
df_reshaped = df.pivot_table(index=['binned_age'], columns=['job'], aggfunc=len)
df_reshaped.plot(kind='bar', stacked=True, ylabel='Frequency', xlabel='Age binned',
title='Age group frequency by Job', rot=45)
prints:
You can use the documentation to tailor the chart to your needs

df['age_bin'] = pd.cut(df['age'], [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70])
df.groupby('age_bin')['job'].value_counts().unstack().plot(kind='bar', stacked=True)

Related

Matplotlib plot with x-axis as binned data and y-axis as the mean value of various variables in the bin?

My apologies if this is rather basic; I can't seem to find a good answer yet because everything refers only to histograms. I have circular data, with a degrees value as the index. I am using pd.cut() to create bins of a few degrees in order to summarize the dataset. Then, I use df.groupby() and .mean() to calculate mean values of all columns for the respective bins.
Now - I would like to plot this, with the bins on the x-axis, and lines for the columns.
I tried to iterate over the columns, adding them as:
for i in df.columns:
ax.plot(df.index,df[i])
However, this gives me the error: "float() argument must be a string or number, not 'pandas._libs.interval.Interval'
Therefore, I assume it wants the x-axis values to be numbers or strings and not intervals. Is there a way I can make this work?
To get the dataframe containing the mean values of each variable with respect to bins, I used:
bins = np.arange(0,360,5)
df = df.groupby(pd.cut(df[Dir]),bins)).mean()
Here is what df looks like at the point of plotting - each column includes mean values for each variable 0,1,2 etc. for each bin, which I would like plotted on y-axis, and "Dir" is the index with bins.
0 1 2 3 4 5
Dir
(0, 5] 37.444135 2922.848675 3244.325904 4203.001446 36.262371 37.493497
(5, 10] 42.599494 3248.194328 3582.355759 4061.098517 36.351476 37.148341
(10, 15] 47.277694 2374.379517 2709.435714 2932.064076 36.537377 36.878293
(15, 20] 52.345712 2626.774240 2659.391040 3087.324800 36.114965 36.603918
(20, 25] 57.318976 2207.845000 2228.002353 2811.066176 36.279392 37.165979
(25, 30] 62.454386 2436.117405 2839.255696 3329.441772 36.762896 37.861577
(30, 35] 67.705955 3138.968411 3462.831977 4007.180620 36.462313 37.560977
(35, 40] 72.554786 2554.552620 2548.955581 3079.570159 36.256386 36.819579
(40, 45] 77.501479 2862.703066 2965.408491 2857.901887 36.170788 36.140976
(45, 50] 82.386679 2973.858188 2539.348967 2000.606359 36.067776 37.210645
We have multiple options, we can obtain the middle of the bin using as shown below. You can also access the left and right side of the bins, as described here. Let me know if you need any further help.
df = pd.DataFrame(data={'x': np.random.uniform(low=0, high=10, size=10), 'y': np.random.exponential(size=10)})
bins = range(0,360,5)
df['bin'] = pd.cut(df['x'], bins)
agg_df = df.groupby(by='bin').mean()
# this is the important step. We can obtain the interval index from the categorical input using this line.
mids = pd.IntervalIndex(agg_df.index.get_level_values('bin')).mid
# to apply for plots:
for col in df.columns:
plt.plot(mids, df[col])

Color problems with pandas matplotlib - graph colors are inconsistent

I have graphs with just 3 colors - Green, Red, Grey for values A, B, C. The application uses group by and value counts to get the cumulative count of A, B, and C across months and shows a donut chart, barh, and a bar chart. The colors shift from graph to graph - on one they A is green and the other graph with the same data shows A as red.
Simple fix, right?
def color_for_label(label):
xlate = {'A': 'green',
'B': 'red',
'C': 'grey',
}
return [xlate[x] for x in label]
chart = gb.unstack(level=-1)
.plot.barh(color=color_for_label(gb.index[0:2].names),
width=.50,
stacked=True,
legend=None)
The data returns an index sometimes and a multiindex other times. It chokes on and but works on
The colors are constant Red/Green/Grey that always go with the values A/B/C.
I've tried checking datatypes and try/except structures, but both got too complex quickly. Anyone got a simple solution to share?
Lets use the data from this example pandas pivot table to stacked bar chart
df.assign(count =1 ).groupby(['battle_type']).count().plot.barh(stacked=True)
and (latter preferred - I'm not loving the groupby inconsistencies)
df.pivot_table(index='battle_type', columns='attacker_outcome', aggfunc='size').plot.barh(stacked=True)
both get me
I have a 3rd value, "Tie" in my example of A, B, C above, but lets ignore that for the moment.
I want to make sure that win is always green, lose is red, Tie is grey.
so I have my simple function
def color_for_label(label):
xlate = {'win': 'green',
'lose': 'red',
'Tie': 'grey',
}
return xlate[label]
so I add
....plot.barh(stacked=True, color=color_for_label(**label**))
And here I'm stuck - what do I set label to so that win is always green, lose is red and tie is grey?
Got it!
First, translate colors for the new example
def color_for_label(label):
xlate = {'win': 'green',
'loss': 'red',
'tie': 'grey',
}
return [xlate[x] for x in label]
Then break it into two lines.
# create a dataframe
gb = df.pivot_table(index='battle_type', columns='attacker_outcome', aggfunc='size')
# pass the dataframe column values
gb.plot.barh(stacked=True, color=color_for_label(gb.columns.values))

How to plot outliers with regard to unique ids

I have item_code column in my data and another column, sales, which represents sales quantity for the particular item.
The data can have a particular item id many times. There are other columns tell apart these entries.
I want to plot only the outlier sales for each item (because data has thousands of different item ids, plotting every entry can be difficult).
Since I'm very new to this, what is the right way and tool to do this?
you can use pandas. You should choose a method to detect outliers, but I have an example for you:
If you want to get outliers for all sales (not in groups), you can use apply with function (example - lambda function) to have outliers indexes.
import numpy as np
%matplotlib inline
df = pd.DataFrame({'item_id': [1, 1, 2, 1, 2, 1, 2],
'sales': [0, 2, 30, 3, 30, 30, 55]})
df[df.apply(lambda x: np.abs(x.sales - df.sales.mean()) / df.sales.std() > 1, 1)
].set_index('item_id').plot(style='.', color='red')
In this example we generated data sample and search indexes of points what are more then mean / std + 1 (you can try another method). And then just plot them where y is count of sales and x is item id. This method detected points 0 and 55. If you want search outliers in groups, you can group data before.
df.groupby('item_id').apply(lambda data: data.loc[
data.apply(lambda x: np.abs(x.sales - data.sales.mean()) / data.sales.std() > 1, 1)
]).set_index('item_id').plot(style='.', color='red')
In this example we have points 30 and 55, because 0 isn't outlier for group where item_id = 1, but 30 is.
Is it what you want to do? I hope it helps start with it.

Panda's Update information based on bins / cutting

I'm working on a dataset which has a large amount of missing information.
I understand I could use FillNA but i'd like to base my updates on the binned values of another column.
Selection of missing data:
missing = train[train['field'].isnull()]
Bin the data (this works correctly):
filter_values = [0, 42, 63, 96, 118, 160]
labels = [1,2,3,4,5]
out = pd.cut(missing['field2'], bins = filter_values, labels=labels)
counts = pd.value_counts(out)
print(counts)
Now, based on the bin assignments, I would like to set the correct bin label, to the missing/train['field'] for all data assigned to this bin.
IIUC:
You just need to fillna
train['field'] = train['field'].fillna(out)

Pandas scatterplot categorical and timeseries axes

I'm looking to create a chart much like nltk's lexical dispersion plot, but am drawing a blank how to construct this. I was thinking that scatter would be my best geom, using '|' as markers, and setting the alpha, but I am running into all sorts of issues setting the parameters. An example of this is below:
I have the dataframe arranged with a datetime index, freq='D', over a 5 year period, and each column represents the count of a particular word used that date.
For example:
tst = pd.DataFrame(index=pd.date_range(datetime.datetime(2010, 1, 1), end=datetime.datetime(2010, 2, 1), freq='D'), data=[[randint(0, 5), randint(0, 1), randint(0, 2)] for x in range(32)])
Currently I'm trying something akin to the following:
plt.figure()
tst.plot(kind='scatter', x=tst.index, y=tst.columns, marker='|', color=sns.xkcd_rgb['dodger blue'], alpha=.05, legend=False)
yticks = plt.yticks()[0]
plt.yticks(yticks, top_words)
the above code yields a KeyError:
KeyError: "['2009-12-31T19:00:00.000000000-0500' '2010-01-01T19:00:00.000000000-0500'\n '2010-01-02T19:00:00.000000000-0500' '2010-01-03T19:00:00.000000000-0500'\n '2010-01-04T19:00:00.000000000-0500' '2010-01-05T19:00:00.000000000-0500'\n '2010-01-06T19:00:00.000000000-0500' '2010-01-07T19:00:00.000000000-0500'\n '2010-01-08T19:00:00.000000000-0500' '2010-01-09T19:00:00.000000000-0500'\n '2010-01-10T19:00:00.000000000-0500' '2010-01-11T19:00:00.000000000-0500'\n '2010-01-12T19:00:00.000000000-0500' '2010-01-13T19:00:00.000000000-0500'\n '2010-01-14T19:00:00.000000000-0500' '2010-01-15T19:00:00.000000000-0500'\n '2010-01-16T19:00:00.000000000-0500' '2010-01-17T19:00:00.000000000-0500'\n '2010-01-18T19:00:00.000000000-0500' '2010-01-19T19:00:00.000000000-0500'\n '2010-01-20T19:00:00.000000000-0500' '2010-01-21T19:00:00.000000000-0500'\n '2010-01-22T19:00:00.000000000-0500' '2010-01-23T19:00:00.000000000-0500'\n '2010-01-24T19:00:00.000000000-0500' '2010-01-25T19:00:00.000000000-0500'\n '2010-01-26T19:00:00.000000000-0500' '2010-01-27T19:00:00.000000000-0500'\n '2010-01-28T19:00:00.000000000-0500' '2010-01-29T19:00:00.000000000-0500'\n '2010-01-30T19:00:00.000000000-0500' '2010-01-31T19:00:00.000000000-0500'] not in index"
Any help would be appreciated.
With help, I was able to produce the following:
plt.plot(tst.index, tst, marker='|', color=sns.xkcd_rgb['dodger blue'], alpha=.25, ms=.5, lw=.5)
plt.ylim([-1, 20])
plt.yticks(range(20), top_words)
Unfortunately, it only appears that the upper bars will show up when there is a corresponding bar to be built on top of. That's not how my data looks.
I am not sure you can do this with .plot method. However, it is easy to do it straightly in matplotlib:
plt.plot(tst.index, tst, marker='|', lw=0, ms=10)
plt.ylim([-0.5, 5.5])
If you can install seaborn, try stripplot():
import seaborn as sns
sns.stripplot(data=tst, orient='h', marker='|', edgecolor='blue');
Note that I changed your data to make it look a bit more interesting:
tst = pd.DataFrame(index=pd.date_range(datetime.datetime(2010, 1, 1), end=datetime.datetime(2010, 2, 1), freq='D'),
data=(150000 * np.random.rand(32, 3)).astype('int'))
More information on seaborn:
http://stanford.edu/~mwaskom/software/seaborn/tutorial/categorical.html