total_income_language = pd.DataFrame(df.groupby('language')['gross'].sum())
average_income_language = pd.DataFrame(df.groupby('language')['gross'].mean())
plt.bar(total_income_language.index, total_income_language["gross"],
label="Total Income of Language")
plt.bar(average_income_language.index, average_income_language["gross"],
label="Average Income of Language")
plt.xlabel("Language")
plt.ylabel("Log Dollar Values(Gross)")
I want to plot the sum and average for each and every languages. I'm not sure if my code does what I wanted. And I'm getting an error while trying to plot this. I'm not sure where did I messed up on the coding. I need some assistance.
Here's the error message
You can use groupby with aggregation by agg, rename columns by dict and plot by DataFrame.plot.bar.
Last set labels by ax.set.
df = pd.DataFrame({'language':['en','de','en','de','sk','sk'],
'gross':[10,20,30,40,50,60]})
print (df)
gross language
0 10 en
1 20 de
2 30 en
3 40 de
4 50 sk
5 60 sk
d = {'mean':'Average Income of Language','sum':'Total Income of Language'}
df1 = df.groupby('language')['gross'].agg(['sum','mean']).rename(columns=d)
print (df1)
Total Income of Language Average Income of Language
language
de 60 30
en 40 20
sk 110 55
ax = df1.plot.bar()
ax.set(xlabel='Language', ylabel='Log Dollar Values(Gross)')
If want rotate labels of axis x:
ax = df1.plot.bar(rot=0)
ax.set(xlabel='Language', ylabel='Log Dollar Values(Gross)')
Instead of:
df.groupby('language')['gross'].sum()
Try this:
df.groupby('language').sum()
And similarly with mean(). That should get your code closer to running.
Calling the groupby() method of a DataFrame yields a groupby object upon which you then need to call an aggregation function, like sum, mean, or agg. The groupby documentation is really great: https://pandas.pydata.org/pandas-docs/stable/groupby.html
Also, you might be able to achieve your desired output in two lines:
df.groupby('language').sum().plot(kind='bar')
df.groupby('language').mean().plot(kind='bar')
Related
I have a data frame with, among other things, a user id and an age. I need to produce a bar chart of the number of users that fall with ranges of ages. What's throwing me is that there is really no upper bound for the age range. The specific ranges I'm trying to plot are age <= 25, 25 < age <= 75 and age > 75.
I'm relatively new to Pandas and plotting, and I'm sure this is a simple thing for more experienced data wranglers. Any assistance would be greatly appreciated.
You'll need to use the pandas.cut method to do this, and you can supply custom bins and labels!
from pandas import DataFrame, cut
from numpy.random import default_rng
from numpy import arange
from matplotlib.pyplot import show
# Make som dummy data
rng = default_rng(0)
df = DataFrame({'id': arange(100), 'age': rng.normal(50, scale=20, size=100).clip(min=0)})
print(df.head())
id age
0 0 52.514604
1 1 47.357903
2 2 62.808453
3 3 52.098002
4 4 39.286613
# Use pandas.cut to bin all of the ages & assign
# these bins to a new column to demonstrate how it works
## bins are [0-25), [25-75), [75-inf)
df['bin'] = cut(df['age'], [0, 25, 75, float('inf')], labels=['under 25', '25 up to 75', '75 or older'])
print(df.head())
id age bin
0 0 52.514604 25 up to 75
1 1 47.357903 25 up to 75
2 2 62.808453 25 up to 75
3 3 52.098002 25 up to 75
4 4 39.286613 25 up to 75
# Get the value_counts of those bins and plot!
df['bin'].value_counts().sort_index().plot.bar()
show()
I have a sample of a large dataset as below
I would like to get the percentage of the count of number of rows with a value above 30 which would give me an output as below
How would I go about achieving this with pandas. I have gotten to this last point of processing my data and a bit stuck with this
You can compare values for greater like 30 with aggregate mean:
df = (df.B > 30).groupby(df['A']).mean().mul(100).reset_index(name='C')
print (df)
A C
0 r 60.0
Or:
df = df.assign(C = df.B > 30).groupby('A')['C'].mean().mul(100).reset_index()
I have monthly data of 6 variables from 2014 until 2018 in one dataset.
I'm trying to draw 6 subplots (one for each variable) with monthly X axis (Jan, Feb....) and 5 series (one for each year) with their legend.
This is part of the data:
I created 5 series (one for each year) per variable (30 in total) and I'm getting the expected output but using MANY lines of code.
What is the best way to achieve this using less lines of code?
This is an example how I created the series:
CL2014 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2014)[0:12]
CL2015 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2015)[12:24]
This is an example of how I'm plotting the series:
axCL.plot(xvals, CL2014)
axCL.plot(xvals, CL2015)
axCL.plot(xvals, CL2016)
axCL.plot(xvals, CL2017)
axCL.plot(xvals, CL2018)
There's no need to litter your namespace with 30 variables. Seaborn makes the job very easy but you need to normalize your dataframe first. This is what "normalized" or "unpivoted" looks like (Seaborn calls this "long form"):
Date variable value
2014-01-01 Charity Lottery ...
2014-01-01 Racecourse ...
2014-04-01 Bingo Halls ...
2014-04-01 Casino ...
Your screenshot is a "pivoted" or "wide form" dataframe.
df_plot = pd.melt(df, id_vars='Date')
df_plot['Year'] = df_plot['Date'].dt.year
df_plot['Month'] = df_plot['Date'].dt.strftime('%b')
import seaborn as sns
plot = sns.catplot(data=df_plot, x='Month', y='value',
row='Year', col='variable', kind='bar',
sharex=False)
plot.savefig('figure.png', dpi=300)
Result (all numbers are randomly generated):
I would try using .groupby(), it is really powerful for parsing down things like this:
for _, group in data_total.groupby([year, month])[[x_variable, y_variable]]:
plt.plot(group[x_variables], group[y_variables])
So here the groupby will separate your data_total DataFrame into year/month subsets, with the [[]] on the end to parse it down to the x_variable (assuming it is in your data_total DataFrame) and your y_variable, which you can make any of those features you are interested in.
I would decompose your datetime column into separate year and month columns, then use those new columns inside that groupby as the [year, month]. You might be able to pass in the dt.year and dt.month like you had before... not sure, try it both ways!
I'm starting to learn about Python Pandas and want to generate a graph with the sum of arbitrary groupings of an ordinal value. It can be better explained with a simple example.
Suppose I have the following table of food consumption data:
And I have two groups of foods defined as two lists:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
Now I want to plot a graph with the evolution of consumption of junk and healthy food. I believe I must then process my data to get a DataFrame like:
Suppose the first table is already in a Dataframe called food, how do I transform it to get the second one?
I also welcome suggestions to reword my question to make it clearer, or for different approaches to generate the plot.
First create dictinary with lists and then swap keys with values.
Then groupby by mapped column food by dict and year, aggregate sum and last reshape by unstack:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
d1 = {'healthy':healthy, 'junk':junk}
##http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
print (d)
{'brocolli': 'healthy', 'cheetos': 'junk', 'apple': 'healthy', 'coke': 'junk'}
df1 = df.groupby([df.food.map(d), 'year'])['amount'].sum().unstack(0)
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
Another solution with pivot_table:
df1 = df.pivot_table(index='year', columns=df.food.map(d), values='amount', aggfunc='sum')
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
I'm trying to make a scatter plot of a GroupBy() with Multiindex (http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-with-multiindex). That is, I want to plot one of the labels on the x-axis, another label on the y-axis, and the mean() as the size of each point.
df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean() returns:
Sigma_ang Epsilon_K
3.4 30 0.647000
40 0.602071
50 0.619786
3.6 30 0.646538
40 0.591833
50 0.607769
3.8 30 0.616833
40 0.590714
50 0.578364
Name: RMSD, dtype: float64
And I'd like to to plot something like: plt.scatter(x=Sigma, y=Epsilon, s=RMSD)
What's the best way to do this? I'm having trouble getting the proper Sigma and Epsilon values for each RMSD value.
+1 to Vaishali Garg. Based on his comment, the following works:
df_mean = df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean().reset_index()
plt.scatter(df_mean['Sigma'], df_mean['Epsilon'], s=100.*df_mean['RMSD'])