Matplotlib plot bar chart with 2 columns relationship in dataframes - pandas

I have been stuck at plotting dataframe.
This might be simple, but I can't able to figure out!
I have panda dataframe records like this:
Year occurrence Count
0 2011 0 306
1 2011 1 1838
2 2012 0 422
3 2012 1 1816
4 2013 0 423
5 2013 1 3471
6 2014 0 537
7 2014 1 3239
8 2015 0 993
9 2015 1 7668
10 2016 0 415
11 2016 1 2052
12 2017 0 511
13 2017 1 4750
14 2018 0 705
15 2018 1 2125
I want to plot this dataframe as bar chart such that, x-axis contains Year and Y-axis contains Count.
Now I want to plot this Count based on occurrence value. means that in year 2011 one bar has count=306 and second bar has count=1838, same for remaining years.
Also, if possible, I also have to display stacked bar chart based on same thing.
And, How can I plot line charts with two lines in it?
Can anyone have workaround on this?
I have created sample df based on my result:
df = spark.createDataFrame({
(2011, 0, 306),
(2011, 1, 1838),
(2012, 0, 422),
(2012, 1, 1816),
(2013, 0, 423),
(2013, 1, 3471),
(2014, 0, 537),
(2014, 1, 3239),
(2015, 0, 993),
(2015, 1, 7668),
(2016, 0, 415),
(2016, 1, 2052),
(2017, 0, 511),
(2017, 1, 4750),
(2018, 0, 705),
(2018, 1, 2125),
}, ['Year', 'occurrence', 'Count'])
pdf_1 = df.toPandas()
I have tried with this:
pdf_1.plot(x='Year', y=['Count'], kind='bar')
but it does not give me what exactly I wanted.

You can use pivot to reshape:
pdf_1.pivot('Year', 'occurrence', 'Count').plot.bar(stacked=True)
Output:

As per #BigBen,
I have figured out all three answers:
# For Question 1
pdf_1.pivot(index='Year', columns='above_threshold', values='Count').plot.bar()
# For Question 2
pdf_1.pivot('Year', 'above_threshold', 'Count').plot.bar(stacked=True)
# For Question 3
pdf_1.pivot(index='Year', columns='above_threshold', values='Count').plot.line()

Related

how to perform an operation similar to group by on the first index of a multi indexed dataframe

The code to generate a sample dataframe is as follows
fruits=pd.DataFrame()
fruits['month']=['jan','feb','feb','march','jan','april','april','june','march','march','june','april']
fruits['fruit']=['apple','orange','pear','orange','apple','pear','cherry','pear','orange','cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
ind=(fruits.index)
fruits_grp = fruits.set_index(['month', ind],drop=False)
The output dataframe should look something like this:
fruits_new1=pd.DataFrame()
fruits_new1['month']=['jan','jan','feb','feb','march','march','march','apr','apr','apr','jun','jun']
fruits_new1['fruit']=['apple','apple','orange','pear','orange','orange','cherry','pear','cherry','cherry','pear','apple']
fruits_new1['price']=[30,30,20,40,25,25,55,45,60,60,45,37]
ind1=fruits_new1.index
fruits_grp1 = fruits_new1.set_index(['month', ind1],drop=False)
fruits_grp1
Thank you
use:
d={'Jan': 0, 'Feb': 1, 'Mar': 2, 'Apr': 3, 'May': 4, 'Jun': 5, 'Jul': 6, 'Aug': 7, 'Sep': 8, 'Oct': 9, 'Nov': 10, 'Dec': 11}
idx=fruits_grp['month'].str.title().str[:3].map(d).sort_values().index
fruits_grp=fruits_grp.reindex(idx)
fruits_grp['s']=list(range(len(fruits_grp)))
fruits_grp=fruits_grp.set_index('s',append=True).droplevel(1).rename_axis(index=['month',None])
Update:
sample dataframe:
fruits=pd.DataFrame()
fruits['month']=[1,2,2,3,1,4,4,6,3,3,6,4]
fruits['fruit']=['apple','orange','pear','orange','apple','pear','cherry','pear','orange','cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
ind=(fruits.index)
fruits_grp = fruits.set_index(['month', ind],drop=False)
Then just simply use:
idx=fruits_grp['month'].sort_values().index
fruits_grp=fruits_grp.reindex(idx)
fruits_grp['s']=list(range(len(fruits_grp)))
fruits_grp=fruits_grp.set_index('s',append=True).droplevel(1).rename_axis(index=['month',None])
fruits=pd.DataFrame()
fruits['month']=
['jan','feb','feb','mar','jan','apr','apr','jun','mar','mar','jun','apr']
fruits['fruit']=
['apple','orange','pear','orange','apple','pear','cherry','pear','orange',
'cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
fruits["month"] = fruits["month"].str.capitalize()
fruits["month"] = pd.to_datetime(fruits.month, format='%b',
errors='coerce').dt.month
fruits = fruits.sort_values(by="month")
fruits["month"] = pd.to_datetime(fruits['month'], format='%m').dt.strftime('%b')
ind1 = fruits.index
fruits_grp1 = fruits.set_index(['month', ind1],drop=False)
fruits_grp1
print(fruits_grp1)
Thank you so much for all the answers. I've figured out that the sort_values() can make this happen.
The reproducible code for the same is as follows:
fruit_grp_srt=fruits_grp.sort_values(by='month')
But this sorts the rows in alphabetical order and not in the original order of the 1st index.
Still looking for a better solution, Thank you
To me, it looks like a simple sorting by month. First, you need to eliminate the month column (as there is a month in the index) followed by a reset_index
del fruits_grp['month']
df = fruits_grp.reset_index()
Then, it is important to set the months as an ordered categorical datatype, and define the custom order.
df.month = df.month.astype('category')
df.month = df.month.cat.reorder_categories(['jan', 'feb', 'march', 'april', 'june'])
Now, it is just simply sorting by month
df.sort_values(by='month')
Output
month level_1 fruit price
0 jan 0 apple 30
4 jan 4 apple 30
1 feb 1 orange 20
2 feb 2 pear 40
3 march 3 orange 25
8 march 8 orange 25
9 march 9 cherry 55
5 april 5 pear 45
6 april 6 cherry 60
11 april 11 cherry 60
7 june 7 pear 45
10 june 10 apple 37

Calculate number of cases per year from pandas data frame

I have a data frame in the following format:
import pandas as pd
d = {'case_id': [1, 2, 3], 'begin': [2002, 1996, 2001], 'end': [2019, 2001, 2002]}
df = pd.DataFrame(data=d)
with about 1,000 cases.
I need to calculate how many cases are in force by year. This information can be derived from the 'begin' and 'end' columns.
For example, case 2 was in force between the years 1996 and 2001.
The resulting data frame should like as follows:
e = {'year': [1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
'cases': [1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df_ = pd.DataFrame(data=e)
Any idea how I can do this in a few lines for 1,000 cases?
Assign the new value with range then explode
df['new'] = [range(x,y+1) for x , y in zip(df.begin,df.end)]
df = df.explode('new')
And we do groupby + nunique
out = df.groupby(['new']).case_id.nunique().reset_index()
Out[257]:
new case_id
0 1996 1
1 1997 1
2 1998 1
3 1999 1
4 2000 1
5 2001 2
6 2002 2
7 2003 1
8 2004 1
9 2005 1
10 2006 1
11 2007 1
12 2008 1
13 2009 1
14 2010 1
15 2011 1
16 2012 1
17 2013 1
18 2014 1
19 2015 1
20 2016 1
21 2017 1
22 2018 1
23 2019 1
Here is another way:
df.assign(year = df.apply(lambda x: np.arange(x['begin'],x['end']+1),axis=1)).explode('year').groupby('year')['case_id'].count().reset_index()

flatten a multi index in pandas [duplicate]

This question already has answers here:
Pandas index column title or name
(9 answers)
Closed 2 years ago.
I need to set index to my rows, and when I do that, pandas automatically makes my column index hierarchical..and then I tried every flatten mathod I can search, but once I reset_index, my index for row are replaced with iloc (integers). If I use df.columns = [ my col index name], it doesn't flatten my columns' index at all..
I use pandas official docs as example
df = pd.DataFrame({'month': [1, 4, 7, 10],
'year': [2012, 2014, 2013, 2014],
'sale': [55, 40, 84, 31]})
df.set_index('month')
and I get
year sale
month
1 2012 55
4 2014 40
7 2013 84
10 2014 31
Then I flatten the index by
df.reset_index()
Then it becomes
index month year sale
0 0 1 2012 55
1 1 4 2014 40
2 2 7 2013 84
3 3 10 2014 31
(The month for row index disappeard...)
This really kills me so Im appreciate it if someone can help to make the dataframe to sth like
month year sale
1 2012 55
4 2014 40
7 2013 84
10 2014 31
Thanks!
You only need to
df.reset_index(drop=True)
which returns
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31

Standard deviation with groupby(multiple columns) Pandas

I am working with data from the California Air Resources Board.
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
Please help me!
[1]: https://i.stack.imgur.com/ueVrG.png
I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?
import pandas as pd
columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
[2014, 1, 1, 4.0],
[2014, 1, 2, 6.0],
[2014, 1, 2, 8.0],
[2014, 2, 1, 2.0],
[2014, 2, 1, 6.0],
[2014, 2, 2, 10.0],
[2014, 2, 2, 14.0]]
df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')
Outputs:
Concentrations:
year month day conc
0 2014 1 1 2.0
1 2014 1 1 4.0
2 2014 1 2 6.0
3 2014 1 2 8.0
4 2014 2 1 2.0
5 2014 2 1 6.0
6 2014 2 2 10.0
7 2014 2 2 14.0
Daily Average:
year month day conc
0 2014 1 1 3.0
1 2014 1 2 7.0
2 2014 2 1 4.0
3 2014 2 2 12.0
Monthly Average:
year month conc
0 2014 1 5.0
1 2014 2 8.0
Monthly Standard Deviation:
year month conc
0 2014 1 2.828427
1 2014 2 5.656854
I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
and here's the new dataframe:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31

Select Rows Where MultiIndex Is In Another DataFrame

I have one DataFrame (DF1) with a MultiIndex and many additional columns. In another DataFrame (DF2) I have 2 columns containing a set of values from the MultiIndex. I would like to select the rows from DF1 where the MultiIndex matches the values in DF2.
df1 = pd.DataFrame({'month': [1, 3, 4, 7, 10],
'year': [2012, 2012, 2014, 2013, 2014],
'sale':[55, 17, 40, 84, 31]})
df1 = df1.set_index(['year','month'])
sale
year month
2012 1 55
2012 3 17
2014 4 40
2013 7 84
2014 10 31
df2 = pd.DataFrame({'year': [2012,2014],
'month': [1, 10]})
year month
0 2012 1
1 2014 10
I'd like to create a new DataFrame that would be:
sale
year month
2012 1 55
2014 10 31
I've tried many variations using .isin, .loc, slicing, but keep running into errors.
You could just set_index on df2 the same way and pass the index:
In[110]:
df1.loc[df2.set_index(['year','month']).index]
Out[110]:
sale
year month
2012 1 55
2014 10 31
more readable version:
In[111]:
idx = df2.set_index(['year','month']).index
df1.loc[idx]
Out[111]:
sale
year month
2012 1 55
2014 10 31