Calculate number of cases per year from pandas data frame - pandas

I have a data frame in the following format:
import pandas as pd
d = {'case_id': [1, 2, 3], 'begin': [2002, 1996, 2001], 'end': [2019, 2001, 2002]}
df = pd.DataFrame(data=d)
with about 1,000 cases.
I need to calculate how many cases are in force by year. This information can be derived from the 'begin' and 'end' columns.
For example, case 2 was in force between the years 1996 and 2001.
The resulting data frame should like as follows:
e = {'year': [1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
'cases': [1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df_ = pd.DataFrame(data=e)
Any idea how I can do this in a few lines for 1,000 cases?

Assign the new value with range then explode
df['new'] = [range(x,y+1) for x , y in zip(df.begin,df.end)]
df = df.explode('new')
And we do groupby + nunique
out = df.groupby(['new']).case_id.nunique().reset_index()
Out[257]:
new case_id
0 1996 1
1 1997 1
2 1998 1
3 1999 1
4 2000 1
5 2001 2
6 2002 2
7 2003 1
8 2004 1
9 2005 1
10 2006 1
11 2007 1
12 2008 1
13 2009 1
14 2010 1
15 2011 1
16 2012 1
17 2013 1
18 2014 1
19 2015 1
20 2016 1
21 2017 1
22 2018 1
23 2019 1

Here is another way:
df.assign(year = df.apply(lambda x: np.arange(x['begin'],x['end']+1),axis=1)).explode('year').groupby('year')['case_id'].count().reset_index()

Related

Matplotlib plot bar chart with 2 columns relationship in dataframes

I have been stuck at plotting dataframe.
This might be simple, but I can't able to figure out!
I have panda dataframe records like this:
Year occurrence Count
0 2011 0 306
1 2011 1 1838
2 2012 0 422
3 2012 1 1816
4 2013 0 423
5 2013 1 3471
6 2014 0 537
7 2014 1 3239
8 2015 0 993
9 2015 1 7668
10 2016 0 415
11 2016 1 2052
12 2017 0 511
13 2017 1 4750
14 2018 0 705
15 2018 1 2125
I want to plot this dataframe as bar chart such that, x-axis contains Year and Y-axis contains Count.
Now I want to plot this Count based on occurrence value. means that in year 2011 one bar has count=306 and second bar has count=1838, same for remaining years.
Also, if possible, I also have to display stacked bar chart based on same thing.
And, How can I plot line charts with two lines in it?
Can anyone have workaround on this?
I have created sample df based on my result:
df = spark.createDataFrame({
(2011, 0, 306),
(2011, 1, 1838),
(2012, 0, 422),
(2012, 1, 1816),
(2013, 0, 423),
(2013, 1, 3471),
(2014, 0, 537),
(2014, 1, 3239),
(2015, 0, 993),
(2015, 1, 7668),
(2016, 0, 415),
(2016, 1, 2052),
(2017, 0, 511),
(2017, 1, 4750),
(2018, 0, 705),
(2018, 1, 2125),
}, ['Year', 'occurrence', 'Count'])
pdf_1 = df.toPandas()
I have tried with this:
pdf_1.plot(x='Year', y=['Count'], kind='bar')
but it does not give me what exactly I wanted.
You can use pivot to reshape:
pdf_1.pivot('Year', 'occurrence', 'Count').plot.bar(stacked=True)
Output:
As per #BigBen,
I have figured out all three answers:
# For Question 1
pdf_1.pivot(index='Year', columns='above_threshold', values='Count').plot.bar()
# For Question 2
pdf_1.pivot('Year', 'above_threshold', 'Count').plot.bar(stacked=True)
# For Question 3
pdf_1.pivot(index='Year', columns='above_threshold', values='Count').plot.line()

Pandas Vlookup 2 DF Columns Different Lengths & Perform Calculation

I need to execute a vlookup-like calculation considering two df's of diff lengths with the same column name. Suppose i have a df called df1 such as:
Y M P D
2020 11 Red 10
2020 11 Blue 9
2020 11 Green 12
2020 11 Tan 7
2020 11 White 5
2020 11 Cyan 17
and a second df called df2 such as:
Y M P D
2020 11 Blue 4
2020 11 Red 12
2020 11 White 6
2020 11 Tan 7
2020 11 Green 20
2020 11 Violet 10
2020 11 Black 7
2020 11 BlackII 3
2020 11 Cyan 14
2020 11 Copper 6
I need a new df like df3['Res','P'] with 2 columns showing results from subtracting df1 from df2 such as:
Res P
Red -2
Blue 5
Green -8
Tan 0
White -1
Cyan 3
I have not been able to find anything with a lookup and then calculation on the web. I've tried merging df1 and df2 into one df but I do not see how to execute the calculation when the values in the "P" column match. I think that a merge of df1 and df2 is probably the first step though?
Based on the example, column 'Y' and 'M' do not matter for the merge. If these columns are relevant, then use a list with the on parameter (e.g. on=['Y', 'M', 'P']).
Currently, only columns [['P', 'D']] are being used from df1 and df2.
The following code, produces the desire output for the example, but it's difficult say what will happen with larger dataframes and if there are repeating values in 'P'.
import pandas as pd
# setup the dataframes
df1 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11], 'P': ['Red', 'Blue', 'Green', 'Tan', 'White', 'Cyan'], 'D': [10, 9, 12, 7, 5, 17]})
df2 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11], 'P': ['Blue', 'Red', 'White', 'Tan', 'Green', 'Violet', 'Black', 'BlackII', 'Cyan', 'Copper'], 'D': [4, 12, 6, 7, 20, 10, 7, 3, 14, 6]})
# merge the dataframes
df = pd.merge(df1[['P', 'D']], df2[['P', 'D']], on='P', suffixes=('_1', '_2')).rename(columns={'P': 'Res'})
# subtract the values
df['P'] = (df.D_1 - df.D_2)
# drop the unneeded columns
df = df.drop(columns=['D_1', 'D_2'])
# display(df)
Res P
0 Red -2
1 Blue 5
2 Green -8
3 Tan 0
4 White -1
5 Cyan 3

Standard deviation with groupby(multiple columns) Pandas

I am working with data from the California Air Resources Board.
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
Please help me!
[1]: https://i.stack.imgur.com/ueVrG.png
I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?
import pandas as pd
columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
[2014, 1, 1, 4.0],
[2014, 1, 2, 6.0],
[2014, 1, 2, 8.0],
[2014, 2, 1, 2.0],
[2014, 2, 1, 6.0],
[2014, 2, 2, 10.0],
[2014, 2, 2, 14.0]]
df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')
Outputs:
Concentrations:
year month day conc
0 2014 1 1 2.0
1 2014 1 1 4.0
2 2014 1 2 6.0
3 2014 1 2 8.0
4 2014 2 1 2.0
5 2014 2 1 6.0
6 2014 2 2 10.0
7 2014 2 2 14.0
Daily Average:
year month day conc
0 2014 1 1 3.0
1 2014 1 2 7.0
2 2014 2 1 4.0
3 2014 2 2 12.0
Monthly Average:
year month conc
0 2014 1 5.0
1 2014 2 8.0
Monthly Standard Deviation:
year month conc
0 2014 1 2.828427
1 2014 2 5.656854
I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
and here's the new dataframe:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31

Select Rows Where MultiIndex Is In Another DataFrame

I have one DataFrame (DF1) with a MultiIndex and many additional columns. In another DataFrame (DF2) I have 2 columns containing a set of values from the MultiIndex. I would like to select the rows from DF1 where the MultiIndex matches the values in DF2.
df1 = pd.DataFrame({'month': [1, 3, 4, 7, 10],
'year': [2012, 2012, 2014, 2013, 2014],
'sale':[55, 17, 40, 84, 31]})
df1 = df1.set_index(['year','month'])
sale
year month
2012 1 55
2012 3 17
2014 4 40
2013 7 84
2014 10 31
df2 = pd.DataFrame({'year': [2012,2014],
'month': [1, 10]})
year month
0 2012 1
1 2014 10
I'd like to create a new DataFrame that would be:
sale
year month
2012 1 55
2014 10 31
I've tried many variations using .isin, .loc, slicing, but keep running into errors.
You could just set_index on df2 the same way and pass the index:
In[110]:
df1.loc[df2.set_index(['year','month']).index]
Out[110]:
sale
year month
2012 1 55
2014 10 31
more readable version:
In[111]:
idx = df2.set_index(['year','month']).index
df1.loc[idx]
Out[111]:
sale
year month
2012 1 55
2014 10 31

Split pandas dataframe index

I have a pretty big dataframe with column names categories (foreign trade statistics), while the index is a string containing the country code AND the year: w2013 meaning World, year 2013, r2015 meaning Russian Federation, year 2015.
Index([u'w2011', u'c2011', u'g2011', u'i2011', u'r2011', u'w2012', u'c2012',
u'g2012', u'i2012', u'r2012', u'w2013', u'c2013', u'g2013', u'i2013',
u'r2013', u'w2014', u'c2014', u'g2014', u'i2014', u'r2014', u'w2015',
u'c2015', u'g2015', u'i2015', u'r2015'],
dtype='object')
What would be the easiest way to make a multiple index for plotting the various columns - I need a column plotted for each country and each year?
You can try create Multiindex from_tuples - for extract letters use indexing with str.
import pandas as pd
li =[u'w2011', u'c2011', u'g2011', u'i2011', u'r2011', u'w2012', u'c2012',
u'g2012', u'i2012', u'r2012', u'w2013', u'c2013', u'g2013', u'i2013',
u'r2013', u'w2014', u'c2014', u'g2014', u'i2014', u'r2014', u'w2015',
u'c2015', u'g2015', u'i2015', u'r2015']
df = pd.DataFrame(range(25), index = li, columns=['a'])
print df
a
w2011 0
c2011 1
g2011 2
i2011 3
r2011 4
w2012 5
c2012 6
g2012 7
i2012 8
r2012 9
w2013 10
c2013 11
g2013 12
i2013 13
r2013 14
w2014 15
c2014 16
g2014 17
i2014 18
r2014 19
w2015 20
c2015 21
g2015 22
i2015 23
r2015 24
print df.index.str[0]
Index([u'w', u'c', u'g', u'i', u'r', u'w', u'c', u'g', u'i', u'r', u'w', u'c',
u'g', u'i', u'r', u'w', u'c', u'g', u'i', u'r', u'w', u'c', u'g', u'i',
u'r'],
dtype='object')
print df.index.str[1:]
Index([u'2011', u'2011', u'2011', u'2011', u'2011', u'2012', u'2012', u'2012',
u'2012', u'2012', u'2013', u'2013', u'2013', u'2013', u'2013', u'2014',
u'2014', u'2014', u'2014', u'2014', u'2015', u'2015', u'2015', u'2015',
u'2015'],
dtype='object')
df.index = pd.MultiIndex.from_tuples(zip(df.index.str[0], df.index.str[1:]))
print df
a
w 2011 0
c 2011 1
g 2011 2
i 2011 3
r 2011 4
w 2012 5
c 2012 6
g 2012 7
i 2012 8
r 2012 9
w 2013 10
c 2013 11
g 2013 12
i 2013 13
r 2013 14
w 2014 15
c 2014 16
g 2014 17
i 2014 18
r 2014 19
w 2015 20
c 2015 21
g 2015 22
i 2015 23
r 2015 24
If you need convert years to int, use astype:
df.index = pd.MultiIndex.from_tuples(zip(df.index.str[0], df.index.str[1:].astype(int)))
print df.index
MultiIndex(levels=[[u'c', u'g', u'i', u'r', u'w'], [2011, 2012, 2013, 2014, 2015]],
labels=[[4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]])
If I understood well, you can:
reset your index
df.reset_index(inplace=1)
create two other columns, one for the year, and one for the country:
df.loc[,"year"] = df.foo.apply(lambda x: x[1:])
df.loc[,"country"] = df.foo.apply(lambda x: x[0])
assuming that the columns of your former index is called foo and that the length of the country code is 1. You can adapt otherwise.
Set those two columns as index:
df.set_index(["year", "country"], inplace=1)
HTH