Select Rows Where MultiIndex Is In Another DataFrame - pandas

I have one DataFrame (DF1) with a MultiIndex and many additional columns. In another DataFrame (DF2) I have 2 columns containing a set of values from the MultiIndex. I would like to select the rows from DF1 where the MultiIndex matches the values in DF2.
df1 = pd.DataFrame({'month': [1, 3, 4, 7, 10],
'year': [2012, 2012, 2014, 2013, 2014],
'sale':[55, 17, 40, 84, 31]})
df1 = df1.set_index(['year','month'])
sale
year month
2012 1 55
2012 3 17
2014 4 40
2013 7 84
2014 10 31
df2 = pd.DataFrame({'year': [2012,2014],
'month': [1, 10]})
year month
0 2012 1
1 2014 10
I'd like to create a new DataFrame that would be:
sale
year month
2012 1 55
2014 10 31
I've tried many variations using .isin, .loc, slicing, but keep running into errors.

You could just set_index on df2 the same way and pass the index:
In[110]:
df1.loc[df2.set_index(['year','month']).index]
Out[110]:
sale
year month
2012 1 55
2014 10 31
more readable version:
In[111]:
idx = df2.set_index(['year','month']).index
df1.loc[idx]
Out[111]:
sale
year month
2012 1 55
2014 10 31

Related

how to perform an operation similar to group by on the first index of a multi indexed dataframe

The code to generate a sample dataframe is as follows
fruits=pd.DataFrame()
fruits['month']=['jan','feb','feb','march','jan','april','april','june','march','march','june','april']
fruits['fruit']=['apple','orange','pear','orange','apple','pear','cherry','pear','orange','cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
ind=(fruits.index)
fruits_grp = fruits.set_index(['month', ind],drop=False)
The output dataframe should look something like this:
fruits_new1=pd.DataFrame()
fruits_new1['month']=['jan','jan','feb','feb','march','march','march','apr','apr','apr','jun','jun']
fruits_new1['fruit']=['apple','apple','orange','pear','orange','orange','cherry','pear','cherry','cherry','pear','apple']
fruits_new1['price']=[30,30,20,40,25,25,55,45,60,60,45,37]
ind1=fruits_new1.index
fruits_grp1 = fruits_new1.set_index(['month', ind1],drop=False)
fruits_grp1
Thank you
use:
d={'Jan': 0, 'Feb': 1, 'Mar': 2, 'Apr': 3, 'May': 4, 'Jun': 5, 'Jul': 6, 'Aug': 7, 'Sep': 8, 'Oct': 9, 'Nov': 10, 'Dec': 11}
idx=fruits_grp['month'].str.title().str[:3].map(d).sort_values().index
fruits_grp=fruits_grp.reindex(idx)
fruits_grp['s']=list(range(len(fruits_grp)))
fruits_grp=fruits_grp.set_index('s',append=True).droplevel(1).rename_axis(index=['month',None])
Update:
sample dataframe:
fruits=pd.DataFrame()
fruits['month']=[1,2,2,3,1,4,4,6,3,3,6,4]
fruits['fruit']=['apple','orange','pear','orange','apple','pear','cherry','pear','orange','cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
ind=(fruits.index)
fruits_grp = fruits.set_index(['month', ind],drop=False)
Then just simply use:
idx=fruits_grp['month'].sort_values().index
fruits_grp=fruits_grp.reindex(idx)
fruits_grp['s']=list(range(len(fruits_grp)))
fruits_grp=fruits_grp.set_index('s',append=True).droplevel(1).rename_axis(index=['month',None])
fruits=pd.DataFrame()
fruits['month']=
['jan','feb','feb','mar','jan','apr','apr','jun','mar','mar','jun','apr']
fruits['fruit']=
['apple','orange','pear','orange','apple','pear','cherry','pear','orange',
'cherry','apple','cherry']
fruits['price']=[30,20,40,25,30 ,45,60,45,25,55,37,60]
fruits["month"] = fruits["month"].str.capitalize()
fruits["month"] = pd.to_datetime(fruits.month, format='%b',
errors='coerce').dt.month
fruits = fruits.sort_values(by="month")
fruits["month"] = pd.to_datetime(fruits['month'], format='%m').dt.strftime('%b')
ind1 = fruits.index
fruits_grp1 = fruits.set_index(['month', ind1],drop=False)
fruits_grp1
print(fruits_grp1)
Thank you so much for all the answers. I've figured out that the sort_values() can make this happen.
The reproducible code for the same is as follows:
fruit_grp_srt=fruits_grp.sort_values(by='month')
But this sorts the rows in alphabetical order and not in the original order of the 1st index.
Still looking for a better solution, Thank you
To me, it looks like a simple sorting by month. First, you need to eliminate the month column (as there is a month in the index) followed by a reset_index
del fruits_grp['month']
df = fruits_grp.reset_index()
Then, it is important to set the months as an ordered categorical datatype, and define the custom order.
df.month = df.month.astype('category')
df.month = df.month.cat.reorder_categories(['jan', 'feb', 'march', 'april', 'june'])
Now, it is just simply sorting by month
df.sort_values(by='month')
Output
month level_1 fruit price
0 jan 0 apple 30
4 jan 4 apple 30
1 feb 1 orange 20
2 feb 2 pear 40
3 march 3 orange 25
8 march 8 orange 25
9 march 9 cherry 55
5 april 5 pear 45
6 april 6 cherry 60
11 april 11 cherry 60
7 june 7 pear 45
10 june 10 apple 37

Calculate number of cases per year from pandas data frame

I have a data frame in the following format:
import pandas as pd
d = {'case_id': [1, 2, 3], 'begin': [2002, 1996, 2001], 'end': [2019, 2001, 2002]}
df = pd.DataFrame(data=d)
with about 1,000 cases.
I need to calculate how many cases are in force by year. This information can be derived from the 'begin' and 'end' columns.
For example, case 2 was in force between the years 1996 and 2001.
The resulting data frame should like as follows:
e = {'year': [1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
'cases': [1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df_ = pd.DataFrame(data=e)
Any idea how I can do this in a few lines for 1,000 cases?
Assign the new value with range then explode
df['new'] = [range(x,y+1) for x , y in zip(df.begin,df.end)]
df = df.explode('new')
And we do groupby + nunique
out = df.groupby(['new']).case_id.nunique().reset_index()
Out[257]:
new case_id
0 1996 1
1 1997 1
2 1998 1
3 1999 1
4 2000 1
5 2001 2
6 2002 2
7 2003 1
8 2004 1
9 2005 1
10 2006 1
11 2007 1
12 2008 1
13 2009 1
14 2010 1
15 2011 1
16 2012 1
17 2013 1
18 2014 1
19 2015 1
20 2016 1
21 2017 1
22 2018 1
23 2019 1
Here is another way:
df.assign(year = df.apply(lambda x: np.arange(x['begin'],x['end']+1),axis=1)).explode('year').groupby('year')['case_id'].count().reset_index()

Pandas Vlookup 2 DF Columns Different Lengths & Perform Calculation

I need to execute a vlookup-like calculation considering two df's of diff lengths with the same column name. Suppose i have a df called df1 such as:
Y M P D
2020 11 Red 10
2020 11 Blue 9
2020 11 Green 12
2020 11 Tan 7
2020 11 White 5
2020 11 Cyan 17
and a second df called df2 such as:
Y M P D
2020 11 Blue 4
2020 11 Red 12
2020 11 White 6
2020 11 Tan 7
2020 11 Green 20
2020 11 Violet 10
2020 11 Black 7
2020 11 BlackII 3
2020 11 Cyan 14
2020 11 Copper 6
I need a new df like df3['Res','P'] with 2 columns showing results from subtracting df1 from df2 such as:
Res P
Red -2
Blue 5
Green -8
Tan 0
White -1
Cyan 3
I have not been able to find anything with a lookup and then calculation on the web. I've tried merging df1 and df2 into one df but I do not see how to execute the calculation when the values in the "P" column match. I think that a merge of df1 and df2 is probably the first step though?
Based on the example, column 'Y' and 'M' do not matter for the merge. If these columns are relevant, then use a list with the on parameter (e.g. on=['Y', 'M', 'P']).
Currently, only columns [['P', 'D']] are being used from df1 and df2.
The following code, produces the desire output for the example, but it's difficult say what will happen with larger dataframes and if there are repeating values in 'P'.
import pandas as pd
# setup the dataframes
df1 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11], 'P': ['Red', 'Blue', 'Green', 'Tan', 'White', 'Cyan'], 'D': [10, 9, 12, 7, 5, 17]})
df2 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11], 'P': ['Blue', 'Red', 'White', 'Tan', 'Green', 'Violet', 'Black', 'BlackII', 'Cyan', 'Copper'], 'D': [4, 12, 6, 7, 20, 10, 7, 3, 14, 6]})
# merge the dataframes
df = pd.merge(df1[['P', 'D']], df2[['P', 'D']], on='P', suffixes=('_1', '_2')).rename(columns={'P': 'Res'})
# subtract the values
df['P'] = (df.D_1 - df.D_2)
# drop the unneeded columns
df = df.drop(columns=['D_1', 'D_2'])
# display(df)
Res P
0 Red -2
1 Blue 5
2 Green -8
3 Tan 0
4 White -1
5 Cyan 3

flatten a multi index in pandas [duplicate]

This question already has answers here:
Pandas index column title or name
(9 answers)
Closed 2 years ago.
I need to set index to my rows, and when I do that, pandas automatically makes my column index hierarchical..and then I tried every flatten mathod I can search, but once I reset_index, my index for row are replaced with iloc (integers). If I use df.columns = [ my col index name], it doesn't flatten my columns' index at all..
I use pandas official docs as example
df = pd.DataFrame({'month': [1, 4, 7, 10],
'year': [2012, 2014, 2013, 2014],
'sale': [55, 40, 84, 31]})
df.set_index('month')
and I get
year sale
month
1 2012 55
4 2014 40
7 2013 84
10 2014 31
Then I flatten the index by
df.reset_index()
Then it becomes
index month year sale
0 0 1 2012 55
1 1 4 2014 40
2 2 7 2013 84
3 3 10 2014 31
(The month for row index disappeard...)
This really kills me so Im appreciate it if someone can help to make the dataframe to sth like
month year sale
1 2012 55
4 2014 40
7 2013 84
10 2014 31
Thanks!
You only need to
df.reset_index(drop=True)
which returns
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31

Standard deviation with groupby(multiple columns) Pandas

I am working with data from the California Air Resources Board.
site,monitor,date,start_hour,value,variable,units,quality,prelim,name
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach
...
df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
inplace = True) #drops bottom columns without data in them, NaN
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df = df.rename(columns={'value':'conc'})
I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
year month day conc
0 2014 1 1 9.644583
1 2014 1 2 4.945652
2 2014 1 3 4.345238
3 2014 1 4 5.047917
4 2014 1 5 5.212857
5 2014 1 6 2.095714
After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
year month conc datetime
0 2014 1 4.330985 2014-01-31
1 2014 2 2.280096 2014-02-28
2 2014 3 4.464622 2014-03-31
3 2014 4 6.583759 2014-04-30
4 2014 5 9.069353 2014-05-31
5 2014 6 9.982330 2014-06-30
Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)
sd = d_avg['conc'].apply(lambda x: x.std())
However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:
year month sd
0 44.877611 1.000000 1.795868
1 44.877611 1.414214 2.355055
2 44.877611 1.732051 2.597531
3 44.877611 2.000000 2.538749
4 44.877611 2.236068 5.456785
5 44.877611 2.449490 3.315546
Please help me!
[1]: https://i.stack.imgur.com/ueVrG.png
I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?
import pandas as pd
columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
[2014, 1, 1, 4.0],
[2014, 1, 2, 6.0],
[2014, 1, 2, 8.0],
[2014, 2, 1, 2.0],
[2014, 2, 1, 6.0],
[2014, 2, 2, 10.0],
[2014, 2, 2, 14.0]]
df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()
print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')
Outputs:
Concentrations:
year month day conc
0 2014 1 1 2.0
1 2014 1 1 4.0
2 2014 1 2 6.0
3 2014 1 2 8.0
4 2014 2 1 2.0
5 2014 2 1 6.0
6 2014 2 2 10.0
7 2014 2 2 14.0
Daily Average:
year month day conc
0 2014 1 1 3.0
1 2014 1 2 7.0
2 2014 2 1 4.0
3 2014 2 2 12.0
Monthly Average:
year month conc
0 2014 1 5.0
1 2014 2 8.0
Monthly Standard Deviation:
year month conc
0 2014 1 2.828427
1 2014 2 5.656854
I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0)
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
and here's the new dataframe:
m_avg_sd.head(5)
Out[2]:
year month conc sd datetime
0 2009 1 48.350105 18.394192 2009-01-31
1 2009 2 21.929383 16.293645 2009-02-28
2 2009 3 15.094729 6.821124 2009-03-31
3 2009 4 12.021009 4.391219 2009-04-30
4 2009 5 13.449100 4.081734 2009-05-31