Iterating over two dfs and add values - pandas

I have a dataframe along the lines of the below (df1):
Country Crisis
1 Italy 2008
2 Germany 2008, 2009
3 Mexico
4 US 2007
5 Greece 2010, 2007
I have another dataframe (df2) in panel data format:
Country Year
1 Italy 2007
2 Italy 2008
3 Italy 2009
4 Italy 2010
5 Germany 2007
6 Germany 2008
7 Germany 2009
8 Germany 2010
9 Mexico 2007
10 Mexico 2008
11 Mexico 2009
12 Mexico 2010
13 US 2007
14 US 2008
15 US 2009
16 US 2010
17 Greece 2007
18 Greece 2008
19 Greece 2009
20 Greece 2010
I wish to add a column to df2, "crisis", in which 1 will indicate a crisis, like so:
Country Year crisis
1 Italy 2007 0
2 Italy 2008 1
3 Italy 2009 0
4 Italy 2010 0
5 Germany 2007 0
6 Germany 2008 1
7 Germany 2009 1
8 Germany 2010 0
9 Mexico 2007 0
10 Mexico 2008 0
11 Mexico 2009 0
12 Mexico 2010 0
13 US 2007 1
14 US 2008 0
15 US 2009 0
16 US 2010 0
17 Greece 2007 1
18 Greece 2008 0
19 Greece 2009 0
20 Greece 2010 1
Any ideas?

Although not pretty, this works:
df2['crisis'] = df2.groupby('Country', sort=False, as_index=False).apply(lambda x: x['Year'].isin(df1[df1['Country'] == x['Country'].iloc[0]]['Crisis'].iloc[0])).droplevel(0)
Output:
>>> df2
Country Year crisis
1 Italy 2007 False
2 Italy 2008 True
3 Italy 2009 False
4 Italy 2010 False
5 Germany 2007 False
6 Germany 2008 True
7 Germany 2009 True
8 Germany 2010 False
9 Mexico 2007 False
10 Mexico 2008 False
11 Mexico 2009 False
12 Mexico 2010 False
13 US 2007 True
14 US 2008 False
15 US 2009 False
16 US 2010 False
17 Greece 2007 True
18 Greece 2008 False
19 Greece 2009 False
20 Greece 2010 True
>>> df2.assign(crisis=df2['crisis'].astype(int))
Country Year crisis
1 Italy 2007 0
2 Italy 2008 1
3 Italy 2009 0
4 Italy 2010 0
5 Germany 2007 0
6 Germany 2008 1
7 Germany 2009 1
8 Germany 2010 0
9 Mexico 2007 0
10 Mexico 2008 0
11 Mexico 2009 0
12 Mexico 2010 0
13 US 2007 1
14 US 2008 0
15 US 2009 0
16 US 2010 0
17 Greece 2007 1
18 Greece 2008 0
19 Greece 2009 0
20 Greece 2010 1

Related

Pandas - creating new column based on data from other records

I have a pandas dataframe which has the folowing columns -
Day, Month, Year, City, Temperature.
I would like to have a new column that has the average (mean) temperature in same date (day\month) of all previous years.
Can someone please assist?
Thanks :-)
Try:
dti = pd.date_range('2000-1-1', '2021-12-1', freq='D')
temp = np.random.randint(10, 20, len(dti))
df = pd.DataFrame({'Day': dti.day, 'Month': dti.month, 'Year': dti.year,
'City': 'Nice', 'Temperature': temp})
out = df.set_index('Year').groupby(['City', 'Month', 'Day']) \
.expanding()['Temperature'].mean().reset_index()
Output:
>>> out
Day Month Year City Temperature
0 1 1 2000 Nice 12.000000
1 1 1 2001 Nice 12.000000
2 1 1 2002 Nice 11.333333
3 1 1 2003 Nice 12.250000
4 1 1 2004 Nice 11.800000
... ... ... ... ... ...
8001 31 12 2016 Nice 15.647059
8002 31 12 2017 Nice 15.555556
8003 31 12 2018 Nice 15.631579
8004 31 12 2019 Nice 15.750000
8005 31 12 2020 Nice 15.666667
[8006 rows x 5 columns]
Focus on 1st January of the dataset:
>>> df[df['Day'].eq(1) & df['Month'].eq(1)]
Day Month Year City Temperature # Mean
0 1 1 2000 Nice 12 # 12
366 1 1 2001 Nice 12 # 12
731 1 1 2002 Nice 10 # 11.33
1096 1 1 2003 Nice 15 # 12.25
1461 1 1 2004 Nice 10 # 11.80
1827 1 1 2005 Nice 12 # and so on
2192 1 1 2006 Nice 17
2557 1 1 2007 Nice 16
2922 1 1 2008 Nice 19
3288 1 1 2009 Nice 12
3653 1 1 2010 Nice 10
4018 1 1 2011 Nice 16
4383 1 1 2012 Nice 13
4749 1 1 2013 Nice 15
5114 1 1 2014 Nice 14
5479 1 1 2015 Nice 13
5844 1 1 2016 Nice 15
6210 1 1 2017 Nice 13
6575 1 1 2018 Nice 15
6940 1 1 2019 Nice 18
7305 1 1 2020 Nice 11
7671 1 1 2021 Nice 14

Pandas Avoid Multidimensional Key Error Comparing 2 Dataframes

I am stuck on a multidimensional key value error. I have a datframe that looks like this:
year RMSE index cyear Corr_to_CY
0 2000 0.279795 5 1997 0.997975
1 2011 0.299011 2 1994 0.997792
2 2003 0.368341 1 1993 0.977143
3 2013 0.377902 23 2015 0.824441
4 1999 0.41495 10 2002 0.804633
5 1997 0.435813 8 2000 0.752724
6 2018 0.491003 24 2016 0.703359
7 2002 0.505771 3 1995 0.684926
8 2009 0.529308 17 2009 0.580481
9 2015 0.584146 27 2019 0.556555
10 2004 0.620946 26 2018 0.500790
11 2016 0.659388 22 2014 0.443543
12 1993 0.700942 19 2011 0.431615
13 2006 0.748086 11 2003 0.375111
14 2007 0.766675 21 2013 0.323143
15 2020 0.827913 12 2004 0.149202
16 2014 0.884109 7 1999 0.002438
17 2012 0.900184 0 1992 -0.351615
18 1995 0.919482 28 2020 -0.448915
19 1992 0.930512 20 2012 -0.563762
20 2001 0.967834 18 2010 -0.613170
21 2019 1.00497 9 2001 -0.677590
22 2005 1.00885 13 2005 -0.695690
23 2010 1.159125 14 2006 -0.843122
24 2017 1.173262 15 2007 -0.931034
25 1994 1.179737 6 1998 -0.939697
26 2008 1.212915 25 2017 -0.981626
27 1996 1.308853 16 2008 -0.985893
28 1998 1.396771 4 1996 -0.999990
I have selected the conditions for column values of 'Corr_to_CY' >= 0.70 and to return values of 'cyear' column into a new df called 'cyears'. I need to use this as an index to find the year and RMSE value where the 'year' column is in cyears df. This is my best attempt and I get the value error: cannot index with multidimensional key. Do I need to change the index df "cyears" to something else - series, list, etc for this to work? thank you and here is my code that produces the error:
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
cyears = cyears.to_frame()
result = comp.loc[comp['year'] == cyears,'RMSE']
ValueError: Cannot index with multidimensional key
You can use isin method:
import pandas as pd
# Sample creation
import io
comp = pd.read_csv(io.StringIO('year,RMSE,index,cyear,Corr_to_CY\n2000,0.279795,5,1997,0.997975\n2011,0.299011,2,1994,0.997792\n2003,0.368341,1,1993,0.977143\n2013,0.377902,23,2015,0.824441\n1999,0.41495,10,2002,0.804633\n1997,0.435813,8,2000,0.752724\n2018,0.491003,24,2016,0.703359\n2002,0.505771,3,1995,0.684926\n2009,0.529308,17,2009,0.580481\n2015,0.584146,27,2019,0.556555\n2004,0.620946,26,2018,0.500790\n2016,0.659388,22,2014,0.443543\n1993,0.700942,19,2011,0.431615\n2006,0.748086,11,2003,0.375111\n2007,0.766675,21,2013,0.323143\n2020,0.827913,12,2004,0.149202\n2014,0.884109,7,1999,0.002438\n2012,0.900184,0,1992,-0.351615\n1995,0.919482,28,2020,-0.448915\n1992,0.930512,20,2012,-0.563762\n2001,0.967834,18,2010,-0.613170\n2019,1.00497,9,2001,-0.677590\n2005,1.00885,13,2005,-0.695690\n2010,1.159125,14,2006,-0.843122\n2017,1.173262,15,2007,-0.931034\n1994,1.179737,6,1998,-0.939697\n2008,1.212915,25,2017,-0.981626\n1996,1.308853,16,2008,-0.985893\n1998,1.396771,4,1996,-0.999990\n'))
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
result = comp.loc[comp['year'].isin(cyears),'RMSE']
If you want to keep cyears as pandas DataFrame instead of Series, try the following:
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7, ['cyear']]
result = comp.loc[comp['year'].isin(cyears.cyear),'RMSE']

Pandas removing rows with incomplete time series in panel data

I have a dataframe along the lines of the below:
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
4 Germany Italy 2000
5 Germany Italy 2002
6 Mexico Canada 2000
7 Mexico Canada 2001
8 Mexico Canada 2002
9 US France 2000
10 US France 2001
11 Greece Italy 2000
12 Greece Italy 2001
I want to keep only the rows in which there are observations for the entire time series (2000-2002). So, the end result would be:
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
4 Mexico Canada 2000
5 Mexico Canada 2001
6 Mexico Canada 2002
One idea is reshape by crosstab and test if rows has not 0 values by DataFrame.ne with DataFrame.all, convert index to DataFrame by MultiIndex.to_frame and last get filtered rows in DataFrame.merge:
df1 = pd.crosstab([df['Country1'], df['Country2']], df['Year'])
df = df.merge(df1.index[df1.ne(0).all(axis=1)].to_frame(index=False))
print (df)
Country1 Country2 Year
0 Italy Greece 2000
1 Italy Greece 2001
2 Italy Greece 2002
3 Mexico Canada 2000
4 Mexico Canada 2001
5 Mexico Canada 2002
Or if need test some specific range is possible compare sets in GroupBy.transform:
r = set(range(2000, 2003))
df = df[df.groupby(['Country1', 'Country2'])['Year'].transform(lambda x: set(x) == r)]
print (df)
Country1 Country2 Year
1 Italy Greece 2000
2 Italy Greece 2001
3 Italy Greece 2002
6 Mexico Canada 2000
7 Mexico Canada 2001
8 Mexico Canada 2002
One option is to pivot the data, drop null rows and reshape back; this only works if the combination of Country* and Year is unique (in the sample data it is ):
(df.assign(dummy = 1)
.pivot(('Country1', 'Country2'), 'Year')
.dropna()
.stack()
.drop(columns='dummy')
.reset_index()
)
Country1 Country2 Year
0 Italy Greece 2000
1 Italy Greece 2001
2 Italy Greece 2002
3 Mexico Canada 2000
4 Mexico Canada 2001
5 Mexico Canada 2002

how to avoid null while doing diff using period in pandas?

I have the below dataframe and I am calculating the different with the previous value using diff periods but that makes the first value as Null, is there any way to fill that value?
example:
df['cal_val'] = df.groupby('year')['val'].diff(periods=1)
current output:
date year val cal_val
1/3/10 2010 12 NaN
1/6/10 2010 15 3
1/9/10 2010 18 3
1/12/10 2010 20 2
1/3/11 2011 10 NaN
1/6/11 2011 12 2
1/9/11 2011 15 3
1/12/11 2011 18 3
expected output:
date year val cal_val
1/3/10 2010 12 12
1/6/10 2010 15 3
1/9/10 2010 18 3
1/12/10 2010 20 2
1/3/11 2011 10 10
1/6/11 2011 12 2
1/9/11 2011 15 3
1/12/11 2011 18 3

Pandas: how to have preview of several rows sorted under certain columns

If i have a Data Frame(df) as :
Year Rate
2001 10
2001 3
2001 5
2001 3
2001 6
2002 2
2002 7
2002 4
2002 9
2002 8
... ...
2018 8
2018 6
2018 4
2018 6
2018 5
How do i get a Data Frame that show only first 2 rows of each years, like:
Year Rate
2001 10
2001 3
2002 2
2002 7
... ...
2018 8
2018 6
Thanks
Use GroupBy.head:
df1 = df.groupby('Year').head(2)
print (df1)
Year Rate
0 2001 10
1 2001 3
5 2002 2
6 2002 7
10 2018 8
11 2018 6