Pandas Avoid Multidimensional Key Error Comparing 2 Dataframes - pandas
I am stuck on a multidimensional key value error. I have a datframe that looks like this:
year RMSE index cyear Corr_to_CY
0 2000 0.279795 5 1997 0.997975
1 2011 0.299011 2 1994 0.997792
2 2003 0.368341 1 1993 0.977143
3 2013 0.377902 23 2015 0.824441
4 1999 0.41495 10 2002 0.804633
5 1997 0.435813 8 2000 0.752724
6 2018 0.491003 24 2016 0.703359
7 2002 0.505771 3 1995 0.684926
8 2009 0.529308 17 2009 0.580481
9 2015 0.584146 27 2019 0.556555
10 2004 0.620946 26 2018 0.500790
11 2016 0.659388 22 2014 0.443543
12 1993 0.700942 19 2011 0.431615
13 2006 0.748086 11 2003 0.375111
14 2007 0.766675 21 2013 0.323143
15 2020 0.827913 12 2004 0.149202
16 2014 0.884109 7 1999 0.002438
17 2012 0.900184 0 1992 -0.351615
18 1995 0.919482 28 2020 -0.448915
19 1992 0.930512 20 2012 -0.563762
20 2001 0.967834 18 2010 -0.613170
21 2019 1.00497 9 2001 -0.677590
22 2005 1.00885 13 2005 -0.695690
23 2010 1.159125 14 2006 -0.843122
24 2017 1.173262 15 2007 -0.931034
25 1994 1.179737 6 1998 -0.939697
26 2008 1.212915 25 2017 -0.981626
27 1996 1.308853 16 2008 -0.985893
28 1998 1.396771 4 1996 -0.999990
I have selected the conditions for column values of 'Corr_to_CY' >= 0.70 and to return values of 'cyear' column into a new df called 'cyears'. I need to use this as an index to find the year and RMSE value where the 'year' column is in cyears df. This is my best attempt and I get the value error: cannot index with multidimensional key. Do I need to change the index df "cyears" to something else - series, list, etc for this to work? thank you and here is my code that produces the error:
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
cyears = cyears.to_frame()
result = comp.loc[comp['year'] == cyears,'RMSE']
ValueError: Cannot index with multidimensional key
You can use isin method:
import pandas as pd
# Sample creation
import io
comp = pd.read_csv(io.StringIO('year,RMSE,index,cyear,Corr_to_CY\n2000,0.279795,5,1997,0.997975\n2011,0.299011,2,1994,0.997792\n2003,0.368341,1,1993,0.977143\n2013,0.377902,23,2015,0.824441\n1999,0.41495,10,2002,0.804633\n1997,0.435813,8,2000,0.752724\n2018,0.491003,24,2016,0.703359\n2002,0.505771,3,1995,0.684926\n2009,0.529308,17,2009,0.580481\n2015,0.584146,27,2019,0.556555\n2004,0.620946,26,2018,0.500790\n2016,0.659388,22,2014,0.443543\n1993,0.700942,19,2011,0.431615\n2006,0.748086,11,2003,0.375111\n2007,0.766675,21,2013,0.323143\n2020,0.827913,12,2004,0.149202\n2014,0.884109,7,1999,0.002438\n2012,0.900184,0,1992,-0.351615\n1995,0.919482,28,2020,-0.448915\n1992,0.930512,20,2012,-0.563762\n2001,0.967834,18,2010,-0.613170\n2019,1.00497,9,2001,-0.677590\n2005,1.00885,13,2005,-0.695690\n2010,1.159125,14,2006,-0.843122\n2017,1.173262,15,2007,-0.931034\n1994,1.179737,6,1998,-0.939697\n2008,1.212915,25,2017,-0.981626\n1996,1.308853,16,2008,-0.985893\n1998,1.396771,4,1996,-0.999990\n'))
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
result = comp.loc[comp['year'].isin(cyears),'RMSE']
If you want to keep cyears as pandas DataFrame instead of Series, try the following:
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7, ['cyear']]
result = comp.loc[comp['year'].isin(cyears.cyear),'RMSE']
Related
Pandas - creating new column based on data from other records
I have a pandas dataframe which has the folowing columns - Day, Month, Year, City, Temperature. I would like to have a new column that has the average (mean) temperature in same date (day\month) of all previous years. Can someone please assist? Thanks :-)
Try: dti = pd.date_range('2000-1-1', '2021-12-1', freq='D') temp = np.random.randint(10, 20, len(dti)) df = pd.DataFrame({'Day': dti.day, 'Month': dti.month, 'Year': dti.year, 'City': 'Nice', 'Temperature': temp}) out = df.set_index('Year').groupby(['City', 'Month', 'Day']) \ .expanding()['Temperature'].mean().reset_index() Output: >>> out Day Month Year City Temperature 0 1 1 2000 Nice 12.000000 1 1 1 2001 Nice 12.000000 2 1 1 2002 Nice 11.333333 3 1 1 2003 Nice 12.250000 4 1 1 2004 Nice 11.800000 ... ... ... ... ... ... 8001 31 12 2016 Nice 15.647059 8002 31 12 2017 Nice 15.555556 8003 31 12 2018 Nice 15.631579 8004 31 12 2019 Nice 15.750000 8005 31 12 2020 Nice 15.666667 [8006 rows x 5 columns] Focus on 1st January of the dataset: >>> df[df['Day'].eq(1) & df['Month'].eq(1)] Day Month Year City Temperature # Mean 0 1 1 2000 Nice 12 # 12 366 1 1 2001 Nice 12 # 12 731 1 1 2002 Nice 10 # 11.33 1096 1 1 2003 Nice 15 # 12.25 1461 1 1 2004 Nice 10 # 11.80 1827 1 1 2005 Nice 12 # and so on 2192 1 1 2006 Nice 17 2557 1 1 2007 Nice 16 2922 1 1 2008 Nice 19 3288 1 1 2009 Nice 12 3653 1 1 2010 Nice 10 4018 1 1 2011 Nice 16 4383 1 1 2012 Nice 13 4749 1 1 2013 Nice 15 5114 1 1 2014 Nice 14 5479 1 1 2015 Nice 13 5844 1 1 2016 Nice 15 6210 1 1 2017 Nice 13 6575 1 1 2018 Nice 15 6940 1 1 2019 Nice 18 7305 1 1 2020 Nice 11 7671 1 1 2021 Nice 14
line plot by month and year using FacetGrid
I have this dataset: year month victims 2016 7 4869 2016 8 4817 2016 9 4900 2016 10 4873 2016 11 4461 2016 12 4908 2017 1 4717 2017 2 4489 2017 3 4733 2017 4 4549 2017 5 4928 2017 6 4767 2017 7 4713 2017 8 4992 2017 9 4885 2017 10 5049 2017 11 4861 2017 12 4667 .... I want to plot the number of victims by year and month I used this code : sb.lineplot(data=number_victims, x='month', y='victims', hue='year') plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), title = 'month', title_fontsize = 12) this is the result: I tried to use FacetGrid to get a better view of each year this is my code : g = sb.FacetGrid(number_victims, col="year", margin_titles=True, col_wrap=3, height=3, aspect=2) g.map_dataframe( sb.lineplot, x="month", y="victims" ) g.fig.subplots_adjust(top=0.8) g.fig.suptitle('The overall trend of victims number by month and year'); and this is the result : So I have 2 questions: How to sort the months in the first graph? (to start from 1 to 12) Why in the second graph the name of months are from 1 to 10 ? and in 2016 it suppose to start from month #7 not #1, how can I fix that? Thank you so much.
unexpected result with pandas plot [duplicate]
This question already has answers here: Plotting Pandas DataFrame from pivot (2 answers) Closed 1 year ago. I have a dataframe as below: year hour system_load 0 2013 0 23 1 2013 1 22 2 2013 2 22 3 2013 3 21 4 2013 4 19 .. ... ... ... 187 2020 19 30 188 2020 20 29 189 2020 21 28 190 2020 22 28 191 2020 23 28 I am trying to plot by the following code: dff.plot(x='hour',y='system_load') The result as follows: What I want is removing these lines between start and end of the curve each year and refers to each curve by its year and finally plot each curve with different color.
You need to pivot your data. dff.pivot(index='hour', columns='year', values='system_load').plot()
pandas- return Month containing Max value for each year
I have a dataframe like: Year Month Value 2017 1 100 2017 2 1 2017 4 2 2018 3 88 2018 4 8 2019 5 87 2019 6 1 I'd the dataframe to return the Month and Value for each year where the value is the maximum: year month value 2017 1 100 2018 3 88 2019 5 87 I've attempted something like df=df.groupby(["Year","Month"])['Value']).max() however, it returns the full data set because each Year / Month pair is unique (i believe).
You can get the index where the top Value occurs with .groupby(...).idxmax() and use that to index into the original dataframe: In [28]: df.loc[df.groupby("Year")["Value"].idxmax()] Out[28]: Year Month Value 0 2017 1 100 3 2018 3 88 5 2019 5 87
Here is a solution that also handles duplicate possibility: m = df.groupby('Year')['Value'].transform('max') == df['Value'] dfmax = df.loc[m] Full example: import pandas as pd data = '''\ Year Month Value 2017 1 100 2017 2 1 2017 4 2 2018 3 88 2018 4 88 2019 5 87 2019 6 1''' fileobj = pd.compat.StringIO(data) df = pd.read_csv(fileobj, sep='\s+') m = df.groupby('Year')['Value'].transform('max') == df['Value'] print(df[m]) Year Month Value 0 2017 1 100 3 2018 3 88 4 2018 4 88 5 2019 5 87
how to avoid null while doing diff using period in pandas?
I have the below dataframe and I am calculating the different with the previous value using diff periods but that makes the first value as Null, is there any way to fill that value? example: df['cal_val'] = df.groupby('year')['val'].diff(periods=1) current output: date year val cal_val 1/3/10 2010 12 NaN 1/6/10 2010 15 3 1/9/10 2010 18 3 1/12/10 2010 20 2 1/3/11 2011 10 NaN 1/6/11 2011 12 2 1/9/11 2011 15 3 1/12/11 2011 18 3 expected output: date year val cal_val 1/3/10 2010 12 12 1/6/10 2010 15 3 1/9/10 2010 18 3 1/12/10 2010 20 2 1/3/11 2011 10 10 1/6/11 2011 12 2 1/9/11 2011 15 3 1/12/11 2011 18 3