Trouble with NaNs: set_index().reset_index() corrupts data - indexing

I read that NaNs are problematic, but the following causes an actual corruption of my data, rather than an error. Is this a bug? Have I missed something basic in the documentation?
I would like the second command to give an error or to give the same response as the first command:
ipdb> df
year PRuid QC data
18 2007 nonQC 0 8.014261
19 2008 nonQC 0 7.859152
20 2010 nonQC 0 7.468260
21 1985 10 NaN 0.861403
22 1985 11 NaN 0.878531
23 1985 12 NaN 0.842704
24 1985 13 NaN 0.785877
25 1985 24 1 0.730625
26 1985 35 NaN 0.816686
27 1985 46 NaN 0.819271
28 1985 47 NaN 0.807050
ipdb> df.set_index(['year','PRuid','QC']).reset_index()
year PRuid QC data
0 2007 nonQC 0 8.014261
1 2008 nonQC 0 7.859152
2 2010 nonQC 0 7.468260
3 1985 10 1 0.861403
4 1985 11 1 0.878531
5 1985 12 1 0.842704
6 1985 13 1 0.785877
7 1985 24 1 0.730625
8 1985 35 1 0.816686
9 1985 46 1 0.819271
10 1985 47 1 0.807050
The value of "QC" is actually changed to 1 from NaN where it should be NaN.
Btw, for symmetry I added the ".reset_index()", but the data corruption is introduced by set_index.
And in case this is interesting, the version is:
pd.version
<module 'pandas.version' from '/usr/lib/python2.6/site-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/version.pyc'>

So this was a bug. By the end of May 2013, pandas 0.11.1 should be released with the bug fix (see comments on this question).
In the mean time, I avoided using a value with NaNs in any multiindex, for instance by using some other flag value (-99) for the NaNs in the column 'QC'.

Related

Pandas - creating new column based on data from other records

I have a pandas dataframe which has the folowing columns -
Day, Month, Year, City, Temperature.
I would like to have a new column that has the average (mean) temperature in same date (day\month) of all previous years.
Can someone please assist?
Thanks :-)
Try:
dti = pd.date_range('2000-1-1', '2021-12-1', freq='D')
temp = np.random.randint(10, 20, len(dti))
df = pd.DataFrame({'Day': dti.day, 'Month': dti.month, 'Year': dti.year,
'City': 'Nice', 'Temperature': temp})
out = df.set_index('Year').groupby(['City', 'Month', 'Day']) \
.expanding()['Temperature'].mean().reset_index()
Output:
>>> out
Day Month Year City Temperature
0 1 1 2000 Nice 12.000000
1 1 1 2001 Nice 12.000000
2 1 1 2002 Nice 11.333333
3 1 1 2003 Nice 12.250000
4 1 1 2004 Nice 11.800000
... ... ... ... ... ...
8001 31 12 2016 Nice 15.647059
8002 31 12 2017 Nice 15.555556
8003 31 12 2018 Nice 15.631579
8004 31 12 2019 Nice 15.750000
8005 31 12 2020 Nice 15.666667
[8006 rows x 5 columns]
Focus on 1st January of the dataset:
>>> df[df['Day'].eq(1) & df['Month'].eq(1)]
Day Month Year City Temperature # Mean
0 1 1 2000 Nice 12 # 12
366 1 1 2001 Nice 12 # 12
731 1 1 2002 Nice 10 # 11.33
1096 1 1 2003 Nice 15 # 12.25
1461 1 1 2004 Nice 10 # 11.80
1827 1 1 2005 Nice 12 # and so on
2192 1 1 2006 Nice 17
2557 1 1 2007 Nice 16
2922 1 1 2008 Nice 19
3288 1 1 2009 Nice 12
3653 1 1 2010 Nice 10
4018 1 1 2011 Nice 16
4383 1 1 2012 Nice 13
4749 1 1 2013 Nice 15
5114 1 1 2014 Nice 14
5479 1 1 2015 Nice 13
5844 1 1 2016 Nice 15
6210 1 1 2017 Nice 13
6575 1 1 2018 Nice 15
6940 1 1 2019 Nice 18
7305 1 1 2020 Nice 11
7671 1 1 2021 Nice 14

Pandas Avoid Multidimensional Key Error Comparing 2 Dataframes

I am stuck on a multidimensional key value error. I have a datframe that looks like this:
year RMSE index cyear Corr_to_CY
0 2000 0.279795 5 1997 0.997975
1 2011 0.299011 2 1994 0.997792
2 2003 0.368341 1 1993 0.977143
3 2013 0.377902 23 2015 0.824441
4 1999 0.41495 10 2002 0.804633
5 1997 0.435813 8 2000 0.752724
6 2018 0.491003 24 2016 0.703359
7 2002 0.505771 3 1995 0.684926
8 2009 0.529308 17 2009 0.580481
9 2015 0.584146 27 2019 0.556555
10 2004 0.620946 26 2018 0.500790
11 2016 0.659388 22 2014 0.443543
12 1993 0.700942 19 2011 0.431615
13 2006 0.748086 11 2003 0.375111
14 2007 0.766675 21 2013 0.323143
15 2020 0.827913 12 2004 0.149202
16 2014 0.884109 7 1999 0.002438
17 2012 0.900184 0 1992 -0.351615
18 1995 0.919482 28 2020 -0.448915
19 1992 0.930512 20 2012 -0.563762
20 2001 0.967834 18 2010 -0.613170
21 2019 1.00497 9 2001 -0.677590
22 2005 1.00885 13 2005 -0.695690
23 2010 1.159125 14 2006 -0.843122
24 2017 1.173262 15 2007 -0.931034
25 1994 1.179737 6 1998 -0.939697
26 2008 1.212915 25 2017 -0.981626
27 1996 1.308853 16 2008 -0.985893
28 1998 1.396771 4 1996 -0.999990
I have selected the conditions for column values of 'Corr_to_CY' >= 0.70 and to return values of 'cyear' column into a new df called 'cyears'. I need to use this as an index to find the year and RMSE value where the 'year' column is in cyears df. This is my best attempt and I get the value error: cannot index with multidimensional key. Do I need to change the index df "cyears" to something else - series, list, etc for this to work? thank you and here is my code that produces the error:
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
cyears = cyears.to_frame()
result = comp.loc[comp['year'] == cyears,'RMSE']
ValueError: Cannot index with multidimensional key
You can use isin method:
import pandas as pd
# Sample creation
import io
comp = pd.read_csv(io.StringIO('year,RMSE,index,cyear,Corr_to_CY\n2000,0.279795,5,1997,0.997975\n2011,0.299011,2,1994,0.997792\n2003,0.368341,1,1993,0.977143\n2013,0.377902,23,2015,0.824441\n1999,0.41495,10,2002,0.804633\n1997,0.435813,8,2000,0.752724\n2018,0.491003,24,2016,0.703359\n2002,0.505771,3,1995,0.684926\n2009,0.529308,17,2009,0.580481\n2015,0.584146,27,2019,0.556555\n2004,0.620946,26,2018,0.500790\n2016,0.659388,22,2014,0.443543\n1993,0.700942,19,2011,0.431615\n2006,0.748086,11,2003,0.375111\n2007,0.766675,21,2013,0.323143\n2020,0.827913,12,2004,0.149202\n2014,0.884109,7,1999,0.002438\n2012,0.900184,0,1992,-0.351615\n1995,0.919482,28,2020,-0.448915\n1992,0.930512,20,2012,-0.563762\n2001,0.967834,18,2010,-0.613170\n2019,1.00497,9,2001,-0.677590\n2005,1.00885,13,2005,-0.695690\n2010,1.159125,14,2006,-0.843122\n2017,1.173262,15,2007,-0.931034\n1994,1.179737,6,1998,-0.939697\n2008,1.212915,25,2017,-0.981626\n1996,1.308853,16,2008,-0.985893\n1998,1.396771,4,1996,-0.999990\n'))
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
result = comp.loc[comp['year'].isin(cyears),'RMSE']
If you want to keep cyears as pandas DataFrame instead of Series, try the following:
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7, ['cyear']]
result = comp.loc[comp['year'].isin(cyears.cyear),'RMSE']

unexpected result with pandas plot [duplicate]

This question already has answers here:
Plotting Pandas DataFrame from pivot
(2 answers)
Closed 1 year ago.
I have a dataframe as below:
year hour system_load
0 2013 0 23
1 2013 1 22
2 2013 2 22
3 2013 3 21
4 2013 4 19
.. ... ... ...
187 2020 19 30
188 2020 20 29
189 2020 21 28
190 2020 22 28
191 2020 23 28
I am trying to plot by the following code:
dff.plot(x='hour',y='system_load')
The result as follows:
What I want is removing these lines between start and end of the curve each year and refers to each curve by its year and finally plot each curve with different color.
You need to pivot your data.
dff.pivot(index='hour', columns='year', values='system_load').plot()

How to create one variable conditional on other variables in R

I have a very big data-frame like the following:
Region year prate
1 2005 24
1 2006 17
1 2007 56
2 2005 13
2 2006 65
2 2007 43
3 2005 91
3 2006 65
3 2007 12
.....
I want to create a new variable called prate07 in which the variable for year 2007 is the value of prate in that year and the value for other years is 0. Something like the following:
Region year prate prate07
1 2005 24 0
1 2006 17 0
1 2007 56 56
2 2005 13 0
2 2006 65 0
2 2007 43 43
3 2005 91 0
3 2006 65 0
3 2007 12 12
.....
May someone please help me to find the code for it?
Thanks for the help in advance
I used the following code, but it does not work:
library(tidyverse)
dat2 <- dat %>%
mutate(group2 = str_c("p_rate", year), prate07 = prate) %>%
spread(group2, prate07, fill = 0)

how to avoid null while doing diff using period in pandas?

I have the below dataframe and I am calculating the different with the previous value using diff periods but that makes the first value as Null, is there any way to fill that value?
example:
df['cal_val'] = df.groupby('year')['val'].diff(periods=1)
current output:
date year val cal_val
1/3/10 2010 12 NaN
1/6/10 2010 15 3
1/9/10 2010 18 3
1/12/10 2010 20 2
1/3/11 2011 10 NaN
1/6/11 2011 12 2
1/9/11 2011 15 3
1/12/11 2011 18 3
expected output:
date year val cal_val
1/3/10 2010 12 12
1/6/10 2010 15 3
1/9/10 2010 18 3
1/12/10 2010 20 2
1/3/11 2011 10 10
1/6/11 2011 12 2
1/9/11 2011 15 3
1/12/11 2011 18 3