unexpected result with pandas plot [duplicate] - pandas

This question already has answers here:
Plotting Pandas DataFrame from pivot
(2 answers)
Closed 1 year ago.
I have a dataframe as below:
year hour system_load
0 2013 0 23
1 2013 1 22
2 2013 2 22
3 2013 3 21
4 2013 4 19
.. ... ... ...
187 2020 19 30
188 2020 20 29
189 2020 21 28
190 2020 22 28
191 2020 23 28
I am trying to plot by the following code:
dff.plot(x='hour',y='system_load')
The result as follows:
What I want is removing these lines between start and end of the curve each year and refers to each curve by its year and finally plot each curve with different color.

You need to pivot your data.
dff.pivot(index='hour', columns='year', values='system_load').plot()

Related

Pandas Avoid Multidimensional Key Error Comparing 2 Dataframes

I am stuck on a multidimensional key value error. I have a datframe that looks like this:
year RMSE index cyear Corr_to_CY
0 2000 0.279795 5 1997 0.997975
1 2011 0.299011 2 1994 0.997792
2 2003 0.368341 1 1993 0.977143
3 2013 0.377902 23 2015 0.824441
4 1999 0.41495 10 2002 0.804633
5 1997 0.435813 8 2000 0.752724
6 2018 0.491003 24 2016 0.703359
7 2002 0.505771 3 1995 0.684926
8 2009 0.529308 17 2009 0.580481
9 2015 0.584146 27 2019 0.556555
10 2004 0.620946 26 2018 0.500790
11 2016 0.659388 22 2014 0.443543
12 1993 0.700942 19 2011 0.431615
13 2006 0.748086 11 2003 0.375111
14 2007 0.766675 21 2013 0.323143
15 2020 0.827913 12 2004 0.149202
16 2014 0.884109 7 1999 0.002438
17 2012 0.900184 0 1992 -0.351615
18 1995 0.919482 28 2020 -0.448915
19 1992 0.930512 20 2012 -0.563762
20 2001 0.967834 18 2010 -0.613170
21 2019 1.00497 9 2001 -0.677590
22 2005 1.00885 13 2005 -0.695690
23 2010 1.159125 14 2006 -0.843122
24 2017 1.173262 15 2007 -0.931034
25 1994 1.179737 6 1998 -0.939697
26 2008 1.212915 25 2017 -0.981626
27 1996 1.308853 16 2008 -0.985893
28 1998 1.396771 4 1996 -0.999990
I have selected the conditions for column values of 'Corr_to_CY' >= 0.70 and to return values of 'cyear' column into a new df called 'cyears'. I need to use this as an index to find the year and RMSE value where the 'year' column is in cyears df. This is my best attempt and I get the value error: cannot index with multidimensional key. Do I need to change the index df "cyears" to something else - series, list, etc for this to work? thank you and here is my code that produces the error:
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
cyears = cyears.to_frame()
result = comp.loc[comp['year'] == cyears,'RMSE']
ValueError: Cannot index with multidimensional key
You can use isin method:
import pandas as pd
# Sample creation
import io
comp = pd.read_csv(io.StringIO('year,RMSE,index,cyear,Corr_to_CY\n2000,0.279795,5,1997,0.997975\n2011,0.299011,2,1994,0.997792\n2003,0.368341,1,1993,0.977143\n2013,0.377902,23,2015,0.824441\n1999,0.41495,10,2002,0.804633\n1997,0.435813,8,2000,0.752724\n2018,0.491003,24,2016,0.703359\n2002,0.505771,3,1995,0.684926\n2009,0.529308,17,2009,0.580481\n2015,0.584146,27,2019,0.556555\n2004,0.620946,26,2018,0.500790\n2016,0.659388,22,2014,0.443543\n1993,0.700942,19,2011,0.431615\n2006,0.748086,11,2003,0.375111\n2007,0.766675,21,2013,0.323143\n2020,0.827913,12,2004,0.149202\n2014,0.884109,7,1999,0.002438\n2012,0.900184,0,1992,-0.351615\n1995,0.919482,28,2020,-0.448915\n1992,0.930512,20,2012,-0.563762\n2001,0.967834,18,2010,-0.613170\n2019,1.00497,9,2001,-0.677590\n2005,1.00885,13,2005,-0.695690\n2010,1.159125,14,2006,-0.843122\n2017,1.173262,15,2007,-0.931034\n1994,1.179737,6,1998,-0.939697\n2008,1.212915,25,2017,-0.981626\n1996,1.308853,16,2008,-0.985893\n1998,1.396771,4,1996,-0.999990\n'))
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7,'cyear']
result = comp.loc[comp['year'].isin(cyears),'RMSE']
If you want to keep cyears as pandas DataFrame instead of Series, try the following:
# Operations
cyears = comp.loc[comp['Corr_to_CY']>= 0.7, ['cyear']]
result = comp.loc[comp['year'].isin(cyears.cyear),'RMSE']

pandas- return Month containing Max value for each year

I have a dataframe like:
Year Month Value
2017 1 100
2017 2 1
2017 4 2
2018 3 88
2018 4 8
2019 5 87
2019 6 1
I'd the dataframe to return the Month and Value for each year where the value is the maximum:
year month value
2017 1 100
2018 3 88
2019 5 87
I've attempted something like df=df.groupby(["Year","Month"])['Value']).max() however, it returns the full data set because each Year / Month pair is unique (i believe).
You can get the index where the top Value occurs with .groupby(...).idxmax() and use that to index into the original dataframe:
In [28]: df.loc[df.groupby("Year")["Value"].idxmax()]
Out[28]:
Year Month Value
0 2017 1 100
3 2018 3 88
5 2019 5 87
Here is a solution that also handles duplicate possibility:
m = df.groupby('Year')['Value'].transform('max') == df['Value']
dfmax = df.loc[m]
Full example:
import pandas as pd
data = '''\
Year Month Value
2017 1 100
2017 2 1
2017 4 2
2018 3 88
2018 4 88
2019 5 87
2019 6 1'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+')
m = df.groupby('Year')['Value'].transform('max') == df['Value']
print(df[m])
Year Month Value
0 2017 1 100
3 2018 3 88
4 2018 4 88
5 2019 5 87

how to avoid null while doing diff using period in pandas?

I have the below dataframe and I am calculating the different with the previous value using diff periods but that makes the first value as Null, is there any way to fill that value?
example:
df['cal_val'] = df.groupby('year')['val'].diff(periods=1)
current output:
date year val cal_val
1/3/10 2010 12 NaN
1/6/10 2010 15 3
1/9/10 2010 18 3
1/12/10 2010 20 2
1/3/11 2011 10 NaN
1/6/11 2011 12 2
1/9/11 2011 15 3
1/12/11 2011 18 3
expected output:
date year val cal_val
1/3/10 2010 12 12
1/6/10 2010 15 3
1/9/10 2010 18 3
1/12/10 2010 20 2
1/3/11 2011 10 10
1/6/11 2011 12 2
1/9/11 2011 15 3
1/12/11 2011 18 3

yearly average from monthly daterange data

I have the following table in postgresql;
Value period
1 [2017-01-01,2017-02-01)
2 [2017-02-01,2017-03-01)
3 [2017-03-01,2017-04-01)
4 [2017-04-01,2017-05-01)
5 [2017-05-01,2017-06-01)
6 [2017-06-01,2017-07-01)
7 [2017-07-01,2017-08-01)
8 [2017-08-01,2017-09-01)
9 [2017-09-01,2017-10-01)
10 [2017-10-01,2017-11-01)
11 [2017-11-01,2017-12-01)
12 [2017-12-01,2018-01-01)
13 [2018-01-01,2018-02-01)
14 [2018-02-01,2018-03-01)
15 [2018-03-01,2018-04-01)
16 [2018-04-01,2018-05-01)
17 [2018-05-01,2018-06-01)
18 [2018-06-01,2018-07-01)
19 [2018-07-01,2018-08-01)
20 [2018-08-01,2018-09-01)
21 [2018-09-01,2018-10-01)
22 [2018-10-01,2018-11-01)
23 [2018-11-01,2018-12-01)
24 [2018-12-01,2019-01-01)
25 [2019-01-01,2019-02-01)
26 [2019-02-01,2019-03-01)
27 [2019-03-01,2019-04-01)
28 [2019-04-01,2019-05-01)
29 [2019-05-01,2019-06-01)
30 [2019-06-01,2019-07-01)
31 [2019-07-01,2019-08-01)
32 [2019-08-01,2019-09-01)
33 [2019-09-01,2019-10-01)
34 [2019-10-01,2019-11-01)
35 [2019-11-01,2019-12-01)
36 [2019-12-01,2020-01-01)
37 [2020-01-01,2020-02-01)
38 [2020-02-01,2020-03-01)
39 [2020-03-01,2020-04-01)
40 [2020-04-01,2020-05-01)
41 [2020-05-01,2020-06-01)
42 [2020-06-01,2020-07-01)
How can I get yearly average from monthly data in postgresql?
Note: Column Value is type integer and column period is type daterange.
The expected result should be
6.5 2017
18.5 2018
30.5 2019
39.5 2020
If your periods are always taking one month, including the lower bound and excluding the upper, you could try this
select
avg(value * 1.0) as average,
extract(year from lower(period)) as year
from table
group by year

Trouble with NaNs: set_index().reset_index() corrupts data

I read that NaNs are problematic, but the following causes an actual corruption of my data, rather than an error. Is this a bug? Have I missed something basic in the documentation?
I would like the second command to give an error or to give the same response as the first command:
ipdb> df
year PRuid QC data
18 2007 nonQC 0 8.014261
19 2008 nonQC 0 7.859152
20 2010 nonQC 0 7.468260
21 1985 10 NaN 0.861403
22 1985 11 NaN 0.878531
23 1985 12 NaN 0.842704
24 1985 13 NaN 0.785877
25 1985 24 1 0.730625
26 1985 35 NaN 0.816686
27 1985 46 NaN 0.819271
28 1985 47 NaN 0.807050
ipdb> df.set_index(['year','PRuid','QC']).reset_index()
year PRuid QC data
0 2007 nonQC 0 8.014261
1 2008 nonQC 0 7.859152
2 2010 nonQC 0 7.468260
3 1985 10 1 0.861403
4 1985 11 1 0.878531
5 1985 12 1 0.842704
6 1985 13 1 0.785877
7 1985 24 1 0.730625
8 1985 35 1 0.816686
9 1985 46 1 0.819271
10 1985 47 1 0.807050
The value of "QC" is actually changed to 1 from NaN where it should be NaN.
Btw, for symmetry I added the ".reset_index()", but the data corruption is introduced by set_index.
And in case this is interesting, the version is:
pd.version
<module 'pandas.version' from '/usr/lib/python2.6/site-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/version.pyc'>
So this was a bug. By the end of May 2013, pandas 0.11.1 should be released with the bug fix (see comments on this question).
In the mean time, I avoided using a value with NaNs in any multiindex, for instance by using some other flag value (-99) for the NaNs in the column 'QC'.