Pandas Shift Column & Remove Row - pandas

I have a dataframe 'df1' that has 2 columns and i need to shift the 2nd column down a row and then remove the entire top row of the df1.
My data looks like this:
year ER12
0 2017 -2.05
1 2018 1.05
2 2019 -0.04
3 2020 -0.60
4 2021 -99.99
And, I need it to look like this:
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60

We can try this:
df = df.assign(ER12=df.ER12.shift()).dropna().reset_index(drop=True)
print(df)
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60

This works on your example:
import pandas as pd
df = pd.DataFrame({'year':[2017,2018,2019,2020,2021], 'ER12':[-2.05,1.05,-0.04,-0.6,-99.99]})
df['year'] = df['year'].shift(-1)
df = df.dropna()

Related

Cumulative Deviation of 2 Columns in Pandas DF

I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64

Rolling Rows in pandas.DataFrame

I have a dataframe that looks like this:
year
month
valueCounts
2019
1
73.411285
2019
2
53.589128
2019
3
71.103842
2019
4
79.528084
I want valueCounts column's values to be rolled like:
year
month
valueCounts
2019
1
53.589128
2019
2
71.103842
2019
3
79.528084
2019
4
NaN
I can do this by dropping first index of dataframe and assigning last index to NaN but it doesn't look efficient. Is there any simpler method to do this?
Thanks.
Assuming your dataframe are already sorted.
Use shift:
df['valueCounts'] = df['valueCounts'].shift(-1)
print(df)
# Output
year month valueCounts
0 2019 1 53.589128
1 2019 2 71.103842
2 2019 3 79.528084
3 2019 4 NaN

np.where multi-conditional based on another column

I have two dataframes.
df_1:
Year
ID
Flag
2021
1
1
2020
1
0
2021
2
1
df_2:
Year
ID
2021
1
2020
2
I'm looking to add the flag from df_1 to df_2 based on id and year. I think I need to use an np.where statement but i'm having a hard time figuring it out. any ideas?
You can use pandas.merge() to combine df1 and df2 with outer ways.
df2["Flag"] = pd.NaT
df2["Flag"].update(df2.merge(df1, on=["Year", "ID"], how="outer")["Flag_y"])
print(df2)
Year ID Flag
0 2020 2 NaT
1 2021 1 1.0

Subsetting pandas dataframe based on two columnar values

I am trying to subset a large dataframe (5000+ rows and 15 columns) based on unique values from two columns (both are dtype = object). I want to exclude rows of data that meet the following criteria:
A column called 'Record' equals "MO" AND a column called 'Year' equals "2017" or "2018".
Here is an example of the dataframe:
df = pd.DataFrame({'A': [1001,2002,3003,4004,5005,6006,7007,8008,9009], 'Record' : ['MO','MO','I','I','MO','I','MO','I','I'], 'Year':[2017,2019,2018,2020,2018,2018,2020,2019,2017]})
print(df)
A Record Year
0 1001 MO 2017
1 2002 MO 2019
2 3003 I 2018
3 4004 I 2020
4 5005 MO 2018
5 6006 I 2018
6 7007 MO 2020
7 8008 I 2019
8 9009 I 2017
I would like any row with both "MO" and "2017", as well as both "MO" and "2018" taken out of the dataframe.
Example where the right rows (0 and 4 in dataframe above) are deleted:
df = pd.DataFrame({'A': [2002,3003,4004,6006,7007,8008,9009], 'Record' : ['MO','I','I','I','MO','I','I'], 'Year':[2019,2018,2020,2018,2020,2019,2017]})
print(df)
A Record Year
0 2002 MO 2019
1 3003 I 2018
2 4004 I 2020
3 6006 I 2018
4 7007 MO 2020
5 8008 I 2019
6 9009 I 2017
I have tried the following code, but it does not work (I tried at first for just one year):
df = df[(df['Record'] != "MO" & df['Year'] != "2017")]
I believe you're just missing some parenthesis.
df = df[(df['Record'] != "MO") & (df['Year'] != "2017")]
Edit:
After some clarification:
df = df[~((df['Record']=='MO')&
(df['Year']=='2017')|
(df['Year']=='2018'))]

Reindexing Multiindex dataframe

I have Multiindex dataframe and I want to reindex it. However, I get 'duplicate axis error'.
Product Date col1
A September 2019 5
October 2019 7
B September 2019 2
October 2019 4
How can I achieve output like this?
Product Date col1
A January 2019 0
February 2019 0
March 2019 0
April 2019 0
May 2019 0
June 2019 0
July 2019 0
August 2019 0
September 2019 5
October 2019 7
B January 2019 0
February 2019 0
March 2019 0
April 2019 0
May 2019 0
June 2019 0
July 2019 0
August 2019 0
September 2019 2
October 2019 4
First I tried this:
nested_df = nested_df.reindex(annual_date_range, level = 1, fill_value = 0)
Secondly,
nested_df = nested_df.reset_index().set_index('Date')
nested_df = nested_df.reindex(annual_date_range, fill_value = 0)
You should do the following for each month:
df.loc[('A', 'January 2019'), :] = (0)
df.loc[('B', 'January 2019'), :] = (0)
Let df1 be your first data frame with non-zero values. The approach is to create another data frame df with zero values and merge both data frames to obtain the result.
dates = ['{month}-2019'.format(month=month) for month in range(1,9)]*2
length = int(len(dates)/2)
products = ['A']*length + ['B']*length
Col1 = [0]*len(dates)
df = pd.DataFrame({'Dates': dates, 'Products': products, 'Col1':Col1}).set_index(['Products','Dates'])
Now the MultiIndex is converted to datetime:
df.index.set_levels(pd.to_datetime(df.index.get_level_values(1)[:8]).strftime('%m-%Y'), level=1,inplace=True)
In df1 you have to do the same, i.e. change the datetime multiindex level to the same format:
df1.index.set_levels(pd.to_datetime(df1.index.get_level_values(1)[:2]).strftime('%m-%Y'), level=1,inplace=True)
I did it because otherwise (for example if datetimes are formatted like %B %y) the sorting of the MultiIndex by months goes wrong. Now it is sufficient to merge both data frames:
result = pd.concat([df1,df]).sort_values(['Products','Dates'])
The final move is to change the datetime format:
result.index.set_levels(levels = pd.to_datetime(result.index.get_level_values(1)[:10]).strftime('%B %Y'), level=1, inplace=True)