dataframe - timespan between timestamps based on value of other column - pandas

I have a pandas dataframe with the index containing years and a column containing dividend payouts. I now want to determine how many years the company has continuously paid out dividends (column dividend > 0).
As an example, for the following table I want the result to be 2 (2019+2018)
year dividend
2019 1.89
2018 1.70
2017 0
2016 1.5
And for this one 4
year dividend
2019 1.89
2018 1.70
2017 1.6
2016 1.58

Eventhough the below answer is a round about, yet it can solve your problem.
Convert it to pd.df and use idxmin() with ne() wrapper to find the continuous divident pay.
year dividend
2019 1.89
2018 1.70
2017 0
2016 1.5
df = pd.DataFrame({'dividend' : [1.89,1.70,0,1.5]}, index=[2019,2018,2017,2016])
print('Continuous Divident value :', df.loc[ : df.dividend.idxmin()].ne(0).sum()[0])
Continuous Divident value : 2
year dividend
2019 1.89
2018 1.70
2017 1.6
2016 1.58
df = pd.DataFrame({'dividend' : [1.89,1.70,1.6,1.58]}, index=[2019,2018,2017,2016])
print('Continuous Divident value :', df.loc[ : df.dividend.idxmin()].ne(0).sum()[0])
Continuous Divident value : 4
Edit
Based on your comment, I just played with loops, I hope this satisfy your requiremnts. If not let me know..
value = 0 if df['bool'].iloc[0] == 0 else (len(df) if len(df) == df.iloc[-1::]['bool'].values[0] else len(df.loc[: df[df['bool'].duplicated(keep = 'last')].index[0]]))
print('Continous divident value : ', value)

Related

Cumulative Deviation of 2 Columns in Pandas DF

I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64

Pandas Shift Column & Remove Row

I have a dataframe 'df1' that has 2 columns and i need to shift the 2nd column down a row and then remove the entire top row of the df1.
My data looks like this:
year ER12
0 2017 -2.05
1 2018 1.05
2 2019 -0.04
3 2020 -0.60
4 2021 -99.99
And, I need it to look like this:
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
We can try this:
df = df.assign(ER12=df.ER12.shift()).dropna().reset_index(drop=True)
print(df)
year ER12
0 2018 -2.05
1 2019 1.05
2 2020 -0.04
3 2021 -0.60
This works on your example:
import pandas as pd
df = pd.DataFrame({'year':[2017,2018,2019,2020,2021], 'ER12':[-2.05,1.05,-0.04,-0.6,-99.99]})
df['year'] = df['year'].shift(-1)
df = df.dropna()

Pandas Implement Equation & Groupby 2 Conditions

I have data that looks like this below and I'm trying to calculate the CRMSE (centered root mean squared error) by site_name and year. Maybe i need an agg function or a lambda function to do this at each groupby parameters (plant_name, year). The dataframe data for df3m1:
plant_name year month obsvals modelvals
0 ARIZONA I 2021 1 8.90 8.30
1 ARIZONA I 2021 2 7.98 7.41
2 CAETITE I 2021 1 9.10 7.78
3 CAETITE I 2021 2 6.05 6.02
The equation that I need to implement by plant_name and year looks like this:
crmse = df3m1.groupby(['plant_name','year'])(( (df3m1.obsvals - df3m1.obsvals.mean()) -
(df3m1.modelvals - df3m1.modelvals.mean()) ) ** 2).mean() ** .5
This is a bit advanced for me yet on how to integrate a groupby and a calculation at the same time. thank you. Final dataframe would look like:
plant_name year crmse
0 ARIZONA I 2021 ?
1 CAETITE I 2021 ?
I have tried things like this with groupby -
crmse = df3m1.groupby(['plant_name','year'])(( (df3m1.obsvals -
df3m1.obsvals.mean()) - (df3m1.modelvals - df3m1.modelvals.mean()) )
** 2).mean() ** .5
but get errors like this:
TypeError: 'DataFrameGroupBy' object is not callable
Using groupby is correct. After that, we would have used .agg normally, but computing csrme interacts with multiple columns (obsvals and modelvals). So we pass the entire dataframe then take columns as we want by using .apply.
Code:
def crmse(x, y):
return np.sqrt(np.mean(np.square( (x - x.mean()) - (y - y.mean()) )))
def f(df):
return pd.Series(crmse(df['obsvals'], df['modelvals']), index=['crmse'])
crmse_series = (
df3m1
.groupby(['plant_name', 'year'])
.apply(f)
)
crmse_series
crmse
plant_name year
ARIZONA I 2021 0.015
CAETITE I 2021 0.645
You can merge the series into the original dataframe with merge.
df = df.merge(crmse_series, on=['plant_name', 'year'])
df
plant_name year month obsvals modelvals crmse
0 ARIZONA I 2021 1 8.90 8.30 0.015
1 ARIZONA I 2021 2 7.98 7.41 0.015
2 CAETITE I 2021 1 9.10 7.78 0.645
3 CAETITE I 2021 2 6.05 6.02 0.645
See Also:
Apply multiple functions to multiple groupby columns

Subsetting pandas dataframe based on two columnar values

I am trying to subset a large dataframe (5000+ rows and 15 columns) based on unique values from two columns (both are dtype = object). I want to exclude rows of data that meet the following criteria:
A column called 'Record' equals "MO" AND a column called 'Year' equals "2017" or "2018".
Here is an example of the dataframe:
df = pd.DataFrame({'A': [1001,2002,3003,4004,5005,6006,7007,8008,9009], 'Record' : ['MO','MO','I','I','MO','I','MO','I','I'], 'Year':[2017,2019,2018,2020,2018,2018,2020,2019,2017]})
print(df)
A Record Year
0 1001 MO 2017
1 2002 MO 2019
2 3003 I 2018
3 4004 I 2020
4 5005 MO 2018
5 6006 I 2018
6 7007 MO 2020
7 8008 I 2019
8 9009 I 2017
I would like any row with both "MO" and "2017", as well as both "MO" and "2018" taken out of the dataframe.
Example where the right rows (0 and 4 in dataframe above) are deleted:
df = pd.DataFrame({'A': [2002,3003,4004,6006,7007,8008,9009], 'Record' : ['MO','I','I','I','MO','I','I'], 'Year':[2019,2018,2020,2018,2020,2019,2017]})
print(df)
A Record Year
0 2002 MO 2019
1 3003 I 2018
2 4004 I 2020
3 6006 I 2018
4 7007 MO 2020
5 8008 I 2019
6 9009 I 2017
I have tried the following code, but it does not work (I tried at first for just one year):
df = df[(df['Record'] != "MO" & df['Year'] != "2017")]
I believe you're just missing some parenthesis.
df = df[(df['Record'] != "MO") & (df['Year'] != "2017")]
Edit:
After some clarification:
df = df[~((df['Record']=='MO')&
(df['Year']=='2017')|
(df['Year']=='2018'))]

Pandas 1.0 create column of months from year and date

I have a dataframe df with values as:
df.iloc[1:4, 7:9]
Year Month
38 2020 4
65 2021 4
92 2022 4
I am trying to create a new MonthIdx column as:
df['MonthIdx'] = pd.to_timedelta(df['Year'], unit='Y') + pd.to_timedelta(df['Month'], unit='M') + pd.to_timedelta(1, unit='D')
But I get the error:
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Following is the desired output:
df['MonthIdx']
MonthIdx
38 2020/04/01
65 2021/04/01
92 2022/04/01
So you can pad the month value in a series, and then reformat to get a datetime for all of the values:
month = df.Month.astype(str).str.pad(width=2, side='left', fillchar='0')
df['MonthIdx'] = pd.to_datetime(pd.Series([int('%d%s' % (x,y)) for x,y in zip(df['Year'],month)]),format='%Y%m')
This will give you:
Year Month MonthIdx
0 2020 4 2020-04-01
1 2021 4 2021-04-01
2 2022 4 2022-04-01
You can reformat the date to be a string to match exactly your format:
df['MonthIdx'] = df['MonthIdx'].apply(lambda x: x.strftime('%Y/%m/%d'))
Giving you:
Year Month MonthIdx
0 2020 4 2020/04/01
1 2021 4 2021/04/01
2 2022 4 2022/04/01