cumulative product for specific groups of observations in pandas - pandas

I have a dataset of the following type
Date ID window var
0 1998-01-28 X -5 8.500e-03
1 1998-01-28 Y -5 1.518e-02
2 1998-01-29 X -4 8.005e-03
3 1998-01-29 Y -4 7.905e-03
4 1998-01-30 X -3 -5.497e-03
... ... ... ...
3339 2016-12-19 Y 3 -4.365e-04
3340 2016-12-20 X 4 3.628e-03
3341 2016-12-20 Y 4 6.608e-03
3342 2016-12-21 X 5 -2.467e-03
3343 2016-12-21 Y 5 -2.651e-03
My aim is to calculate the cumulative product of the variable var according to the variable window. The idea is that for every date, I have identified a window of 5 days around that date /the variable window goes from -5 to 5). Now, I want to calculate the cumulative product in the window that belongs to a specific date. For example, the first date (1998-01-28) has a value of windows of -5, and thus represent the starting point for the calculation of the cumprod. I want to have a new variable called cumprod which is exactly var on the date in which window is -5, then it is the cumprod between the value of varat -5 and -4, and so on until window is equal to 5. This defines the value of cumprod for the first group of dates, where every group is defined by consecutive dates in a way that var starts at -5 and ends at 5. I shall then repeat this for any group of date. I will therefore obtain something like
Date ID window var cumprod
0 1998-01-28 X -5 8.500e-03 8.500e-03
1 1998-01-28 Y -5 1.518e-02 1.518e-02
2 1998-01-29 X -4 8.005e-03 6.80425e-05
3 1998-01-29 Y -4 7.905e-03 0.00011999790000000002
4 1998-01-30 X -3 -5.497e-03
... ... ... ...
3339 2016-12-19 Y 3 -4.365e-04
3340 2016-12-20 X 4 3.628e-03
3341 2016-12-20 Y 4 6.608e-03
3342 2016-12-21 X 5 -2.467e-03
3343 2016-12-21 Y 5 -2.651e-03
where I gave an example in of cumprod for the first 2 dates.
How could I achieve this? I was thinking to find a way to attach an identifier to every group of dates and then run some sort of cumprod() method using .groupby(group_identifier). I can't think of how to do it though. Would it be possible to simplify it by using a rolling function on window? Any other kind of approach is very welcomed.

I suggest the following
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({"Date": pd.date_range("1998-01-28", freq="d", periods=22),
"window": np.concatenate([np.arange(-5,6,1),np.arange(-5,6,1)]),
"var": np.random.randint(1,10,22)
})
My df is similar to yours:
Date window var
0 1998-01-28 -5 3
1 1998-01-29 -4 3
2 1998-01-30 -3 7
3 1998-01-31 -2 2
4 1998-02-01 -1 4
5 1998-02-02 0 7
6 1998-02-03 1 2
7 1998-02-04 2 1
8 1998-02-05 3 2
9 1998-02-06 4 1
10 1998-02-07 5 1
11 1998-02-08 -5 4
12 1998-02-09 -4 5
Then I create a grouping variable and transform var usingcumprod:
df = df.sort_values("Date") # My df is already sorted by Date given the way
# I created it, but I add this to make sure yours is sorted by date
df["group"] = (df["window"] == -5).cumsum()
df = pd.concat([df, df.groupby("group")["var"].transform("cumprod")], axis=1)
And the result is :
Date window var group var
0 1998-01-28 -5 3 1 3
1 1998-01-29 -4 3 1 9
2 1998-01-30 -3 7 1 63
3 1998-01-31 -2 2 1 126
4 1998-02-01 -1 4 1 504
5 1998-02-02 0 7 1 3528
6 1998-02-03 1 2 1 7056
7 1998-02-04 2 1 1 7056
8 1998-02-05 3 2 1 14112
9 1998-02-06 4 1 1 14112
10 1998-02-07 5 1 1 14112
11 1998-02-08 -5 4 2 4
12 1998-02-09 -4 5 2 20
13 1998-02-10 -3 1 2 20

Related

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

Find Max Gradient by Row in For Loop Pandas

I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,
Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14

Python/Pandas: Transformation of column within a list of columns

I'd like to select a subset of columns from a DataFrame while applying a transformation to some of those columns at the same time. Is it possible to transform a column when that column is selected as one in a list of columns?
For example, I have a column StartDate that is of type np.datetime[64] that I'd like to extract the month from.
When dealing with that Series on its own, I'd do something like
print(df['StartDate'].transform(lambda x: x.month))
to see the transformed data. Can I accomplish the same thing when the above expression is part of a list of columns? Something like:
print(df[['ColumnA', 'ColumnB', 'StartDate'.transform(lambda x: x.month)]])
Of course the above gives the error
AttributeError: 'str' object has no attribute 'month'
So, if my data looks like:
Metadata | Metadata | 2020-01-01
Metadata | Metadata | 2020-02-06
Metadata | Metadata | 2020-02-25
I'd like to see:
Metadata | Metadata | 1
Metadata | Metadata | 2
Metadata | Metadata | 2
Without appending a new separate "Month" column to the DataFrame. Is this possible?
If you have some data like below
df = pd.DataFrame({'col1' : np.random.randint(10, size = 366), 'col2': np.random.randint(10, size = 366),'StartDate' : pd.date_range('2018', '2019')})
which looks like
col1 col2 StartDate
0 0 2 2018-01-01
1 8 0 2018-01-02
2 0 5 2018-01-03
3 3 4 2018-01-04
4 8 6 2018-01-05
... ... ... ...
361 8 8 2018-12-28
362 9 9 2018-12-29
363 4 1 2018-12-30
364 2 4 2018-12-31
365 0 9 2019-01-01
You could redefine the column, or you could assign and create a temporary view, like.
df.assign(StartDate = df['StartDate'].dt.month)
which outputs.
col1 col2 StartDate
0 0 2 1
1 8 0 1
2 0 5 1
3 3 4 1
4 8 6 1
... ... ... ...
361 8 8 12
362 9 9 12
363 4 1 12
364 2 4 12
365 0 9 1
This also doesn't change the original dataframe. If you want to create a permanent version, then just reassign.
df = df.assign(StartDate = df['StartDate'].dt.month)
You could also take this further, such as.
df.assign(StartDate = df['StartDate'].dt.month, col1 = df['col1'] + 100)[['col1', 'StartDate']]
You can apply whatever transform you need and then access any columns you want after assigning these transforms.
col1 StartDate
0 105 1
1 109 1
2 108 1
3 101 1
4 108 1
... ... ...
361 104 12
362 102 12
363 109 12
364 102 12
365 100 1
I guess you could use the attribute name of the Series.
Something like:
dt_to_month = lambda x: [d.month for d in x] if x.name == 'StartDate' else x
df[['ColumnA', 'ColumnB', 'StartDate']].apply(dt_to_month)
will do the trick.

Pandas : Get a column value where another column is the minimum in a sub-grouping [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')

How to repeat a dataframe - python

I have a simple csv dataframe as follow:
Date,Data
2000-01-31,9
2000-02-28,8
2000-03-31,7
2000-04-30,6
2000-05-31,5
2000-06-30,4
2000-07-31,3
2000-08-31,2
2000-09-30,1
2000-10-31,0
2000-11-30,11
2000-12-31,12
I would like to repeat this dataframe over 10 years, with the year stamp changing accordingly, as follow:
Date,Data
2000-01-31,9
2000-02-28,8
2000-03-31,7
2000-04-30,6
2000-05-31,5
2000-06-30,4
2000-07-31,3
2000-08-31,2
2000-09-30,1
2000-10-31,0
2000-11-30,11
2000-12-31,12
2001-01-31,9
2001-02-28,8
2001-03-31,7
2001-04-30,6
2001-05-31,5
2001-06-30,4
2001-07-31,3
2001-08-31,2
2001-09-30,1
2001-10-31,0
2001-11-30,11
2001-12-31,12
....
How can I do that?
You can just using concat
n=2
Newdf=pd.concat([df]*n,keys=range(n))
Newdf.Date+=pd.to_timedelta(Newdf.index.get_level_values(level=0),'Y')
Newdf.reset_index(level=0,drop=True, inplace=true)
Try:
df1 = pd.concat([df] * 10)
date_fix = pd.date_range(start='2000-01-31', freq='M', periods=len(df1))
df1['Date'] = date_fix
df1
[out]
Date Data
0 2000-01-31 9
1 2000-02-29 8
2 2000-03-31 7
3 2000-04-30 6
4 2000-05-31 5
5 2000-06-30 4
6 2000-07-31 3
... ... ...
5 2009-06-30 4
6 2009-07-31 3
7 2009-08-31 2
8 2009-09-30 1
9 2009-10-31 0
10 2009-11-30 11
11 2009-12-31 12