pandas filtering after a groupby with groupby-specifc filter conditions? - pandas

I've seen a number of great solutions to "filter after a groupby" where the filter condition is fixed ("hey, group by name and then look for everyone over the age of 21", wherein 21 is fixed. I'm instead looking for a way to filter based on the results of a groupby.
example:
df = pd.DataFrame({'person':['Sue', 'Sue', 'Sue', 'Bill', 'Alfonso'],
'date': ['2019-01-01','2019-01-02', '2019-01-03','2019-02-01', '2019-03-01'],
'my_value': [5,10,20,10,5],
'my_other_value': [3,2,9,6,8]})
I want to be able to ask a question along the lines of:
"starting from the first time a person has a my_value of 10, tell me the mean of my_other_value for all successive records".
In the example, the first date that Sue has a my_value == 10 is 2019-01-02, so the her mean for my_other_value is (2+9)/2 = 5.5, which comes from 2019-01-02 and 2019-01-03. Bill only has one entry but it does have a my_value of 10 so his mean of my_other_value is 6. Alfonso, sadly, never has a my_value of 10, so he's not even included in the final tally
So, I started off with
df2 = df.query('my_value == 10').groupby('person').first().reset_index()
which gets my the first time a person has a my_value of 10. From this I know the person and the date it happened. So in English, I want to now filter those results for that person, so that I can do a .mean() but only including rows for that person >= the date I learned from the call to first(). I'm stuck, of course.
I kinda sorta was hoping something like this would work:
df3 = df.groupby('person').apply( lambda x: x['date'] >= df2['date']).mean()
but I know that can't really work because how does the lambda know to match up the correct person in the df.groupby() with the same person in the df2 grouping?
Another option was thinking "hey maybe there's a version of expanding() that can start with something other than the very first record"
Crossing my fingers that one of the above approaches is directionally correct and some hero shows up to say "oh, you are so close, just add in this little extra part!"

"oh, you are so close, just add in this little extra part!"
See below for little extra part.
df = pd.DataFrame({'person':['Sue', 'Sue', 'Sue', 'Bill', 'Alfonso'],
'date': ['2019-01-01','2019-01-02', '2019-01-03','2019-02-01', '2019-03-01'],
'my_value': [5,10,20,10,5],
'my_other_value': [3,2,9,6,8]})
df = df.sort_values(['person', 'date']).reset_index(drop=True)
>>> df
person date my_value my_other_value
0 Alfonso 2019-03-01 5 8
1 Bill 2019-02-01 10 6
2 Sue 2019-01-01 5 3
3 Sue 2019-01-02 10 2
4 Sue 2019-01-03 20 9
Find first date of my_value == 10
df2 = df.query('my_value == 10').groupby('person').first()['date'].reset_index()
df2 = df2.rename(columns={'date': 'first_date'})
>>> df2
person first_date
0 Bill 2019-02-01
1 Sue 2019-01-02
Merge the DataFrames
df_merged = pd.merge(df, df2, how='left', on=['person'])
>>> df_merged
person date my_value my_other_value first_date
0 Alfonso 2019-03-01 5 8 NaN
1 Bill 2019-02-01 10 6 2019-02-01
2 Sue 2019-01-01 5 3 2019-01-02
3 Sue 2019-01-02 10 2 2019-01-02
4 Sue 2019-01-03 20 9 2019-01-02
Calculate mean of my_other_value
grouped = df_merged[df_merged['date'] >= df_merged['first_date']].groupby('person')
>>> grouped['my_other_value'].mean()
person
Bill 6.0
Sue 5.5
Name: my_other_value, dtype: float64

Related

Groupby count between multiple date ranges since last-contact date

Customer data, and campaign data with each time we have contacted them. We don't contact each customer each time, so their last contacted(touched) date varies. How to achieve a groupby count, but between two dates that varies for each cust_id.
import pandas as pd
import io
tempCusts=u"""cust_id, lastBookedDate
1, 10-02-2022
2, 20-04-2022
3, 25-07-2022
4, 10-06-2022
5, 10-05-2022
6, 10-08-2022
7, 01-01-2021
8, 02-06-2022
9, 11-12-2021
10, 10-05-2022
"""
tempCamps=u"""cust_id,campaign_id,campaignMonth,campaignYear,touch,campaignDate,campaignNum
1,CN2204,4,2022,1,01-04-2022,1
2,CN2204,4,2022,1,01-04-2022,1
3,CN2204,4,2022,1,01-04-2022,1
4,CN2204,4,2022,1,01-04-2022,1
5,CN2204,4,2022,1,01-04-2022,1
6,CN2204,4,2022,1,01-04-2022,1
7,CN2204,4,2022,1,01-04-2022,1
8,CN2204,4,2022,1,01-04-2022,1
9,CN2204,4,2022,1,01-04-2022,1
10,CN2204,4,2022,1,01-04-2022,1
1,CN2205,5,2022,1,01-05-2022,2
2,CN2205,5,2022,1,01-05-2022,2
3,CN2205,5,2022,1,01-05-2022,2
4,CN2205,5,2022,1,01-05-2022,2
5,CN2205,5,2022,1,01-05-2022,2
6,CN2206,6,2022,1,01-06-2022,3
7,CN2206,6,2022,1,01-06-2022,3
8,CN2206,6,2022,1,01-06-2022,3
9,CN2206,6,2022,1,01-06-2022,3
10,CN2206,6,2022,1,01-06-2022,3"""
campaignDets = pd.read_csv(io.StringIO(tempCamps), parse_dates=True)
customerDets = pd.read_csv(io.StringIO(tempCusts), parse_dates=True)
Campaign details (campaignDets) is any customer who was part of campaign, some(most) appear in multiple campaigns as they continue to be contacted. cust_id is therefore duplicated, but not within each campaign. The customer details(customerDets), showing if/when they last had appointment.
cust_id 1: lastBooked 10-02-2022 So touchCount since then == 2
cust_id 2: last booked 20-04-2022 So touchCount since then == 1
...
This is what i'm attempting to achieve:
desired=u"""cust_id,lastBookedDate, touchesSinceBooked
1,10-02-2022,2
2,20-04-2022,1
3,25-07-2022,0
4,10-06-2022,0
5,10-05-2022,0
6,10-08-2022,0
7,01-01-2021,3
8,02-06-2022,0
9,11-12-2021,3
10,10-05-2022,1
"""
desiredDf = pd.read_csv(io.StringIO(desired), parse_dates=True)
>>> desiredDf
cust_id lastBookedDate touchesSinceBooked
0 1 10-02-2022 2
1 2 20-04-2022 1
2 3 25-07-2022 0
3 4 10-06-2022 0
4 5 10-05-2022 0
5 6 10-08-2022 0
6 7 01-01-2021 2
7 8 02-06-2022 0
8 9 11-12-2021 3
9 10 10-05-2022 1
I've attempted to work around the guidance given on not-dissimilar problems, but these either rely on a fixed date to group on, or haven't worked within the constraints here(unless i'm missing something). I have not yet been able to cross-relate previous questions, and am sure that the simplicity of what i'm after cannot be best achieved by some awful groupby split by user into a list of df's pulling them back out & looping through a max() of each user_ids campaignDate. Surely not. Can i apply pd.merge_asof within this?
Those examples i've taken advice from that are along the same lines:
44010314/count-number-of-rows-groupby-within-a-groupby-between-two-dates-in-pandas-datafr
31772863/count-number-of-rows-between-two-dates-by-id-in-a-pandas-groupby-dataframe/31773404
Constraints?
None. Am happy to use any available library and/or helper cols.
Neither data source/df is especially large but the custDets ~120k, and the campaignDets ~600k, but i have time...so optimised approaches though welcome are secondary to actual solutions.
First, format as datetime:
customerDets['lastBookedDate'] = pd.to_datetime(customerDets[' lastBookedDate'], dayfirst=True)
campaignDets['campaignDate'] = pd.to_datetime(campaignDets['campaignDate'], dayfirst=True)
Then, filter on when the campaign date is greater than last booked:
df = campaignDets[(campaignDets['campaignDate']>campaignDets['cust_id'].map(customerDets.set_index('cust_id')['lastBookedDate']))]
Finally, add your new column:
customerDets['touchesSinceBooked'] = customerDets['cust_id'].map(df.groupby('cust_id')['touch'].sum()).fillna(0)
You'll get
cust_id lastBookedDate touchesSinceBooked
0 1 10-02-2022 2.0
1 2 20-04-2022 1.0
2 3 25-07-2022 0.0
3 4 10-06-2022 0.0
4 5 10-05-2022 0.0
5 6 10-08-2022 0.0
6 7 01-01-2021 2.0
7 8 02-06-2022 0.0
8 9 11-12-2021 2.0
9 10 10-05-2022 1.0

Drop Duplicates based on Nearest Datetime condition

import pandas as pd
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
df = pd.read_csv("C:/Files/input.txt", dtype=str)
duplicatesDf = df[df.duplicated(subset=['CLASS_ID', 'START_TIME', 'TEACHER_ID'], keep=False)]
duplicatesDf['START_TIME'] = pd.to_datetime(duplicatesDf['START_TIME'], format='%Y/%m/%d %H:%M:%S.%f')
print duplicatesDf
print df['START_TIME'].dt.date
df:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
2,123,2020/06/01 20:47:26.000,o1,2020/06/04 20:47:26.000
3,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:47:26.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
So, I've got a dataframe like mentioned above. As you can see, I have multiple records with the same CLASS_ID,START_DATE and TEACHER_ID. Whenever, multiple records like these are present, I would like to retain only 1 record based on the condition that, the retained record should have its END_DATE nearest to its START_DATE(by minute level precision).
In this case,
for CLASS_ID 123, the record with ID 1 will be retained, as its END_DATE 2020/06/02 00:00:00.000 is nearest to its START_DATE 2020/06/01 20:47:26.000 as compared to record with ID 2 whose END_DATE is 2020/06/04 20:47:26.000. Similarly for CLASS_ID 789, record with ID 4 will be retained.
Hence the expected output will be:
ID,CLASS_ID,START_TIME,TEACHER_ID,END_TIME
1,123,2020/06/01 20:47:26.000,o1,2020/06/02 00:00:00.000
4,789,2020/06/01 20:47:26.000,o3,2020/06/03 14:40:00.000
5,456,2020/06/01 20:47:26.000,o5,2020/06/08 20:00:26.000
I've been going through the following links,
https://stackoverflow.com/a/32237949,
https://stackoverflow.com/a/33043374
to find a solution but have unfortunately reached an impasse.
Hence, would some kind soul mind helping me out a bit. Many thanks.
IIUC, we can use .loc and idxmin() after creating a condtional column to measure the elapsed time between the start and the end, we will apply idxmin() as a groupby operation on your CLASS_ID column.
df.loc[
df.assign(mins=(df["END_TIME"] - df["START_TIME"]))
.groupby("CLASS_ID")["mins"]
.idxmin()
]
ID CLASS_ID START_TIME TEACHER_ID END_TIME
0 1 123 2020-06-01 20:47:26 o1 2020-06-02 00:00:00
4 5 456 2020-06-01 20:47:26 o5 2020-06-08 20:00:26
3 4 789 2020-06-01 20:47:26 o3 2020-06-03 14:40:00
in steps.
Time Delta.
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]))[['CLASS_ID','mins']])
CLASS_ID mins
0 123 0 days 03:12:34
1 123 3 days 00:00:00
2 789 1 days 18:00:00
3 789 1 days 17:52:34
4 456 6 days 23:13:00
minimum index from time delta column whilst grouping with CLASS_ID
print(df.assign(mins=(df["END_TIME"] - df["START_TIME"]) )
.groupby("CLASS_ID")["mins"]
.idxmin())
CLASS_ID
123 0
456 4
789 3
Name: mins, dtype: int64

Time column interval filter

I have a dataframe with a "Fecha" column, I would like to reduce de Dataframe size through filter it and maintain just the rows which are on each 10 minutes multiple and discard all rows which are not in 10 minutes multiple.
Some idea?
Thanks
I have to guess some variable names. But assuming your dataframe name is df, the solution should look similar to:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = df[df['Fecha'].minute % 10 == 0]
The first line guarantees that your 'Fecha' column is in DateTime-Format. The second line filters all rows which are a multiple of 10 minutes. To do this you use the modulus operator %.
Since I'm not sure if this solves your problem, here's a minimal example that runs by itself:
import pandas as pd
idx = pd.date_range(pd.Timestamp(2020, 1, 1), periods=60, freq='1T')
series = pd.Series(1, index=idx)
series = series[series.index.minute % 10 == 0]
series
The first three lines construct a series with a 1 minute index, which is filtered in the fourth line.
Output:
2020-01-01 00:00:00 1
2020-01-01 00:10:00 1
2020-01-01 00:20:00 1
2020-01-01 00:30:00 1
2020-01-01 00:40:00 1
2020-01-01 00:50:00 1
dtype: int64

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

Why use to_frame before reset_index?

Using a data set like this one
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])
we often see this pattern:
df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
But we get exactly the same result from
df.groupby(['user_id'])['module_id'].count().reset_index(name='count')
(N.B. we need the additional rename in the former because reset_index on Series (here) includes a name parameter and returns a data frame, while reset_index on DataFrame (here) does not include the name parameter.)
Is there any advantage in using to_frame first?
(I wondered if it might be an artefact of earlier versions of pandas, but that looks unlikely:
Series.reset_index was added in this commit on the 27th of January 2012.
Series.to_frame was added in this commit on the 13th of October 2013.
So Series.reset_index was available over a year before Series.to_frame.)
There is no noticeable advantage of using to_frame(). Both approaches can be used to achieve the same result. It is common in pandas to use multiple approaches for solving a problem. The only advantage I can think of is that for larger sets of data, it maybe more convenient to have a dataframe view first before resetting the index. If we take your dataframe as an example, you will find that to_frame() displays a dataframe view that maybe useful to understand the data in terms of a neat dataframe table v/s a count series. Also, the usage of to_frame() makes the intent more clear to a new user who looks at your code for the first time.
The example dataframe:
In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i
...: d','module_id','week'])
In [8]: df.head()
Out[8]:
user_id module_id week
0 3 4 4
1 1 3 4
2 1 2 2
3 1 3 4
4 1 2 2
The count() function returns a Series:
In [18]: test1 = df.groupby(['user_id'])['module_id'].count()
In [19]: type(test1)
Out[19]: pandas.core.series.Series
In [20]: test1
Out[20]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [21]: test1.index
Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')
Using to_frame makes it explicit that you intend to convert the Series to a Dataframe. The index here is user_id:
In [22]: test1.to_frame()
Out[22]:
module_id
user_id
0 2
1 7
2 4
3 6
4 1
And now we reset the index and rename the column using Dataframe.rename. As you rightly pointed, Dataframe.reset_index() does not have a name parameter and therefore, we will have to rename the column explicitly.
In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
In [25]: testdf1
Out[25]:
user_id count
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
Now lets look at the other case. We will use the same count() series test1 but rename it as test2 to differentiate between the two approaches. In other words, test1 is equal to test2.
In [26]: test2 = df.groupby(['user_id'])['module_id'].count()
In [27]: test2
Out[27]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [28]: test2.reset_index()
Out[28]:
user_id module_id
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
In [30]: testdf2 = test2.reset_index(name='count')
In [31]: testdf1 == testdf2
Out[31]:
user_id count
0 True True
1 True True
2 True True
3 True True
4 True True
As you can see both dataframes are equivalent, and in the second approach we just had to use reset_index(name='count') to both reset the index and rename the column name because Series.reset_index() does have a name parameter.
The second case has lesser code but is less readable for new eyes and I'd prefer the first approach of using to_frame() because it makes the intent clear: "Convert this count series to a dataframe and rename the column 'module_id' to 'count'".