Groupby count between multiple date ranges since last-contact date - pandas

Customer data, and campaign data with each time we have contacted them. We don't contact each customer each time, so their last contacted(touched) date varies. How to achieve a groupby count, but between two dates that varies for each cust_id.
import pandas as pd
import io
tempCusts=u"""cust_id, lastBookedDate
1, 10-02-2022
2, 20-04-2022
3, 25-07-2022
4, 10-06-2022
5, 10-05-2022
6, 10-08-2022
7, 01-01-2021
8, 02-06-2022
9, 11-12-2021
10, 10-05-2022
"""
tempCamps=u"""cust_id,campaign_id,campaignMonth,campaignYear,touch,campaignDate,campaignNum
1,CN2204,4,2022,1,01-04-2022,1
2,CN2204,4,2022,1,01-04-2022,1
3,CN2204,4,2022,1,01-04-2022,1
4,CN2204,4,2022,1,01-04-2022,1
5,CN2204,4,2022,1,01-04-2022,1
6,CN2204,4,2022,1,01-04-2022,1
7,CN2204,4,2022,1,01-04-2022,1
8,CN2204,4,2022,1,01-04-2022,1
9,CN2204,4,2022,1,01-04-2022,1
10,CN2204,4,2022,1,01-04-2022,1
1,CN2205,5,2022,1,01-05-2022,2
2,CN2205,5,2022,1,01-05-2022,2
3,CN2205,5,2022,1,01-05-2022,2
4,CN2205,5,2022,1,01-05-2022,2
5,CN2205,5,2022,1,01-05-2022,2
6,CN2206,6,2022,1,01-06-2022,3
7,CN2206,6,2022,1,01-06-2022,3
8,CN2206,6,2022,1,01-06-2022,3
9,CN2206,6,2022,1,01-06-2022,3
10,CN2206,6,2022,1,01-06-2022,3"""
campaignDets = pd.read_csv(io.StringIO(tempCamps), parse_dates=True)
customerDets = pd.read_csv(io.StringIO(tempCusts), parse_dates=True)
Campaign details (campaignDets) is any customer who was part of campaign, some(most) appear in multiple campaigns as they continue to be contacted. cust_id is therefore duplicated, but not within each campaign. The customer details(customerDets), showing if/when they last had appointment.
cust_id 1: lastBooked 10-02-2022 So touchCount since then == 2
cust_id 2: last booked 20-04-2022 So touchCount since then == 1
...
This is what i'm attempting to achieve:
desired=u"""cust_id,lastBookedDate, touchesSinceBooked
1,10-02-2022,2
2,20-04-2022,1
3,25-07-2022,0
4,10-06-2022,0
5,10-05-2022,0
6,10-08-2022,0
7,01-01-2021,3
8,02-06-2022,0
9,11-12-2021,3
10,10-05-2022,1
"""
desiredDf = pd.read_csv(io.StringIO(desired), parse_dates=True)
>>> desiredDf
cust_id lastBookedDate touchesSinceBooked
0 1 10-02-2022 2
1 2 20-04-2022 1
2 3 25-07-2022 0
3 4 10-06-2022 0
4 5 10-05-2022 0
5 6 10-08-2022 0
6 7 01-01-2021 2
7 8 02-06-2022 0
8 9 11-12-2021 3
9 10 10-05-2022 1
I've attempted to work around the guidance given on not-dissimilar problems, but these either rely on a fixed date to group on, or haven't worked within the constraints here(unless i'm missing something). I have not yet been able to cross-relate previous questions, and am sure that the simplicity of what i'm after cannot be best achieved by some awful groupby split by user into a list of df's pulling them back out & looping through a max() of each user_ids campaignDate. Surely not. Can i apply pd.merge_asof within this?
Those examples i've taken advice from that are along the same lines:
44010314/count-number-of-rows-groupby-within-a-groupby-between-two-dates-in-pandas-datafr
31772863/count-number-of-rows-between-two-dates-by-id-in-a-pandas-groupby-dataframe/31773404
Constraints?
None. Am happy to use any available library and/or helper cols.
Neither data source/df is especially large but the custDets ~120k, and the campaignDets ~600k, but i have time...so optimised approaches though welcome are secondary to actual solutions.

First, format as datetime:
customerDets['lastBookedDate'] = pd.to_datetime(customerDets[' lastBookedDate'], dayfirst=True)
campaignDets['campaignDate'] = pd.to_datetime(campaignDets['campaignDate'], dayfirst=True)
Then, filter on when the campaign date is greater than last booked:
df = campaignDets[(campaignDets['campaignDate']>campaignDets['cust_id'].map(customerDets.set_index('cust_id')['lastBookedDate']))]
Finally, add your new column:
customerDets['touchesSinceBooked'] = customerDets['cust_id'].map(df.groupby('cust_id')['touch'].sum()).fillna(0)
You'll get
cust_id lastBookedDate touchesSinceBooked
0 1 10-02-2022 2.0
1 2 20-04-2022 1.0
2 3 25-07-2022 0.0
3 4 10-06-2022 0.0
4 5 10-05-2022 0.0
5 6 10-08-2022 0.0
6 7 01-01-2021 2.0
7 8 02-06-2022 0.0
8 9 11-12-2021 2.0
9 10 10-05-2022 1.0

Related

How can I create a datframe column which counts the occurrence of each value in anopther column?

I am trying to add a column to my dataframe, which will hold a value which represents the number of times a unique value has appeared in another column.
For example , I haver the following dataframe:
Date|Team|Goals|
22.08.20|Team1|4|
22.08.20|Team2|3|
22.08.20|Team3|1|
22.09.20|Team1|4|
22.09.20|Team3|5|
I would like to add a counter column, which counts how often each team appears:
Date|Team|Goals|Count|
22.08.20|Team1|4|1|
22.08.20|Team2|3|1|
22.08.20|Team3|1|1|
22.09.20|Team1|4|2|
22.09.20|Team3|5|2|
My Dataframe is ordered by date, so the teams should appear in the correct order.
Apologies, very new to pandas and stack overflow, so please let me know if I can format this question differently. Thanks
TRY:
df['Count'] = df.groupby('Team').cumcount().add(1)
OUTPUT:
Date Team Goals Count
0 22.08.20 Team1 4 1
1 22.08.20 Team2 3 1
2 22.08.20 Team3 1 1
3 22.09.20 Team1 4 2
4 22.09.20 Team3 5 2
Another answer building upon #Nk03's with replicable results:
import pandas as pd
import numpy as np
# Set numpy random seed
np.random.seed(42)
# Create dates array
dates = pd.date_range(start='2021-06-01', periods=10, freq='D')
# Create teams array
teams_names = ['Team 1', 'Team 2', 'Team 3']
teams = [teams_names[i] for i in np.random.randint(0, 3, 10)]
# Create goals array
goals = np.random.randint(1, 6, 10)
# Create DataFrame
data = pd.DataFrame({'Date': dates,
'Team': teams,
'Goals': goals})
# Cumulative count of teams
data['Count'] = data.groupby('Team').cumcount().add(1)
The output will be:
Date Team Goals Count
0 2021-06-01 Team 2 3 1
1 2021-06-02 Team 2 1 2
2 2021-06-03 Team 2 4 3
3 2021-06-04 Team 1 2 1
4 2021-06-05 Team 2 4 4
5 2021-06-06 Team 1 2 2
6 2021-06-07 Team 2 2 5
7 2021-06-08 Team 3 4 1
8 2021-06-09 Team 3 5 2
9 2021-06-10 Team 1 2 3

Combine two dataframes in Pandas to generate many to many relationship

I have two lists, say
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
I want to generate a Pandas dataframe so:
All customers and accounts are used
There is a many to many relationship between customers and accounts (one customer 'may' have multiple accounts and an account 'may' be owned by multiple customers
I want the many to many relationship to be random. That is, some customers will have one account and some will have more than one. Similarly, some accounts will be owned by just one customers and others by more than one.
Something like,
Customer
Account
a
1
a
2
b
2
c
3
a
4
b
4
c
4
b
5
b
6
b
7
b
8
a
9
Since I am generating random data, in the worst case scenario, I can generate way too many accounts and discard the unused ones if the code is easier (essentially relaxing the requirement 1 above).
I am using sample (n=20, replace=True) to generate 20 records in both dataframes and then merging them into one based on the index. Is there an out of the box API or library to do this or is my code the recommended way?
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
customers_df = pd.DataFrame(data=customers)
customers_df = customers_df.sample (n=20, replace=True)
customers_df['new_index'] = range (20)
customers_df.set_index ('new_index', inplace=True)
accounts_df = pd.DataFrame (data=accounts)
accounts_df = accounts_df.sample (n=20, replace=True)
accounts_df['new_index'] = range (20)
accounts_df.set_index ('new_index', inplace=True)
combined_df = pd.merge (customers_df, accounts_df, on='new_index')
print (combined_df)
Edit: Modified the question and added sample code I have tried.
One way to accomplish this is to collect the set of all possible relationships with a cartesian product, then select from that list before building your dataframe:
import itertools
import random
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
possible_associations = [ca for ca in itertools.product(customers, accounts)]
df = pd.DataFrame.from_records(random.choices(possible_associations, k=20), columns=['customers', 'accounts']).sort_values(['customers','accounts'])
print(df)
Output
customers accounts
0 a 2
3 a 2
15 a 2
18 a 4
16 a 5
14 a 7
7 a 8
12 a 8
1 a 9
2 b 5
9 b 5
8 b 8
11 b 8
19 c 2
17 c 3
5 c 4
4 c 5
6 c 5
13 c 5
10 c 7
To have a repeatable test result, start with np.random.seed(1) (in the target
version drop it).
Then proceed as follows:
Create a list of probabilities - how many accounts can have a customer, e.g.:
prob = [0.5, 0.25, 0.15, 0.09, 0.01]
Generate a Series stating how many owners shall have each account:
cnt = pd.Series(np.random.choice(range(1, len(prob) + 1), size=len(accounts),
p=prob), name='Customer')
Its name is Customer, because it will be the source to create just
Customer column.
For my sample probablities and generator seeding the result is:
0 1
1 2
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: Customer, dtype: int32
(the left column is the index, the right - actual values).
Because your data sample contains only 9 accounts, the result does
not contain "greater" number of owners. But in your target version,
with more accounts, there will be accounts with greater numbers of
owners.
Generate the result - cust_acct DataFrame, defining the assignment of customers
to accounts:
cust_acct = cnt.apply(lambda x: np.random.choice(customers, x, replace=False))\
.explode().to_frame().join(pd.Series(accounts, name='Account')).reset_index(drop=True)
The result, for your sample data and my seeding and probabilities, is:
Customer Account
0 b 1
1 a 2
2 b 2
3 b 3
4 b 4
5 c 5
6 b 6
7 c 7
8 a 8
9 b 9
Of course, you can assume different proabilities in prob.
You can also choose other "top" number of owners (the number of
entries in prob).
In this case no change in code is needed, because the range of values in
the first np.random.choice is set to accomodete to the lenght of prob.
Note: Because your sample data contains only 3 customers,
under different generator seeding there can occur ValueError: Cannot
take a larger sample than population when 'replace=False'.
The reason is that if the number of owners for some account was > 3 then just
this error occurs.
But with your target data, with greater number of customers, this error
will not occur.

Check if list cell contains value

Having a dataframe like this:
month transactions_ids
0 1 [0, 5, 1]
1 2 [7, 4]
2 3 [8, 10, 9, 11]
3 6 [2]
4 9 [3]
For a given transaction_id, I would like to get the month when it took place. Notice that a transaction_id can only be related to one single month.
So for example, given transaction_id = 4, the month would be 2.
I know this can be done in a loop by looking month by month if the transactions_ids related contain the given transaction_id, but I'm wondering if there is any way more efficient than that.
Cheers
The best way in my opinion is to explode your data frame and avoid having python lists in your cells.
df = df.explode('transaction_ids')
which outputs
month transactions_ids
0 1 0
0 1 5
0 1 1
1 2 7
1 2 4
2 3 8
2 3 10
2 3 9
2 3 11
3 6 2
4 9 3
Then, simply
id_to_find = 1 # example
df.loc[df.transactions_ids == id_to_find, 'month']
P.S: be aware of the duplicated indexes that explode outputs. In general, it is better to do explode(...).reset_index(drop=True) for most cases to avoid unwanted behavior.
You can use pandas string methods to find the id in the "list" (it's really just a string as far as pandas is concerned when read in using StringIO):
import pandas as pd
from io import StringIO
data = StringIO("""
month transactions_ids
1 [0,5,1]
2 [7,4]
3 [8,10,9,11]
6 [2]
9 [3]
""")
df = pd.read_csv(data, delim_whitespace=True)
df.loc[df['transactions_ids'].str.contains('4'), 'month']
In case your transactions_ids are real lists, then you can use map to check for membership:
df['transactions_ids'].map(lambda x: 3 in x)

How to write this SQL in Pandas?

I have this SQL code and I want to write in in Pandas. Every example I saw uses groupby and order by outside of the window function and that is not what I want. I don't want my data to look grouped, instead I just need a cumulative sum of my new column (reg_sum) ordered by hour for each article_id.
SELECT
*,
SUM(registrations) OVER(PARTITION BY article_id ORDER BY time) AS
cumulative_regs
FROM table
Data example of what I need to get (reg_sum column):
article_id time registrations reg_sum
A 7 6 6
A 9 5 11
B 10 1 1
C 10 2 2
C 11 4 6
If anyone can say what is the equivalent of this in Pandas, that would be great. Thanks!
Using groupby and cumsum, this should work:
import pandas as pd
import numpy as np
# generate data
df = pd.DataFrame({'article_id': np.array(['A', 'A', 'B', 'C', 'C']),
'time': np.array([7, 9, 10, 10, 11]),
'registrations': np.array([6, 5, 1, 2, 4])})
# compute cumulative sum of registrations sorted by time and grouped by article_id
df['reg_sum'] = df.sort_values('time').groupby('article_id').registrations.cumsum()
Output:
article_id time registrations reg_sum
0 A 7 6 6
1 A 9 5 11
2 B 10 1 1
3 C 10 2 2
4 C 11 4 6

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])