Pandas share of value with condition and adding new column - pandas

I'm new to pandas and got stuck a bit. Can you help me?
I have a dataframe storing orders:
| item | store_status | customer_status |
|------|--------------|-----------------|
| A | 'dispatched' | 'received' |
| A | 'dispatched' | 'pending' |
| B | 'pending' | 'pending' |
| B | 'dispatched' | 'received' |
| B | 'dispatched' | 'pending' |
I want to create a new dataframe that shows what portion of each item is 'dispatched' and 'received'. So the result would be:
| item | dispatched_and_received |
|------|-------------------------|
| A | 0.5 |
| B | 0.33 |
I'm also interested in the portion of each item that is 'dispatched', regardless of the customer status and want to add it as a new column to this dataframe:
| item | dispatched_and_received | dispatched |
|------|-------------------------|------------|
| A | 0.5 | 1.00 |
| B | 0.33 | 0.66 |
Thank you!

Create Boolean Series that check the conditions, then take the mean of those Series within each group.
(df.assign(dispatched=df.store_status.eq('dispatched'),
dispatched_and_received=(df.store_status.eq('dispatched')
& df.customer_status.eq('received')))
.groupby('item')[['dispatched', 'dispatched_and_received']]
.mean()
.reset_index())
# item dispatched dispatched_and_received
#0 A 1.000000 0.500000
#1 B 0.666667 0.333333
The assign just creates the columns, you can split that out manually above if all of that chaining seems a bit cluttered. It's equivalent to:
df['dispatched'] = df.store_status.eq('dispatched')
df['dispatched_and_received'] = df['dispatched'] & df.customer_status.eq('received')
This is the DataFrame after the assign
item store_status customer_status dispatched dispatched_and_received
0 A dispatched received True True
1 A dispatched pending True False
2 B pending pending False False
3 B dispatched received True True
4 B dispatched pending True False

Related

Spark SQL - How to create marketing campaign paths for each user until purchase

I am using Spark SQL to process the following dataset, so it can fit a marketing attribution model:
| user_ID | timestamp | activity | campaign | event_name
------ --------- ------ -------- ------
| akalsds124 | 2021-12-31 10:00 | click | Holidays Campaign | NULL
| akalsds124 | 2022-03-01 16:32 | click | Super Campaign | NULL
| akalsds124 | 2022-04-27 20:55 | event | NULL | purchase
| akalsds124 | 2022-05-10 10:21 | event | NULL | purchase
| akalsds124 | 2022-06-25 09:22 | click | IG 3 Campaign | NULL
| akalsds124 | 2022-07-07 15:00 | event | NULL | purchase
| ijnbmshs33 | 2022-05-02 10:31 | click | New Campaign | NULL
| ijnbmshs33 | 2022-07-04 17:01 | click | Mega Campaign | NULL
A click activity is an ad click made by the user and an event is an interaction inside the app (e.g. a purchase, login, etc).
To create the table above, you can use this code:
df=spark.createDataFrame(
[('akalsds124','2021-12-31 10:00','click','Holidays Campaign','NULL'),
('akalsds124','2022-03-01 16:32','click','Super Campaign','NULL'),
('akalsds124','2022-04-27 20:55','event','NULL','purchase'),
('akalsds124','2022-05-10 10:21','event','NULL', 'purchase'),
('akalsds124','2022-06-25 09:22','click','IG 3 Campaign','NULL'),
('akalsds124','2022-07-07 15:00','event','NULL','purchase'),
('ijnbmshs33','2022-05-02 10:31','click','New Campaign','NULL'),
('ijnbmshs33','2022-07-04 17:01','click','Mega Campaign','NULL')],
['user_id','timestamp','activity','campaign','event_name']
)
I need to create a path with each user's campaign touchpoints inside a list. When a user purchases a product, a new path must be created for his/her next touchpoints.
Also, I need a column named 'converted' with boolean results (1 if the path led to a purchase and 0 if it did not lead to a conversion), and another one (total_conversions) with the total n° of purchases per path.
The expected output should be like this:
| user_ID | path | converted | total_conversions
----- ------ ----- -------
| akalsds124 | [Holidays Campaign,Super Campaign] | 1 | 2
| akalsds124 | [IG Campaign] | 1 | 1
| ijnbmshs33 | [New Campaign,Mega Campaign] | 0 | 0
Starting from the dataset you created, here is what i've done :
data preparation
from pyspark.sql import functions as F, Window as W
df = df.withColumn(
"event_name", F.when(F.col("event_name") == "purchase", 1).otherwise(0)
)
df = df.withColumn(
"rnk", F.lag("event_name").over(W.partitionBy("user_id").orderBy("timestamp"))
)
df = df.withColumn(
"rnk", F.when((F.col("rnk") == 1) & (F.col("event_name") != 1), 1).otherwise(0)
)
df = df.withColumn(
"rnk", F.sum("rnk").over(W.partitionBy("user_id").orderBy("timestamp"))
)
aggregation
df = df.groupBy("user_id", "rnk").agg(
F.collect_set("campaign").alias("path"),
F.max("event_name").alias("converted"),
F.sum("event_name").alias("total_conversions"),
)
Result
df.show()
+----------+---+--------------------+---------+-----------------+
| user_id|rnk| path|converted|total_conversions|
+----------+---+--------------------+---------+-----------------+
|akalsds124| 0|[Super Campaign, ...| 1| 2|
|akalsds124| 1|[NULL, IG 3 Campa...| 1| 1|
|ijnbmshs33| 0|[Mega Campaign, N...| 0| 0|
+----------+---+--------------------+---------+-----------------+

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?
df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))
Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

SQL Query (Display All 'x' Where 'x' Is Not In Table '2' for field 'y' and has 'z' flag)

I need to return all 'contacts' that do not appear in the 'delegate' table for 'event name' but do have flags in the 'contacts' table that can selected by the user for the search.
I know the query can be broken in to 2 parts.
Are they already attending this event (Does their email appear in 'delegates' table with delegates.event field matching 'event' on the user form)
WHERE (
d.Event <> [Forms]![usf_FindCampaignContacts]![FCC_EventName]
Do they match the criteria (Have they got the HR flag in 'contacts' table)
AND (c.[HR-DEL] = [Forms]![usf_FindCampaignContacts]![FCC_HRD] OR IsNull([Forms]![usf_FindCampaignContacts]![FCC_HRD]));
Based on the 2 things that the query is required to do I have written the following code...
SELECT
c.[First Name], c.[Last Name], c.Email, d.Event, c.Suppress, c.[HR-DEL]
FROM tbl_Contacts AS c LEFT JOIN tbl_Delegates AS d ON c.Email = d.Email
WHERE (
d.Event <> [Forms]![usf_FindCampaignContacts]![FCC_EventName]
And
c.Suppress = False
)
AND (c.[HR-DEL] = [Forms]![usf_FindCampaignContacts]![FCC_HRD] OR IsNull([Forms]![usf_FindCampaignContacts]![FCC_HRD]));
[FCC_HRD] refers to the user selected input on the form, I tried to use a <> to remove matching records but I feel this is where the compile error is so I changed these to and/or statements and this part now returns results with the matching flags (Success)
Other issue with attempting to do it this way is even if it worked it would remove anyone who was listed in the delegates/sponsor table. Which is why I added the <> statement for the Event as it only needs to remove them off the list for the named event. Again this works perfectly well (Success)
Final issue is the results are clearly being pulled from the 'delegates' table not the 'contacts' table as both parts above work but only display the results that match criteria in delegates table not from contacts.
Here is the query/table relationships
Here is the user form (This is not the final design)
Below are the 3 tables that are used in the query (2 direct, 1 linked)
Contacts (c)
+----+------------+---------------+-------------------------+--------+----------+
| ID | First Name | Last Name | Email | HR-DEL | Suppress |
+----+------------+---------------+-------------------------+--------+----------+
| 1 | A | Platt | a.platt#fake.com | TRUE | TRUE |
| 2 | D | Farr | d.farr#fake.com | TRUE | FALSE |
| 3 | Y | Helle | y.helle#fake.com | TRUE | FALSE |
| 4 | S | Oliphant | soliphant#fake.com | TRUE | FALSE |
| 5 | J | Bedell-Pearce | jbedell-pearce#fake.com | TRUE | FALSE |
| 6 | J | Walker | j.walker#fake.com | FALSE | FALSE |
| 7 | S | Rug | s.rug#fake.com | FALSE | FALSE |
| 8 | D | Brown | d.brown#fake.com | FALSE | FALSE |
| 9 | R | Cooper | r.cooper#fake.com | TRUE | FALSE |
| 10 | M | Morrall | m.morrall#fake.com | TRUE | FALSE |
+----+------------+---------------+-------------------------+--------+----------+
Delegates (d)
+----+-------------------------+-------+
| ID | Email | Event |
+----+-------------------------+-------+
| 1 | a.platt#fake.com | 2 |
| 2 | d.farr#fake.com | 1 |
| 3 | y.helle#fake.com | 4 |
| 4 | soliphant#fake.com | 3 |
| 6 | jbedell-pearce#fake.com | 2 |
+----+-------------------------+-------+
Events (not direct but used to check event name drop-down on user form vs event number in delegates)
+----+------------+
| ID | Event Name |
+----+------------+
| 1 | Test 1 |
| 2 | Test 2 |
| 3 | Test 3 |
| 4 | Test 4 |
+----+------------+
Based on form selection and this sample data I need to return the following:
All contacts who are flagged 'HR' TRUE, not suppressed or going to event named 'test 2' (Should be 5 - I always return the names of 'delegates' not going to the event only = 3)
Final results should be:
+----+------------+-----------+--------------------+--------+----------+
| ID | First Name | Last Name | Email | HR-DEL | Suppress |
+----+------------+-----------+--------------------+--------+----------+
| 2 | D | Farr | d.farr#fake.com | TRUE | FALSE |
| 3 | Y | Helle | y.helle#fake.com | TRUE | FALSE |
| 4 | S | Oliphant | soliphant#fake.com | TRUE | FALSE |
| 9 | R | Cooper | r.cooper#fake.com | TRUE | FALSE |
| 10 | M | Morrall | m.morrall#fake.com | TRUE | FALSE |
+----+------------+-----------+--------------------+--------+----------+
At the moment it appears to be pulling results from the wrong table (d not c). I attempted to change to OUTER join type but that returned with a FROM syntax error.
If I understand it correctly, basically you want to do this:
SELECT A.foo
FROM A
LEFT JOIN B
ON A.bar = B.bar
WHERE
<complex condition, partly involving B>
This cannot work. By including B in the global WHERE condition, you turn the LEFT JOIN into an INNER JOIN, and so you will only ever get records that match between A and B.
You can either move the filter on B into the JOIN condition:
SELECT A.foo
FROM A
LEFT JOIN B
ON (A.bar = B.bar)
AND (B.bamboozle = 42)
WHERE
A.columns = things
or LEFT JOIN a filtered subquery:
SELECT A.foo
FROM A
LEFT JOIN
(SELECT bar, columns FROM B
WHERE B.bamboozle = 42) AS B1
ON A.bar = B1.bar
WHERE
A.columns = things
So in your query, this is the bamboozle part you will need to move:
d.Event <> [Forms]![usf_FindCampaignContacts]![FCC_EventName]

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.

Update the first and last records only amongst many duplicate records

I have a table in Access named Spells which holds patient spells (where a patient has a spell within a hospital). It's structure is as below:
| ID | SpellID | MultipleSpell | FirstSpell | LastSpell |
|----|---------|---------------|------------|-----------|
| 1 | 1 | False | | |
| 2 | 2 | True | | |
| 3 | 2 | True | | |
| 4 | 3 | False | | |
| 5 | 4 | False | | |
| 6 | 5 | True | | |
| 7 | 5 | True | | |
| 8 | 5 | True | | |
The MultipleSpell column indicates that there are multiple occurrences of the spell within the table.
I'd like to run query which would update the FirstSpell column to True for records with the IDs of 1,2,4,5,6. So basically, where a Spell is the first one in the table, it should be marked, in the FirstSpell column.
I would also then like to update the LastSpell column to True for records with the IDs of 1,3,4,5,8.
The reasoning for this (if you're interested) is that the table links to a separate table containing the name of wards. It would be useful to link to this other table and indicate whether the ward is the admitting ward (FirstSpell) or the discharging ward (LastSpell)
You can update the first using:
update spells
set firstspell = 1
where id = (select min(id)
from spells as s2
where spells.spellid = s2.spellid
);
Similar logic (using max()) can be used for the last spell.