Vectorizing a variable length look-ahead loop in pandas - pandas

This is a very simplified version of my data:
+----+---------+---------------------+
| | user_id | seconds_since_start |
+----+---------+---------------------+
| 0 | 1 | 10 |
| 1 | 1 | 12 |
| 2 | 1 | 15 |
| 3 | 1 | 52 |
| 4 | 1 | 60 |
| 5 | 1 | 67 |
| 6 | 1 | 120 |
| 7 | 2 | 55 |
| 8 | 2 | 62 |
| 9 | 2 | 105 |
| 10 | 3 | 200 |
| 11 | 3 | 206 |
+----+---------+---------------------+
And this is the data I would like to produce:
+----+---------+---------------------+-----------------+------------------+
| | user_id | seconds_since_start | session_ordinal | session_duration |
+----+---------+---------------------+-----------------+------------------+
| 0 | 1 | 10 | 1 | 5 |
| 1 | 1 | 12 | 1 | 5 |
| 2 | 1 | 15 | 1 | 5 |
| 3 | 1 | 52 | 2 | 15 |
| 4 | 1 | 60 | 2 | 15 |
| 5 | 1 | 67 | 2 | 15 |
| 6 | 1 | 120 | 3 | 0 |
| 7 | 2 | 55 | 1 | 7 |
| 8 | 2 | 62 | 1 | 7 |
| 9 | 2 | 105 | 2 | 0 |
| 10 | 3 | 200 | 1 | 6 |
| 11 | 3 | 206 | 1 | 6 |
+----+---------+---------------------+-----------------+------------------+
My notion of a session is a group of events from a single user which occur not more than 10 seconds apart, and a session's duration is defined as the difference between the first event in the session and the last event (in seconds).
I have written working Python that achieves what I want.
import pandas as pd
events_data = [[1, 10], [1, 12], [1, 15], [1, 52], [1, 60], [1, 67], [1, 120],
[2, 55], [2, 62], [2, 105],
[3, 200], [3, 206]]
events = pd.DataFrame(data=events_data, columns=['user_id', 'seconds_since_start'])
def record_session(index_range, ordinal, duration):
for i in index_range:
events.at[i, 'session_ordinal'] = ordinal
events.at[i, 'session_duration'] = duration
session_indexes = []
current_user = previous_time = session_start = -1
session_num = 0
for i, row in events.iterrows():
if row['user_id'] != current_user or (row['seconds_since_start'] - previous_time) > 10:
record_session(session_indexes, session_num, previous_time - session_start)
session_indexes = [i]
session_num += 1
session_start = row['seconds_since_start']
if row['user_id'] != current_user:
current_user = row['user_id']
session_num = 1
previous_time = row['seconds_since_start']
session_indexes.append(i)
record_session(session_indexes, session_num, previous_time - session_start)
My problem is the length of time this takes to run. As I said, this is a very simplified version of my data, my actual data has 70,000,000 rows. Is there a way to vectorize (and thus speed-up) algorithms like this that formulate additional columns based on variable length look-aheads?

You could try:
# Create a helper boolean Series
s = df.groupby('user_id')['seconds_since_start'].diff().gt(10)
df['session_ordinal'] = s.groupby(df['user_id']).cumsum().add(1).astype(int)
df['session_duration'] = (df.groupby(['user_id', 'session_ordinal'])['seconds_since_start']
.transform(lambda x: x.max() - x.min()))
[output]
user_id seconds_since_start session_ordinal session_duration
0 1 10 1 5
1 1 12 1 5
2 1 15 1 5
3 1 52 2 15
4 1 60 2 15
5 1 67 2 15
6 1 120 3 0
7 2 55 1 7
8 2 62 1 7
9 2 105 2 0
10 3 200 1 6
11 3 206 1 6

Chris A's answer here is great. It contains several techniques or calls I was unfamiliar with. This answer copies his and adds copious annotations.
We start by building a helper Boolean series. This series records which events start additional sessions for any user. This is OK as a Boolean series because in numeric contexts they behave like the integers 0 and 1 (quoting from here). Let's put the series together bit by bit.
starts_session = events.groupby('user_id')['seconds_since_start'].diff().gt(10)
First we group events by user_id (documentation) and then choose the 'seconds_since_start' column and call diff (documentation) on that. The result of events.groupby('user_id')['seconds_since_start'].diff()is
+----+----------------------+
| | seconds_since_start |
+----+----------------------+
| 0 | NaN |
| 1 | 2.0 |
| 2 | 3.0 |
| 3 | 37.0 |
| 4 | 8.0 |
| 5 | 7.0 |
| 6 | 53.0 |
| 7 | NaN |
| 8 | 7.0 |
| 9 | 43.0 |
| 10 | NaN |
| 11 | 6.0 |
+----+----------------------+
I can see that the start of each group is already picking up the correct NaN difference as there's no previous event from that user to give a delta from.
Then using the element-wise greater than gt(10) (documentation) we get
+----+----------------------+
| | seconds_since_start |
+----+----------------------+
| 0 | False |
| 1 | False |
| 2 | False |
| 3 | True |
| 4 | False |
| 5 | False |
| 6 | True |
| 7 | False |
| 8 | False |
| 9 | True |
| 10 | False |
| 11 | False |
+----+----------------------+
(N.B. The column heading is odd, but it is not used and so does not matter.)
events['session_ordinal'] = starts_session.groupby(events['user_id']).cumsum().add(1).astype(int)
We then re-group starts_session by the user_ids in events and then do the cumulative sum cumsum (documentation) over each group. The grouping does the work for us here ensuring that each user's events are restarted at zero. We need the session ordinal to start at 1 not zero so we simply add one add(1) (documentation) and we cast them to int as none of them are NaN astype(int) (documentation). This gives the derived session_ordinal column I wanted.
events['session_duration'] = events.groupby(['user_id', 'session_ordinal'])['seconds_since_start'].transform(lambda x: x.max() - x.min())
To derive each session's duration we first group events by both the user_id and the new session_ordinal, i.e. we group them into sessions. Using transform (documentation) we find the minimum and maximum value of seconds_since_start for each group (i.e. each session) and the difference between them is the session's duration. This pattern, applying transform to grouped data is used extensively in the split-apply-combine process.
Thanks Chris.

Related

Why sorting a pandas column causing reordering the sub-groups? [duplicate]

This question already has answers here:
Sorting Dataframe using pandas. Keeping columns intact
(2 answers)
Closed last year.
The goal of my question is to understand why this happens and if this is a defined behaviour. I need to know to design my unittests in a predictable way. I do not want or need to change that behaviour or work around it.
Here is the initial data on the left side complete and on the right side just all ID.eq(1) but the order is the same as you can see in the index and the val column.
| | ID | val | | | ID | val |
|---:|-----:|:------| |---:|-----:|:------|
| 0 | 1 | A | | 0 | 1 | A |
| 1 | 2 | B | | 3 | 1 | x |
| 2 | 9 | C | | 4 | 1 | R |
| 3 | 1 | x | | 6 | 1 | G |
| 4 | 1 | R | | 9 | 1 | a |
| 5 | 4 | F | | 12 | 1 | d |
| 6 | 1 | G | | 13 | 1 | e |
| 7 | 9 | H |
| 8 | 4 | I |
| 9 | 1 | a |
| 10 | 2 | b |
| 11 | 9 | c |
| 12 | 1 | d |
| 13 | 1 | e |
| 14 | 4 | f |
| 15 | 2 | g |
| 16 | 9 | h |
| 17 | 9 | i |
| 18 | 4 | X |
| 19 | 5 | Y |
This right table is also the result I would expected when doing the following:
When I sort by ID the order of the rows inside the subgroups (e.g. ID.eq(1)) is modified. Why is it so?
This is the unexpected result
| | ID | val |
|---:|-----:|:------|
| 0 | 1 | A |
| 13 | 1 | e |
| 12 | 1 | d |
| 6 | 1 | G |
| 9 | 1 | a |
| 3 | 1 | x |
| 4 | 1 | R |
This is a full MWE
#!/usr/bin/env python3
import pandas as pd
# initial data
df = pd.DataFrame(
{
'ID': [1, 2, 9, 1, 1, 4, 1, 9, 4, 1,
2, 9, 1, 1, 4, 2, 9, 9, 4, 5],
'val': list('ABCxRFGHIabcdefghiXY')
}
)
print(df.to_markdown())
# only the group "1"
print(df.loc[df.ID.eq(1)].to_markdown())
# sort by 'ID'
df = df.sort_values('ID')
# only the group "1" (after sorting)
print(df.loc[df.ID.eq(1)].to_markdown())
As explained in the sort_values documentation, the stability of the sort is not always guaranteed depending on the chosen algorithm:
kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'
Choice of sorting algorithm. See also :func:`numpy.sort` for more
information. `mergesort` and `stable` are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
If you want to ensure using a stable sort:
df.sort_values('ID', kind='stable')
output:
ID val
0 1 A
3 1 x
4 1 R
6 1 G
9 1 a
...

pandas iterate over rows based on column values

I want to calculate the temperature difference at the same time between to cities. The data structure looks as follows:
dic = {'city':['a','a','a','a','a','b','b','b','b','b'],'week':[1,2,3,4,5,3,4,5,6,7],'temp':[20,21,23,21,25,20,21,24,21,22]}
df = pd.DataFrame(dic)
df
+------+------+------+
| city | week | temp |
+------+------+------+
| a | 1 | 20 |
| a | 2 | 21 |
| a | 3 | 23 |
| a | 4 | 21 |
| a | 5 | 25 |
| b | 3 | 20 |
| b | 4 | 21 |
| b | 5 | 24 |
| b | 6 | 21 |
| b | 7 | 22 |
+------+------+------+
I would like to calculate the difference in temperature between city a and b at week 3, 4, and 5. The final data structure should look as follows:
+--------+-------+------+------+
| city_1 | city2 | week | diff |
+--------+-------+------+------+
| a | b | 3 | 3 |
| a | b | 4 | 0 |
| a | b | 5 | 1 |
+--------+-------+------+------+
I would pivot your data, drop the NA values, and do the subtraction directly. This way you can keep the source temperatures associated with each city.
result = (
df.pivot(index='week', columns='city', values='temp')
.dropna(how='any', axis='index')
.assign(diff=lambda df: df['a'] - df['b'])
)
print(result)
city a b diff
week
3 23.0 20.0 3.0
4 21.0 21.0 0.0
5 25.0 24.0 1.0

Running Count by Group and Flag in BigQuery?

I have a table that looks like the below:
Row | Fullvisitorid | Visitid | New_Session_Flag
1 | A | 111 | 1
2 | A | 120 | 0
3 | A | 128 | 0
4 | A | 133 | 0
5 | A | 745 | 1
6 | A | 777 | 0
7 | B | 388 | 1
8 | B | 401 | 0
9 | B | 420 | 0
10 | B | 777 | 1
11 | B | 784 | 0
12 | B | 791 | 0
13 | B | 900 | 1
14 | B | 904 | 0
What I want to do is if it's the first row for a fullvisitorid then mark the field as 1, otherwise use the above row as the value, but if the new_session_flag = 1 then use the above row plus 1, example of output I'm looking for below:
Row | Fullvisitorid | Visitid | New_Session_Flag | Rank_Session_Order
1 | A | 111 | 1 | 1
2 | A | 120 | 0 | 1
3 | A | 128 | 0 | 1
4 | A | 133 | 0 | 1
5 | A | 745 | 1 | 2
6 | A | 777 | 0 | 2
7 | B | 388 | 1 | 1
8 | B | 401 | 0 | 1
9 | B | 420 | 0 | 1
10 | B | 777 | 1 | 2
11 | B | 784 | 0 | 2
12 | B | 791 | 0 | 2
13 | B | 900 | 1 | 3
14 | B | 904 | 0 | 3
As you can see:
Row 1 is 1 because it's the first time fullvisitorid A appears
Row 2 is 1 because it's not the first time fullvisitorid A appears and new_session_flag <> 1 therefore it uses the above row (i.e. 1)
Row 5 is 2 because it's not the first time fullvisitorid A appears and new_session_Flag = 1 therefore it uses the above row (i.e 1) plus 1
Row 7 is 1 because it's the first time fullvisitorid B appears
etc.
I believe this can be done through a retain statement in SAS but is there an equivalent in Google BigQquery?
Hopefully the above makes sense, let me know if not.
Thanks in advance
Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
COUNTIF(New_Session_Flag = 1) OVER(PARTITION BY Fullvisitorid ORDER BY Visitid) Rank_Session_Order
FROM `project.dataset.table`
The answer by Mikhail Berlyant using a conditional window count is corret and works. I am answering because I find that a window sum is even simpler (and possibly more efficient on a large dataset):
select
t.*,
sum(new_session_flag) over(partition by fullvisitorid order by visid_id) rank_session_order
from mytable t
This works because the new_session_flag contains 0s and 1s only; so counting the 1s is actually equivalent to suming all values.

Deleting recursively in a function (ERROR: query has no destination for result data)

I have this table of relationships (only id_padre and id_hijo are interesting):
id | id_padre | id_hijo | cantidad | posicion
----+----------+---------+----------+----------
0 | | 1 | 1 | 0
1 | 1 | 2 | 1 | 0
2 | 1 | 3 | 1 | 1
3 | 3 | 4 | 1 | 0
4 | 4 | 5 | 0.5 | 0
5 | 4 | 6 | 0.5 | 1
6 | 4 | 7 | 24 | 2
7 | 4 | 8 | 0.11 | 3
8 | 8 | 6 | 0.12 | 0
9 | 8 | 9 | 0.05 | 1
10 | 8 | 10 | 0.3 | 2
11 | 8 | 11 | 0.02 | 3
12 | 3 | 12 | 250 | 1
13 | 12 | 5 | 0.8 | 0
14 | 12 | 6 | 0.8 | 1
15 | 12 | 13 | 26 | 2
16 | 12 | 8 | 0.15 | 3
This table store the links between nodes (id_padre = parent node and id_hijo = child node).
I'm trying to do a function for a recursive delete of rows where I begin with a particular row. After deleted, I check if there are more rows with id_hijo column with the same value I used to delete the first row.
If there aren't rows with this condition, I'll must to delete all the rows where id_padre are equal id_hijo of the deleted row.
i.e.: If I begin to delete the row where id_padre=3 and id_hijo=4 then I delete this row:
id | id_padre | id_hijo | cantidad | posicion
----+----------+---------+----------+----------
3 | 3 | 4 | 1 | 0
and the table remains like that:
id | id_padre | id_hijo | cantidad | posicion
----+----------+---------+----------+----------
0 | | 1 | 1 | 0
1 | 1 | 2 | 1 | 0
2 | 1 | 3 | 1 | 1
4 | 4 | 5 | 0.5 | 0
5 | 4 | 6 | 0.5 | 1
6 | 4 | 7 | 24 | 2
7 | 4 | 8 | 0.11 | 3
8 | 8 | 6 | 0.12 | 0
9 | 8 | 9 | 0.05 | 1
10 | 8 | 10 | 0.3 | 2
11 | 8 | 11 | 0.02 | 3
12 | 3 | 12 | 250 | 1
13 | 12 | 5 | 0.8 | 0
14 | 12 | 6 | 0.8 | 1
15 | 12 | 13 | 26 | 2
16 | 12 | 8 | 0.15 | 3
Because of there aren't any row with id_hijo = 4 I will delete the rows where id_padre = 4....and so on..recursively. (in this example the process end here)
I have try to do this function (this function calls itself):
CREATE OR REPLACE FUNCTION borrar(integer,integer) RETURNS VOID AS
$BODY$
DECLARE
padre ALIAS FOR $1;
hijo ALIAS FOR $2;
r copia_rel%rowtype;
BEGIN
DELETE FROM copia_rel WHERE id_padre = padre AND id_hijo = hijo;
IF NOT EXISTS (SELECT id_hijo FROM copia_rel WHERE id_hijo = hijo) THEN
FOR r IN SELECT * FROM copia_rel WHERE id_padre = hijo LOOP
RAISE NOTICE 'Selecciono: %,%',r.id_padre,r.id_hijo;--for debugging
SELECT borrar(r.id_padre,r.id_hijo);
END LOOP;
END IF;
END;
$BODY$
LANGUAGE plpgsql;
But I get this error:
ERROR: query has no destination for result data
I know that there are specific recursive ways in postgresql wit CTE. I have used it for traverse my graph, but I don't know how could use it in this case.
The error is due to the SELECT used to call the function recursively. PostgreSQL wants to put the results somewhere but is not told where.
If you want to run a function and discard results use PERFORM instead of SELECT in PL/PgSQL functions.

SQL Select and Group By clause

I have data as per the table below, I pass in a list of numbers and need the raceId where all the numbers appear in the the data column for that race.
+-----+--------+------+
| Id | raceId | data |
+-----+--------+------+
| 14 | 1 | 1 |
| 12 | 1 | 2 |
| 13 | 1 | 3 |
| 16 | 1 | 8 |
| 47 | 2 | 1 |
| 43 | 2 | 2 |
| 46 | 2 | 6 |
| 40 | 2 | 7 |
| 42 | 2 | 8 |
| 68 | 3 | 3 |
| 69 | 3 | 6 |
| 65 | 3 | 7 |
| 90 | 4 | 1 |
| 89 | 4 | 2 |
| 95 | 4 | 6 |
| 92 | 4 | 7 |
| 93 | 4 | 8 |
| 114 | 5 | 1 |
| 116 | 5 | 2 |
| 117 | 5 | 3 |
| 118 | 5 | 8 |
| 138 | 6 | 2 |
| 139 | 6 | 6 |
| 140 | 6 | 7 |
| 137 | 6 | 8 |
+-----+--------+------+
Example I pass in 1,2,7 I would get the following Id's:
2 and 4
I have tried the simple statement
SELECT * FROM table WHERE ((data = 1) or (data = 2) or (data = 7))
But I don't really understand the grouping by clause or indeed if it is the correct way of doing this.
select raceId
from yourtable
where data in (1,2,7)
group by raceId
having count(raceId) = 3 /* length(1,2,7) */
This is assuming raceId, data pair is unique. If it's not the you should use
select raceId
from (select distinct raceId, data
from yourtable
where data in(1,2,7))
group by raceId
having count(raceId) = 3
SELECT DISTINCT raceId WHERE data IN (1, 2, 7)
This is an example of a "set-within-sets" query. I like to solve these with group by and having.
select raceid
from races
where data in (1, 2, 7)
group by raceid
having count(*) = 3;