How do I group entry timestamps into visitor-specific "sessions" in PostgreSQL? - sql

Here's some mock data:
visitor_id,channel,timestamp,order_id,session
100,A,1,,1
100,B,2,,1
100,A,3,,1
100,B,4,1,1
100,B,5,,2
100,B,6,,2
100,B,7,2,2
100,A,8,,3
100,A,9,,3
A visitor will come into the site via channels, and eventually order (creating an order_id). Many visitors never order, but I still want to group their session together (to determine what was their first channel, for example). The last column is one example solving the problem.
What's an efficient, declarative statement to create it in PostgreSQL? Are there better solutions than what I am proposing?

You want to combine the values up-to an order. One method would be to assign each row a grouping id, such as the number of orders before a given time. This can be done with a correlated subquery:
select md.*,
(select count(md2.order_id)
from mockdata md2
where md2.visitor_id = md.visitor_id and
md2.timestamp < md.timestamp
) as session
from mockdata md;
This can also be done using a cumulative count:
select md.*,
count(order_id) over (partition by visitor_id
order by timestamp
rows between unbounded preceding and 1 preceding
) as session
from mockdata md;

Related

Stuck on what seems like a simple SQL dense_rank task

Been stuck on this issue and could really use a suggestion or help.
What I have in a table is basic user flow on a website. For every Session ID, there's a page visited from start (lands on homepage) to finish (purchase). This has been ordered by timestamp to get a count of pages visited during this process. This 'page count' has also been partitioned by Session ID to go back to 1 every time the ID changes.
What I need to do now is assign a step count (highlighted is what I'm trying to achieve). This should assign a similar count but doesn't continue counting at duplicate steps (ie, someone visited multiple product pages - it's multiple pages but still only one 'product view' step.
You'd think this would be done using a dense rank, partitioned by session id - but that's where I get stuck. You can't order on page count because that'll assign a unique number to each step count. You can't order by Step because that orders it alphabetically.
What could I do to achieve this?
Screenshot of desired outcome:
Many thanks!
Use lag to see if two values are the same then a cumulative sum:
select t.*,
sum(case when prev_cs = custom_step then 0 else 1 end) over (partition by session_id order by timestamp) as steps_count
from (select t.*,
lag(custom_step) over (partition by session_id order by timestamp) as prev_cs
from t
) t
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(flag),
COUNTIF(IFNULL(flag, TRUE)) OVER(PARTITION BY session_id ORDER BY timestamp) AS steps_count
FROM (
SELECT *,
custom_step != LAG(custom_step) OVER(PARTITION BY session_id ORDER BY timestamp) AS flag
FROM `project.dataset.table`
)
-- ORDER BY timestamp

How Can I Retrieve The Earliest Date and Status Per Each Distinct ID

I have been trying to write a query to perfect this instance but cant seem to do the trick because I am still receiving duplicated. Hoping I can get help how to fix this issue.
SELECT DISTINCT
1.Client
1.ID
1.Thing
1.Status
MIN(1.StatusDate) as 'statdate'
FROM
SAMPLE 1
WHERE
[]
GROUP BY
1.Client
1.ID
1.Thing
1.status
My output is as follows
Client Id Thing Status Statdate
CompanyA 123 Thing1 Approved 12/9/2019
CompanyA 123 Thing1 Denied 12/6/2019
So although the query is doing what I asked and showing the mininmum status date per status, I want only the first status date. I have about 30k rows to filter through so whatever does not run overload the query and have it not run. Any help would be appreciated
Use window functions:
SELECT s.*
FROM (SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY statdate) as seqnum
FROM SAMPLE s
WHERE []
) s
WHERE seqnum = 1;
This returns the first row for each id.
Use whichever of these you feel more comfortable with/understand:
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY statusdate) as rn
FROM sample
WHERE ...
) x
WHERE rn = 1
The way that one works is to number all rows sequentially in order of StatusDate, restarting the numbering from 1 every time ID changes. If you thus collect all the number 1's togetyher you have your set of "first records"
Or can coordinate a MIN:
SELECT
*
FROM
sample s
INNER JOIN
(SELECT ID, MIN(statusDate) as minDate FROM sample WHERE ... GROUP BY ID) mins
ON s.ID = mins.ID and s.StatusDate = mins.MinDate
WHERE
...
This one prepares a list of all the ID and the min date, then joins it back to the main table. You thus get all the data back that was lost during the grouping operation; you cannot simultaneously "keep data" and "throw away data" during a group; if you group by more than just ID, you get more groups (as you have found). If you only group by ID you lose the other columns. There isn't any way to say "GROUP BY id, AND take the MIN date, AND also take all the other data from the same row as the min date" without doing a "group by id, take min date, then join this data set back to the main dataset to get the other data for that min date". If you try and do it all in a single grouping you'll fail because you either have to group by more columns, or use aggregating functions for the other data in the SELECT, which mixes your data up; when groups are done, the concept of "other data from the same row" is gone
Be aware that this can return duplicate rows if two records have identical min dates. The ROW_NUMBER form doesn't return duplicated records but if two records have the same minimum StatusDate then which one you'll get is random. To force a specific one, ORDER BY more stuff so you can be sure which will end up with 1

Complex grouping case - using "delimiting tinyint flag" between groups

Below is the an example of the page-views dataset in question.
Presented records are sorted in ASC order by a timestamp.
I need to calculate some per-session measures from the dataset.
The problem is that there is no clear identifier for a session. The only thing that is available is is_a_new_session flag - which serves as a kind of a delimiter between sessions. So, in the given example there are 5 separate sessions.
How could I generate some sort of a session identifier and add it to the dataset, so that I can later use it for grouping per session?
The desired new column would like similar to this:
Use a cumulative sum to define the groups and then aggregate:
select min(timestamp), max(timestamp), . . . -- whatever columns you want
from (select t.*,
sum(is_a_new_session) over (order by timestamp) as grp
from t
) t
group by grp;

postgres select aggregate timespans

I have a table with the following structure:
timstamp-start, timestamp-stop
1,5
6,10
25,30
31,35
...
i am only interested in continuous timespans e.g. the break between a timestamp-end and the following timestamp-start is less than 3.
How could I get the aggregated covered timespans as a result:
timestamp-start,timestamp-stop
1,10
25,35
The reason I am considering this is because a user may request a timespan that would need to return several thousand rows. However, most records are continous and using above method could potentially reduce many thousand of rows down to just a dozen. Or is the added computation not worth the savings in bandwith and latency?
You can group the time stamps in three steps:
Add a flag to determine where a new period starts (that is, a gap greater than 3).
Cumulatively sum the flag to assign groupings.
Re-aggregate with the new groupings.
The code looks like:
select min(ts_start) as ts_start, max(ts_end) as ts_end
from (select t.*,
sum(flag) over (order by ts_start) as grouping
from (select t.*,
(coalesce(ts_start - lag(ts_end) over (order by ts_start),0) > 3)::int as flag
from t
) t
) t
group by grouping;

Filtering by window function result in Postgresql

Ok, initially this was just a joke we had with a friend of mine, but it turned into interesting technical question :)
I have the following stuff table:
CREATE TABLE stuff
(
id serial PRIMARY KEY,
volume integer NOT NULL DEFAULT 0,
priority smallint NOT NULL DEFAULT 0,
);
The table contains the records for all of my stuff, with respective volume and priority (how much I need it).
I have a bag with specified volume, say 1000. I want to select from the table all stuff I can put into a bag, packing the most important stuff first.
This seems like the case for using window functions, so here is the query I came up with:
select s.*, sum(volume) OVER previous_rows as total
from stuff s
where total < 1000
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
order by priority desc
The problem with it, however, is that Postgres complains:
ERROR: column "total" does not exist
LINE 3: where total < 1000
If I remove this filter, total column gets properly calculated, results properly sorted but all stuff gets selected, which is not what I want.
So, how do I do this? How do I select only items that can fit into the bag?
I don't know if this qualifies as "more elegant" but it is written in a different manner than Cybernate's solution (although it is essentially the same)
WITH window_table AS
(
SELECT s.*,
sum(volume) OVER previous_rows as total
FROM stuff s
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
)
SELECT *
FROM window_table
WHERE total < 1000
ORDER BY priority DESC
If by "more elegant" you mean something that avoids the sub-select, then the answer is "no"
I haven't worked with PostgreSQL. However, my best guess would be using an inline view.
SELECT a.*
FROM (
SELECT s.*, sum(volume) OVER previous_rows AS total
FROM stuff AS s
WINDOW previous_rows AS (
ORDER BY priority desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY priority DESC
) AS a
WHERE a.total < 1000;