Stuck on what seems like a simple SQL dense_rank task - sql

Been stuck on this issue and could really use a suggestion or help.
What I have in a table is basic user flow on a website. For every Session ID, there's a page visited from start (lands on homepage) to finish (purchase). This has been ordered by timestamp to get a count of pages visited during this process. This 'page count' has also been partitioned by Session ID to go back to 1 every time the ID changes.
What I need to do now is assign a step count (highlighted is what I'm trying to achieve). This should assign a similar count but doesn't continue counting at duplicate steps (ie, someone visited multiple product pages - it's multiple pages but still only one 'product view' step.
You'd think this would be done using a dense rank, partitioned by session id - but that's where I get stuck. You can't order on page count because that'll assign a unique number to each step count. You can't order by Step because that orders it alphabetically.
What could I do to achieve this?
Screenshot of desired outcome:
Many thanks!

Use lag to see if two values are the same then a cumulative sum:
select t.*,
sum(case when prev_cs = custom_step then 0 else 1 end) over (partition by session_id order by timestamp) as steps_count
from (select t.*,
lag(custom_step) over (partition by session_id order by timestamp) as prev_cs
from t
) t

Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(flag),
COUNTIF(IFNULL(flag, TRUE)) OVER(PARTITION BY session_id ORDER BY timestamp) AS steps_count
FROM (
SELECT *,
custom_step != LAG(custom_step) OVER(PARTITION BY session_id ORDER BY timestamp) AS flag
FROM `project.dataset.table`
)
-- ORDER BY timestamp

Related

Get first record based on time in PostgreSQL

DO we have a way to get first record considering the time.
example
get first record today, get first record yesterday, get first record day before yesterday ...
Note: I want to get all records considering the time
sample expected output should be
first_record_today,
first_record_yesterday,..
As I understand the question, the "first" record per day is the earliest one.
For that, we can use RANK and do the PARTITION BY the day only, truncating the time.
In the ORDER BY clause, we will sort by the time:
SELECT sub.yourdate FROM (
SELECT yourdate,
RANK() OVER
(PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
In the main query, we will sort the data beginning with the latest date, meaning today's one, if available.
We can try out here: db<>fiddle
If this understanding of the question is incorrect, please let us know what to change by editing your question.
A note: Using a window function is not necessary according to your description. A shorter GROUP BY like shown in the other answer can produce the correct result, too and might be absolutely fine. I like the window function approach because this makes it easy to add further conditions or change conditions which might not be usable in a simple GROUP BY, therefore I chose this way.
EDIT because the question's author provided further information:
Here the query fetching also the first message:
SELECT sub.yourdate, sub.message FROM (
SELECT yourdate, message,
RANK() OVER (PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
Or if only the message without the date should be selected:
SELECT sub.message FROM (
SELECT yourdate, message,
RANK() OVER (PARTITION BY DATE_TRUNC('DAY',yourdate)
ORDER BY DATE_TRUNC('SECOND',yourdate)) rk
FROM yourtable
) AS sub
WHERE sub.rk = 1
ORDER BY sub.yourdate DESC;
Updated fiddle here: db<>fiddle

Return Only Most Recent Instance of Item From Query (Where Multiple Instances Exist)

I have written the following subquery, which is returning instances of item counts from my application's log table.
The idea is that from this subquery I will be pulling information on item counts from a specific date, to be compared to the same information from a different date - info such as, for a given location on the system, what the latest quantity of all items counted within it was.
select
LOCATION,
ITEM,
SUM(CASE
WHEN ACTION = 'COUNT-OK'
THEN QUANTITY
ELSE QUANTITY * CHANGE --If ACTION <> 'OK', then we need to adjust the quantity
END) AS QuantityCalc,
DATE_TIME,
from LOG_TABLE
where ACTION IN ('COUNT-ADJ','COUNT-OK')
AND (CAST(DATE_TIME AS DATE) = #CountDate) --Declared elsewhere
group by LOCATION, ITEM, DATE_TIME
order by DATE_TIME desc
My issue is with the rows returned. Because these are application logs, there is a row for each count being done on the system, so only the most recent 'QuantityCalc' for a given item in a location would be accurate.
I need a way to return only the most recent instance of a count happening (where the LOCATION and ITEM values are the same). I am using a SUM in the main query which is pulling the QuantityCalc value from this subquery to find the total Quantity by Item and Location per specific count (to compare them side by side). This is currently being thrown off by instances such as the below.
I've attached an example image of what this query returns. My issue is with Item2 in Location B and Item3 in location C, and I'd be looking for the query to ONLY return rows 2, 3, 5 and 8 (including header).
Thank you
You can pre-filter the logs for the latest row per location/item tuple, then aggregate. We would typically use row_number() to enumerate the rows in a subquery:
select
LOCATION,
ITEM,
sum(case when ACTION = 'COUNT-OK' then QUANTITY else QUANTITY * CHANGE end) AS QuantityCalc,
DATE_TIME,
from (
select l.*,
row_number() over(partition by LOCATION, ITEM order by date_time desc) AS RN
from LOG_TABLE
where ACTION IN ('COUNT-ADJ','COUNT-OK') and CAST(DATE_TIME AS DATE) = #CountDate
) l
where RN = 1
group by LOCATION, ITEM, DATE_TIME
order by DATE_TIME desc
Side note: the filtering on date_time can probably be optimized; rather than casting your column to date, we can check it directly against a range defined from the date parameter. The syntax of date arithmetic widely varies across databases (and you did not well which one you are using), but in standard SQL that would be:
DATE_TIME >= #CountDate and DATE_TIME < #CountDate + interval '1' day

How Can I Retrieve The Earliest Date and Status Per Each Distinct ID

I have been trying to write a query to perfect this instance but cant seem to do the trick because I am still receiving duplicated. Hoping I can get help how to fix this issue.
SELECT DISTINCT
1.Client
1.ID
1.Thing
1.Status
MIN(1.StatusDate) as 'statdate'
FROM
SAMPLE 1
WHERE
[]
GROUP BY
1.Client
1.ID
1.Thing
1.status
My output is as follows
Client Id Thing Status Statdate
CompanyA 123 Thing1 Approved 12/9/2019
CompanyA 123 Thing1 Denied 12/6/2019
So although the query is doing what I asked and showing the mininmum status date per status, I want only the first status date. I have about 30k rows to filter through so whatever does not run overload the query and have it not run. Any help would be appreciated
Use window functions:
SELECT s.*
FROM (SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY statdate) as seqnum
FROM SAMPLE s
WHERE []
) s
WHERE seqnum = 1;
This returns the first row for each id.
Use whichever of these you feel more comfortable with/understand:
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY statusdate) as rn
FROM sample
WHERE ...
) x
WHERE rn = 1
The way that one works is to number all rows sequentially in order of StatusDate, restarting the numbering from 1 every time ID changes. If you thus collect all the number 1's togetyher you have your set of "first records"
Or can coordinate a MIN:
SELECT
*
FROM
sample s
INNER JOIN
(SELECT ID, MIN(statusDate) as minDate FROM sample WHERE ... GROUP BY ID) mins
ON s.ID = mins.ID and s.StatusDate = mins.MinDate
WHERE
...
This one prepares a list of all the ID and the min date, then joins it back to the main table. You thus get all the data back that was lost during the grouping operation; you cannot simultaneously "keep data" and "throw away data" during a group; if you group by more than just ID, you get more groups (as you have found). If you only group by ID you lose the other columns. There isn't any way to say "GROUP BY id, AND take the MIN date, AND also take all the other data from the same row as the min date" without doing a "group by id, take min date, then join this data set back to the main dataset to get the other data for that min date". If you try and do it all in a single grouping you'll fail because you either have to group by more columns, or use aggregating functions for the other data in the SELECT, which mixes your data up; when groups are done, the concept of "other data from the same row" is gone
Be aware that this can return duplicate rows if two records have identical min dates. The ROW_NUMBER form doesn't return duplicated records but if two records have the same minimum StatusDate then which one you'll get is random. To force a specific one, ORDER BY more stuff so you can be sure which will end up with 1

aggregate multiple rows based on time ranges

i do have a customerand he use over a specific period of time different devices, tracked with a valid_from and valid_to date. but, every time something changes for this device there will be a new row written without any visible changes for the row based data, besides a new valid from/to.
what i'm trying to do is to aggregate the first two rows into one, same for row 3 and 4, while leaving 5 and 6 as they are. all my solutions i came up so far with are working for a usage history for the user not switching back to device a. everything keeps failing.
i'd really appreciate some help, thanks in advance!
If you know that the previous valid_to is the same as the current valid_from, then you can use lag() to identify where a new grouping starts. Then use a cumulative sum to calculate the grouping and finally aggregation:
select cust, act_dev, min(valid_from), max(valid_to)
from (select t.*,
sum(case when prev_valid_to = valid_from then 0 else 1 end) over (partition by cust order by valid_from) as grouping
from (select t.*,
lag(valid_to) over (partition by cust, act_dev order by valid_from) as prev_valid_to
from t
) t
) t
group by cust, act_dev, grouping;
Here is a db<>fiddle.

How to get rows with a value greater than the min value for that group plus a constant?

I have some pageview data where each row is a single pageview and am looking to find pageviews from each user's second (and any subsequent) visit; for simplicity's sake I'll use a full day as the session length. I assume the query should look something like,
SELECT date_time, url FROM pageviews WHERE date_time > date_add(min(date_time), 1)
Of course, the min function doesn't actually exist, and I need the min date_time for each visitor, not over the whole table.
I looked at some other questions and it looks like the windowing and analytics functions may be the right thing to use, but the documentation is sparse and I can't find a single example of how to to this anywhere.
The following query
SELECT user_id, date_time, rank() OVER(PARTITION BY user_id ORDER BY date_time) FROM pageviews
returns a list of pageviews ranked by time, so I can technically take the one which is equal to 1 for each user_id, but I can't figure out how to do that. It doesn't seem to be possible to use the OVER clause inside a WHERE.
Sample data:
date_time url user_id
12-21-2015 00:00:07 www.mywebsite.com 1234
12-13-2015 14:12:02 www.mywebsite.com 5678
12-16-2015 23:24:25 www.mywebsite.com 5678
Desired result
user_id
5678
(I need at least the user id; any extra info, e.g. the datetime of the second visit, would be great.)
Use a subquery:
FROM ( SELECT
user_id,
date_time,
rank() OVER(PARTITION BY user_id ORDER BY date_time) as rnk
FROM pageviews ) t select user_id where rnk > 1;