Get range between FIRST_VALUE and LAST_VALUE - sql

timestamp
id
scope
2021-01-23 12:52:34.159999 UTC
1
enter_page
2021-01-23 12:53:02.342 UTC
1
view_product
2021-01-23 12:53:02.675 UTC
1
checkout
2021-01-23 12:53:04.342 UTC
1
search_page
2021-01-23 12:53:24.513 UTC
1
checkout
I am trying to get all the values between the FIRST_VALUE and LAST VALUE in the column 'scope' using WINDOWS/ANALYTICAL Functions
I already get the first_value() = enter_page
and the last_value() == checkout
by using windows functions in SQLite
FIRST_VALUE(scope) OVER ( PARTITION BY id ORDER BY julianday(timestamp) ASC) first_page
FIRST_VALUE(scope) OVER ( PARTITION BY id ORDER BY julianday(timestamp) DESC ) last_page
I am trying to capture all steps in between [excluding edges]: view_product, apartment_view, checkout[, N-field] to later add them into a string ( unique values -STR_AGGR() )
Once done that, i will later process trying to find if the customer at somepoint during the purchase_journey open the checkout multiple times
my result should like like
id
first_page
last_page
inbetween_pages
1
enter_page
checkout
view_product, checkout, search_page
p.s. I am trying to avoid using python to process this. I would like a 'clean' way of doing this with SQL-pure
Thanks a lot guys

You can do it with GROUP_CONCAT() window function which supports the ORDER BY clause so you will have the scopes in inbetween_pages in the correct order, instead of GROUP_CONCAT() aggregate function which does not support the ORDER BY clause and the results that it returns are not guaranteed to be in a specific order:
SELECT DISTINCT id, first_page, last_page,
GROUP_CONCAT(CASE WHEN timestamp NOT IN (min_timestamp, max_timestamp) THEN scope END)
OVER (PARTITION BY id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) inbetween_pages
FROM (
SELECT *,
FIRST_VALUE(scope) OVER (PARTITION BY id ORDER BY timestamp) first_page,
FIRST_VALUE(scope) OVER (PARTITION BY id ORDER BY timestamp DESC) last_page,
MIN(timestamp) OVER (PARTITION BY id) min_timestamp,
MAX(timestamp) OVER (PARTITION BY id) max_timestamp
FROM tablename
)
See the demo.
Results:
id
first_page
last_page
inbetween_pages
1
enter_page
checkout
view_product,checkout,search_page

Hmmm . . . I am thinking:
select id, group_concat(scope, ',')
from (select t.*,
row_number() over (partition by id order by timestamp) as seqnum_asc,
row_number() over (partition by id order by timestamp desc) as seqnum_desc
from t
order by id, timestamp
) t
where 1 not in (seqnum_asc, seqnum_desc)
group by id;
In SQLite, group_concat() doesn't accept an order by argument. My understanding is that it respects the ordering from the subquery, which is why the subquery has an order by.

Related

How to return max date per month for user

I have following table:
And I would like to have returned maximum threshold date per each month for every user, so my final result should look like that:
I wanted to use analytic function ROW_NUMBER and return maximum number of row but how to do it per month for each user? Is there any simpler way to do it in BigQuery?
You can partition the row_number by the user and the month, and then take the first one for each:
SELECT user_id, threshold_date, net_deposists_usd
FROM (SELECT user_id, threshold_date, net_deposists_usd,
ROW_NUMBER () OVER (PARTITION BY user_id, EXTRACT (MONTH FROM threshold_date)
ORDER BY net_deposists_usd DESC) AS rk
FROM mytable)
WHERE rk = 1
BigQuery now supports qualify, which does everything you want. For the month, just use date_trunc():
select t.*
from t
qualify row_number() over (partition by user_id, date_trunc(threshold_date, month)
order by threshold_date desc, net_deposits_usd desc
);
A simple alternative uses arrays and group by:
select array_agg(t order by threshold_date desc, net_deposits_usd desc limit 1)[ordinal(1)].*
from t
group by user_id, date_trunc(threshold_date, month) ;

Redshift - Group Table based on consecutive rows

I am working right now with this table:
What I want to do is to clear up this table a little bit, grouping some consequent rows together.
Is there any form to achieve this kind of result?
The first table is already working fine, I just want to get rid of some rows to free some disk space.
One method is to peak at the previous row to see when the value changes. Assuming that valid_to and valid_from are really dates:
select id, class, min(valid_to), max(valid_from)
from (select t.*,
sum(case when prev_valid_to >= valid_from + interval '-1 day' then 0 else 1 end) over (partition by id order by valid_to rows between unbounded preceding and current row) as grp
from (select t.*,
lag(valid_to) over (partition by id, class order by valid_to) as prev_valid_to
from t
) t
) t
group by id, class, grp;
If the are not dates, then this gets trickier. You could convert to dates. Or, you could use the difference of row_numbers:
select id, class, min(valid_from), max(valid_to)
from (select t.*,
row_number() over (partition by id order by valid_from) as seqnum,
row_number() over (partition by id, class order by valid_from) as seqnum_2
from t
) t
group by id, class, (seqnum - seqnum_2)

Difference between last and second last event in a table of events

I have the following table
which created by
create table events (
event_type integer not null,
value integer not null,
time timestamp not null,
unique (event_type, time)
);
given the data in the pic, I want to write a query that for each event_type that has been
registered more than once returns the difference between the latest and
the second latest value.
Given the above data, the output should be like
event_type value
2 -5
3 4
I solved it using the following :
CREATE VIEW [max_date] AS
SELECT event_type, max(time) as time, value
FROM events
group by event_type
having count(event_type) >1
order by time desc;
select event_type, value
from
(
select event_type, value, max(time)
from(
Select E1.event_type, ([max_date].value - E1.value) as value, E1.time
From events E1, [max_date]
Where [max_date].event_type = E1.event_type
and [max_date].time > E1.time
)
group by event_type
)
but this seems like a very complicated query and I wonder if there is an easier way?
Use window functions:
select e.*,
(value - prev_value)
from (select e.*,
lag(value) over (partition by event_type order by time) as prev_value,
row_number() over (partition by event_type order by time desc) as seqnum
from events e
) e
where seqnum = 1 and prev_value is not null;
You could use lag() and row_number()
select event_type, val
from (
select
event_type,
value - lag(value) over(partition by event_type order by time desc) val,
row_number() over(partition by event_type order by time desc) rn
from events
) t
where rn = 1 and val is not null
The inner query ranks records having the same event_type by descending time, and computes the difference between each value and the previous one.
Then, the outer query just filters on the top record per group.
Here is a way to do this using a combination of analytic functions and aggregation. This approach is friendly in the event that your database does not support LEAD and LAG.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY event_type ORDER BY time DESC)
FROM events
)
SELECT
event_type,
MAX(CASE WHEN rn = 1 THEN value END) - MAX(CASE WHEN rn = 2 THEN value END) AS value
FROM cte
GROUP BY
event_type
HAVING
COUNT(*) > 1;

Returning 5 Most Recent Trips Per ID

I have a table with the number of trips taken and a station_id, and I want to return the 5 most recent trips made per ID (sample image of the table is below)
The query I made below aggregates the station id's and the most recent trip, but I am having a difficult time returning the 5 most recent
SELECT start_station_id, MAX(start_time)
FROM `bpd.shop.trips`
group by start_station_id, start_time
Trips:
https://imgur.com/Ebh9FeZ
Any help would be much appreciated, thanks!
You can use row_number():
SELECT t.*
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY start_station_id ORDER BY start_time DESC) as seqnum
FROM `bpd.shop.trips` t
) t
WHERE seqnum <= 5;
Below is for BigQuery Standard SQL
Option 1
#standardSQL
SELECT record.*
FROM (
SELECT ARRAY_AGG(t ORDER BY start_time DESC LIMIT 5) arr
FROM `bpd.shop.trips` t
GROUP BY start_station_id
), UNNEST(arr) record
Option 2
#standardSQL
SELECT * EXCEPT (pos) FROM (
SELECT *, ROW_NUMBER() OVER(win) AS pos
FROM `bpd.shop.trips`
WINDOW win AS (PARTITION BY start_station_id ORDER BY start_time DESC)
)
WHERE pos <= 5
I recommend using Option 1 as more scalable option

Tagging consecutive days

Supposedly I have data something like this:
ID,DATE
101,01jan2014
101,02jan2014
101,03jan2014
101,07jan2014
101,08jan2014
101,10jan2014
101,12jan2014
101,13jan2014
102,08jan2014
102,09jan2014
102,10jan2014
102,15jan2014
How could I efficiently code this in Greenplum SQL such that I can have a grouping of consecutive days similar to the one below:
ID,DATE,PERIOD
101,01jan2014,1
101,02jan2014,1
101,03jan2014,1
101,07jan2014,2
101,08jan2014,2
101,10jan2014,3
101,12jan2014,4
101,13jan2014,4
102,08jan2014,1
102,09jan2014,1
102,10jan2014,1
102,15jan2014,2
You can do this using row_number(). For a consecutive group, the difference between the date and the row_number() is a constant. Then, use dense_rank() to assign the period:
select id, date,
dense_rank() over (partition by id order by grp) as period
from (select t.*,
date - row_number() over (partition by id order by date) * 'interval 1 day'
from table t
) t