Query for days since last value change - sql

I have a table with the following data that includes any change to coupon program (Rate & Status)
timestamp
account_ID
active
rate
1675894331538
1234
true
5
1675386736152
1234
false
0
1674778434298
1234
true
7
1673500367524
1234
true
5
1673309563251
1234
true
8
I am trying to determine how to best write a query to have the output look like this:
account_ID
days_since_status_change
days_since_rate_change
1234
2
4
I've been looking into using row_number and partitioning by account_id over timestamp DESC, but I can't wrap my head around how to narrow it down to two specific events and then counting the days since that event happened.
If you can make suggestions, this n00b would really appreciate the help!
I'm using BigQuery if that helps too.

You might consider below query.
SELECT account_ID,
COUNTIF(active_chage) OVER w1 AS days_since_status_change,
COUNTIF(rate_change) OVER w1 AS days_since_rate_change
FROM (
SELECT *,
active <> LAG(active) OVER w0 AS active_chage,
rate <> LAG(rate) OVER w0 AS rate_change
FROM sample_table
WINDOW w0 AS (PARTITION BY account_ID ORDER BY timestamp)
) QUALIFY timestamp = MAX(timestamp) OVER (PARTITION BY account_ID)
WINDOW w1 AS (PARTITION BY account_ID ORDER BY timestamp);
Query results

Related

Sum iteratively in sql based on what value next row has?

I want to aggregate a transaction table in a way that it sums checking a variable in next row and sums if condition is met, otherwise breaks and start summing again creating a new row. This might be confusing, so adding example below -
What I have -
ID
date
type
amount
a
1/1/2023
incoming
10
a
2/1/2023
incoming
10
a
3/1/2023
incoming
10
a
4/1/2023
incoming
10
a
5/1/2023
outgoing
20
a
6/1/2023
outgoing
10
a
7/1/2023
incoming
10
a
8/1/2023
incoming
10
a
9/1/2023
outgoing
30
a
10/1/2023
incoming
10
Summary Output I want -
ID
type
min_date
max_date
amount
a
incoming
1/1/2023
4/1/2023
40
a
outgoing
5/1/2023
6/1/2023
30
a
incoming
7/1/2023
8/1/2023
20
a
outgoing
9/1/2023
9/1/2023
30
a
incoming
10/1/2023
10/1/2023
10
Basically keep summing until the next row has same transaction type (after sorting on date), if it changes create a new row and repeat same process.
Thanks!
I tried approaches like using window function (dense_rank) and sum() over (partition by) but not getting the output I am looking for.
Using window functions is the correct approach, you need to identify when the type changes (one way is to use Lag or Lead) and then assign grouping to each set, see if the following gives your expected results:
with d as (
select *,
case when lag(type) over(partition by id order by date) = type then 0 else 1 end diff
from t
), grp as (
select *, Sum(diff) over(partition by id order by date) grp
from d
)
select Id, Type,
Min(date) Min_Date,
Max(Date) Max_Date,
Sum(Amount) Amount
from grp
group by Id, Type, grp
order by Min_Date;
See this example Fiddle

Count number of visits before a user purchases. (Count should be reset after every purchase) in Presto SQL

I have a table of events which records actions the customer takes in our website. I want to find out how many times a customer visited before he finally purchases an item.
The above table will be aggregated as
In the first week customerid 1 made 3 visits (including the visit in which he made a purchase). Again he made a purchase in the same week in another visit. So you can see 3 in the first case and 1 in second case. That is every time the user makes a purchase the count should be reset.
The solution i came up with is very messy, slow (involved multiple joins and 3 windows function) and it is not working in some cases. I am missing some data.
It would be great if someone can help me in the right direction on how to approach this scenario.
Thanks in advance.
Try this:
WITH
-- your input, don't use in final query
visits(wk,visit_id,cust_id,has_purchased) AS (
SELECT 1,1,1,FALSE
UNION ALL SELECT 1,2,1,FALSE
UNION ALL SELECT 1,3,1,TRUE
UNION ALL SELECT 1,2,1,TRUE
)
-- real query starts here, replace following comma with "WITH"
,
with_counter AS (
SELECT
*
, LAG(CASE WHEN has_purchased THEN 1 ELSE 0 END,1,0)
OVER(PARTITION BY wk,cust_id ORDER BY visit_id) AS grp_end
FROM visits
)
SELECT
wk
, cust_id
, grp_end
, COUNT(*) AS visits_before_purchase
FROM with_counter
GROUP BY
wk
, cust_id
, grp_end
;
-- out wk | cust_id | grp_end | visits_before_purchase
-- out ----+---------+---------+------------------------
-- out 1 | 1 | 0 | 3
-- out 1 | 1 | 1 | 1
I'm assuming that each the each time a customer visit, their visit id will increase by 1 each time. So I just took the difference between the visit id for each customer to find out how many visit they made before purchasing something.
SELECT weeks, visit_id, customer_id, purchase_flag,
CASE WHEN diff IS null then visit_id else diff
end
FROM (
SELECT *, visit_id - LAG(visit_id) over (partition by customer_id order by
customer_id, visit_id)as diff
FROM customer
WHERE purchase_flag = 1
) as t ORDER BY customer_id, visit_id

Filter and rank (using row_partition) with a filter inside the row_partition

I have a table Jobs that stores a bunch of Jobs every User from *Users posts. Each Job has a status. My first goal is to identify the first completed (status = completed) job for each user. I was able to do so using:
SELECT
user_id AS user_id,
starts_time AS starts_time,
id AS job_id
FROM (
SELECT
user_id,
starts_time,
id,
--sort by starts time, and rank ascending
Row_number() OVER (PARTITION BY User_id ORDER BY Starts_time ASC) AS Rn
FROM
jobs
WHERE
--status 2 is completed
status = 2
GROUP BY
user_id,
assignment_id,
id ORDER BY
user_id) AS jobs
WHERE
rn = 1
Here is what it returns:
user_id | starts_time | job_id |
-----------------------------------------------
123 | 2016-04-18 14:30:00+00 | 1292 |
124 | 2016-04-18 19:00:00+00 | 2389 |
128 | 2016-04-16 13:00:00+00 | 3201 |
Just as some context, there are a lot of cases where a User's first job isn't a job with the status "completed". For example they'll post a list of jobs that go have any one of the following status' before they see a completed job: ("Unfilled", "Voided", "Cancelled")
For every user I want to establish which jobs came before that user saw their first completed job. I was hoping the query above would be a starting point, and from that I can just say return me any job for every user that has a starts_time preceding that of the first job completed
*Sorry if this is confusing, this is my first time posting for help on Stack Overflow, any constructive criticism is appreciated!
For every user I want to establish which jobs came before that user saw their first completed job.
For each user, you want all the records the first status "2". You can use window functions:
select *
from (
select j.*,
bool_or(status = 2) over(partition by user_id order by starts_time) as flag
from jobs j
) t
where not flag
bool_or checks if the current row or any preceding row satisfies the condition.
If you want to retain the first status 2, then you can just change the over() clause of the window function to not consider the current row:
select *
from (
select j.*,
bool_or(status = 2) over(
partition by user_id
order by starts_time rows between unbounded preceding and 1 preceding
) as flag
from jobs j
) t
where flag is distinct from true

How to identify rows per group before a certain value gap?

I'd like to update a certain column in a table based on the difference in a another column value between neighboring rows in PostgreSQL.
Here is a test setup:
CREATE TABLE test(
main INTEGER,
sub_id INTEGER,
value_t INTEGER);
INSERT INTO test (main, sub_id, value_t)
VALUES
(1,1,8),
(1,2,7),
(1,3,3),
(1,4,85),
(1,5,40),
(2,1,3),
(2,2,1),
(2,3,1),
(2,4,8),
(2,5,41);
My goal is to determine in each group main starting from sub_id 1 which value in diff exceeds a certain threshold (e.g. <10 or >-10) by checking in ascending order by sub_id. Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.
Should I use a loop or are there smarter solutions?
The task description in pseudocode:
FOR i in GROUP [PARTITION BY main ORDER BY sub_id]:
DO until diff > 10 OR diff <-10
SET newval = 1 AND LEAD(newval) = 1
Basic SELECT
As fast as possible:
SELECT *, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
SELECT *, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
FROM test
) sub;
Fine points
Your thought model evolves around the window function lead(). But its counterpart lag() is a bit more efficient for the purpose, since there is no off-by-one error when including the row before the big gap. Alternatively, use lead() with inverted sort order (ORDER BY sub_id DESC).
To avoid NULL for the first row in the partition, provide value_t as default as 3rd parameter, which makes the diff 0 instead of NULL. Both lead() and lag() have that capability.
diff BETWEEN -10 AND 10 is slightly faster than #diff < 11 (clearer and more flexible, too). (# being the "absolute value" operator, equivalent to the abs() function.)
bool_or() or bool_and() in the outer window function is probably cheapest to mark all rows up to the big gap.
Your UPDATE
Until the threshold is reached I would like to flag every passed row AND the one row where the condition is FALSE by filling column newval with a value e.g. 1.
Again, as fast as possible.
UPDATE test AS t
SET newval = 1
FROM (
SELECT main, sub_id
, bool_and(diff BETWEEN -10 AND 10) OVER (PARTITION BY main ORDER BY sub_id) AS flag
FROM (
SELECT main, sub_id
, value_t - lag(value_t, 1, value_t) OVER (PARTITION BY main ORDER BY sub_id) AS diff
FROM test
) sub
) u
WHERE (t.main, t.sub_id) = (u.main, u.sub_id)
AND u.flag;
Fine points
Computing all values in a single query is typically substantially faster than a correlated subquery.
The added WHERE condition AND u.flag makes sure we only update rows that actually need an update.
If some of the rows may already have the right value in newval, add another clause to avoid those empty updates, too: AND t.newval IS DISTINCT FROM 1
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
SET newval = 1 assigns a constant (even though we could use the actually calculated value in this case), that's a bit cheaper.
db<>fiddle here
Your question was hard to comprehend, the "value_t" column was irrelevant to the question, and you forgot to define the "diff" column in your SQL.
Anyhow, here's your solution:
WITH data AS (
SELECT main, sub_id, value_t
, abs(value_t
- lead(value_t) OVER (PARTITION BY main ORDER BY sub_id)) > 10 is_evil
FROM test
)
SELECT main, sub_id, value_t
, CASE max(is_evil::int)
OVER (PARTITION BY main ORDER BY sub_id
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
WHEN 1 THEN NULL ELSE 1 END newval
FROM data;
I'm using a CTE to prepare the data (computing whether a row is "evil"), and then the "max" window function is used to check if there were any "evil" rows before the current one, per partition.
EXISTS on an aggregating subquery:
UPDATE test u
SET value_t = NULL
WHERE EXISTS (
SELECT * FROM (
SELECT main,sub_id
, value_t , ABS(value_t - lag(value_t)
OVER (PARTITION BY main ORDER BY sub_id) ) AS absdiff
FROM test
) x
WHERE x.main = u.main
AND x.sub_id <= u.sub_id
AND x.absdiff >= 10
)
;
SELECT * FROM test
ORDER BY main, sub_id;
Result:
UPDATE 3
main | sub_id | value_t
------+--------+---------
1 | 1 | 8
1 | 2 | 7
1 | 3 | 3
1 | 4 |
1 | 5 |
2 | 1 | 3
2 | 2 | 1
2 | 3 | 1
2 | 4 | 8
2 | 5 |
(10 rows)

GROUP values separated by specific records

I want to make a specific counter which will raise by one after a specific record is found in a row.
time event revenue counter
13.37 START 20 1
13.38 action A 10 1
13.40 action B 5 1
13.42 end 1
14.15 START 20 2
14.16 action B 5 2
14.18 end 2
15.10 START 20 3
15.12 end 3
I need to find out total revenue for every visit (actions between START and END). I was thinking the best way would be to set a counter like this:
so I could group events. But if you have a better solution, I would be grateful.
You can use a query similar to the following:
with StartTimes as
(
select time,
startRank = row_number() over (order by time)
from events
where event = 'START'
)
select e.*, counter = st.startRank
from events e
outer apply
(
select top 1 st.startRank
from StartTimes st
where e.time >= st.time
order by st.time desc
) st
SQL Fiddle with demo.
May need to be updated based on the particular characteristics of the actual data, things like duplicate times, missing events, etc. But it works for the sample data.
SQL Server 2012 supports an OVER clause for aggregates, so if you're up to date on version, this will give you the counter you want:
count(case when eventname='START' then 1 end) over (order by eventtime)
You could also use the latest START time instead of a counter to group by, like this:
with t as (
select
*,
max(case when eventname='START' then eventtime end)
over (order by eventtime) as timeStart
from YourTable
)
select
timeStart,
max(eventtime) as timeEnd,
sum(revenue) as totalRevenue
from t
group by timeStart;
Here's a SQL Fiddle demo using the schema Ian posted for his solution.