Posgresql LAG with condition - sql

My data looks like this:
id
user
data
date
1
1
1
2023-02-05
2
2
1
2023-02-05
3
1
2
2023-02-06
4
1
3
2023-02-07
5
2
5
2023-02-07
I want to get a difference between data of each row and a previous row for this user like this:
id
user
data
date
diff
1
1
1
2023-02-05
2
2
1
2023-02-05
3
1
2
2023-02-06
1
4
1
3
2023-02-07
1
5
2
5
2023-02-07
4
I can do this with LAG function but without condition that users for difference must be same. How can I do it with condition in postgres?

We can use LAG() as follows:
SELECT id, user, data, date,
data - LAG(data) OVER (PARTITION BY user ORDER BY date) AS diff
FROM yourTable
ORDER BY date, user;

As per the comment: window functions let you partition your input, narrowing down the context of each window the way you want it:
select *,
coalesce(data-(lag(data) over w1),0) as data_diff
from test
window w1 as (partition by user order by date asc)
order by date,
"user";
It's also handy to define the window separately to save space and handle null for first row in case of lag() or last row for lead() with coalesce().
Online demo

Related

Snowflake SQL: trying to calculate time difference between subsets of subsequent rows

I have some data like the following in a Snowflake database
DEVICE_SERIAL
REASON_CODE
VERSION
MESSAGE_CREATED_AT
NEXT_REASON_CODE
BA1254862158
1
4
2022-06-23 02:06:03
4
BA1254862158
4
4
2022-06-23 02:07:07
1
BA1110001111
1
5
2022-06-16 16:19:04
4
BA1110001111
4
5
2022-06-16 17:43:04
1
BA1110001111
5
5
2022-06-20 14:37:45
4
BA1110001111
4
5
2022-06-20 17:31:12
1
that's the result of a previous query. I'm trying to get the difference between message_created_at timestamps where the device_serial is the same between subsequent rows, and the first row (of the pair for the difference) has reason_code of 1 or 5, and the second row of the pair has reason_code 4.
For this example, my desired output would be
DEVICE_SERIAL
VERSION
DELTA_SECONDS
BA1254862158
4
64
BA1110001111
5
5040
BA1110001111
5
10407
It's easy to calculate the time difference between every pair of rows (just lead or lag + datediff). But I'm not sure how to structure a query to select only the desired rows so that I can get a datediff between them, without calculating spurious datediffs.
My ultimate goal is to see how these datediffs change between versions. I am but a lowly C programmer, my SQL-fu is weak.
with data as (
select *,
count(case when reason_code in (1, 5) then 1 end)
over (partition by device_serial order by message_created_at) as grp
/* or alternately bracket by the end code */
-- count(case when reason_code = 4 then 1 end)
-- over (partition by device_serial order by message_created_at desc) as grp
from T
)
select device_serial, min(version) as version,
datediff(second, min(message_created_at), max(message_created_at)) as delta_seconds
from data
group by device_serial, grp

SQLite - Rolling Average/Sum

I have a dataset as shown below, wondering how I can do a rolling average with its current record followed by next two records. Example: lets consider the first record whose total is 3 followed by 4 and 7 ,Now the rolling 3 day average for first record would be 4.6 and so on.
Date Total
1 3
2 4
3 7
4 1
5 2
6 4
Expected output:
Date Total 3day_rolling_Avg
1 3 4.6
2 4 4
3 7 3.3
4 1 2.3
5 2 null
6 4 null
PS: Having "null" value isn't important. This is just a sample data where I need to look at more than 3 days(Ex: 30 days rolling)
I think that the simplest approach is a window avg(), with the poper window frame:
select
t.*,
avg(total)
over(order by date rows between current row and 2 following) as "3d_rolling_avg"
from mytable t
If you want to return a null value when there is less than 2 leading rows, as show in your expected results, then you can use row_number() on top of it:
select
t.*,
case when rank() over(order by date desc) <= 2
then avg(total)
over(order by date rows between current row and 2 following)
end as "3d_rolling_avg"
from mytable t

GBQ SQL: How to find first instance of X value and pull a corresponding row

I have a table that records the history of each ID per LOCATION. This table is updated each day to keep track of the history of any change in a certain row(ID). Note: The date field is not in chronological order.
ID Count Date (datetime type)
1 20 2020-01-15T12:00:00.000
1 16 2020-03-15T12:00:00.000
1 13 2020-04-15T12:00:00.000
1 4 2020-05-15T12:00:00.000
1 0 2020-06-15T12:00:00.000
2 20 2020-01-15T12:00:00.000
2 10 2020-02-15T12:00:00.000
3 12 2020-01-15T12:00:00.000
3 10 2020-02-15T12:00:00.000
3 0 2020-03-15T12:00:00.000
For each unique ID, I need to pull the first instance (oldest date) when the Count value is zero. If a unique ID does not have an instance where it Count value is zero, I need to pull the most current Count value.
Here's what my results should look like below:
ID Count Date (datetime type)
1 0 2020-06-15T12:00:00.000
2 10 2020-02-15T12:00:00.000
3 0 2020-03-15T12:00:00.000
I can't seem to wrap my head around how to code this in Google BigQuery.
Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE
CASE COUNTIF(count = 0)
WHEN 0 THEN ARRAY_AGG(t ORDER BY date DESC LIMIT 1)[OFFSET(0)]
ELSE ARRAY_AGG(t ORDER BY count, date LIMIT 1)[OFFSET(0)]
END
FROM `project.dataset.table` t
GROUP BY id
if to apply to sample data in your question - output is
Row id count date
1 1 0 2020-05-15 12:00:00 UTC
2 2 10 2020-03-15 12:00:00 UTC
3 3 0 2020-06-15 12:00:00 UTC
Do you just want the last row for each id?
One method is row_number():
select t.*
from (select t.*,
row_number() over (partition by id
order by case when count = 0 then date end nulls last,
date desc
) as seqnum
from t
) t
where seqnum = 1;
But I also like using aggregation in BigQuery:
select (array_agg(t order by date desc limit 1))[ordinal(1)]
from t
group by id;

Calculate "position in run" in SQL

I have a table of consecutive ids (integers, 1 ... n), and values (integers), like this:
Input Table:
id value
-- -----
1 1
2 1
3 2
4 3
5 1
6 1
7 1
Going down the table i.e. in order of increasing id, I want to count how many times in a row the same value has been seen consecutively, i.e. the position in a run:
Output Table:
id value position in run
-- ----- ---------------
1 1 1
2 1 2
3 2 1
4 3 1
5 1 1
6 1 2
7 1 3
Any ideas? I've searched for a combination of windowing functions including lead and lag, but can't come up with it. Note that the same value can appear in the value column as part of different runs, so partitioning by value may not help solve this. I'm on Hive 1.2.
One way is to use a difference of row numbers approach to classify consecutive same values into one group. Then a row number function to get the desired positions in each group.
Query to assign groups (Running this will help you understand how the groups are assigned.)
select t.*
,row_number() over(order by id) - row_number() over(partition by value order by id) as rnum_diff
from tbl t
Final Query using row_number to get positions in each group assigned with the above query.
select id,value,row_number() over(partition by value,rnum_diff order by id) as pos_in_grp
from (select t.*
,row_number() over(order by id) - row_number() over(partition by value order by id) as rnum_diff
from tbl t
) t

Calculating number of trips without using a loop

I am currently working on postgres and below is the question that I have.
We have a customer ID and the date when the person visited a property. Based on this I need to calculate the number of trips. Consecutive dates are considered as one trip. Eg: If a person visits on first date the trip no is first, post that he visits consecutively for three days that will counted as trip two.
Below is the input
ID Date
1 1-Jan
1 2-Jan
1 5-Jan
1 1-Jul
2 1-Jan
2 2-Feb
2 5-Feb
2 6-Feb
2 7-Feb
2 12-Feb
Expected output
ID Date Trip no
1 1-Jan 1
1 2-Jan 1
1 5-Jan 2
1 1-Jul 3
2 1-Jan 1
2 2-Feb 2
2 5-Feb 3
2 6-Feb 3
2 7-Feb 3
2 12-Feb 4
I am able to implement successfully using loop but its running very slow given the volume of the data.
Can you please suggest a workaround where we can not use loop.
Subtract a sequence from the dates -- these will be constant for a particular trip. Then you can use dense_rank() for the numbering:
select t.*,
dense_rank() over (partition by id order by grp) as trip_num
from (select t.*,
(date - row_number() over (partition by id order by date) * interval '1 day'
) as grp
from t
) t;