SQL LAG function over dates - sql

I have the following table (example):
+----+-------+-------------+----------------+
| id | value | last_update | ingestion_date |
+----+-------+-------------+----------------+
| 1 | 30 | 2021-02-03 | 2021-02-07 |
+----+-------+-------------+----------------+
| 1 | 29 | 2021-02-03 | 2021-02-06 |
+----+-------+-------------+----------------+
| 1 | 28 | 2021-01-25 | 2021-02-02 |
+----+-------+-------------+----------------+
| 1 | 25 | 2021-01-25 | 2021-02-01 |
+----+-------+-------------+----------------+
| 1 | 23 | 2021-01-20 | 2021-01-31 |
+----+-------+-------------+----------------+
| 1 | 20 | 2021-01-20 | 2021-01-30 |
+----+-------+-------------+----------------+
| 2 | 55 | 2021-02-03 | 2021-02-06 |
+----+-------+-------------+----------------+
| 2 | 50 | 2021-01-25 | 2021-02-02 |
+----+-------+-------------+----------------+
The result I need:
It should be the last updated value in the column value and the penult value (based in the last_update and ingestion_date) in the value2.
+----+-------+-------------+----------------+--------+
| id | value | last_update | ingestion_date | value2 |
+----+-------+-------------+----------------+--------+
| 1 | 30 | 2021-02-03 | 2021-02-07 | 28 |
+----+-------+-------------+----------------+--------+
| 2 | 55 | 2021-02-03 | 2021-02-06 | 50 |
+----+-------+-------------+----------------+--------+
The query I have right now is the following:
SELECT id, value, last_update, ingestion_date, value2
FROM
(SELECT *,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY last_update DESC, ingestion_date DESC) AS order,
LAG(value) OVER(PARTITION BY id ORDER BY last_update, ingestion_date) AS value2
FROM table)
WHERE ordem = 1
The result I am getting:
+----+-------+-------------+----------------+--------+
| ID | value | last_update | ingestion_date | value2 |
+----+-------+-------------+----------------+--------+
| 1 | 30 | 2021-02-03 | 2021-02-07 | 29 |
+----+-------+-------------+----------------+--------+
| 2 | 55 | 2021-02-03 | 2021-02-06 | 50 |
+----+-------+-------------+----------------+--------+
Obs: I am using Athena from AWS

Related

cumulative amount to current_date

base_table
month id sales cumulative_sales
2021-01-01 33205 10 10
2021-02-01 33205 15 25
Based on the base table above, I would like to add more rows up to the current month,
even if there is no sales.
Expected table
month id sales cumulative_sales
2021-01-01 33205 10 10
2021-02-01 33205 15 25
2021-03-01 33205 0 25
2021-04-01 33205 0 25
2021-05-01 33205 0 25
.........
2021-11-01 33205 0 25
My query stops at
select month, id, sales,
sum(sales) over (partition by id
order by month
rows between unbounded preceding and current row) as cumulative_sales
from base_table
This works. Assumes the month column is constrained to hold only "first of the month" dates. Use the desired hard-coded start date, or use another CTE to get the earliest date from base_table:
with base_table as (
select *
from (values
('2021-01-01'::date,33205,10)
,('2021-02-01' ,33205,15)
,('2021-01-01' ,12345,99)
,('2021-04-01' ,12345,88)
) dat("month",id,sales)
)
select cal.dt::date
,list.id
,coalesce(dat.sales,0) as sales
,coalesce(sum(dat.sales) over (partition by list.id order by cal.dt),0) as cumulative_sales
from generate_series('2020-06-01' /* use desired start date here */,current_date,'1 month') cal(dt)
cross join (select distinct id from base_table) list
left join base_table dat on dat."month" = cal.dt and dat.id = list.id
;
Results:
| dt | id | sales | cumulative_sales |
+------------+-------+-------+------------------+
| 2020-06-01 | 12345 | 0 | 0 |
| 2020-07-01 | 12345 | 0 | 0 |
| 2020-08-01 | 12345 | 0 | 0 |
| 2020-09-01 | 12345 | 0 | 0 |
| 2020-10-01 | 12345 | 0 | 0 |
| 2020-11-01 | 12345 | 0 | 0 |
| 2020-12-01 | 12345 | 0 | 0 |
| 2021-01-01 | 12345 | 99 | 99 |
| 2021-02-01 | 12345 | 0 | 99 |
| 2021-03-01 | 12345 | 0 | 99 |
| 2021-04-01 | 12345 | 88 | 187 |
| 2021-05-01 | 12345 | 0 | 187 |
| 2021-06-01 | 12345 | 0 | 187 |
| 2021-07-01 | 12345 | 0 | 187 |
| 2021-08-01 | 12345 | 0 | 187 |
| 2021-09-01 | 12345 | 0 | 187 |
| 2021-10-01 | 12345 | 0 | 187 |
| 2021-11-01 | 12345 | 0 | 187 |
| 2020-06-01 | 33205 | 0 | 0 |
| 2020-07-01 | 33205 | 0 | 0 |
| 2020-08-01 | 33205 | 0 | 0 |
| 2020-09-01 | 33205 | 0 | 0 |
| 2020-10-01 | 33205 | 0 | 0 |
| 2020-11-01 | 33205 | 0 | 0 |
| 2020-12-01 | 33205 | 0 | 0 |
| 2021-01-01 | 33205 | 10 | 10 |
| 2021-02-01 | 33205 | 15 | 25 |
| 2021-03-01 | 33205 | 0 | 25 |
| 2021-04-01 | 33205 | 0 | 25 |
| 2021-05-01 | 33205 | 0 | 25 |
| 2021-06-01 | 33205 | 0 | 25 |
| 2021-07-01 | 33205 | 0 | 25 |
| 2021-08-01 | 33205 | 0 | 25 |
| 2021-09-01 | 33205 | 0 | 25 |
| 2021-10-01 | 33205 | 0 | 25 |
| 2021-11-01 | 33205 | 0 | 25 |
The cross join pairs every date output by generate_series() with every id value from base_table.
The left join ensures that no dt+id pairs get dropped from the output when no such record exists in base_table.
The coalesce() functions ensure that the sales and cumulative_sales show 0 instead of null for dt+id combinations that don't exist in base_table. Remove them if you don't mind seeing nulls in those columns.

Redshift SQL - Count Sequences of Repeating Values Within Groups

I have a table that looks like this:
| id | date_start | gap_7_days |
| -- | ------------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 |
| 1 | 2021-06-13 00:00:00 | 0 |
| 1 | 2021-06-19 00:00:00 | 0 |
| 1 | 2021-06-27 00:00:00 | 0 |
| 2 | 2021-07-04 00:00:00 | 1 |
| 2 | 2021-07-11 00:00:00 | 1 |
| 2 | 2021-07-18 00:00:00 | 1 |
| 2 | 2021-07-25 00:00:00 | 1 |
| 2 | 2021-08-01 00:00:00 | 1 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-09 00:00:00 | 0 |
| 2 | 2021-08-16 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
| 2 | 2021-08-30 00:00:00 | 1 |
| 2 | 2021-08-31 00:00:00 | 0 |
| 2 | 2021-09-01 00:00:00 | 0 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-15 00:00:00 | 1 |
| 2 | 2021-08-22 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
For each ID, I check whether consecutive date_start values are 7 days apart, and put a 1 or 0 in gap_7_days accordingly.
I want to do the following (using Redshift SQL only):
Get the length of each sequence of consecutive 1s in gap_7_days for each ID
Expected output:
| id | date_start | gap_7_days | sequence_length |
| -- | ------------------- | --------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 | |
| 1 | 2021-06-13 00:00:00 | 0 | |
| 1 | 2021-06-19 00:00:00 | 0 | |
| 1 | 2021-06-27 00:00:00 | 0 | |
| 2 | 2021-07-04 00:00:00 | 1 | 6 |
| 2 | 2021-07-11 00:00:00 | 1 | 6 |
| 2 | 2021-07-18 00:00:00 | 1 | 6 |
| 2 | 2021-07-25 00:00:00 | 1 | 6 |
| 2 | 2021-08-01 00:00:00 | 1 | 6 |
| 2 | 2021-08-08 00:00:00 | 1 | 6 |
| 2 | 2021-08-09 00:00:00 | 0 | |
| 2 | 2021-08-16 00:00:00 | 1 | 3 |
| 2 | 2021-08-23 00:00:00 | 1 | 3 |
| 2 | 2021-08-30 00:00:00 | 1 | 3 |
| 2 | 2021-08-31 00:00:00 | 0 | |
| 2 | 2021-09-01 00:00:00 | 0 | |
| 2 | 2021-08-08 00:00:00 | 1 | 4 |
| 2 | 2021-08-15 00:00:00 | 1 | 4 |
| 2 | 2021-08-22 00:00:00 | 1 | 4 |
| 2 | 2021-08-23 00:00:00 | 1 | 4 |
Get the number of sequences for each ID
Expected output:
| id | num_sequences |
| -- | ------------------- |
| 1 | 0 |
| 2 | 3 |
How can I achieve this?
If you want the number of sequences, just look at the previous value. When the current value is "1" and the previous is NULL or 0, then you have a new sequence.
So:
select id,
sum( (gap_7_days = 1 and coalesce(prev_gap_7_days, 0) = 0)::int ) as num_sequences
from (select t.*,
lag(gap_7_days) over (partition by id order by date_start) as prev_gap_7_days
from t
) t
group by id;
If you actually want the lengths of the sequences, as in the intermediate results, then ask a new question. That information is not needed for this question.

How I can I add a count to rank null values in SQL Hive?

This is what I have right now:
| time | car_id | order | in_order |
|-------|--------|-------|----------|
| 12:31 | 32 | null | 0 |
| 12:33 | 32 | null | 0 |
| 12:35 | 32 | null | 0 |
| 12:37 | 32 | 123 | 1 |
| 12:38 | 32 | 123 | 1 |
| 12:39 | 32 | 123 | 1 |
| 12:41 | 32 | 123 | 1 |
| 12:43 | 32 | 123 | 1 |
| 12:45 | 32 | null | 0 |
| 12:47 | 32 | null | 0 |
| 12:49 | 32 | 321 | 1 |
| 12:51 | 32 | 321 | 1 |
I'm trying to rank orders, including those who have null values, in this case by car_id.
This is the result I'm looking for:
| time | car_id | order | in_order | row |
|-------|--------|-------|----------|-----|
| 12:31 | 32 | null | 0 | 1 |
| 12:33 | 32 | null | 0 | 1 |
| 12:35 | 32 | null | 0 | 1 |
| 12:37 | 32 | 123 | 1 | 2 |
| 12:38 | 32 | 123 | 1 | 2 |
| 12:39 | 32 | 123 | 1 | 2 |
| 12:41 | 32 | 123 | 1 | 2 |
| 12:43 | 32 | 123 | 1 | 2 |
| 12:45 | 32 | null | 0 | 3 |
| 12:47 | 32 | null | 0 | 3 |
| 12:49 | 32 | 321 | 1 | 4 |
| 12:51 | 32 | 321 | 1 | 4 |
I just don't know how to manage a count for the null values.
Thanks!
You can count the number of non-NULL values before each row and then use dense_rank():
select t.*,
dense_rank() over (partition by car_id order by grp) as row
from (select t.*,
count(order) over (partition by car_id order by time) as grp
from t
) t;

SQL - Window Functions with dense_rank()

I have a dataset structured such as the one below stored in Hive, call it df:
+-----+-----+----------+--------+
| id1 | id2 | date | amount |
+-----+-----+----------+--------+
| 1 | 2 | 11-07-17 | 0.93 |
| 2 | 2 | 11-11-17 | 1.94 |
| 2 | 2 | 11-09-17 | 1.90 |
| 1 | 1 | 11-10-17 | 0.33 |
| 2 | 2 | 11-10-17 | 1.93 |
| 1 | 1 | 11-07-17 | 0.25 |
| 1 | 1 | 11-09-17 | 0.33 |
| 1 | 1 | 11-12-17 | 0.33 |
| 2 | 2 | 11-08-17 | 1.90 |
| 1 | 1 | 11-08-17 | 0.30 |
| 2 | 2 | 11-12-17 | 2.01 |
| 1 | 2 | 11-12-17 | 1.00 |
| 1 | 2 | 11-09-17 | 0.94 |
| 2 | 2 | 11-07-17 | 1.94 |
| 1 | 2 | 11-11-17 | 1.92 |
| 1 | 1 | 11-11-17 | 0.33 |
| 1 | 2 | 11-10-17 | 1.92 |
| 1 | 2 | 11-08-17 | 0.94 |
+-----+-----+----------+--------+
I wish to partition by id1 and id2, and then order by date descending within each grouping of id1 and id2, and then rank "amount" within that, where the same "amount" on consecutive days would receive the same rank. The ordered and ranked output I'd hope to see is shown here:
+-----+-----+------------+--------+------+
| id1 | id2 | date | amount | rank |
+-----+-----+------------+--------+------+
| 1 | 1 | 2017-11-12 | 0.33 | 1 |
| 1 | 1 | 2017-11-11 | 0.33 | 1 |
| 1 | 1 | 2017-11-10 | 0.33 | 1 |
| 1 | 1 | 2017-11-09 | 0.33 | 1 |
| 1 | 1 | 2017-11-08 | 0.30 | 2 |
| 1 | 1 | 2017-11-07 | 0.25 | 3 |
| 1 | 2 | 2017-11-12 | 1.00 | 1 |
| 1 | 2 | 2017-11-11 | 1.92 | 2 |
| 1 | 2 | 2017-11-10 | 1.92 | 2 |
| 1 | 2 | 2017-11-09 | 0.94 | 3 |
| 1 | 2 | 2017-11-08 | 0.94 | 3 |
| 1 | 2 | 2017-11-07 | 0.93 | 4 |
| 2 | 2 | 2017-11-12 | 2.01 | 1 |
| 2 | 2 | 2017-11-11 | 1.94 | 2 |
| 2 | 2 | 2017-11-10 | 1.93 | 3 |
| 2 | 2 | 2017-11-09 | 1.90 | 4 |
| 2 | 2 | 2017-11-08 | 1.90 | 4 |
| 2 | 2 | 2017-11-07 | 1.94 | 5 |
+-----+-----+------------+--------+------+
I attempted this with the following SQL query:
SELECT
id1,
id2,
date,
amount,
dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
df
GROUP BY
id1,
id2,
date,
amount
But that query doesn't seem to be doing what I'd like it to as I'm not receiving the output I'm looking for.
It seems like a window function using dense_rank, partition by and order by is what I need but I can't quite seem to get it to give me that sample output that I desire. Any help would be much appreciated! Thanks!
This is quite tricky. I think you need to use lag() to see where the value changes and then do a cumulative sum:
select df.*,
sum(case when prev_amount = amount then 0 else 1 end) over
(partition by id1, id2 order by date desc) as rank
from (select df.*,
lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
from df
) df;

Count rows each month of a year - SQL Server

I have a table "Product" as :
| ProductId | ProductCatId | Price | Date | Deadline |
--------------------------------------------------------------------
| 1 | 1 | 10.00 | 2016-01-01 | 2016-01-27 |
| 2 | 2 | 10.00 | 2016-02-01 | 2016-02-27 |
| 3 | 3 | 10.00 | 2016-03-01 | 2016-03-27 |
| 4 | 1 | 10.00 | 2016-04-01 | 2016-04-27 |
| 5 | 3 | 10.00 | 2016-05-01 | 2016-05-27 |
| 6 | 3 | 10.00 | 2016-06-01 | 2016-06-27 |
| 7 | 1 | 20.00 | 2016-01-01 | 2016-01-27 |
| 8 | 2 | 30.00 | 2016-02-01 | 2016-02-27 |
| 9 | 1 | 40.00 | 2016-03-01 | 2016-03-27 |
| 10 | 4 | 15.00 | 2016-04-01 | 2016-04-27 |
| 11 | 1 | 25.00 | 2016-05-01 | 2016-05-27 |
| 12 | 5 | 55.00 | 2016-06-01 | 2016-06-27 |
| 13 | 5 | 55.00 | 2016-06-01 | 2016-01-27 |
| 14 | 5 | 55.00 | 2016-06-01 | 2016-02-27 |
| 15 | 5 | 55.00 | 2016-06-01 | 2016-03-27 |
I want to create SP count rows of Product each month with condition Year = CurrentYear , like :
| Month| SumProducts | SumExpiredProducts |
-------------------------------------------
| 1 | 3 | 3 |
| 2 | 3 | 3 |
| 3 | 3 | 3 |
| 4 | 2 | 2 |
| 5 | 2 | 2 |
| 6 | 2 | 2 |
What should i do ?
You can use a query like the following:
SELECT MONTH([Date]),
COUNT(*) AS SumProducts ,
COUNT(CASE WHEN [Date] > Deadline THEN 1 END) AS SumExpiredProducts
FROM mytable
WHERE YEAR([Date]) = YEAR(GETDATE())
GROUP BY MONTH([Date])