match recognize collect row data into single column - sql

I'm following the tutorial for match_recognize found here:
create or replace temporary table stock_price_history (company text, price_date date, price int);
insert into stock_price_history values
('ABCD', '2020-10-01', 50),
('ABCD', '2020-10-02', 50),
('ABCD', '2020-10-03', 51),
('ABCD', '2020-10-04', 51),
('ABCD', '2020-10-05', 51),
('ABCD', '2020-10-06', 52),
('ABCD', '2020-10-07', 71),
('ABCD', '2020-10-08', 80),
('ABCD', '2020-10-09', 90),
('ABCD', '2020-10-10', 63),
('XYZ' , '2020-10-01', 24),
('XYZ' , '2020-10-02', 24),
('XYZ' , '2020-10-03', 37),
('XYZ' , '2020-10-04', 63),
('XYZ' , '2020-10-05', 65),
('XYZ' , '2020-10-06', 66),
('XYZ' , '2020-10-07', 50),
('XYZ' , '2020-10-08', 54),
('XYZ' , '2020-10-09', 30),
('XYZ' , '2020-10-10', 32);
select * from stock_price_history
match_recognize(
partition by company
order by price_date
measures
match_number() as match_number,
price as all_price,
first(price_date) as start_date,
last(price_date) as end_date,
count(*) as rows_in_sequence,
count(row_with_price_stationary.*) as num_stationary,
count(row_with_price_increase.*) as num_increases
one row per match
after match skip to last row_with_price_increase
pattern(row_before_increase row_with_price_increase{1} row_with_price_stationary* row_with_price_increase{1})
define
row_with_price_increase as price > lag(price),
row_with_price_stationary as price = lag(price)
)
order by company, match_number;
The code above is my version of the tutorial code. Everything works fine except the price as all_price part in the measures clause. What I want to do is collect all prices in the pattern and return it as an array into a single column. I know I can do all rows per match to get all rows but that's not what I want.
How would I go about doing that?

You have to specify all rows per match or lose that information out of the match_recognize function. You can use array_agg within group to get the prices in a single array. Since this aggregates row counts down you may want to do the same for the dates of each of these prices - something like this:
select COMPANY
,array_agg(PRICE) within group (order by PRICE_DATE) as ALL_PRICE
,array_agg(PRICE_DATE) within group (order by PRICE_DATE) as ALL_PRICE_DATE
from stock_price_history
match_recognize(
partition by company
order by price_date
measures
match_number() as match_number,
price as all_price,
first(price_date) as start_date,
last(price_date) as end_date,
count(*) as rows_in_sequence,
count(row_with_price_stationary.*) as num_stationary,
count(row_with_price_increase.*) as num_increases
all rows per match
after match skip to last row_with_price_increase
pattern(row_before_increase row_with_price_increase{1} row_with_price_stationary* row_with_price_increase{1})
define
row_with_price_increase as price > lag(price),
row_with_price_stationary as price = lag(price)
)
group by company
order by company
;
COMPANY
ALL_PRICE
ALL_PRICE_DATE
ABCD
[ 50, 51, 51, 51, 52, 52, 71, 80 ]
[ "2020-10-02", "2020-10-03", "2020-10-04", "2020-10-05", "2020-10-06", "2020-10-06", "2020-10-07", "2020-10-08" ]
XYZ
[ 24, 37, 63, 63, 65, 66 ]
[ "2020-10-02", "2020-10-03", "2020-10-04", "2020-10-04", "2020-10-05", "2020-10-06" ]
If you want to keep all rows, you can use the window function version of array_agg:
select * exclude ALL_PRICE
,array_agg(PRICE) within group (order by PRICE_DATE)
over (partition by COMPANY) as ALL_PRICE
from stock_price_history
match_recognize(
partition by company
order by price_date
measures
match_number() as match_number,
price as all_price,
first(price_date) as start_date,
last(price_date) as end_date,
count(*) as rows_in_sequence,
count(row_with_price_stationary.*) as num_stationary,
count(row_with_price_increase.*) as num_increases
all rows per match
after match skip to last row_with_price_increase
pattern(row_before_increase row_with_price_increase{1} row_with_price_stationary* row_with_price_increase{1})
define
row_with_price_increase as price > lag(price),
row_with_price_stationary as price = lag(price)
)
order by company
;

Related

SQL: how to get a subset from a subset that is subject to an "only" condition

I have a table named orders in a Postgres database (see Fiddle at http://sqlfiddle.com/#!17/ac4f9).
CREATE TABLE orders
(
user_id INTEGER,
order_id INTEGER,
order_date DATE,
price FLOAT,
product VARCHAR(255)
);
INSERT INTO orders(user_id, order_id, order_date, price, product)
VALUES
(1, 2, '2021-03-05', 15, 'books'),
(1, 13, '2022-03-07', 3, 'music'),
(1, 14, '2022-06-15', 900, 'travel'),
(1, 11, '2021-11-17', 25, 'books'),
(1, 16, '2022-08-03', 32, 'books'),
(2, 4, '2021-04-12', 4, 'music'),
(2, 7, '2021-06-29', 9, 'music'),
(2, 20, '2022-11-03', 8, 'music'),
(2, 22, '2022-11-07', 575, 'travel'),
(2, 24, '2022-11-20', 95, 'food'),
(3, 3, '2021-03-17', 25, 'books'),
(3, 5, '2021-06-01', 650, 'travel'),
(3, 17, '2022-08-17', 1200, 'travel'),
(3, 19, '2022-10-02', 6, 'music'),
(3, 23, '2022-11-08', 7, 'food'),
(4, 9, '2021-08-20', 3200, 'travel'),
(4, 10, '2021-10-29', 2750, 'travel'),
(4, 15, '2022-07-15', 1820, 'travel'),
(4, 21, '2022-11-05', 8000, 'travel'),
(4, 25, '2022-11-29', 2300, 'travel'),
(5, 1, '2021-01-04', 3, 'music'),
(5, 6, '2021-06-09', 820, 'travel'),
(5, 8, '2021-07-30', 19, 'books'),
(5, 12, '2021-12-10', 22, 'music'),
(5, 18, '2022-09-19', 20, 'books'),
(6, 26, '2023-01-09', 700, 'travel'),
(6, 27, '2023-01-23', 1900, 'travel');
From the list of users who have placed an order for the either the travel product OR the books product, I would like to get the subset of these users who have placed an order for ONLY the travel product.
The desired result set would be:
user_id count_orders
-----------------------
4 5
6 2
How would I do this?
Thank you.
select o.user_id, count(*) count_orders
from orders o
where not exists(select * from orders where product<>'travel' and user_id=o.user_id)
group by o.user_id
http://sqlfiddle.com/#!17/ac4f9/17
Count all orders and travel orders first. Filter records with same count values.
With inline view: http://sqlfiddle.com/#!17/ac4f9/18/0
SELECT user_id, n_orders AS count_orders
FROM (
SELECT user_id
, COUNT(CASE WHEN product = 'travel' THEN 1 END) AS n_travels
, COUNT(*) AS n_orders
FROM orders
GROUP BY user_id
) v
WHERE v.n_travels = v.n_orders
Using HAVING clause 1: http://sqlfiddle.com/#!17/ac4f9/22/0
SELECT user_id
, COUNT(*) AS count_orders
FROM orders
GROUP BY user_id
HAVING COUNT(CASE WHEN product != 'travel' THEN 1 END) = COUNT(*)
Using HAVING clause 2: http://sqlfiddle.com/#!17/ac4f9/21/0
SELECT user_id
, COUNT(*) AS count_orders
FROM orders
GROUP BY user_id
HAVING COUNT(CASE WHEN product != 'travel' THEN 1 END) = 0
Using EXCEPT operation
select user_id, count(*)
from orders
where user_id in (
select user_id from orders where product = 'travel'
except
select user_id from orders where product <> 'travel'
)
group by user_id
order by user_id

Assign Specific Value To All Rows in Partition If Condition Met

I have to build the Exceptions Report to catch Overlaps or Gaps. The dataset has clients and assigned supervisors with start and end dates of supervision.
CREATE TABLE Report
(Id INT, ClientId INT, ClientName VARCHAR(30), SupervisorId INT, SupervisorName
VARCHAR(30), SupervisionStartDate DATE, SupervisionEndDate DATE);
INSERT INTO Report
VALUES
(1, 22, 'Client A', 33, 'Supervisor A', '2022-01-01', '2022-04-30'),
(2, 22, 'Client A', 44, 'Supervisor B', '2022-05-01', '2022-08-23'),
(3, 22, 'Client A', 55, 'Supervisor C', '2022-08-24', NULL),
(4, 23, 'Client B', 33, 'Supervisor A', '2022-01-01', '2022-04-30'),
(5, 23, 'Client B', 44, 'Supervisor B', '2022-04-30', '2022-08-23'),
(6, 24, 'Client C', 33, 'Supervisor A', '2022-01-01', '2022-04-30'),
(7, 24, 'Client C', 44, 'Supervisor B', '2022-05-01', '2022-08-23'),
(8, 24, 'Client C', 55, 'Supervisor C', '2022-07-22', '2022-10-25'),
(9, 25, 'Client D', 33, 'Supervisor A', '2022-01-01', '2022-04-30'),
(10, 25, 'Client D', 44, 'Supervisor B', '2022-07-23', NULL)
SELECT * FROM Report
'Valid' status should be assigned to all rows associated with Client if no Gaps or Overlaps present, for example:
Client A has 3 Supervisors - Supervisor A (01/01/2022 - 04/30/2022), Supervisor B (05/01/2022 - 08/23/2022) and Supervisor C (08/24/2022 - Present).
'Issue Found' status should be assigned to all rows associated with Client if any Gaps or Overlaps present, for example:
Client B has 2 Supervisors - Supervisor A (01/01/2022 - 04/30/2022) and Supervisor B (04/30/2022 - 08/23/2022).
Client C has 3 Supervisors - Supervisor A (01/01/2022 - 04/30/2022), Supervisor B (05/01/2022 - 08/23/2022) and Supervisor C (07/22/2022 - 10/25/2022).
These are examples of the Overlap.
Client D has 2 Supervisors - Supervisor A (01/01/2022 - 04/30/2022) and Supervisor B (07/23/2022 - Present).
This is the example of the Gap.
The Output I need:
I added some columns that might be helpful, but don't know how to accomplish the main goal.
However, I noticed, that if the first record in the [Diff Between PreviousEndDate And SupervisionStartDate] column is NULL and all other = 1, then it will be Valid.
SELECT
Report.*,
ROW_NUMBER() OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS [ClientRecordNumber],
COUNT(*) OVER (PARTITION BY Report.ClientId) AS [TotalNumberOfClientRecords],
DATEDIFF(DAY, Report.SupervisionStartDate, Report.SupervisionEndDate) AS SupervisionAging,
LAG(Report.SupervisionStartDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS PreviousStartDate,
LAG(Report.SupervisionEndDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS PreviousEndDate,
LEAD(Report.SupervisionStartDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS NextStartDate,
LEAD(Report.SupervisionEndDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)) AS NextEndDate,
DATEDIFF(dd, LAG(Report.SupervisionEndDate) OVER (PARTITION BY Report.ClientId ORDER BY COALESCE(Report.SupervisionStartDate, Report.SupervisionEndDate)), Report.SupervisionStartDate) AS [Diff Between PreviousEndDate And SupervisionStartDate]
FROM Report
One approach:
Use the additional LAG parameters to provide a default value for when its null, and make that value a valid value i.e. 1 day before the StartDate
Use a CTE to calculate the difference in days between the StartDate and previous EndDate.
Then use a second CTE to determine for any given client whether there is an issue.
Finally display your desired results.
WITH cte1 AS (
SELECT
R.*
, DATEDIFF(day, LAG(R.SupervisionEndDate,1,dateadd(day,-1,R.SupervisionStartDate)) OVER (PARTITION BY R.ClientId ORDER BY COALESCE(R.SupervisionStartDate, R.SupervisionEndDate)), R.SupervisionStartDate) AS Diff
FROM Report R
), cte2 AS (
SELECT *
, MAX(COALESCE(Diff,0)) OVER (PARTITION BY ClientId) MaxDiff
, MIN(COALESCE(Diff,0)) OVER (PARTITION BY ClientId) MinDiff
FROM cte1
)
SELECT Id, ClientId, ClientName, SupervisorId, SupervisorName, SupervisionStartDate, SupervisionEndDate
--, Diff, MaxDiff, MinDiff -- Debug
, CASE WHEN MaxDiff = 1 AND MinDiff = 1 THEN 'Valid' ELSE 'Issue Found' END [Status]
FROM cte2
ORDER BY Id;
Notes:
Use the fullname of the datepart you are diff-ing - its much clearer and easier to maintain.
Use short, relevant, table aliases to reduce the code.

(SQL - Snowflake) How do I query a series of events to where it only pulls the first and last event?

I'm looking to query event history (project with lots of tasks and subtasks) to where it only pulls the first and last event on a given task's history and none of the ones in between.
Right now, I can't find another way to compare originating event based IDs (original task:subtask pair) and what the IDs currently are (sometimes they change based on what happens to the tasks when they're moving between queues).
This is the main chunk of query I'm working on.
The current "completed_at" strings I have trust issues with, so I want to change to using work_time (basically an in depth audit log), but I only want the MIN and MAX time.
Is this something I can do without it being an actual column?
select
pt._id as t_id,
ps._id as st_id,
pt.completed_at as t_completed_at,
pt.first_completed_at as t_first_completed_at,
ps.completed_at as ps_completed_at,
pt.times_redone,
ps.is_recalled_subtask as ps_recalled,
ps.status,
ps.review_level,
pt.customer_review_status,
pt.customer_review_comments,
array_size(ps.response:annotations),
ps.subtask_version,
vfwa.FIX_ATTEMPT,
vfwa.work_time as time_attempt_was_completed
You can use QUALIFY to remove rows that you don't want, after having access to them to allow window processing (like SUM, MAX, etc)
SELECT
t.order_val,
t.thing_a,
t.thing_b
FROM VALUES
(1,'this is first', 'extra details'),
(2,'this is second', 'hate to middle things'),
(3,'this is third', 'hate to middle things'),
(4,'this is firth', 'last this are as good as first')
t(order_val, thing_a, thing_b)
QUALIFY order_val = max(order_val)over() OR order_val = min(order_val)over();
gives:
ORDER_VAL
THING_A
THING_B
1
this is first
extra details
4
this is firth
last this are as good as first
The classic QUALIFY pattern uses ROW_NUMBER, would look like:
QUALIFY row_number()over(order by order_val) = 1 OR row_number()over(order by order_val DESC) = 1
But if you have calculated values if your values, you can use it like a bonus WHERE clause that is run after all the GROUPING has been done.
MATCH_RECOGNIZE was created for this:
select * from stock_price_history
match_recognize(
partition by company
order by price_date
measures
first(price_date) as start_date,
last(price_date) as end_date,
first(price) as start_price,
last(price) as end_price
one row per match
pattern(x*)
define
t as true
)
order by company;
Basically with partition you can divide by task, and first() and last() will bring the ids or values you need to identify the first and last row for a series of events.
Setup:
create table stock_price_history (company text, price_date date, price int);
insert into stock_price_history values
('ABCD', '2020-10-01', 50),
('XYZ' , '2020-10-01', 89),
('ABCD', '2020-10-02', 36),
('XYZ' , '2020-10-02', 24),
('ABCD', '2020-10-03', 39),
('XYZ' , '2020-10-03', 37),
('ABCD', '2020-10-04', 42),
('XYZ' , '2020-10-04', 63),
('ABCD', '2020-10-05', 30),
('XYZ' , '2020-10-05', 65),
('ABCD', '2020-10-06', 47),
('XYZ' , '2020-10-06', 56),
('ABCD', '2020-10-07', 71),
('XYZ' , '2020-10-07', 50),
('ABCD', '2020-10-08', 80),
('XYZ' , '2020-10-08', 54),
('ABCD', '2020-10-09', 75),
('XYZ' , '2020-10-09', 30),
('ABCD', '2020-10-10', 63),
('XYZ' , '2020-10-10', 32);

Return latest values for each month filling empty values

In SQL Server 2017 I have a table that looks like this https://i.stack.imgur.com/Ry106.png and I would like to get the amount of members at the end of each month, filling out the blank months with the data from the previous month.
So having this table
Create table #tempCenters (
OperationId int identity (1,1) primary key,
CenterId int,
members int,
Change_date date,
Address varchar(100), --non relevant
Sales float --non relevant
)
with this data
INSERT INTO #tempCenters VALUES
(1, 100, '2020-02-20', 'non relevant column', 135135),
(1, 110, '2020-04-15', 'non relevant column', 231635),
(1, 130, '2020-04-25', 'non relevant column', 3565432),
(1, 180, '2020-09-01', 'non relevant column', 231651),
(2, 200, '2020-01-20', 'non relevant column', 321365),
(2, 106, '2020-03-20', 'non relevant column', 34534),
(2, 135, '2020-06-25', 'non relevant column', 3224),
(2, 154, '2020-06-20', 'non relevant column', 2453453)
I am expecting this result
CenterId, Members, EOM_Date
1, 100, '2020-2-28'
1, 100, '2020-3-30'
1, 130, '2020-4-31'
1, 130, '2020-5-30'
1, 130, '2020-6-31'
1, 130, '2020-7-31'
1, 130, '2020-8-30'
1, 180, '2020-9-31'
2, 200, '2020-1-31'
2, 200, '2020-2-28'
2, 106, '2020-3-31'
2, 106, '2020-4-30'
2, 106, '2020-5-31'
2, 135, '2020-6-30'
And this is what I´ve got so far
SELECT
t.centerId,
EOMONTH(t.Change_date) as endOfMonthDate,
t.members
FROM #tempCenters t
RIGHT JOIN (
SELECT
S.CenterId,
Year(S.Change_date) as dateYear,
Month(S.Change_date) as dateMonth,
Max(s.OperationId) as id
FROM #tempCenters S
GROUP BY CenterId, Year(Change_date), Month(Change_date)
) A
ON A.id = t.OperationId
which returns the values per month, but not fill the blank ones.
First I get start date (min date) and finish date (max date) for each CenterId. Then I generate all end of months from start date to finish date for each CenterId. Finally I join my subuqery (cte) with your table (on cte.CenterId = tc.CenterId AND cte.EOM_Date >= tc.Change_date) and get last (previous or same date) members value for each date (end of month).
WITH cte AS (SELECT CenterId, EOMONTH(MIN(Change_date)) AS EOM_Date, EOMONTH(MAX(Change_date)) AS finish
FROM #tempCenters
GROUP BY CenterId
UNION ALL
SELECT CenterId, EOMONTH(DATEADD(MONTH, 1, EOM_Date)), finish
FROM cte
WHERE EOM_Date < finish)
SELECT DISTINCT cte.CenterId,
FIRST_VALUE(Members) OVER(PARTITION BY cte.CenterId, cte.EOM_Date ORDER BY tc.Change_date DESC) AS Members,
cte.EOM_Date
FROM cte
LEFT JOIN #tempCenters tc ON cte.CenterId = tc.CenterId AND cte.EOM_Date >= tc.Change_date
ORDER BY CenterId, EOM_Date;
I know it looks cumbersome and I'm sure there is a more elegant solution, but still you can use a combination of subqueries with union all and outer apply to get the desired result.
Select t.CenterId, Coalesce(t.members, tt.members), t.Change_date
From (
Select CenterId, Max(members) As members, Change_date
From
(Select t.CenterId, t.members, EOMONTH(t.Change_date) As Change_date
From #tempCenters As t Inner Join
(Select CenterId, Max(Change_date) As Change_date
From #tempCenters
Group by CenterId, Year(Change_date), Month(Change_date)
) As tt On (t.CenterId=tt.CenterId And
t.Change_date=tt.Change_date)
Union All
Select t.CenterId, Null As member, t.Change_date
From (
Select tt.CenterId, EOMONTH(datefromparts(tt.[YEAR], t.[MONTH], '1')) As Change_date,
Min_Change_date, Max_Change_date
From (Select [value] as [Month] From OPENJSON('[1,2,3,4,5,6,7,8,9,10,11,12]')) As t,
(Select CenterId, Year(Change_date) As [YEAR],
Min(Change_date) As Min_Change_date, Max(Change_date) As Max_Change_date
From #tempCenters Group by CenterId, Year(Change_date)) As tt) As t
Where Change_date Between Min_Change_date And Max_Change_date) As t
Group by CenterId, Change_date) As t Outer Apply
(Select members
From #tempCenters
Where CenterId=t.CenterId And
Change_date = (Select Max(Change_date)
From #tempCenters Where CenterId=t.CenterId And Change_date<t.Change_date Group by CenterId)) As tt
Order by t.CenterId, t.Change_date

How to group and collect data in PostgreSQL by date periods?

Here is a demo data:
create table Invoices (
id INT,
name VARCHAR,
customer_id INT,
total_amount FLOAT,
state VARCHAR,
invoice_date DATE
);
INSERT INTO Invoices
(id, name, customer_id, total_amount, state, invoice_date)
VALUES
(1, 'INV/2020/0001', 2, 100, 'posted', '2020-04-05'),
(2, 'INV/2020/0002', 1, 100, 'draft', '2020-04-05'),
(3, 'INV/2020/0003', 2, 100, 'draft', '2020-05-24'),
(4, 'INV/2020/0004', 1, 100, 'posted', '2020-05-25'),
(5, 'INV/2020/0005', 2, 100, 'posted', '2020-06-05'),
(6, 'INV/2020/0006', 1, 100, 'posted', '2020-07-05'),
(7, 'INV/2020/0007', 1, 100, 'draft', '2020-08-24'),
(8, 'INV/2020/0008', 1, 100, 'posted', '2020-08-25'),
(9, 'INV/2020/0009', 1, 100, 'posted', '2020-09-05'),
(10, 'INV/2020/0010', 1, 100, 'draft', '2020-09-05'),
(11, 'INV/2020/0011', 2, 100, 'draft', '2020-10-24'),
(12, 'INV/2020/0012', 1, 100, 'posted', '2020-10-25'),
(13, 'INV/2020/0013', 2, 100, 'posted', '2020-11-05'),
(14, 'INV/2020/0014', 1, 100, 'posted', '2020-11-05'),
(15, 'INV/2020/0015', 2, 100, 'draft', '2020-11-24'),
(16, 'INV/2020/0016', 1, 100, 'posted', '2020-11-25')
I have a query that computes a sum of all posted invoices for customer with id = 1
SELECT sum(total_amount), customer_id
FROM Invoices
WHERE state = 'posted' AND customer_id = 1
GROUP BY customer_id
I need to group the data (sum(total_amount)) by 3 time periods - 2 or 3 months each (2 or 3 needs to be able to change by changing the number in the query. I want to pass it as a parameter to the query from python code).
Also I need to get the average sums of the period.
Can you help me please?
Expected output for period = 2 months is:
+--------------+--------------+--------------+--------+
| Period_1_sum | Period_2_sum | Period_3_sum | Avg |
+--------------+--------------+--------------+--------+
| 300 | 300 | 100 | 233.33 |
+--------------+--------------+--------------+--------+
You can use conditional aggregation for that:
SELECT customer_id,
sum(total_amount) as total_amount,
sum(total_amount) filter (where invoice_date >= date '2020-04-01' and invoice_date < date '2020-07-01') as period_1_sum,
sum(total_amount) filter (where invoice_date >= date '2020-07-01' and invoice_date < date '2020-10-01') as period_2_sum,
sum(total_amount) filter (where invoice_date >= date '2020-10-01' and invoice_date < date '2021-01-01') as period_3_sum
FROM Invoices
WHERE state = 'posted'
GROUP BY customer_id
By changing the filter condition you can control which rows are aggregated for each period.
Online example