How to do an as-of-join in SQL (Snowflake)? - sql

I am looking to join two time-ordered tables, such that the events in table1 are matched to the "next" event in table2 (within the same user). I am using SQL / Snowflake for this.
For argument's sake table1 is "notification_clicked" events and table2 is "purchases"
This is one way to do it:
WITH partial_result AS (
SELECT
userId, notificationId, notificationTimeStamp, transactionId, transactionTimeStamp
FROM table1 CROSS JOIN table2
WHERE table1.userId = table2.userId
AND notificationTimeStamp <= transactionTimeStamp)
SELECT *
FROM partial_result
QUALIFY ROW_NUMBER() OVER(
PARTITION BY userId, notificationId ORDER BY transactionTimeStamp ASC
) = 1
It is not super readable, but is this "the" way to do this?

If you're doing an AsOf join against small tables, you can use a regular Venn diagram type of join. If you're running it against large tables, a regular join will lead to an intermediate cardinality explosion before the filter.
For large tables, this is the highest performance approach I have to date. Rather than treating an AsOf join like a regular Venn diagram join, we can treat it like a special type of union between two tables with a filter that uses the information from that union. The sample SQL does the following:
Unions the A and B tables so that the Entity and Time come from both tables and all other columns come from only one table. Rows from the other table specify NULL for these values (measures 1 and 2 in this case). It also projects a source column for the table. We'll use this later.
In the unioned table, it uses a LAG function on windows partitioned by the Entity and ordered by the Time. For each row with a source indicator from the A table, it lags back to the first Time with source in the B table, ignoring all values in the A table.
with A as
(
select
COLUMN1::int as "E", -- Entity
COLUMN2::int as "T", -- Time
COLUMN4::string as "M1" -- Measure (could be many)
from (values
(1, 7, 1, 'M1-1'),
(1, 8, 1, 'M1-2'),
(1, 41, 1, 'M1-3'),
(1, 89, 1, 'M1-4')
)
), B as
(
select
COLUMN1::int as "E", -- Entity
COLUMN2::int as "T", -- Time
COLUMN4::string as "M2" -- Different measure (could be many)
from (values
(1, 6, 1, 'M2-1'),
(1, 12, 1, 'M2-2'),
(1, 20, 1, 'M2-3'),
(1, 35, 1, 'M2-4'),
(1, 57, 1, 'M2-5'),
(1, 85, 1, 'M2-6'),
(1, 92, 1, 'M2-7')
)
), UNIONED as -- Unify schemas and union all
(
select 'A' as SOURCE_TABLE -- Project the source table
,E as AB_E -- AB_ means it's unified
,T as AB_T
,M1 as A_M1 -- A_ means it's from A
,NULL::string as B_M2 -- Make columns from B null for A
from A
union all
select 'B' as SOURCE_TABLE
,E as AB_E
,T as AB_T
,NULL::string as A_M1 -- Make columns from A null for B
,M2 as B_M2
from B
)
select AB_E as ENTITY
,AB_T as A_TIME
,lag(iff(SOURCE_TABLE = 'A', null, AB_T)) -- Lag back to
ignore nulls over -- previous B row
(partition by AB_E order by AB_T) as B_TIME
,A_M1 as M1_FROM_A
,lag(B_M2) -- Lag back to the previous non-null row.
ignore nulls -- The A sourced rows will already be NULL.
over (partition by AB_E order by AB_T) as M2_FROM_B
from UNIONED
qualify SOURCE_TABLE = 'A'
;
This will perform orders of magnitude faster for large tables because the highest intermediate cardinality is guaranteed to be the cardinality of A + B.
To simplify this refactor, I wrote a stored procedure that generates the SQL given the paths to table A and B, the entity column in A and B (right now limited to one, but if you have more it will get the SQL started), the order by (time) column in A and B, and finally the list of columns to "drag through" the AsOf join. It's rather lengthy so I posted it on Github and will work later to document and enhance it:
https://github.com/GregPavlik/AsOfJoin/blob/main/StoredProcedure.sql

Related

Programmatically assign NULL to certain columns for certain rows when unioning datasets

I'm trying to figure out a way to programmatically assign NULL to certain columns for certain rows when unioning 2 datasets together. This is most easily explained using an example. The rows in #stage2 need to display NULL in columns cost_center3, cost_center14 in the final dataset. The code below works but it is a manual approach and not dynamic if more cost_center columns need to be added.
select *
into #stage1
from
(
values
(42, 170, 44, 827),
(43, 170, 68, 880),
(44, 190, 31, 745)
) d (work_center, plant, cost_center3, cost_center14);
select *
into #stage2
from
(
values
(10, 200),
(11, 200),
(12, 200)
) d (work_center, plant);
--manual approach - need to find a programmatic way to do this
select * from #stage1
union
select *, NULL, NULL from #stage2;
In the actual business use case, there are several more cost_center columns than are shown in this example - thus the need to find a way to programmatically do this task.
I have experimented with CROSS APPLY like this
select s1.*, s2.*
from #stage1 s1
cross apply #stage2 s2;
but it is essentially cross joining the datasets and that is not the desired outcome.
Can this task be done programmatically and concisely?
Here's what I ended up using, even though the number of NULLs is not dynamic:
select * from #stage1
union
select * from #stage2 cross join (values (null, null)) d (cost_center4, cost_center14);

Postgresql - access/use joined subquery result in another joined subquery

I have a database with tables for
equipment we service (table e, field e_id)
contracts on the equipment (table c, fields c_id, e_id, c_start, c_end)
maintenance we have performed in the past (table m, e_id, m_id,
m_date)
I am trying to build a query that will show me all equipment records, if it is currently in contract with the start/end date, and a count of any maintenance performed since the start date of the contract.
I have a subquery to get the current contract (this table is large and has a new line for each contract revision), but I can't work out how to use the result of the contract subquery to return the maintenance visits since that date without returning multiple lines.
select
e.e_id,
c2.c_id,
c2.c_start,
c2.c_end,
m2.count
from e
left join (
select
c_id,
c_start,
c_end,
e_id
...other things and filtering by joining the table to itself
from c
) as c2 on c2.e_id = e.e_id
I would also like to be able to add this
m-subquery v1
left join (
select
count(*),
e_id
from m
where m.m_date >= c2.start
) as m2 on m2.e_id = e.e_id
But I'm unable to access c2.C_start from within the second subquery.
I am able to return this table by joining outside the subquery, but this returns multiple lines.
m-subquery v2
left join (
select
e_id,
m_date,
from m
) as m2 on m2.e_id = e.e_id and m.m_date >= c2.start
Is there a way to:
Get the subquery field c2.start into the m-subquery v1?
Aggregate the result of the m-subquery v2 without using group by (there are a lot of columns in the main select query)?
Do this differently?
I've seen lateral which I kind of think might be what I need but I have tried the keyword in front of both subqueries individually and together and it didn't work to let me use c2.c_start inside at any point.
I am a little averse to using group by, mainly as the BI analyst at work says "slap a group by on it" when there are duplicates in reports rather than trying to understand the business process/database properly. I feel like having a group by on the main query shouldn't be needed when I know for certain that the e table has one record per e_id, and the mess that having probably 59 out of 60 columns named in the group by would cause might make the query less maintainable.
Thanks,
Sam
Since not all RDBMS support lateral, I would like to present you the following general solution. You can make use of CTEs (WITH queries) to help structuring the query and reuse partial results. E.g. in the following code, you can think of current_contracts as a kind of virtual table existing only during query execution.
Part 1: DDLs and test data
DROP TABLE IF EXISTS e;
CREATE TABLE e
(
e_id INTEGER
);
DROP TABLE IF EXISTS c;
CREATE TABLE c
(
c_id INTEGER,
e_id INTEGER,
c_start DATE,
c_end DATE
);
DROP TABLE IF EXISTS m;
CREATE TABLE m
(
e_id INTEGER,
m_id INTEGER,
m_date DATE
);
INSERT INTO e VALUES (101),(102),(103);
INSERT INTO c VALUES (201, 101, DATE '2021-01-01', DATE '2021-12-31'), (202, 102, DATE '2021-03-01', DATE '2021-04-15'), (203, 102, DATE '2021-04-16', DATE '2021-04-30'), (204, 103, DATE '2003-01-01', DATE '2003-12-31'), (205, 103, DATE '2021-04-01', DATE '2021-04-30');
INSERT INTO m VALUES (101, 301, DATE '2021-01-01'), (101, 302, DATE '2021-02-01'), (101, 303, DATE '2021-03-01'), (102, 304, DATE '2021-04-02'), (102, 305, DATE '2021-04-03'), (103, 306, DATE '2021-04-03');
Part 2: the actual query
WITH
-- find currently active contracts per equipment:
-- we assume there is 0 or 1 contract active per equipment at any time
current_contracts AS
(
SELECT *
FROM c
WHERE c.c_start <= CURRENT_DATE -- only active contracts
AND c.c_end >= CURRENT_DATE -- only active contracts
),
-- count maintenance visits during the (single) active contract per equipment, if any:
current_maintenance AS
(
SELECT m.e_id, COUNT(*) AS count_m_per_e -- a count of maintenance visits per equipment
FROM m
INNER JOIN current_contracts cc
ON cc.e_id = m.e_id -- match maintenance to current contracts via equipment
AND cc.c_start <= m.m_date -- only maintenance that was done during the current contract
GROUP BY m.e_id
)
-- bring the parts together for our result:
-- we start with equipment and use LEFT JOINs to assure we retain all equipment
SELECT
e.*,
cc.c_start, cc.c_end,
CASE WHEN cc.e_id IS NOT NULL THEN 'yes' ELSE 'no' END AS has_contract,
COALESCE(cm.count_m_per_e, 0) -- to replace NULL when no contract is active
FROM e
LEFT JOIN current_contracts cc
ON cc.e_id = e.e_id
LEFT JOIN current_maintenance cm
ON cm.e_id = e.e_id
ORDER BY e.e_id;
Please note that your real pre-processing logic for contracts and maintenance visits may be more complex, e.g. due to overlapping periods of active contracts per equipment.

sql join using recursive cte

Edit: Added another case scenario in the notes and updated the sample attachment.
I am trying to write a sql to get an output attached with this question along with sample data.
There are two table, one with distinct ID's (pk) with their current flag.
another with Active ID (fk to the pk from the first table) and Inactive ID (fk to the pk from the first table)
Final output should return two columns, first column consist of all distinct ID's from the first table and second column should contain Active ID from the 2nd table.
Below is the sql:
IF OBJECT_ID('tempdb..#main') IS NOT NULL DROP TABLE #main;
IF OBJECT_ID('tempdb..#merges') IS NOT NULL DROP TABLE #merges
IF OBJECT_ID('tempdb..#final') IS NOT NULL DROP TABLE #final
SELECT DISTINCT id,
current
INTO #main
FROM tb_ID t1
--get list of all active_id and inactive_id
SELECT DISTINCT active_id,
inactive_id,
Update_dt
INTO #merges
FROM tb_merges
-- Combine where the id from the main table matched to the inactive_id (should return all the rows from #main)
SELECT id,
active_id AS merged_to_id
INTO #final
FROM (SELECT t1.*,
t2.active_id,
Update_dt ,
Row_number()
OVER (
partition BY id, active_id
ORDER BY Update_dt DESC) AS rn
FROM #main t1
LEFT JOIN #merges t2
ON t1.id = t2.inactive_id) t3
WHERE rn = 1
SELECT *
FROM #final
This sql partially works. It doesn't work, where the id was once active then gets inactive.
Please note:
the active ID should return the last most active ID
the ID which doesn't have any active ID should either be null or the ID itself
ID where the current = 0, in those cases active ID should be the ID current in tb_ID
ID's may get interchanged. For example there are two ID's 6 and 7, when 6 is active 7 is inactive and vice versa. the only way to know the most current active state is by the update date
Attached sample might be easy to understand
Looks like I might have to use recursive cte for achieiving the results. Can someone please help?
thank you for your time!
I think you're correct that a recursive CTE looks like a good solution for this. I'm not entirely certain that I've understood exactly what you're asking for, particularly with regard to the update_dt column, just because the data is a little abstract as-is, but I've taken a stab at it, and it does seem to work with your sample data. The comments explain what's going on.
declare #tb_id table (id bigint, [current] bit);
declare #tb_merges table (active_id bigint, inactive_id bigint, update_dt datetime2);
insert #tb_id values
-- Sample data from the question.
(1, 1),
(2, 1),
(3, 1),
(4, 1),
(5, 0),
-- A few additional data to illustrate a deeper search.
(6, 1),
(7, 1),
(8, 1),
(9, 1),
(10, 1);
insert #tb_merges values
-- Sample data from the question.
(3, 1, '2017-01-11T13:09:00'),
(1, 2, '2017-01-11T13:07:00'),
(5, 4, '2013-12-31T14:37:00'),
(4, 5, '2013-01-18T15:43:00'),
-- A few additional data to illustrate a deeper search.
(6, 7, getdate()),
(7, 8, getdate()),
(8, 9, getdate()),
(9, 10, getdate());
if object_id('tempdb..#ValidMerge') is not null
drop table #ValidMerge;
-- Get the subset of merge records whose active_id identifies a "current" id and
-- rank by date so we can consider only the latest merge record for each active_id.
with ValidMergeCTE as
(
select
M.active_id,
M.inactive_id,
[Priority] = row_number() over (partition by M.active_id order by M.update_dt desc)
from
#tb_merges M
inner join #tb_id I on M.active_id = I.id
where
I.[current] = 1
)
select
active_id,
inactive_id
into
#ValidMerge
from
ValidMergeCTE
where
[Priority] = 1;
-- Here's the recursive CTE, which draws on the subset of merges identified above.
with SearchCTE as
(
-- Base case: any record whose active_id is not used as an inactive_id is an endpoint.
select
M.active_id,
M.inactive_id,
Depth = 0
from
#ValidMerge M
where
not exists (select 1 from #ValidMerge M2 where M.active_id = M2.inactive_id)
-- Recursive case: look for records whose active_id matches the inactive_id of a previously
-- identified record.
union all
select
S.active_id,
M.inactive_id,
Depth = S.Depth + 1
from
#ValidMerge M
inner join SearchCTE S on M.active_id = S.inactive_id
)
select
I.id,
S.active_id
from
#tb_id I
left join SearchCTE S on I.id = S.inactive_id;
Results:
id active_id
------------------
1 3
2 3
3 NULL
4 NULL
5 4
6 NULL
7 6
8 6
9 6
10 6

Ordering parent rows by date descending with child rows ordered independently beneath each

This is a contrived version of my table schema to illustrate my problem:
QuoteID, Details, DateCreated, ModelQuoteID
Where QuoteID is the primary key and ModelQuoteID is a nullable foreign key back onto this table to represent a quote which has been modelled off another quote (and may have subsequently had its Details column etc changed).
I need to return a list of quotes ordered by DateCreated descending with the exception of modelled quotes, which should sit beneath their parent quote, ordered by date descending within any other sibling quotes (quotes can only be modelled one level deep).
So for example if I have these 4 quote rows:
1, 'Fix the roof', '01/01/2012', null
2, 'Clean the drains', '02/02/2012', null
3, 'Fix the roof and door', '03/03/2012', 1
4, 'Fix the roof, door and window', '04/04/2012', 1
5, 'Mow the lawn', '05/05/2012', null
Then I need to get the results back in this order:
5 - Mow the lawn
2 - Clean the drains
1 - Fix the roof
4 - -> Fix the roof, door and window
3 - -> Fix the roof and door
I'm also passing in search criteria such as keywords for Details, and I'm returning modelled quotes even if they don't contain the search term but their parent quote does. I've got that part working using a common table expression to get the original quotes, unioned with a join for modelled ones.
That works nicely but currently I'm having to do the rearrangement of the modelled quotes into the correct order in code. That's not ideal because my next step is to implement paging in the SQL, and if the rows are not grouped properly at that time then I won't have the children present in the current page to do the re-ordering in code. Generally speaking they will be naturally grouped together anyway, but not always. You could create a model quote today for a quote from a month back.
I've spent quite some time on this, can any SQL gurus help? Much appreciated.
EDIT: Here is a contrived version of my SQL to fit my contrived example :-)
;with originals as (
select
q.*
from
Quote q
where
Details like #details
)
select
*
from
(
select
o.*
from
originals o
union
select
q2.*
from
Quote q2
join
originals o on q2.ModelQuoteID = o.QuoteID
)
as combined
order by
combined.CreatedDate desc
Watching the Olympics -- just skimmed your post -- looks like you want to control the sort at each level (root and one level in), and make sure the data is returned with the children directly beneath its parent (so you can page the data...). We do this all the time. You can add an order by to each inner query and create a sort column. I contrived a slightly different example that should be easy for you to apply to your circumstance. I sorted the root ascending and level one descending just to illustrate how you can control each part.
declare #tbl table (id int, parent int, name varchar(10))
insert into #tbl (id, parent, name)
values (1, null, 'def'), (2, 1, 'this'), (3, 1, 'is'), (4, 1, 'a'), (5, 1, 'test'),
(6, null, 'abc'), (7, 6, 'this'), (8, 6, 'is'), (9, 6, 'another'), (10, 6, 'test')
;with cte (id, parent, name, sort) as (
select id, parent, name, cast(right('0000' + cast(row_number() over (order by name) as varchar(4)), 4) as varchar(1024))
from #tbl
where parent is null
union all
select t.id, t.parent, t.name, cast(cte.sort + right('0000' + cast(row_number() over (order by t.name desc) as varchar(4)), 4) as varchar(1024))
from #tbl t inner join cte on t.parent = cte.id
)
select * from cte
order by sort
This produces these results:
id parent name sort
---- -------- ------- ----------
6 NULL abc 0001
7 6 this 00010001
10 6 test 00010002
8 6 is 00010003
9 6 another 00010004
1 NULL def 0002
2 1 this 00020001
5 1 test 00020002
3 1 is 00020003
4 1 a 00020004
You can see that the root nodes are sorted ascending and the inner nodes are sorted descending.

How to select only one full row per group in a "group by" query?

In SQL Server, I have a table where a column A stores some data. This data can contain duplicates (ie. two or more rows will have the same value for the column A).
I can easily find the duplicates by doing:
select A, count(A) as CountDuplicates
from TableName
group by A having (count(A) > 1)
Now, I want to retrieve the values of other columns, let's say B and C. Of course, those B and C values can be different even for the rows sharing the same A value, but it doesn't matter for me. I just want any B value and any C one, the first, the last or the random one.
If I had a small table and one or two columns to retrieve, I would do something like:
select A, count(A) as CountDuplicates, (
select top 1 child.B from TableName as child where child.A = base.A) as B
)
from TableName as base group by A having (count(A) > 1)
The problem is that I have much more rows to get, and the table is quite big, so having several children selects will have a high performance cost.
So, is there a less ugly pure SQL solution to do this?
Not sure if my question is clear enough, so I give an example based on AdventureWorks database. Let's say I want to list available States, and for each State, get its code, a city (any city) and an address (any address). The easiest, and the most inefficient way to do it would be:
var q = from c in data.StateProvinces select new { c.StateProvinceCode, c.Addresses.First().City, c.Addresses.First().AddressLine1 };
in LINQ-to-SQL and will do two selects for each of 181 States, so 363 selects. I my case, I am searching for a way to have a maximum of 182 selects.
The ROW_NUMBER function in a CTE is the way to do this. For example:
DECLARE #mytab TABLE (A INT, B INT, C INT)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 1, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 1, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 2, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (1, 3, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (2, 2, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 1)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 2)
INSERT INTO #mytab ( A, B, C ) VALUES (3, 3, 3)
;WITH numbered AS
(
SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY A ORDER BY B, C)
FROM #mytab AS m
)
SELECT *
FROM numbered
WHERE rn=1
As I mentioned in my comment to HLGEM and Philip Kelley, their simple use of an aggregate function does not necessarily return one "solid" record for each A group; instead, it may return column values from many separate rows, all stitched together as if they were a single record. For example, if this were a PERSON table, with the PersonID being the "A" column, and distinct contact records (say, Home and Word), you might wind up returning the person's home city, but their office ZIP code -- and that's clearly asking for trouble.
The use of the ROW_NUMBER, in conjunction with a CTE here, is a little difficult to get used to at first because the syntax is awkward. But it's becoming a pretty common pattern, so it's good to get to know it.
In my sample I've define a CTE that tacks on an extra column rn (standing for "row number") to the table, that itself groups by the A column. A SELECT on that result, filtering to only those having a row number of 1 (i.e., the first record found for that value of A), returns a "solid" record for each A group -- in my example above, you'd be certain to get either the Work or Home address, but not elements of both mixed together.
It concerns me that you want any old value for fields b and c. If they are to be meaningless why are you returning them?
If it truly doesn't matter (and I honestly can't imagine a case where I would ever want this, but it's what you said) and the values for b and c don't even have to be from the same record, group by with the use of mon or max is the way to go. It's more complicated if you want the values for a particular record for all fields.
select A, count(A) as CountDuplicates, min(B) as B , min(C) as C
from TableName as base
group by A
having (count(A) > 1)
you can do some thing like this if you have id as primary key in your table
select id,b,c from tablename
inner join
(
select id, count(A) as CountDuplicates
from TableName as base group by A,id having (count(A) > 1)
)d on tablename.id= d.id