Join two tables on consecutive segments - sql

I want to join two tables, where the first table has more entries than the second, such that rows from each are joined in order. Maybe a little example would be helpful:
Table T:
| tid | sid | ron | val | seqno |
| --- | --- | --- | --- | --- |
| 1 | a | x1 | 15 | 1 |
| 2 | b | x2 | 10 | 3 |
| 2 | b | x3 | 20 | 4 |
| 3 | a | x5 | 10 | 5 |
| 4 | c | x9 | 15 | 7 |
| 4 | c | x9 | 15 | 8 |
| 4 | c | x9 | 20 | 10 |
| 4 | c | x9 | 15 | 11 |
| 6 | b | x11 | 22 | 12 |
| 7 | b | x12 | 10 | 14 |
| 7 | b | x13 | 10 | 16 |
| 7 | b | x13 | 10 | 17 |
| 7 | b | x14 | 10 | 19 |
The second table (Table C) is as follows (in reality, more columns):
| tid | sid | ron | val | fid |
| --- | --- | --- | --- | --- |
| 2 | b | x3 | 20 | 54 |
| 4 | c | x9 | 15 | 12 |
| 4 | c | x9 | 15 | 14 |
| 4 | c | x9 | 20 | 15 |
| 4 | c | x9 | 15 | 20 |
| 7 | b | x13 | 10 | 112 |
| 7 | b | x13 | 10 | 113 |
seqNo and fid are there in each table to provide ordering within the groups formed by (tid, sid, ron), and that is the ordering I'd like to maintain.
How can I get from these two tables to something like the following table?
| tid | sid | ron | val | fid | seqno |
| --- | --- | --- | --- | --- | --- |
| 2 | b | x3 | 20 | 54 | 4 |
| 4 | c | x9 | 15 | 12 | 7 |
| 4 | c | x9 | 15 | 14 | 8 |
| 4 | c | x9 | 20 | 15 | 10 |
| 4 | c | x9 | 15 | 20 | 11 |
| 7 | b | x13 | 10 | 112 | 16 |
| 7 | b | x13 | 10 | 113 | 17 |
I can't assign a rank to each element in the group and use that for matching inside of a LEFT JOIN, since there are cases where matching doesn't begin at the end of the group (for example tid=7). Also, because val in the same group may have repeated values, I can't blindly match on it either, as that may blow up the number of rows.

This is what I managed to get late last night, seems to be working correctly:
WITH
table_t AS (
SELECT *
FROM (VALUES
(1,'a','x1',15,1),
(2,'b','x2',10,3),
(2,'b','x3',20,4),
(3,'a','x5',10,5),
(4,'c','x9',15,7),
(4,'c','x9',15,8),
(4,'c','x9',20,10),
(4,'c','x9',15,11),
(6,'b','x11',22,12),
(7,'b','x12',10,14),
(7,'b','x13',10,16),
(7,'b','x13',10,17),
(7,'b','x14',10,19)
) AS c(tid, sid, ron, val, seq)
),
table_t_ranked AS (
SELECT *
, DENSE_RANK() OVER (PARTITION BY tid, sid, ron ORDER BY seq ASC) AS ranking
FROM table_t
),
table_c AS (
SELECT *
FROM (VALUES
(2,'b','x3',20,54),
(4,'c','x9',15,12),
(4,'c','x9',15,14),
(4,'c','x9',20,15),
(4,'c','x9',15,20),
(7,'b','x13',10,112),
(7,'b','x13',10,113)
) AS c(tid, sid, ron, val, fid)
),
table_c_ranked AS (
SELECT *
, DENSE_RANK() OVER (PARTITION BY tid, sid, ron ORDER BY fid ASC) AS ranking
FROM table_c
),
foo AS (
SELECT c.*
, t.seq
, t.ranking as ranking_t
FROM table_c_ranked c
LEFT JOIN table_t_ranked t
ON c.tid = t.tid
AND c.sid = t.sid
AND c.ron = t.ron
AND c.val = t.val
)
SELECT tid, sid, ron, val, fid, seq
FROM foo
WHERE ranking = ranking_t
ORDER BY tid, seq

Related

Get all records from Table A and all entries that don't already exist from Table B

I have an Oracle DB with 2 tables, Table A and Table B.
Table A has better data quality but only for a limited set of entries. Table A has also multiple entries (because of history) and I only need the last one per number.
So I need to get all entries from Table A and then the rest of the entries which are not in Table A from Table B.
I also need to get some data from Table B into the result of Table A because the info does not exist in Table A (val1, val2, val3).
So probably some sort of JOIN + GROUP BY?
Table A:
number | valid_from | valid_to | pos1 | pos2 | pos3 | factor | loc
100 | 2020-03-01 | 2020-03-10 | 7 | 80 | 18 | 19 | 1
100 | 2020-03-10 | 2020-03-13 | 7 | 80 | 18 | 19 | 1
100 | 2020-03-13 | 2020-03-16 | 8 | 80 | 18 | 20 | 1
200 | 2020-03-02 | 2020-03-03 | 6 | 90 | 19 | 30 | 1
200 | 2020-03-03 | 2020-03-04 | 6 | 90 | 19 | 29 | 1
200 | 2020-03-04 | 2020-03-10 | 9 | 90 | 19 | 30 | 1
300 | 2020-03-10 | 2020-03-12 | 13 | 100 | 10 | 41 | 2
300 | 2020-03-12 | 2020-03-14 | 13 | 100 | 10 | 40 | 2
300 | 2020-03-14 | 2020-03-20 | 10 | 100 | 10 | 40 | 2
Table B:
number | pos1 | pos2 | pos3 | val1 | val2 | val3 | top
100 | 7 | 70 | 18 | a | aa | aaa | 3
200 | 6 | 60 | 19 | b | bb | bbb | 4
300 | 5 | 50 | 10 | c | cc | ccc | 5
400 | 2 | 20 | 2 | d | dd | ddd | 16
500 | 3 | 30 | 3 | e | ee | eee | 28
End result should be:
number | pos1 | pos2 | pos3 | factor | loc | val1 | val2 | val3 | top
100 | 8 | 80 | 18 | 20 | 1 | a | aa | aaa | 3
200 | 9 | 90 | 19 | 30 | 1 | b | bb | bbb | 4
300 | 10 | 100 | 10 | 40 | 2 | c | cc | ccc | 5
400 | 2 | 20 | 2 | NULL | NULL | d | dd | ddd | 16
500 | 3 | 30 | 3 | NULL | NULL | e | ee | eee | 28
How can I achieve this? Do I need a FULL LEFT JOIN and GROUP BY by number? Not sure what to take or how to get the latest entries from Table A.
select b.number, b.pos1, b.pos2, b.pos3,
a.factor, a.loc, b.val1, b.val2, b.val3, b.top
from tableb b
left outer join tablea a
on b.number = a.number
you can also use NVL(b.pos1, a.pos1) which means if b.pos1 is null take a.pos1
I would write this as a left join with filtering:
select b.number, b.pos1, b.pos2, b.pos3,
a.factor, a.loc,
b.val1, b.val2, b.val3, b.top
from b
left join (
select a.*, row_number() over(partition by number order by valid_from desc) rn
from a
) a on a.number = b.number and a.rn = 1

Merge groups if they contain the same value

I have the following table:
+-----+----+---------+
| grp | id | sub_grp |
+-----+----+---------+
| 10 | A2 | 1 |
| 10 | B4 | 2 |
| 10 | F1 | 2 |
| 10 | B3 | 3 |
| 10 | C2 | 4 |
| 10 | A2 | 4 |
| 10 | H4 | 5 |
| 10 | K0 | 5 |
| 10 | Z3 | 5 |
| 10 | F1 | 5 |
| 10 | A1 | 5 |
| 10 | A | 6 |
| 10 | B | 6 |
| 10 | B | 7 |
| 10 | C | 7 |
| 10 | C | 8 |
| 10 | D | 8 |
| 20 | A | 1 |
| 20 | B | 1 |
| 20 | B | 2 |
| 20 | C | 2 |
| 20 | C | 3 |
| 20 | D | 3 |
+-----+----+---------+
Within every grp, my goal is to merge all the sub_grp sharing at least one id.
More than 2 sub_grp can be merged together.
The expected result should be:
+-----+----+---------+
| grp | id | sub_grp |
+-----+----+---------+
| 10 | A2 | 1 |
| 10 | B4 | 2 |
| 10 | F1 | 2 |
| 10 | B3 | 3 |
| 10 | C2 | 1 |
| 10 | A2 | 1 |
| 10 | H4 | 2 |
| 10 | K0 | 2 |
| 10 | Z3 | 2 |
| 10 | F1 | 2 |
| 10 | A1 | 2 |
| 10 | A | 6 |
| 10 | B | 6 |
| 10 | B | 6 |
| 10 | C | 6 |
| 10 | C | 6 |
| 10 | D | 6 |
| 20 | A | 1 |
| 20 | B | 1 |
| 20 | B | 1 |
| 20 | C | 1 |
| 20 | C | 1 |
| 20 | D | 1 |
+-----+----+---------+
Here is a SQL Fiddle with the test values: http://sqlfiddle.com/#!9/13666c/2
I am trying to solve this either with a stored procedure or queries.
This is an evolution from my previous problem: Merge rows containing same values
My understanding of the problem
Merge sub_grp (for a given grp) if any one of the IDs in one sub_grp match any one of the IDs in another sub_grp. A given sub_grp can be merged with only one other (the earliest in ascending order) sub_grp.
Disclaimer
This code may work. Not tested as OP did not provide DDLs and data scripts.
Solution
UPDATE final
SET sub_grp = new_sub_grp
FROM
-- For each grp, sub_grp combination return a matching new_sub_grp
( SELECT a.grp, a.sub_grp, MatchGrp.sub_grp AS new_sub_grp
FROM tbl AS a
-- Inner join will exclude cases where there are no matching sub_grp and thus nothing to update.
INNER JOIN
-- Find the earliest (if more than one sub-group is a match) matching sub-group where one of the IDs matches
( SELECT TOP 1 grp, sub_grp
FROM tbl AS b
-- b.sub_grp > a.sub_grp - this will only look at the earlier sub-groups avoiding the "double linking"
WHERE b.grp = a.grp AND b.sub_grp > a.sub_grp AND b.ID = a.ID
ORDER BY grp, sub_grp ) AS MatchGrp ON 1 = 1
-- Only return one record per grp, sub_grp combo
GROUP BY grp, sub_grp, MatchGrp.sub_grp ) AS final
You can re-number sub groups afterwards as a separate update statement with the help of DENSE_RANK window function.

Get the Id of the matched data from other table. No duplicates of ID from both tables

Here is my table A.
| Id | GroupId | StoreId | Amount |
| 1 | 20 | 7 | 15000 |
| 2 | 20 | 7 | 1230 |
| 3 | 20 | 7 | 14230 |
| 4 | 20 | 7 | 9540 |
| 5 | 20 | 7 | 24230 |
| 6 | 20 | 7 | 1230 |
| 7 | 20 | 7 | 1230 |
Here is my table B.
| Id | GroupId | StoreId | Credit |
| 12 | 20 | 7 | 1230 |
| 14 | 20 | 7 | 15000 |
| 15 | 20 | 7 | 14230 |
| 16 | 20 | 7 | 1230 |
| 17 | 20 | 7 | 7004 |
| 18 | 20 | 7 | 65523 |
I want to get this result without getting duplicate Id of both table.
I need to get the Id of table B and A where the Amount = Credit.
| A.ID | B.ID | Amount |
| 1 | 14 | 15000 |
| 2 | 12 | 1230 |
| 3 | 15 | 14230 |
| 4 | null | 9540 |
| 5 | null | 24230 |
| 6 | 16 | 1230 |
| 7 | null | 1230 |
My problem is when I have 2 or more same Amount in table A, I get duplicate ID of table B. which should be null. Please help me. Thank you.
I think you want a left join. But this is tricky because you have duplicate amounts, but you only want one to match. The solution is to use row_number():
select . . .
from (select a.*, row_number() over (partition by amount order by id) as seqnum
from a
) a left join
(select b.*, row_number() over (partition by credit order by id) as seqnum
from b
)b
on a.amount = b.credit and a.seqnum = b.seqnum;
Another approach, I think simplier and shorter :)
select ID [A.ID],
(select top 1 ID from TABLE_B where Credit = A.Amount) [B.ID],
Amount
from TABLE_A [A]

How do I select columns whenever they change?

I'm trying to create a slowly changing dimension (type 2 dimension) and am a bit lost on how to logically write it out. Say that we have a source table with a grain of Person | Country | Department | Login Time. I want to create this dimension table with Person | Country | Department | Eff Start time | Eff End Time.
Data could look like this:
Person | Country | Department | Login Time
------------------------------------------
Bob | CANADA | Marketing | 2009-01-01
Bob | CANADA | Marketing | 2009-02-01
Bob | USA | Marketing | 2009-03-01
Bob | USA | Sales | 2009-04-01
Bob | MEX | Product | 2009-05-01
Bob | MEX | Product | 2009-06-01
Bob | MEX | Product | 2009-07-01
Bob | CANADA | Marketing | 2009-08-01
What I want in the Type 2 dimension would look like this:
Person | Country | Department | Eff Start time | Eff End Time
------------------------------------------------------------------
Bob | CANADA | Marketing | 2009-01-01 | 2009-03-01
Bob | USA | Marketing | 2009-03-01 | 2009-04-01
Bob | USA | Sales | 2009-04-01 | 2009-05-01
Bob | MEX | Product | 2009-05-01 | 2009-08-01
Bob | CANADA | Marketing | 2009-08-01 | NULL
Assume that Bob's name, Country and Department hasn't been updated since 2009-08-01 so it's left as NULL
What function would work best here? This is on Netezza, which uses a flavor of Postgres.
Obviously GROUP BY would not work here because of same groupings later on (I added in Bob | CANADA | Marketing at the last row to show this.
EDIT
Including a hash column on Person, Country, and Department, would make sense, correct? Thinking of using logic of
SELECT PERSON, COUNTRY, DEPARTMENT
FROM table t1
where
person = person
AND t1.hash <> hash_function(person, country, department)
Answer
create table so (
person varchar(32)
,country varchar(32)
,department varchar(32)
,login_time date
) distribute on random;
insert into so values ('Bob','CANADA','Marketing','2009-01-01');
insert into so values ('Bob','CANADA','Marketing','2009-02-01');
insert into so values ('Bob','USA','Marketing','2009-03-01');
insert into so values ('Bob','USA','Sales','2009-04-01');
insert into so values ('Bob','MEX','Product','2009-05-01');
insert into so values ('Bob','MEX','Product','2009-06-01');
insert into so values ('Bob','MEX','Product','2009-07-01');
insert into so values ('Bob','CANADA','Marketing','2009-08-01');
/* ************************************************************************** */
with prm as ( --Create an ordinal primary key.
select
*
,row_number() over (
partition by person
order by login_time
) rwn
from
so
), chn as ( --Chain events to their previous and next event.
select
cur.rwn
,cur.person
,cur.country
,cur.department
,cur.login_time cur_login
,case
when
cur.country = prv.country
and cur.department = prv.department
then 1
else 0
end prv_equal
,case
when
(
cur.country = nxt.country
and cur.department = nxt.department
) or nxt.rwn is null --No next record should be equivalent to matching.
then 1
else 0
end nxt_equal
,case prv_equal
when 0 then cur_login
else null
end eff_login_start_sparse
,case
when eff_login_start_sparse is null
then max(eff_login_start_sparse) over (
partition by cur.person
order by rwn
rows unbounded preceding --The secret sauce.
)
else eff_login_start_sparse
end eff_login_start
,case nxt_equal
when 0 then cur_login
else null
end eff_login_end
from
prm cur
left outer join prm nxt on
cur.person = nxt.person
and cur.rwn + 1 = nxt.rwn
left outer join prm prv on
cur.person = prv.person
and cur.rwn - 1 = prv.rwn
), grp as ( --Group by login starts.
select
person
,country
,department
,eff_login_start
,max(eff_login_end) eff_login_end
from
chn
group by
person
,country
,department
,eff_login_start
), led as ( --Change the effective end to be the next start, if desired.
select
person
,country
,department
,eff_login_start
,case
when eff_login_end is null
then null
else
lead(eff_login_start) over (
partition by person
order by eff_login_start
)
end eff_login_end
from
grp
)
select * from led order by eff_login_start;
This code returns the following table.
PERSON | COUNTRY | DEPARTMENT | EFF_LOGIN_START | EFF_LOGIN_END
--------+---------+------------+-----------------+---------------
Bob | CANADA | Marketing | 2009-01-01 | 2009-03-01
Bob | USA | Marketing | 2009-03-01 | 2009-04-01
Bob | USA | Sales | 2009-04-01 | 2009-05-01
Bob | MEX | Product | 2009-05-01 | 2009-08-01
Bob | CANADA | Marketing | 2009-08-01 |
Explanation
I must have solved this four or five times in the past few years and keep neglecting to write it down formally. I'm glad to have the chance to do it, so this is a great question.
When attempting this, I like writing down the problem in matrix form. Here's the input, presuming that all values have the same key in the SCD.
Cv | Ce
----|----
A | 10
A | 11
B | 14
C | 16
D | 18
D | 25
D | 34
A | 40
Where Cv is the value that we'll need to compare against (again, presuming that the key value for the SCD is equal in this data; we'll be partitioning over the key value the entire time so it's irrelevant to the solution) and Ce is the event time.
First, we need an ordinal primary key. I've designated this Ck in the table. This will allow us to join the table to itself to get the previous and next events. I've called these columns Pk (previous key), Nk (next key), Pv, and Nv.
Cv | Ce | Ck | Pk | Pv | Nk | Nv |
----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A |
A | 11 | 2 | 1 | A | 3 | B |
B | 14 | 3 | 2 | A | 4 | C |
C | 16 | 4 | 3 | B | 5 | D |
D | 18 | 5 | 4 | C | 6 | D |
D | 25 | 6 | 5 | D | 7 | D |
D | 34 | 7 | 6 | D | 8 | A |
A | 40 | 8 | 7 | D | | |
Now we need some columns to see if we're at the beginning or end of a contiguous event block. I'll call these Pc and Nc, for contiguous. Pc is defined as Pv = Cv => true. 1 represents true and 0 represents false. Nc is defined similarly, except that the null case defaults to true (we'll see why in a minute)
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc |
----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 |
A | 40 | 8 | 7 | D | | | 0 | 1 |
Now you can start to see how the 1,1 combination of Pc,Nc is a completely useless record. We know this intuitively, since Bob's Mex/Product combination on the 6th row is pretty much useless information when building an SCD.
So let's get rid of the useless information. I'll add two new columns here: an almost-complete effective start time called Sn and an actually-complete effective end time called Ee. Sn is is populated with Ce when Pc is 0 and Ee is populated with Ce when Nc is 0.
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee |
----|----|----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 | 10 | |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 | | 11 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 | 14 | 14 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 | 16 | 16 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 | 18 | |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 | | |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 | | 34 |
A | 40 | 8 | 7 | D | | | 0 | 1 | 40 | |
This looks really close, but we still have the problem that we can't group by Cv (person/country/department). What we need is for Sn to populate all those nulls with the previous value of Sn. You could join this table to itself on rwn < rwn and get the maximum, but I'm going to be lazy and use Netezza's analytic functions and the rows unbounded preceding clause. It's a shortcut to the method I just described. So we're going to create another column called Es, efffective start, defined as follows.
case
when Sn is null
then max(Sn) over (
partition by k --key value of the SCD
order by Ck
rows unbounded preceding
)
else Sn
end Es
With that definition, we get this.
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee | Es |
----|----|----|----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 | 10 | | 10 |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 | | 11 | 10 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 | 14 | 14 | 14 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 | 16 | 16 | 16 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 | 18 | | 18 |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 | | | 18 |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 | | 34 | 18 |
A | 40 | 8 | 7 | D | | | 0 | 1 | 40 | | 40 |
The rest is trivial. Group by Es and grab the max of Ee to obtain this table.
Cv | Es | Ee |
----|----|----|
A | 10 | 11 |
B | 14 | 14 |
C | 16 | 16 |
D | 18 | 34 |
A | 40 | |
If you want to populate the effective end time with the next start, join the table again to itself or use the lead() window function to grab it.

How to aggregate column on changing criteria in SQL (multiple SUMIFS)

Consider the following simplified example:
Table JobTitles
| PersonID | JobTitle | StartDate | EndDate |
|----------|----------|-----------|---------|
| A | A1 | 1 | 5 |
| A | A2 | 6 | 10 |
| A | A3 | 11 | 15 |
| B | B1 | 2 | 4 |
| B | B2 | 5 | 7 |
| B | B3 | 8 | 11 |
| C | C1 | 5 | 12 |
| C | C2 | 13 | 14 |
| C | C3 | 15 | 18 |
Table Transactions:
| PersonID | TransDate | Amt |
|----------|-----------|-----|
| A | 2 | 5 |
| A | 3 | 10 |
| A | 12 | 5 |
| A | 12 | 10 |
| B | 3 | 5 |
| B | 3 | 10 |
| B | 10 | 5 |
| C | 16 | 10 |
| C | 17 | 5 |
| C | 17 | 10 |
| C | 17 | 5 |
Desired Output:
| PersonID | JobTitle | StartDate | EndDate | Amt |
|----------|----------|-----------|---------|-----|
| A | A1 | 1 | 5 | 15 |
| A | A2 | 6 | 10 | 0 |
| A | A3 | 11 | 15 | 15 |
| B | B1 | 2 | 4 | 15 |
| B | B2 | 5 | 7 | 0 |
| B | B3 | 8 | 11 | 5 |
| C | C1 | 5 | 12 | 0 |
| C | C2 | 13 | 14 | 0 |
| C | C3 | 15 | 18 | 30 |
To me this is JobTitles LEFT OUTER JOIN Transactions with some type of moving criteria for the TransDate -- that is, I want to SUM Transaction.Amt if Transactions.TransDate is between JobTitles.StartDate and JobTitles.EndDate per each PersonID.
Feels like some type of partition or window function, but my SQL skills are not strong enough to create an elegant solution. In Excel, this equates to:
SUMIFS(Transaction[Amt], JobTitles[PersonID], Results[#[PersonID]], Transactions[TransDate], ">" & Results[#[StartDate]], Transactions[TransDate], "<=" & Results[#[EndDate]])
Moreover, I want to be able to perform this same logic over several flavors of Transaction tables.
The basic query is:
select jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate, coalesce(sum(amt), 0) as amt
from JobTitles jt left join
Transactions t
on jt.PersonId = t.PersonId and
t.TransDate between jt.StartDate and jt.EndDate
group by jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate;