How to optimize mapping over several tables - sql

I'm trying to optimize my code. The solution described below works fine, but I'm pretty sure there are better ways to do it. Do you have any recommendations?
I have one table with business contracts and some characteristic attributes:
table_contracts
contract_number attribute_1 attribute_2 attribute_3
123 a e t
456 a f s
789 b g s
And a second table that maps each contract into a specific group. These groups have different priorities (higher number => higher priority). If the attribute column is empty it means that it is not required (=> m3 is the catch all mapping)
table_mappings
map_number priority attribute_1 attribute_2 attribute_3
m1 5 a e t
m2 4 a
m3 3
As a result I need the the contract_number and the corresponding map_number with the highest priority.
This is how I did it, it works but does anyone knows how to optimize that?
with
first_selection as
(
select
table_contracts.contract_number
,table_mappings.priority
,row_number() over(partition by table_contracts.contract_number order by table_mappings.priority desc)
from table_contracts
left join table_mappings
on (table_contracts.attribute_1 = table_mappings.attribute_1 or table_mappings.attribute_1 is null)
and (table_contracts.attribute_2 = table_mappings.attribute_2 or table_mappings.attribute_2 is null)
and (table_contracts.attribute_3 = table_mappings.attribute_3 or table_mappings.attribute_3 is null)
),
second_selection as
(
select
table_contracts.contract_number
,table_mappings.priority
,table_mappings.map_number
from table_contracts
left join table_mappings
on (table_contracts.attribute_1 = table_mappings.attribute_1 or table_mappings.attribute_1 is null)
and (table_contracts.attribute_2 = table_mappings.attribute_2 or table_mappings.attribute_2 is null)
and (table_contracts.attribute_3 = table_mappings.attribute_3 or table_mappings.attribute_3 is null)
)
select
first_selection.contract_number
,second_selection.map_number
from first_selection
join second_selection
on first_selection.contract_number = second_selection.contract_number and first_selection.priority = second_selection.priority
where first_selection.rn = 1
The output of this code would be:
Results
contract_number map_number
123 m1
456 m2
789 m3

I think you only need one of the selections :
with prioritized as(
select c.contract_number, c.attribute_1, c.attribute_2, c.attribute_3, m.map_number
,row_number() over(
partition by c.contract_number
order by m.priority desc
) as rn
from table_contracts c
left join table_mappings m on(
(c.attribute_1 = m.attribute_1 or m.attribute_1 is null)
and (c.attribute_2 = m.attribute_2 or m.attribute_2 is null)
and (c.attribute_3 = m.attribute_3 or m.attribute_3 is null)
)
)
select *
from prioritized
where rn = 1

Try out the below logic using CTE version similar to yours. Hope it helps!
Demo
WITH contracts AS
(SELECT 123 AS contract_number, 'a' AS attribute_1, 'e' AS attribute_2, 't' AS attribute_3 FROM dual
UNION
SELECT 456, 'a', 'f', 's' FROM dual
UNION SELECT 789, 'b', 'g', 's' FROM dual
),
mappings AS
(SELECT 'm1' AS map_number, 5 AS priority, 'a' AS attribute_1, 'e' AS attribute_2, 't' AS attribute_3 FROM dual
UNION
SELECT 'm2', 4, 'a', NULL, NULL FROM dual
UNION
SELECT 'm3', 3, NULL, NULL, NULL FROM dual
),
prioritymap AS
(SELECT contract_number,
map_number,
Rank() over(PARTITION BY contracts.contract_number ORDER BY mappings.priority DESC) AS rank
FROM contracts
JOIN mappings
ON ( contracts.attribute_1 = mappings.attribute_1 OR mappings.attribute_1 IS NULL )
AND ( contracts.attribute_2 = mappings.attribute_2 OR mappings.attribute_2 IS NULL )
AND ( contracts.attribute_3 = mappings.attribute_3 OR mappings.attribute_3 IS NULL )
)
SELECT contract_number, map_number
FROM prioritymap
WHERE prioritymap.rank = 1

You can simply join the tables on the condition given (an attribute is null in the maping table or must match the attribute in the contracts table). Then aggregate per contract number to get the best map_number.
select
c.contract_number,
max(m.map_number) keep (dense_rank last order by m.priority) as map_number
from table_contracts c
join table_mappings m
on (m.attribute_1 is null or m.attribute_1 = c.attribute_1)
and (m.attribute_2 is null or m.attribute_2 = c.attribute_2)
and (m.attribute_3 is null or m.attribute_3 = c.attribute_3)
group by c.contract_number
order by c.contract_number;
Anyway, you are doing this for all contracts and a mapping may match on any combination of attributes, so this will lead to full table scans. The only way I can see to get this quicker is parallel excecution. Maybe the DBMS is set to do this automatically, otherwise you can use a hint:
select /*+parallel(4)*/
...

Related

BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:
WITH t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
)
SELECT * FROM t
id_1
id_2
1
a
2
a
2
b
3
b
4
c
5
c
6
d
6
e
7
f
I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:
The desired output for the example above would look something like this:
id_1_coll
id_2_coll
1, 2, 3
a, b
4, 5
c
6
d, e
7
f
where each row contains all the other IDs one could reach following the links in the table.
Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.
One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.
Explanation
Our steps are
Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function
We can use something like this to remodel the relationships into a self join (1.):
SELECT
a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:
WITH RECURSIVE base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
)
We can also grab the lowest id_1 value at this time:
id_1
linked_id
lowest_linked_id
1
2
1
2
1
1
2
3
2
3
2
2
4
5
4
5
4
4
For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.
We can code this as follows:
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
)
Giving us the following results:
|id_1|linked_id|lowest_linked_id|linked_ids|
|----|---------|------------|---|
|3|2|1|[1,2]|
|2|3|1|[1,2,3]|
|4|5|4|[5]|
|1|2|1|[2]|
|5|4|4|[4]|
|2|3|2|[3]|
|2|1|1|[1]|
|3|2|2|[2]|
which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below
Solution
WITH RECURSIVE t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
),
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
),
link_back AS (
SELECT
t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
FROM t
LEFT JOIN recursive_loop
ON t.id_1 = recursive_loop.id_1
),
by_id_1 AS (
SELECT
id_1,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
by_id_2 AS (
SELECT
id_2,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
result AS (
SELECT
by_id_1.grp,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
FROM
by_id_1
INNER JOIN by_id_2
ON by_id_1.grp = by_id_2.grp
GROUP BY grp
)
SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll
FROM result ORDER BY grp
Giving us the required output:
grp
id1_coll
id2_coll
1
[1,2,3]
[a,b]
4
[4,5]
[c]
6
[6]
[d,e]
7
[7]
[f]
Limitations/Issues
Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).
Using ROW_NUMBER() the query is as the follow:
WITH RECURSIVE
t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
t1 AS (
SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
UNION ALL
SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
ELSE t2.arr_1 END,
CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
ELSE t2.arr_2 END,
t1.id_1, t1.id_2
FROM t2 JOIN t1 ON
t2.n < t1.n AND
t1.n NOT IN UNNEST(t2.n_arr) AND
(t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
(t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
SELECT
n,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
FROM t2
WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().
However, it may not help your Limitations/Issues.
The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):
SELECT GrpNr,
STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END)
OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
ROW_NUMBER() OVER () as RowNr,
LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr
I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from #GordonLinoff to the question here, which I tested (and presume works for BigQuery).
This can also be done using sequential updates as done by #RomanPekar
here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.
Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.
I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

Eliminating null values in union

I'm doing a query across databases with an identical structure, to show a mapping from a source value to a target value.
Every one of my databases has a table with two columns: source and target
DB1
Source
Target
A
X
A
Y
B
NULL
C
NULL
DB2
Source
Target
A
NULL
A
Y
B
Z
So my query is
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
What I'm getting is
Source
Target
A
X
A
Y
B
NULL
C
NULL
B
Z
A
NULL
But I'm only interested in the target being NULL, if there is no other mapping present.
So I'm looking for the following result:
Source
Target
A
X
A
Y
C
NULL
B
Z
How can I easily eliminate the highlighted rows A | NULL and B | NULL from my results?
I've seen a few answers suggesting using MAX(Target), but that won't work for me since I can have multiple valid mappings for a single source (A | X and A | Y)
Something like this would work, just give a number based on NULL, and select the first:
SELECT TOP(1) WITH TIES UN.Source
, UN.Target
FROM (
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
) AS UN
ORDER BY DENSE_RANK()OVER(PARTITION BY UN.Source ORDER BY CASE WHEN UN.Target IS NOT NULL THEN 1 ELSE 2 END)
You might find it easier to think in terms of minimums:
with data as (
select Source, Target from DB1.<table> union
select Source, Target from DB2.<table>
), qualified as (
select *,
case when Target is not null or min(Target) over (partition by Source) is null
then 1 end as Keep
from data
)
select Source, Target from qualified where Keep = 1;
For completeness, here's the solution I went with, based on the answer of #HoneyBadger, including the suggestion made by #MartinSmith in the comments
SELECT * FROM
(SELECT UN.Source
, UN.Target
, DENSE_RANK()OVER(PARTITION BY UN.Source ORDER BY CASE WHEN UN.Target IS NOT NULL THEN 1 ELSE 2 END) as ranking
FROM (
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
) AS UN
) UN2
WHERE UN2.ranking = 1
ORDER BY UN2.Source, UN2.Target
This solution selects only the records that have a DENSE_RANK of 1, avoiding the TOP(1) WITH TIES.

How to display null values in IN operator for SQL with two conditions in where

I have this query
select *
from dbo.EventLogs
where EntityID = 60181615
and EventTypeID in (1, 2, 3, 4, 5)
and NewValue = 'Received'
If 2 and 4 does not exist with NewValue 'Received' it shows this
current results
What I want
Ideally you should maintain somewhere a table containing all possible EventTypeID values. Sans that, we can use a CTE in place along with a left join:
WITH EventTypes AS (
SELECT 1 AS ID UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5
)
SELECT et.ID AS EventTypeId, el.*
FROM EventTypes et
LEFT JOIN dbo.EventLogs el
ON el.EntityID = 60181615 AND
el.NewValue = 'Received'
WHERE
et.ID IN (1,2,3,4,5);

Excluding records from table based on rules from another table

I'm using Oracle SQL and I have a product table with diffrent attributes and sales volume for each product and another table with certain exclusion rules for different level of aggregation. Let's look at the example:
Here is our main table with sales data on which we want to perform some calculations:
And the other table contains diffrent rules which are supposed to exclude certain rows from table above:
When there is an "x", this column shouldn't be considered so our rules are:
1. exclude all rows with ATTR_3 = 'no'
2. exlcude all rows with ATTR_1 = 'Europe' and ATTR_2 = 'snacks' and ATTR_3 = 'no'
3. exlcude all rows with ATTR_1 = 'Africa'
And based on that our final output should be like that:
How this could be achived in SQL? I was thinking about join but I have no idea how to handle different levels of aggregation for exclusions.
I think your expected output is wrong. None of the rules excludes the 2nd row (Europe - snacks - yes).
SQL> with
2 -- sample data
3 test (product_id, attr_1, attr_2, attr_3) as
4 (select 81928 , 'Europe', 'beverages', 'yes' from dual union all
5 select 16534 , 'Europe', 'snacks' , 'yes' from dual union all
6 select 56468 , 'USA' , 'snacks' , 'no' from dual union all
7 select 129921, 'Africa', 'drinks' , 'yes' from dual union all
8 select 123021, 'Africa', 'snacks' , 'yes' from dual union all
9 select 165132, 'USA' , 'drinks' , 'yes' from dual
10 ),
11 rules (attr_1, attr_2, attr_3) as
12 (select 'x' , 'x' , 'no' from dual union all
13 select 'Europe', 'snacks', 'no' from dual union all
14 select 'Africa', 'x' , 'x' from dual
15 )
16 -- query you need
17 select t.*
18 from test t
19 where (t.attr_1, t.attr_2, t.attr_3) not in
20 (select
21 decode(r.attr_1, 'x', t.attr_1, r.attr_1),
22 decode(r.attr_2, 'x', t.attr_2, r.attr_2),
23 decode(r.attr_3, 'x', t.attr_3, r.attr_3)
24 from rules r
25 );
PRODUCT_ID ATTR_1 ATTR_2 ATT
---------- ------ --------- ---
81928 Europe beverages yes
16534 Europe snacks yes
165132 USA drinks yes
SQL>
You can use the join using CASE .. WHEN statement as follows:
SELECT P.*
FROM PRODUCT P
JOIN RULESS R ON
(R.ATTR_1 ='X' OR P.ATTR_1 <> R.ATTR_1)
AND (R.ATTR_2 ='X' OR P.ATTR_2 <> R.ATTR_2)
AND (R.ATTR_3 ='X' OR P.ATTR_3 <> R.ATTR_3)
You can use NOT EXISTS
SELECT *
FROM sales s
WHERE NOT EXISTS (
SELECT 0
FROM attributes a
WHERE ( ( a.attr_1 = s.attr_1 AND a.attr_1 IS NOT NULL )
OR a.attr_1 IS NULL )
AND ( ( a.attr_2 = s.attr_2 AND a.attr_2 IS NOT NULL )
OR a.attr_2 IS NULL )
AND ( ( a.attr_3 = s.attr_3 AND a.attr_3 IS NOT NULL )
OR a.attr_3 IS NULL )
)
where I considered the x values within the attributes table as NULL. If you really have x characters, then you can use :
SELECT *
FROM sales s
WHERE NOT EXISTS (
SELECT 0
FROM attributes a
WHERE ( ( NVL(a.attr_1,'x') = s.attr_1 AND NVL(a.attr_1,'x')!='x' )
OR NVL(a.attr_1,'x')='x' )
AND ( ( NVL(a.attr_2,'x') = s.attr_2 AND NVL(a.attr_2,'x')!='x' )
OR NVL(a.attr_2,'x')='x' )
AND ( ( NVL(a.attr_3,'x') = s.attr_3 AND NVL(a.attr_3,'x')!='x' )
OR NVL(a.attr_3,'x')='x' )
)
instead.
Demo
I would do this with three different not exists:
select p.*
from product p
where not exists (select 1
from rules r
where r.attr_1 = p.attr_1 and r.attr_1 <> 'x'
) and
not exists (select 1
from rules r
where r.attr_2 = p.attr_2 and r.attr_2 <> 'x'
) and
not exists (select 1
from rules r
where r.attr_3 = p.attr_3 and r.attr_3 <> 'x'
) ;
In particular, this can take advantage of indexes on (attr_1), (attri_2) and (attr_3) -- something that is quite handy if you have a moderate number of rules.

Stuck on this union / except

Trying to find the best way to proceed with this, for some reason it is really tripping me up.
I have data like this:
transaction_id(pk) decision_id(pk) accepted_ind
A 1 NULL
A 2 <blank>
A 4 Y
B 1 <blank>
B 2 Y
C 1 Y
D 1 N
D 2 O
D 3 Y
Each transaction is guaranteed to have decision 1
There can be multiple decision possibilities (what-if's) type of scenarios
Accepted can have multiple values or be blank or NULL but only one can be accepted_ind = Y
I am trying to write a query to:
Return one row for each transaction_id
Return the decision_id where the accepted_ind = Y or if the transaction has no rows accepted_ind = Y, then return the row with decision_id = 1 (regardless of value in the accepted_ind)
I have tried:
1. Using logical "or" to pull the records, kept getting duplicates.
2. Using a union and except but can not quite get the logic down correctly.
Any assistance is appreciated. I am not sure why this is tripping me up so much!
Adam
Try this. Basically the WHERE clause says:
Where Accepted = 'Y'
OR
There is no accepted row for this transaction and the decision_id = 1
SELECT Transaction_id, Decision_ID, Accepted_id
FROM MyTable t
WHERE Accepted_ind = 'Y'
OR (NOT EXISTS (SELECT 1 FROM MyTable t2
WHERE Accepted_ind = 'Y'
and t2.Transaction_id = t.transaction_id)
AND Decision_id = 1)
This approach uses ROW_NUMBER() and therefore will only work on SQL Server 2005 or later
I have modified your sample data as as it stands, all transaction_id have a Y indicator!
DECLARE #t TABLE (
transaction_id NCHAR(1),
decision_id INT,
accepted_ind NCHAR(1) NULL
)
INSERT #t VALUES
( 'A' , 1 , NULL ),
( 'A' , 2 , '' ),
( 'A' , 4 , 'Y' ),
( 'B' , 1 , '' ),
( 'B' , 2 , 'N' ), -- change from your sample data
( 'C' , 1 , 'Y' ),
( 'D' , 1 , 'N' ),
( 'D' , 2 , 'O' ),
( 'D' , 3 , 'Y' )
And here is the query itself:
SELECT transaction_id, decision_id, accepted_ind FROM (
SELECT transaction_id, decision_id, accepted_ind,
ROW_NUMBER() OVER (
PARTITION BY transaction_id
ORDER BY
CASE
WHEN accepted_ind = 'Y' THEN 1
WHEN decision_id = 1 THEN 2
ELSE 3
END
) rn
FROM #t
) Raw
WHERE rn = 1
Results:
transaction_id decision_id accepted_ind
-------------- ----------- ------------
A 4 Y
B 1
C 1 Y
D 3 Y
The ROW_NUMBER() clause gives a 'priority' to each criterion you mention; we then ORDER BY to pick the best, and take the first row.
There's probably a neater/more efficient query, but I think this will get the job done. It assumes the table name is Decision:
SELECT CASE
WHEN accepteddecision.transaction_id IS NOT NULL THEN
accepteddecision.transaction_id
ELSE firstdecision.transaction_id
END AS transaction_id,
CASE
WHEN accepteddecision.decision_id IS NOT NULL THEN
accepteddecision.decision_id
ELSE firstdecision.decision_id
END AS decision_id,
CASE
WHEN accepteddecision.accepted_ind IS NOT NULL THEN
accepteddecision.accepted_ind
ELSE firstdecision.accepted_ind
END AS accepted_ind
FROM decision
LEFT OUTER JOIN (SELECT *
FROM decision AS accepteddecision
WHERE accepteddecision.accepted_ind = 'Y') AS
accepteddecision
ON accepteddecision.transaction_id = decision.transaction_id
LEFT OUTER JOIN (SELECT *
FROM decision AS firstdecision
WHERE firstdecision.decision_id = 1) AS firstdecision
ON firstdecision.transaction_id = decision.transaction_id
GROUP BY accepteddecision.transaction_id,
firstdecision.transaction_id,
accepteddecision.decision_id,
firstdecision.decision_id,
accepteddecision.accepted_ind,
firstdecision.accepted_ind
Out of interest, the following uses UNION and EXCEPT (plus a JOIN) as specified in the question title:
WITH T AS (SELECT * FROM (
VALUES ('A', 1, NULL),
('A', 2, ''),
('A', 4, 'Y'),
('B', 1, ''),
('B', 2, 'Y'),
('C', 1, 'Y'),
('D', 1, 'N'),
('D', 2, 'O'),
('D', 3, 'Y'),
('E', 2, 'O'), -- smaple data extended
('E', 1, 'N') -- smaple data extended
) AS T (transaction_id, decision_id, accepted_ind)
)
SELECT *
FROM T
WHERE accepted_ind = 'Y'
UNION
SELECT T.*
FROM (
SELECT transaction_id
FROM T
WHERE decision_id = 1
EXCEPT
SELECT transaction_id
FROM T
WHERE accepted_ind = 'Y'
) D
JOIN T
ON T.transaction_id = D.transaction_id
AND T.decision_id = 1;