removing duplicates based on a number of specific criteria in oracle - sql

I am working on a project where I must examine duplicate records and discern which of the records I must keep. There is a general criteria to be met for the record based on the attributes we are looking at. The following table examines the relationships between the criteria.
Table1
+----------+-----+-------+-------+-------+-------+
| dup_id | idm | ucode | great | good |yo2005 |
+----------+-----+-------+-------+-------+-------+
| a | 1 | 6 | yes | yes | yes |
| a | 2 | 1 | no | yes | yes |
| a | 3 | 1 | no | no | yes |
| b | 4 | 1 | yes | yes | no |
| b | 5 | 1 | no | no | no |
| c | 6 | 7 | no | no | yes |
| c | 7 | 1 | yes | no |no |
| d | 8 | 6 | no | yes |no |
| d | 9 | 1 | yes | no |no |
| e | 10 | 3 | yes | no |yes |
| e | 11 | 4 | no | yes |no |
| f | 12 | 1 | yes | yes | yes |
| f | 13 | 1 | yes | no |yes |
| g | 14 | 1 | no | no |yes |
| g | 15 | 1 | yes | no |no |
+----------+----+--------+-------+-------+-------+
Table 2
+-----+-------+
| ido | yo1998|
+-----+-------+
| 1 | yes |
| 2 | no |
| 3 | no |
| 4 | no |
| 5 | no |
| 6 | no |
| 7 | no |
| 8 | yes |
| 9 | yes |
| 10 | yes |
| 11 | yes |
| 12 | yes |
| 13 | no |
| 14 | yes |
| 15 | no |
+----+-------+
The tables have other records we would like to keep, but these are the main ones that fit the criteria
Table1
• dup_id- this is the id of the collection of all duplicates that are associated with it. This can have 2 or more records associated with it
• idm-the id of records in table 1, matches the ido in table 2
• ucode-this attribute has a duplicate signifier from a previous classification. If it is a value of 6, then it is considered a duplicate (but for some reason the new algorithm accepted it as non duplicate)
• great-this is a field that is preferred because it was verified at some point
• good-this is a field that is preferred, but has not been verified
• yo2005- data that was collected in 2005
Table2
• ido-the id of records in table 2; matches the idm in table 1
• yo1998-data that was collected in 1998
The issue is, we have so many records to sift through. What I have been attempting to do is to develop a query for each criteria to attempt to filter the data we need to look at down.
The criteria
The order of importance of the criteria is as follows:
• ucode- if one of the records in a dupid has a ucode =6, that means it is already known as a duplicate record, so the other ucodes take precedence. For example, dupid d has 2 records, so we know that the correct one is idm=8. For example, if our table has 10,000 records, this may pick up 2000 of them, which leaves us with 8000 to be manually examined.
• great- this is the 2nd level of importance for us. If great = yes, then we would like this record to be selected from any records that were not resolved by the first query. For example, of the 8000 left from the query above, this might pick up another 1000, leaving us with 7000 to be manually examined.
• good-this is 3rd level of importance to us. If great = no, but good = yes, then this would be our choice for anything not earlier resolved. For example, of the 7000 left from the query above, this might pick up another 500, leaving us with 6500 to be manually examined.
• At this point we have 2 tables involved; our 4th priority is that both yo2005 and yo1998 = yes. For example, of the 6500 left in the query above, this might pick up another 1000, leaving us with 5500 to be manually examined.
• If both are not equal to yes, yo2005 is our 5th priority. For example, of the 5500 left in the query above, this might pick up another 2000, leaving us with 3500 to be manually examined.
• yo1998 = ‘yes’ is our final priority. For example, of the 3500 left in the query above, this might pick up another 1000, leaving us with 2500 to be manually examined.
As you can see, this would cut out a great deal of the manual examination of the records.
Ideally, there would be 2 output tables; one for all the records that fit the critera (which is 7500 records). Maybe even a new field can be created with the justifications for it, to be populated by which criteria it was based on. We would also need another table that contains the records that did not meet any of the criteria, so that we can further investigate those records to decide which is the duplicate. Unfortunately, I am not very well versed in sql, so I don’t even know if something like this is possible.
Thank you for your time.

You can write all of these in SQL. Below is the ucode one. It selects all the dupids that have two records, with one of them having ucode = 6. Then picks the other record:
SELECT *
FROM t1
WHERE ucode <> 6
AND dupid IN
(SELECT dupid
FROM t1
INNER JOIN t2
ON t1.idm = t2.ido
GROUP BY dupid
HAVING COUNT(*) = 2
AND EXISTS
(SELECT 1
FROM t1 sub
WHERE ucode = 6
AND sub.dupid = t1.dupid))
This one will give you all the records marked as great and ucode does not = 6:
SELECT *
FROM t1
WHERE great = 'yes'
AND ucode <> 6
This one will give you all the records marked as good that do not have a record in the same dupid marked as great, excluding those with ucode = 6:
SELECT *
FROM t1
WHERE good = 'yes'
AND ucode <> 6
AND NOT EXISTS
(SELECT 1
FROM t1 sub
WHERE great = 'yes'
AND sub.dupid = t1.dupid)
This one finds all records where yo2005 = yes and great = no and good = no and unicode is not equal 6:
SELECT *
FROM t1
WHERE yo2005 = 'yes'
AND ucode <> 6
AND NOT EXISTS
(SELECT 1
FROM t1 sub
WHERE (great = 'yes'
OR good = 'yes')
AND sub.dupid = t1.dupid)
Finally, this one shows the records where yo1998 = yes and all other conditions fail:
SELECT *
FROM t1
INNER JOIN t2
ON t1.idm = t2.ido
WHERE yo1998 = 'yes'
AND ucode <> 6
AND NOT EXISTS
(SELECT 1
FROM t1 sub
WHERE (great = 'yes'
OR good = 'yes'
OR yo2005 = 'yes')
AND sub.dupid = t1.dupid)
Hopefully these will be useful to you!

I am not sure how you are going to use this, but it may help. I believe it gives you the two tables in your last paragraph (combined in one); the "priority" is a number, corresponding to the six criteria you have.
with
table_1 ( dup_id, idm, ucode, great, good, yo2005 ) as (
select 'a', 1, 6, 'yes', 'yes', 'yes' from dual union all
select 'a', 2, 1, 'no' , 'yes', 'yes' from dual union all
select 'a', 3, 1, 'no' , 'no' , 'yes' from dual union all
select 'b', 4, 1, 'yes', 'yes', 'no' from dual union all
select 'b', 5, 1, 'no' , 'no' , 'no' from dual union all
select 'c', 6, 7, 'no' , 'no' , 'yes' from dual union all
select 'c', 7, 1, 'yes', 'no' , 'no' from dual union all
select 'd', 8, 6, 'no' , 'yes', 'no' from dual union all
select 'd', 9, 1, 'yes', 'no' , 'no' from dual union all
select 'e', 10, 3, 'yes', 'no' , 'yes' from dual union all
select 'e', 11, 4, 'no' , 'yes', 'no' from dual union all
select 'f', 12, 1, 'yes', 'yes', 'yes' from dual union all
select 'f', 13, 1, 'yes', 'no' , 'yes' from dual union all
select 'g', 14, 1, 'no' , 'no' , 'yes' from dual union all
select 'g', 15, 1, 'yes', 'no' , 'no' from dual
),
table_2 ( ido, yo1998 ) as (
select 1, 'yes' from dual union all
select 2, 'no' from dual union all
select 3, 'no' from dual union all
select 4, 'no' from dual union all
select 5, 'no' from dual union all
select 6, 'no' from dual union all
select 7, 'no' from dual union all
select 8, 'yes' from dual union all
select 9, 'yes' from dual union all
select 10, 'yes' from dual union all
select 11, 'yes' from dual union all
select 12, 'yes' from dual union all
select 13, 'no' from dual union all
select 14, 'yes' from dual union all
select 15, 'no' from dual
)
select t1.dup_id, t1.idm, t1.ucode, t1.great, t1.good, t1.yo2005, t2.yo1998,
case when ucode = 6 then 1
when great = 'yes' then 2
when good = 'yes' then 3
when yo2005 = 'yes' then case when yo1998 = 'yes' then 4
else 5
end
when yo1998 = 'yes' then 6
end as priority
from table_1 t1 left outer join table_2 t2 on t1.idm = t2.ido
order by dup_id, priority
;
Output:
DUP_ID IDM UCODE GREAT GOOD YO2005 YO1998 PRIORITY
------ ---- ----- ----- ---- ------ ------ --------
a 1 6 yes yes yes yes 1
a 2 1 no yes yes no 3
a 3 1 no no yes no 5
b 4 1 yes yes no no 2
b 5 1 no no no no
c 7 1 yes no no no 2
c 6 7 no no yes no 5
d 8 6 no yes no yes 1
d 9 1 yes no no yes 2
e 10 3 yes no yes yes 2
e 11 4 no yes no yes 3
f 12 1 yes yes yes yes 2
f 13 1 yes no yes no 2
g 15 1 yes no no no 2
g 14 1 no no yes yes 4
15 rows selected
ADDED: Here is one way to use this (as a subquery) to further analyze the results. See OP's comments below. DUP_ID = a and d don't appear in the output at all, since each has a row with UCODE=6; for every other DUP_ID the row with the highest PRIORITY is selected (if there are ties, one random row for that DUP_ID, from those with highest PRIORITY, is selected).
with
table_1 ( dup_id, idm, ucode, great, good, yo2005 ) as (
....
),
table_2 ( ido, yo1998 ) as (
....
),
final ( dup_id, idm, ucode, great, good, yo2005, yo1998, priority ) as (
select t1.dup_id, t1.idm, t1.ucode, t1.great, t1.good, t1.yo2005, t2.yo1998,
case when ucode = 6 then 1
when great = 'yes' then 2
when good = 'yes' then 3
when yo2005 = 'yes' then case when yo1998 = 'yes' then 4
else 5
end
when yo1998 = 'yes' then 6
end as priority
from table_1 t1 left outer join table_2 t2 on t1.idm = t2.ido
),
o ( dup_id, idm, ucode, great, good, yo2005, yo1998, priority, rn ) as (
select dup_id, idm, ucode, great, good, yo2005, yo1998, priority,
row_number() over (partition by dup_id order by priority)
from final
)
select dup_id, idm, ucode, great, good, yo2005, yo1998, priority
from o
where rn = 1 and priority > 1;
DUP_ID IDM UCODE GREAT GOOD YO2005 YO1998 PRIORITY
------ --- ----- ----- ----- ------ ------ --------
b 4 1 yes yes no no 2
c 7 1 yes no no no 2
e 10 3 yes no yes yes 2
f 12 1 yes yes yes yes 2
g 15 1 yes no no no 2

Related

Possible to use a column name in a UDF in SQL?

I have a query in which a series of steps is repeated constantly over different columns, for example:
SELECT DISTINCT
MAX (
CASE
WHEN table_2."GRP1_MINIMUM_DATE" <= cohort."ANCHOR_DATE" THEN 1
ELSE 0
END)
OVER (PARTITION BY cohort."USER_ID")
AS "GRP1_MINIMUM_DATE",
MAX (
CASE
WHEN table_2."GRP2_MINIMUM_DATE" <= cohort."ANCHOR_DATE" THEN 1
ELSE 0
END)
OVER (PARTITION BY cohort."USER_ID")
AS "GRP2_MINIMUM_DATE"
FROM INPUT_COHORT cohort
LEFT JOIN INVOLVE_EVER table_2 ON cohort."USER_ID" = table_2."USER_ID"
I was considering writing a function to accomplish this as doing so would save on space in my query. I have been reading a bit about UDF in SQL but don't yet understand if it is possible to pass a column name in as a parameter (i.e. simply switch out "GRP1_MINIMUM_DATE" for "GRP2_MINIMUM_DATE" etc.). What I would like is a query which looks like this
SELECT DISTINCT
FUNCTION(table_2."GRP1_MINIMUM_DATE") AS "GRP1_MINIMUM_DATE",
FUNCTION(table_2."GRP2_MINIMUM_DATE") AS "GRP2_MINIMUM_DATE",
FUNCTION(table_2."GRP3_MINIMUM_DATE") AS "GRP3_MINIMUM_DATE",
FUNCTION(table_2."GRP4_MINIMUM_DATE") AS "GRP4_MINIMUM_DATE"
FROM INPUT_COHORT cohort
LEFT JOIN INVOLVE_EVER table_2 ON cohort."USER_ID" = table_2."USER_ID"
Can anyone tell me if this is possible/point me to some resource that might help me out here?
Thanks!
There is no such direct as #Tejash already stated, but the thing looks like your database model is not ideal - it would be better to have a table that has USER_ID and GRP_ID as keys and then MINIMUM_DATE as seperate field.
Without changing the table structure, you can use UNPIVOT query to mimic this design:
WITH INVOLVE_EVER(USER_ID, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE)
AS (SELECT 1, SYSDATE, SYSDATE, SYSDATE, SYSDATE FROM dual UNION ALL
SELECT 2, SYSDATE-1, SYSDATE-2, SYSDATE-3, SYSDATE-4 FROM dual)
SELECT *
FROM INVOLVE_EVER
unpivot ( minimum_date FOR grp_id IN ( GRP1_MINIMUM_DATE AS 1, GRP2_MINIMUM_DATE AS 2, GRP3_MINIMUM_DATE AS 3, GRP4_MINIMUM_DATE AS 4))
Result:
| USER_ID | GRP_ID | MINIMUM_DATE |
|---------|--------|--------------|
| 1 | 1 | 09/09/19 |
| 1 | 2 | 09/09/19 |
| 1 | 3 | 09/09/19 |
| 1 | 4 | 09/09/19 |
| 2 | 1 | 09/08/19 |
| 2 | 2 | 09/07/19 |
| 2 | 3 | 09/06/19 |
| 2 | 4 | 09/05/19 |
With this you can write your query without further code duplication and if you need use PIVOT-syntax to get one line per USER_ID.
The final query could then look like this:
WITH INVOLVE_EVER(USER_ID, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE)
AS (SELECT 1, SYSDATE, SYSDATE, SYSDATE, SYSDATE FROM dual UNION ALL
SELECT 2, SYSDATE-1, SYSDATE-2, SYSDATE-3, SYSDATE-4 FROM dual)
, INPUT_COHORT(USER_ID, ANCHOR_DATE)
AS (SELECT 1, SYSDATE-1 FROM dual UNION ALL
SELECT 2, SYSDATE-2 FROM dual UNION ALL
SELECT 3, SYSDATE-3 FROM dual)
-- Above is sampledata query starts from here:
, unpiv AS (SELECT *
FROM INVOLVE_EVER
unpivot ( minimum_date FOR grp_id IN ( GRP1_MINIMUM_DATE AS 1, GRP2_MINIMUM_DATE AS 2, GRP3_MINIMUM_DATE AS 3, GRP4_MINIMUM_DATE AS 4)))
SELECT qcsj_c000000001000000 user_id, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE
FROM INPUT_COHORT cohort
LEFT JOIN unpiv table_2
ON cohort.USER_ID = table_2.USER_ID
pivot (MAX(CASE WHEN minimum_date <= cohort."ANCHOR_DATE" THEN 1 ELSE 0 END) AS MINIMUM_DATE
FOR grp_id IN (1 AS GRP1,2 AS GRP2,3 AS GRP3,4 AS GRP4))
Result:
| USER_ID | GRP1_MINIMUM_DATE | GRP2_MINIMUM_DATE | GRP3_MINIMUM_DATE | GRP4_MINIMUM_DATE |
|---------|-------------------|-------------------|-------------------|-------------------|
| 3 | | | | |
| 1 | 0 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 1 |
This way you only have to write your calculation logic once (see line starting with pivot).

How to create a query with all of dependencies in hierarchical organization?

I've been trying hard to create a query to see all dependencies in a hierarchical organization. But the only I have accuaried is to retrieve the parent dependency. I have attached an image to show what I need.
Thanks for any clue you can give me.
This is the code I have tried with the production table.
WITH CTE AS
(SELECT
H1.systemuserid,
H1.pes_aprobadorid,
H1.yomifullname,
H1.internalemailaddress
FROM [dbo].[ext_systemuser] H1
WHERE H1.pes_aprobadorid is null
UNION ALL
SELECT
H2.systemuserid,
H2.pes_aprobadorid,
H2.yomifullname,
H2.internalemailaddress
FROM [dbo].[ext_systemuser] H2
INNER JOIN CTE c ON h2.pes_aprobadorid=c.systemuserid)
SELECT *
FROM CTE
OPTION (MAXRECURSION 1000)
You are almost there with your query. You just have to include all rows as a starting point. Also the join should be cte.parent_id = ext.user_id and not the other way round. I've done an example query in postgres, but you shall easily adapt it to your DBMS.
with recursive st_units as (
select 0 as id, NULL as pid, 'Director' as nm
union all select 1, 0, 'Department 1'
union all select 2, 0, 'Department 2'
union all select 3, 1, 'Unit 1'
union all select 4, 3, 'Unit 1.1'
),
cte AS
(
SELECT id, pid, cast(nm as text) as path, 1 as lvl
FROM st_units
UNION ALL
SELECT c.id, u.pid, cast(path || '->' || u.nm as text), lvl + 1
FROM st_units as u
INNER JOIN cte as c on c.pid = u.id
)
SELECT id, pid, path, lvl
FROM cte
ORDER BY lvl, id
id | pid | path | lvl
-: | ---: | :--------------------------------------- | --:
0 | null | Director | 1
1 | 0 | Department 1 | 1
2 | 0 | Department 2 | 1
3 | 1 | Unit 1 | 1
4 | 3 | Unit 1.1 | 1
1 | null | Department 1->Director | 2
2 | null | Department 2->Director | 2
3 | 0 | Unit 1->Department 1 | 2
4 | 1 | Unit 1.1->Unit 1 | 2
3 | null | Unit 1->Department 1->Director | 3
4 | 0 | Unit 1.1->Unit 1->Department 1 | 3
4 | null | Unit 1.1->Unit 1->Department 1->Director | 4
db<>fiddle here
I've reached this code that it is working but when I include a hierarchy table of more than 1800 the query is endless.
With cte AS
(select systemuserid, systemuserid as pes_aprobadorid, internalemailaddress, yomifullname
from #TestTable
union all
SELECT c.systemuserid, u.pes_aprobadorid, u.internalemailaddress, u.yomifullname
FROM #TestTable as u
INNER JOIN cte as c on c.pes_aprobadorid = u.systemuserid
)
select distinct * from cte
where pes_aprobadorid is not null
OPTION (MAXRECURSION 0)

Oracle Sql: Obtain a Sum of a Group, if Subgroup condition met

I have a dataset upon which I am trying to obain a summed value for each group, if a subgroup within each group meets a certain condition. I am not sure if this is possible, or if I am approaching this problem incorrectly.
My data is structured as following:
+----+-------------+---------+-------+
| ID | Transaction | Product | Value |
+----+-------------+---------+-------+
| 1 | A | 0 | 10 |
| 1 | A | 1 | 15 |
| 1 | A | 2 | 20 |
| 1 | B | 1 | 5 |
| 1 | B | 2 | 10 |
+----+-------------+---------+-------+
In this example I want to obtain the sum of values by the ID column, if a transaction does not contain any products labeled 0. In the above described scenario, all values related to Transaction A would be excluded because Product 0 was purchased. With the outcome being:
+----+-------------+
| ID | Sum of Value|
+----+-------------+
| 1 | 15 |
+----+-------------+
This process would repeat for multiple IDs with each ID only containing the sum of values if the transaction does not contain product 0.
Hmmm . . . one method is to use not exists for the filtering:
select id, sum(value)
from t
where not exists (select 1
from t t2
where t2.id = t.id and t2.transaction = t.transaction and
t2.product = 0
)
group by id;
Do not need to use correlated subquery with not exists.
Just use group by.
with s (id, transaction, product, value) as (
select 1, 'A', 0, 10 from dual union all
select 1, 'A', 1, 15 from dual union all
select 1, 'A', 2, 20 from dual union all
select 1, 'B', 1, 5 from dual union all
select 1, 'B', 2, 10 from dual)
select id, sum(sum_value) as sum_value
from
(select id, transaction,
sum(value) as sum_value
from s
group by id, transaction
having count(decode(product, 0, 1)) = 0
)
group by id;
ID SUM_VALUE
---------- ----------
1 15

Selecting a record based on a series of criteria

I would like to run a query that will allow me to chose the best record from a particular username based on certain criteria. I have 2 columns (col01, col02) that are my criteria that I am looking at.
• If one record (username a in the example below) has both columns as yes, I would like that one to take precedence.
• If one record has col01 as a yes, that takes next 2nd rank precenence (username c in the example below)
• If one record has col01, and the other has col02 as yes, than col01 takes precedence(username d in the example below).
• If one record has col02 as yes, and the other records as no, than column two takes 3rd precedence (username g in the example below).
• If both records are the same, than neither should be returned as these records need to be investigated further (usernames b, e, f)
Below is example sample and output. How it can be done using sql query?
+----------+-----+-------+-------+
| username | id | col01 | col02 |
+----------+-----+-------+-------+
| a | 1 | yes | yes |
| a | 2 | yes | no |
| b | 3 | no | no |
| b | 4 | no | no |
| c | 5 | yes | no |
| c | 6 | no | no |
| d | 7 | yes | no |
| d | 8 | no | yes |
| e | 9 | no | yes |
| e | 10 | no | yes |
| f | 11 | yes | yes |
| f | 12 | yes | yes |
| g | 13 | no | no |
| g | 14 | no | yes |
+----------+----+--------+-------+
output
+----------+-----+-------+------+
| username | id | col01 | col02|
+----------+-----+-------+------+
| a | 1 | yes | yes |
| c | 5 | yes | no |
| d | 7 | yes | no |
| g | 14 | no | yes |
+----------+----+--------+------+
Edit: I was asked to explain the conditions. Basically the records come from the same area (username); The col01 is the most recently updated information we have, while col02 is older. Both columns are important to us, so that is why it is better if both are yes; col01 being more recent is where the more dependable data is. Where all the records are exactly the same, we have to dig a little deeper to understand out data.
Use analytic functions and then you do not need any self-joins:
Query:
SELECT username,
id,
col01,
col02
FROM (
SELECT t.*,
c.col2,
MIN( t.col01 ) OVER ( PARTITION BY username ) AS mincol01,
MAX( t.col01 ) OVER ( PARTITION BY username ) AS maxcol01,
MIN( c.col02 ) OVER ( PARTITION BY username ) AS mincol02,
MAX( c.col02 ) OVER ( PARTITION BY username ) AS maxcol02,
ROW_NUMBER() OVER ( PARTITION BY username
ORDER BY t.col01 DESC, c.col02 DESC ) AS rn
FROM table_name t
INNER JOIN
col02_table c
ON ( t.id = c.id )
)
WHERE ( mincol01 < maxcol01 OR mincol02 < maxcol02 )
AND rn = 1;
Output:
USERNAME ID COL01 COL02
-------- -- ----- -----
a 1 yes yes
c 5 yes no
d 7 yes no
g 14 no yes
with
inputs ( username, id, col01 , col02 ) as (
select 'a', 1, 'yes', 'yes' from dual union all
select 'a', 2, 'yes', 'no' from dual union all
select 'b', 3, 'no' , 'no' from dual union all
select 'b', 4, 'no' , 'no' from dual union all
select 'c', 5, 'yes', 'no' from dual union all
select 'c', 6, 'no' , 'no' from dual union all
select 'd', 7, 'yes', 'no' from dual union all
select 'd', 8, 'no' , 'yes' from dual union all
select 'e', 9, 'no' , 'yes' from dual union all
select 'e', 10, 'no' , 'yes' from dual union all
select 'f', 11, 'yes', 'yes' from dual union all
select 'f', 12, 'yes', 'yes' from dual union all
select 'g', 13, 'no' , 'no' from dual union all
select 'g', 14, 'no' , 'yes' from dual
)
-- Query begins here
select username,
max(id) keep (dense_rank last order by col01, col02) as id,
max(col01) as col01,
max(col02) keep (dense_rank last order by col01) as col02
from inputs
group by username
having min(col01) != max(col01) or min(col02) != max(col02)
;
USERNAME ID COL COL
-------- --- --- ---
a 1 yes yes
c 5 yes no
d 7 yes no
g 14 no yes
Use multiple outer self joins, one for records with both yes, one for records with only col01 = yes and one for records with only col02 = yes. Then add predicates to only select records where the id is the id of the first record in that set (id of row with same name that has both yes, id of row with same name that has only col01 = yes, etc.)
to get rid of rows that are dupes, filter out any row where there's another row, (with different id) that has same value for username, col01, and col02.
Select distinct a.username, a.id,
a.col01, a.col02
From table a
left join table b -- <- this is rows with both cols = yes
on b.username=a.username
and b.col01='yes'
and b.col02='yes'
left join table c1 -- <- this is rows with col1 = yes
on c1.username=a.username
and c1.col01='yes'
and c1.col02='no'
left join table c2 -- <- this is rows with col2 = yes
on c2.username=a.username
and c2.col01='no'
and c2.col02='yes'
Where a.id = coalesce(b.id, c1.Id, c2.Id)
and not exists -- <- This gets rid of f
(select * from table
where username = a.username
and id != a.id
and col01 = a.col01
and col02 = a.col02)
if col02 is in another table, then in each place you use the table and need col02, you will need to add another join to this other table.
Select distinct a.username, a.id,
a.col01, ot.col02
From (table a join other table ot
on ot.id = a.Id)
left join (table b join otherTable ob -- <- this rows with both cols yes
on ob.id= b.id)
on b.username=a.username
and b.col01='yes'
and ob.col02='yes'
left join (table c1 join otherTable oc1 -- <- this rows with col1 yes
on oc1.id= c1.id)
on c1.username=a.username
and c1.col01='yes'
and oc1.col02='no'
left join (table c2 join otherTable oc2 -- <- this rows with col2 yes
on oc2.id= c2.id)
on c2.username=a.username
and c2.col01='no'
and oc2.col02='yes'
Where a.id = coalesce(b.id, c1.Id, c2.Id)
and not exists -- <- This gets rid of f
(select * from table e
join otherTable oe
on oe.id= e.id
where e.username = a.username
and e.id != a.id
and e.col01 = a.col01
and oe.col02 = a.col02)

SQL:How to dynamically return error code for records which doesn't exist in table

I am trying to replicate a workplace scenario. The sqlfiddle for Oracle db is not working so I couldn't recreate the table.
Say I have a table like below
Table1
+----+------+
| ID | Col1 |
+----+------+
| 1 | A |
| 2 | B |
| 3 | C |
+----+------+
Now we run a query with where condition. The in clause for where is passed by user and run time and can change.
Suppose user inputs 1,2,4,5
So the SQL will be like
select t.* from Table1 t where t.id in (1,2,4,5);
The result of this query will be
+----+------+
| ID | Col1 |
+----+------+
| 1 | A |
| 2 | B |
+----+------+
Now output I am expecting should be something like below
+----+---------+------+
| ID | ErrCode | Col1 |
+----+---------+------+
| 1 | 0 | A |
| 2 | 0 | B |
| 4 | 404 | |
| 5 | 404 | |
+----+---------+------+
As 3 was not entered by user, we will not return it. But for 4 and 5, there is no record in our table, so I want to create another dummy column which will contain error code. The data columns should be null.
It is not mandatory that the user input should go to in clause. We can use it anywhere in the query.
I am thinking of some way of splitting the input id and use them as rows. Then use them to do left join with Table1 to find the records which exists and doesn't exist in Table1 and use case on that to decide among 0 or 404 as error code.
Appreciate any other way we can do it by query.
Here it goes
SQL> WITH table_filter AS
2 (SELECT regexp_substr(txt, '[^,]+', 1, LEVEL) id
3 FROM (SELECT '1,2,4,5' AS txt FROM dual) -- User input here
4 CONNECT BY regexp_substr(txt, '[^,]+', 1, LEVEL) IS NOT NULL),
5 table1 AS -- Sample data
6 (SELECT 1 id,
7 'A' col1
8 FROM dual
9 UNION ALL
10 SELECT 2,
11 'B'
12 FROM dual
13 UNION ALL
14 SELECT 3,
15 'C'
16 FROM dual)
17 SELECT f.id,
18 CASE
19 WHEN t.id IS NULL THEN
20 404
21 ELSE
22 0
23 END AS err_code,
24 t.col1
25 FROM table_filter f
26 LEFT OUTER JOIN table1 t
27 ON t.id = f.id;
ID ERR_CODE COL1
---------------------------- ---------- ----
1 0 A
2 0 B
5 404
4 404
SQL>
Oracle Setup:
CREATE TABLE Table1 ( id, col1 ) AS
SELECT 1, 'A' FROM DUAL UNION ALL
SELECT 2, 'B' FROM DUAL;
Query:
SELECT i.COLUMN_VALUE AS id,
NVL2( t.col1, 0, 404 ) AS ErrCode,
t.col1
FROM TABLE( SYS.ODCINUMBERLIST( 1, 2, 4, 5 ) ) i
LEFT OUTER JOIN
Table1 t
ON ( i.COLUMN_VALUE = t.id );
Output:
ID ERRCODE COL1
-- ------- ----
1 0 A
2 0 B
4 404
5 404
The collection of ids can be built dynamically using PL/SQL or an external language and then passed as a bind variable. See my answer here for an example.