SQL Duplicate With Nulls - sql

For somewhat insane business reasons related to getting rather messy data from our customer, I have the following problem;
1)I have a table with 6 semi-unique identifiers and one automatically incrementing unique ID. The table has more fields. But, those aren't important to this discussion. Nor is the exact type of data the fields hold.
2)I want to get a list of the unique IDs of all rows that participate in at least one duplicate relationship. (There's not any additional value in identifying all of the row pairs that indicate a duplication. But, if a solution provides that, it's fairly trivial to retrieve the set of duplicate rows. So, that would also be fine)
3)A duplicate is defined as;
3a)For each of these 6 fields, record A must either match record B or one of them must be null
3b)At least one field must match exactly (i.e. neither is null)
4)All of the potential duplicate fields of interest are strings and are not empty strings. Many rows have at least one of the fields of interest as null but (at least assuming our ingest logic is working) none of them can have more than 3 of these fields as null.
5)Exact string content matching is fine. We don't need any regex-based, case-insensitive... sort of matching.
6)Actual duplicates in the table are fairly rare.
7)We are running PostgreSQL 9. Using database-specific functionality is acceptable.
8)The table has 500,000 rows. So, the naive query I started out with, provided below, takes far too long to be viable. Presumably, it operates principally in exponential time. Ideally, the results should return in less than a minute, running on a midrange server.
SELECT a.id
FROM myTable a
JOIN myTable b ON a.id < b.id
AND (a.field1 = b.field1 OR a.field1 IS NULL OR b.field1 IS NULL )
AND (a.field2 = b.field2 OR a.field2 IS NULL OR b.field2 IS NULL)
....
WHERE
a.field1 = b.field1 OR a.field2 = b.field2 ...
9)I also looked into using "group by". But, "group by" does not consider two rows to be equal if a grouped column in one contains null and the other contains a value. Unless there is a way to achieve that behavior, group by won't work for my "both equal or at least one is null" logic.
10)The set of values that might be expected to appear in each row can be assumed non-overlapping with other columns. i.e., other than null, you will not expect a value from field 1 to appear in any rows for field 2.
Update: Sorry for the lack of information. I'll provide as close an approximation of the table schema as I can. Unfortunately, the project in question is in defense and even just the field names of the table could reveal information about operational security.
CREATE TABLE a (
id serial NOT NULL PRIMARY KEY,
f1 character varying,
f2 character varying,
f3 character varying,
f4 character varying,
f5 character varying,
f6 character varying,
...Other columns that aren't really relevant
)
CREATE INDEX f1_idx
ON public.a
USING btree
(f1 COLLATE pg_catalog."default");
...Same index for the other 5 fields.
For ease of reference, I'll copy Lorenze Albe's question and answer it here.
If you have the three rows
(1, 2, 3, 4, NULL, 6)
(1, 2, 3, NULL, 5, NULL)
(1, 2, 3, 4, 7, NULL)
which are duplicates?
(1, 2, 3, NULL, 5, NULL)
and
(1, 2, 3, 4, 7, NULL)
are not duplicates because field 5 is non-null in both and they are not equal. The other two are duplicates.
I'll give a few more examples of my own for clarity. (Just for completeness, I'll provide my row examples as strings. But, like I said, their string-iness isn't really important because we require exact string matches.
("1", "2", "3", "4", NULL, NULL)
AND
("1","2","3",NULL,"9",NULL)
are duplicates because columns 4, 5, and 6 are null in at least one and all other fields are equal.
("1", "2", "3", "4", NULL, "6")
AND
("1","2","3",NULL,"9","7")
are not duplicates because field 6 differs and neither is null
And two examples more typical of the actual data;
(NULL, NULL, "3", NULL, "5", "6")
and
("1", "2", NULL, "4", NULL, "6")
are duplicates because all fields wherein they differ, at least one side is null.
(NULL, NULL, "3", NULL, "5", "6")
and
("1", "2", NULL, "4", NULL, "6")
Yes, that does mean that
(NULL, NULL, NULL, "4", "5", "6")
and
("1", "2", "3", NULL, NULL, NULL)
would be duplicates if not for the requirement that at least one field match exactly. Which fields are null and which aren't is very nearly random. All that we require from our data provider is that at least 2 of the 6 fields must be provided.
Another Update: I've updated point 2 to reflect the fact that I want all rows that participate in at least one duplicate pair. So, for the three rows
(1, 2, 3, 4, NULL, 6)
(1, 2, 3, NULL, 5, NULL)
(1, 2, 3, 4, 7, NULL)
all three would be returned because even though rows 2 and 3 would not be considered duplicates of each other, row pairs 1,2 are duplicates and 1,3 are duplicates and therefore all three participate in a duplicate relationship and therefore would be returned.

Use count() over(partition by ...) then filter that result for any counts greater than 1:
CREATE TABLE mytable(
ID INTEGER NOT NULL PRIMARY KEY
,col1 VARCHAR(2) NOT NULL
,col2 VARCHAR(2)
,col3 VARCHAR(2) NOT NULL
,col4 VARCHAR(2)
,col5 VARCHAR(2)
,col6 VARCHAR(2)
);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1001,'a1','b1','c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1002,'a1',NULL,'c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1003,'a1','b1','c1','d1','e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1004,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1005,'a1','b1','c1',NULL,'e1','f1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1006,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1007,'f1',NULL,'b1','c1','d1','e1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1008,'b1','c1','d1','e1','f1',NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1009,'c1','d1','e1','f1',NULL,NULL);
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1010,'c1','d1','e1','f1',NULL,'a1');
INSERT INTO mytable(ID,col1,col2,col3,col4,col5,col6) VALUES (1011,'a1','b1','c1','d1','e1','f1');
select
*
, count(*) over(partition by
coalesce(col1,'NULL')
, coalesce(col2,'NULL')
, coalesce(col3,'NULL')
, coalesce(col4,'NULL')
, coalesce(col5,'NULL')
, coalesce(col6,'NULL')
) cv
from mytable
id | col1 | col2 | col3 | col4 | col5 | col6 | cv
---: | :--- | :--- | :--- | :--- | :--- | :--- | -:
1001 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1003 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1011 | a1 | b1 | c1 | d1 | e1 | f1 | 3
1005 | a1 | b1 | c1 | null | e1 | f1 | 1
1002 | a1 | null | c1 | d1 | e1 | f1 | 1
1008 | b1 | c1 | d1 | e1 | f1 | null | 3
1004 | b1 | c1 | d1 | e1 | f1 | null | 3
1006 | b1 | c1 | d1 | e1 | f1 | null | 3
1010 | c1 | d1 | e1 | f1 | null | a1 | 1
1009 | c1 | d1 | e1 | f1 | null | null | 1
1007 | f1 | null | b1 | c1 | d1 | e1 | 1
Use the approach above as a subquery, and then use where cv > 1 to locate all rows that have "duplicates" in those 6 columns.
db<>fiddle here
Please note the power of having some sample data to work with. Really it is your responsibility to provide sample data with your question (as you already own that data anyway). Do NOT try to explain in words alone, use data to illustrate the "as is" and the "to be", you will find your questions easier to prepare and faster to answer. See Why should I provide a MCVE

Related

What if there are no possible primary keys in a table?

I have to design a wide table for a database (timescaleDB, which will create hypertables based on date), but it seems like there are no possible primary keys, even if we are talking about composite keys.
| id | attribute1 | attribute2 | attribute3 | attribute4 | date_time
| ---| ---------- | ---------- | ---------- | ---------- | -------------------
| P1 | A | 20 | NULL | NULL | 2021-01-01 00:00:00
| P1 | B | 10 | NULL | NULL | 2021-01-01 00:00:00
| P1 | NULL | NULL | 200 | 300 | 2021-01-01 00:00:00
| P2 | C | 25 | NULL | NULL | 2021-01-01 00:00:00
| P2 | NULL | NULL | 150 | 400 | 2021-01-01 00:00:00
The problem is that we are scraping data that is describing P1, P2, etc. as a whole, and also that is describing only a part of P1 (A and B are part of P1) P2 (C), etc...
Is there any way to make this work without splitting up the table?
You can follow the design below. The following structure does not store any null values in the database
create table parenttable
(
id int identity,
Name nvarchar(10),
primary key(id)
)
create table childtable
(
id int identity,
parent_id int,
attribute nvarchar(50),
valueattribute nvarchar(50),
date_time datetime,
primary key(id),
foreign key(parent_id)references parenttable
);
insert into parenttable values
('P1'),
('P2')
insert into childtable values
(1,'attribute1','A','2021-01-01 00:00:00'),
(1,'attribute2','20','2021-01-01 00:00:00'),
(1,'attribute1','B','2021-01-01 00:00:00'),
(1,'attribute2','10','2021-01-01 00:00:00'),
(1,'attribute3','200','2021-01-01 00:00:00'),
(1,'attribute4','300','2021-01-01 00:00:00'),
(2,'attribute1','C','2021-01-01 00:00:00'),
(2,'attribute2','25','2021-01-01 00:00:00'),
(2,'attribute3','150','2021-01-01 00:00:00'),
(2,'attribute4','400','2021-01-01 00:00:00')
select *
from parenttable p join childtable c on p.id = c.parent_id
result in dbfiddle: https://dbfiddle.uk
Attribute1 and Attribute2 will always be NULL for P1, those are describing A and B (and they belong to P1). Similarly, attribute3 and attribute4 are going to be always NULL if there is A, B, C, etc. because those attirubtes are describing P1.
There is not enough information in your problem statement to answer your question.
I don't understand the above description, but it's enough to tell me you need to apply functional dependency analysis, and create as many tables as "those are describing" exist.
attribute3 and attribute4 ... are describing P1
That suggests you should have a table representing P1 things, with attribute3 and attribute4 as columns (preferably with meaningful names).
Organize your tables around the things you're modeling.
Look for columns that cannot be NULL for particular things. Those belong in the table depicting one kind of thing.
Then look for columns that might be NULL for a certain kind of thing. Those can be NULL-able columns, or a separate table sharing the same key, with optional cardinality.
There are no other kinds of columns.
Once you've grouped your column into tables and distinguished what's necessary from what's not, you can look over the mandatory columns for a candidate key. There is always such a key, even if it includes all the non-NULL columns. Why? Because two identical rows are indistinguishable from each other. If you think you need two such rows, what you really need is 1 row, and a quantity column (not in the key) indicating how many such exist.

Use IN to compare Array of Values against a table of data

I want to compare an array of values against the the rows of a table and return only the rows in which the data are different.
Suppose I have myTable:
| ItemCode | ItemName | FrgnName |
|----------|----------|----------|
| CD1 | Apple | Mela |
| CD2 | Mirror | Specchio |
| CD3 | Bag | Borsa |
Now using the SQL instruction IN I would like to compare the rows above against an array of values pasted from an excel file and so in theory I would have to write something like:
WHERE NOT IN (
ARRAY[CD1, Apple, Mella],
ARRAY[CD2, Miror, Specchio],
ARRAY[CD3, Bag, Borsa]
)
The QUERY should return rows 1 and 2 "MELLA" and "MIROR" are in fact typos.
You could use a VALUES expression to emulate a table of those arrays, like so:
... myTable AS t
LEFT JOIN (
VALUES (1, 'CD1','Apple','Mella')
, (1, 'CD2', 'Miror', 'Specchio')
, (1, 'CD3', 'Bag', 'Borsa')
) AS v(rowPresence, a, b, c)
ON t.ItemCode = v.a AND t.ItemName = v.b AND t.FrgnName = v.c
WHERE v.rowPresence IS NULL
Technically, in your scenario, you can do without the "rowPresence" field I added, since none of the values in your arrays are NULL any would do; I basically added it to point to a more general case.

BINARY_CHECKSUM - different result depending on number of rows

I wonder why the BINARY_CHECKSUM function returns different result for the same:
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL),
(3, 1, 2)) s(id,a,b);
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL)) s(id,a,b);
Ouput:
+-----+----+------+-------------+
| id | a | b | bc |
+-----+----+------+-------------+
| 1 | | 100 | -109 |
| 2 | | | -2147483640 |
| 3 | 1 | 2 | 18 |
+-----+----+------+-------------+
-- -109 vs 100
+-----+----+------+------------+
| id | a | b | bc |
+-----+----+------+------------+
| 1 | | 100 | 100 |
| 2 | | | 2147483647 |
+-----+----+------+------------+
And for second sample I get what I would anticipate:
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, 1, 100),
(2, 3, 4),
(3,1,1)) s(id,a,b);
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, 1, 100),
(2, 3, 4)) s(id,a,b);
Ouptut for both first two rows:
+-----+----+------+-----+
| id | a | b | bc |
+-----+----+------+-----+
| 1 | 1 | 100 | 116 |
| 2 | 3 | 4 | 52 |
+-----+----+------+-----+
db<>fiddle demo
It has strange consequences when I want to compare two tables/queries:
WITH t AS (
SELECT 1 AS id, NULL AS a, 100 b
UNION ALL SELECT 2, NULL, NULL
UNION ALL SELECT 3, 1, 2 -- comment this out
), s AS (
SELECT 1 AS id ,100 AS a, NULL as b
UNION ALL SELECT 2, NULL, NULL
UNION ALL SELECT 3, 2, 1 -- comment this out
)
SELECT t.*,s.*
,BINARY_CHECKSUM(t.a, t.b) AS bc_t, BINARY_CHECKSUM(s.a, s.b) AS bc_s
FROM t
JOIN s
ON s.id = t.id
WHERE BINARY_CHECKSUM(t.a, t.b) = BINARY_CHECKSUM(s.a, s.b);
db<>fiddle demo2
For 3 rows I get single result:
+-----+----+----+-----+----+----+--------------+-------------+
| id | a | b | id | a | b | bc_t | bc_s |
+-----+----+----+-----+----+----+--------------+-------------+
| 2 | | | 2 | | | -2147483640 | -2147483640 |
+-----+----+----+-----+----+----+--------------+-------------+
but for 2 rows I get also id = 1:
+-----+----+------+-----+------+----+-------------+------------+
| id | a | b | id | a | b | bc_t | bc_s |
+-----+----+------+-----+------+----+-------------+------------+
| 1 | | 100 | 1 | 100 | | 100 | 100 |
| 2 | | | 2 | | | 2147483647 | 2147483647 |
+-----+----+------+-----+------+----+-------------+------------+
Remarks:
I am not searching for alternatives like(HASH_BYTES/MD5/CHECKSUM)
I am aware that BINARY_CHECKSUM could lead to collisions(two different calls produce the same output) here scenario is a bit different
For this definition, we say that null values, of a specified type,
compare as equal values. If at least one of the values in the
expression list changes, the expression checksum can also change.
However, this is not guaranteed. Therefore, to detect whether values
have changed, we recommend use of BINARY_CHECKSUM only if your
application can tolerate an occasional missed change.
It is strange for me that hash function returns different result for the same input arguments.
Is this behaviour by design or it is some kind of glitch?
EDIT:
As #scsimon
points out it works for materialized tables but not for cte.
db<>fiddle actual table
Metadata for cte:
SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL),
(3, 1, 2)) s(id,a,b)', NULL,0);
SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL)) s(id,a,b)', NULL,0)
-- working workaround
SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, cast(NULL as int), 100),
(2, NULL, NULL)) s(id,a,b)', NULL,0)
For all cases all columns are INT but with explicit CAST it behaves as it should.
db<>fidde metadata
This has nothing to do with the number of rows. It is because the values in one of the columns of the 2-row version are always NULL. The default type of NULL is int and the default type of a numeric constant (of this length) is int, so these should be comparable. But from a values() derived table, these are (apparently) not exactly the same type.
In particular, a column with only typeless NULLs from a derived table is not comparable, so it is excluded from the binary checksum calculation. This does not occur in a real table, because all columns have types.
The rest of the answer illustrates what is happening.
The code behaves as expected with type conversions:
SELECT *, BINARY_CHECKSUM(a, b) AS bc
FROM (VALUES(1, cast(NULL as int), 100),
(2, NULL, NULL)
) s(id,a,b);
Here is a db<>fiddle.
Actually creating tables with the values suggests that columns with only NULL values have exactly the same type as columns with explicit numbers. That suggests that the original code should work. But an explicit cast also fixes the problem. Very strange.
This is really, really strange. Consider the following:
select v.*, checksum(a, b), checksum(c,b)
FROM (VALUES(1, NULL, 100, NULL),
(2, 1, 2, 1.0)
) v(id, a, b, c);
The change in type for "d" affects the binary_checksum() for the second row, but not for the first.
This is my conclusion. When all the values in a column are binary, then binary_checksum() is aware of this and the column is in the category of "noncomparable data type". The checksum is then based on the remaining columns.
You can validate this by seeing the error when you run:
select v.*, binary_checksum(a)
FROM (VALUES(1, NULL, 100, NULL),
(2, NULL, 2, 1.0)
) v( id,a, b, c);
It complains:
Argument data type NULL is invalid for argument 1 of checksum function.
Ironically, this is not a problem if you save the results into a table and use binary_checksum(). The issue appears to be some interaction with values() and data types -- but something that is not immediately obvious in the information_schema.columns table.
The happyish news is that the code should work on tables, even if it does not work on values() generated derived tables -- as this SQL Fiddle demonstrates.
I also learned that a column filled with NULLs really is typeless. The assignment of the int data type in a select into seems to happen when the table is being defined. The "typeless" type is converted to an int.
For the literal NULL without the CAST (and without any typed values in the column) it entirely ignores it and just gives you the same result as BINARY_CHECKSUM(b).
This seems to happen very early on. The initial tree representation output from
SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
(2, NULL, NULL)) s(id,a,b)
OPTION (RECOMPILE, QUERYTRACEON 8605, QUERYTRACEON 3604);
Already shows that it has decided to just use one column as input to the function
ScaOp_Intrinsic binary_checksum
ScaOp_Identifier COL: Union1008
This compares with the following output for your first query
ScaOp_Intrinsic binary_checksum
ScaOp_Identifier COL: Union1011
ScaOp_Identifier COL: Union1010
If you try and get the BINARY_CHECKSUM with
SELECT *, BINARY_CHECKSUM(a) AS bc
FROM (VALUES(1, NULL, 100)) s(id,a,b)
It gives the error
Msg 8184, Level 16, State 1, Line 8 Error in binarychecksum. There are
no comparable columns in the binarychecksum input.
This is not the only place where an untyped NULL constant is treated differently from an explicitly typed one.
Another case is
SELECT COALESCE(CAST(NULL AS INT),CAST(NULL AS INT))
vs
SELECT COALESCE(NULL,NULL)
I'd err on the side of "glitch" in this case rather than "by design" though as the columns from the derived table are supposed to be int before they get to the checksum function.
SELECT COALESCE(a,b)
FROM (VALUES(NULL, NULL)) s(a,b)
Does work as expected without this glitch.

PostgreSQL 9.3 - Compare two sets of data without duplicating values in first set

I have a group of tables that define some rules that need to be followed, for example:
CREATE TABLE foo.subrules (
subruleid SERIAL PRIMARY KEY,
ruleid INTEGER REFERENCES foo.rules(ruleid),
subrule INTEGER,
barid INTEGER REFERENCES foo.bars(barid)
);
INSERT INTO foo.subrules(ruleid,subrule,barid) VALUES
(1,1,1),
(1,1,2),
(1,2,2),
(1,2,3),
(1,2,4),
(1,3,3),
(1,3,4),
(1,3,5),
(1,3,6),
(1,3,7);
What this is defining is a set of "subrules" that need to be satisfied... if all "subrules" are satisfied then the rule is also satisfied.
In the above example, "subruleid" 1 can be satisfied with a "barid" value of 1 or 2.
Additionally, "subruleid" 2 can be satisfied with a "barid" value of 2, 3, or 4.
Likewise, "subruleid" 3 can be satisfied with a "barid" value of 3, 4, 5, 6, or 7.
I also have a data set that looks like this:
primarykey | resource | barid
------------|------------|------------
1 | A | 1
2 | B | 2
3 | C | 8
The tricky part is that once a "subrule" is satisfied with a "resource", that "resource" can't satisfy any other "subrule" (even if the same "barid" would satisfy the other "subrule")
So, what I need is to evaluate and return the following results:
ruleid | subrule | barid | primarykey | resource
------------|------------|------------|------------|------------
1 | 1 | 1 | 1 | A
1 | 1 | 2 | NULL | NULL
1 | 2 | 2 | 2 | B
1 | 2 | 3 | NULL | NULL
1 | 2 | 4 | NULL | NULL
1 | 3 | 3 | NULL | NULL
1 | 3 | 4 | NULL | NULL
1 | 3 | 5 | NULL | NULL
1 | 3 | 6 | NULL | NULL
1 | 3 | 7 | NULL | NULL
NULL | NULL | NULL | 3 | C
Interestingly, if "primarykey" 3 had a "barid" value of 2 (instead of 8) the results would be identical.
I have tried several methods including a plpgsql function that performs a grouping by "subruleid" with ARRAY_AGG(barid) and building an array from barid and checking if each element in the barid array is in the "subruleid" group via a loop, but it just doesn't feel right.
Is a more elegant or efficient option available?
The following fragment finds solutions, if there are any. The number three (resources) is hardcoded. If only one solution is needed some symmetry-breaker should be added.
If the number of resources is not bounded, I think there could be a solution by enumerating all possible tableaux (Hilbert? mixed-radix?), and selecting from them, after pruning the not-satifying ones.
-- the data
CREATE TABLE subrules
( subruleid SERIAL PRIMARY KEY
, ruleid INTEGER -- REFERENCES foo.rules(ruleid),
, subrule INTEGER
, barid INTEGER -- REFERENCES foo.bars(barid)
);
INSERT INTO subrules(ruleid,subrule,barid) VALUES
(1,1,1), (1,1,2),
(1,2,2), (1,2,3), (1,2,4),
(1,3,3), (1,3,4), (1,3,5), (1,3,6), (1,3,7);
CREATE TABLE resources
( primarykey INTEGER NOT NULL PRIMARY KEY
, resrc varchar
, barid INTEGER NOT NULL
);
INSERT INTO resources(primarykey,resrc,barid) VALUES
(1, 'A', 1) ,(2, 'B', 2) ,(3, 'C', 8)
-- ################################
-- uncomment next line to find a (two!) solution(s)
-- ,(4, 'D', 7)
;
-- all matching pairs of subrules <--> resources
WITH pairs AS (
SELECT sr.subruleid, sr.ruleid, sr.subrule, sr.barid
, re.primarykey, re.resrc
FROM subrules sr
JOIN resources re ON re.barid = sr.barid
)
SELECT
p1.ruleid AS ru1 , p1.subrule AS sr1 , p1.resrc AS one
, p2.ruleid AS ru2 , p2.subrule AS sr2 , p2.resrc AS two
, p3.ruleid AS ru3 , p3.subrule AS sr3 , p3.resrc AS three
-- self-join the pairs, excluding the ones that
-- use the same subrule or resource
FROM pairs p1
JOIN pairs p2 ON p2.primarykey > p1.primarykey -- tie-breaker
JOIN pairs p3 ON p3.primarykey > p2.primarykey -- tie breaker
WHERE 1=1
AND p2.subruleid <> p1.subruleid
AND p2.subruleid <> p3.subruleid
AND p3.subruleid <> p1.subruleid
;
Result (after uncommenting the line with missing resource) :
ru1 | sr1 | one | ru2 | sr2 | two | ru3 | sr3 | three
-----+-----+-----+-----+-----+-----+-----+-----+-------
1 | 1 | A | 1 | 1 | B | 1 | 3 | D
1 | 1 | A | 1 | 2 | B | 1 | 3 | D
(2 rows)
The resources {A,B,C} could of course be hard-coded, but that would prevent the 'D' record (or any other) to serve as the missing link.
Since you are not clarifying the question, I am going with my own assumptions.
subrule numbers are ascending without gaps for each rule.
(subrule, barid) is UNIQUE in table subrules.
If a there are multiple resources for the same barid, assignments are arbitrary among these peers.
As commented, the number of resources matches the number of subrules (which has no effect on my suggested solution).
The algorithm is as follows:
Pick the subrule with the smallest subrule number.
Assign a resource to the lowest barid possible (the first that has a matching resource), which consumes the resource.
After the first resource is matched, skip to the next higher subruleid and repeat 2.
Append all remaining resources after last subrule.
You can implement this with pure SQL using a recursive CTE:
WITH RECURSIVE cte AS ((
SELECT s.*, r.resourceid, r.resource
, CASE WHEN r.resourceid IS NULL THEN '{}'::int[]
ELSE ARRAY[r.resourceid] END AS consumed
FROM subrules s
LEFT JOIN resource r USING (barid)
WHERE s.ruleid = 1
ORDER BY s.subrule, r.barid, s.barid
LIMIT 1
)
UNION ALL (
SELECT s.*, r.resourceid, r.resource
, CASE WHEN r.resourceid IS NULL THEN c.consumed
ELSE c.consumed || r.resourceid END
FROM cte c
JOIN subrules s ON s.subrule = c.subrule + 1
LEFT JOIN resource r ON r.barid = s.barid
AND r.resourceid <> ALL (c.consumed)
ORDER BY r.barid, s.barid
LIMIT 1
))
SELECT ruleid, subrule, barid, resourceid, resource FROM cte
UNION ALL -- add unused rules
SELECT s.ruleid, s.subrule, s.barid, NULL, NULL
FROM subrules s
LEFT JOIN cte c USING (subruleid)
WHERE c.subruleid IS NULL
UNION ALL -- add unused resources
SELECT NULL, NULL, r.barid, r.resourceid, r.resource
FROM resource r
LEFT JOIN cte c USING (resourceid)
WHERE c.resourceid IS NULL
ORDER BY subrule, barid, resourceid;
Returns exactly the result you have been asking for.
SQL Fiddle.
Explain
It's basically an implementation of the algorithm laid out above.
Only take a single match on a single barid per subrule. Hence the LIMIT 1, which requires additional parentheses:
Sum results of a few queries and then find top 5 in SQL
Collecting "consumed" resources in the array consumed and exclude them from repeated assignment with r.resourceid <> ALL (c.consumed). Note in particular how I avoid NULL values in the array, which would break the test.
The CTE only returns matched rows. Add rules and resources without match in the outer SELECT to get the complete result.
Or you open two cursors on the tables subrule and resource and implement the algorithm with any decent programming language (including PL/pgSQL).

Append a zero to value if necessary in SQL statement DB2

I have a complex SQL statement that I need to match up two table based on a join. The the intial part of the complex query has a location number that is stored in a table as a Smallint and the second table has the Store number stored as a CHAR(4). I have been able to cast the smallint to a char(4) like this:
CAST(STR_NBR AS CHAR(4)) AND LOCN_NBR
The issue is that because the Smallint suppresses the leading '0' the join returns null values from the right hand side of the LEFT OUTER JOIN.
Example
Table set A(Smallint) Table Set B (Char(4))
| 96 | | 096 |
| 97 | | 097 |
| 99 | | 099 |
| 100 | <- These return -> | 100 |
| 101 | <- These return -> | 101 |
| 102 | <- These return -> | 102 |
I need to add make it so that they all return, but since it is in a join statement how do you append a zero to the beginning and in certain conditions and not in others?
SELECT RIGHT('0000' || STR_NBR, 4)
FROM TABLE_A
Casting Table B's CHAR to tinyint would work as well:
SELECT ...
FROM TABLE_A A
JOIN TABLE_B B
ON A.num = CAST(B.txt AS TINYINT)
Try LPAD function:
LPAD(col,3,'0' )
I was able to successfully match it out to obtain a 3 digit location number at all times by doing the following:
STR_NBR was originally defined as a SmallINT(2)
LOCN_NO was originally defined as a Char(4)
SELECT ...
FROM TABLE_A AS A
JOIN TABLE_B AS B
ON CAST(SUBSTR(DIGITS(A.STR_NBR),3,3)AS CHAR(4)) = B.LOCN_NO