Match and merge when comparing data in 4 tables - sql

I have a requirement to create SQL Server user defined function/SP (either normal or table valued function) which has the below requirements:
The data across 4 tables (Table_A, Table_B, Table_C, Table_D) should be matched based on fix attributes (Name in our below example)
If the data matches in all the 4 tables it gets the highest score & uniqueID is created. For example Match Type = ABCD
If the data matches in other combinations of 3 tables than it gets some score and different UniqueID. For example Match Type = ABC, ABD, BCD, CDA
If the data matches in other combinations of 2 tables than it gets some score and different UniqueID. For example Match Type = AB, AC, AD, BC, BD, CD
Records that doesn't match will get 0 score with separate UniqueID will be stored in the same table.
Table_A
AID | Name | ZipCode
Table_B
BID | Name | ZipCode
Table_C
CID | Name | ZipCode
Table_D
DID | Name | ZipCode
It matches on Name and ZipCode attributes
Final or match and merge table:
UID | AID | BID | CID | DID | Match_Score
Please suggest how can we create a function/stored procedure for the above requirements. If we can make it robust and expandable would be better i.e. If one more tables get added the logic should work with minimal code changes.
Really appreciate your help in this case.
I can think of the below approach but not sure if that can be coded -
ABCD (Output of the table where all the record matches)
UNION ALL
ABC (This will run only on the records that are not par of the ABCD result)
UNION ALL
ACD (This will run only on those records which are not a part of the above 2 results)
UNION ALL
and on and on

Break it down into smaller sections using a temp table for each section and then do your final merge
In your final rank them based on the how many matches.
The typical merge syntax is like so.. remember merge can only have one target table, but multiple sources
MERGE TOP (value) <target_table>
USING <table_source>
ON <merge_search_condition>
[ WHEN MATCHED [ AND <clause_search_condition> ]
THEN <merge_matched> ]
[ WHEN NOT MATCHED [ BY TARGET ] [ AND <clause_search_condition> ]
THEN <merge_not_matched> ]
[ WHEN NOT MATCHED BY SOURCE [ AND <clause_search_condition> ]
THEN <merge_matched> ]
[ <output_clause> ]
[ OPTION ( <query_hint> ) ]
;
As a simple easy sample, find names and join them... you can start with this... and refine it to your needs by using something like a DECLARE #MyNumberOfMatchesVariable INT and update it as you get matches
select *
from (
select someUniqueValueToMatchfrom table1
union
select someUniqueValueToMatchfrom table2
union
-- ...
select someUniqueValueToMatchfrom table..
) distinct_usernames
left join table1 on table1.someUniqueValueToMatch= distinct_usernames.Username
left join table2 on table2.someUniqueValueToMatch= distinct_usernames.Username
-- ...
left join table... on table....someUniqueValueToMatch= distinct_usernames.Username

Joining the 4 tables using FULL JOINs will give you all the various combinations:
SELECT AID,BID,CID,DID,
CASE WHEN AID IS NULL THEN 0 ELSE 1 END
+ CASE WHEN BID IS NULL THEN 0 ELSE 1 END
+ CASE WHEN CID IS NULL THEN 0 ELSE 1 END
+ CASE WHEN DID IS NULL THEN 0 ELSE 1 END/*,
CASE WHEN AID IS NULL THEN '' ELSE 'A' END
+ CASE WHEN BID IS NULL THEN '' ELSE 'B' END
+ CASE WHEN CID IS NULL THEN '' ELSE 'C' END
+ CASE WHEN DID IS NULL THEN '' ELSE 'D' END*/
FROM Table_A a
FULL JOIN Table_B b ON a.Name=b.Name AND a.ZipCode=b.ZipCode
FULL JOIN Table_C c ON a.Name=c.Name AND a.ZipCode=c.ZipCode OR b.Name=c.Name AND b.ZipCode=c.ZipCode
FULL JOIN Table_D d ON a.Name=d.Name AND a.ZipCode=d.ZipCode OR b.Name=d.Name AND b.ZipCode=d.ZipCode OR c.Name=d.Name AND c.ZipCode=d.ZipCode

Related

how to select columns by row and value in postgres?

I got a table like this, values are all booleans, except for col1, these are the rownames (the primary-key):
col1 | col2 | col3 | col4 | col5 ...
--------------------------------
row1 | f | t | t | t
row2 | f | f | f | t
row3 | t | f | t | f
:
And I want a query like this: select all columns for row3 where value=t, or, perhaps more precisely: select all column-names for row3 where value=t.
In this example the answer should be:
col2
col4
Because I know all column-names I can do it by recursion in the caller, I mean e.g. by calling the postgres-client from bash, recursing over the colums for each row I'm interested in. But is there a solution in postgres-sql?
That is not really how SQL works. SQL works on rows, not columns.
What this suggests is that your data structure is wrong. If, instead, you stored the values in rows like this:
col1 name value
row1 'col1' value
. . .
Then you would just do:
select name
from t
group by name
having count(*) = sum(case when value then 1 else 0 end);
With your structure, you need to do a separate subquery for each column. Something like this:
select 'col2'
from yourtable
having count(*) = sum(case when col2 then 1 else 0 end)
union all
select 'col3'
from yourtable
having count(*) = sum(case when col3 then 1 else 0 end)
union all
. . .
I'm not trying to answer your question here, but want to tell you what database structure would be appropriate for the task described.
You have a book table with a book id. Each record contains one book.
You have a word table with a word id. Each record contains one word.
Now you want to have a list of all existing book-word combinations.
The table you would create for this relation is called a bridge table. One book can contain many words; one word can be contained in many books; a n:m relation. The table has two columns: the book id and the word id. The two combined are the table's primary key (a composite key). Each record contains one existing combination of book and word.
Here are some examples how to use this table:
To find all words contained in a book:
select word
from words
where word_id in
(
select word_id
from book_word
where book_id =
(
select book_id
from books
where name = 'Peter Pan'
)
);
(That's just an example; the same can be got with joins instead of subqueries.)
To select words that occur in two particular books:
select word
from words
where word_id in
(
select word_id
from book_word
where book_id in
(
select book_id
from books
where name in ('Peter Pan', 'Treasure Island')
)
group by word_id
having count(*) = 2
);
To find words that occur in only one book:
select w.word, min(b.name) as book_name
from words w
join book_word bw on bw.word_id = w.word_id
join books b on b.book_id = bw.book_id
group by w.word_id
having count(*) = 1;

Determine source on COALESCE fields

I have two tables table which are identical in structure but belong to different schemas (schemas A and B). All rows in question will always appear in the A.table but may or may not appear in B.table. B.table is essentially an override for the defaults in A.table.
As such my query uses a COALESCE on each field similar to:
SELECT COALESCE(B.id, A.id) as id,
COALESCE(B.foo, A.foo) as foo,
COALESCE(B.bar, A.bar) as bar
FROM A.table LEFT JOIN B.table ON (A.id = B.id)
WHERE A.id in (1, 2, 3)
This works great, but I also want to add the source of the data. In the example above, assuming id=2 existed in B.table but not 1 or 3, I would want to include some indication that A is the source for 1 and 3 and B is the source for 2.
So the data might look like the following
+---------------------------------+
| id | foo | bar | source |
+---------------------------------+
| 1 | a | b | A |
| 2 | c | d | B |
| 3 | e | f | A |
+---------------------------------+
I don't really care what the value of source is as long as I can distinguish A from B.
I am no pgsql expert (not by a long shot) but I have tinkered around with EXISTS and a subquery but have had no luck so far.
As records showing the default value (from A.table) have NULLs for B.id, all you need is to add this column specification to your query:
CASE WHEN B.id IS NULL THEN 'A' ELSE 'B' END AS Source
The USING clause would simplify the query you have:
SELECT id
, COALESCE(B.foo, A.foo) AS foo
, COALESCE(B.bar, A.bar) AS bar
, CASE WHEN b.id IS NULL THEN 'A' ELSE 'B' END AS source -- like #Terje provided
FROM a
LEFT JOIN b USING (id)
WHERE a.id IN (1, 2, 3);
But typically, this alternative query should serve you better:
SELECT x.* -- or list columns of your choice
FROM (VALUES (1), (2), (3)) t (id)
, LATERAL (
SELECT *, 'B' AS source FROM b WHERE id = t.id
UNION ALL
SELECT *, 'A' FROM a WHERE id = t.id
LIMIT 1
) x
ORDER BY x.id;
Advantages:
You don't have to add another COALESCE construct for every column you want to add to the result.
The same query works for any number of columns in a and b.
The query even works if the column names are not identical. Only number and data types of columns must match.
Of course, you can always list selected, compatible columns as well:
SELECT * -- or list columns of your choice
FROM (VALUES (1), (2), (3)) t (id)
, LATERAL (
SELECT foo, bar, 'B' AS source FROM b WHERE id = t.id
UNION ALL
SELECT foo2, bar17, 'A' FROM a WHERE id = t.id
LIMIT 1
) x
ORDER BY x.id;
The first SELECT determines names, data types and number of columns.
This query doesn't break if columns in b are not defined NOT NULL.
COALESCE cannot tell the difference between b.foo IS NULL and no row with matching id in b. So the source of any result column (except id) can still be 'A', even if the result row says 'B' - if any relevant column in b can be NULL.
My alternative returns all values from b if the row exists - including NULL values. So the result can be different if columns in b can be NULL. It depends on your requirements which behavior is desirable.
Either query assumes that id is defined as primary key (so exactly 1 or 0 rows per given id value).
Related:
Select first record if none match
What is the difference between LATERAL and a subquery in PostgreSQL?

Derive groups of records that match over multiple columns, but where some column values might be NULL

I would like an efficient means of deriving groups of matching records across multiple fields. Let's say I have the following table:
CREATE TABLE cust
(
id INT NOT NULL,
class VARCHAR(1) NULL,
cust_type VARCHAR(1) NULL,
terms VARCHAR(1) NULL
);
INSERT INTO cust
VALUES
(1,'A',NULL,'C'),
(2,NULL,'B','C'),
(3,'A','B',NULL),
(4,NULL,NULL,'C'),
(5,'D','E',NULL),
(6,'D',NULL,NULL);
What I am looking to get is the set of IDs for which matching values unify a set of records over the three fields (class, cust_type and terms), so that I can apply a unique ID to the group.
In the example, records 1-4 constitute one match group over the three fields, while records 5-6 form a separate match.
The following does the job:
SELECT
DISTINCT
a.id,
DENSE_RANK() OVER (ORDER BY max(b.class),max(b.cust_type),max(b.terms)) AS match_group
FROM cust AS a
INNER JOIN
cust AS b
ON
a.class = b.class
OR a.cust_type = b.cust_type
OR a.terms = b.terms
GROUP BY a.id
ORDER BY a.id
id match_group
-- -----------
1 1
2 1
3 1
4 1
5 2
6 2
**But, is there a better way?** Running this query on a table of over a million rows is painful...
As Graham pointed out in the comments, the above query doesn't satisfy the requirements if another record is added that would group all the records together.
The following values should be grouped together in one group:
INSERT INTO cust
VALUES
(1,'A',NULL,'C'),
(2,NULL,'B','C'),
(3,'A','B',NULL),
(4,NULL,NULL,'C'),
(5,'D','E',NULL),
(6,'D',NULL,NULL),
(7,'D','B','C');
Would yield:
id match_group
-- -----------
1 1
2 1
3 1
4 1
5 1
6 1
...because the class value of D groups records 5, 6 and 7. The terms value of C matches records 1, 2 and 4 to that group, and cust_type value B ( or class value A) pulls in record 3.
Hopefully that all makes sense.
I don't think you can do this with a (recursive) Select.
I did something similar (trying to identify unique households) using a temporary table & repeated updates using following logic:
For each class|cust_type|terms get the minimum id and update that temp table:
update temp
from
(
SELECT
class, -- similar for cust_type & terms
min(id) as min_id
from temp
group by class
) x
set id = min_id
where temp.class = x.class
and temp.id <> x.min_id
;
Repeat all three updates until none of them updates a row.

Comparing two tables and get the values that dont match

I have two tables with articles.
table 1 article and table 2 articlefm
both tables have one field with artnr.
'table 1' has 2192 artnr and 'table 2' has 2195 artnr.
I want in my query to find out whats the artnr of the 3 articles that is not matched.
If 'table 2' has more articles then 'table 1' then I need a list with those artnr.
How can I make this?
You can do this using a FULL JOIN:
SELECT COALESCE(t1.Artnr, t2.Artnr) AS Artnr,
CASE WHEN t1.Artnr IS NULL THEN 'Table1' ELSE 'Table2' END AS MissingFrom
FROM Table1 AS t1
FULL JOIN Table2 AS t2
ON t1.Artnr = t2.Artnr
WHERE t1.Artnr IS NULL
OR t2.Artnr IS NULL;
Note, that just because there is a difference in the count of 3, it does not necessarily mean that there are only 3 records in one table missing from the other. Imagine the following:
Table1 Table2
------ -------
1 2
2 4
3 6
4
The difference in count is 1, but there are actually 2 records present in table1 that aren't in table2, and 1 in table2 that isn't in table1. Using the above full join method you would get a result like:
Artnr | MissingFrom
------+-------------
1 | Table1
3 | Table1
6 | Table2
In most databases you can use except (SQL standard) or minus (Oracle specific):
select artnr
from articlefm -- table 2
except
select artnr
from article -- table 1
Else you could try a not in:
select atrnr
from articlefm -- table 2
where atrnr not in
( select artnr
from article -- table 1
)
This will give you the article numbers that exist in 2, but not in 1.

Copy column from one database to another and insert data depends on condition in SQL Server

I have two databases on the same SQL server. The first is named Aa with a table Models in which there are two columns: Id and Desc and Bb database with a table ListOfModels in which there are three columns: Id, MachineTypeId, ModelName. I need to copy all Desc values from Aa database into ModelName in Bb and insert 1 into MachineTypeId if Desc starts with "K", otherwise insert 2.
Can you please help me write the script for this?
The question is unclear as to whether you want to insert new rows into the table or just update matching values.
If you actually want to insert records:
insert into bb..ListOfModels(MachineTypeId, ModelName)
select (case when TypeId like 'K%' then 1 else 2 end), m.[desc]
from aa..Models m;
If you want to update the records based on matching by id:
update lom
set ModelName = [desc],
MachineTypeId = (case when m.TypeId like 'K%' then 1 else 2 end)
from bb..ListOfModels lom join
aa..Models m
on lom.id = m.id;
By the way, desc is a lousy name for a column, because it is a reserved word in SQL.
Use a case statement: http://msdn.microsoft.com/en-us/library/ms181765.aspx
Case Substr(MachineTypeID,1,1) When "K" Then 1 Else 2 End