Combining four tables in SQL Server - sql

I have four tables Table A, Table B, Table C and Table D. The schema of all four tables are identical. I need to union these four tables in the following way:
If a record is present in Table A then that is considered in the output table.
If a record is present in Table B then it is considered in the output table ONLY if it is not present in Table A.
If a record is present in Table C then it is considered ONLY if it is not present in Table A and Table B.
If a record is present in Table D then it is considered ONLY if it is not present in Table A, Table B, and Table C.
Note -
Every table has a column which identifies the table itself for every record (I don't know if this is of any importance)
Records are identified based on a particular column - Column X which is not unique even within each table

You could do something like (only two cases shown but you should see how to extend this)
WITH CTE1 AS
(
SELECT 't1' as Source, X, Y
FROM t1
UNION ALL
SELECT 't2' as Source, X, Y
FROM t2
), CTE2 AS
(
SELECT *,
RANK() OVER (PARTITION BY X
ORDER BY CASE Source
WHEN 't1' THEN 1
WHEN 't2' THEN 2
END) As RN
FROM CTE1
)
SELECT X,Y
FROM CTE2
WHERE RN=1

I would be inclined to do this using not exists:
select a.*
from a
union all
select b.*
from b
where not exists (select 1 from a where a.x = b.x)
union all
select c.*
from c
where not exists (select 1 from a where a.x = c.x) and
not exists (select 1 from b where b.x = c.x)
union all
select d.*
from d
where not exists (select 1 from a where a.x = d.x) and
not exists (select 1 from b where b.x = d.x) and
not exists (select 1 from c where c.x = d.x);
If you have an index on the x column in each table, then this should be the fastest method.

This will work as long as there are no NULL columns, or if columns for a record that exists in table with higher precedence are NULL you can assume the same column will NULL in tables with lower precedence.
SELECT coalesce(a.column1, b.column1, c.column1, d.column1) column1
,coalesce(a.column2, b.column2, c.column2, d.column2) column2
,coalesce(a.column3, b.column3, c.column3, d.column3) column3
--...
,coalesce(a.columnN, b.columnN, c.columnN, d.columnN) columnN
FROM TableA a
FULL JOIN TableB b on b.ColumnX = a.ColumnX
FULL JOIN TableC c on c.ColumnX = a.ColumnX or c.ColumnX = b.ColumnX
FULL JOIN TableD d on d.ColumnX = a.ColumnX or d.ColumnX = b.ColumnX or d.ColumnX = c.ColumnX
If the NULL values matter, you can switch to a more-complicated (and likely slower) CASE version:
CASE WHEN a.columnX IS NOT NULL THEN a.column1
WHEN b.columnX IS NOT NULL THEN b.column1
WHEN c.columnX IS NOT NULL THEN c.column1
WHEN d.columnX IS NOT NULL THEN d.column1 END column1
Of course, you can mix and match, so columns that are not nullable can use the former syntax, and columns where NULL values matter use the latter.
Hopefully the purpose of this is to fix the broken schema and put this data all in the same table, where it belongs.

This might seem stupid, but if, by any chance, you can leave out the table-identifying column and you also want to eliminate duplicate records (from within one table) too then the most straightforward answer would be
select <all columns without table identifier> from tableA
union
select <all columns without table identifier> from tableB
union
select <all columns without table identifier> from tableC
...
This is exactly, what union was designed to do: add rows only if they do not already exist before.

Related

Hive join tables and keep only 1 column

I have below table join and noticed that Hive keeps two copies of the pk column - one from table b and one from table c. Is there a way to keep only 1 of those columns?
I can always replace select * with exact select column1, column2 etc but that wont be too efficient
with a as (
select
*
from table1 b left join table2 c
on b.pk = c.pk
)
select
*
from a;
;
#update 1
is it possible to alias many columns?
for example the below line works
select b.pk as duplicate_pk
but is there a way to do something like
select b.* as table2 to add text table2 before all the columns of the table b?
Not sure if you already tried this but you can choose what to select using either
b.* to select cols of only table1
c.* to select cols of only table2
Example:
with a as (
select
b.*
from table1 b left join table2 c
on b.pk = c.pk
)
select
*
from a;

How to join tables based on certain condition? SQL

There are three tables A,B,C
Table A has columns [ID], [flag], [many other columns]
Table B has columns [ID], [column subset of Table A]
Table C has columns [ID], [same column subset as Table B (thus also a subset of Table A), however with different values]
I want to join Table A & Table B if Flag = '1', and want to join Table A & Table C if Flag ='2'
Could you help me how I might be able to achieve this?
Many thanks!
You're looking for a UNION.
SELECT
<interesting columns>
FROM
A
JOIN
B
ON A.ID = B.ID
AND A.Flag = 1
UNION ALL
SELECT
<exactly the same interesting columns>
FROM
A
JOIN
C
ON A.ID = C.ID
AND A.Flag = 2
If the flag is really a string column, put the single quotes back. If it's numeric, leave them out.
Since the flag field in A should effectively eliminate duplicates between the result sets, I opted for UNION ALL, which is more efficient than UNION because UNION will run a DISTINCT under the covers, which in this case is likely unnecessary.

Fetching fields with specific criteria - Oracle

I am trying to extract particular data from 2 tables based on specific criteria. But the result is not as expected. Can someone please help?
Criteria:
Need to fetch id pairs whose type is A alone.
Tables:
Table A
ID1 ID2
579643307310619501 644543316683180704
296151129721950503 328945291791563504
Table B
ID TYPE
579643307310619501 A
579643307310619501 B
579643307310619501 C
644543316683180704 A
296151129721950503 A
328945291791563504 A
Expected Result:
ID1 ID2
296151129721950503 328945291791563504
(Since only this pair is of type A alone, individually)
Note: The IDs, ID1 and ID2 both must be present in ID field of Table B.
What I've tried:
SELECT id1, id2
FROM A
JOIN B ON A.id1 = B.id
WHERE A.id1 IN (SELECT id FROM B)
AND A.id2 IN (SELECT id FROM B)
AND B.type='A'
GROUP BY id1, id2
HAVING count(*)=1;
In the approach below, I use a CTE to first identify all ID values having exclusively the 'A' type. Then I join TableA to this CTE, twice, to filter off any records either of whose ID1 or ID2 values are not in the exclusively 'A' type list.
WITH cte (ID) AS (
SELECT ID
FROM TableB
GROUP BY ID
HAVING SUM(CASE WHEN TYPE <> 'A' THEN 1 ELSE 0 END) = 0
)
SELECT a.ID1, a.ID2
FROM TableA a
INNER JOIN cte t1
ON a.ID1 = t1.ID
INNER JOIN cte t2
ON a.ID2 = t2.ID;
Find below a working demo (for SQL Server - I can't get Oracle to work anywhere).
Demo
Here is an Oracle solution using the MINUS operator.
The top sub-query gets the set of records where both ID1 and ID2 are of type 'A'. The bottom sub-query gets the set of records where either ID1 or ID2 is not of type 'A'. The result is the set of records in the top set which are not also in the bottom set.
select a.id1, a.id2
from a
join b b1 on b1.id = a.id1
join b b2 on b2.id = a.id2
where b1.type = 'A'
and b2.type = 'A'
minus
select a.id1, a.id2
from a
join b b1 on b1.id = a.id1
join b b2 on b2.id = a.id2
where b1.type != 'A'
or b2.type != 'A'
/
This SQL Fiddle demo returns the right row but there's a bit of a problem with its display: for some reason the numbers are rounded down.
Note on performance
This hits table A twice and table B four times. With small tables and a well-sized buffer cache this is not so important.
#TimBiegeleisen uses the WITH clause and that approach only hits each table once. However, Oracle will materialize the CTE as a temporary table. The overhead of doing this for such small amounts of data makes his solution consistently slower than mine. Including an /*+ inline */ hint in the CTE projection prevents Oracle from materializing the temporary table and the performance of the two queries becomes comparable.
However, if the tables become large enough there will be a point at which the WITH clause approach with a materialized temporary table is the more performative approach. As always with query tuning, the specifics matter greatly and benchmarking is the key to success.
Here is a sample for Oracle and my solution.
This is valid for any letter, A, B, C... If you want only for A, add an additional filter in the where of the main query.
create table a (id1 number,id2 number, constraint pk_a primary key(id1,id2));
create table b (id number, type char(1), constraint pk_b primary key(id,type));
insert into a values(57,64);
insert into a values(29,32);
insert into b values(57,'A');
insert into b values(57,'B');
insert into b values(57,'C');
insert into b values(64,'A');
insert into b values(29,'A');
insert into b values(32,'A');
commit;
select a.*
from a, b b1, b b2
where a.id1 = b1.id
and a.id2 = b2.id
and b1.type = b2.type
and not exists (select null
from b b1bis
where b1bis.id = b1.id
and b1.type <> b1bis.type)
and not exists (select null
from b b2bis
where b2bis.id = b2.id
and b2.type <> b2bis.type);

Value present in more than one table

I have 3 tables. All of them have a column - id. I want to find if there is any value that is common across the tables. Assuming that the tables are named a.b and c, if id value 3 is present is a and b, there is a problem. The query can/should exit at the first such occurrence. There is no need to probe further. What I have now is something like
( select id from a intersect select id from b )
union
( select id from b intersect select id from c )
union
( select id from a intersect select id from c )
Obviously, this is not very efficient. Database is PostgreSQL, version 9.0
id is not unique in the individual tables. It is OK to have duplicates in the same table. But if a value is present in just 2 of the 3 tables, that also needs to be flagged and there is no need to check for existence in he third table, or check if there are more such values. One value, present in more than one table, and I can stop.
Although id is not unique within any given table, it should be unique across the tables; a union of distinct id should be unique, so:
select id from (
select distinct id from a
union all
select distinct id from b
union all
select distinct id from c) x
group by id
having count(*) > 1
Note the use of union all, which preserves duplicates (plain union removes duplicates).
I would suggest a simple join:
select a.id
from a join
b
on a.id = b.id join
c
on a.id = c.id
limit 1;
If you have a query that uses union or group by (or order by, but that is not relevant here), then you need to process all the data before returning a single row. A join can start returning rows as soon as the first values are found.
An alternative, but similar method is:
select a.id
from a
where exists (select 1 from b where a.id = b.id) and
exists (select 1 from c where a.id = c.id);
If a is the smallest table and id is indexes in b and c, then this could be quite fast.
Try this
select id from
(
select distinct id, 1 as t from a
union all
select distinct id, 2 as t from b
union all
select distinct id, 3 as t from c
) as t
group by id having count(t)=3
It is OK to have duplicates in the same table.
The query can/should exit at the first such occurrence.
SELECT 'OMG!' AS danger_bill_robinson
WHERE EXISTS (SELECT 1
FROM a,b,c -- maybe there is a place for old-style joins ...
WHERE a.id = b.id
OR a.id = c.id
OR c.id = b.id
);
Update: it appears the optimiser does not like carthesian joins with 3 OR conditions. The below query is a bit faster:
SELECT 'WTF!' AS danger_bill_robinson
WHERE exists (select 1 from a JOIN b USING (id))
OR exists (select 1 from a JOIN c USING (id))
OR exists (select 1 from c JOIN b USING (id))
;

Issues with SQL Select utilizing Except and UNION All

Select *
From (
Select a
Except
Select b
) x
UNION ALL
Select *
From (
Select b
Except
Select a
) y
This sql statement returns an extremely wrong amount of data. If Select a returns a million, how does this entire statement return 100,000? In this instance, Select b contains mutually exclusive data, so there should be no elimination due to the except.
As already stated in the comment, EXCEPT does an implicit DISTINCT, according to this and the ALL in your UNION ALL cannot re-create the duplicates. Hence you cannot use your approach if you want to keep duplicates.
As you want to get the data that is contained in exactly one of the tables a and b, but not in both, a more efficient way to achieve that would be the following (I am just assuming the tables have columns id and c where id is the primary key, as you did not state any column names):
SELECT CASE WHEN a.id IS NULL THEN 'from b' ELSE 'from a' END as source_table
,coalesce(a.id, b.id) as id
,coalesce(a.c, b.c) as c
FROM a
FULL OUTER JOIN b ON a.id = b.id AND a.c = b.c -- use all columns of both tables here!
WHERE a.id IS NULL OR b.id IS NULL
This makes use of a FULL OUTER JOIN, excluding the matching records via the WHERE conditions, as the primary key cannot be null except if it comes from the OUTER side.
If your tables do not have primary keys - which is bad practice anyway - you would have to check across all columns for NULL, not just the one primary key column.
And if you have records completely consisting of NULLs, this method would not work.
Then you could use an approach similar to your original one, just using
SELECT ...
FROM a
WHERE NOT EXISTS (SELECT 1 FROM b WHERE <join by all columns>)
UNION ALL
SELECT ...
FROM b
WHERE NOT EXISTS (SELECT 1 FROM a WHERE <join by all columns>)
If you're trying to get any data that is in one table and not in the other regardless of which table, I would try something like the following:
select id, 'table a data not in b' from a where id not in (select id from b)
union
select id, 'table b data not in a' from b where id not in (select id from a)