I am using PostgreSQL and am having difficulty with getting a series of queries that combine the data from two tables (t1, t2)
t1 is
studyida
gender
age
a
M
1
a
M
2
a
M
3
b
F
4
b
F
5
b
F
6
c
M
13
c
M
14
c
M
15
and t2 is
studyida
studyidb
gender
age
a
z
M
3
a
z
M
4
a
z
M
5
NULL
y
F
7
NULL
y
F
8
NULL
y
F
9
c
x
M
10
c
x
M
11
c
x
M
12
NULL
w
F
7
NULL
w
F
8
NULL
w
F
9
NULL
u
M
7
NULL
u
M
8
NULL
u
M
9
t1 and t2 are related via StudyIDA and gender. What I need is a comprehensive listing from both tables, including including the ages. Sometimes the age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3) but most of the time it does not.
I am looking to create a table like this
StudyIDA
StudyIDB
gender
ageA
ageB
a
z
M
1
a
z
M
2
a
z
M
3
3
a
z
M
4
a
z
M
5
b
NULL
F
4
b
NULL
F
5
b
NULL
F
6
NULL
y
F
7
NULL
y
F
8
NULL
y
F
9
c
x
F
13
c
x
F
14
c
x
F
15
c
x
F
10
c
x
F
11
c
x
F
12
NULL
w
F
7
NULL
w
F
8
NULL
w
F
9
NULL
u
M
7
NULL
u
M
8
NULL
u
M
9
I was thinking that first a full outer join of t1 and t2 would give me what I want but it does not.
Then I was thinking I need a listing of all the individuals (lets call it t3), and then do a series of inserts (e.g. t1+t3 and also t1+t3) into a new table to 'construct' what I need. I am really stuck on the odd times when age in t1 equals the age in t2 (e.g. for StudyIDA=a, age=3).
I am still not getting what I need. Here is my code so far
DROP TABLE IF EXISTS t1, t2, t3;
CREATE TEMPORARY TABLE t1 (StudyIDA VARCHAR, gender VARCHAR, age int);
INSERT INTO t1 VALUES
('a','M', 1),('a','M', 2),('a','M', 3),
('b','F', 4),('b','F', 5),('b','F', 6),
('c','M', 13),('c','M', 14),('c','M', 15);
SELECT * FROM t1;
CREATE TEMPORARY TABLE t2 (StudyIDA VARCHAR, StudyIDB varchar, gender VARCHAR, age int);
INSERT INTO t2 VALUES
('a','z','M', 3), ('a','z','M', 4), ('a','z','M', 5),
(NULL,'y','F', 7),(NULL,'y','F', 8),(NULL,'y','F', 9),
('c','x','M', 10),('c','x','M', 11),('c','x','M', 12),
(NULL,'w','F', 7),(NULL,'w','F', 8),(NULL,'w','F', 9),
(NULL,'u','M', 7),(NULL,'u','M', 8),(NULL,'u','M', 9);
SELECT * FROM t2;
CREATE TEMPORARY TABLE t3 (StudyIDA_t1 VARCHAR, gender_t1 VARCHAR, StudyIDA_t2 VARCHAR,StudyIDB varchar,
gender_t2 VARCHAR);
INSERT INTO t3
SELECT * FROM (SELECT DISTINCT StudyIDA, gender FROM t1) a FULL OUTER JOIN
(SELECT DISTINCT StudyIDA, StudyIDB, gender FROM t2) b
ON a.StudyIDA=b.StudyIDA AND a.gender=b.gender
ORDER BY a.StudyIDA;
SELECT * FROM t3 ORDER BY StudyIDA_t1;
SELECT 'IN t1', *
FROM t3 JOIN t1 on t1.StudyIDA=t3.StudyIDA_t1 AND t1.gender=t3.gender_t1
ORDER BY StudyIDA_t1, StudyIDB;
SELECT 'In t2',*
FROM t3 JOIN t2 on t3.StudyIDA_t1=t2.StudyIDA AND t3.gender_t1=t2.gender
ORDER BY StudyIDA_t1, t3.StudyIDB;
DROP TABLE IF EXISTS t1, t2, t3;
A full join that includes the age maybe?
And some coalesce's for common fields.
SELECT DISTINCT
COALESCE(t1.StudyIDA, t2.StudyIDA) AS StudyIDA
, t2.StudyIDB
, COALESCE(t1.gender, t2.gender) AS gender
, t1.age as ageA
, t2.age as ageB
FROM t1
FULL JOIN t2
ON t2.StudyIDA is not distinct from t1.StudyIDA
AND t2.gender = t1.gender
AND t2.age = t1.age
ORDER BY StudyIDA, gender, ageA, ageB;
studyida | studyidb | gender | agea | ageb
:------- | :------- | :----- | ---: | ---:
a | null | M | 1 | null
a | null | M | 2 | null
a | z | M | 3 | 3
a | z | M | null | 4
a | z | M | null | 5
b | null | F | 4 | null
b | null | F | 5 | null
b | null | F | 6 | null
c | null | M | 13 | null
c | null | M | 14 | null
c | null | M | 15 | null
c | x | M | null | 10
c | x | M | null | 11
c | x | M | null | 12
null | w | F | null | 7
null | y | F | null | 7
null | w | F | null | 8
null | y | F | null | 8
null | w | F | null | 9
null | y | F | null | 9
null | u | M | null | 7
null | u | M | null | 8
null | u | M | null | 9
db<>fiddle here
Your sample data indicates that only t2.studyida can be NULL and all other columns should really be declared as NOT NULL.
If so, I suggest this simpler query:
SELECT studyida, b.studyidb, gender, age
, CASE WHEN a.age IS NULL THEN 'b'
WHEN b.age IS NULL THEN 'a'
ELSE 'a and b' END as source
FROM t1 a
FULL JOIN t2 b USING (studyida, gender, age)
ORDER BY studyida, gender, age;
db<>fiddle here
The USING clause is convenient for identically named join columns. Only a single instance of the joining column is in the result set, effectively what COALESCE(a.col, b.col) gives you otherwise. (You might just use SELECT *.)
You can still reference source columns with table-qualification, like a.age.
I reduced to a single age column and added source. You may or may not want that.
Either way, "age" is subject to bitrot, almost always the wrong choice for a table column, and should typically be replaced by "birthday" or similar.
Related
I need to compare the records from two tables: X and Y. Each record has two ids: ID1 and ID2. Either ID1 or ID2 can be null in either table, but both can’t be null at once. I need to produce a view with all the information from both tables:
Rows where X.ID1 = Y.ID1 and X.ID2 = Y.ID2
Rows where X.ID1 = Y.ID1 but X.ID2 <> Y.ID2
Rows where X.ID1 <> Y.ID1 but X.ID2 = Y.ID2
Rows where X.ID1 and Y.ID1 don’t have any matches at all
Rows where X.ID2 and Y.ID2 don’t have any matches at all
Example:
X: Y:
|---------------| |---------------|
| ID1 | ID2 | | ID1 | ID2 |
|---------------| |---------------|
| 1 | A | | 1 | A |
| 2 | B | | 2 | C |
| 3 | NULL | | NULL | B |
| NULL | D | | 5 | NULL |
|---------------| |---------------|
Output:
|---------------------------------------|
| XID1 | YID1 | XID2 | YID2 | SRC |
|---------------------------------------|
| 1 | 1 | A | A | X+Y |
| 2 | 2 | B | C | X+Y |
| 3 | NULL | NULL | NULL | X |
| NULL | 5 | NULL | NULL | Y |
| 2 | NULL | B | B | X+Y |
| NULL | 2 | NULL | C | Y |
| NULL | NULL | D | NULL | X |
|---------------------------------------|
My first obvious solution was to do a FULL OUTER JOIN:
SELECT … FROM X FULL OUTER JOIN Y ON X.ID1 = Y.ID1 OR X.ID2 = Y.ID2
This works, but a conditional within a join has terrible performance, and this view would take up to a minute to run. Removing the conditional takes the execution time down to less than a second, but then I lose matching by one of the IDs.
How can I elegantly achieve the above without using a conditional join? I’ve tried:
Joining by concatenation of the two IDs, but this only matches when both IDs match
Doing a CROSS JOIN and filtering by X.ID1=Y.ID1 OR X.ID2=Y.ID2, but this loses the cases without any matches. This is the most promising approach.
Doing a UNION ALL of X and Y and then grouping by ID1 and ID2, but this once again only matches when both IDs match
You can try decomposing this into multiple joins. I think the logic is:
SELECT …
FROM X JOIN
Y
ON X.ID1 = Y.ID1
UNION ALL
SELECT …
FROM X JOIN
Y
ON X.ID1 <> Y.ID1 AND X.ID2 = Y.ID2
UNION ALL
SELECT ...
FROM X
WHERE NOT EXISTS (SELECT 1 FROM Y WHERE Y.ID1 = X.ID1) AND
NOT EXISTS (SELECT 1 FROM Y WHERE Y.ID2 = X.ID2)
UNION ALL
SELECT ...
FROM Y
WHERE NOT EXISTS (SELECT 1 FROM X WHERE Y.ID1 = X.ID1) AND
NOT EXISTS (SELECT 1 FROM X WHERE Y.ID2 = X.ID2) ;
If I read your conditions correctly, you could try something like this. Union the two left joins together and take a distinct of the two sets.
SELECT DISTINCT ... FROM (
SELECT … FROM X LEFT JOIN Y ON X.ID1 = Y.ID1
UNION ALL
SELECT … FROM X LEFT JOIN Y ON X.ID2 = Y.ID2
UNION ALL
SELECT … FROM Y LEFT JOIN X ON Y.ID1 = X.ID1 WHERE X.ID1 is null
UNION ALL
SELECT … FROM Y LEFT JOIN X ON Y.ID2 = X.ID2 WHERE X.ID2 is null
)
In situations where I have to choose between doing an OR in the join, or a union of two left joins, I find the union to be faster.
EDIT: Updated to include Y on the left as well.
I have the following (simplified) situation in two databases:
ID Prog T Qt
|---------|--------|---------|---------|
| a | 1 | N | 100 |
| b | 1 | Y | 10 |
| b | 2 | N | 90 |
| c | 1 | N | 25 |
| c | 2 | Y | 25 |
| c | 3 | Y | 25 |
| c | 4 | Y | 25 |
|---------|--------|---------|---------|
ID Prog T Qt
|---------|--------|---------|---------|
| 1 | 1 | Y | 10 |
| 1 | 2 | N | 90 |
| 2 | 1 | Y | 100 |
| 3 | 1 | Y | 100 |
| 4 | 1 | Y | 50 |
| 4 | 2 | Y | 25 |
| 4 | 3 | Y | 25 |
|---------|--------|---------|---------|
I need to compare groups of rows (primary keys are ID and Prog), to find out which groups of rows represent the same combination of factors (not considering ID).
In the example above, ID "b" in the first table and ID "1" in the second have the same combination of values for Prog, T and Qt, while no one else can be considered exactly the same between the 2 dbs (while ID "2" and "3" in the second table are equal, I'm not interested in comparing in the same db).
I hope I explained everything.
A join and aggregation should work for this purpose:
select t1.id, t2.id
from (select t1.*, count(*) over (partition by id) as cnt
from t1
) t1 join
(select t2.*, count(*) over (partition by id) as cnt
from t2
) t2
on t1.prog = t2.prog and t1.T = t2.T and t1.Qt = t2.Qt and t1.cnt = t2.cnt
group by t1.id, t2.id, t1.cnt
having count(*) = t1.cnt;
This is a little tricky. The subqueries count the number of rows for each id in each table. The on clause gets matches between the three columns -- and checks that the ids have the same count. The group by and having then get rows where number of matching rows is the total number of rows.
Join the two tables on the conditions you want to match. The results will be the values that match between them.
CREATE TABLE a (ID CHAR(1), Prog INT, T CHAR(1), Qt INT);
CREATE TABLE b (ID int, Prog INT, T CHAR(1), Qt INT);
INSERT INTO dbo.a
( ID ,Prog ,T ,Qt)
VALUES ('a',1,'N',100), ('b',1,'Y',10), ('b',2,'N',90),('c',1,'N',25),('c',2,'Y',25),('c',3,'Y',25),('c',4,'Y',25)
INSERT INTO dbo.b
( ID ,Prog ,T ,Qt)
VALUES (1,1,'Y',10),(1,2,'N',90),(2,1,'Y',100),(3,1,'Y',100),(4,1,'Y',50),(4,2,'Y',25),(4,3,'Y',25)
WITH CTEa
AS (SELECT ID,
Prog,
T,
Qt,
Cnt = COUNT(ID) OVER (PARTITION BY ID)
FROM dbo.a
),
CTEb
AS (SELECT ID,
Prog,
T,
Qt,
Cnt = COUNT(ID) OVER (PARTITION BY ID)
FROM dbo.b
)
SELECT ID_A = a.ID,
ID_B = b.ID,
b.Prog,
b.T,
b.Qt,
b.Cnt
FROM CTEa AS a
INNER JOIN CTEb AS b
ON a.Prog = b.Prog
AND a.T = b.T
AND a.Qt = b.Qt
AND a.Cnt = b.Cnt;
Results:
ID_A ID_B Prog T Qt Cnt
b 1 1 Y 10 2
b 1 2 N 90 2
I have couple of tables like below-
Table1:
A B C D <<Columns
1 2 3 4 <<single row
Table2:
W X Y Z << Columns
5 6 7 8 << Single row
I want to combine these 2 tables such a way that it will give me following result
Result:
P Q R S << Column headers
1 2 3 4 << row from table1
5 6 7 8 << row from table2
Expected result will have column headers as P, Q, R, S and row from table1 and row from table2
How to achieve this using SQL?
UNION ALL will not eliminate duplicates
In set operations (UNION / INTERSECT / EXCEPT) the aliases are taken from the first query (Currently I'm aware of only one exception- Hive requires the aliases to be the same for all queries - I consider this as a bug)
select A as P, B as Q, C as R, D as S
from table1
union all
select W,X,Y,Z
from table2
+---+---+---+---+
| p | q | r | s |
+---+---+---+---+
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
+---+---+---+---+
table2 with 3 Columns
select B as Q, C as R, D as S
from table1
union all
select X,Y,Z
from table2
+---+---+---+
| q | r | s |
+---+---+---+
| 2 | 3 | 4 |
| 6 | 7 | 8 |
+---+---+---+
or
select A as P, B as Q, C as R, D as S
from table1
union all
select null,X,Y,Z
from table2
+--------+---+---+---+
| p | q | r | s |
+--------+---+---+---+
| 1 | 2 | 3 | 4 |
| (null) | 6 | 7 | 8 |
+--------+---+---+---+
_Updated to be more strict and more complete, thanks to #AntDC (and #Matt) and #Dudu Markovitz__
Use UNION with aliases, like this:
SELECT A AS P, B AS Q, C AS R, D AS S
FROM table1
UNION
-- or UNION ALL if you want to keep duplicate rows
SELECT W, X, Y, Z
FROM table2
I create the following table on http://sqlfiddle.com in PostgreSQL 9.3.1 mode:
CREATE TABLE t
(
id serial primary key,
m varchar(1),
d varchar(1),
c int
);
INSERT INTO t
(m, d, c)
VALUES
('A', '1', 101),
('A', '2', 102),
('A', '3', 103),
('B', '1', 104),
('B', '3', 105);
table:
| ID | M | D | C |
|----|---|---|-----|
| 1 | A | 1 | 101 |
| 2 | A | 2 | 102 |
| 3 | A | 3 | 103 |
| 4 | B | 1 | 104 |
| 5 | B | 3 | 105 |
From this I want to generate such a table:
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 1 | 4 | 104 |
| B | 2 | (null) | (null) |
| B | 3 | 5 | 105 |
but with my current statement
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
I only get the following
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| B | 1 | 4 | 104 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 3 | 5 | 105 |
| B | 2 | (null) | (null) |
Attempts to order it by m,d fail so far:
select * from
(select * from
(select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as kombi
full outer join
t
on kombi.d = t.d and kombi.m = t.m) as result)
order by result.m
Error message:
ERROR: subquery in FROM must have an alias: select * from (select * from (select * from (select * from (select distinct m from t) as dummy1, (select distinct d from t) as dummy2) as kombi full outer join t on kombi.d = t.d and kombi.m = t.m) as result) order by result.m
It would be cool if somebody could point out to me what I am doing wrong and perhaps show the correct statement.
select * from
(select kombi.m, kombi.d, t.id, t.c from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as kombi
full outer join t
on kombi.d = t.d and kombi.m = t.m) as result
order by result.m, result.d
I think your problem is the order. You can solve this problem with the order by clause:
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
order by combi.m, combi.d
You need to specify which data you would like to order. In this case you get back the row from the combi table, so you need to say that.
http://sqlfiddle.com/#!15/ddc0e/17
You could also use column numbers instead of names to do the ordering.
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
order by 1,2;
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 1 | 4 | 104 |
| B | 2 | (null) | (null) |
| B | 3 | 5 | 105 |
you just need a pivot table
the query is very simple
select classes.M, p.i as D, t.ID, t.C
from (select M, max(D) MaxValue from t group by m) classes
inner join pivot p
on p.i =< classes.MaxValue
left join t
on t.M = classes.M
and t.D = p.i
pivot table is a dummy table some how
CREATE TABLE Pivot (
i INT,
PRIMARY KEY(i)
)
populate is some how
CREATE TABLE Foo(
i CHAR(1)
)
INSERT INTO Foo VALUES('0')
INSERT INTO Foo VALUES('1')
INSERT INTO Foo VALUES('2')
INSERT INTO Foo VALUES('3')
INSERT INTO Foo VALUES('4')
INSERT INTO Foo VALUES('5')
INSERT INTO Foo VALUES('6')
INSERT INTO Foo VALUES('7')
INSERT INTO Foo VALUES('8')
INSERT INTO Foo VALUES('9')
Using the 10 rows in the Foo table, you can easily populate the Pivot table with 1,000 rows. To get 1,000 rows from 10 rows, join Foo to itself three times to create a Cartesian product:
INSERT INTO Pivot
SELECT f1.i+f2.i+f3.i
FROM Foo f1, Foo F2, Foo f3
you can read about that in Transac-SQL Cookbook by Jonathan Gennick, Ales Spetic
You just need to order by the final column definitions. t.m and t.d. SO your final SQL would be...
SELECT *
FROM (SELECT *
FROM (SELECT DISTINCT m FROM t) AS dummy1,
(SELECT DISTINCT d FROM t) AS dummy2) AS combi
FULL OUTER JOIN t
ON combi.d = t.d
AND combi.m = t.m
ORDER BY t.m,
t.d;
Also for query optimization perspective, it is better to now have many layers of sub queries.
I think you need another correlation name - dummy3? - after 'as result )' before the order by.
[MS SQL 2008]
I have tables (all columns are string names):
A: two columns relating some datafield to an owning entity
B: three columns defining a hierarchy of entities
I need to create a singe table of the whole hierarchy (including all rows not existing in both tables), but the key column in table A (shown as Acol2) can be in either column 1 or 2 of table B...
A: B:
Acol1 | Acol2 Bcol1 | Bcol2 | Bcol3
-------+------ --------+-------+------
A | B B | X | Y
C | D Q | X | Y
E | F H | D | Z
G | H W | V | U
The output should be
Hierarchy:
Acol1 | Bcol1 | Bcol2 | Bcol3
-------+-------+-------+------
A | B | X | Y
Null | Q | X | Y
C | Null | D | Z
G | H | D | Z
E | Null | Null | Null
Null | W | V | U
Logic (also added to original):
If A has no record in B, show A with all Null
If A has record in Bcol1, show A with full row B
If A has record in Bcol2, show A with Null, Bcol2, Bcol3
If B has no record in A, show B with Null for Acol1
I have tried all sorts of UNIONs of two separate JOINs, but can't seem to get rid of extraneous rows...
B LEFT JOIN A ON Acol2=Bcol1 UNION B LEFT JOIN A ON Acol2=Bcol2;
gives duplicate rows, as the second part of the union has to set Bcol1 to NULL
(perhaps one solution is a way to remove this duplicate NULL row?)
B INNER JOIN A ON Acol2=Bcol1 UNION B INNER JOIN A ON Acol2=Bcol2;
Obviously removes all the rows from A and B that have no shared keys
(solution as to easy way to regain just those rows?)
Any idea appreciated!
To play:
[SQL removed - see fiddle in reply comments]
SELECT
Table1.ACol1,
CASE WHEN Table1.ACol1 = Table2.BCol1 THEN Table2.BCol1 ELSE NULL END AS BCol1
Table2.BCol2,
Table2.BCol3
FROM
Table1
FULL OUTER JOIN
Table2
ON Table1.ACol2 IN (Table2.BCol1, Table2.BCol2)
When you say no duplicates, this is only possible if ACol2 only ever appears in one field of one row in Table2. If it appears in multiple places, you'll get duplication.
- If that's possible, how would you want to chose which record from Table2?
Also, in general, however, this is a SQL-Anti-Pattern.
This is because the join would prefer an index on Table2. But, since you never know which field you're joining on, no single index will ever satsify the join condition.
EDIT:
What would make this significantly faster is to create a normalised TableB...
B_ID | B_Col | B_Val
------+-------+-------
1 | 1 | B
1 | 2 | X
1 | 3 | Y
2 | 1 | Q
2 | 2 | X
2 | 3 | Y
3 | 1 | H
3 | 2 | D
3 | 3 | Z
4 | 1 | W
4 | 2 | V
4 | 3 | U
Then index that table with (B_ID) and on (B_Val)...
Then include the B_ID field in the non_normalised table...
ID | Bcol1 | Bcol2 | Bcol3
------+-------+-------+-------
1 | B | X | Y
2 | Q | X | Y
3 | H | D | Z
4 | W | V | U
Then use the following query...
SELECT
Table1.ACol1,
CASE WHEN Table1.ACol1 = Table2.BCol1 THEN Table2.BCol1 ELSE NULL END AS BCol1
Table2.BCol2,
Table2.BCol3
FROM
(
Table1
LEFT JOIN
Table2Normalised
ON Table2Normalised.B_Val = Table1.ACol2
AND Table2Normalised.B_Col IN (1,2)
)
FULL OUTER JOIN
Table2
ON Table2Normalised.B_ID = Table2.ID
EDIT:
Without changing the schema, and instead having one index on BCol1 and a second index on Bcol2...
SELECT ACol1, BCol1, BCol2, BCol3 FROM Table1 a INNER JOIN Table2 b ON a.ACol2 = b.BCol1
UNION ALL
SELECT ACol1, NULL, BCol2, BCol3 FROM Table1 a INNER JOIN Table2 b ON a.ACol2 = b.BCol2
UNION ALL
SELECT ACol1, NULL, NULL, NULL FROM Table1 a WHERE NOT EXISTS (SELECT * FROM Table2 WHERE BCol1 = a.ACol2)
AND NOT EXISTS (SELECT * FROM Table2 WHERE BCol2 = a.ACol2)
UNION ALL
SELECT NULL, BCol1, BCol2, BCol3 FROM Table2 b WHERE NOT EXISTS (SELECT * FROM Table1 WHERE ACol2 = b.BCol1)
AND NOT EXISTS (SELECT * FROM Table1 WHERE ACol2 = b.BCol2)
But that's pretty messy...