SQL JOIN or UNION or what? - sql

I have 3 tables named A, B and C. Table A have column a. Table B has column a, b. Table C has column a, c. These tables contains data like below:
I want to get all data from A,B,C where a=1
My desired output should look like below:
But I'm getting result from SSMS like below:
How should I refactor my SQL to get my desired output?
e.g. I don't want repeated values in columns

You need to join these values, but not only on the a value but on the position. SQL tables represent unordered sets. I am going to assume that the b and c columns represent the ordering.
select a.a, b.b, c.c
from (select a.*, row_number() over (order by a) as seqnum
from a
) a full outer join
(select b.*, row_number() over (partition by a order by b) as seqnum
from b
) b
on a.a = b.a and a.seqnum = b.seqnum full outer join
(select c.*, row_number() over (partition by a order by c) as seqnum
from c
) c
on c.a = coalesce(a.a, b.a) and c.seqnum = coalesce(a.seqnum, b.seqnum)
where coalesce(a.a, b.a, c.a) = 1;

Related

Subqueries vs Multi Table Join

I've 3 tables A, B, C. I want to list the intersection count.
Way 1:-
select count(id) from A a join B b on a.id = b.id join C c on B.id = C.id;
Result Count - X
Way 2:-
SELECT count(id) FROM A WHERE id IN (SELECT id FROM B WHERE id IN (SELECT id FROM C));
Result Count - Y
The result count in each of the query is different. What exactly is wrong?
A JOIN can multiply the number of rows as well as filtering out rows.
In this case, the second count should be the correct one because nothing is double counted -- assuming id is unique in a. If not, it needs count(distinct a.id).
The equivalent using JOIN would use COUNT(DISTINCT):
select count(distinct a.id)
from A a join
B b
on a.id = b.id join
C c
on B.id = C.id;
I mention this for completeness but do not recommend this approach. Multiplying the number of rows just to remove them using distinct is inefficient.
In many databases, the most efficient method might be:
select count(*)
from a
where exists (select 1 from b where b.id = a.id) and
exists (select 1 from c where c.id = a.id);
Note: This assumes there are indexes on the id columns and that id is unique in a.

Efficient way to get entire rows with MAX(id) when grouping by a foreign key

Consider tables A, B and C. B and C are related to A through a foreign key, and there are many Bs and Cs with the same A foreign key.
Suppose the following query:
SELECT
A.pk AS pk_a,
MAX(B.id) AS new_b,
MAX(C.id) AS new_c
FROM A
INNER JOIN B ON B.fk_a = pk_a
INNER JOIN C ON C.fk_a = pk_a
GROUP BY pk_a
I would like to retrieve the entire new_b and new_c rows from B and C for each GROUP BY pk_a.
Surely I could wrap this as a subselect and JOIN B ON b.id = new_b, and the same for C, but B and C are huge and I would like to avoid this.
I could also use SELECT DISTINCT ON(A.pk) A.pk, B.*, C.* and ORDER BY A.pk, B.id, C.id, but that would only guarantee the latest B., not the latest C..
Is there any other way I'm missing?
For few rows (like 2 or 3 or 5 on avg., depends) in B and C per row in A, DISTINCT ON is typically fastest.
For many rows per row in A, there are (much) more efficient solutions. And your information: "B and C are huge" indicates as much.
I suggest LATERAL subqueries with ORDER BY and LIMIT 1, backed by a matching index.
SELECT A.pk AS pk_a, B.*, C.*
FROM A
LEFT JOIN LATERAL (
SELECT *
FROM B
WHERE B.fk_a = A.pk -- lateral reference
ORDER BY B.id DESC
LIMIT 1
) B ON true
LEFT JOIN LATERAL (
SELECT *
FROM C
WHERE C.fk_a = A.pk -- lateral reference
ORDER BY C.id DESC
LIMIT 1
) C ON true;
Assuming B.id and C.id are NOT NULL.
You need at least indexes on the FK columns. Ideally, multi-column indexes on B (fk_a, id DESC) and C (fk_a, id DESC) though.
Use LEFT JOIN! to not exclude rows from A that are not referenced in either B or C. It would be an evil trap to use [INNER] JOIN here, since you join to two unrelated tables.
Detailed explanation:
Optimize GROUP BY query to retrieve latest record per user
Related:
Select first row in each GROUP BY group?
What is the difference between LATERAL and a subquery in PostgreSQL?
Simpler syntax with smart naming convention
The result of above query has pk_a once and fk_a twice. Useless ballast - and the same column name twice may be an actual problem, depending on your client.
You can spell out a column list in the outer SELECT (instead of the syntax shortcut A.*, B.*) to avoid redundancies. You may have to do that either way if there are more duplicate names or if you don't want all columns.
But with a smart naming convention, the USING clause can fold the redundant PK and FK columns for you:
SELECT *
FROM A
LEFT JOIN LATERAL (
SELECT * FROM B
WHERE B.a_id = A.a_id
ORDER BY B.id DESC
LIMIT 1
) B USING (a_id)
LEFT JOIN LATERAL (
SELECT * FROM C
WHERE C.a_id = A.a_id
ORDER BY C.id DESC
LIMIT 1
) C USING (a_id);
Logically, USING (a_id) is redundant here, since WHERE B.a_id = A.a_id in the subquery already filters the same way. But the additional effect of USING is that joining columns are folded to one instance. So only one a_id remains in the result. The manual:
Furthermore, the output of JOIN USING suppresses redundant columns:
there is no need to print both of the matched columns, since they must
have equal values. While JOIN ON produces all columns from T1 followed
by all columns from T2, JOIN USING produces one output column for each
of the listed column pairs (in the listed order), followed by any
remaining columns from T1, followed by any remaining columns from T2.
It also typically makes a lot of sense to use the same name for the same data. So: a_id for PK and FK columns.
Is this what you are asking for?
SELECT abc.*
FROM (SELECT A.pk AS pk_a, b.*, c.*,
ROW_NUMBER() OVER (PARTITION BY a.pk ORDER BY b.id DESC) as seqnum_b,
ROW_NUMBER() OVER (PARTITION BY a.pk ORDER BY c.id DESC) as seqnum_c
FROM A INNER JOIN
B
ON B.fk_a = pk_a INNER JOIN
C
ON C.fk_a = pk_a
) abc
WHERE seqnum_b = 1 or seqnum_c = 1;
Actually, I think the above is on the right track, but you probably want:
SELECT a.pk, b.*, c.*
FROM A INNER JOIN
(SELECT DISTINCT ON (b.fk_a) b.*
FROM b
ORDER BY b.fk_a, b.id DESC
) b
ON B.fk_a = pk_a JOIN
(SELECT DISTINCT ON (c.fk_a) c.*
FROM c
ORDER BY c.fk_a, c.id DESC
) c
ON c.fk_a = pk_a;
In Postgres 9.5, you can also use lateral joins for a similar effect.
How about this:
SELECT DISTINCT
A.pk AS pk_a,
MAX(B.id) OVER(PARTITION BY pk_a) AS new_b,
MAX(C.id) OVER(PARTITION BY pk_a) AS new_c
FROM A
INNER JOIN B ON B.fk_a = pk_a
INNER JOIN C ON C.fk_a = pk_a

Redshift Join VS. Union with Group By

Let's say I would like to pull the fields dim,a,b,c,d from 2 tables which one contains a,b and the other contains c,d.
I'm wondering if there's a preferred way (between the following) to do it - Performance wise:
1:
select t1.dim,a,b,c,d
from
(select dim,sum(a) as a,sum(b)as b from t1 group by dim)t1
join
(select dim,sum(c) as c,sum(d) as d from t2 group by dim)t2
on t1.dim=t2.dim;
2:
select dim,sum(a) as a,sum(b) as b,sum(c) as c,sum(d) as d
from
(
select dim,a,b,null as c, null as d from t1
union
select dim,null as a, null as b, c, d from t2
)a
group by dim
Of course when handling a large amount of data (5-30M records at the final query).
Thanks!
The first method filters would any dim values that are not in both tables. union is inefficient. So, neither is appealing.
I would go for:
select dim, sum(a) as a, sum(b) as b, sum(c) as c, sum(d) as d
from (select dim, a, b, null as c, null as d from t1
union all
select dim, null as a, null as b, c, d from t2
) a
group by dim;
You could also pre-aggregate the values in each subquery. Or use full outer join for the first method.

one to one distinct restriction on selection

I encountered a problem like that. There are two tables (x value is ordered so that
in a incremental trend !)
Table A
id x
1 1
1 3
1 4
1 7
Table B
id x
1 2
1 5
I want to join these two tables:
1) on the condition of the equality of id and
2) each row of A should be matched only to one row of B, vice verse (one to one relationship) based on the absolute difference of x value (small difference row has
more priority to match).
Only based on the description above it is not a clear description because if two pairs of row which share a common row in one of the table have the same difference, there is no way to decide which one goes first. So define A as "Main" table, the row in table A with smaller line number always go first
Expected result of demo:
id A.x B.x abs_diff
1 1 2 1
1 4 5 1
End of table(two extra rows in A shouldn't be considered, because one to one rule)
I am using PostgreSQL so the thing I have tried is DISTINCT ON, but it can not solve.
select distinct on (A.x) id,A.x,B.x,abs_diff
from
(A join B
on A.id=B.id)
order by A.x,greatest(A.x,B.x)-least(A.x,B.x)
Do you have any ideas, it seems to be tricky in plain SQL.
Try:
select a.id, a.x as ax, b.x as bx, x.min_abs_diff
from table_a a
join table_b b
on a.id = b.id
join (select a.id, min(abs(a.x - b.x)) as min_abs_diff
from table_a a
join table_b b
on a.id = b.id
group by a.id) x
on x.id = a.id
and abs(a.x - b.x) = x.min_abs_diff
fiddle: http://sqlfiddle.com/#!15/ab5ae/5/0
Although it doesn't match your expected output, I think the output is correct based on what you described, as you can see each pair has a difference with an absolute value of 1.
Edit - Try the following, based on order of a to b:
select *
from (select a.id,
a.x as ax,
b.x as bx,
x.min_abs_diff,
row_number() over(partition by a.id, b.x order by a.id, a.x) as rn
from table_a a
join table_b b
on a.id = b.id
join (select a.id, min(abs(a.x - b.x)) as min_abs_diff
from table_a a
join table_b b
on a.id = b.id
group by a.id) x
on x.id = a.id
and abs(a.x - b.x) = x.min_abs_diff) x
where x.rn = 1
Fiddle: http://sqlfiddle.com/#!15/ab5ae/19/0
One possible solution for your currently ambiguous question:
SELECT *
FROM (
SELECT id, x AS a, lead(x) OVER (PARTITION BY grp ORDER BY x) AS b
FROM (
SELECT *, count(tbl) OVER (PARTITION BY id ORDER BY x) AS grp
FROM (
SELECT TRUE AS tbl, * FROM table_a
UNION ALL
SELECT NULL, * FROM table_b
) x
) y
) z
WHERE b IS NOT NULL
ORDER BY 1,2,3;
This way, every a.x is assigned the next bigger (or same) b.x, unless there is another a.x that is still smaller than the next b.x (or the same).
Produces the requested result for the demo case. Not sure about various ambiguous cases.
SQL Fiddle.

SQL: select multiple columns based on multiple groups of minimum values?

The following query gives me a single row because b.id is pinned. I would like a query which I can give a group of ids and get the minimum valued row for each of them.
The effect I want is as if I wrapped this query in a loop over a collection of ids and executed the query with each id as b.id = value but that will be (tens of?) thousands of queries.
select top 1 a.id, b.id, a.time_point, b.time_point
from orientation_momentum a, orientation_momentum b
where a.id = '00820001001' and b.id = '00825001001'
order by calculatedValue() asc
This is on sql-server but I would prefer a portable solution if it's possible.
SQL Server ranking function should do the trick.
select * from (
select a.id, b.id, a.time_point, b.time_point,
rank() over (partition by a.id, b.id
order by calculatedValue() asc) ranker
from orientation_momentum a, orientation_momentum b
where a.id = '00820001001' and b.id between 10 and 20
) Z where ranker = 1