one to one distinct restriction on selection - sql

I encountered a problem like that. There are two tables (x value is ordered so that
in a incremental trend !)
Table A
id x
1 1
1 3
1 4
1 7
Table B
id x
1 2
1 5
I want to join these two tables:
1) on the condition of the equality of id and
2) each row of A should be matched only to one row of B, vice verse (one to one relationship) based on the absolute difference of x value (small difference row has
more priority to match).
Only based on the description above it is not a clear description because if two pairs of row which share a common row in one of the table have the same difference, there is no way to decide which one goes first. So define A as "Main" table, the row in table A with smaller line number always go first
Expected result of demo:
id A.x B.x abs_diff
1 1 2 1
1 4 5 1
End of table(two extra rows in A shouldn't be considered, because one to one rule)
I am using PostgreSQL so the thing I have tried is DISTINCT ON, but it can not solve.
select distinct on (A.x) id,A.x,B.x,abs_diff
from
(A join B
on A.id=B.id)
order by A.x,greatest(A.x,B.x)-least(A.x,B.x)
Do you have any ideas, it seems to be tricky in plain SQL.

Try:
select a.id, a.x as ax, b.x as bx, x.min_abs_diff
from table_a a
join table_b b
on a.id = b.id
join (select a.id, min(abs(a.x - b.x)) as min_abs_diff
from table_a a
join table_b b
on a.id = b.id
group by a.id) x
on x.id = a.id
and abs(a.x - b.x) = x.min_abs_diff
fiddle: http://sqlfiddle.com/#!15/ab5ae/5/0
Although it doesn't match your expected output, I think the output is correct based on what you described, as you can see each pair has a difference with an absolute value of 1.
Edit - Try the following, based on order of a to b:
select *
from (select a.id,
a.x as ax,
b.x as bx,
x.min_abs_diff,
row_number() over(partition by a.id, b.x order by a.id, a.x) as rn
from table_a a
join table_b b
on a.id = b.id
join (select a.id, min(abs(a.x - b.x)) as min_abs_diff
from table_a a
join table_b b
on a.id = b.id
group by a.id) x
on x.id = a.id
and abs(a.x - b.x) = x.min_abs_diff) x
where x.rn = 1
Fiddle: http://sqlfiddle.com/#!15/ab5ae/19/0

One possible solution for your currently ambiguous question:
SELECT *
FROM (
SELECT id, x AS a, lead(x) OVER (PARTITION BY grp ORDER BY x) AS b
FROM (
SELECT *, count(tbl) OVER (PARTITION BY id ORDER BY x) AS grp
FROM (
SELECT TRUE AS tbl, * FROM table_a
UNION ALL
SELECT NULL, * FROM table_b
) x
) y
) z
WHERE b IS NOT NULL
ORDER BY 1,2,3;
This way, every a.x is assigned the next bigger (or same) b.x, unless there is another a.x that is still smaller than the next b.x (or the same).
Produces the requested result for the demo case. Not sure about various ambiguous cases.
SQL Fiddle.

Related

Subqueries vs Multi Table Join

I've 3 tables A, B, C. I want to list the intersection count.
Way 1:-
select count(id) from A a join B b on a.id = b.id join C c on B.id = C.id;
Result Count - X
Way 2:-
SELECT count(id) FROM A WHERE id IN (SELECT id FROM B WHERE id IN (SELECT id FROM C));
Result Count - Y
The result count in each of the query is different. What exactly is wrong?
A JOIN can multiply the number of rows as well as filtering out rows.
In this case, the second count should be the correct one because nothing is double counted -- assuming id is unique in a. If not, it needs count(distinct a.id).
The equivalent using JOIN would use COUNT(DISTINCT):
select count(distinct a.id)
from A a join
B b
on a.id = b.id join
C c
on B.id = C.id;
I mention this for completeness but do not recommend this approach. Multiplying the number of rows just to remove them using distinct is inefficient.
In many databases, the most efficient method might be:
select count(*)
from a
where exists (select 1 from b where b.id = a.id) and
exists (select 1 from c where c.id = a.id);
Note: This assumes there are indexes on the id columns and that id is unique in a.

SQL : matching two tables with two possibles conditions

is there a way in SQL while we join two tables table_A and table_B, if we can’t match the two tables on a criteria said criteria_X at all we will try the second criteria criteria_Y
Something like this:
select *
from table_A, table_B
where table_A.id = table_B.id2
and (if there is no row where table_B.criteria_X = X then try table_B.criteria_Y = Y)
The following query is not a solution:
..
and (table_B.criteria_X = X OR table_B.criteria_Y = Y)
Thanks
This is a find the best match query:
select *
from
(
select *,
row_number() -- based on priority, #1 criteria_X, #2 criteria_Y
over (partition by table_A.id
order by case when table_B.criteria_X = X then 1
else 2
end) as best_match
from table_A, table_B
where table_A.id = table_B.id2
and (table_B.criteria_X = X OR table_B.criteria_Y = Y)
) dt
where best_match = 1
If the ORed condition results in loosing indexed access you might try splitting it into two UNION ALL selects.
A typical method uses left join twice . . . once for each criterion. Then then uses coalesce() in the select. And, with indexes on the join keys, this also should have very good performance:
select a.*, coalesce(b1.colx, b2.colx)
from table_A a left join
table_B b1
on a.id = b1.id2 and b1.criteria_X = X left join
table_B b2
on a.id = b1.id2 and b2.criteria_Y = Y
where b1.id2 is not null or b2.id2 is not null;
The where clause ensures that at least one row matches.
This does not work under all circumstances -- in particular, each join needs to return only 0 or 1 matching rows. This is often the situation with this type of "priority" joins.
An alternative version uses row_number(). This is sort of similar to #dnoeth's approach, but the row number calculation is done before the join:
select a.*, coalesce(b1.colx, b2.colx)
from table_A a join
(select b.*,
row_number() over (partition by id2
order by (case when criteria_x = X then 1
when criteria_y = Y then 2
end)
) as seqnum
from table_B b
where criteria_x = X or criteria_y = Y
) b
on a.id = b.id2 and seqnum = 1

Select table as json array

I have three tables in PostgreSQL: A, B, C.
I want to get a row from table A with a specific id, plus all records from tables B and C with matching id as aggregated JSON.
For example:
Table A Table B Table C
---------------------------------------------------------------
id / colum1 / colum2 id/ colum 1 id / column1
1 someValue, somValue 1 someVal1 1 someVal1
1 someVal2 1 someVal2
The expected output for id = 1 would be:
a.column1 a.column2 ARRAY_JSON_B ARRAY_JSON_C
------------------------------------------------------------------------------
someValue someValue [{colum1:'someVal1'}, [{colum1:'someVal1'},
{colum1:'someVal2'}] {colum1:'someVal2'}]
This requires Postgres 9.3 or later.
Simple case
I suggest to use the simpler json_agg() that's meant for this purpose, in LATERAL joins:
SELECT *
FROM a
LEFT JOIN LATERAL (SELECT json_agg(b) AS array_json_b FROM b WHERE id = a.id) b ON true
LEFT JOIN LATERAL (SELECT json_agg(c) AS array_json_c FROM c WHERE id = a.id) c ON true
WHERE id = 1;
LEFT JOIN LATERAL ... ON true keeps rows in the result that have no match on the left side of the join. Details:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Subtle difference: This query returns NULL where no match is found in b or c, #stas' query with correlated subqueries returns an empty array instead. May or may not be important.
Actual answer
Your example in the question excludes the redundant id column in b and c from the result - which makes sense. To achieve this, you can't use #stas' simple correlated subquery. While it would still work for a single column instead of the whole row, it would lose the column name and produce a simple array. Also, it would not work for more than one column.
Use json_object_agg() for a single selected column (which also allows to chose the tag name freely):
SELECT *
FROM a
LEFT JOIN LATERAL (
SELECT json_object_agg('colum1', colum1) AS array_json_b
FROM b WHERE id = a.id
) b ON true
LEFT JOIN LATERAL (
SELECT json_object_agg('colum1', colum1) AS array_json_c
FROM c WHERE id = a.id
) c ON true
WHERE id = 1;
Or use a subselect for any selection (col1 and col2 in this example):
SELECT *
FROM a
LEFT JOIN LATERAL (
SELECT json_agg(x) AS array_json_b
FROM (SELECT col1, col2 FROM b WHERE id = a.id) x
) b ON true
LEFT JOIN LATERAL (
SELECT json_agg(x) AS array_json_c
FROM (SELECT col1, col2 FROM c WHERE id = a.id) x
) c ON true
WHERE id = 1;
Related:
Return multiple columns of the same row as JSON array of objects
How do I return a jsonb array and array of objects from my data?
select
a.*,
to_json(array(select b from b where b.id = a.id)) array_json_b,
to_json(array(select c from c where c.id = a.id)) array_json_c
from a
where
a.id = 1;
I hope your Postgresql version is 9.3 or higher. There is a clever function to_json which can convert anything to json. So we take an array of all related rows from b and convert it. Same with c.

Combining four tables in SQL Server

I have four tables Table A, Table B, Table C and Table D. The schema of all four tables are identical. I need to union these four tables in the following way:
If a record is present in Table A then that is considered in the output table.
If a record is present in Table B then it is considered in the output table ONLY if it is not present in Table A.
If a record is present in Table C then it is considered ONLY if it is not present in Table A and Table B.
If a record is present in Table D then it is considered ONLY if it is not present in Table A, Table B, and Table C.
Note -
Every table has a column which identifies the table itself for every record (I don't know if this is of any importance)
Records are identified based on a particular column - Column X which is not unique even within each table
You could do something like (only two cases shown but you should see how to extend this)
WITH CTE1 AS
(
SELECT 't1' as Source, X, Y
FROM t1
UNION ALL
SELECT 't2' as Source, X, Y
FROM t2
), CTE2 AS
(
SELECT *,
RANK() OVER (PARTITION BY X
ORDER BY CASE Source
WHEN 't1' THEN 1
WHEN 't2' THEN 2
END) As RN
FROM CTE1
)
SELECT X,Y
FROM CTE2
WHERE RN=1
I would be inclined to do this using not exists:
select a.*
from a
union all
select b.*
from b
where not exists (select 1 from a where a.x = b.x)
union all
select c.*
from c
where not exists (select 1 from a where a.x = c.x) and
not exists (select 1 from b where b.x = c.x)
union all
select d.*
from d
where not exists (select 1 from a where a.x = d.x) and
not exists (select 1 from b where b.x = d.x) and
not exists (select 1 from c where c.x = d.x);
If you have an index on the x column in each table, then this should be the fastest method.
This will work as long as there are no NULL columns, or if columns for a record that exists in table with higher precedence are NULL you can assume the same column will NULL in tables with lower precedence.
SELECT coalesce(a.column1, b.column1, c.column1, d.column1) column1
,coalesce(a.column2, b.column2, c.column2, d.column2) column2
,coalesce(a.column3, b.column3, c.column3, d.column3) column3
--...
,coalesce(a.columnN, b.columnN, c.columnN, d.columnN) columnN
FROM TableA a
FULL JOIN TableB b on b.ColumnX = a.ColumnX
FULL JOIN TableC c on c.ColumnX = a.ColumnX or c.ColumnX = b.ColumnX
FULL JOIN TableD d on d.ColumnX = a.ColumnX or d.ColumnX = b.ColumnX or d.ColumnX = c.ColumnX
If the NULL values matter, you can switch to a more-complicated (and likely slower) CASE version:
CASE WHEN a.columnX IS NOT NULL THEN a.column1
WHEN b.columnX IS NOT NULL THEN b.column1
WHEN c.columnX IS NOT NULL THEN c.column1
WHEN d.columnX IS NOT NULL THEN d.column1 END column1
Of course, you can mix and match, so columns that are not nullable can use the former syntax, and columns where NULL values matter use the latter.
Hopefully the purpose of this is to fix the broken schema and put this data all in the same table, where it belongs.
This might seem stupid, but if, by any chance, you can leave out the table-identifying column and you also want to eliminate duplicate records (from within one table) too then the most straightforward answer would be
select <all columns without table identifier> from tableA
union
select <all columns without table identifier> from tableB
union
select <all columns without table identifier> from tableC
...
This is exactly, what union was designed to do: add rows only if they do not already exist before.

SQL JOIN or UNION or what?

I have 3 tables named A, B and C. Table A have column a. Table B has column a, b. Table C has column a, c. These tables contains data like below:
I want to get all data from A,B,C where a=1
My desired output should look like below:
But I'm getting result from SSMS like below:
How should I refactor my SQL to get my desired output?
e.g. I don't want repeated values in columns
You need to join these values, but not only on the a value but on the position. SQL tables represent unordered sets. I am going to assume that the b and c columns represent the ordering.
select a.a, b.b, c.c
from (select a.*, row_number() over (order by a) as seqnum
from a
) a full outer join
(select b.*, row_number() over (partition by a order by b) as seqnum
from b
) b
on a.a = b.a and a.seqnum = b.seqnum full outer join
(select c.*, row_number() over (partition by a order by c) as seqnum
from c
) c
on c.a = coalesce(a.a, b.a) and c.seqnum = coalesce(a.seqnum, b.seqnum)
where coalesce(a.a, b.a, c.a) = 1;