SQL Server : grouping by 2 variables, keeping all distinct - sql

I made several attempts at googling a solution to this but I'm finding a hard time generating keywords for an accurate search.
Say I have a table with the following information.
A, B
1, 1
1, 2
2, 1
If I perform a group by operation on both columns A, B, I'll get a table indexed by the same set, but I'm interested in something of the form:
A, B, nRecords
1, 1, 1
1, 2, 1
2, 1, 1
2, 2, 0
Query:
SELECT
A, B, COUNT(*) nRecords
FROM
table
GROUP BY
A, B
will not include information for the A = 2, B = 2 case. Any thoughts on moving forward? This needs to be abstracted to large distinct values in both columns.

select a.A, a.B, count(*)
from
(select distinct A from T) as a
cross join
(select distinct B from T) as b
left outer join T as t on t.A = a.A and t.B = b.B
group by a.A, a.B

Related

Redshift Join VS. Union with Group By

Let's say I would like to pull the fields dim,a,b,c,d from 2 tables which one contains a,b and the other contains c,d.
I'm wondering if there's a preferred way (between the following) to do it - Performance wise:
1:
select t1.dim,a,b,c,d
from
(select dim,sum(a) as a,sum(b)as b from t1 group by dim)t1
join
(select dim,sum(c) as c,sum(d) as d from t2 group by dim)t2
on t1.dim=t2.dim;
2:
select dim,sum(a) as a,sum(b) as b,sum(c) as c,sum(d) as d
from
(
select dim,a,b,null as c, null as d from t1
union
select dim,null as a, null as b, c, d from t2
)a
group by dim
Of course when handling a large amount of data (5-30M records at the final query).
Thanks!
The first method filters would any dim values that are not in both tables. union is inefficient. So, neither is appealing.
I would go for:
select dim, sum(a) as a, sum(b) as b, sum(c) as c, sum(d) as d
from (select dim, a, b, null as c, null as d from t1
union all
select dim, null as a, null as b, c, d from t2
) a
group by dim;
You could also pre-aggregate the values in each subquery. Or use full outer join for the first method.

2 equals select expression within select

I have to process my data by Levenshtein function.
In this case I'm using nested selection
SELECT levenshtein(a.param, b.param), *
FROM (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
) a,
FROM (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
) b,
is there a way to not duplicate inner SELECT ?
Solution is pretty simple, thanks to #Nikarus for suggestion about WITH expression:
WITH subtable AS (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
)
SELECT levenshtein(a.param, b.param), *
FROM subtable a, subtable b

one to one distinct restriction on selection

I encountered a problem like that. There are two tables (x value is ordered so that
in a incremental trend !)
Table A
id x
1 1
1 3
1 4
1 7
Table B
id x
1 2
1 5
I want to join these two tables:
1) on the condition of the equality of id and
2) each row of A should be matched only to one row of B, vice verse (one to one relationship) based on the absolute difference of x value (small difference row has
more priority to match).
Only based on the description above it is not a clear description because if two pairs of row which share a common row in one of the table have the same difference, there is no way to decide which one goes first. So define A as "Main" table, the row in table A with smaller line number always go first
Expected result of demo:
id A.x B.x abs_diff
1 1 2 1
1 4 5 1
End of table(two extra rows in A shouldn't be considered, because one to one rule)
I am using PostgreSQL so the thing I have tried is DISTINCT ON, but it can not solve.
select distinct on (A.x) id,A.x,B.x,abs_diff
from
(A join B
on A.id=B.id)
order by A.x,greatest(A.x,B.x)-least(A.x,B.x)
Do you have any ideas, it seems to be tricky in plain SQL.
Try:
select a.id, a.x as ax, b.x as bx, x.min_abs_diff
from table_a a
join table_b b
on a.id = b.id
join (select a.id, min(abs(a.x - b.x)) as min_abs_diff
from table_a a
join table_b b
on a.id = b.id
group by a.id) x
on x.id = a.id
and abs(a.x - b.x) = x.min_abs_diff
fiddle: http://sqlfiddle.com/#!15/ab5ae/5/0
Although it doesn't match your expected output, I think the output is correct based on what you described, as you can see each pair has a difference with an absolute value of 1.
Edit - Try the following, based on order of a to b:
select *
from (select a.id,
a.x as ax,
b.x as bx,
x.min_abs_diff,
row_number() over(partition by a.id, b.x order by a.id, a.x) as rn
from table_a a
join table_b b
on a.id = b.id
join (select a.id, min(abs(a.x - b.x)) as min_abs_diff
from table_a a
join table_b b
on a.id = b.id
group by a.id) x
on x.id = a.id
and abs(a.x - b.x) = x.min_abs_diff) x
where x.rn = 1
Fiddle: http://sqlfiddle.com/#!15/ab5ae/19/0
One possible solution for your currently ambiguous question:
SELECT *
FROM (
SELECT id, x AS a, lead(x) OVER (PARTITION BY grp ORDER BY x) AS b
FROM (
SELECT *, count(tbl) OVER (PARTITION BY id ORDER BY x) AS grp
FROM (
SELECT TRUE AS tbl, * FROM table_a
UNION ALL
SELECT NULL, * FROM table_b
) x
) y
) z
WHERE b IS NOT NULL
ORDER BY 1,2,3;
This way, every a.x is assigned the next bigger (or same) b.x, unless there is another a.x that is still smaller than the next b.x (or the same).
Produces the requested result for the demo case. Not sure about various ambiguous cases.
SQL Fiddle.

Where one or another column exists in a sub select

I'm looking to do something like this:
SELECT a, b, c, d FROM someTable WHERE
WHERE a in (SELECT testA FROM otherTable);
Only I want to be able to test if 2 columns exist in a sub select of 2 columns.
SELECT a, b, c, d FROM someTable WHERE
WHERE a OR b in (SELECT testA, testB FROM otherTable);
We are using MS SQL Server 2012
Try this
SELECT a, b, c, d
FROM someTable WHERE
WHERE a IN (SELECT testA FROM otherTable)
OR b IN (SELECT testB FROM otherTable)
or
SELECT a, b, c, d
FROM someTable WHERE
WHERE EXISTS
(SELECT NULL
FROM otherTable
WHERE testA = a OR testB = a
OR testA = b OR testB = b)
UPDATE:
Maybe you need to add index on testB column, if you have bad performance.
Also another option to use CROSS APPLY for MS SQL
SELECT a, b, c, d
FROM someTable ST
CROSS APPLY (
SELECT 1
FROM otherTable OT
WHERE OT.testA = ST.a OR OT.testB = ST.b
)
If any of this won't work, try using UNION. Mostly UNION gives better performance than OR
SELECT a, b, c, d
FROM someTable WHERE
WHERE a IN (SELECT testA FROM otherTable)
UNION
SELECT a, b, c, d
FROM someTable WHERE
WHERE b IN (SELECT testB FROM otherTable)
UPDATE 2:
For further reading on OR and UNION differences
Why is UNION faster than an OR statement
Try this..
SELECT a, b, c, d
FROM someTable
WHERE Exists
(
SELECT 1
FROM otherTable
Where a = testA OR b = testB
)
If I'm understanding your question correctly, LEFT JOIN is probably the way to go here:
SELECT a, b, c, d
FROM TableA ta
LEFT JOIN TableB tb
ON ta.a = tb.a
AND ta.b = tb.b
WHERE tb.a IS NOT NULL
AND tb.c IS NOT NULL
You could also use UNION and INNER JOIN:
SELECT a, b, c, d
FROM someTable
INNER JOIN OtherTable OT on someTable.B = OT.testB
UNION
SELECT a, b, c, d
FROM someTable
INNER JOIN OtherTable OT ON someTable.A= OT.testA
Note that the JOIN approach should be orders of magnitude faster if you have an index on the column
Joins seems to be one option, have you thought about using them with a Union?
SELECT a, b, c, d
FROM someTable
INNER JOIN OtherTable OT on someTable.B = OT.testB
UNION
SELECT a, b, c, d
FROM someTable
INNER JOIN OtherTable OT ON someTable.A= OT.testA

Multi columns using union query

I would like to create 2 columns of data out of a single union query of 2 tables with same field. I have 2 tables with "Utilizations" field in each table.
I tried the following query but I got an error.
select Utilizations as "Utilizations A", Utilizations as "Utilizations B" from
(select Utilizations as A, 0 as B from TableA
union all
select 0 as A, Utilizations as B from TableB)
First off, you need to alias your subquery, and second of all, you needed to refer to your columns in your outer query as A and B, not utilitizations:
select A as "Utilizations A",
B as "Utilizations B"
from
(select Utilizations as A,
0 as B
from TableA
union all
select 0 as A,
Utilizations as B
from TableB
)AS t