PostgreSQL aggregate union, intersection and set differences - sql

I have a table of pairs to aggregate as follows:
+---------+----------+
| left_id | right_id |
+---------+----------+
| a | b |
+---------+----------+
| a | c |
+---------+----------+
And a table of values as so:
+----+-------+
| id | value |
+----+-------+
| a | 1 |
+----+-------+
| a | 2 |
+----+-------+
| a | 3 |
+----+-------+
| b | 1 |
+----+-------+
| b | 4 |
+----+-------+
| b | 5 |
+----+-------+
| c | 1 |
+----+-------+
| c | 2 |
+----+-------+
| c | 3 |
+----+-------+
| c | 4 |
+----+-------+
For each pair, I would like to calculate the length of the union, intersection and set differences (each way) comparing the values, so that the output would look like this:
+---------+----------+-------+--------------+-----------+------------+
| left_id | right_id | union | intersection | left_diff | right_diff |
+---------+----------+-------+--------------+-----------+------------+
| a | b | 5 | 1 | 2 | 2 |
+---------+----------+-------+--------------+-----------+------------+
| a | c | 4 | 3 | 0 | 1 |
+---------+----------+-------+--------------+-----------+------------+
What would be the best way to approach this using PostgreSQL?
UPDATE: here is a rextester link with data https://rextester.com/RWID9864

You need scalar sub-queries that do that.
The UNION can also be expressed by an OR which makes that query somewhat shorter to write. But for the intersection you need a query that is a bit longer.
To calculate the "diff", use the except operator:
SELECT p.*,
(select count(distinct value) from values where id in (p.left_id, p.right_id)) as "union",
(select count(*)
from (
select v.value from values v where id = p.left_id
intersect
select v.value from values v where id = p.right_id
) t) as intersection,
(select count(*)
from (
select v.value from values v where id = p.left_id
except
select v.value from values v where id = p.right_id
) t) as left_diff,
(select count(*)
from (
select v.value from values v where id = p.right_id
except
select v.value from values v where id = p.left_id
) t) as right_diff
from pairs p

I don't know what causes your slowness, as I cannot see table sizes and/or explain plans. Presuming both tables are large enough to make nested loops inefficient and to not dare thinking about joining values to itself, I'd try to rewrite it free from scalar subqueries like this:
select p.*,
coalesce(stats."union", 0) "union",
coalesce(stats.intersection, 0) intersection,
coalesce(stats.left_cnt - stats.intersection, 0) left_diff,
coalesce(stats.right_cnt - stats.intersection, 0) right_diff
from pairs p
left join (
select left_id,
right_id,
count(*) "union",
count(has_left and has_right) intersection,
count(has_left) left_cnt,
count(has_right) right_cnt
from (
select p.*,
v."value" the_value,
true has_left
from pairs p
join "values" v on v.id = p.left_id
) l
full join (
select p.*,
v."value" the_value,
true has_right
from pairs p
join "values" v on v.id = p.right_id
) r using(left_id, right_id, the_value)
group by left_id,
right_id
) stats on p.left_id = stats.left_id
and p.right_id = stats.right_id;
Each join condition here allows hash and/or merge join, so the planner will have a chance to avoid nested loops.

Related

SQL select all rows that are not equal to an id, and replace the id column with the value - without cross join

Say I have a table like this:
+----+-------+
| id | value |
+----+-------+
| 1 | a |
| 1 | b |
| 2 | c |
| 2 | d |
| 3 | e |
| 3 | f |
+----+-------+
And I want to select all rows with id that are not a, and change their id to a; select all rows with id that are not b, and change the id to b; and select all rows with id that are not c, and change their id to c.
Here is the output I want:
+----+-------+
| id | value |
+----+-------+
| 1 | c |
| 1 | d |
| 1 | e |
| 1 | f |
| 2 | a |
| 2 | b |
| 2 | e |
| 2 | f |
| 3 | a |
| 3 | b |
| 3 | c |
| 3 | d |
+----+-------+
The only solution I can think of is through cross join and distinct:
select distinct a.id, b.value
from table a
cross join table b
where a.id != b.id
Is there any other way to avoid such expensive operation?
I think the typical way to write this is to generate all pairs of id and value and then remove the ones that exist:
select i.id, v.value
from (select distinct id from t) i cross join
(select distinct value from t) v left join
t
on t.id = i.id and t.value = i.value
where t.id is null;
First, I don't think this is what your query does. But this is what you seem to be describing.
From a performance perspective, you might have other sources for i and v that don't require subqueries. If so, use those for performance.
Finally, I don't think you can do much to improve the performance of this, apart from using explicit tables -- and perhaps having appropriate indexes on all the tables.

Join two tables with no relation postgres?

I have this statement which you can see
SELECT t1.*, t2.* FROM
(SELECT m.* FROM microposts AS m) AS t1
FULL JOIN
(SELECT r.* FROM ratings AS r) AS t2
ON true
I am using Rails and connecting to the database raw, but the output removes duplicate named columns eg user_id etc from the second table and is still giving results in the second table in regards to the first even though there is no relation. Eg
+------+-----------+-------+--------+
| m.id | m.content | r.id | rating |
+------+-----------+-------+--------+
| 1 | "hello" | 10 | 5 |
+------+-----------+-------+--------+
There is no relation between table m and r
I would like A output of something like this
+------+-----------+------+---------+
| m.id | m.content | r.id | rating |
+------+-----------+------+---------+
| 1 | "hello" | null | null |
| null | null | 5 | 4 |
| 2 | "gday" | null | null |
+------+-----------+------+---------+
....................... etc
This is rather exotic way to say UNION ALL
SELECT t1.*, t2.*
FROM
(SELECT m.* FROM microposts AS m) AS t1
FULL JOIN
(SELECT r.* FROM ratings AS r) AS t2
ON false
Contrary, ON true will create a cartesian product.

SQL Query - Check for Two Distinct Values

Given the below data set I want to run a query to highlight any 'pairs' that do not consist of a 'left' and 'right'.
+---------+-----------+---------------+----------------------+
| Pair_Id | Pair_Name | Individual_Id | Individual_Direction |
+---------+-----------+---------------+----------------------+
| 1 | A | A1 | Left |
| 1 | A | A2 | Right |
| 2 | B | B1 | Right |
| 2 | B | B2 | Left |
| 3 | C | C1 | Left |
| 3 | C | C2 | Left |
| 4 | D | D1 | Right |
| 4 | D | D2 | Left |
| 5 | E | E1 | Left |
| 5 | E | E2 | Right |
+---------+-----------+---------------+----------------------+
In this instance Pair 3 'C' has two lefts. Therefore, I would look to display the following:
+---------+-----------+---------------+----------------------+
| Pair_Id | Pair_Name | Individual_Id | Individual_Direction |
+---------+-----------+---------------+----------------------+
| 3 | C | C1 | Left |
| 3 | C | C2 | Left |
+---------+-----------+---------------+----------------------+
You can simply use not exists:
select t.*
from t
where not exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_Direction <> t.Individual_Direction
) ;
With an index on (pair_id, Individual_Direction), this should not only be the most concise solution but also the fastest.
If you want to be sure that there are pairs (the above returns singletons):
select t.*
from t
where not exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_Direction <> t.Individual_Direction
) and
exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_ID <> t.Individual_ID
);
You can also do this using window functions:
select t.*
from (select t.*,
count(*) over (partition by pair_id) as cnt,
min(status) over (partition by pair_id) as min_status,
max(status) over (partition by pair_id) as max_status
from t
) t
where cnt > 1 and min_status <> max_status;
One option uses aggregation:
WITH cte AS (
SELECT Pair_Name
FROM yourTable
WHERE Individual_Direction IN ('Left', 'Right')
GROUP BY Pair_Name
HAVING MIN(Individual_Direction) = MAX(Individual_Direction)
)
SELECT *
FROM yourTable
WHERE Pair_Name IN (SELECT Pair_Name FROM cte);
The HAVING clause used above asserts that a matching pair has both a minimum and maximum direction which are the same. This implies that such a pair only has one direction.
As is the case with Gordon's answer, an index on (Pair_Name, Individual_Direction) might help performance:
CREATE INDEX idx ON yourTable (Pair_Name, Individual_Direction);
There should be an elegant way of using window function than what I wrote:
WITH ranked AS
(
SELECT *, RANK() OVER(ORDER BY Pair_Id, Pair_Name, Individual_Direction) AS r
FROM pairs
),
counted AS
(
SELECT Pair_Id, Pair_Name, Individual_Direction,r, COUNT(r) as times FROM ranked
GROUP BY Pair_Id, Pair_Name, Individual_Direction, r
HAVING COUNT(r) > 1
)
SELECT ranked.Pair_Id, ranked.Pair_Name, ranked.Individual_Id, ranked.Individual_Direction FROM ranked
RIGHT JOIN counted
ON ranked.Pair_Id=counted.Pair_Id
AND ranked.Pair_Name=counted.Pair_Name
AND ranked.Individual_Direction=counted.Individual_Direction

Filter array depending on other table

I'm trying to filter values from an array. The information, which values should be kept, are in another table.
table_a table_b
___________________ ___________
| id | values | | keyword |
------------------- -----------
| 1 | [a, b, c] | | b |
| 2 | [d, e, f] | | e |
| 3 | [a, g] | | f |
------------------- -----------
I expect the following output:
output
________________________
| id | filtered_values |
------------------------
| 1 | [b] |
| 2 | [e, f] |
| 3 | [] |
------------------------
At the moment, I am using following query:
SELECT
id,
array_intersect(ta.values, tb.filter_keywords) AS filtered_values -- brickhouse UDF
FROM
table_a ta
CROSS JOIN (
SELECT
collect_set(keyword) as filter_keywords
FROM (
SELECT
"dummy" as grouping_dummy,
keyword
FROM
table_b
) tmp
GROUP BY
grouping_dummy
)
table_a has a couple million rows, table_b contains less than 1000 rows.
I guess the cross join is the bottleneck, because it uses only one reducer.
Is there any way to optimize this query?
Thanks!
I have a different assumption.
The reducer is needed in order to generate filter_keywords, not for the CROSS JOIN which is a map side operation.
So no problem here.
My guess is that the performance penalty comes from the use of array_intersect with an array of 1000 elements, therefor the solution would be avoiding it.
P.s.
There is no need for grouping_dummy.
You don't need to use GROUP BY in order to use aggregate functions.
select a.id
,collect_list (case when b.keyword is not null then a.val end) as vals
from (select a.id
,e.val
from table_a a
lateral view outer
explode (a.vals) e as val
) a
left join table_b b
on b.keyword =
a.val
group by a.id
+----+-----------+
| id | vals |
+----+-----------+
| 1 | ["b"] |
| 2 | ["e","f"] |
| 3 | [] |
+----+-----------+

order by after full outer join

I create the following table on http://sqlfiddle.com in PostgreSQL 9.3.1 mode:
CREATE TABLE t
(
id serial primary key,
m varchar(1),
d varchar(1),
c int
);
INSERT INTO t
(m, d, c)
VALUES
('A', '1', 101),
('A', '2', 102),
('A', '3', 103),
('B', '1', 104),
('B', '3', 105);
table:
| ID | M | D | C |
|----|---|---|-----|
| 1 | A | 1 | 101 |
| 2 | A | 2 | 102 |
| 3 | A | 3 | 103 |
| 4 | B | 1 | 104 |
| 5 | B | 3 | 105 |
From this I want to generate such a table:
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 1 | 4 | 104 |
| B | 2 | (null) | (null) |
| B | 3 | 5 | 105 |
but with my current statement
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
I only get the following
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| B | 1 | 4 | 104 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 3 | 5 | 105 |
| B | 2 | (null) | (null) |
Attempts to order it by m,d fail so far:
select * from
(select * from
(select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as kombi
full outer join
t
on kombi.d = t.d and kombi.m = t.m) as result)
order by result.m
Error message:
ERROR: subquery in FROM must have an alias: select * from (select * from (select * from (select * from (select distinct m from t) as dummy1, (select distinct d from t) as dummy2) as kombi full outer join t on kombi.d = t.d and kombi.m = t.m) as result) order by result.m
It would be cool if somebody could point out to me what I am doing wrong and perhaps show the correct statement.
select * from
(select kombi.m, kombi.d, t.id, t.c from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as kombi
full outer join t
on kombi.d = t.d and kombi.m = t.m) as result
order by result.m, result.d
I think your problem is the order. You can solve this problem with the order by clause:
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
order by combi.m, combi.d
You need to specify which data you would like to order. In this case you get back the row from the combi table, so you need to say that.
http://sqlfiddle.com/#!15/ddc0e/17
You could also use column numbers instead of names to do the ordering.
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
order by 1,2;
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 1 | 4 | 104 |
| B | 2 | (null) | (null) |
| B | 3 | 5 | 105 |
you just need a pivot table
the query is very simple
select classes.M, p.i as D, t.ID, t.C
from (select M, max(D) MaxValue from t group by m) classes
inner join pivot p
on p.i =< classes.MaxValue
left join t
on t.M = classes.M
and t.D = p.i
pivot table is a dummy table some how
CREATE TABLE Pivot (
i INT,
PRIMARY KEY(i)
)
populate is some how
CREATE TABLE Foo(
i CHAR(1)
)
INSERT INTO Foo VALUES('0')
INSERT INTO Foo VALUES('1')
INSERT INTO Foo VALUES('2')
INSERT INTO Foo VALUES('3')
INSERT INTO Foo VALUES('4')
INSERT INTO Foo VALUES('5')
INSERT INTO Foo VALUES('6')
INSERT INTO Foo VALUES('7')
INSERT INTO Foo VALUES('8')
INSERT INTO Foo VALUES('9')
Using the 10 rows in the Foo table, you can easily populate the Pivot table with 1,000 rows. To get 1,000 rows from 10 rows, join Foo to itself three times to create a Cartesian product:
INSERT INTO Pivot
SELECT f1.i+f2.i+f3.i
FROM Foo f1, Foo F2, Foo f3
you can read about that in Transac-SQL Cookbook by Jonathan Gennick, Ales Spetic
You just need to order by the final column definitions. t.m and t.d. SO your final SQL would be...
SELECT *
FROM (SELECT *
FROM (SELECT DISTINCT m FROM t) AS dummy1,
(SELECT DISTINCT d FROM t) AS dummy2) AS combi
FULL OUTER JOIN t
ON combi.d = t.d
AND combi.m = t.m
ORDER BY t.m,
t.d;
Also for query optimization perspective, it is better to now have many layers of sub queries.
I think you need another correlation name - dummy3? - after 'as result )' before the order by.