Filter array depending on other table - sql

I'm trying to filter values from an array. The information, which values should be kept, are in another table.
table_a table_b
___________________ ___________
| id | values | | keyword |
------------------- -----------
| 1 | [a, b, c] | | b |
| 2 | [d, e, f] | | e |
| 3 | [a, g] | | f |
------------------- -----------
I expect the following output:
output
________________________
| id | filtered_values |
------------------------
| 1 | [b] |
| 2 | [e, f] |
| 3 | [] |
------------------------
At the moment, I am using following query:
SELECT
id,
array_intersect(ta.values, tb.filter_keywords) AS filtered_values -- brickhouse UDF
FROM
table_a ta
CROSS JOIN (
SELECT
collect_set(keyword) as filter_keywords
FROM (
SELECT
"dummy" as grouping_dummy,
keyword
FROM
table_b
) tmp
GROUP BY
grouping_dummy
)
table_a has a couple million rows, table_b contains less than 1000 rows.
I guess the cross join is the bottleneck, because it uses only one reducer.
Is there any way to optimize this query?
Thanks!

I have a different assumption.
The reducer is needed in order to generate filter_keywords, not for the CROSS JOIN which is a map side operation.
So no problem here.
My guess is that the performance penalty comes from the use of array_intersect with an array of 1000 elements, therefor the solution would be avoiding it.
P.s.
There is no need for grouping_dummy.
You don't need to use GROUP BY in order to use aggregate functions.
select a.id
,collect_list (case when b.keyword is not null then a.val end) as vals
from (select a.id
,e.val
from table_a a
lateral view outer
explode (a.vals) e as val
) a
left join table_b b
on b.keyword =
a.val
group by a.id
+----+-----------+
| id | vals |
+----+-----------+
| 1 | ["b"] |
| 2 | ["e","f"] |
| 3 | [] |
+----+-----------+

Related

SQL select all rows that are not equal to an id, and replace the id column with the value - without cross join

Say I have a table like this:
+----+-------+
| id | value |
+----+-------+
| 1 | a |
| 1 | b |
| 2 | c |
| 2 | d |
| 3 | e |
| 3 | f |
+----+-------+
And I want to select all rows with id that are not a, and change their id to a; select all rows with id that are not b, and change the id to b; and select all rows with id that are not c, and change their id to c.
Here is the output I want:
+----+-------+
| id | value |
+----+-------+
| 1 | c |
| 1 | d |
| 1 | e |
| 1 | f |
| 2 | a |
| 2 | b |
| 2 | e |
| 2 | f |
| 3 | a |
| 3 | b |
| 3 | c |
| 3 | d |
+----+-------+
The only solution I can think of is through cross join and distinct:
select distinct a.id, b.value
from table a
cross join table b
where a.id != b.id
Is there any other way to avoid such expensive operation?
I think the typical way to write this is to generate all pairs of id and value and then remove the ones that exist:
select i.id, v.value
from (select distinct id from t) i cross join
(select distinct value from t) v left join
t
on t.id = i.id and t.value = i.value
where t.id is null;
First, I don't think this is what your query does. But this is what you seem to be describing.
From a performance perspective, you might have other sources for i and v that don't require subqueries. If so, use those for performance.
Finally, I don't think you can do much to improve the performance of this, apart from using explicit tables -- and perhaps having appropriate indexes on all the tables.

Iterate over the rows of a second table to return resultset with cumulative sum

Yesterday, after the help of a SO user #
Iterate over the rows of a second table to return resultset
I was able to make a combination of rows with a selfjoin.
After some modifications, to adapt to my implementation, I faced a new challenge that I'm stuck: how to make an aggregate sum of a third column?
My issue is better explained in the image below:
Based on the code
SELECT
b1.table_a_id,
b1.label_x,
b2.label_y
FROM table_a a
INNER JOIN table_b b1
ON b1.table_a_id = a.table_a_id
INNER JOIN table_b b2
ON b2.table_a_id = b1.table_a_id AND
b2.label_y > b1.label_x
ORDER BY
b1.table_a_id,
b1.label_x,
b2.label_y;
I was able to acquire the combinations.
What should be the next step to get the cumulative sum based on a third column?
I couldn't think of a solution without using a second service, such as python with pandas, using a cumsum function.
To generate the expected resultset, you would need to join the table with itself with an inequality condition on the order column. Then, you can do a window sum:
select
t1.table_a_id,
t1.label_x,
t2.label_y,
sum(t2.value) over(
partition by t1.table_a_id, t1.label_x
order by t1."order", t2."order"
) agg_value
from
table_b t1
inner join table_b t2
on t1.table_a_id = t2.table_a_id
and t2."order" >= t1."order"
order by t1."order", t2."order"
Note: order is a reserved word, so it needs to be quoted; if you actual database column has a different name, you can remove the double quotes.
Demo on DB Fiddle:
TABLE_A_ID | LABEL_X | LABEL_Y | AGG_VALUE
---------: | :------ | :------ | --------:
1 | A | B | 1
1 | A | C | 3
1 | A | D | 6
1 | A | E | 10
1 | A | F | 15
1 | B | C | 2
1 | B | D | 5
1 | B | E | 9
1 | B | F | 14
1 | C | D | 3
1 | C | E | 7
1 | C | F | 12
1 | D | E | 4
1 | D | F | 9
1 | E | F | 5
You seem to want a cumulative sum:
SELECT b1.table_a_id, b1.label_x, b2.label_y,
SUM(b1.value) OVER (PARTITION BY b1.table_a_id, b1.label_x
ORDER BY b2.order
) as AGG_VALUE

select query joining two tables on a range

I have two tables:
Table A with columns
name | tag | price | ref
and Table B with columns:
id | time | min_ref | max_ref
I want to make the following query, take all columns from table A and columns id and time from Table B, combining rows in such a way that particular row from A is merged with a row from B if value ref from A is in the range (min_ref, max_ref). Example:
A
name | tag | price | ref
A | aaa | 78 | 456
B | bbb | 19 | 123
C | ccc | 5 | 789
B
id | time | min_ref | max_ref
0 | 26-01-2019 | 100 | 150
1 | 27-01-2019 | 450 | 525
2 | 25-01-2019 | 785 | 800
the query should return:
name | tag | price | ref | id | time
A | aaa | 78 | 456 | 1 | 27-01-2019
B | bbb | 19 | 123 | 0 | 26-01-2019
C | ccc | 5 | 789 | 2 | 25-01-2019
The notation (min_ref, max_ref) for ranges signifies exclusive bounds. Would be [min_ref, max_ref] for inclusive.
So:
select a.*, b.id, b.time
from a
join b on a.ref > b.min_ref
and a.ref < b.max_ref;
The BETWEEN predicate treats all bounds as inclusive.
I think this is just a join:
select a.*, b.id, b.time
from a join
b
on a.ref between b.min_ref and b.max_ref;
You want a JOIN which combines rows from the two tables with an appropriate criteria. For instance:
SELECT a.name, a.tag, a.price, a.ref, b.id, bi.time
FROM a
INNER JOIN b ON b.min_ref <= a.ref AND b.max_ref >= a.ref
The INNER JOIN finds matching rows from the two tables, ON a specified criteria. In this case, the criteria is that a.ref is between b.min_ref and b.max_ref.
You can also use the sql BETWEEN operator to simplify the conditionals:
SELECT ...
FROM a
INNER JOIN b ON a.ref BETWEEN b.min_ref AND b.max_ref

PostgreSQL aggregate union, intersection and set differences

I have a table of pairs to aggregate as follows:
+---------+----------+
| left_id | right_id |
+---------+----------+
| a | b |
+---------+----------+
| a | c |
+---------+----------+
And a table of values as so:
+----+-------+
| id | value |
+----+-------+
| a | 1 |
+----+-------+
| a | 2 |
+----+-------+
| a | 3 |
+----+-------+
| b | 1 |
+----+-------+
| b | 4 |
+----+-------+
| b | 5 |
+----+-------+
| c | 1 |
+----+-------+
| c | 2 |
+----+-------+
| c | 3 |
+----+-------+
| c | 4 |
+----+-------+
For each pair, I would like to calculate the length of the union, intersection and set differences (each way) comparing the values, so that the output would look like this:
+---------+----------+-------+--------------+-----------+------------+
| left_id | right_id | union | intersection | left_diff | right_diff |
+---------+----------+-------+--------------+-----------+------------+
| a | b | 5 | 1 | 2 | 2 |
+---------+----------+-------+--------------+-----------+------------+
| a | c | 4 | 3 | 0 | 1 |
+---------+----------+-------+--------------+-----------+------------+
What would be the best way to approach this using PostgreSQL?
UPDATE: here is a rextester link with data https://rextester.com/RWID9864
You need scalar sub-queries that do that.
The UNION can also be expressed by an OR which makes that query somewhat shorter to write. But for the intersection you need a query that is a bit longer.
To calculate the "diff", use the except operator:
SELECT p.*,
(select count(distinct value) from values where id in (p.left_id, p.right_id)) as "union",
(select count(*)
from (
select v.value from values v where id = p.left_id
intersect
select v.value from values v where id = p.right_id
) t) as intersection,
(select count(*)
from (
select v.value from values v where id = p.left_id
except
select v.value from values v where id = p.right_id
) t) as left_diff,
(select count(*)
from (
select v.value from values v where id = p.right_id
except
select v.value from values v where id = p.left_id
) t) as right_diff
from pairs p
I don't know what causes your slowness, as I cannot see table sizes and/or explain plans. Presuming both tables are large enough to make nested loops inefficient and to not dare thinking about joining values to itself, I'd try to rewrite it free from scalar subqueries like this:
select p.*,
coalesce(stats."union", 0) "union",
coalesce(stats.intersection, 0) intersection,
coalesce(stats.left_cnt - stats.intersection, 0) left_diff,
coalesce(stats.right_cnt - stats.intersection, 0) right_diff
from pairs p
left join (
select left_id,
right_id,
count(*) "union",
count(has_left and has_right) intersection,
count(has_left) left_cnt,
count(has_right) right_cnt
from (
select p.*,
v."value" the_value,
true has_left
from pairs p
join "values" v on v.id = p.left_id
) l
full join (
select p.*,
v."value" the_value,
true has_right
from pairs p
join "values" v on v.id = p.right_id
) r using(left_id, right_id, the_value)
group by left_id,
right_id
) stats on p.left_id = stats.left_id
and p.right_id = stats.right_id;
Each join condition here allows hash and/or merge join, so the planner will have a chance to avoid nested loops.

SQL JOIN two table & show all rows for table A

I have a question about JOIN.
TABLE A | TABLE B |
-----------------------------------------|
PK | div | PK | div | val |
-----------------------------------------|
A | a | 1 | a | 10 |
B | b | 2 | a | 100 |
C | c | 3 | c | 9 |
------------------| 4 | c | 99 |
-----------------------
There are two tables something like above, and I have been trying to join two tables but I want to see all rows from TABLE A.
Something like
SELECT T1.PK, T1.div, T2.val
FROM A T1
LEFT OUTER JOIN B T2
ON T1.div = T2.div
and I want the result would look like this below.
PK | div | val |
-------------------------
A | a | 10 |
A | a | 100 |
B | null | null |
C | c | 9 |
C | c | 99 |
I have tried all JOINs I know but B doesn't appear because it doesn't exist. Is it possible to show all rows on TABLE A and just show null if it doesn't exists on TABLE B?
Thanks in advance!
If you change your query to
SELECT T1.PK, T2.div, T2.val
FROM A T1
LEFT OUTER JOIN B T2
ON T1.div = T2.div
(Note, that div comes from T2 here.), you'll get exactly the result posted (but maybe in a different order, add an ORDER BY clause if you want a specific order).
Your query as it stands will get you:
PK | div | val |
-------------------------
A | a | 10 |
A | a | 100 |
B | b | null |
C | c | 9 |
C | c | 99 |
(Note, that div is b for the row with the PK of B, not null.)
To get to your resultset, all you need to do is use T2.Div as that is the value that does not exist in the second table:
SELECT T1.PK, T2.div, T2.val
FROM A T1
LEFT OUTER JOIN B T2
ON T1.div = T2.div