Any tips and tricks to avoid or reduce cost of one-to-many joins and non-equi joins when dataset is large?

Any tips and tricks to avoid or reduce cost of one-to-many joins and non-equi joins when dataset is large? - sql

I am wondering how people grapple with large one-to-many joins, and in particular non-equi joins, when they have large data. If the keys of the two tables A and B are sufficiently repetitive, the output of the join between the two can be nearly the size of |A| * |B|. This must come up frequently in analytics at large companies, so I am wondering what ways there are to reduce the computation time of these joins.
However, many times A and B are different tables, and in those cases I do not think LAG() can be used.
Example of a non-equi, one-to-many join
As a simplified example of a situation where a non-equi and one-to-many join might be warranted, I have tables A and B, with each having a numeric id column, a date field date_created and some field group. For each row in table A, I want the id column of A and all data of the corresponding row in table B where B.date_created is the largest possible value such that A.date_created > B.date_created and A.group = B.group. In other words, I want the most recent row of table B with respect to the date_created and group fields of each row in column A.
Code when using a window function
In most use cases where these non-equi-joins come up, A and B are the same table and the date_created fields in fact correspond to the same column. In this situation, I could use the LAG() window function:
WITH id_tuples AS
(
SELECT A.id,
LAG(A.id, 1) OVER (PARTITION BY A.group ORDER BY A.date_created) AS lagged_id
FROM A
)
SELECT id_t.id,
A.*
FROM id_tuples id_t
INNER JOIN A ON A.id = id_t.lagged_id
which I believe is more efficient than a self-join. However, this approach is not possible when the columns being compared are different, or belong to different tables.
Code when window function is not feasible
I use the following code to compute the most recent row of table B for each row in table A.
SELECT *
FROM
(
SELECT A.id,
B.*,
DENSE_RANK() OVER (PARTITION BY A.id ORDER BY B.date_created) AS date_rank
FROM A
INNER JOIN B ON B.group = A.group
AND B.date_created < A.date_created
)
WHERE date_rank = 1
The problem here is that the grouping variables A.group and B.group can have only a few distinct values. Then the join becomes nearly a Cartesian join and the number of outputted results in the subquery can be many orders of magnitude greater than the sum of the rows of A and B. This is wasteful since the outer query proceeds to throw out the majority of the results by filtering for date_rank = 1.
Is there a better way of structuring the query to reduce the cost of these joins, or avoid them entirely in these situations? I am asking in the abstract but I've found that neither my relational database, nor my Spark cluster (once I move the data there) has enough memory to handle such a join. Even on smaller datasets, this operation takes a large amount of time to run. And I don't believe my dataset is particularly large relative to what others are doing.

Your first query can simply be written as:
SELECT A.id,
LAG(A.id, 1) OVER (PARTITION BY A.group ORDER BY A.date_created) AS lagged_id
FROM A;
There is no need for the JOIN.
For the second query, one method is a lateral join:
SELECT A.id, B.*,
FROM A LEFT JOIN LATERAL
(SELECT B.*
FROM B
WHERE B.group = A.group AND
B.date_created < A.date_created
ORDER BY B.date_created DESC
FETCH FIRST 1 ROW ONLY
) B;
This should use an index on B(GROUP, date_created).

Related

APPLY operator - how does it decide which rows are a match between the two sets?

When using a JOIN, it is clear exactly what is deciding whether or not the rows will match, e.g. ON a.SomeID1=b.SomeID1. So the only rows returned will be ones where there is a matching 'SomeID1' in the tables referenced by aliases A and B.
My initial thought was that, when using APPLY, a WHERE clause is typically placed within the right-hand query, to provide similar functionality to the ON clause of a JOIN.
However, I see many SQL queries that do not include a WHERE in the right-hand query when using APPLY. So won't this mean that the resulting rows will just be the product of the number of rows from both tables?
What logic determines which rows will match between the left and right queries when using APPLY?
I have tried many blog posts, answers on here and even YouTube videos, but none of the explanations have 'clicked' with me.

The apply operator (in databases that support it) implements a type of join called a lateral join.
For me, the best way of understanding it starts with a correlated subquery. For instance:
select a.*,
(select count(*)
from b
where b.a_id = a.a_id
--------------^
) as b_count
from a;
The subquery is counting the number of matching rows in b for each row in a. How does it do this? The correlation clause is the condition that maps the subquery to the outer query.
Apply works the same way:
select a.*, b.b_count
from a outer apply
(select count(*) as b_count
from b
where b.a_id = a.a_id
------------^
) b;
In other words, the correlation clause is the answer to your question.
What is the difference between a lateral join and a correlated subquery? There are three differences:
A lateral join can return more than one row.
A lateral join can return more than one column.
A lateral join is in the FROM clause so the returned columns can be referenced multiple times in the query.

Further to Gordon's excellent answer:
APPLY does not need to be correlated (ie that it uses columns from the outer query), the key is that it is lateral (it returns a new resultset for each row).
So starting with a base query:
select c.*
from customer c;
Example Resut:
Id
Name
1
John
2
Jack
The idea is to apply a new resultset to this. In this case, we only want a single row (a grouped-up count) to apply for each existing row.
Note the where correlation, we use an outer reference
select c.*, o.Orders
from customer c
outer apply
(select count(*) as Orders
from [order] o
where o.c_id = c.id
) o;
Id
Name
Orders
1
John
2
2
Jack
0
We can, however, return multiple results. In fact, we can return anything we like, and place arbirtrary filters on the result:
select c.*, t.*
from customer c
outer apply
(select 'Thing1' thing
union all
select 'Thing2'
where c.Name = 'Jack'
) t;
Id
Name
thing
1
John
Thing1
1
John
Thing2
2
Jack
Thing1
Note how the row containing John got doubled up, based on the filter. Note also that the first half of the union has no outer reference.
See also this answer for further APPLY tricks.

How can I join 3 tables and calculate the correct sum of fields from 2 tables, without duplicate rows?

I have tables A, B, C. Table A is linked to B, and table A is linked to C. I want to join the 3 tables and find the sum of B.cost and the sum of C.clicks. However, it is not giving me the expected value, and when I select everything without the group by, it is showing duplicate rows. I am expecting the row values from B to roll up into a single sum, and the row values from C to roll up into a single sum.
My query looks like
select A.*, sum(B.cost), sum(C.clicks) from A
join B
left join C
group by A.id
having sum(cost) > 10
I tried to group by B.a_id and C.another_field_in_a also, but that didn't work.
Here is a DB fiddle with all of the data and the full query:
http://sqlfiddle.com/#!9/768745/13
Notice how the sum fields are greater than the sum of the individual tables? I'm expecting the sums to be equal, containing only the rows of the table B and C once. I also tried adding distinct but that didn't help.
I'm using Postgres. (The fiddle is set to MySQL though.) Ultimately I will want to use a having clause to select the rows according to their sums. This query will be for millions of rows.

If I understand the logic correctly, the problem is the Cartesian product caused by the two joins. Your query is a bit hard to follow, but I think the intent is better handled with correlated subqueries:
select k.*,
(select sum(cost)
from ad_group_keyword_network n
where n.event_date >= '2015-12-27' and
n.ad_group_keyword_id = 1210802 and
k.id = n.ad_group_keyword_id
) as cost,
(select sum(clicks)
from keyword_click c
where (c.date is null or c.date >= '2015-12-27') and
k.keyword_id = c.keyword_id
) as clicks
from ad_group_keyword k
where k.status = 2 ;
Here is the corresponding SQL Fiddle.
EDIT:
The subselect should be faster than the group by on the unaggregated data. However, you need the right indexes: ad_group_keyword_network(ad_group_keyword_id, ad_group_keyword_id, event_date, cost) and keyword_click(keyword_id, date, clicks).

I found this (MySQL joining tables group by sum issue) and created a query like this
select *
from A
join (select B.a_id, sum(B.cost) as cost
from B
group by B.a_id) B on A.id = B.a_id
left join (select C.keyword_id, sum(C.clicks) as clicks
from C
group by C.keyword_id) C on A.keyword_id = C.keyword_id
group by A.id
having sum(cost) > 10
I don't know if it's efficient though. I don't know if it's more or less efficient than Gordon's. I ran both queries and this one seemed faster, 27s vs. 2m35s. Here is a fiddle: http://sqlfiddle.com/#!15/c61c74/10

Simply split the aggregate of the second table into a subquery as follows:
http://sqlfiddle.com/#!9/768745/27
select ad_group_keyword.*, SumCost, sum(keyword_click.clicks)
from ad_group_keyword
left join keyword_click on ad_group_keyword.keyword_id = keyword_click.keyword_id
left join (select ad_group_keyword.id, sum(cost) SumCost
from ad_group_keyword join ad_group_keyword_network on ad_group_keyword.id = ad_group_keyword_network.ad_group_keyword_id
where event_date >= '2015-12-27'
group by ad_group_keyword.id
having sum(cost) > 20
) Cost on Cost.id=ad_group_keyword.id
where
(keyword_click.date is null or keyword_click.date >= '2015-12-27')
and status = 2
group by ad_group_keyword.id

Difference between Two Queries - Join vs IN

I have the following two queries. Query1 is returning 1000 as row count where as Query2 is returning 4000 as row count. Can someone please explain the difference between both the queries. I was hoping both would return same count.
Query1:
SELECT COUNT(*)
FROM TableA A
WHERE A.VIN IN (
SELECT VIN
FROM TableB B, TableC C
WHERE B.MODEL_YEAR = '2014' AND B.VIN_NBR = C.VIN
)
Query2:
SELECT COUNT(*)
FROM TABLEA A, TableB B, TableC C
WHERE B.MODEL_YEAR = '2014' AND B.VIN_NBR = C.VIN AND A.VIN = C.VIN

In many cases, they will return the same answer, but not necessarily. The first counts the number of rows in A that match the conditions -- each row is counted only once, regardless of the number of matches. The second does a join, which can multiply the number of rows.
The second query would be equivalent in results if it used count(distinct A.id), where id is unique or a primary key.
That said, although they are similar in functionality, how they are executed can be quite different. Different SQL engines might do a better job of optimizing one version or the other.
By the way, you should avoid the archaic join syntax that you are using. Since 1992, explicit joins have been part of SQL syntax.

Count the overlapping values between two tables?

I have two tables that are structured the same with a sequence column and I am trying to count the number of sequences that show up in two different tables.
I am using this right now:
SELECT A.sequence FROM p2.pool A WHERE EXISTS (SELECT * from
p1.pool B WHERE B.sequence = A.sequence)
And then I was going to count the number of results.
Is there an easier way to do this using COUNT so I don't have to get all of the results first?

Yes, there is an easier way using COUNT:
SELECT COUNT(*)
FROM p2.pool A
WHERE EXISTS (SELECT *
FROM p1.pool B
WHERE B.sequence = A.sequence)
You could also use a join instead of a subquery, but the speed is unlikely to change:
SELECT COUNT(*)
FROM p2.pool A
JOIN p1.pool B ON A.sequence = B.sequence

optimize insert max dates from 1 million row table

I need to get the max dates from a detail table which meets the following condition.
This transaction table reaches near 1 million rows.
Is there a better query than this?
insert into SCH1.maxDATES
select a.ID, a.STATUS, max(detail.REGISTER_DATE) max_DATE
from SCH1.User a
inner join SCH1.Transaction detail on detail.ID = a.ID
where a.STATUS = 3 and detail.REGISTER_DATE is not null
group by a.ID, a.STATUS

Determine what the indexes are for that table and join on them if possible. Also being more specific, without limiting the data you want, is always better.
Here is a helpful site I commonly look at for optimization advice:
http://beginner-sql-tutorial.com/sql-query-tuning.htm

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Any tips and tricks to avoid or reduce cost of one-to-many joins and non-equi joins when dataset is large? - sql

Related

APPLY operator - how does it decide which rows are a match between the two sets?

How can I join 3 tables and calculate the correct sum of fields from 2 tables, without duplicate rows?

Difference between Two Queries - Join vs IN

Count the overlapping values between two tables?

optimize insert max dates from 1 million row table

Categories

Resources