LIMIT and JOIN order of actions - sql

I have a query which includes LIMIT to the main table and JOIN.
My questions is that: what is coming before? the query finds the x rows of the LIMIT and then doing JOIN to these rows or doing first the JOIN on the all rows and just after that LIMIT?

LIMIT Applies to the query to which it is applied. It will be applied to the query AFTER the JOINs in that query, but if the derived table is JOINed to other tables, that/those JOIN(s) comes after.
e.g.
SELECT ..
FROM (SELECT ..
FROM TABLE1 T1
JOIN TABLE2 T2 ON ..
LIMIT 10) X
JOIN OTHERTABLE Y
LIMIT 20;
The JOIN between T1 and T2 occurs first
LIMIT 10 is applied to result from the previous step, so only 10 records from this derived table will be used in the outer query
LIMIT 20 is applied to the result of the JOIN between X and Y
Although LIMIT is a specific keyword for PostgreSQL, MySQL and SQLite, the TOP keyword and processing in SQL Server works the same way.

doing first the JOIN on the all rows and just after that LIMIT

Related

SQL takes very long to execute

This SQL statement left joins two tables, both with approx. 10.000 rows (table1 = 20 columns, table2 = 50+ columns), and it takes 60+ seconds to execute. Is there a way to make it faster?
SELECT
t.*, k.*
FROM
table1 AS t
LEFT JOIN
table2 AS k ON t.key_Table1 = k.Key_Table2
WHERE
((t.Time) = (SELECT MAX(t2.Time) FROM table1 AS t2
WHERE t2.key2_Table1 = t.key2_Table1))
ORDER BY
t.Time;
The ideal execution time would be < 5 seconds, since Excel query does it in 8 secs, and it is very surprising that Excel query would work faster than a SQL Server Express query.
Execution plan:
also you can rewrite your query better :
select *
from table2 as k
join (
select *, row_number() over (partition by Key_Table2 order by time desc) rn
from table1
) t
on t.rn = 1
and t.key_Table1 = k.Key_Table2
but you need index on Key_Table2, time and key_Table1 columns if you already don't have.
also another improvement would be to select only columns you want instead of select *
The optimizer is determining that a merge join is best, but if both tables have 10,000 rows and they aren't joining on indexed columns then forcing the optimizer to get out of the way and telling it to hash join may improve performance
The syntax would be to change LEFT JOIN to LEFT HASH JOIN
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/ms191426(v=sql.100)
https://learn.microsoft.com/en-us/sql/relational-databases/performance/joins?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-join?view=sql-server-ver15
I would recommend rewriting the query using outer apply:
SELECT t.*, k.*
FROM table1 t OUTER APPLY
(SELECT TOP (1) k.*
FROM table2 k
WHERE t.key_Table1 = k.Key_Table2
ORDER BY Time DESC
) k
ORDER BY t.Time;
And for this query, you want an index on table2(Key_Table2, time desc).

Is there a better way to prioritize a sub query instead of using TOP?

Currently we are using SQL Server 2019 and from time to time we tend to use TOP (max INT) to prioritize the execution of a sub-query.
The main reason to do this is to make the starting result set as small as possible and thus avoid excessive reads when joining with other tables.
most common scenario in which it helps:
t1: is the main table we are querying has about 200k rows
t2,3: just some other tables with max 5k rows
pres: is a view with basically all the fields we use for presentation of e.g. product with about 30 JOINS and also containing table t1 + LanguageID
SELECT t1.Id, "+30 Fields from tables t1,t2,t3, pres"
FROM t1
INNER JOIN pres ON pres.LanguageId=1 AND t1.Id=pres.Id
INNER JOIN t2 ON t1.vtype=t2.Id
LEFT JOIN t3 ON t1.color=t3.Id
WHERE 1=1
AND t1.f1=0
AND t1.f2<>76
AND t1.f3=2
we only expect about 300 rows, but it takes about 12 seconds to run
SELECT t.Id, "10 Fields from tables t1,t2,t3 + 20 fields from pres"
FROM (
SELECT TOP 9223372036854775807 t1.Id, "about 10 fields from table t1,t2,t3"
FROM t1
INNER JOIN t2 ON t1.vtype=t2.Id
LEFT JOIN t3 ON t1.color=t3.Id
WHERE 1=1
AND t1.f1=0
AND t1.f2<>76
AND t1.f3=2
) t
INNER JOIN pres ON pres.LanguageId=1 AND t.Id=pres.Id
we only expect about 300 rows, but it takes about 2 seconds to run

Different way of writing this SQL query with partition

Hi I have the below query in Teradata. I have a row number partition and from that I want rows with rn=1. Teradata doesn't let me use the row number as a filter in the same query. I know that I can put the below into a subquery with a where rn=1 and it gives me what I need. But the below snippet needs to go into a larger query and I want to simplify it if possible.
Is there a different way of doing this so I get a table with 2 columns - one row per customer with the corresponding fc_id for the latest eff_to_dt?
select cust_grp_id, fc_id, row_number() over (partition by cust_grp_id order by eff_to_dt desc) as rn
from table1
Have you considered using the QUALIFY clause in your query?
SELECT cust_grp_id
, fc_id
FROM table1
QUALIFY ROW_NUMBER()
OVER (PARTITION BY cust_grp_id
ORDER BY eff_to_dt desc)
= 1;
Calculate MAX eff_to_dt for each cust_grp_id and then join result to main table.
SELECT T1.cust_grp_id,
T1.fc_id,
T1.eff_to_dt
FROM Table1 AS T1
JOIN
(SELECT cust_grp_id,
MAX(eff_to_dt) AS max_eff_to_dt
FROM Table
GROUP BY cust_grp_id) AS T2 ON T2.cust_grp_id = T1.cust_grp_id
AND T2.max_eff_to_dt = T1.eff_to_dt
You can use a pair of JOINs to accomplish the same thing:
INNER JOIN My_Table T1 ON <some criteria>
LEFT OUTER JOIN My_Table T2 ON <some criteria> AND T2.eff_to_date > T1.eff_to_date
WHERE
T2.my_id IS NULL
You'll need to sort out the specific criteria for your larger query, but this is effectively JOINing all of the rows (T1), but then excluding any where a later row exists. In the WHERE clause you eliminate these by checking for a NULL value in a column that is NOT NULL (in this case I just assumed some ID value). The only way that would happen is if the LEFT OUTER JOIN on T2 failed to find a match - i.e. no rows later than the one that you want exist.
Also, whether or not the JOIN to T1 is LEFT OUTER or INNER is up to your specific requirements.

Oracle 11g Performance of Join

I was doing some random testing on oracle 11g, and realized a strange performance difference between different JOIN using SQL developer.
I inserted 200,000 records into RANDOM_TABLE, and about 300 in unrelated_table, then run the two query for 10 times, the time stated is an average.
The two tables are totally unrelated, so the first one should give the same result as the second one, and indeed the row counts are the same.
1. SELECT * FROM some_random_table t1 LEFT JOIN unrelated_table t2 ON 1=1;
~0.005 seconds to fetch the first 50 row.
2. SELECT * FROM some_random_table t1 RIGHT JOIN unrelated_table t2 ON 1=1;
>0.05 seconds to fetch the first 50 row.
1. SELECT * FROM some_random_table t1 FULL JOIN unrelated_table t2 ON 1=1;
~0.005 seconds to fetch the first 50 row.
2. SELECT * FROM some_random_table t1 CROSS JOIN unrelated_table t2;
>0.05 seconds to fetch the first 50 row.
Can anyone explain the difference between these queries? Why are some faster and some slower by an order of magnitude?

SQL query to limit number of rows having distinct values

Is there a way in SQL to use a query that is equivalent to the following:
select * from table1, table2 where some_join_condition
and some_other_condition and count(distinct(table1.id)) < some_number;
Let us say table1 is an employee table. Then a join will cause data about a single employee to be spread across multiple rows. I want to limit the number of distinct employees returned to some number. A condition on row number or something similar will not be sufficient in this case.
So what is the best way to get the same effect the same output as intended by the above query?
select *
from (select * from employee where rownum < some_number and some_id_filter), table2
where some_join_condition and some_other_condition;
This will work for nearly all DBs
SELECT *
FROM table1 t1
INNER JOIN table2 t2
ON some_join_condition
AND some_other_condition
INNER JOIN (
SELECT t1.id
FROM table1 t1
HAVING
count(t1.ID) > someNumber
) on t1.id = t1.id
Some DBs have special syntax to make this a little bit eaiser.
I may not have a full understanding of what you're trying to accomplish, but lets say you're trying to get it down to 1 row per employee, but each join is causing multiple rows per employee and grouping by employee name and other fields is still not unique enough to get it down to a single row, then you can try using ranking and partitioning and then select the rank you prefer for each employee partition.
See example : http://msdn.microsoft.com/en-us/library/ms176102.aspx