Why does the optimizer decide to self-join a table? - sql

I'm analyzing my query that looks like this:
WITH Project_UnfinishedCount AS (
SELECT P.Title AS title, COUNT(T.ID) AS cnt
FROM PROJECT P LEFT JOIN TASK T on P.ID = T.ID_PROJECT AND T.ACTUAL_FINISH_DATE IS NULL
GROUP BY P.ID, P.TITLE
)
SELECT Title
FROM Project_UnfinishedCount
WHERE cnt = (
SELECT MAX(cnt)
FROM Project_UnfinishedCount
);
It returns a title of a project that has the biggest number of unfinished tasks in it.
Here is its execution plan:
I wonder why it has steps 6-8 that look like self-join of project table? And than it stores the result of the join as a view, but the view, according to rows and bytes columns is the same as project table. Why does he do it?
I'd also like to know what 2 and 1 steps stand for. I guess, 2 stores the result of my CTE to use it in steps 10-14 and 1 removes the rows from the view that don't have the 'cnt' value that was returned by the subquery, is this a correct guess?

In addition to the comments above, when you reference a CTE more than once, there is a heuristic that tells the optimizer to materialize the CTE, which is why you see the temp table transformation.
A few other comments/questions regarding this query. I'm assuming that the relationship is that a PROJECT can have 0 or more tasks, and each TASK is for one and only one PROJECT. In that case, I wonder why you have an outer join? Moreover, you are joining on the ACTUAL_FINISH_DATE column. This would mean that if you have a project, where all the task were complete, then the outer join would materialize the non-matching row, which would make your query results appear to indicate that there was 1 unfinished task. So I think your CTE should look more like:
SELECT P.Title AS title, COUNT(T.ID) AS cnt
FROM PROJECT P
JOIN TASK T on P.ID = T.ID_PROJECT
WHERE T.ACTUAL_FINISH_DATE IS NULL
GROUP BY P.ID, P.TITLE
With all that being said, these "find the match (count, max etc) within a group" type of queries are often more efficient when written as a window function. That way you can eliminate the self join. This can make a big performance difference when you have millions or billions of rows. So for example, your query could be re-written as:
SELECT TITLE, CNT
FROM (
SELECT P.Title AS title, COUNT(T.ID) AS cnt
, RANK() OVER( ORDER BY COUNT(*) DESC ) RNK
FROM PROJECT P
JOIN TASK T on P.ID = T.ID_PROJECT
WHERE T.ACTUAL_FINISH_DATE IS NULL
GROUP BY P.ID, P.TITLE
)
WHERE RNK=1

Related

How to avoid duplicates between two tables on using join?

I have two tables work_table and progress_table.
work_table has following columns:
id[primary key],
department,
dept_name,
dept_code,
created_time,
updated_time
progress_table has following columns:
id[primary key],
project_id,
progress,
progress_date
I need only the last updated progress value to be updated in the table now am getting duplicates.
Here is the tried code:
select
row_number() over (order by a.dept_code asc) AS sno,
a.dept_name,
b.project_id,
p.physical_progress,
DATE(b.updated_time) as updated_date,
b.created_time
from
masters.dept_users as a,
work_table as b
LEFT JOIN
progress as p on b.id = p.project_id
order by
a.dept_name asc
It shows the duplicate values for progress with the same id how to resolve it?[the progress values are integer whose values are feed to the form]
Having reformatted your query, some things become clear...
You've mixed , and JOIN syntax (why!?)
You start with the masters.dept_users table, but don't mention it in your description
You have no join predicate between dept_users and work_table
You calculate an sno, but have no partition by and never use it
Your query includes columns not mentioned in the table descriptions above
And to top it off, you use meaningless aliases like a and b? Please for the love of other, and your future self (who will try to read this one day) make the aliases meaningful in Some way.
You possibly want something like...
WITH
sorted_progress AS
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY project_id
ORDER BY progress_date DESC -- This may need to be updated_time, your question is very unclear
)
AS seq_num
FROM
progress
)
SELECT
<whatever>
FROM
masters.dept_users AS u
INNER JOIN
work_table AS w
ON w.user_id = u.id -- This is a GUESS, but you need to do SOMETHING here
LEFT JOIN
sorted_progress AS p
ON p.project_id = w.id -- Even this looks suspect, are you SURE that w.id is the project_id?
AND p.seq_num = 1
That at least shows how to get that latest progress record (p.seq_num = 1), but whether the other joins are correct is something you'll have to double (and triple) check for yourself.

SQL for a query with several input IDs, how to get the first 5 results for each ID

I have a query that accepts several IDs as filters in a WHERE clause.
it's formatted something like this:
SELECT a.ID, a.VOLUMETRY, b.ANNOY_DISTANCE
FROM PRODUCT a
JOIN RECOMMENDATIONS b on a.ID = b.ID
WHERE a.ID in ('0001','0002', ...., '0099')
ORDER BY b.ANNOY_DISTANCE
Now this query can return several thousand results for each ID, but I only need the first 5 for each ID after ordering them by the ANNOY_DISTANCE column. The rest aren't needed and would only slow post-processing of the data.
How can I change this so that the query result only gives the first 5 rows for each ID?
Use window functions, which you can filter using a QUALIFY clause:
SELECT p.ID, p.VOLUMETRY, r.ANNOY_DISTANCE
FROM PRODUCT p JOIN
RECOMMENDATIONS r
ON p.ID = r.ID
WHERE a.ID in ('0001','0002', ...., '0099')
QUALIFY ROW_NUMBER() OVER (PARTITION BY p.ID ORDER BY r.ANNOY_DISTANCE) <= 5
ORDER BY r.ANNOY_DISTANCE;
Notice that I changed your table aliases to be meaningful abbreviations for the table names. That is a best practice.

Want to know if one can simplify this SQL query further to determine highest stackoverflow post on the last date of dataset

Looking over BigQuery's stackoverflow public dataset and my goal is to query the highest scored posts on the latest day of dataset. Want to print the date, score, view count, user name, etc.
SELECT display_name name, score, view_count, title, date
FROM `bigquery-public-data.stackoverflow.users` as u
INNER JOIN (
SELECT owner_user_id, date(creation_date) date, view_count, score, title
FROM `bigquery-public-data.stackoverflow.stackoverflow_posts`
WHERE date(creation_date) = (
SELECT max(date(creation_date))
FROM `bigquery-public-data.stackoverflow.stackoverflow_posts`
)
) as p
ON u.id = p.owner_user_id
WHERE view_count IS NOT NULL and owner_user_id IS NOT NULL and title IS
NOT NULL
ORDER by score DESC
LIMIT 50
While this works, it does require me to use 2 subqueries. I was wondering if there was a way to simplify this using just a join.
I find my first obstacle is being unable to use the max() function anywhere outside of SELECT and it can only be used with other aggregated columns.
I was wondering if there was a way to simplify this using just a join.
You query is already good enough performance and readability wise, but if you wish to use JOIN instead of WHERE - below version should produce same result and be slightly faster
#standardSQL
SELECT display_name name, score, view_count, title, DATE
FROM `bigquery-public-data.stackoverflow.users` AS u
INNER JOIN (
SELECT owner_user_id, DATE(creation_date) DATE, view_count, score, title
FROM `bigquery-public-data.stackoverflow.stackoverflow_posts` a
JOIN (
SELECT MAX(DATE(creation_date)) max_date
FROM `bigquery-public-data.stackoverflow.stackoverflow_posts`
) b
ON DATE(creation_date) = max_date
WHERE view_count IS NOT NULL AND owner_user_id IS NOT NULL AND title IS NOT NULL
) AS p
ON u.id = p.owner_user_id
ORDER BY score DESC
LIMIT 50
Note: there are two adjustments
Most inner WHERE transformed into JOIN
Most outer WHERE moved inside
You need a query to select the columns you want, and a query to get the latest day which is a minimum of two subqueries if you're not counting JOINs as queries.
I think what you have is going to be essentially equivalent to if not better than other options. The 2nd nested query to get the latest date will be cached, it will not re-execute that for each row in the outer query. Compared to hardcoding the latest date instead of looking it up on the fly, there is no notable runtime or read size difference.
You can sort of 'flatten' the query by using WITH to construct a resultset of the filter values first and then INNER JOIN them with the original outer queries, which behaves like a WHERE clause. For this specific case I don't see any improvement in runtime or data read size when doing this. It is also a bit less readable in my personal opinion. Depending on the tables you're joining, using the JOIN method instead of filtering before the join might result in slower queries because it has to read more data, I'm not entirely sure how BigQuery handles that.
WITH max_creation_date as (
SELECT max(date(creation_date)) as date
FROM `bigquery-public-data.stackoverflow.stackoverflow_posts`)
SELECT display_name name, score, view_count, title, date(p.creation_date) as date
FROM `bigquery-public-data.stackoverflow.users` as u
INNER JOIN `bigquery-public-data.stackoverflow.stackoverflow_posts` as p
ON u.id = p.owner_user_id
INNER JOIN max_creation_date
ON max_creation_date.date = date(p.creation_date)
WHERE view_count IS NOT NULL
AND owner_user_id IS NOT NULL
AND title IS NOT NULL
ORDER by score DESC
LIMIT 50
You could technically turn the other 3 WHERE clauses into INNER JOIN clauses as well but that would probably be less readable and potentially slower than what you have.

SQLServer get top 1 row from subquery

In a huge products query, I'm trying to get the prices for the last buy for every element. In my first approach, I added a sub-query, on the buys table ordering by date descendant and only getting the first row, so I ensure I got the latest row. But it didn't show any values.
Now that I see it again, it's logical, as the sub-query still doesn't have a restriction for the product then lists all the buys and gets the latest one, that doesn't have to be for the product currently being processed by the main query. so returns nothing.
That's the very simplified code:
SELECT P.ID, P.Description, P... blah, blah
FROM Products
LEFT JOIN (
SELECT TOP 1 B.Product,B.Date,B.Price --Can't take out TOP 1 if ORDER BY
FROM Buys B
--WHERE... Can't select by P.Product because doesn't exist in that context
ORDER BY B.Date DESC, B.ID DESC
) BUY ON BUY.Product=P.ID
WHERE (Some product family and kind restrictions, etc, so it processes a big amount of products)
I thought about an embedded query in the main select stmt, but as I need several values it would imply doing a query for each, and that's ugly and bad.
Is there a way to do this and avoid the infamous LOOP? Anyone knows the Good?
You are going down the path of using outer apply, so let's continue:
SELECT P.ID, P.Description, P... blah, blah
FROM Products p OUTER APPLY
(SELECT TOP 1 B.Product,B.Date,B.Price --Can't take out TOP 1 if ORDER BY
FROM Buys b
--WHERE... Can't select by P.Product because doesn't exist in that context
WHERE b.Product = P.ID
ORDER BY B.Date DESC, B.ID DESC
) buy
WHERE (Some product family and kind restrictions, etc, so it processes a big amount of products)
In this context, you can thing of apply as being a correlated subquery that can return multiple columns. In other databases, this is called a "lateral join".
Seems like a good candidate for OUTER APPLY. You need something along these lines..
SELECT P.ID, P.Description, P... blah, blah
FROM Products P
OUTER APPLY (
SELECT TOP 1 B.Product,B.Date,B.Price
FROM Buys B
WHERE B.ProductID = P.ID
ORDER BY B.Date DESC, B.ID DESC
) a

Distinct on multi-columns in sql

I have this query in sql
select cartlines.id,cartlines.pageId,cartlines.quantity,cartlines.price
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
I want to get rows distinct by pageid ,so in the end I will not have rows with same pageid more then once(duplicate)
any Ideas
Thanks
Baaroz
Going by what you're expecting in the output and your comment that says "...if there rows in output that contain same pageid only one will be shown...," it sounds like you're trying to get the top record for each page ID. This can be achieved with ROW_NUMBER() and PARTITION BY:
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER(PARTITION BY c.pageId ORDER BY c.pageID) rowNumber,
c.id,
c.pageId,
c.quantity,
c.price
FROM orders o
INNER JOIN cartlines c ON c.orderId = o.id
WHERE userId = 5
) a
WHERE a.rowNumber = 1
You can also use ROW_NUMBER() OVER(PARTITION BY ... along with TOP 1 WITH TIES, but it runs a little slower (despite being WAY cleaner):
SELECT TOP 1 WITH TIES c.id, c.pageId, c.quantity, c.price
FROM orders o
INNER JOIN cartlines c ON c.orderId = o.id
WHERE userId = 5
ORDER BY ROW_NUMBER() OVER(PARTITION BY c.pageId ORDER BY c.pageID)
If you wish to remove rows with all columns duplicated this is solved by simply adding a distinct in your query.
select distinct cartlines.id,cartlines.pageId,cartlines.quantity,cartlines.price
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
If however, this makes no difference, it means the other columns have different values, so the combinations of column values creates distinct (unique) rows.
As Michael Berkowski stated in comments:
DISTINCT - does operate over all columns in the SELECT list, so we
need to understand your special case better.
In the case that simply adding distinct does not cover you, you need to also remove the columns that are different from row to row, or use aggregate functions to get aggregate values per cartlines.
Example - total quantity per distinct pageId:
select distinct cartlines.id,cartlines.pageId, sum(cartlines.quantity)
from orders
INNER JOIN
cartlines on(cartlines.orderId=orders.id)where userId=5
If this is still not what you wish, you need to give us data and specify better what it is you want.