Using analytics with left join and partition by - sql

I have two different queries which produce the same results. I wonder which one is more efficent. The second one, I am using one select clause less, but I am moving the where to the outter select. Which one is executed first? The left join or the where clause?
Using 3 "selects":
select * from
(
select * from
(
select
max(t.PRICE_DATETIME) over (partition by t.PRODUCT_ID) as LATEST_SNAPSHOT,
t.*
from
PRICE_TABLE t
) a
where
a.PRICE_DATETIME = a.LATEST_SNAPSHOT;
) r
left join
PRODUCT_TABLE l on (r.PRODUCT_ID = l.PRODUCT_ID and r.PRICE_DATETIME = l.PRICE_DATETIME)
Using 2 selects:
select * from
(
select
max(t.PRICE_DATETIME) over (partition by t.PRODUCT_ID) as LATEST_SNAPSHOT,
t.*
from
PRICE_TABLE t
) r
left join
PRODUCT_TABLE l on (r.PRODUCT_ID = l.PRODUCT_ID and r.PRICE_DATETIME = l.PRICE_DATETIME)
where
r.PRICE_DATETIME = r.LATEST_SNAPSHOT;
ps: I know, I know, "select star" is evil, but I'm writing it this way only here to make it smaller.

"I wonder which one is more efficent"
You can answer this question yourself pretty easily by turning on statistics.
set statistics io on
set statistics time on
-- query goes here
set statistics io off
set statistics time off
Do this for each of your two queries and compare the results. You'll get some useful output about how many reads SQL Server is doing, how many milliseconds each takes to complete, etc.
You can also see the execution plan SQL Server generates viewing the estimated execution plan (ctrl+L or right-click and choose that option) or by enabling "Display Actual Execution Plan" (ctrl+M) and running the queries. That could help answer the question about order of execution; I couldn't tell you off the top of my head.

Related

Tuning Oracle Query for slow select

I'm working on an oracle query that is doing a select on a huge table, however the joins with other tables seem to be costing a lot in terms of time of processing.
I'm looking for tips on how to improve the working of this query.
I'm attaching a version of the query and the explain plan of it.
Query
SELECT
l.gl_date,
l.REST_OF_TABLES
(
SELECT
MAX(tt.task_id)
FROM
bbb.jeg_pa_tasks tt
WHERE
l.project_id = tt.project_id
AND l.task_number = tt.task_number
) task_id
FROM
aaa.jeg_labor_history l,
bbb.jeg_pa_projects_all p
WHERE
p.org_id = 2165
AND l.project_id = p.project_id
AND p.project_status_code = '1000'
Something to mention:
This query takes data from oracle to send it to a sql server database, so I need it to be this big, I can't narrow the scope of the query.
the purpose is to set it to a sql server job with SSIS so it runs periodically
One obvious suggestion is not to use sub query in select clause.
Instead, you can try to join the tables.
SELECT
l.gl_date,
l.REST_OF_TABLES
t.task_id
FROM
aaa.jeg_labor_history l
Join bbb.jeg_pa_projects_all p
On (l.project_id = p.project_id)
Left join (SELECT
tt.project_id,
tt.task_number,
MAX(tt.task_id) task_id
FROM
bbb.jeg_pa_tasks tt
Group by tt.project_id, tt.task_number) t
On (l.project_id = t.project_id
AND l.task_number = t.task_number)
WHERE
p.org_id = 2165
AND p.project_status_code = '1000';
Cheers!!
As I don't know exactly how many rows this query is returning or how many rows this table/view has.
I can provide you few simple tips which might be helpful for you for better query performance:
Check Indexes. There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement.
Limit the size of your working data set.
Only select columns you need.
Remove unnecessary tables.
Remove calculated columns in JOIN and WHERE clauses.
Use inner join, instead of outer join if possible.
You view contains lot of data so you can also break down and limit only the information you need from this view

mystery on BigQuery views

Here is my mistery. On the console, when I compute this query, it works perfectly well:
SELECT rd.ds_id AS ds_id
FROM (SELECT ds_id, 1 AS dummy FROM bq_000010.table) rd
INNER JOIN EACH (SELECT 1 AS dummy) cal ON (cal.dummy = rd.dummy);
Then I save it as a view called dataset.myview, and run:
SELECT * FROM dataset.myview LIMIT 1000
But this raises the following error:
SELECT query which references non constant fields or uses aggregation
functions or has one or more of WHERE, OMIT IF, GROUP BY, ORDER BY
clauses must have FROM clause.
Nevertheless, when I try: SELECT * FROM dataset.myview, i.e. without the LIMIT, it works !!
And in fact, when I run my full query with the LIMIT at the bottom, It also raises the error:
SELECT rd.ds_id AS ds_id
FROM (SELECT ds_id, 1 AS dummy FROM bq_000010.table) rd
INNER JOIN EACH (SELECT 1 AS dummy) cal ON (cal.dummy = rd.dummy) LIMIT 1000;
Nevertheless, when I add an internal ORDER BY, it computes well again:
SELECT rd.ds_id AS ds_id
FROM (SELECT ds_id,
1 AS dummy
FROM bq_000010.000010_flux_visites_ds
ORDER BY ds_id) rd
INNER JOIN EACH (SELECT 1 AS dummy) cal ON (cal.dummy = rd.dummy) LIMIT 1000
What happens if you apply an order by to your select on the view? or do you require random results?
A query with a LIMIT clause may still be non-deterministic if there is no operator in the query that guarantees the ordering of the output result set. This is because BigQuery executes using a large number of parallel workers. The order in which parallel jobs return is not guaranteed.
I'm not sure why the order by here would make a difference. However, It's generally odd to see a limit without any order by; which is why I asked about order. A complete SWAG is that perhaps the parallel workers are completing the outer join and limit before the inner select is complete causing an internal error; and by applying an order by the system is forced to materialize the record before executing the inner join join.
But I really have ~~NO CLUE~

Why does oracle optimiser treat join by JOIN and WHERE differently?

I have a query on which I used query optimiser:
SELECT res.studentid,
res.examid,
r.percentcorrect,
MAX(attempt) AS attempt
FROM tbl res
JOIN (SELECT studentid,
examid,
MAX(percentcorrect) AS percentcorrect
FROM tbl
GROUP BY studentid, examid) r
ON r.studentid = res.studentid
AND r.examid = res.examid
AND r.percentcorrect = res.percentcorrect
GROUP BY res.studentid, res.examid, r.percentcorrect
ORDER BY res.examid
What surprised me was that the optimiser returned the following as over 40% faster:
SELECT /*+ NO_CPU_COSTING */ res.studentid,
res.examid,
r.percentcorrect,
MAX(attempt) AS attempt
FROM tbl res,
(SELECT studentid,
examid,
MAX(percentcorrect) AS percentcorrect
FROM tbl
GROUP BY studentid, examid) r
WHERE r.studentid = res.studentid
AND r.examid = res.examid
AND r.percentcorrect = res.percentcorrect
GROUP BY res.studentid, res.examid, r.percentcorrect
ORDER BY res.examid
Here are the execution plans for both:
How is that possible? I always thought the optimiser treats JOIN exactly as the WHERE clause in the optimised query...
From here:
In general you should find that the cost of a table scan will increase
when you enable CPU Costing (also known as "System Statistics"). This
means that your improved run time is likely to be due to changes in
execution path that have started to favour execution plans. There are
a few articles about system statistics on my blog that might give you
more background, and a couple of links from there to other relevant
articles:
http://jonathanlewis.wordpress.com/category/oracle/statistics/system-stats/
In other words, your statistics might be stale, but since you have "turned them off" for this query, you avoid using an inefficient path: hence the (temporary?) improvement.

Does Sql JOIN order affect performance?

I was just tidying up some sql when I came across this query:
SELECT
jm.IMEI ,
jm.MaxSpeedKM ,
jm.MaxAccel ,
jm.MaxDeccel ,
jm.JourneyMaxLeft ,
jm.JourneyMaxRight ,
jm.DistanceKM ,
jm.IdleTimeSeconds ,
jm.WebUserJourneyId ,
jm.lifetime_odo_metres ,
jm.[Descriptor]
FROM dbo.Reporting_WebUsers AS wu WITH (NOLOCK)
INNER JOIN dbo.Reporting_JourneyMaster90 AS jm WITH (NOLOCK) ON wu.WebUsersId = jm.WebUsersId
INNER JOIN dbo.Reporting_Journeys AS j WITH (NOLOCK) ON jm.WebUserJourneyId = j.WebUserJourneyId
WHERE ( wu.isActive = 1 )
AND ( j.JourneyDuration > 2 )
AND ( j.JourneyDuration < 1000 )
AND ( j.JourneyDistance > 0 )
My question is does it make any performance difference the order of the joins as for the above query I would have done
FROM dbo.Reporting_JourneyMaster90 AS jm
and then joined the other 2 tables to that one
Join order in SQL2008R2 server does unquestionably affect query performance, particularly in queries where there are a large number of table joins with where clauses applied against multiple tables.
Although the join order is changed in optimisation, the optimiser does't try all possible join orders. It stops when it finds what it considers a workable solution as the very act of optimisation uses precious resources.
We have seen queries that were performing like dogs (1min + execution time) come down to sub second performance just by changing the order of the join expressions. Please note however that these are queries with 12 to 20 joins and where clauses on several of the tables.
The trick is to set your order to help the query optimiser figure out what makes sense. You can use Force Order but that can be too rigid. Try to make sure that your join order starts with the tables where the will reduce data most through where clauses.
No, the JOIN by order is changed during optimization.
The only caveat is the Option FORCE ORDER which will force joins to happen in the exact order you have them specified.
I have a clear example of inner join affecting performance. It is a simple join between two tables. One had 50+ million records, the other has 2,000. If I select from the smaller table and join the larger it takes 5+ minutes.
If I select from the larger table and join the smaller it takes 2 min 30 seconds.
This is with SQL Server 2012.
To me this is counter intuitive since I am using the largest dataset for the initial query.
Usually not. I'm not 100% this applies verbatim to Sql-Server, but in Postgres the query planner reserves the right to reorder the inner joins as it sees fit. The exception is when you reach a threshold beyond which it's too expensive to investigate changing their order.
JOIN order doesn't matter, the query engine will reorganize their order based on statistics for indexes and other stuff.
For test do the following:
select show actual execution plan and run first query
change JOIN order and now run the query again
compare execution plans
They should be identical as the query engine will reorganize them according to other factors.
As commented on other asnwer, you could use OPTION (FORCE ORDER) to use exactly the order you want but maybe it would not be the most efficient one.
AS a general rule of thumb, JOIN order should be with table of least records on top, and most records last, as some DBMS engines the order can make a difference, as well as if the FORCE ORDER command was used to help limit the results.
Wrong. SQL Server 2005 it definitely matters since you are limiting the dataset from the beginning of the FROM clause. If you start with 2000 records instead of 2 million it makes your query faster.

Limit the number of rows being processed in this query

I cannot post the actual query here, so I am posting the basic outline of the query which should suffice. The query is used to page and return a set of users ranked according the output of a function, say F. F takes parameters from the User table and other tables which are joined. The query is something like as follows
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where DATEDIFF(dd, LastLogin, GetDate()) > 200 and Y.bar > FUBAR) as temp
where rownum > 0
According to the execution plan 91% of the cost is in the Sort. Since the sort is based on F, I cannot add an index to speed the sort. The inner query queries all the records, filters then sorts. Now most of the time the users just look at results in the 1 - 5 pages (1 page has 20 records hence the Top(20)) so I was thinking if there was any way I could limit the rows being processed and sorted and make the query faster and less CPU intensive most of the time.
EDIT: When I say to Calculate F tables are joined, what I mean is this. F takes in parameters such as X.blah and Y.foo and Y.bar. That's it. All these parameters also need to be returned as part of the resultset. e.g. The Latitude and Longitude of the User's Last location is stored in X.
At least you could try not to call DATEDIFF on every row
declare #target_date datetime
set #target_date = DATEADD(dd, -200, GetDate())
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where LastLogin < #target_date and Y.bar > FUBAR) as temp
where rownum > 0
Perhaps do the same thing with FUBAR and F?
The example above doesn't give you much performance but provides a general idea on how to reduce function calls
Not sure if and how much it'll help - but two things:
can you make sure all the foreign key columns and colums in the WHERE clause (user.blah, X.blah, user.foo, Y.foo, Y.bar) are indeed indexed? This will significantly help JOIN performance.
If those columns are not indexed, there also might be a sort operation in the execution plan that SQL Server uses so it can then use a Merge Join for the data. So your sort might not even really come from the OVER (ORDER BY F DESC) that you think causes the sort
you're combining TOP (20) with row numbers, but you're not defining any real ORDER BY for the complete result set - so your results will be random at best. Also, if you already define the rownum, couldn't you just use:
SELECT (columns)
FROM (.......) as temp
WHERE rownum BETWEEN 0 AND 20
Some thoughts:
What kind of function is F? Can it be rewritten as an inline table-valued function? That would give the optimizer an opportunity to expand the function into a reusable execution plan.
You're doing a LEFT OUTER JOIN on Y, but then include a column from Y in your WHERE clause, effectively rendering it as an INNER JOIN. Although the optimizer probably renders the execution plan in the same way, I would clean that up so that it's easier to troubleshoot in the future.