using row_number over (partition by ...) in a view runs endlessly vs run as a query - sql

I have been tasked to enhance a SQL view for performance improvement that has psuedo code as follows. It has row_number over (partition by... order by...), which seems to be causing this view to run indefinitely until i kill the query.
I.e. When I run select * from view_name where Date = '2015-01-31', it runs forever. But it runs fine if I run the whole view as a query (e.g. remove alter view statement on top and pass the where clause at the end of the code).
I am using SQL 2005. It maybe that SQL 2005 engine generates execution plans differently for views vs normal queries because as I mentioned the entire code in the view, when executed as a query, runs fine. How can I make the view itself run faster so it can return the results? One of the tables that my view queries (table1 in this psuedo code) is very large and is partitioned by date where each month's data is one partition.
PSEUDO-CODE:
CREATE VIEW Sample
AS
WITH Dataset1
AS (
SELECT table1.DATE
,column1
,column2
,column3
,column4
FROM table1
INNER JOIN table2 ON table1.DATE = table2.DATE
)
,Dataset2
AS (
SELECT Dataset1.DATE
,column1
,column2
,column3
,column4
FROM table3
INNER JOIN Dataset1 ON table3.column1 = Dataset1.column1
)
SELECT ROW_NUMBER() OVER (
PARTITION BY column1 ORDER BY column1 ASC
) AS RowNumber
,*
FROM Dataset2
GO

My first steps towards improving this query would be:
Reducing the code complexity: why are you using two CTEs? It appears from the example code that this could be rewritten as a single query joining table 1 to 2, then 2 to 3, with the ROW_NUMBER() directly in the SELECT clause. This may not affect the performance directly, but it is much easier to analyse a simple query than a complex one.
Reconsidering the intended behaviour of the ROW_NUMBER(): you are partitioning and ordering by the same column. This means that for each distinct value in column1, SQL Server will attempt to order the rows based on the values in column1; the values are all the same within that partition so the ordering is essentially "random" and any processing time devoted to this is wasted. (Depending hugely on other factors e.g. any clustered indexes on these tables.)
Retrieving the execution plan for this query and examining it for further ideas. The execution plan may include tips for indexes that can be applied - which you should consider, but don't take SQL Server's word as gospel.
I might have further suggestions if I could see an execution plan, have a bit more insight into the structures of these tables (including indexes and cardinality of the relationships), and know how big "very large" means to you :)

Related

Query process steps of select query

I have confused with query process steps of select query. I read some docs, select query will run like this
1. Getting Data (From, Join)
2. Row Filter (Where)
3. Grouping (Group by)
4. Group Filter (Having)
5. Return Expressions (Select)
6. Order & Paging (Order by & Limit / Offset)
I retry test run a query join A table ( 70m records ) and B table( 75m records)
select *
from A join B on A.code = B.box_code
where B.box_code = '123'
compare with
select *
from A join (select * from B where box_code = '123' ) on A.code = B.box_code
I assume the first query will run slower than second query. Because the first query will take time when mapping large data while second query filters box_code before mapping data. But two queries run the same. Why did that happen?
I searched google, it may be related to clustered index, but I am not sure.
1 more question , why clustered index can get where condition to filter data before join ? i think the query will run join before where
Where did I get it wrong?
illustrating images
first query
second query
Thanks
This part is wrong...
select query will run like this
Getting Data (From, Join)
Row Filter (Where)
Grouping (Group by)
Group Filter (Having)
Return Expressions (Select)
Order & Paging (Order by & Limit / Offset)
Oracle has a number of operations that it can perform to satisfy a query. Some operations may require child operations to be completed first. Operations include things like TABLE ACCESS BY INDEX ROWID, INDEX RANGE SCAN, and NESTED LOOPS.
Oracle's optimizer decides which operations are necessary and in what order. It very often will, for example, apply WHERE conditions to a row source before joining that row source to another one. It does that for exactly the reason you imply in your post: because it is probably faster to filter a million rows down to 10 before doing a join.
Oracle maintains an elaborate set of statistics on each table and column so that it can estimate when you submit your query what is likely to work well.
Theoretically, your job when writing SQL is to describe what you want and leave the how part to Oracle. In practice, the how part is still important, so your question is a very good one. Read Oracle's documentation on the subject, titled "Oracle Database SQL Tuning Guide". There is a version for each release of the database and they're available for free online (see: https://docs.oracle.com).

Materialized View Performance of Exists vs In

I did some googling and couldn't find a clear answer to an oracle performance question. Maybe we can document it here. I am building an MV that is pretty simple but on fairly large tables. The query like many things can be written more than one way. In my case when written as a select statement two solutions have similar costs / execution plan, but when placed inside of a create materialized view the execution time changes drastically. Any insight into why?
Tab1 is aprox 40M records.
Tab2 is aprox 8M records.
field1 is a primary key on Tab1, it is not a PK or unique on Tab2 but tab 2 does have an index on this field.
field2 is not a key nor is it indexed on either table (boo)
Queries are:
Q1:
SELECT
CR1.Several_Fields
FROM
SCHEMA1.tab1 T1
WHERE T1.field2 like 'EXAMPLE%'
AND T1.field1 not in (
SELECT T2.field1
FROM SCHEMA1.tab2 T2
)
;
Q2:
SELECT
CR1.Several_Fields
FROM
SCHEMA1.tab1 T1
WHERE T1.field2 like 'EXAMPLE%'
AND not exists (
SELECT 1
FROM SCHEMA1.tab2 T2
WHERE T1.field1 = T2.field1
)
;
The two queries as select statements run similarly in time, and explain plan has them both utilizing the index scan rather than full table scans as I would expect. What is unexpected is that Q2 runs vastly faster (47 seconds vs 81 days per v$session_longops) when run in an mv creation like:
CREATE MATERIALIZED VIEW SCHEMA1.mv_blah as
(
Q1 or Q2
);
Does anyone have any insight, is there a rule here to not use IN if possible for mviews only? I know of the tricks between in and exist when indexes do not exist between the tables but this one had me baffled. This is running against an oracle 11g database.
This looks like a known bug. If you have access to My Oracle Support look at Slow Create/Refresh of Materialized View Based on NOT IN Definition Query (Doc ID 1591851.1), or less usefully if you don't, a summary of the problem is available.
The contents of the MOS version can't be reproduced here of course, but suffice to say that the only workaround is what you're already doing with not exists. It's fixed in 12c, which doesn't help you much.

MS Access 2010 SQL Top N query by group performance issue (continued)

I have signficant performcance issues (up to time-out) in MS Access 2010 with the query below. The table TempTableAnalysis contains between 10'000-15'000 records. I have already received input from this forum to work with a temporary table in the top 10 query (MS Access 2010 SQL Top N query by group performance issue)
Can anyone explain how to implement the temporary table in the subquery and how to join it? I can't get it to work.
Any other suggestions to improve performance are highly appreciated.
Here is my query:
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
t2.DmdUnit,
ROUND(t2.MASE,2) AS MASE,
ROUND(t2.AFAR,2) AS AFAR
FROM TempTableAnalysis AS t2
WHERE t2.MASE IN (
SELECT TOP 10 t1.MASE
FROM TempTableAnalysis AS t1
WHERE t1.ABCByPick = t2.ABCByPick
ORDER BY t1.MASE DESC
)
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
Optimizing Access Query Performance For Large Data Sets
Based on your posted SQL Query, you have some options available to optimize and speed up the performance.
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
t2.DmdUnit,
ROUND(t2.MASE,2) AS MASE,
ROUND(t2.AFAR,2) AS AFAR
FROM TempTableAnalysis AS t2
...
This is the first part where TempTableAnalysis is the multi-thousand record subquery. If you want to squeeze a little more performance out of the use of this "Temp" Table, don't use it as a dynamic query (i.e., calculated on demand each time the query is opened), try constructing a macro that pushes the output to a static table:
Appending Subquery Data to a Static Table:
Create a QUERY object and change its type to DELETE. Design it to delete the contents of your "temporary" table object. If you prefer using SQL, the command will look like:
DELETE My_Table.*
FROM My_Table;
Create a QUERY object and change its type to APPEND. Design it to query all fields from your query defined by the SQL statement of this OP. Again, the SQL version of this task has the following syntax:
INSERT INTO StaticAnalysisTable ( ID, Loc, Item, AvgOfScaledError )
SELECT t1.ID, t1.Loc, t1.Item, t1.AvgOfScaledError
FROM TempTableAnalysis as t1;
The next step is to automate the population of this static table and it is optional. It's simple however and will make it less likely that you will make the mistake of forgetting to "Refresh" and accessing your static table while it has stale data... causing inaccuracies in your results.
Create a macro with two steps. Each step will have the following definition: OPEN QUERY. When prompted for the query to open, reference the objects you created in the previous two steps in the following order (important): (1) DELETE Query: (your delete query name) then (2) APPEND Query: (your append query name).
SQL Query Comments and Suggestions
The following part of the posted SQL query could use some help:
...
WHERE t2.MASE IN (
SELECT TOP 10 t1.MASE
FROM TempTableAnalysis AS t1
WHERE t1.ABCByPick = t2.ABCByPick
ORDER BY t1.MASE DESC
)
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
There is a join across the sub query that generates the TOP-10 data and the outermost query that correlates these results with the supplementing MASE table data. This isn't necessary if the TempTableAnalysis.MASE represents a key value.
ORDER BY
in the inner most query isn't necessary unless it is intended to force some sort of selection criteria (as in when using SQL analytical functions) this doesn't look like one of those cases. Ordering records from large data sets is also a wasteful cpu and memory sink.
EDIT: Just as a counter-point argument, the ORDER BY clause used beside a TOP N query actually has a purpose, but I am still not clear if it is necessary. Just to round out the discussion, another SO thread talks about How to Select Top 10 in an Access Query.
WHERE t2.MASE IN (...
You may be experiencing blocks in performance with very large in-list set operations. On an Oracle database server, I have discovered with other developers that there is a limitation to the number of discrete elements in an in-list query operator. That value was in the thousands... which may be further limited based on server and database resources.
Consider using a SQL JOIN operator. The place where you define TABLE objects can also be populated with SQL defined queries with aliases known as INLINE VIEWS. Since you're using ACCESS, if an inline view does not work directly, just define another ACCESS QUERY object and reference it in your final query as if it were a table...
A possible rewrite to the ending part of the original query:
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
...
FROM TempTableAnalysis AS t2,
(SELECT TOP 10 t1.MASE, t1.ABCByPick
FROM TempTableAnalysis AS t1) AS ttop
WHERE t2.MASE = ttop.MASE
AND t2.ABCByPick = ttop.ABCByPick
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
You will definitely need to run through these recommendations and validate the output data for accuracy. This represents approaches to capturing some of the "low-hanging fruit" (easy items) that you can pursue to speed up your query and reporting operations.
Conclusions and Closing Comments
As a background to other readers, the database object TempTableAnalysis is not a static table. It is the result of a sub query presented in another SO post requesting help with a Access TOP N Query. The query comes from multiple tables approaching 10,000 records in size (each?).
Tip: A query result in Access ALSO has potential table-like behaviors. You can push the output to a table for joining (as described above) or just join to the query object itself (careful though, especially when you get to "chaining" multiple query operations...)
The strategy of this solution was:
To minimize the number of trips through one or more instances of this very large table.
To pre-process and index optimize any data that would otherwise be "static" for the duration of its analysis.
To audit and review the SQL code used to obtain the final results.
Definitely look into Access MACROS. Coupled with identifying static data in your data sets, you can offload processing of your complex background analytic queries to improve the user experience when they view and query through the final results. Good Luck!

SQL Query Speed

I'm building a report that collates a huge amount of data, the data for the report has taken shape as a view which runs in about 2 to 9 seconds (which is acceptable). I also have a function that returns a set of ids which needs to filter the view:
select *
from vw_report
where employee_id in (select id from dbo.fnc_security(#personRanAsID))
The security function on its own runs in less than a second. However when I combine the two as I have above the query takes over 15 minutes.
Both the view and the security function do quite a lot of work so originally I thought it might be down to locking, I've tried no lock on the security function but it made no difference.
Any tips or tricks as to where I may be going wrong?
It may be worth noting that when I copy the result of the function into the in part of the statement:
select *
from vw_report
where employee_id in (123, 456, 789)
The speed increases back to 2 to 9 seconds.
Firstly, any extra background will help here...
- Do you have the code for the view and the function?
- Can you specify the schema and indexes used for the tables being referenced?
Without these, advise become difficult, but I'll have a stab...
1). You could change the IN clause to a Join.
2). You could specify WITH (NOEXPAND) on the view.
SELECT
*
FROM
vw_report WITH (NOEXPAND)
INNER JOIN
(select id from dbo.fnc_security(#personRanAsID)) AS security
ON security.id = vw_report.employee_id
Note: I'd try without NOEXPAND first.
The other option is that the combination of the indexes and the formulation of the view make it very hard for the optimiser to create a good execution plan. With the extra info I asked for above, this may be improvable.
It takes so much time because sub-select query executing for each row from vw_report while the second query doesn't. You should use something like:
select *
from vw_report r, (select id from dbo.fnc_security(#personRanAsID)) v
where r.employee_id = v.id
I ended up dumping the result from the security function into a temporary table and using the temporary table in my main query. Proved to be the fastest method.
e.g.:
create table #tempTable (id bigint)
select id
into #tempTable
from dbo.fnc_security(#personRanAsID)
select *
from vw_report
where id in (select id from #tempTable)

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog