How to avoid Rownum? - sql

The below query picks up 1000 rows due to batch constraints and should fetch only 1000 rows.
If I dont use rownum it is taking only 5secs to fetch more than 1000 recs.. but with rownum it is taking 20 secs.
SELECT E.INFO_ID
FROM TAB1 E LEFT OUTER JOIN TAB2 D
ON E.INFO_ID= D.INFO_ID
WHERE D.INFO_ID IS NULL AND ROWNUM < 1000;
Please help me on tuning the query without affecting functionality.

Look at the execution plans. Probably the optimizer thinks that it can get quicker to the first 1000 results by following a different path, wheras for the complete data it uses a hash join or such - which surprisingly turns out to be quick on the first records.
Once you know the execution plans you can use hints to let the optimizer follow the path, which you know from your experience to be better.
Anyhow, you are asking for tab1 records that don't exist in tab2, but rather than saying so with NOT EXISTS, NOT IN or MINUS, you kind of hide this by using a left join. This can be faster sometimes, but it's a trick after all. Why not re-write the query in a more straight-forward way and see how it performs? I think such a statement might be more stable as to slight alterations like using a rownum limit. It's worth a try.
EDIT: Some clarification. You are asking for IDs that exist in tab1 but not in tab2. This would be:
SELECT INFO_ID FROM TAB1
MINUS
SELECT INFO_ID FROM TAB2;
You can also word the task differently, such as: I want all IDs from tab1 that don't exist in tab2:
SELECT INFO_ID FROM TAB1
WHERE NOT EXISTS
(
SELECT * FROM TAB2
WHERE TAB2.INFO_ID = TAB1.INFO_ID
);
Or: I want all IDs from tab1 that are not in tab2
SELECT INFO_ID FROM TAB1
WHERE INFO_ID NOT IN
(
SELECT INFO_ID FROM TAB2
);
What you do instead is saying: For every ID in tab1 find all matching IDs in tab2 and combine these. For IDs in tab 1 that have no match in tab2 give me a result record, too. Then from that (probably huge) set of results remove all matches, so that I stay with those IDs that have no match.
Many words to describe the same task. Accordingly the query is not easy to read for people not familiar with this trick technique. The query certainly produces a large intermediate result. So why do people use it though? Database systems grow up with joins, so this is something they are really good in. For example they use hash mechanisms to get the records joined, rather than looping record for record. So in spite of suggesting a rather complicated access way, the left join technique may result in good performance.
However, the queries above are more straight-forward. Let's look at the first; a possible execution plan would be: Order tab1 IDs, Order tab2 IDs, then loop once to keep tab1 IDs withot a tab2 match. Very simple. Sorting takes time, but then you go sequentielly through both results. If this happens to give you the first thousand matches rather quickly, it is likely to do so when you limit the results with ROWNUM < 1000. And the second query? Loop through tab1 and with the id given find a match in tab2, if there is none keep that record. May be fast with an index, and adding ROWNUM < 1000 will probably not change the speed for getting the first records, for the execution path stays the same. Third query: Can be interpreted like the second. Or tab2 IDs are put in an array with fast access somehow. Anyway, the ROWNUM < 1000 is not likely to change much in the access path.
With your query however it is difficult to say. When all records must be regarded ahash join might be fastest. But if only some records suffice, why join everything? Maybe the optimizer decides to go record by record of tab1 and look for a match in tab2 then. This would alter the execution plan extremely and can be much faster for the first 1000 records. It's just not guaranteed to be so and with bad luck as in your case it can even get slower.
Well, after all, Oracle has a great optimizer. Queries get re-written, and your query might get turned into a NOT EXISTS query or vice versa. And even without re-writing: in spite of dealing with different queries, the optimizer can still decide for the same execution plan. So you never know. But it's always worth a try.
My advice: Write straight-forward SQL. Quite often an SQL statement can resemble the task how one would formulate it in words. Just as shown above. Only when facing performance problems think of how to re-write the query to deal with this.

Related

SQL - Make Join Query Faster

I've made a query that selects 2 values from 2 tables. I need to run this query about 32 times when a visitor visits my website. This makes the page quite slow (it takes over 5 seconds to fully load).
The query looks like this:
SELECT tmdb.name, patch.sfo_title
FROM tmdb
RIGHT JOIN patch
ON tmdb.titleid = CONCAT(patch.cusa, '_00')
WHERE cusa = :titleid
LIMIT 1
is there any way to make this query faster? The query isn't the biggest operation if I look at it, so I'm not really sure why it's so slow?
I would write the query as a left join (out of preference, not performance):
SELECT t.name, p.sfo_title
FROM patch p LEFT JOIN
tmdb t
ON t.titleid = CONCAT(p.cusa, '_00')
WHERE p.cusa = :titleid
LIMIT 1;
Then for performance, I would recommend indexes on patch(cusa, sfo_title) and t(titleid).
Note that the use of LIMIT without ORDER BY is suspicious, although you might have reasons for it.
You are joining two tables based on a CALCULATED field? No wonder it is slow. I have no idea how your tables get maintained, but you need to get that concat'ed value of CUSA into the data base as a separate field and get it indexed, as Gordon Linoff suggested. You could even maintain it through On Insert and On Update triggers. Personally, I would examine why you have similar but different keys in your two tables and try to rationalize it down to one. That Concat(CUSA, '_00' ) looks suspiciously like an opportunity to simplify the application.

Is Except operator computational expensive

I have a table which includes 30 records, and a smaller table has 10 records, both tables have the same schema. All I want to do is to return a table, whose records are in the big table, but not in the small table. The solution I found is to use Except operator. However, when I run the query, it took me about 30 mins. so I am just wondering that if Except is computational expensive and it took a lot of resources?
Is there any functions can replace Except? Thanks for any help !
EXCEPT is a set operator and it should be reasonably optimized. It does remove duplicate values, so there is a bit more overhead than one might expect.
It is not so unoptimized that it would take 30 seconds on such small tables, unless you have columns whose size measures in many megabytes. Something else might be going on -- such as network or server contention.
EXCEPT is a very reasonable approach. NOT IN has a problem with NULL values and only works with one column. NOT EXISTS is going to work best when you have an appropriate index. Under some circumstances, EXCEPT is faster than NOT EXISTS.
In this case, you should be using EXISTS. It is one of the most performant operations in SQL Server
SELECT *
FROM big_table b
WHERE NOT EXISTS (
SELECT 1
FROM small_table s
WHERE s.id = b.id)
There is no need to make things complicated for something so simple.
SELECT * FROM Table1 WHERE ID NOT IN (SELECT ID FROM Table2)

Inconsistent results from BigQuery: same query, different number of rows

I noticed today that one my query was having inconsistent results: every time I run it I have a different number of rows returned (cache deactivated).
Basically the query looks like this:
SELECT *
FROM mydataset.table1 AS t1
LEFT JOIN EACH mydataset.table2 AS t2
ON t1.deviceId=t2.deviceId
LEFT JOIN EACH mydataset.table3 AS t3
ON t2.email=t3.email
WHERE t3.email IS NOT NULL
AND (t3.date IS NULL OR DATE_ADD(t3.date, 5000, 'MINUTE')<TIMESTAMP('2016-07-27 15:20:11') )
The tables are not updated between each query. So I'm wondering if you also have noticed that kind of behaviour.
I usually make queries that return a lot of rows (>1000) so a few missing rows here and there is hardly noticeable. But this query return a few row, and it varies everytime between 10 and 20 rows :-/
If a Google engineer is reading this, here are two Job ID of the same query with different results:
picta-int:bquijob_400dd739_1562d7e2410
picta-int:bquijob_304f4208_1562d7df8a2
Unless I'm missing something, the query that you provide is completely deterministic and so should give the same result every time you execute it. But you say it's "basically" the same as your real query, so this may be due to something you changed.
There's a couple of things you can do to try to find the cause:
replace select * by an explicit selection of fields from your tables (a combination of fields that uniquely determine each row)
order the table by these fields, so that the order becomes the same each time you execute the query
simplify your query. In the above query, you can remove the first condition and turn the two left outer joins into inner joins and get the same result. After that, you could start removing tables and conditions one by one.
After each step, check if you still get different result sets. Then when you have found the critical step, try to understand why it causes your problem. (Or ask here.)

order of tables in FROM clause

For an sql query like this.
Select * from TABLE_A a
JOIN TABLE_B b
ON a.propertyA = b.propertyA
JOIN TABLE_C
ON b.propertyB = c.propertyB
Does the sequence of the tables matter. It wont matter in results, but do they affect the performance?
One can assume that the data in table C is much larger that a or b.
For each sql statement, the engine will create a query plan. So no matter how you put them, the engine will chose a correct path to build the query.
More on plans you have http://en.wikipedia.org/wiki/Query_plan
There are ways, considering what RDBMS you are using to enforce the query order and plan, using hints, however, if you feel that the engine does no chose the correct path.
Sometimes Order of table creates a difference here,(when you are using different joins)
Actually our Joins working on Cross Product Concept
If you are using query like this A join B join C
It will be treated like this (A*B)*C)
Means first result comes after joining A and B table then it will make join with C table
So if after inner joining A (100 record) and B (200 record) if it will give (100 record)
And then these ( 100 record ) will compare with (1000 record of C)
No.
Well, there is a very, very tiny chance of this happening, see this article by Jonathan Lewis. Basically, the number of possible join orders grows very quickly, and there's not enough time for the Optimizer to check them all. The sequence of the tables may be used as a tie-breaker in some very rare cases. But I've never seen this happen, or even heard about it happening, to anybody in real life. You don't need to worry about it.

Why select Top clause could lead to long time cost

The following query takes forever to finish. But if I remove the top 10 clause, it finishs rather quickly. big_table_1 and big_table_2 are 2 tables with 10^5 records.
I used to believe that top clause will reduce the time cost, but it's apparently not here. Why???
select top 10 ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
There are other stackoverflow discussions on this same topic (links at bottom). As noted in the comments above it might have something to do with indexes and the optimizer getting confused and using the wrong one.
My first thought is that you are doing a select top serviceid from (select *....) and the optimizer may have difficulty pushing the query down to the inner queries and making using of the index.
Consider rewriting it as
select top 10 ServiceRequestID
from big_table_1
inner join big_table_2 cap2
on cap1.servicerequestid = cap2.customerreferencenumber
and big_table_1.statusid = 2
In your query, the database is probably trying to merge the results and return them and THEN limit it to the top 10 in the outer query. In the above query the database will only have to gather the first 10 results as results are being merged, saving loads of time. And if servicerequestID is indexed, it will be sure to use it. In your example, the query is looking for the servicerequestid column in a result set that has already been returned in a virtual, unindexed format.
Hope that makes sense. While hypothetically the optimizer is supposed to take whatever format we put SQL in and figure out the best way to return values every time, the truth is that the way we put our SQL together can really impact the order in which certain steps are done on the DB.
SELECT TOP is slow, regardless of ORDER BY
Why is doing a top(1) on an indexed column in SQL Server slow?
I had a similar problem with a query like yours. The query ordered but without the top clause took 1 sec, same query with top 3 took 1 minute.
I saw that using a variable for the top it worked as expected.
The code for your case:
declare #top int = 10;
select top (#top) ServiceRequestID
from
(
(select *
from big_table_1
where big_table_1.StatusId=2
) cap1
inner join
big_table_2 cap2
on cap1.ServiceRequestID = cap2.CustomerReferenceNumber
)
I cant explain why but I can give an idea:
try adding SET ROWCOUNT 10 before your query. It helped me in some cases. Bear in mind that this is a scope setting so you have to set it back to its original value after running your query.
Explanation:
SET ROWCOUNT: Causes SQL Server to stop processing the query after the specified number of rows are returned.
This can also depend on what you mean by "finished". If "finished" means you start seeing some display on a gui, that does not necessarily mean the query has completed executing. It can mean that the results are beginning to stream in, not that the streaming is complete. When you wrap this into a subquery, the outer query can't really do it's processing until all the results of the inner query are available:
the outer query is dependent on the length of time it takes to return the last row of the inner query before it can "finish"
running the inner query independently may only requires waiting until the first row is returned before seeing any results
In Oracle, there were "first_rows" and "all_rows" hints that were somewhat related to manipulating this kind of behaviour. AskTom discussion.
If the inner query takes a long time between generating the first row and generating the last row, then this could be an indicator of what is going on. As part of the investigation, I would take the inner query and modify it to have a grouping function (or an ordering) to force processing all rows before a result can be returned. I would use this as a measure of how long the inner query really takes for comparison to the time in the outer query takes.
Drifting off topic a bit, it might be interesting to try simulating something like this in Oracle: create a Pipelined function to stream back numbers; stream back a few (say 15), then spin for a while before streaming back more.
Used a jdbc client to executeQuery against the pipelined function. The Oracle Statement fetchSize is 10 by default. Loop and print the results with a timestamp. See if the results stagger. I could not test this with Postgresql (RETURN NEXT), since Postgres does not stream the results from the function.
Oracle Pipelined Function
A pipelined table function returns a row to its invoker immediately
after processing that row and continues to process rows. Response time
improves because the entire collection need not be constructed and
returned to the server before the query can return a single result
row. (Also, the function needs less memory, because the object cache
need not materialize the entire collection.)
Postgresql RETURN NEXT
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation.
JDBC Default Fetch Sizes
statement.setFetchSize(100);
When debugging things like this I find that the quickest way to figure out how SQL Server "sees" the two queries is to look at their query plans. Hit CTRL-L in SSMS in the query view and the results will show what logic it will use to build your results when the query is actually executed.
SQL Server maintains statistics about the data your tables, e.g. histograms of the number of rows with data in certain ranges. It gathers and uses these statistics to try to predict the "best" way to run queries against those tables. For example, it might have data that suggests for some inputs a particular subquery might be expected to return 1M rows, while for other inputs the same subquery might return 1000 rows. This can lead it to choose different strategies for building the results, say using a table scan (exhaustively search the table) instead of an index seek (jump right to the desired data). If the statistics don't adequately represent the data, the "wrong" strategy can be chosen, with results similar to what you're experiencing. I don't know if that's the problem here, but that's the kind of thing I would look for.
If you want to compare performances of your two queries, you have to run these two queries in the same situation ( with clean memory buffers ) and have mumeric statistics
Run this batch for each query to compare execution time and statistics results
(Do not run it on a production environment) :
DBCC FREEPROCCACHE
GO
CHECKPOINT
GO
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS IO ON
GO
SET STATISTICS TIME ON
GO
-- your query here
GO
SET STATISTICS TIME OFF
GO
SET STATISTICS IO OFF
GO
I've just had to investigate a very similar issue.
SELECT TOP 5 *
FROM t1 JOIN t2 ON t2.t1id = t1.id
WHERE t1.Code = 'MyCode'
ORDER BY t2.id DESC
t1 has 100K rows, t2 20M rows, The average number of rows from the joined tables for a t1.Code is about 35K. The actual resultset is only 3 rows because t1.Code = 'MyCode' only matches 2 rows which only have 3 corresponding rows in t2. Stats are up-to-date.
With the TOP 5 as above the query takes minutes, with the TOP 5 removed the query returns immediately.
The plans with and without the TOP are completely different.
The plan without the TOP uses an index seek on t1.Code, finds 2 rows, then nested loop joins 3 rows via an index seek on t2. Very quick.
The plan with the TOP uses an index scan on t2 giving 20M rows, then nested loop joins 2 rows via an index seek on t1.Code, then applies the top operator.
What I think makes my TOP plan so bad is that the rows being picked from t1 and t2 are some of the newest rows (largest values for t1.id and t2.id). The query optimiser has assumed that picking the first 5 rows from an evenly distributed average resultset will be quicker than the non-TOP approach. I tested this theory by using a t1.code from the very earliest rows and the response is sub-second using the same plan.
So the conclusion, in my case at least, is that the problem is a result of uneven data distribution that is not reflected in the stats.
TOP does not sort the results to my knowledge unless you use order by.
So my guess would be, as someone had already suggested, that the query isn't taking longer to execute. You simply start seeing the results faster when you don't have TOP in the query.
Try using #sql_mommy query, but make sure you have the following:
To get your query to run faster, you could create an index on servicerequestid and statusid in big_table_1 and an index on customerreferencenumber in big_table_2. If you create unclustered indexes, you should get an index only plan with very fast results.
If I remember correctly, the TOP results will be in the same order as the index you us on big_table_1, but I'm not sure.
GĂ­sli
It might be a good idea to compare the execution plans between the two. Your statistics might be out of date. If you see a difference between the actual execution plans, there is your difference in performance.
In most cases, you would expect better performance in the top 10. In your case, performance is worse. If this is the case you will not only see a difference between the execution plans, but you will also see a difference in the number of returned rows in the estimated execution plan and the actual execution plan, leading to the poor decission by the SQL engine.
Try again after recomputing your statistics (and while you're at it, rebuilding indices)
Also check if it helps to take out the where big_table_1.StatusId=2 and instead go for
select top 10 ServiceRequestID
from big_table_1 as cap1 INNER JOIN
big_table_2 as cap2
ON cap1.ServiceRequestID = cap2.CustomerReferenceNumber
WHERE cap1.StatusId=2
I find this format much more readable, though it should (though remotely possibly it doesn't) optimise to the same execution plan. The returned endresult will be identical regardless