Test Oracle SQL queries in subset of data in big table

Test Oracle SQL queries in subset of data in big table - sql

I have some big tables in an Oracle database, representing transactions etc.
Is there a way in Oracle/SQL to try a query on a subset of the data? Just to speed up the time it takes to try different queries and feel confident that you’ve got the logic right? Once I feel confident that my query is correct I want to run it on the full dataset.

You can limit your result set by the pseudo column ROWNUM.
In your where clause, add:
and rownum < 100 -- or any number you want
And this will put a hard stop at that number of rows.

Related

Optimising a sequence of SQL calculations - Nested or Seperate Queries?

Short Intro:
When it is required to have a dozen nested calculating queries, is it more optimal to
A) Perform each operation separately (saving into a table for each result and then reading that table for the next query)
B) Have a large set of nested selects
Full Description:
I am trying to calculate some advanced forecasts from a series of input tables in SQL.
I am building around a dozen 'modules' that are separated into their own schema and each module typically includes 4-10 input tables and 6-10 calculation steps. All outputs from each module is dumped into the same output table once completed.
Queries range from 7k-200k rows.
A single schema's/module's tables might look like this:
Input Table 1
Input Table 2
Input Table 3
Input Table 4
Calculation Query 1 Result Table
Calculation Query 2 Result Table
Calculation Query 3 Result Table
Calculation Query 4 Result Table
Calculation Query 5 Result Table
Calculation Query 6 Result Table
Final Output
Each calculation query uses the results of the previous (for the most part). The final output is the result of the final calculation query. Calculations are not very complex: partitioned max, basic formula (+,-,*,/) or SUM etcetera. Normally only 1-3 of these per calculation step and always on the same column.
The main reason this is split into multiple calculation queries (instead of one super-formula) is because each calculation joins the outputs in a different way and uses different input tables; also because some are based on previous row results. (Such as max partitions or Lag)
My requirements are as follows:
A procedure that calculates final output from step 1 and merges into Final Output.
A procedure that calculates up to the selected calculation query and merges into its respective results table (and stop). Consider this the 'overriding final'
I DONT need to store the calculation results of intermediate queries - only the final output or the 'overriding final' if selected.
My Problem:
I am trying to optimise the entire process - at this point it looks like it will take around 10-15 seconds. I want it to be 1 second - however I appreciate this is probably not possible.
What I have tried:
Firstly, I created a single procedure for each calculation query that Merged the results into its respective output table. Using this method, each calculation query must read from the database and then merge into its output.
I tried temp tables however I don't see why this would be optimal because I have existing tables for the calculation steps already - which are indexed with the next step in mind.
I then made an assumption that it would be faster to simply nest all the queries into one super-procedure or maybe even have a sequence of Table-Functions.
My Question:
However I ran into a thought that I could not find an answer for - which is the following:
Inserting results into a table on every calculation step might slow the process (especially as they are indexed with 2-4 columns); but at least the data will be indexed for the next step.
Nesting selects would save the effort of inserting data but these results wouldn't be indexed? Right? Or Wrong?
Are select results intelligently indexed? And given my scenario what advise would you give on how I approach this. Maybe I am missing something really simple.
Additional Info:
Most of my larger query results (150-200K) have 4 columns that need to be indexed.
All of my tables only have one column that needs calculating - the rest are indexed.
For Example:
ForecastID, Group, Year, Type, Sub-Type, Value
So I have to index Group, Year, Type and Sub-Type to Join multiple input tables and then calculate on the Value column.
I am telling you this in case having index-heavy tables influences your advice - I wont ask for help on optimizing indexes here due to the overwhelming quantity of advice already available and because it's a different question!

Query optimization is often more art than science, there are few hard and fast rules because there are so many possible influences on the outcome. With that big caveat out of the way, Time to hit the high points.
Indexes effects on loading tables - Indexes have a similar performance impact on inserts as triggers. Unless you have a filtered index each insert will have to update every index on the table, so at three indexes you are looking at quadrupling the number of updates per insert. At one read per insert and a small table size of 200k (very doable for a table scan), for three indexes you are probably outside the butter zone for cost vs. benefit of having those indexes on your work tables.
Nesting results - Like CTEs, nested results work best when the entire result set can fit in memory. When part is in memory and part is on disk it will generally perform worse than a similarly sized temp table without an index. At 5 or so columns for 200k rows with smallish datatypes and a modern server you should be ok performance wise with nesting queries, so long as your only doing one result set at a time. Once again this varies based on your setup, if you are strapped for ram drop them into a temp table.
Joins - Another possible good reason to use temp tables/nested queries is to avoid excessively large joins. The first step in a join process is a full Cartesian join between the tables, which is then filtered based on the on and where clauses. The Join process is heavily optimized in all RDMS, so most of the time you are not aware of how much heavy lifting is occurring behind the scenes, however when tables reach large sizes this can be a major performance pain point. So instead you select the subset of data you require from both tables, and join the two much smaller sets. Once again the butter zone between subsets and full table joins depends on a number of factors, so you'll have to play around with your queries to find where it is for your situation.
Unfortunately I can't really give specific advice without some sample inputs and outputs and/or an execution plan, but I hope this is some food for thought. Good luck.

It sounds like your datasets from the subqueries are more than a few thousand rows, so I would start off with approach A, persist some of these intermediate result sets to #temptables, check the execution plan for scans on these tables, and index the #temptables if needed.
If you want to use approach B, or mix A and B, I suggest CTEs instead of nested queries where possible. They are more readable, and it is easier to switch to #temptables when you are testing/designing the query.

Does using the TOP X * format in SQL speed up queries significantly?

So lately when I run queries on huge tables I'll use the the top 10 * notation like so:
select top 10 * from BI_Sessions (nolock)
where SessionSID like 'b6d%'
and CreateDate between '03-15-2012' AND '05-18-2012'
I thought that it let's it run faster, but it doesn't seem so , this one took 4 minutes(or is that OK time)?
I guess I'm curious about whether the top functionality happens after it pulls all the data anyway(which would seem like it's inefficient).
thanks

It entirely depends on the query, with the exceptino of "Top 0". "Top 0" does return much faster.
In your case, the query has to look through the rows in a huge table to find rows that match the WHERE clause. If no rows are found, the number of rows being returned doesn't help. If the rows are at the end of the table scan, then the number of rows being returned doesn't help.
There are certain cases with more complicated queries where the "top" could affect performance. There is a difference between optimizing overall and for the first row returned. I'm not sure if SQL Server's optimizer recognizes this difference.

Well, it depends. If you do not have a covering index on BI_sessions and its a large database then the answer is probably. A good covering index may be something like: CreateDate, SessionSIS, and all the columns you actually need to return. If you do have a coveing index, then SQL will not even read the table, it will get all the data it needs from the covering index. Possibly if you specified the columns you actually need to return, 10 rows should come back in a fraction of a second.
for more useful info
http://www.mssqltips.com/sqlservertip/1078/improve-sql-server-performance-with-covering-index-enhancements/
and a bit more technical:
http://www.simple-talk.com/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/
also
http://www.sqlserverinternals.com/
and
http://www.insidesqlserver.com/thebooks.html

Querying large dataset for statistics in SQL Server?

Say I have a sample for which 5 million data objects are stored as rows in SQL Server. If I need to run some stats on the data, would it be better to have a table for each sample, or one giant table, where I would select by sample id and then run the stats?
There may eventually be hundreds or even thousands of samples- which seems like one massive table.
But I'm not a SQL Server expert so I can't say whether one would be faster than the other...
Or maybe a better way to deal with such a large data set? I was hoping to use SQL CLR with C# to do my heavy lifting...

If you need to deal with such a large dataset, my gut feeling tells me T-SQL and working in sets will be significantly faster than anything you can do in SQL-CLR and a RBAR (row-by-agonizing-row) approach... dealing with large sets of data, summing up and selecting, that's what T-SQL is always been made for and what it's good at.
5 million rows isn't really an awful lot of data - it's a nice size dataset. But if you have the proper indices in place, e.g. on the columns you use in your JOIN conditions, in your WHERE clause and your ORDER BY clause, you should be just fine.
If you need more and more detailed advice - try to post your table structure, explain how you will query that table (what criteria you use for WHERE and ORDER BY) and we should be able to provide some more feedback.

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?

The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?

You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.

A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.

It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

is there something faster than "having count" for large tables?

Here is my query:
select word_id, count(sentence_id)
from sentence_word
group by word_id
having count(sentence_id) > 100;
The table sentenceword contains 3 fields, wordid, sentenceid and a primary key id.
It has 350k+ rows.
This query takes a whopping 85 seconds and I'm wondering (hoping, praying?) there is a faster way to find all the wordids that have more than 100 sentenceids.
I've tried taking out the select count part, and just doing 'having count(1)' but neither speeds it up.
I'd appreciate any help you can lend. Thanks!

If you don't already have one, create a composite index on sentence_id, word_id.

having count(sentence_id) > 100;
There's a problem with this... Either the table has duplicate word/sentence pairs, or it doesn't.
If it does have duplicate word/sentence pairs, you should be using this code to get the correct answer:
HAVING COUNT(DISTINCT Sentence_ID) > 100
If the table does not have duplicate word/sentence pairs... then you shouldn't count sentence_ids, you should just count rows.
HAVING COUNT(*) > 100
In which case, you can create an index on word_id only, for optimum performance.

If that query is often performed, and the table rarely updated, you could keep an auxiliary table with word ids and corresponding sentence counts -- hard to think of any further optimization beyond that!

Your query is fine, but it needs a bit of help (indexes) to get faster results.
I don't have my resources at hand (or access to SQL), but I'll try to help you from memory.
Conceptually, the only way to answer that query is to count all the records that share the same word_id. That means that the query engine needs a fast way to find those records. Without an index on word_id, the only thing the database can do is go through the table one record at a time and keep running totals of every single distinct word_id it finds. That would usually require a temporary table and no results can be dispatched until the whole table is scanned. Not good.
With an index on word_id, it still has to go through the table, so you would think it wouldn't help much. However, the SQL engine can now compute the count for each word_id without waiting until the end of the table: it can dispatch the row and the count for that value of word_id (if it passes your where clause), or discard the row (if it doesn't); that will result in lower memory load on the server, possibly partial responses, and the temporary table is no longer needed. A second aspect is parallelism; with an index on word_id, SQL can split the job in chunks and use separate processor cores to run the query in parallel (depending on hardware capabilities and existing workload).
That might be enough to help your query; but you will have to try to see:
CREATE INDEX someindexname ON sentence_word (word_id)
(T-SQL syntax; you didn't specify which SQL product you are using)
If that's not enough (or doesn't help at all), there are two other solutions.
First, SQL allows you to precompute the COUNT(*) by using indexed views and other mechanisms. I don't have the details at hand (and I don't do this often). If your data doesn't change often, that would give you faster results but with a cost in complexity and a bit of storage.
Also, you might want to consider storing the results of the query in a separate table. That is practical only if the data never changes, or changes on a precise schedule (say, during a data refresh at 2 in the morning), or if it changes very little and you can live with non perfect results for a few hours (you would have to schedule a periodic data refresh); that's the moral equivalent of a poor-man's data warehouse.
The best way to find out for sure what works for you is to run the query and look at the query plan with and without some candidate indexes like the one above.

There is, surprisingly, an even faster way to accomplish that on large data sets:
SELECT totals.word_id, totals.num
FROM (SELECT word_id, COUNT(*) AS num FROM sentence_word GROUP BY word_id) AS totals
WHERE num > 1000;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas