select first rows in oracle without full table scan - sql

I'm trying to perform the following query in Oracle.
select * from (select rownum r, account from fooTable) where r<5001;
It selects the 1st 5000 rows. I'm running into a problem that fooTable has a lot of data inside of it and this is really slowing down the query (35 million+ rows). According to the query analyzer it's performing a full table scan.
My question is, is there a way to speed up this statement? Since I'm only fetching the 1st N rows, is the full table scan necessary?
mj

I have found the /*+ FIRST_ROWS(n) */ hint to be very helpful in cases like this (such as for limiting pagination results). You replace n with whatever value you want.
select /*+ FIRST_ROWS(5000) */
account
from fooTable
where rownum <5000;
You still need the rownum predicate to limit rows, but the hint lets the optimizer know you only need a lazy fetch of n rows.

Related

Query process steps of select query

I have confused with query process steps of select query. I read some docs, select query will run like this
1. Getting Data (From, Join)
2. Row Filter (Where)
3. Grouping (Group by)
4. Group Filter (Having)
5. Return Expressions (Select)
6. Order & Paging (Order by & Limit / Offset)
I retry test run a query join A table ( 70m records ) and B table( 75m records)
select *
from A join B on A.code = B.box_code
where B.box_code = '123'
compare with
select *
from A join (select * from B where box_code = '123' ) on A.code = B.box_code
I assume the first query will run slower than second query. Because the first query will take time when mapping large data while second query filters box_code before mapping data. But two queries run the same. Why did that happen?
I searched google, it may be related to clustered index, but I am not sure.
1 more question , why clustered index can get where condition to filter data before join ? i think the query will run join before where
Where did I get it wrong?
illustrating images
first query
second query
Thanks
This part is wrong...
select query will run like this
Getting Data (From, Join)
Row Filter (Where)
Grouping (Group by)
Group Filter (Having)
Return Expressions (Select)
Order & Paging (Order by & Limit / Offset)
Oracle has a number of operations that it can perform to satisfy a query. Some operations may require child operations to be completed first. Operations include things like TABLE ACCESS BY INDEX ROWID, INDEX RANGE SCAN, and NESTED LOOPS.
Oracle's optimizer decides which operations are necessary and in what order. It very often will, for example, apply WHERE conditions to a row source before joining that row source to another one. It does that for exactly the reason you imply in your post: because it is probably faster to filter a million rows down to 10 before doing a join.
Oracle maintains an elaborate set of statistics on each table and column so that it can estimate when you submit your query what is likely to work well.
Theoretically, your job when writing SQL is to describe what you want and leave the how part to Oracle. In practice, the how part is still important, so your question is a very good one. Read Oracle's documentation on the subject, titled "Oracle Database SQL Tuning Guide". There is a version for each release of the database and they're available for free online (see: https://docs.oracle.com).

Is there a way to select different rows each time avoiding ORDER BY clause?

I have a table with approximately 100 million rows (TABLE_A), I need to select 6 millons different rows each query, once the entire table is selected, the process ends. TABLE_A does not have index or primary key, and ORDER BY is very expensive in terms of time, also I don't need any order here, just different rows. I have tried to order using ROWID, according to this,
They are the fastest way to access a single row.
This query works but takes about 5 minutes (I would like to avoid this order by)
SELECT * FROM TABLE_A ORDER BY ROWID
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
This query works faster but has no sense since ROWNUM, according to this
returns a number indicating the order in which Oracle selects the row
from a table
SELECT * FROM TABLE_A ORDER BY ROWNUM asc
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
As expected, same query returns different results each time.
This query seems to be conceptually better.
SELECT * FROM TABLE_A WHERE ROWID >= 6000000 AND ROWID <12000000;
But it can't be done in this way, ROWID (UROWID Datatype) has values like AAAZIUAHWAAC6XcAAI
So, Is there a way to select different rows avoiding order? and just call the rows using some kind of internal ID, maybe a direction in the storage or maybe a default order. The whole approach was likely wrong, so I'm open to radical changes.
I've also tried somethig like this
SELECT * FROM TABLE_A
WHERE dbms_rowid.rowid_block_number(rowid)
BETWEEN 2049 AND 261281;
it's surprisingly fast but unfortunately a row could have more than one block number.
Based on your last comment, some things to look at:
DBMS_PARALLEL_EXECUTE
If you are going through 100 million rows, the best place to process them is on the database itself. If your processing is done with PL/SQL, then dbms_parallel_execute can manage most of the parallelisation for you, and carve up the rows.
ROWID ranges
Even if you don't process the rows on the database, you can use DBMS_PARALLEL_EXECUTE to produce the rowid ranges for you. Then use those start-end pairs as inputs to whatever app you are using to do the processing
simple MOD
Each instance of your app gets an ID from 0 to 'n-1' and each issues a query
select *
from (
select rownum r, m.* from my_table
)
where mod(r,"n") = :x
where x is that app's ID. If you already have a numeric sequence column of some sort that is reasonably distributed, you can substitute that in for the rownum

Fast query in PostgreSQL

I have a very large database (~1TB), so running even a very simple query can take a very long time. Eg. for:
EXPLAIN select count(*) from users;
the cost is 44661683.87 disk page fetch. Hence making it very expensive to execute.
When I try to put a limit on the query like:
EXPLAIN select count(*) from users limit 10;
the cost of executing the query remains the same i.e 44661683.87 disk page fetch.
So (1) is it possible to execute a query on subset of data and then extrapolate to the rest of the table? The row size can be quickly found using something like:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'users';
Moreover, (2) is it possible to select a randomly distributed subset of rows?
is it possible to execute a query on subset of data and then extrapolate to the rest of the table
You could use the tablesample option:
select count(*) * 10
from the_table tablesample system (10);
tablesample system (10) will only scan 10 percent of the blocks of the table which should be quite fast. If you multiply the resulting row count with 10 you'll have an approximation(!) of the total number of rows. The smaller the sample size is the faster, this will be - but also less accurate.
I accuracy of the number depends on how much free space your table has because the 10% (or whatever sample size you choose) is based on the total number of blocks in the table. If there are many free (or half free) blocks, then the number will be less reliable.
select count(*) . . . is an aggregation query with no group by. It returns 1 row, so the limit has no impact.
You seem to want:
select count(*)
from (select u.*
from users u
limit 10
);
As for your second question, Postgres introduced tablesample in version 9.5. You can investigate that.
If you have a primary key index on the users table (or an index on another column), you can get it to use that index for an index-only scan which should result in a much better execution plan. But, strangely, it won't work with COUNT so you can do a SELECT DISTINCT in a subquery and then COUNT on an outer query to force it to use the index:
EXPLAIN SELECT COUNT(*) FROM (SELECT DISTINCT id FROM users) u;

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post

Selecting data effectively sql

I have a very large table with over 1000 records and 200 columns. When I try to retreive records matching some criteria in the WHERE clause using SELECT statement it takes a lot of time. But most of the time I just want to select a single record that matches the criteria in the WHERE clause rather than all the records.
I guess there should be a way to select just a single record and exit which would minimize the retrieval time. I tried ROWNUM=1 in the WHERE clause but it didn't really work because I guess the engine still checks all the records even after finding the first record matching the WHERE criteria. Is there a way to optimize in case if I want to select just a few records?
Thanks in advance.
Edit:
I am using oracle 10g.
The Query looks like,
Select *
from Really_Big_table
where column1 is NOT NULL
and column2 is NOT NULL
and rownum=1;
This seems to work slower than the version without rownum=1;
rownum is what you want, but you need to perform your main query as a subquery.
For example, if your original query is:
SELECT co1, col2
FROM table
WHERE condition
then you should try
SELECT *
FROM (
SELECT col1, col2
FROM table
WHERE condition
) WHERE rownum <= 1
See http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html for details on how rownum works in Oracle.
1,000 records isn't a lot of data in a table. 200 columns is a reasonably wide table. For this reason, I'd suggest you aren't dealing with a really big table - I've performed queries against millions of rows with no problems.
Here is a little experiment... how long does it take to run this compared to the "SELECT *" query?
SELECT
Really_Big_table.Id
FROM
Really_Big_table
WHERE
column1 IS NOT NULL
AND
column2 IS NOT NULL
AND
rownum=1;
An example is here: You can view more here
SELECT ename, sal
FROM ( SELECT ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
FROM emp )
WHERE sal_rank <= 1;
You also have to do some column indexing for column in the WHERE clause
In SQL most of the optimization would come in the form on index on the table (where you would index the columns that appear in the WHERE and ORDER BY columns as a rough guide).
You did not specify what SQL database you are using, so I can't point to a good resource.
Here is an introduction to indexes on Oracle.
Here another tutorial.
As for queries - you should always specify the columns you are returning and not use a blanket *.
it shouldn't take a lot of time to query a 1000 rows table. There are exceptions however, check if you are in one of the following cases:
1. Lots of rows were deleted
The table had a massive amount of rows in the past. Since the High Water Mark (HWM) is still high (delete won't lower it) and FULL TABLE SCAN read all the data up to the high water mark, it may take a lot of time to return results even if the table is now nearly empty.
Analyse your table (dbms_stats.gather_table_stats('<owner>','<table>')) and compare the space actually used by the rows (space on disk) with the effective space (data), for example:
SELECT t.avg_row_len * t.num_rows data_bytes,
(t.blocks - t.empty_blocks) * ts.block_size bytes_used
FROM user_tables t
JOIN user_tablespaces ts ON t.tablespace_name = ts.tablespace_name
WHERE t.table_name = '<your_table>';
You will have to take into account the overhead of the rows and blocks as well as the space reserved for update (PCT_FREE). If you see that you use a lot more space than required (typical overhead is below 30%, YMMV) you may want to reset the HWM, either:
ALTER TABLE <your_table> MOVE; and then rebuild INDEX (ALTER INDEX <index> REBUILD), don't forget to collect stats afterwards.
use DBMS_REDEFINITION
2. The table has very large columns
Check if you have columns of datatype LOB, CLOB, LONG (irk), etc. Data over 4000 bytes in any of these columns is stored out of line (in a separate segment), which means that if you don't select these columns you will only query the other smaller columns.
If you are in this case, don't use SELECT *. Either you don't need the data in the large columns or use SELECT rowid and then do a second query : SELECT * WHERE rowid = <rowid>.