Is there a way to select different rows each time avoiding ORDER BY clause? - sql

I have a table with approximately 100 million rows (TABLE_A), I need to select 6 millons different rows each query, once the entire table is selected, the process ends. TABLE_A does not have index or primary key, and ORDER BY is very expensive in terms of time, also I don't need any order here, just different rows. I have tried to order using ROWID, according to this,
They are the fastest way to access a single row.
This query works but takes about 5 minutes (I would like to avoid this order by)
SELECT * FROM TABLE_A ORDER BY ROWID
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
This query works faster but has no sense since ROWNUM, according to this
returns a number indicating the order in which Oracle selects the row
from a table
SELECT * FROM TABLE_A ORDER BY ROWNUM asc
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
As expected, same query returns different results each time.
This query seems to be conceptually better.
SELECT * FROM TABLE_A WHERE ROWID >= 6000000 AND ROWID <12000000;
But it can't be done in this way, ROWID (UROWID Datatype) has values like AAAZIUAHWAAC6XcAAI
So, Is there a way to select different rows avoiding order? and just call the rows using some kind of internal ID, maybe a direction in the storage or maybe a default order. The whole approach was likely wrong, so I'm open to radical changes.
I've also tried somethig like this
SELECT * FROM TABLE_A
WHERE dbms_rowid.rowid_block_number(rowid)
BETWEEN 2049 AND 261281;
it's surprisingly fast but unfortunately a row could have more than one block number.

Based on your last comment, some things to look at:
DBMS_PARALLEL_EXECUTE
If you are going through 100 million rows, the best place to process them is on the database itself. If your processing is done with PL/SQL, then dbms_parallel_execute can manage most of the parallelisation for you, and carve up the rows.
ROWID ranges
Even if you don't process the rows on the database, you can use DBMS_PARALLEL_EXECUTE to produce the rowid ranges for you. Then use those start-end pairs as inputs to whatever app you are using to do the processing
simple MOD
Each instance of your app gets an ID from 0 to 'n-1' and each issues a query
select *
from (
select rownum r, m.* from my_table
)
where mod(r,"n") = :x
where x is that app's ID. If you already have a numeric sequence column of some sort that is reasonably distributed, you can substitute that in for the rownum

Related

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 or will it get only specific records i.e 5

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 and then it will it get only specific records i.e 5? or do it only get 5 records from internal memory. suppose we have 1000 records in tbl2.
For future reference, SQL actually has a huge amount of amazing documentation/resources online. For instance: This google search.
As to your question, it will pull the top five results matching your criteria, so it depends. It'll go through, and find the first five results matching your criteria. If those are the last results it goes through, it'll still have to do the comparisons and filtering on all rows, the only difference will be that it will have to send less rows to your computer.
For example, let's say we have a table my_table with two columns: Customer_ID (which is unique) and First_Purchase_Date (which is a date value between 2015-01-01 and 2017-07-26). If we simply do SELECT TOP 5 * FROM my_table then it will go through and pull the first five rows it finds, without looking at the rest of the rows. On the other hand, if we do SELECT TOP 5 * FROM my_table WHERE First_Purchase_Date = '2017-05-17' then it will have to go through all the rows until it can find five rows with a First_Purchase_Date of 2017-05-17. If First_Purchase_Date is indexed, this should not be very expensive, as it'll more or less know where to look. If it's not, then it depends on your how SQL has decided to structure your table, and if it has created any useful statistics. Worst case, it could not have any statistics and the desired rows could be the last five in the database, in which case it will have to complete the comparison on all the rows in the database.
By the way, this is a somewhat poor idea, as the columns returned will not necessarily stay consistent over time. It may be a good idea to throw in an ORDER BY clause, to ensure you get the same records every time.
The SELECT TOP clause is used to specify the number of records to return.
It limits the rows returned in a query result set to a specified number of rows or percentage of rows. When TOP is used in conjunction with the ORDER BY clause, the result set is limited to the first N number of ordered rows; otherwise, it returns the first N number of rows in an undefined order.
SELECT TOP number|percent column_name(s)
FROM table_name
WHERE condition;

Numbering rows in a view

I am connecting to an SQL database using a PLC, and need to return a list of values. Unfortunately, the PLC has limited memory, and can only retrieve approximately 5,000 values at any one time, however the database may contain up to 10,000 values.
As such I need a way of retrieving these values in 2 operations. Unfortunately the PLC is limited in the query it can perform, and is limited to only SELECT and WHERE commands, so I cannot use LIMIT or TOP or anything like that.
Is there a way in which I can create a view, and auto number every field in that view? I could then query all records < 5,000, followed by a second query of < 10,000 etc?
Unfortunately it seems that views do not support the identity column, so this would need to be done manually.
Anyone any suggestions? My only realistic option at the moment seems to be to create 2 views, one with the first 5,000 and 1 with the next 5,000...
I am using SQL Server 2000 if that makes a difference...
There are 2 solutions. The easiest is to modify your SQL table and add an IDENTITY column. If that is not a possibility, the you'll have to do something like the below query. For 10000 rows, it shouldn't be too slow. But as the table grows, it will become worse and worse-performing.
SELECT Col1, Col2, (SELECT COUNT(i.Col1)
FROM yourtable i
WHERE i.Col1 <= o.Col1) AS RowID
FROM yourtable o
While the code provided by Derek does what I asked - i.e numbers each row in the view, the performance for this is really poor - approximately 20 seconds to number 100 rows. As such it is not a workable solution. An alternative is to number the first 5,000 records with a 1, and the next 5,000 with a 2. This can be done with 3 simple queries, and is far quicker to execute.
The code to do so is as follows:
SELECT TOP(5000) BCode, SAPCode, 1 as GroupNo FROM dbo.DB
UNION
SELECT TOP (10000) BCode, SAPCode, 2 as GroupNo FROM dbo.DB p
WHERE ID NOT IN (SELECT TOP(5000) ID FROM dbo.DB)
Although, as pointed out by Andriy M, you should also specify an explicit sort, to ensure the you dont miss any records.
One possibility might be to use a function with a temporary table such as
CREATE FUNCTION dbo.OrderedBCodeData()
RETURNS #Data TABLE (RowNumber int IDENTITY(1,1),BCode int,SAPCode int)
AS
BEGIN
INSERT INTO #Data (BCode,SAPCode)
SELECT BCode,SAPCode FROM dbo.DB ORDER BY BCode
RETURN
END
And select from this function such as
SELECT FROM dbo.OrderedBCodeData() WHERE RowNumber BETWEEN 5000 AND 10000
I haven't used this in production ever, in fact was just a quick idea this morning but worth exploring as a neater alternative?

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post

Selecting data effectively sql

I have a very large table with over 1000 records and 200 columns. When I try to retreive records matching some criteria in the WHERE clause using SELECT statement it takes a lot of time. But most of the time I just want to select a single record that matches the criteria in the WHERE clause rather than all the records.
I guess there should be a way to select just a single record and exit which would minimize the retrieval time. I tried ROWNUM=1 in the WHERE clause but it didn't really work because I guess the engine still checks all the records even after finding the first record matching the WHERE criteria. Is there a way to optimize in case if I want to select just a few records?
Thanks in advance.
Edit:
I am using oracle 10g.
The Query looks like,
Select *
from Really_Big_table
where column1 is NOT NULL
and column2 is NOT NULL
and rownum=1;
This seems to work slower than the version without rownum=1;
rownum is what you want, but you need to perform your main query as a subquery.
For example, if your original query is:
SELECT co1, col2
FROM table
WHERE condition
then you should try
SELECT *
FROM (
SELECT col1, col2
FROM table
WHERE condition
) WHERE rownum <= 1
See http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html for details on how rownum works in Oracle.
1,000 records isn't a lot of data in a table. 200 columns is a reasonably wide table. For this reason, I'd suggest you aren't dealing with a really big table - I've performed queries against millions of rows with no problems.
Here is a little experiment... how long does it take to run this compared to the "SELECT *" query?
SELECT
Really_Big_table.Id
FROM
Really_Big_table
WHERE
column1 IS NOT NULL
AND
column2 IS NOT NULL
AND
rownum=1;
An example is here: You can view more here
SELECT ename, sal
FROM ( SELECT ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
FROM emp )
WHERE sal_rank <= 1;
You also have to do some column indexing for column in the WHERE clause
In SQL most of the optimization would come in the form on index on the table (where you would index the columns that appear in the WHERE and ORDER BY columns as a rough guide).
You did not specify what SQL database you are using, so I can't point to a good resource.
Here is an introduction to indexes on Oracle.
Here another tutorial.
As for queries - you should always specify the columns you are returning and not use a blanket *.
it shouldn't take a lot of time to query a 1000 rows table. There are exceptions however, check if you are in one of the following cases:
1. Lots of rows were deleted
The table had a massive amount of rows in the past. Since the High Water Mark (HWM) is still high (delete won't lower it) and FULL TABLE SCAN read all the data up to the high water mark, it may take a lot of time to return results even if the table is now nearly empty.
Analyse your table (dbms_stats.gather_table_stats('<owner>','<table>')) and compare the space actually used by the rows (space on disk) with the effective space (data), for example:
SELECT t.avg_row_len * t.num_rows data_bytes,
(t.blocks - t.empty_blocks) * ts.block_size bytes_used
FROM user_tables t
JOIN user_tablespaces ts ON t.tablespace_name = ts.tablespace_name
WHERE t.table_name = '<your_table>';
You will have to take into account the overhead of the rows and blocks as well as the space reserved for update (PCT_FREE). If you see that you use a lot more space than required (typical overhead is below 30%, YMMV) you may want to reset the HWM, either:
ALTER TABLE <your_table> MOVE; and then rebuild INDEX (ALTER INDEX <index> REBUILD), don't forget to collect stats afterwards.
use DBMS_REDEFINITION
2. The table has very large columns
Check if you have columns of datatype LOB, CLOB, LONG (irk), etc. Data over 4000 bytes in any of these columns is stored out of line (in a separate segment), which means that if you don't select these columns you will only query the other smaller columns.
If you are in this case, don't use SELECT *. Either you don't need the data in the large columns or use SELECT rowid and then do a second query : SELECT * WHERE rowid = <rowid>.

How can I speed up row_number in Oracle?

I have a SQL query that looks something like this:
SELECT * FROM(
SELECT
...,
row_number() OVER(ORDER BY ID) rn
FROM
...
) WHERE rn between :start and :end
Essentially, it's the ORDER BY part that's slowing things down. If I were to remove it, the EXPLAIN cost goes down by an order of magnitude (over 1000x). I've tried this:
SELECT
...
FROM
...
WHERE
rownum between :start and :end
But this doesn't give correct results. Is there any easy way to speed this up? Or will I have to spend some more time with the EXPLAIN tool?
ROW_NUMBER is quite inefficient in Oracle.
See the article in my blog for performance details:
Oracle: ROW_NUMBER vs ROWNUM
For your specific query, I'd recommend you to replace it with ROWNUM and make sure that the index is used:
SELECT *
FROM (
SELECT /*+ INDEX_ASC(t index_on_column) NOPARALLEL_INDEX(t index_on_column) */
t.*, ROWNUM AS rn
FROM table t
ORDER BY
column
)
WHERE rn >= :start
AND rownum <= :end - :start + 1
This query will use COUNT STOPKEY
Also either make sure you column is not nullable, or add WHERE column IS NOT NULL condition.
Otherwise the index cannot be used to retrieve all values.
Note that you cannot use ROWNUM BETWEEN :start and :end without a subquery.
ROWNUM is always assigned last and checked last, that's way ROWNUM's always come in order without gaps.
If you use ROWNUM BETWEEN 10 and 20, the first row that satisifies all other conditions will become a candidate for returning, temporarily assigned with ROWNUM = 1 and fail the test of ROWNUM BETWEEN 10 AND 20.
Then the next row will be a candidate, assigned with ROWNUM = 1 and fail, etc., so, finally, no rows will be returned at all.
This should be worked around by putting ROWNUM's into the subquery.
Looks like a pagination query to me.
From this ASKTOM article (about 90% down the page):
You need to order by something unique for these pagination queries, so that ROW_NUMBER is assigned deterministically to the rows each and every time.
Also your queries are no where near the same so I'm not sure what the benefit of comparing the costs of one to the other is.
Is your ORDER BY column indexed? If not that's a good place to start.
Part of the problem is how big is the 'start' to 'end' span and where they 'live'.
Say you have a million rows in the table, and you want rows 567,890 to 567,900 then you are going to have to live with the fact that it is going to need to go through the entire table, sort pretty much all of that by id, and work out what rows fall into that range.
In short, that's a lot of work, which is why the optimizer gives it a high cost.
It is also not something an index can help with much. An index would give the order, but at best, that gives you somewhere to start and then you keep reading on until you get to the 567,900th entry.
If you are showing your end user 10 items at a time, it may be worth actually grabbing the top 100 from the DB, then having the app break that 100 into ten chunks.
Spend more time with the EXPLAIN PLAN tool. If you see a TABLE SCAN you need to change your query.
Your query makes little sense to me. Querying over a ROWID seems like asking for trouble. There's no relational info in that query. Is it the real query that you're having trouble with or an example that you made up to illustrate your problem?