Make SELECT with LIMIT and OFFSET on big table fast - sql

I have more than 10 million records in a table.
SELECT * FROM tbl ORDER BY datecol DESC
LIMIT 10
OFFSET 999990
Output of EXPLAIN ANALYZE on explain.depesz.com.
Executing the above query takes about 10 seconds. How can I make this faster?
Update
The execution time is reduced half by using a subquery:
SELECT * FROM tbl where id in
(SELECT id FROM tbl ORDER BY datecol DESC LIMIT 10 OFFSET 999990)
Output of EXPLAIN ANALYZE on explain.depesz.com.

You need to create an index on the column used in ORDER BY. Ideally in the same sort order, but PostgreSQL can scan indexes backwards at almost the same speed.
CREATE INDEX tbl_datecol_idx ON tbl (datecol DESC);
More about indexes and CREATE INDEX in the current manual.
Test with EXPLAIN ANALYZE to get actual times in addition to the query plan.
Of course all the usual advice for performance optimization applies, too.

I was trying to do something similar my self with a very large table ( >100m records ) and found that using Offset / Limit was killing performance.
Offset for the first 10m records was (with limit 1) about 1.5 minutes to retrieve with it growing exponentially.
By record 50m I was up to 3 minutes per select - even using sub-queries.
I came across a post here which details useful alternatives.
I modified this slightly to suit my needs and came up with a method that gave me pretty quick results.
CREATE TEMPORARY TABLE
just_index AS SELECT ROW_NUMBER()
OVER (ORDER BY [VALUE-You-need]), [VALUE-You-need]
FROM [your-table-name];
This was a once-off - took about 4 minutes but I then had all values I wanted
Next was to create a function that would loop at the "Offset" I needed:
create or replace
function GetOffsets ()
returns void as $$
declare
-- For this part of the function I only wanted values after 90 million up to 120 million
counter bigint := 90000000;
maxRows bigInt := 120000000;
begin
drop table if exists OffsetValues;
create temp table OffsetValues
(
offset_myValue bigint
);
while counter <= maxRows loop
insert into OffsetValues(offset_myValue)
select [VALUE-You-need] from just_index where row_number > counter
limit 1;
-- here I'm looping every 500,000 records - this is my 'Offset'
counter := counter + 500000 ;
end loop ;
end ;$$ LANGUAGE plpgsql;
Then run the function:
select GetOffsets();
Again, a once-off amount of time (I went from ~3 minutes getting one of my offset values to 3 milliseconds to get one of my offset values).
Then select from the temp-table:
select * from OffsetValues;
This worked really well for me in terms of performance - I don't think i'll be using offset going forward if I can help it.
Hope this improves performance for any of your larger tables.

Related

Is there a way to select different rows each time avoiding ORDER BY clause?

I have a table with approximately 100 million rows (TABLE_A), I need to select 6 millons different rows each query, once the entire table is selected, the process ends. TABLE_A does not have index or primary key, and ORDER BY is very expensive in terms of time, also I don't need any order here, just different rows. I have tried to order using ROWID, according to this,
They are the fastest way to access a single row.
This query works but takes about 5 minutes (I would like to avoid this order by)
SELECT * FROM TABLE_A ORDER BY ROWID
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
This query works faster but has no sense since ROWNUM, according to this
returns a number indicating the order in which Oracle selects the row
from a table
SELECT * FROM TABLE_A ORDER BY ROWNUM asc
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
As expected, same query returns different results each time.
This query seems to be conceptually better.
SELECT * FROM TABLE_A WHERE ROWID >= 6000000 AND ROWID <12000000;
But it can't be done in this way, ROWID (UROWID Datatype) has values like AAAZIUAHWAAC6XcAAI
So, Is there a way to select different rows avoiding order? and just call the rows using some kind of internal ID, maybe a direction in the storage or maybe a default order. The whole approach was likely wrong, so I'm open to radical changes.
I've also tried somethig like this
SELECT * FROM TABLE_A
WHERE dbms_rowid.rowid_block_number(rowid)
BETWEEN 2049 AND 261281;
it's surprisingly fast but unfortunately a row could have more than one block number.
Based on your last comment, some things to look at:
DBMS_PARALLEL_EXECUTE
If you are going through 100 million rows, the best place to process them is on the database itself. If your processing is done with PL/SQL, then dbms_parallel_execute can manage most of the parallelisation for you, and carve up the rows.
ROWID ranges
Even if you don't process the rows on the database, you can use DBMS_PARALLEL_EXECUTE to produce the rowid ranges for you. Then use those start-end pairs as inputs to whatever app you are using to do the processing
simple MOD
Each instance of your app gets an ID from 0 to 'n-1' and each issues a query
select *
from (
select rownum r, m.* from my_table
)
where mod(r,"n") = :x
where x is that app's ID. If you already have a numeric sequence column of some sort that is reasonably distributed, you can substitute that in for the rownum

Fast query in PostgreSQL

I have a very large database (~1TB), so running even a very simple query can take a very long time. Eg. for:
EXPLAIN select count(*) from users;
the cost is 44661683.87 disk page fetch. Hence making it very expensive to execute.
When I try to put a limit on the query like:
EXPLAIN select count(*) from users limit 10;
the cost of executing the query remains the same i.e 44661683.87 disk page fetch.
So (1) is it possible to execute a query on subset of data and then extrapolate to the rest of the table? The row size can be quickly found using something like:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'users';
Moreover, (2) is it possible to select a randomly distributed subset of rows?
is it possible to execute a query on subset of data and then extrapolate to the rest of the table
You could use the tablesample option:
select count(*) * 10
from the_table tablesample system (10);
tablesample system (10) will only scan 10 percent of the blocks of the table which should be quite fast. If you multiply the resulting row count with 10 you'll have an approximation(!) of the total number of rows. The smaller the sample size is the faster, this will be - but also less accurate.
I accuracy of the number depends on how much free space your table has because the 10% (or whatever sample size you choose) is based on the total number of blocks in the table. If there are many free (or half free) blocks, then the number will be less reliable.
select count(*) . . . is an aggregation query with no group by. It returns 1 row, so the limit has no impact.
You seem to want:
select count(*)
from (select u.*
from users u
limit 10
);
As for your second question, Postgres introduced tablesample in version 9.5. You can investigate that.
If you have a primary key index on the users table (or an index on another column), you can get it to use that index for an index-only scan which should result in a much better execution plan. But, strangely, it won't work with COUNT so you can do a SELECT DISTINCT in a subquery and then COUNT on an outer query to force it to use the index:
EXPLAIN SELECT COUNT(*) FROM (SELECT DISTINCT id FROM users) u;

Numbering rows in a view

I am connecting to an SQL database using a PLC, and need to return a list of values. Unfortunately, the PLC has limited memory, and can only retrieve approximately 5,000 values at any one time, however the database may contain up to 10,000 values.
As such I need a way of retrieving these values in 2 operations. Unfortunately the PLC is limited in the query it can perform, and is limited to only SELECT and WHERE commands, so I cannot use LIMIT or TOP or anything like that.
Is there a way in which I can create a view, and auto number every field in that view? I could then query all records < 5,000, followed by a second query of < 10,000 etc?
Unfortunately it seems that views do not support the identity column, so this would need to be done manually.
Anyone any suggestions? My only realistic option at the moment seems to be to create 2 views, one with the first 5,000 and 1 with the next 5,000...
I am using SQL Server 2000 if that makes a difference...
There are 2 solutions. The easiest is to modify your SQL table and add an IDENTITY column. If that is not a possibility, the you'll have to do something like the below query. For 10000 rows, it shouldn't be too slow. But as the table grows, it will become worse and worse-performing.
SELECT Col1, Col2, (SELECT COUNT(i.Col1)
FROM yourtable i
WHERE i.Col1 <= o.Col1) AS RowID
FROM yourtable o
While the code provided by Derek does what I asked - i.e numbers each row in the view, the performance for this is really poor - approximately 20 seconds to number 100 rows. As such it is not a workable solution. An alternative is to number the first 5,000 records with a 1, and the next 5,000 with a 2. This can be done with 3 simple queries, and is far quicker to execute.
The code to do so is as follows:
SELECT TOP(5000) BCode, SAPCode, 1 as GroupNo FROM dbo.DB
UNION
SELECT TOP (10000) BCode, SAPCode, 2 as GroupNo FROM dbo.DB p
WHERE ID NOT IN (SELECT TOP(5000) ID FROM dbo.DB)
Although, as pointed out by Andriy M, you should also specify an explicit sort, to ensure the you dont miss any records.
One possibility might be to use a function with a temporary table such as
CREATE FUNCTION dbo.OrderedBCodeData()
RETURNS #Data TABLE (RowNumber int IDENTITY(1,1),BCode int,SAPCode int)
AS
BEGIN
INSERT INTO #Data (BCode,SAPCode)
SELECT BCode,SAPCode FROM dbo.DB ORDER BY BCode
RETURN
END
And select from this function such as
SELECT FROM dbo.OrderedBCodeData() WHERE RowNumber BETWEEN 5000 AND 10000
I haven't used this in production ever, in fact was just a quick idea this morning but worth exploring as a neater alternative?

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post

Selecting a subset of records from a very large record set in Oracle runs out of memory

I have a process that is converting dates from GMT to Australian Eastern Standard Time. To do this, I need to select the records from the database, process them and then save them back.
To select the records, I have the following query:
SELECT id,
user_id,
event_date,
event,
resource_id,
resource_name
FROM
(SELECT rowid id,
rownum r,
user_id,
event_date,
event,
resource_id,
resource_name
FROM user_activity
ORDER BY rowid)
WHERE r BETWEEN 0 AND 50000
to select a block of 50000 rows from a total of approx. 60 million rows. I am splitting them up because a) Java (what the update process is written in) runs out of memory with too many rows (I have a bean object for each row) and b) I only have 4 gig of Oracle temp space to play with.
In the process, I use the rowid to update the record (so I have a unique value) and the rownum to select the blocks. I then call this query in iterations, selecting the next 50000 records until none remain (the java program controls this).
The problem I'm getting is that I'm still running out of Oracle temp space with this query. My DBA has told me that more temp space cannot be granted, so another method must be found.
I've tried substituting the subquery (what I presume is using all the temp space with the sort) with a view but an explain plan using a view is identical to one of the original query.
Is there a different/better way to achieve this without running into the memory/tempspace problems? I'm assuming an update query to update the dates (as opposed to a java program) would suffer from the same problem using temp space available?
Your assistance on this is greatly appreciated.
Update
I went down the path of the pl/sql block as suggested below:
declare
cursor c is select event_date from user_activity for update;
begin
for t_row in c loop
update user_activity
set event_date = t_row.event_date + 10/24 where current of c;
commit;
end loop;
end;
However, I'm running out of undo space. I was under the impression that if the commit was made after each update, then the need for undo space is minimal. Am I incorrect in this assumption?
A single update probably would not suffer from the same issue, and would probably be orders of magnitude faster. The large amount of temp tablespace is only needed because of the sorting. Although if your DBA is so stingy with the temp tablespace you may end up running out of UNDO space or something else. (Take a look at ALL_SEGMENTS, how large is your table?)
But if you really must use this method, maybe you can use a filter instead of an order by. Create 1200 buckets and process them one at a time:
where ora_hash(rowid, 1200) = 1
where ora_hash(rowid, 1200) = 2
...
But this will be horribly, horribly slow. And what happens if a value changes halfway through the process? A single SQL statement is almost certainly the best way to do this.
Why not just one update or merge?
Or you can write anonymous pl/sql block with processing data with cursor
For example
declare
cursor c is select * from aa for update;
begin
for t_row in c loop
update aa
set val=t_row.val||' new value';
end loop;
commit;
end;
How about not updating it at all?
rename user_activity to user_activity_gmt
create view user_activity as
select id,
user_id,
event_date+10/24 as event_date,
event,
resource_id,
resource_name
from user_activity_gmt;