SQL query: How to select the first 100000-200000 rows in a huge table - sql

I know the normal way should be:
SELECT *
FROM mytable
ORDER BY date
OFFSET 100000 ROWS
FETCH FIRST 100000 ROWS
However, when mytable has 80 Million rows, the "ORDER BY" command will takes a long time to run. To me, the order doesn't matter, I just want to download 100,000 rows of data one at each time. Is there any good way to achieve it?

The order by only takes a long time because you use a column without an index on it. Use an indexed column like an id column in your order by.
Or add an index on date

If the order doesn't matter, just don't use it. But the correct point is to follow #juergen instructions. Try to order always on indexed columns.
I'm not sure if you fetch 100k lines at once, the system will load 100k rows to memory.
But when you process this, the loop will work over the 100k rows and will end.
¿Could you explain why order does not matter to you?
Without that, you can get repeated rows within different fetches.
If you use order, you ensure to not get repeated rows.
If th query is slow, create an index.
In this kind of select, I'm not sure if pruning would help, as you have no where clause.

Related

Strange behavior when doing where and order by query in postgres

Background: A large table, 50M+, all column in query is indexed.
when I do a query like this:
select * from table where A=? order by id DESC limit 10;
In statement, A, id are both indexed.
Now confusing things happen:
the more rows where returned, the less time whole sql cost
the less rows where returned, the more time whole sql cost
I have a guess here: postgres do the order by first, and then where , so it cost more time to find 10 row in the orderd index when target rowset is small(like find 10 particular sand on beach); opposite, if target rowset is large, it's easy to find the first 10.
Is it right? Or there are some other reason for this?
Final question: How to optimize this situation?
It can either use the index on A to apply the selectivity, then sort on "id" and apply the limit. Or it can read them already in order using the index on "id", then filter out the ones that meet the A condition until it finds 10 of them. It will choose the one it thinks is faster, and sometimes it makes the wrong choice.
If you had a multi-column index, on (A,id) it could use that one index to do both things, get the selectivity on A and still fetch the already in order by "id", at the same time.
Do you know PGAdmin? With "explain verbose" before your statement, you can check how the query is executed (meaning the order of the operators). Usually first happens the filter and only afterwards the sorting...

Different result size between SELECT * and SELECT COUNT(*) on Oracle

I have an strange behavior on an oracle database. We make a huge insert of around 3.1 million records. Everything fine so far.
Shortly after the insert finished (around 1 too 10 minutes) I execute two statements.
SELECT COUNT(*) FROM TABLE
SELECT * FROM TABLE
The result from the first statement is fine it gives me the exact number of rows that was inserted.
The result from the second statement is now the problem. Depending on the time, the number of rows that are returned is for example around 500K lower than the result from the first statement. The difference of the two results is decreasing with time.
So I have to wait 15 to 30 minutes before both statements return the same number of rows.
I already talked with the oracle dba about this issue but he has no idea how this could happen.
Any ideas, questions or suggestions?
Update
When I select only an index column I get the correct row count.
When I instead select an non index column I get again the wrong row count.
That doesn't sounds like a bug to me, if I understood you correctly, it just takes time for Oracle to fetch the entire table . After all, 3 Mil is not a small amount.
As opposed to count, which brings 1 record with the total number of rows.
If after some waiting, the number of records being output equals to the number that the count query returns, then everything is fine.
Have you already verified with these things:
1- Count single column instead of * ALL to verify both result
2- You can verify both queries result by adding where clause and gradually select more rows by removing conditions so that you can get the issue where it is returning different value from both.
I think you should check Execution plan to identify missing indexes to improve performance.
Add missing Indexes and check the result.
Why missing Indexes are impotent:
To count row, Oracle engine no need to go throw paging operation. But while fetching all the details from a table, it requires to go through paging.
And paging process depends on indexes created on a table to fetch the data effectively and fast.
So to decrease time for your second statement, you should find missing indexes and create those indexes.
How to Find Missing Indexes:
You can start with DBA_HIST_ACTIVE_SESS_HISTORY, and look at all statements that contain that type of hint.
From there, you can pull the index name coming from that hint, and then do a lookup on dba_indexes to see if the index exists, is valid, etc.

Is it possible to query a table without order columns by page

I've a big table which contains more than 100K records, in oracle. I want to get all of the records and save each row to a file with JDBC.
In order to make it faster, I want to create 100 threads to read the data from the table concurrently. I will get the total count of the records in the first sql, then split it to 100 pages, then get one page in a thread with a new connection.
But I've a problem, that there is no any column can be used to order. There is no column with sequence, no accurate timestamp. I can't use a sql query without order by clause to query, since there is no guarantee it will return the data with the same order every time (per this question).
So is it possible to solve it?
Finally, I used rowid to order:
select * from mytable order by rowid
It seems work well.

Selecting 'highest' X rows without sorting

I've got a table with huge amount of data. Lets say 10GB of lines, containing bunch of crap. I need to select for example X rows (X is usually below 10) with highest amount column.
Is there any way how to do it without sorting the whole table? Sorting this amount of data is extremely time-expensive, I'd be OK with one scan through the whole table and selecting X highest values, and letting the rest untouched. I'm using SQL Server.
Create an index on amount then SQL Server can select the top 10 from that and do bookmark lookups to retrieve the missing columns.
SELECT TOP 10 Amount FROM myTable ORDER BY Amount DESC
if it is indexed, the query optimizer should use the index.
If not, I do no see how one could avoid scanning the whole thing...
Wether an index is usefull or not depends on how often you do that search.
You could also consider putting that query into an indexed view. I think this will give you the best benefit/cost ration.

Query Optimisation (ORDER BY)

I have an SQL query similar to below:
SELECT NAME,
MY_FUNCTION(NAME) -- carries out some string manipulation
FROM TITLES
ORDER BY NAME; -- has an index.
The TITLES table has approximately 12,000 records. At the moment the query takes over 5 minutes to execute but if I remove the ORDER BY clause then it executes within a couple of seconds.
Does anyone have any suggestions on how to be speed up this query.
If MY_FUNCTION is deterministic (i.e. always returns the same result for the same input value) then you could create an index on (NAME, MY_FUNCTION(NAME)) and it may help (or may not!)
In comments under the question, you say that it takes 2 seconds "to return N rows without the ORDER BY". That makes sense: without the ORDER BY you will just get the first N rows encountered, as soon as they are encountered. With the ORDER BY, the first N rows are returned only after the results have been sorted into the correct order.
If the query is being used in a situation where getting the first N rows fast is important (e.g. an online report with pagination) then you could try adding a FIRST_ROWS or FIRST_ROWS_n hint to the query, to try to persuade it to use the index. See Choosing an Optimizer Goal
Use the EXPLAIN statement to see where the issue is
EXPLAIN SELECT NAME, MY_FUNCTION(NAME) FROM TITLES ORDER BY NAME;
Sounds weird. What's name column type?
Have you checked for defective hardware errors? Maybe (just maybe) your query with the order by clause is using your index, and your index is located in a defective disk (it could be in a different disk from the table if they are located in different tablespaces).