How to select first 'N' records from a database containing million records?

How to select first 'N' records from a database containing million records? - sql

I have an oracle database populated with million records. I am trying to write a SQL query that returns the first 'N" sorted records ( say 100 records) from the database based on certain condition.
SELECT *
FROM myTable
Where SIZE > 2000
ORDER BY NAME DESC
Then programmatically select first N records.
The problem with this approach is :
The query results into half million
records and "ORDER BY NAME" causes
all the records to be sorted on NAME in the descending order. This sorting is taking lot of time. (nearly 30-40 seconds. If I omit ORDER BY, it takes only 1 second).
After the sort I am interested in
only first N (100) records. So the sorting of complete records is not useful.
My questions are:
Is it possible to specify the 'N' in
query itself? ( so that sort applies to only N records and query becomes faster).
Any better way in SQL to improve the query to sort
only N elements and return in quick
time.

If your purpose is to find 100 random rows and sort them afterwards then Lasse's solution is correct. If as I think you want the first 100 rows sorted by name while discarding the others you would build a query like this:
SELECT *
FROM (SELECT *
FROM myTable
WHERE SIZE > 2000 ORDER BY NAME DESC)
WHERE ROWNUM <= 100
The optimizer will understand that it is a TOP-N query and will be able to use an index on NAME. It won't have to sort the entire result set, it will just start at the end of the index and read it backwards and stop after 100 rows.
You could also add an hint to your original query to let the optimizer understand that you are interested in the first rows only. This will probably generate a similar access path:
SELECT /*+ FIRST_ROWS*/* FROM myTable WHERE SIZE > 2000 ORDER BY NAME DESC
Edit: just adding AND rownum <= 100 to the query won't work since in Oracle rownum is attributed before sorting : this is why you have to use a subquery. Without the subquery Oracle will select 100 random rows then sort them.

This shows how to pick the top N rows depending on your version of Oracle.
From Oracle 9i onwards, the RANK() and
DENSE_RANK() functions can be used to
determine the TOP N rows. Examples:
Get the top 10 employees based on
their salary
SELECT ename, sal FROM ( SELECT
ename, sal, RANK() OVER (ORDER BY sal
DESC) sal_rank
FROM emp ) WHERE sal_rank <= 10;
Select the employees making the top 10
salaries
SELECT ename, sal FROM ( SELECT
ename, sal, DENSE_RANK() OVER (ORDER
BY sal DESC) sal_dense_rank
FROM emp ) WHERE sal_dense_rank <= 10;
The difference between the two is explained here

Add this:
AND rownum <= 100
to your WHERE-clause.
However, this won't do what you're asking.
If you want to pick 100 random rows, sort those, and then return them, you'll have to formulate a query without the ORDER BY first, then limit that to 100 rows, then select from that and sort.
This could work, but unfortunately I don't have an Oracle server available to test:
SELECT *
FROM (
SELECT *
FROM myTable
WHERE SIZE > 2000
AND rownum <= 100
) x
ORDER BY NAME DESC
But note the "random" part there, you're saying "give me 100 rows with SIZE > 2000, I don't care which 100".
Is that really what you want?
And no, you won't actually get a random result, in the sense that it'll change each time you query the server, but you are at the mercy of the query optimizer. If the data load and index statistics for that table changes over time, at some point you might get different data than you did on the previous query.

Your problem is that the sort is being done every time the query is run. You can eliminate the sort operation by using an index - the optimiser can use an index to eliminate a sort operation - if the sorted column is declared NOT NULL.
(If the column is nullable, it is still possible, by either (a) adding a NOT NULL predicate to the query, or (b) adding a function-based index and modifying the ORDER BY clause accordingly).

Just for reference, in Oracle 12c, this task can be done using FETCH clause. You can see here for examples and additional reference links regarding this matter.

Related

How can I get the total result count, and a given subset ('page' of results) with the same SQL Query with Oracle

I would like to display a table of results. The data is sourced from a SQL query on an Oracle database. I would like to show the results one page (say, 10 records) at a time, minimising the actual data being sent to the front-end.
At the same time, I would like to show the total number of possible results (say, showing 1-10 of 123), and to allow for pagination (say, to calculate that 10 per page, 123 results, therefore 13 pages).
I can get the total number of results with a single count query.
SELECT count(*) AS NUM_RESULTS FROM ... etc.
and I can get the desired subset with another query
SELECT * FROM ... etc. WHERE ? <= ROWNUM AND ROWNUM < ?
But, is there a way to get all the relevant details in one single query?
Update
Actually, the above query using ROWNUM seems to work for 0 - 10, but not for 10 - 20, so how can I do that too?

ROWNUM is a bit tricky to use.
The ROWNUM pseudocolumn always starts with 1 for the first result that actually gets fetched. If you filter for ROWNUM>10, you will never fetch any result and therefore will not get any.
If you want to use it for paging (not that you really should), it requires nested subqueries:
select * from
(select rownum n, x.* from
(select * from mytable order by name) x
)
where n between 3 and 5;
Note that you need another nested subquery to get the order by right; if you put the order by one level higher
select * from
(select rownum n, x.* from mytable x order by name)
where n between 3 and 5;
it will pick 3 random(*) rows and sort them, but that is ususally not what you want.
(*) not really random, but probably not what you expect.
See http://use-the-index-luke.com/sql/partial-results/window-functions for more effient ways to implement pagination.

You can use inner join on your table and fetch total number of result in your subquery. The example of an query is as follows:
SELECT E.emp_name, E.emp_age, E.emp_sal, E.emp_count
FROM EMP as E
INNER JOIN (SELECT emp_name, COUNT(*) As emp_count
FROM EMP GROUP BY emp_name) AS T
ON E.emp_name = T.emp_name WHERE E.emp_age < 35;

Not sure exactly what you're after based on your question wording, but it seems like you want to see your specialized table of all records with a row number between two values, and in an adjacent field in each record see the total count of records. If so, you can try selecting everything from your table and joining a subquery of a COUNT value as a field by saying where 1=1 (i.e. everywhere) tack that field onto the record. Example:
SELECT *
FROM table_name LEFT JOIN (SELECT COUNT(*) AS NUM_RESULTS FROM table_name) ON 1=1
WHERE ? <= ROWNUM AND ROWNUM < ?

total number of rows of a query

I have a very large query that is supposed to return only the top 10 results:
select top 10 ProductId from .....
The problem is that I also want the total number of results that match the criteria without that 'top 10', but in the same time it's considered unaceptable to return all rows (we are talking of roughly 100 thousand results.
Is there a way to get the total number of rows affected by the previous query, either in it or afterwords without running it again?
PS: please no temp tables of 100 000 rows :))

dump the count in a variable and return that
declare #count int
select #count = count(*) from ..... --same where clause as your query
--now you add that to your query..of course it will be the same for every row..
select top 10 ProductId, #count as TotalCount from .....

Assuming that you're using an ORDER BY clause already (to properly define which the "TOP 10" results are), then you could add a call of ROW_NUMBER also, with the opposite sort order, and pick the highest value returned.
E.g., the following:
select top 10 *,ROW_NUMBER() OVER (order by id desc) from sysobjects order by ID
Has a final column with values 2001, 2000, 1999, etc, descending. And the following:
select COUNT(*) from sysobjects
Confirms that there are 2001 rows in sysobjects.

I suppose you could hack it with a union select
select top 10 ... from ... where ...
union
select count(*) from ... where ...
For you to get away with this type of hack you will need to add fake columns to the count query so it returns the same amount of columns as the main query. For example:
select top 10 id, first_name from people
union
select count(*), '' as first_name from people
I don't recommend using this solution. Using two separate queries is how it should be done

Generally speaking no - reasoning is as follows:
If(!) the query planner can make use of TOP 10 to return only 10 rows then RDBMS will not even know the exact number of rows that satisfy the full criteria, it just gets the TOP 10.
Therefore, when you want to find out count of all rows satisfying the criteria you are not running it the second time, but the first time.
Having said that proper indexes might make both queries execute pretty fast.
Edit
MySQL has SQL_CALC_FOUND_ROWS which returns the number of rows that query would return if there was no LIMIT applied - googling for an equivalent in MS SQL points to analytical SQL and CTE variant, see this forum (even though not sure that either would qualify as running it only once, but feel free to check - and let us know).

Efficient SQL to count an occurrence in the latest X rows

For example I have:
create table a (i int);
Assume there are 10k rows.
I want to count 0's in the last 20 rows.
Something like:
select count(*) from (select i from a limit 20) where i = 0;
Is that possible to make it more efficient? Like a single SQL statement or something?
PS. DB is SQLite3 if that matters at all...
UPDATE
PPS. No need to group by anything in this instance, assume the table that is literally 1 column (and presumably the internal DB row_ID or something). I'm just curious if this is possible to do without the nested selects?

You'll need to order by something in order to determine the last 20 rows. When you say last, do you mean by date, by ID, ...?
Something like this should work:
select count(*)
from (
select i
from a
order by j desc
limit 20
) where i = 0;

If you do not remove rows from the table, you may try the following hacky query:
SELECT COUNT(*) as cnt
FROM A
WHERE
ROWID > (SELECT MAX(ROWID)-20 FROM A)
AND i=0;
It operates with ROWIDs only. As the documentation says: Rows are stored in rowid order.

You need to remember to order by when you use limit, otherwise the result is indeterminate. To get the latest rows added, you need to include a column with the insertion date, then you can use that. Without this column you cannot guarantee that you will get the latest rows.
To make it efficient you should ensure that there is an index on the column you order by, possibly even a clustered index.

I'm afraid that you need a nested select to be able to count and restrict to last X rows at a time, because something like this
SELECT count(*) FROM a GROUP BY i HAVING i = 0
will count 0's, but in ALL table records, because a LIMIT in this query will basically have no effect.
However, you can optimize making COUNT(i) as it is faster to COUNT only one field than 2 or more (in this case your table will have 2 fields, i and rowid, that is automatically created by SQLite in PKless tables)

Are the results deterministic, if I partition SQL SELECT query without ORDER BY?

I have SQL SELECT query which returns a lot of rows, and I have to split it into several partitions. Ie, set max results to 10000 and iterate the rows calling the query select time with increasing first result (0, 10000, 20000). All the queries are done in same transaction, and data that my queries are fetching is not changing during the process (other data in those tables can change, though).
Is it ok to use just plain select:
select a from b where...
Or do I have to use order by with the select:
select a from b where ... order by c
In order to be sure that I will get all the rows? In other word, is it guaranteed that query without order by will always return the rows in the same order?
Adding order by to the query drops performance of the query dramatically.
I'm using Oracle, if that matters.
EDIT: Unfortunately I cannot take advantage of scrollable cursor.

Order is definitely not guaranteed without an order by clause, but whether or not your results will be deterministic (aside from the order) would depend on the where clause. For example, if you have a unique ID column and your where clause included a different filter range each time you access it, then you would have non-ordered deterministic results, i.e.:
select a from b where ID between 1 and 100
select a from b where ID between 101 and 200
select a from b where ID between 201 and 300
would all return distinct result sets, but order would not be any way guaranteed.

No, without order by it is not guaranteed that query will ALWAYS return the rows in the same order.

No guarantees unless you have an order by on the outermost query.
Bad SQL Server example, but same rules apply. Not guaranteed order even with inner query
SELECT
*
FROM
(
SELECT
*
FROM
Mytable
ORDER BY SomeCol
) foo

Use Limit
So you would do:
SELECT * FROM table ORDER BY id LIMIT 0,100
SELECT * FROM table ORDER BY id LIMIT 101,100
SELECT * FROM table ORDER BY id LIMIT 201,100
The LIMIT would be from which position you want to start and the second variable would be how many results you want to see.
Its a good pagnation trick.

Paging with Oracle and sql server and generic paging method

I want to implement paging in a gridview or in an html table which I will fill using ajax. How should I write queries to support paging? For example if pagesize is 20 and when the user clicks page 3, rows between 41 and 60 must be shown on table. At first I can get all records and put them into cache but I think this is the wrong way. Because data can be very huge and data can be change from other sessions. so how can I implement this? Is there any generic way ( for all databases ) ?

As others have suggested, you can use rownum in Oracle. It's a little tricky though and you have to nest your query twice.
For example, to paginate the query
select first_name from some_table order by first_name
you need to nest it like this
select first_name from
(select rownum as rn, first_name from
(select first_name from some_table order by first_name)
) where rn > 100 and rn <= 200
The reason for this is that rownum is determined after the where clause and before the order by clause. To see what I mean, you can query
select rownum,first_name from some_table order by first_name
and you might get
4 Diane
2 Norm
3 Sam
1 Woody
That's because oracle evaluates the where clause (none in this case), then assigns rownums, then sorts the results by first_name. You have to nest the query so it uses the rownum assigned after the rows have been sorted.
The second nesting has to do with how rownum is treated in a where condition. Basically, if you query "where rownum > 100" then you get no results. It's a chicken and egg thing where it can't return any rows until it finds rownum > 100, but since it's not returning any rows it never increments rownum, so it never counts to 100. Ugh. The second level of nesting solves this. Note it must alias the rownum column at this point.
Lastly, your order by clause must make the query deterministic. For example, if you have John Doe and John Smith, and you order by first name only, then the two can switch places from one execution of the query to the next.
There are articles here http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html
and here http://www.oracle.com/technology/oramag/oracle/07-jan/o17asktom.html. Now that I see how long my post is, I probably should have just posted those links...

Unfortunately, the methods for restricting the range of rows returned by a query vary from one DBMS to another: Oracle uses ROWNUM (see ocdecio's answer), but ROWNUM won't work in SQL Server.
Perhaps you can encapsulate these differences with a function that takes a given SQL statement and first and last row numbers and generates the appropriate paginatd SQL for the target DBMS - i.e. something like:
sql = paginated ('select empno, ename from emp where job = ?', 101, 150)
which would return
'select * from (select v.*, ROWNUM rn from ('
+ theSql
+ ') v where rownum < 150) where rn >= 101'
for Oracle and something else for SQL Server.
However, note that the Oracle solution is adding a new column RN to the results that you'll need to deal with.

I believe that both have a ROWNUM analytic Function. Use that and you'll be identical.
In Oracle it is here
ROW_NUMBER
Yep, just verified that ROW_NUMBER is the same function in both.

"Because...data can be change from other sessions."
What do you want to happen for this ?
For example, user gets the 'latest' ten rows at 10:30.
At 10:31, 3 new rows are added (so those ten being view by the user are no longer the latest).
At 10:32, the user requests then 'next' ten entries.
Do you want that new set to include those three that have been bumped from 8/9/10 down to 11/12/13 ?
If not, in Oracle you can select the data as it was at 10:30
SELECT * FROM table_1 as of timestamp (timestamp '2009-01-29 10:30:00');
You still need the row_number logic, eg
select * from
(SELECT a.*, row_number() over (order by hire_date) rn
FROM hr.employees as of timestamp (timestamp '2009-01-29 10:30:00') a)
where rn between 10 and 19

select *
from ( select /*+ FIRST_ROWS(n) */ a.*,
ROWNUM rnum
from ( your_query_goes_here,
with order by ) a
where ROWNUM <=
:MAX_ROW_TO_FETCH )
where rnum >= :MIN_ROW_TO_FETCH;
Step 1: your query with order by
Step 2: select a.*, ROWNUM rnum from ()a where ROWNUM <=:MAX_ROW_TO_FETCH
Step 3: select * from ( ) where rnum >= :MIN_ROW_TO_FETCH;
put 1 in 2 and 2 in 3

If the expected data set is huge, I'd recommend to create a temp table, a view or a snapshot (materialized view) to store the query results + a row number retrieved either using ROWNUM or ROW_NUMBER analytic function. After that you can simply query this temp storage using row number ranges.
Basically, you need to separate the actual data fetch from the paging.

There is no uniform way to ensure paging across various RDBMS products. Oracle gives you rownum which you can use in where clause like:
where rownum < 1000
SQL Server gives you row_id( ) function which can be used similar to Oracle's rownum. However, row_id( ) isn't available before SQL Server 2005.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas