SQL real limit/top function - sql

I have a question about LIMIT/TOP. As I understand before we get only rows from the limit, the whole table is processed.
so if I write Select * from TABLE limit 2, first the whole table is processed and then it is cut.
Is there a way to cut it before it gets processed? So for example "take 2 random rows". So then I don't query the whole table, but only a part of it.
I hope this question makes sense to you. I will appreciate your help!

In the execution plan tree a LIMIT node will stop processing the child nodes as soon as it's complete; i.e., when it receives the maximum number of rows from the child nodes (in your case 2 rows).
This will be very effective in terms of performance and response time if the child nodes are pipelined, reducing the cost drastically. For example:
select * from t limit 2;
If the child nodes are materialized then the subbranch will be entirely processed before limiting, and the cost and response time won't be significantly affected. For example:
select * from t order by rand() limit 2;

MySQL Limit clause used select statement is used to restrict the number of rows returns from the result set, rather than fetching the whole set from table.
If you use Select * from TABLE limit 2 it will give result set in random order. It better to used Limit clause with criteria so you can increase the performance on table.
For example:
SELECT * FROM TABLE
WHERE column_name >30
ORDER BY column_name DESC
LIMIT 5;

Related

Does Snowflake preserve retrieval order?

Posting two questions:
1.
Let's say there is a query:
SELECT C1, C2, C3 from TABLE;
When this query is fired for the first time,it retrieves all the values in a certain order.
Next time, when the same query is fired, will the previous order be retained?
There are 2 tables, TABLE1 and TABLE2, both of them have identical data.
Will (SELECT * from TABLE1) and (SELECT * from TABLE1) retrieve the same order of rows?
SQL tables represent unordered sets. Period. There is no ordering in a result set unless you explicitly include an ORDER BY.
It is that simple. If you want data in a particular order, then you need to use ORDER BY. That is how relational databases work.
The same query can return results in different orders each time the query is executed. There are no guarantees about the order -- unless the query has an ORDER BY for the outermost SELECT.
No, unless you are fetching data from result cache!
No, unless they are very small tables and your query runs with low parallelism.
Sorry for extra answer, but I see Tim claims that the query will return same result as long as the underlying table(s) is not modified, and the query has same execution plan.
Snowflake executes the queries in parallel, therefore the order of data is not predictable unless ORDER BY is used.
Let's create a table (big enough to be processed in parallel), and run a simple test case:
-- running on medium warehouse
create or replace table my_test_table ( id number, name varchar ) as
select seq4(), 'gokhan' || seq4() from table(generator(rowcount=>1000000000));
alter session set USE_CACHED_RESULT = false;
select * from my_test_table limit 10;
You will see that it will return different rows every time you run.
To answer both questions short: No.
If your query has no ORDER BY-clause, the SELECT statement always returns an unordered set. This means: Even if you query the same table twice and the data didnt change, SELECT without ORDER BY can retrieve different row-orders.
https://docs.snowflake.com/en/sql-reference/sql/select.html

In a SQL table with many rows, how can I quickly determine if a query might return more than 1000 rows

NOTE: This is a re-posting of a question from a Stack Overflow Teams site to attract a wider audience
I have a transaction log table that has many millions of records. Many of the data items that are linked to these logs might have more than 100K rows for each item.
I have a requirement to display a warning if a user tries to delete an item when more than 1000 items in the log table exist.
We have determined that 1000 logs means this item is in use
If I try to simply query the table to lookup the total number of log rows the query takes too long to execute:
SELECT COUNT(1)
FROM History
WHERE SensorID IN (SELECT Id FROM Sensor WHERE DeviceId = 96)
Is there a faster way to determine if the entity has more than 1000 log records?
NOTE: history table has an index on the SensorId column.
You are right to use Count instead of returning all the rows and checking the record count, but we are still asking the database engine to seek across all rows.
If the requirement is not to return the maximum number of rows, but just to determine if there are more than X number of rows, then the first improvement I would do is to return the count of just the first X rows from the table.
So if X is 1000, your application logic does not need to change, you will still be able to determine the difference between an item with 999 logs and 1000+ logs
We simply change the existing query an select the TOP(X) rows instead of the count, and then return the count of that resultset, only select the primary key or a unique indexed column so that we are only inspecting the index and not the underlying table store.
select count(Id) FROM (
SELECT TOP(1000) // limit the seek that the DB engine does to the limit
Id // Further constrain the seek to just the indexed column
FROM History
where SensorId IN ( // this is the same filter condition as before, just re-formatted
SELECT Id
FROM Sensor
WHERE DeviceId = 96)
) as trunk
Changing this query to top 10,000 still provides sub-second response, however with X = 100,000 the query took almost as long as the original query
There is another seemingly 'silver bullet' approach to this type of issue if table in question has a high transaction rate and the main reason for the execution time is due to waiting cause by lock contention.
If you suspect that locks are the issue, and you can accept a count response that includes uncommitted rows then you can use the WITH(NOLOCK) table hint to allow the query to run effectively in the READ UNCOMMITED transaction isolation level.
There is a good discussion about the effect of the NOLOCK table hint on select queries here
SELECT COUNT(1) FROM History WITH (NOLOCK)
WHERE SensorId IN (SELECT Id FROM Sensor WHERE DeviceId = 96)
Although strongly discouraged, this is a good example of a scenario when NOLOCK can easily be permitted, it even makes sense, as your count before delete will take into account another user or operation that is actively adding to the log count.
After many trials, when querying for 1000 or 10K rows the select with count solution is still faster than using the NOLOCK table hint. NOLOCK however presents an opportunity to execute the same query with minimal change, while still returning within a timely manner.
The performance of a select with NOLOCK will still increase as the number of rows in the underlying result set increases, where as the performance of the select that has a top with no order by clause should remain constant once the top limit has been exceeded.

Fast query in PostgreSQL

I have a very large database (~1TB), so running even a very simple query can take a very long time. Eg. for:
EXPLAIN select count(*) from users;
the cost is 44661683.87 disk page fetch. Hence making it very expensive to execute.
When I try to put a limit on the query like:
EXPLAIN select count(*) from users limit 10;
the cost of executing the query remains the same i.e 44661683.87 disk page fetch.
So (1) is it possible to execute a query on subset of data and then extrapolate to the rest of the table? The row size can be quickly found using something like:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'users';
Moreover, (2) is it possible to select a randomly distributed subset of rows?
is it possible to execute a query on subset of data and then extrapolate to the rest of the table
You could use the tablesample option:
select count(*) * 10
from the_table tablesample system (10);
tablesample system (10) will only scan 10 percent of the blocks of the table which should be quite fast. If you multiply the resulting row count with 10 you'll have an approximation(!) of the total number of rows. The smaller the sample size is the faster, this will be - but also less accurate.
I accuracy of the number depends on how much free space your table has because the 10% (or whatever sample size you choose) is based on the total number of blocks in the table. If there are many free (or half free) blocks, then the number will be less reliable.
select count(*) . . . is an aggregation query with no group by. It returns 1 row, so the limit has no impact.
You seem to want:
select count(*)
from (select u.*
from users u
limit 10
);
As for your second question, Postgres introduced tablesample in version 9.5. You can investigate that.
If you have a primary key index on the users table (or an index on another column), you can get it to use that index for an index-only scan which should result in a much better execution plan. But, strangely, it won't work with COUNT so you can do a SELECT DISTINCT in a subquery and then COUNT on an outer query to force it to use the index:
EXPLAIN SELECT COUNT(*) FROM (SELECT DISTINCT id FROM users) u;

select first rows in oracle without full table scan

I'm trying to perform the following query in Oracle.
select * from (select rownum r, account from fooTable) where r<5001;
It selects the 1st 5000 rows. I'm running into a problem that fooTable has a lot of data inside of it and this is really slowing down the query (35 million+ rows). According to the query analyzer it's performing a full table scan.
My question is, is there a way to speed up this statement? Since I'm only fetching the 1st N rows, is the full table scan necessary?
mj
I have found the /*+ FIRST_ROWS(n) */ hint to be very helpful in cases like this (such as for limiting pagination results). You replace n with whatever value you want.
select /*+ FIRST_ROWS(5000) */
account
from fooTable
where rownum <5000;
You still need the rownum predicate to limit rows, but the hint lets the optimizer know you only need a lazy fetch of n rows.

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post