Does a query goes through all data when you only select the last N? - sql

I have a query that select the last 5($new) items from my database.
SELECT OvenRunData.dataId AS id, OvenRunData.data AS data
FROM ovenRuns INNER JOIN OvenRunData ON OvenRuns.id = OvenRunData.ovenRunId
WHERE OvenRunData.ovenRunId = (SELECT MAX(id) FROM OvenRuns)
ORDER BY id DESC LIMIT '$new'
I want to execute this query every 5 seconds with an AJAX request so I can update my table.
I know this query select the last 5 records but I want to know if the query runs through all records and then selects the last 5 or does it select only the last 5 without checking all the data?
I'm really worried that I'll have lag.

You need two indexes to make it fast enough:
create index ix_OvenRuns_id on OvenRuns(id)
create index ix_OvenRunData_ovenRunId on OvenRunData(ovenRunId)
you can even put OvenRunData.dataId OvenRunData.data into the second one, or create clustered index, however, these indexes definitely avoid full data scan.

That depends on the indexes.
In your case, you should have one on OverRuns(id).
More here: http://use-the-index-luke.com/sql/partial-results/top-n-queries

The LIMIT is applied after the ORDER BY, and the ORDER BY is applied to the entire result-set. So the answer to your question is, yes it must go through all of the records in your result-set determined by your WHERE clause before applying the LIMIT.

Related

Does Snowflake preserve retrieval order?

Posting two questions:
1.
Let's say there is a query:
SELECT C1, C2, C3 from TABLE;
When this query is fired for the first time,it retrieves all the values in a certain order.
Next time, when the same query is fired, will the previous order be retained?
There are 2 tables, TABLE1 and TABLE2, both of them have identical data.
Will (SELECT * from TABLE1) and (SELECT * from TABLE1) retrieve the same order of rows?
SQL tables represent unordered sets. Period. There is no ordering in a result set unless you explicitly include an ORDER BY.
It is that simple. If you want data in a particular order, then you need to use ORDER BY. That is how relational databases work.
The same query can return results in different orders each time the query is executed. There are no guarantees about the order -- unless the query has an ORDER BY for the outermost SELECT.
No, unless you are fetching data from result cache!
No, unless they are very small tables and your query runs with low parallelism.
Sorry for extra answer, but I see Tim claims that the query will return same result as long as the underlying table(s) is not modified, and the query has same execution plan.
Snowflake executes the queries in parallel, therefore the order of data is not predictable unless ORDER BY is used.
Let's create a table (big enough to be processed in parallel), and run a simple test case:
-- running on medium warehouse
create or replace table my_test_table ( id number, name varchar ) as
select seq4(), 'gokhan' || seq4() from table(generator(rowcount=>1000000000));
alter session set USE_CACHED_RESULT = false;
select * from my_test_table limit 10;
You will see that it will return different rows every time you run.
To answer both questions short: No.
If your query has no ORDER BY-clause, the SELECT statement always returns an unordered set. This means: Even if you query the same table twice and the data didnt change, SELECT without ORDER BY can retrieve different row-orders.
https://docs.snowflake.com/en/sql-reference/sql/select.html

Postgres Skipping Duplicate Field in Select

I want to select a limited number of items but only keeping ones with a distinct value for a specific field. I have tried using SELECT DISTINCT ON(field) as well as GROUP BY but they are both extremely slow because the table is very large. I assume this is because using DISTINCT will actually sort the table into distinct values before selecting at all.
SELECT DISTINCT ON(parent) id FROM posts WHERE sub = ? LIMIT 25
For my purposes this is unnecessary because I am using a LIMIT and can guarantee the limit will be met without scanning much of the table at all. Similar to selecting a value with a condition, which (without an index) will scan each row and check if it meets the condition before continuing, how can I use not having duplicate fields as a condition?
Another way to think about it is how do I do this:
SELECT DISTINCT ON (parent) post.id FROM
(SELECT id FROM posts WHERE sub = ? ORDER BY id LIMIT 25) AS post
While guaranteeing that there are 25 results. Here the result is very fast but it will usually have less results than required because multiple rows can have the same parent.
The way you're thinking may seem to make sense, but if you think a little deeper, you'll find that it cannot work that way. You want 25 unique result. To give you that, first it needs to go through the records and find the unique ones then return the first 25.
What you actually want is for it to go through the records one by one and check, do I already have similar value? If yes, discard it and continue, if no, add it to the results. Now check, do I already have 25 results? If no, continue, if yes, stop and return the results.
This is not a trivial task to do in a query. Your best bet is to do it in a stored procedure with a cursor. That will be much easier as you are in full control of the flow, just follow the steps as per the description above.
For my purposes this is unnecessary because I am using a LIMIT and can guarantee the limit will be met without scanning much of the table at all.
If you really know that your first 25 results will be found in the first xx records (say first 100), and that's all you care to achieve, then you can use a somewhat dumb query:
SELECT DISTINCT ON (parent) post.id
FROM (SELECT id FROM posts WHERE sub = ? ORDER BY id LIMIT 100) AS post
LIMIT 25
Change the 100 to whatever suits your needs.
When you use distinct on, you should use order by:
SELECT DISTINCT ON (parent) id
FROM posts
WHERE sub = ?
ORDER BY parent
LIMIT 25;
To optimize this query, you want an index on posts(sub, parent, id).

SQL Select, different than the last 10 records

I have a table called "dutyroster". I want to make a random selection from this table's "names" column, but, I want the selection be different than the last 10 records so that the same guy is not given a second duty in 10 days. Is that possible ?
Create a temporary table with only one column called oldnames which will have no records initially. For each select, execute a query like
select names from dutyroster where dutyroster.names not in (select oldnamesfrom temporarytable) limit 10
and when execution is done add the resultset to the temporary table
The other answer already here is addressing the portion of the question on how to avoid duplicating selections.
To accomplish the random part of the selection, leverage newid() directly within your select statement. I've made this sqlfiddle as an example.
SELECT TOP 10
newid() AS [RandomSortColumn],
*
FROM
dutyroster
ORDER BY
[RandomSortColumn] ASC
Keep executing the query, and you'll keep getting different results. Use the technique in the other answer for avoiding doubling a guy up.
The basic idea is to use a subquery to get all but users from the last ten days, then sort the rest randomly:
select dr.*
from dutyroster dr
where dr.name not in (select dr2.name
from dutyroster dr2
where dr2.datetimecol >= date_sub(curdate(), interval 10 day)
)
order by rand()
limit 1;
Different databases may have different syntax for limit, rand(), and for the date/time functions. The above gives the structure of the query, but the functions may differ.
If you have a large amount of data and performance is a concern, there are other (more complicated) ways to take a random sample.
you could use TOP function for SQL Server
and for MYSQL you could use LIMIT function
Maybe this would help...
SELECT TOP number|percent column_name(s)
FROM table_name;
Source: http://www.w3schools.com/sql/sql_top.asp

Is it possible to query a table without order columns by page

I've a big table which contains more than 100K records, in oracle. I want to get all of the records and save each row to a file with JDBC.
In order to make it faster, I want to create 100 threads to read the data from the table concurrently. I will get the total count of the records in the first sql, then split it to 100 pages, then get one page in a thread with a new connection.
But I've a problem, that there is no any column can be used to order. There is no column with sequence, no accurate timestamp. I can't use a sql query without order by clause to query, since there is no guarantee it will return the data with the same order every time (per this question).
So is it possible to solve it?
Finally, I used rowid to order:
select * from mytable order by rowid
It seems work well.

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post