what is the minimum time Order of a query in SQL databases? - sql

I want to know what is the minimum query time in a given SQL(specially SQLite) database(with n records).
I know that full table scan is O(n) and for indexed column (and RowId) it is O(log(n)).
1st question : is there any situation that the time is smaller than O(log(n))?
2nd question : why querying on RowId (SELECT *FROM table_01 WHERE rowid='234')is also O(log(n))?? if it (RowId)is ordered from 1 to n I logically expect that SQL can immediately find the row with a given RowId

Finding a specific row requires a search. (Not every rowid is necessarily present, so the database needs to look.) The optimistic case, or even the average case, should be much faster than log(n), but the worst case cannot be, since it requires searching a list.

If you want to retrieve the smallest or largest value from an indexed column (SELECT MIN(x) FROM table), the database can simply read the first or last value, and the time is in O(1).
Indexes are stored as a B-tree, with the indexed columns as the key.
Tables are stored as a B-tree, with the rowid as the key, so searching for the rowid is just as fast as searching for a value in an index.

Related

Strange behavior when doing where and order by query in postgres

Background: A large table, 50M+, all column in query is indexed.
when I do a query like this:
select * from table where A=? order by id DESC limit 10;
In statement, A, id are both indexed.
Now confusing things happen:
the more rows where returned, the less time whole sql cost
the less rows where returned, the more time whole sql cost
I have a guess here: postgres do the order by first, and then where , so it cost more time to find 10 row in the orderd index when target rowset is small(like find 10 particular sand on beach); oppositeļ¼Œ if target rowset is large, it's easy to find the first 10.
Is it right? Or there are some other reason for this?
Final question: How to optimize this situation?
It can either use the index on A to apply the selectivity, then sort on "id" and apply the limit. Or it can read them already in order using the index on "id", then filter out the ones that meet the A condition until it finds 10 of them. It will choose the one it thinks is faster, and sometimes it makes the wrong choice.
If you had a multi-column index, on (A,id) it could use that one index to do both things, get the selectivity on A and still fetch the already in order by "id", at the same time.
Do you know PGAdmin? With "explain verbose" before your statement, you can check how the query is executed (meaning the order of the operators). Usually first happens the filter and only afterwards the sorting...

SQL Server : OFFSET FETCH performs scan while TOP WHERE performs seek?

I got the following two queries. One is fast the other is slow.
The table has a clusted index on the Id column.
-- Slow, uses clustered index scan reading 100100 rows
SELECT *
FROM [dbo].[Foo]
ORDER BY Id
OFFSET 100000 ROWS FETCH FIRST 100 ROWS ONLY
-- Fast, uses clustered index seek reading 100 rows
SELECT TOP 100 *
FROM [dbo].[Foo]
WHERE Id > 100000
ORDER BY Id
The plans are identical except for one uses a scan the other a seek.
Can anyone explain why or is this simply how OFFSET works?
The table is very wide with a few NVARCHAR(100-200) and a single NVARCHAR(2500) column.
The two queries are not equivalent. Although you might assume that the ids have no gaps and start at 1, the database engine does not know that.
Indexes are organized to find particular values quickly. They generally do this by traversing a tree structure, and one which is generally balanced. You can read more about this in the documentation.
However, they are not organized to quickly get to the nth row in the table. Hence, the query needs to scan the table to count the number of rows.
That said, the index could do what you want if it kept the number of rows in each child. Do realize that this would complicate modifications to the table, because the entire hierarchy would need to be updated for each update, insert, and delete.

Finding the "next 25 rows" in Oracle SQL based on an indexed column

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.
K Z
= ==========================================
1 0.6508784068583483336644518457703156855132
2 0.4078768075307567089075462518978907890789
3 0.5365440453204830852096396398565048002638
4 0.7573281573257782352853823856682368153782
What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.
I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.
select * from (
select *
from T
where Z >= [origin's Z value]
order by Z asc
) where rownum <= 25;
In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).
Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).
I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.
I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)
I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.
One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).
If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

Does clustered index sort order have impact on performance

If a PK of a table is a standard auto-increment int (Id) and the retrieved and updated records are almost always the ones closer to the max Id will it make any difference performance-wise whether the PK clustered index is sorted as ascending or descending?
When such PK is created, SSMS by default sets the sort order of the index as ascending and since the rows most accessed are always the ones closer to the current max Id, I'm wondering if changing the sorting to descending would speed up the retrieval since the records will be sorted top-down instead of bottom-up and the records close to the top are accessed most frequently.
I don't think there will be any performance hit. Since, it's going to perform a binary search for the index key to access and then the specific data block with that key. Either way, that binary search will hit O(log N) complexity. So in total O(log N) + 1 and since it's clustered index, it actually should be O(log N) time complexity; since the table records are physically ordered instead of having a separate index page/block.
Indexes use a B-tree structure, so No. But if you have an index that is based off multiple columns, you want the most distinct columns on the outer level, and least distinct on the inner levels. For example, if you had 2 columns (gender and age), you would want age on the outer and gender on the inner, because there are only 2 possible genders, whereas there are many more ages. This will impact performance.

What is the most efficient way to count rows in a table in SQLite?

I've always just used "SELECT COUNT(1) FROM X" but perhaps this is not the most efficient. Any thoughts? Other options include SELECT COUNT(*) or perhaps getting the last inserted id if it is auto-incremented (and never deleted).
How about if I just want to know if there is anything in the table at all? (e.g., count > 0?)
The best way is to make sure that you run SELECT COUNT on a single column (SELECT COUNT(*) is slower) - but SELECT COUNT will always be the fastest way to get a count of things (the database optimizes the query internally).
If you check out the comments below, you can see arguments for why SELECT COUNT(1) is probably your best option.
To follow up on girasquid's answer, as a data point, I have a sqlite table with 2.3 million rows. Using select count(*) from table, it took over 3 seconds to count the rows. I also tried using SELECT rowid FROM table, (thinking that rowid is a default primary indexed key) but that was no faster. Then I made an index on one of the fields in the database (just an arbitrary field, but I chose an integer field because I knew from past experience that indexes on short fields can be very fast, I think because the index is stored a copy of the value in the index itself). SELECT my_short_field FROM table brought the time down to less than a second.
If you are sure (really sure) that you've never deleted any row from that table and your table has not been defined with the WITHOUT ROWID optimization you can have the number of rows by calling:
select max(RowId) from table;
Or if your table is a circular queue you could use something like
select MaxRowId - MinRowId + 1 from
(select max(RowId) as MaxRowId from table) JOIN
(select min(RowId) as MinRowId from table);
This is really really fast (milliseconds), but you must pay attention because sqlite says that row id is unique among all rows in the same table. SQLite does not declare that the row ids are and will be always consecutive numbers.
The fastest way to get row counts is directly from the table metadata, if any. Unfortunately, I can't find a reference for this kind of data being available in SQLite.
Failing that, any query of the type
SELECT COUNT(non-NULL constant value) FROM table
should optimize to avoid the need for a table, or even an index, scan. Ideally the engine will simply return the current number of rows known to be in the table from internal metadata. Failing that, it simply needs to know the number of entries in the index of any non-NULL column (the primary key index being the first place to look).
As soon as you introduce a column into the SELECT COUNT you are asking the engine to perform at least an index scan and possibly a table scan, and that will be slower.
I do not believe you will find a special method for this. However, you could do your select count on the primary key to be a little bit faster.
sp_spaceused 'table_name' (exclude single quote)
this will return the number of rows in the above table, this is the most efficient way i have come across yet.
it's more efficient than select Count(1) from 'table_name' (exclude single quote)
sp_spaceused can be used for any table, it's very helpful when the table is exceptionally big (hundreds of millions of rows), returns number of rows right a way, whereas 'select Count(1)' might take more than 10 seconds. Moreover, it does not need any column names/key field to consider.