SQL Server : where clause performance theory? - sql

If TableA has 1,000,000 billion rows (although it is impossible in a true environment) and the primary key increased by identity, so the max row is 1 million billion. Now I used a Random() number to match it's where statement. Like below:
Select *
from TableA a
where a.PrimaryKey = [Random Number from 1 to 1 million Billion]
If the Select statement executes 100 times, but I found it is still very fast in SQL Server.
I mean if random number is a big number, then if Select * from Table where Pk=1000000, it must be comparing with the all previous records to see if THIS number whether match the primary key column. So the performance of SQL will be very low.

It's fast because the primary key is indexed, which means the lookup of a row based on the primary key takes O(log N) time ( https://en.wikipedia.org/wiki/Big_O_notation ) if you're using a B-tree based index, which is the default in most database systems.
It's further helped by the fact the primary-key is usually the clustered-index too, which is the index that defines the physical order of rows on disk (this is a gross over-simplification, but you get the idea).

The primary key is a key. That is the database maintains a unique index on the location of every record by the key value.
The index is usually some variation on b-tree which would probably have a depth of about 10 to 20 for such a large dataset. So accessing any record via the key would involve a maximum of 10 IOs and probably much less as large parts of the B-Tree will be cached in memory.

Related

Is it advised to index the field if I envision retrieving all records corresponding to positive values in that field?

I have a table with definition somewhat like the following:
create table offset_table (
id serial primary key,
offset numeric NOT NULL,
... other fields...
);
The table has about 70 million rows in it.
I envision doing the following query many times
select * from offset_table where offset > 0;
For speed issues, I am wondering whether it would be advised to create an index like:
create index on offset_table(offset);
I am trying to avoid creation of unnecessary indices on this table as it is pretty big already.
As you mentioned in the comments, it would be ~70% of rows that match the offset > 0 predicate.
In that case the index would not be beneficial, since postgresql (and basically every other DBMS) would prefer a full table scan instead. It happens because it would be faster than jumping between reading the index consequently and the table randomly.

Does clustered index sort order have impact on performance

If a PK of a table is a standard auto-increment int (Id) and the retrieved and updated records are almost always the ones closer to the max Id will it make any difference performance-wise whether the PK clustered index is sorted as ascending or descending?
When such PK is created, SSMS by default sets the sort order of the index as ascending and since the rows most accessed are always the ones closer to the current max Id, I'm wondering if changing the sorting to descending would speed up the retrieval since the records will be sorted top-down instead of bottom-up and the records close to the top are accessed most frequently.
I don't think there will be any performance hit. Since, it's going to perform a binary search for the index key to access and then the specific data block with that key. Either way, that binary search will hit O(log N) complexity. So in total O(log N) + 1 and since it's clustered index, it actually should be O(log N) time complexity; since the table records are physically ordered instead of having a separate index page/block.
Indexes use a B-tree structure, so No. But if you have an index that is based off multiple columns, you want the most distinct columns on the outer level, and least distinct on the inner levels. For example, if you had 2 columns (gender and age), you would want age on the outer and gender on the inner, because there are only 2 possible genders, whereas there are many more ages. This will impact performance.

fast search in a 10 million records table with unique index column of SQL server 2008 R2 on win 7

I need to do a fast search in a column with floating point numbers in a table of SQL server 2008 R2 on Win 7.
the table has 10 million records.
e.g.
Id value
532 937598.32421
873 501223.3452
741 9797327.231
ID is primary key, I need o do a search on "value" column for a given value such that I can find the 5 closest points to the given point in the table.
The closeness is defined as the absolute value of the difference between the given value and column value.
The smaller value, the closer.
I would like to use binary search.
I want to set an unique index on the value column.
But, I am not sure whether the table will be sorted every time when I search the given value in the column ?
Or, it only sorts the table one time because I have set the value column as unique index ?
Are there better ways to do this search ?
A sorting will have to be done whenever I do a search ? I need to do a lot of times of search in the table. I know the sorting time is O(n lg n). Using index can really have done the sort for me ? or the index is associated with a sorted tree to hold the column values ?
When an index is set up, the values have been sorted ? I do not need to sort it every time when I do a search ?
Any help would be appreciated.
thanks
Sorry for my initial response, no, I would not even create an index, it won't be able to use it because you're searching not on a given value but the difference between that given value and the value column on the table. You could create a function based index, but you would have to specify the # you're searching on, which is not constant.
Given that, I would look at getting enough RAM to swallow the whole table. Ie. if the table is 10gb, try to get 10gb ram allocated for caching. And if possible do it on a machine w/ an SSD, or get an SSD.
The sql itself is not complicated, it's really just an issue of performance.
select top 5 id, abs(99 - val) as diff
from tbl
order by 2
If you don't mind some trial and error, you could create an index on the value column, and then search as follows -
select top 5 id, abs(99 - val) as diff
from tbl
where val between 99-30 and 99+30
order by 2
The above query WOULD utilize the index on the value column, because it is searching on a range of values in the value column, not the differences between the values in that column and X (2 very different things)
However, there is no guarantee it would return 5 rows, it would only return 5 rows if there actually existed 5 rows within 30 of 99 (69 to 129). If it returned 2, 3, etc. but not 5, you would have to run the query again and expand the range, and keep doing so until you determine your top 5. However, these queries should run quite a bit faster than having no index and firing against the table blind. So you could give it a shot. The index may take a while to create though, so you might want to do that part overnight.
You mention sql server and binary search. SQL server does not work that way, but sql server (or other database) is a good solution for this problem.
Just to concrete, I will assume
create table mytable
(
id int not null
, value float not null
constraint mytable_pk primary key(id)
)
And you need an index on the value field.
Now get ten rows 5 above and 5 below the search value with these 2 selects
SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
-- and
SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
To combine the 2 unions into 1 result set you need
SELECT *
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
UNION ALL
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
But since you only want the smallest 5 differences, wrap with one more layer as
SELECT TOP 5 * FROM
(
SELECT *
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value >= #searchval
ORDER BY val asc) as bigger
UNION ALL
FROM (SELECT TOP 5 id, value, abs(id-value) as diff
FROM mytable
WHERE value < #searchval
ORDER BY val desc) as smaller
)
ORDER BY DIFF ASC
I Have not tested any of this
Creating the table's clustered index upon [value] will cause [value]'s values to be stored on disk in sorted order. The table's primary key (perhaps already defined on [Id]) might already be defined as the table's clustered index. There can only be one clustered index on a table. If a primary key on [Id] is already clustered, the primary key will need to be dropped, the clustered index on [value] will need to be created, and then the primary key on [Id] can be recreated (as a nonclustered primary key). A clustered index upon [value] should improve performance of this specific statement, but you must ultimately test all variety of T-SQL that will reference this table before making the final choice about this table's most useful clustered index column(s).
Because the FLOAT data type is imprecise (subject to your system's FPU and its floating point rounding and truncation errors, while still in accordance with IEEE 754's specifications), it can be a fatal mistake to assume every [value] will be unique, event when the decimal number (being inserted into FLOAT) appears (in decimal) to be unique. Irrational numbers must always be truncated and rounded. In decimal, PI is an example of an irrational value, which can be truncated and rounded to an imprecise value of 3.142. Similarly, the decimal number 0.1 is an irrational number in binary, which means FLOAT will not store decimal 0.1 as a precise binary value.... You might want to consider whether the domain of acceptable values offered by the NUMERIC data type can accommodate [value] (thus gaining more precise answers when compared to a use of FLOAT).
While a NUMERIC data type might require more storage space than FLOAT, the performance of a given query is often controlled by the levels of the (perhaps clustered) index's B-Tree (assuming an index seek can be harnessed by the query, which for your specific need is a safe assumption). A NUMERIC data type with a precision greater than 28 will require 17 bytes of storage per value. The payload of SQL Server's 8KB page is approximately 8000 bytes. Such a NUMERIC data type will thus store approximately 470 values per page. A B-Tree will consume 2^(index_level_pages-1) * 470 rows/page to store the 10,000,000 rows. Dividing both sides by 470 rows/page: 2^(index_level_pages-1) = 10,000,000/470 pages. Simplifying: log(base2)10,000,000/470 = (index_level_pages-1). Solving: ~16 = index_level_pages (albeit this is back of napkin math, I think it close enough). Thus searching for a specific value in a 10,000,000 row table will require ~16*8KB = ~128 KB of reads. If a clustered index is created, the leaf level of a clustered index will contain the other NUMERIC values that are "close" to the one being sought. Since that leaf level page (and the 15 other index pages) are now cached in SQL Server's buffer pool and are "hot", the next search (for values that are "close" to the value being sought) is likely to be constrained by memory access speeds (as opposed to disk access speeds). This is why a clustered index can enhance performance for your desired statement.
If the [value]'s values are not unique (perhaps due to floating point truncation and rounding errors), and if [value] has been defined as the table's clustered index, SQL Server will (under the covers) add a 4-byte "uniqueifier" to each value. A uniqueifier adds overhead (per above math, it is less overhead than might be thought, when a index can be harnessed). That overhead is another (albeit less important) reason to test. If values can instead be stored as NUMERIC and if a use of NUMERIC would more precisely ensure persisted decimal values are indeed unique (just the way they look, in decimal), that 4 byte overhead can be eliminated by also declaring the clustered index as being unique (assuming value uniqueness is a business need). Using similar math, I am certain you will discover the index levels for a FLOAT data type are not all that different from NUMERIC.... An index B-Tree's exponential behavior is "the great leveler" :). Choosing FLOAT because it has smaller storage space than NUMERIC may not be as useful as can initially be thought (even when greatly more storage space for the table, as a whole, is needed).
You should also consider/test whether a Columnstore index would enhance performance and suit your business needs.
This is a common request coming from my clients.
It's better if you transform your float column into two integer columns (one for each part of the floating point number), and put the appropriate index on them for fast searching. For example: 12345.678 will become two columns 12345 and 678.

Why is there a non-clustered index scan when counting all rows in a table?

As far as I understand it, each transaction sees its own version of the database, so the system cannot get the total number of rows from some counter and thus needs to scan an index. But I thought it would be the clustered index on the primary key, not the additional indexes. If I had more than one additional index, which one will be chosen, anyway?
When digging into the matter, I've noticed another strange thing. Suppose there are two identical tables, Articles and Articles2, each with three columns: Id, View_Count, and Title. The first has only a clustered PK-based index, while the second one has an additional non-clustered, non-unique index on view_count. The query SELECT COUNT(1) FROM Articles runs 2 times faster for the table with the additional index.
SQL Server will optimize your query - if it needs to count the rows in a table, it will choose the smallest possible set of data to do so.
So if you consider your clustered index - it contains the actual data pages - possibly several thousand bytes per row. To load all those bytes just to count the rows would be wasteful - even just in terms of disk I/O.
Therefore, it there is a non-clustered index that's not filtered or restricted in any way, SQL Server will pick that data structure to count - since the non-clustered index basically contains the columns you've put into the NC index (plus the clustered index key) - much less data to load just to count the number of rows.

Should I create a unique clustered index, or non-unique clustered index on this SQL 2005 table?

I have a table storing millions of rows. It looks something like this:
Table_Docs
ID, Bigint (Identity col)
OutputFileID, int
Sequence, int
…(many other fields)
We find ourselves in a situation where the developer who designed it made the OutputFileID the clustered index. It is not unique. There can be thousands of records with this ID. It has no benefit to any processes using this table, so we plan to remove it.
The question, is what to change it to… I have two candidates, the ID identity column is a natural choice. However, we have a process which does a lot of update commands on this table, and it uses the Sequence to do so. The Sequence is non-unique. Most records only contain one, but about 20% can have two or more records with the same Sequence.
The INSERT app is a VB6 piece of crud throwing thousands insert commands at the table. The Inserted values are never in any particular order. So the Sequence of one insert may be 12345, and the next could be 12245. I know that this could cause SQL to move a lot of data to keep the clustered index in order. However, the Sequence of the inserts are generally close to being in order. All inserts would take place at the end of the clustered table. Eg: I have 5 million records with Sequence spanning 1 to 5 million. The INSERT app will be inserting sequence’s at the end of that range at any given time. Reordering of the data should be minimal (tens of thousands of records at most).
Now, the UPDATE app is our .NET star. It does all UPDATES on the Sequence column. “Update Table_Docs Set Feild1=This, Field2=That…WHERE Sequence =12345” – hundreds of thousands of these a day. The UPDATES are completely and totally, random, touching all points of the table.
All other processes are simply doing SELECT’s on this (Web pages). Regular indexes cover those.
So my question is, what’s better….a unique clustered index on the ID column, benefiting the INSERT app, or a non-unique clustered index on the Sequence, benefiting the UPDATE app?
First off, I would definitely recommend to have a clustered index!
Secondly, your clustered index should be:
narrow
static (never or hardly ever change)
unique
ever-increasing
so an INT IDENTITY is a very well thought out choice.
When your clustering key is not unique, SQL Server will add a 4-byte uniqueifier to those column values - thus making your clustering key and with it all non-clustered indices on that table larger and less optimal.
So in your case, I would pick the ID - it's narrow, static, unique and ever-increasing - can't be more optimal than that! Since the Sequence is used heavily in UPDATE statements, definitely put a non-clustered index on it, too!
See Kimberly Tripp's excellent blog posts on choosing the right clustering key for great background info on the topic.
As a general rule, you want your clustered index to be unique. If it is not, SQL Server will in fact add a hidden "uniquifier" to it to force it to be unique, and this adds overhead.
So, you are probably best using the ID column as your index.
Just as a side note, using a identity column as your primary key is normally referred to as a surrogate key since it is not inherent in your data. When you have a unique natural key available that is probably a better choice. In this case it looks like you do not, so using the unique surrogate key makes sense.
The worst thing about the inserts out of order is page splits.
When SQL Server needs to insert a new record into an existing index page and finds no place there, it takes half the records from the page and moves them into a new one.
Say, you have these records filling the whole page:
1 2 3 4 5 6 7 8 9
and need to insert a 10. In this case, SQL Server will just start the new page.
However, if you have this:
1 2 3 4 5 6 7 8 11
, 10 should go before 11. In this case, SQL Server will move records from 6 to 11 into the new page:
6 7 8 9 10 11
The old page, as it can be easily seen, will remain half filled (only records from 1 to 6 will go there which are very).
This will increase the index size.
Let's create two sample tables:
CREATE TABLE perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))
CREATE TABLE almost_perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))
;
WITH q(num) AS
(
SELECT 1
UNION ALL
SELECT num + 1
FROM q
WHERE num < 200000
)
INSERT
INTO perfect
SELECT num, REPLICATE('*', 300)
FROM q
OPTION (MAXRECURSION 0)
;
WITH q(num) AS
(
SELECT 1
UNION ALL
SELECT num + 1
FROM q
WHERE num < 200000
)
INSERT
INTO almost_perfect
SELECT num + CASE num % 5 WHEN 0 THEN 2 WHEN 1 THEN 0 ELSE 1 END, REPLICATE('*', 300)
FROM q
OPTION (MAXRECURSION 0)
EXEC sp_spaceused N'perfect'
EXEC sp_spaceused N'almost_perfect'
perfect 200000 66960 KB 66672 KB 264 KB 24 KB
almost_perfect 200000 128528 KB 128000 KB 496 KB 32 KB
Even with only 20% probability of the records being out of order, the table becomes twice as large.
On the other hand, having a clustered key on Sequence will reduce the I/O twice (since it can be done with a single clustered index seek rather than two unclustered ones).
So I'd take a sample subset of your data, insert it into the test table with a clustered index on Sequence and measure the resulting table size.
If it less than twice the size of the same table with an index on ID, I'd go for the clustered index on Sequence (since the total resulting I/O will be less).
If you decide to create a clustered index on Sequence, make ID an unclustered PRIMARY KEY and make the clustered index UNIQUE on Sequence, ID. This will use a meaningful ID instead of opaque uniquiefier.