Optimization for ever increasing key in SQL

Optimization for ever increasing key in SQL - sql

The problem itself is quite simple:
I have a large number of timestamp and value pairs
The timestamps with a few exceptions (<1%) are ever increasing and unique
I use the timestampt as clustered index
How can I enforce that the DB system try to insert the value to the end, and if it fails (does not go rightmost (large side) in the B-tree, that can be checked const time) only then do the binary search for the correct placement?
Target system: MSSQL 2016 or 2017

Either you want timestampt to be a clustered index or you do not. There is no "half-way" clustered index.
So, if you want it clustered, then leave extra space on each page in case a new value gets inserted later. You can control this by using fill_factor (documented here). This allows a clustered index to (more) efficiently insert values that are not at the end.
If you don't want a clustered index on timestampt, then use an identity column to identify each row. This will ensure that rows are only inserted at the "end" (i.e. last page) of the table, making inserts more efficient. You can still have a regular index on timestampt for efficient access.
Actually, I prefer the second method. I would be concerned about duplicates in timestampt, and I prefer having a clustered index that uniquely identifies each row.

Related

SQL Server - can GUID be a good choice as part of a clustered index?

I have a large domain set of tables in a database - over 100 tables. Every single one uses a uniqueidentifier as a PK.
I'm realizing now that my mistake is that these are also by default, the clustered index.
Consider a table with this type of structure:
Orders
Id (uniqueidentifier) Primary Key
UserId (uniqueidentifier)
.
.
.
.
Other columns
Most queries are going to be something like "Get top 10 orders for user X sorted by OrderDate".
In this case, would it make sense to create a clustered index on UserId,Id...that way the data is physically stored sorted by UserId?
I'm not too concerned about Inserts and Updates - those will be few enough that performance loss there isn't a big deal. I'm mostly concerned with READs.

A clustered index means that data is physically stored in the order of the values. By default, the primary key is used for the clustered index.
The problem with GUIDs is that they are generated is (essentially) random order. That means that inserts are happening "in the middle" of the table. And, such inserts result in fragmentation.
Without getting into database internals, this is a little hard to explain. But what it means is that inserts require much more work than just inserting the values "at the end" of the table, because new rows go in the middle of a data page so the other rows have to be moved around.
SQL Server offers a solution for this, newsequentialid(). On a given server, this returns a sequential value which is inserted at the end. Often, this is an excellent compromise if you have to use GUIDs.
That said, I have a preference for just plain old ints as ids -- identity columns. These are smaller, so they take up less space. This is particularly true for indexes. Inserts work well because new values go at the "end" of the table. I also find integers easier to work with visually.
Using identity columns for primary keys and foreign key references still allows you to have unique GUID columns for each identity, if that is a requirement for the database (say for interfacing to other applications).

Clustered index is when you want to retrieve rows for a range of values for a given column. As data is physically arranged in that order, the rows can be extracted very efficiently.
a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.

Clustered index on foreign key or primary key?

I have a table Item with autoinc int primary key Id and a foreign key UserId.
And I have a table User with autoinc int primary key Id.
Default is that the index for Item.Id gets clustered.
I will mostly query items on user-id so my question is: Would it be better to set the UserId foreign key index to be clustered instead?

Having the clustered index on the identity field has the advantage that the records will be stored in the order that they are created. New records are added at the end of the table.
If you use the foreign key as clustered index, the records will be stored in that order instead. When you create new records the data will be fragmented as records are inserted in the middle, which can reduce performance.
If you want an index on the foreign key, then just add a non-clustered index for it.

The answer depends only on usage scenario. For example, Guffa tolds that data will be fragmented. That's wrong. If your queries depends mostly on UserId, then data clustered by ItemId is fragmented for you, because items for same user may be spreaded over a lot of pages.
Of course, compared to sequential ItemId (if it is sequential in your schema), using UserId as clustered key can cause page splits while inserting. This is two additional page writes at maximum. But when you're selecting by some user, his items may be fragmented over tens of pages (depends on items per user, item size, insertion strategy, etc) and therefor a lot of page reads. If you have a lof of such selects per single single insert (very often used web/olap scenarios), you can face hundreds of IO operations compared to few ones spent on page splitting. That was the clustering index was created for, not only for clustering by surrogate IDs.
So there is no clear answer, are the clustered UserId in your case good or bad, because this is highly depends on context. What is ratio between selects/inserts operations? How fragmented user ids are if clustered by itemid? How many additional indicies are on the table, because there is a pitfall (below) in sql server.
As you might know, clustered index requires unique values. This is not a big problem, because you can create index on pair (UserId, ItemId). Clustered index isn't itself stored on disk, so no matter how many fields are there. But non-clustered indices store clustered index values in their leaves. So if you have clustered index on UserId+ItemId (lets imagine their type is [int] and size is 8 bytes) and non-clustered index on ItemId, then this index will have twice size (8 bytes per a b-tree leaf) compared to just the ItemId as clustered index (4 bytes per a leaf).

In general, you want to cluster on the most frequently accessed index. But you're not required to have a clustering index at all. You (or your DBAs) need to evaluate things and weigh the advantages and disadvantages so as to choose the most appropriate indexing strategy.
If you cluster on a monotonic counter like an identity column, all new rows are going to be inserted at the end of the table: that means a "hot spot" is created that is likely to cause lock contention on inserts, since every SPID doing an insert is hitting the same data page.
Tables without a clustering index have their data pages organized as a heap, pretty much just a linked list of data pages.
SQL Server indexes are B-trees. For non-clustered indexes, the leaf nodes of the B-tree are pointers to the appropriate data page. That means if the index is used and doesn't cover the query's columns, an additional look aside has to be done to fetch the data page. That means additional I/O and paging.
Clustered indices are different: their leaf nodes are the data pages themselves, meaning the heap essentially goes away: a table scan means a traversal of the clustering index's B-tree. The advantage is that once you've found what you need in the clustered index, you already have the data page you need, thus avoiding the additional I/O that a seek on a non-clustered index is likely to requir. The disadvantage, of course, is that the clustered index is larger, since it carrys the entire table with it, so traversals of the clustered index are more expensive.

clustered index is created on primary key so what you can do is leave that as clustered and then create a non clustered index on the user Id of item. This will still be very fast as user. Id column will be clustered index.

Possibly.
Is the item.user-id column a unique column within your item table? If not you'd need to make this a clustered primary key by adding a second (possibly more) column to the key to make it unique / possibly this will add additional overhead that you'd not anticipated.
Are there any relationships with the item.id column? If so those may be important to the performance of your application so should be taken into account.
How often is the item.user-id value likely to change? If not at all that counts in its favour; the more often it's likely to be updated the worse, since that leads to fragmentation.
My recommendation would be to build you app with the regular item.id as clustered key, the later once you've got some data try (in a test system using a copy of your production data) switching the clustered index and testing its impact; that way you can easily see real results rather than trying to guess the multitude of possibilities. This avoids premature optimisation / ensures you make the correct choice.

SQL Server Indexing Doubts

Indexing is used to improve performance of sql query but I always found it little difficult to decide in which situation should I use index and in which not. I want to clarify some of my doubts regarding non-clustered index
What is Non-clustered index key. As book say each index row of non clustered index contains non clustered key value so is it mean it is the column in which we created non clustered index i.e. If created index on empname varchar(50) , so non clustered key will be
that empname .
Why It is preferable to create index on column with small width. It is due to comparison with more width column takes more time for SQL server engine or is it due to it will increment hierarchy of intermediate nodes as page size is fixed so with more width column in a page or node less index row it will contain.
If a table contain multiple non clustered column so whether non clustered key will be combination of all this column or some unique id is generated internally by SQL with locator which will point to actual data row. If possible please clear it will some real time example and graphs.
Why It is said that column with non-repeatable value is good to create index as even if it contains repeated value it will definitely improve performance as once it reach to certain key value its locator will immediately found its actual row.
If column used in indexing is not unique how it find actual data row from table.
Please refer any book or tutorial which will be useful to clear my doubts.

First I think we need to cover what an actual index is. Usually in RDBMS indexes are implemented using a variant of B-tree's (B+ variant is most common). To put it shortly - think a binary search tree optimized for being stored on a disk. The result of looking up a key in the B-tree is usually the primary key of the table. That means if a lookup in the index completes and we need more data than what is present in the index we can do a seek in the table using the primary key.
Please remember that when we think of performance for a RDBMS we usually measure this in disk accesses (I decide to ignore locking and other issues here) and not so much CPU time.
Having the index being non-clustered means that the actual way the data in the table is stored has no relation to the index key - whereas a clustered index specifies that the data in the table will be sorted (or clustered by) the index key - this is why there can only be one clustered index per table.
2) Back to our model of measuring performance - if the index key is has small width (fits into a low amount of bytes) it means that per block of disk data we retrieve we can fit more keys - and as such perform lookups in the B-tree much faster if you measure disk I/O.
3) I tried explaining this further up - unfortunately I don't really have any graphs or drawings to indicate this - hopefully someone else can come along and share these.
4) If you're running a query like so:
SELECT something, something_else FROM sometable t1 WHERE akey = 'some value'
On a table with an index defined like so:
CREATE INDEX idx_sometable_akey ON sometable(akey)
If sometable has alot of rows where akey is equal to 'some value' this means alot of lookups in both the index but also in the actual table to retrieve the values of something and something_else. Whereas if there's a good chance that this filtering returns few rows then it also means less disk accesses.
5) See earlier explanation
Hope this helps :)

SQL Db index recommendation

I am trying to see if using a custom index for a specific type of data might reduce fragmentation in my database.
[Edit: we are using MS SQL Server 2008 R2]
I have an SQL database containing timestamped measurement data. Lots of data is inserted all the time, but once inserted it practically never needs to be updated. These timestamps are, however, not unique, as several devices (around 50 of them) measure the data at the same time.
This means that every 50 rows in the table contain equal timestamp values. This data is received more or less simultaneously, although I could take additional care to ensure that rows are written as sequentially as possible (if that would help), perhaps by keeping them in memory for some time and then writing only when I get the data from all the devices for a single timestamp.
We are using NHibernate with Guid.Comb to avoid index lookups we would have with plain bigint IDs. As opposed to plain GUIDs, this should reduce fragmentation, but for so many inserts, fragmentation nevertheless happens very soon.
Since my data is timestamped, and data is inserted almost sequentially (increasing timestamps), I am wondering if there is a more clever way to create a primary key with a unique clustered index for this table. Timestamp column is basically a bigint number (.NET DateTime ticks).
I have also noticed that a non-clustered index over that same timestamp column also gets pretty fragmented. So what index strategy would you recommend to reduce heap fragmentation in this case?

Maybe take a look at this answer, HiLo looks interesting.
Also, maybe your fragmentation is not result of the discrepancy between the ordering of the index values and the order in which they are added, but natural file growth effect (as explained here)?

A seperate column for a key doesn't make a lot of sense for this table since you won't be updating any of the data. I imagine you'll be doing a lot of queries though, probably based on that timestamp column.
You could try making the primary key a combination of the timestamp column and a device id column. You could try making that clustered. That should allow you to write nearly as fast as possible. If you query by device however, you may need another index on device id and timestamp (the reverse). I wouldn't make the reverse the clustered one though, as that will make the writes happen all over the table rather than on the trailing pages. And if most queries involve a date range and more than one device, clustering on timestamp first should give you the best performance.

Is ROWID internally indexed unique by an SQL DBMS?

It's my understanding that the quickest way to access a particular row is by its ROWID. In INFORMIX-SE 7.3, when I do a SELECT ROWID FROM table I notice that its values are type SERIAL[INT]. In Oracle, they are SERIAL[HEX]. Has anyone ever used ROWID for any practical use? If I wanted to locate the most recent row added to a table, would SELECT MAX(ROWID) FROM table be quicker and more reliable than say SELECT MAX(pk_id) FROM table, where pk_id is a user-created SERIAL column? What other practical use have you ever put ROWID to work for you?

Your understanding is not necessarily correct. The ROWID property in SQL Server is primarily intended for replication as a way to guarantee that the table has a single-field unique index value. This way the replication system does not have to account for any specific primary key semantics that your design might employ, while still being able to identify every row by a single column. No table is required to have a ROWID column unless it is part of a merge replication publication, so it's not something that every table has, unlike Oracle. It also doesn't serve the same purpose (they're Guid's--or uniqueidentifier in T-SQL parlance--on SQL Server and are random, not sequential integers like they are on Oracle).
The quickest way to retrieve a row from a table is by accessing the row via the clustered index. A table can only have one clustered index, as it's what determines the physical ordering of the rows on the disk. Furthermore, if the table has a primary key, the primary key is the clustered index. While it's possible to declare a table without a primary key and assign the clustered index to something else, I can't (off the top of my head) fathom a reason why you'd want to do this (or, for practical purposes, how you can justify having a table without a primary key).
In short, that means that the quickest way to retireve a row is by using the primary key of the table. Unless the ROWID column is the primary key (which is certainly possible to do), then it isn't the fastest way.

Well, I can only really tell how it works in Oracle, using it for 19+ years :-)
Put simply, ROWID is an internel identification, that acts like an physical address. It can be split into database file no, block no, and row no. So obtaining the ROWID makes the db engine able to look the data up in a single direct IO.
In an index the B* tree will have ROWIDs on the leaf nodes pointing directly the location of the data, e.g. in a primary index.
Being an physical address it is submit to change on relocation on disk, which can happen after restoring a backup, rebuilding a table, or export/import of data.
The db engine can do some tricks, e.g. when moving a pluggable tablespace from one instance to another to avoid rebuilding indexes, however this is strickly db engine internals.
So to keep out of trouble leave the ROWID for internal use for the db engine. Storing the ROWID for your own usage will eventually lead to inconsistency.

In Informix-SE, the ROWID is basically the record number within the C-ISAM file that is used to hold the table. SE only deals in fixed size records, of course (no VARCHAR data).
In Informix Dynamic Server, the ROWID is (a) more complex (page number plus slot number) and (b) not always present (fragmented tables do not expose the ROWID, unless the table was created WITH ROWIDS, in which case the ROWID is a physical column that is indexed after all) - be aware!
If no data is ever deleted and you are using SE, then selecting the row with the maximum ROWID will be the most recently added row. If a row is deleted, then that space will eventually be reused, and then the most recently added row ceases to be the one with the maximum ROWID. (IDS does not make that promise for a variety of complex reasons.)
The SE implementation of ROWID does not store the ROWID in the table, and does not create an index on it, but it does not need an index because it knows the formula for where to go to find the data (offset in data file = ROWID * RowSize), give or take a plus one on RowSize or a minus one ROWID or both.
As to practical use for ROWID, the style that was used before fragmentation was added to IDS was to select a list of ROWID values for the records of interest in the table, maintaining that list in memory:
SELECT ROWID
FROM InterestingTable
WHERE SomeColumn = xxx
AND AnotherColumn < yyy;
Then, the program could present these rows one at time, fetching the current data via the stored ROWID. The ROWID for a record would not change while a program was running. This ensured that the current data - whether edits from the current user or someone else - was shown when the record was displayed.
There's a program you're familiar with, ISQL Perform, that behaves like this. And it does not work with fragmented tables (necessarily in IDS; SE does not support fragmented tables) unless they are created with a physical ROWID column with the WITH ROWIDS clause.

Perhaps the term "RDBMS" rather than "an SQL server"?
Attaching any purpose to a ROWID is a bad idea. Particularly if you're in the habit of dropping and recreating tables. If your table needs a SERIAL PK, then that's what it should have. No good can come of using ROWIDs within your application.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas