Indexing a 'non guessable' key for quick retrieval?

Indexing a 'non guessable' key for quick retrieval? - sql

I'm not fully getting all i want from Google analytics so I'm making my own simple tracking system to fill in some of the gaps.
I have a session key that I send to the client as a cookie. This is a GUID.
I also have a surrogate IDENTITY int column.
I will frequently have to access the session row to make updates to it during the life of the client. Finding this session row to make updates is where my concern lies.
I only send the GUID to the client browser:
a) i dont want my technical 'hacker'
users being able to guage what 'user
id' they are - i.e. know how many
visitors we have had to the site in total
b) i want to make sure noone messes with data maliciously - nobody can guess a GUID
I know GUID indexes are inefficnent, but I'm not sure exactly how inefficient. I'm also not clear how to maximize the efficiency of multiple updates to the same row.
I don't know which of the following I should do :
Index the GUID column and always use that to find the row
Do a table scan to find the row based on the GUID (assuming recent sessions are easy to find). Do this by reverse date order (if thats even possible!)
Avoid a GUID index and keep a hashtable in my application tier of active sessions : IDictionary<GUID, int> to allow the 'secret' IDENTITY surrogate key to be found from the 'non secret' GUID key.
There may be several thousand sessions a day.
PS. I am just trying to better understand the SQL aspects of this. I know I can do other clever thigns like only write to the table on session expiration etc., but please keep answers SQL/index related.

In this case, I'd just create an index on the GUID. Thousands of sessions a day is a completely trivial load for a modern database.
Some notes:
If you create the GUID index as non-clustered, the index will be small and probably be cached in memory. By default most databases cluster on primary key.
A GUID column is larger than an integer. But this is hardly a big issue nowadays. And you need a GUID for the application.
An index on a GUID is just like an index on a string, for example Last Name. That works efficiently.
The B-tree of an index on a GUID is harder to balance than an index on an identity column. (But not harder than an index on Last Name.) This effect can be countered by starting with a low fill factor, and reorganizing the index in a weekly job. This is a micro-optimization for a databases that handle a million inserts an hour or more.

Assuming you are using SQL Server 2005 or above, your scenario might benefit from NEWSEQUENTIALID(), the function that gives you ordered GUIDs.
Consider this quote from the article Performance Comparison - Identity() x NewId() x NewSequentialId
"The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance, and globally unique values."
Declare your table as
create table MyTable(
id uniqueidentifier default newsequentialid() not null primary key clustered
);
However, keep in mind, as Andomar noted that the sequentiality of the GUIDs produced also make them easy to predict. There are ways to make this harder, but non that would make this better than applying the same techniques to sequential integer keys.
Like the other authors I seriously doubt that the overheads of using straight newid() GUIDs would be significant enough for your application to notice. You would be better of focusing on minimizing roundtrips to your DB than on implementing custom caching scenarios such as the dictionary you propose.

If I understand what you're asking, you're worrying that indexing and looking up your users by their hashed GUID might slow your application down? I'm with Andomar, this is unlikely to matter unless you're inserting rows so fast that updating the index slows things down. Only on something like a logging table might that happen, and then only for complicated indicies.
More importantly, did you profile it first? You don't have to guess why your program is slow, you can find out which bits are slow with a profiler. Otherwise you'll waste hours optimizing bits of code that are either A) never used or B) already fast enough.

Related

Is using a timestamp as a hash key on a GSI in DynamoDB a good approach

I have a large (2B + records) DynamoDB table.
I want to implement a distributed locking process by adding a new field, 'index_due_at' when an item is created or updated. After the create/update, I will do some further processing on the item and then remove the 'index_due_at' field.
I'd like to create a sweeper job which will periodically extract any records with an outstanding 'index_due_at' field (on the assumption that something about the above process failed) to give those records further treatment. I would anticipate at most 100s of records in this state at any one time, more likely 10s.
To optimise the performance of the sweeper, I want to create a GSI including the new field (and project the key data into it).
It seems that using a timestamp (in millis) as the GSI HASH key ought to give a good distribution. And I don't need to query on this field's value, just on its presence. Can anyone identify any drawbacks in this approach and if so, suggest an alternative?
Issues I can anticipate include:
* Non-uniqueness in timestamps at milli level.
* Possible hash key problems with numeric values?
* Possible hash key problems with numeric values that don't vary much in the most significant digits.

This is less of a problem than you might be thinking. GSI hash keys don't actually have to be unique, so you're fine on than front.
You probably already know this, but your GSI will only contain items with GSI keys, so your GSI should be pretty small (100s of items).
One thought I have is that the index_due_at might actually be better as a GSI sort key rather than hash key. Data is sorted within a partition by sort key. So you could have a GSI hash key of index_due_at_flag which would be Y if present, then a sort key of index_due_at. This would mean all your data would be sorted naturally, so you could process it in date order.
That said, you are probably never going to Query this GSI, so I suspect your choice of keys hardly matters at all. Presumably you will just do a Scan, get all the items and try and process them all. In which case you would never even use the keys. Just having a key attribute present would put the item in the GSI.
Another thought is that you need to handle the fact GSIs are not perfectly synchronous with the base table. Its possible (admittedly unlikely) that an item in your GSI has actually just been processed. Therefore if your sweeper script picks up an item from the GSI, you should handle the fact its possible its already been updated in the base table (e.g. by checking the base table item before attempting to process it).
Good luck with it. I answered because I liked your bio! Hope staying on the right side of barrel shaped is working out :)

This should be a perfect scenario for using DynamoDB Sparse Index
Use the 'index_due_at' as sort key in GSI, and only the items you are interested will be in the index, greatly reducing the space needed and the performance.

SQL Db index recommendation

I am trying to see if using a custom index for a specific type of data might reduce fragmentation in my database.
[Edit: we are using MS SQL Server 2008 R2]
I have an SQL database containing timestamped measurement data. Lots of data is inserted all the time, but once inserted it practically never needs to be updated. These timestamps are, however, not unique, as several devices (around 50 of them) measure the data at the same time.
This means that every 50 rows in the table contain equal timestamp values. This data is received more or less simultaneously, although I could take additional care to ensure that rows are written as sequentially as possible (if that would help), perhaps by keeping them in memory for some time and then writing only when I get the data from all the devices for a single timestamp.
We are using NHibernate with Guid.Comb to avoid index lookups we would have with plain bigint IDs. As opposed to plain GUIDs, this should reduce fragmentation, but for so many inserts, fragmentation nevertheless happens very soon.
Since my data is timestamped, and data is inserted almost sequentially (increasing timestamps), I am wondering if there is a more clever way to create a primary key with a unique clustered index for this table. Timestamp column is basically a bigint number (.NET DateTime ticks).
I have also noticed that a non-clustered index over that same timestamp column also gets pretty fragmented. So what index strategy would you recommend to reduce heap fragmentation in this case?

Maybe take a look at this answer, HiLo looks interesting.
Also, maybe your fragmentation is not result of the discrepancy between the ordering of the index values and the order in which they are added, but natural file growth effect (as explained here)?

A seperate column for a key doesn't make a lot of sense for this table since you won't be updating any of the data. I imagine you'll be doing a lot of queries though, probably based on that timestamp column.
You could try making the primary key a combination of the timestamp column and a device id column. You could try making that clustered. That should allow you to write nearly as fast as possible. If you query by device however, you may need another index on device id and timestamp (the reverse). I wouldn't make the reverse the clustered one though, as that will make the writes happen all over the table rather than on the trailing pages. And if most queries involve a date range and more than one device, clustering on timestamp first should give you the best performance.

What are the reasons not to use a GUID for a primary key?

Whenever I design a database I automatically start with an auto-generating GUID primary key for each of my tables (excepting look-up tables)
I know I'll never lose sleep over duplicate keys, merging tables, etc. To me it just makes sense philosophically that any given record should be unique across all domains, and that that uniqueness should be represented in a consistent way from table to table.
I realize it will never be the most performant option, but putting performance aside, I'd like to know if there are philosophical arguments against this practice?
Based on the responses let me clarify:
I'm talking about consistently using a GUID surrogate key as a primary key- irrespective of whether and how any natural or sequential keys are designed on a table. These are my assumptions:
Data integrity based on natural keys can be designed for, but not assumed.
A primary key's function is referential integrity, irrespective of performance, sequencing, or data.

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table.
What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to. The main reason for this is indeed performance, which will come and bite you down the road... (it will, trust me - just a matter of time) - plus also a waste of resources (disk space and RAM in your SQL Server machine) which is really not necessary.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc

Jeff Atwood talks about this in great detail:
http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
Guid Pros:
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
Guid Cons:
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

Adding to ewwwn:
Pros
It makes it nearly impossible for developers to "accidentally" expose the surrogate key to users (unlike integers where it happens almost all the time).
Makes merging databases several orders of magnitude simpler than dealing with identity columns.
Cons
Fatter. The real problem with it being fatter is that it eats up more space per page and more space in your indexes making them slower. The additional storage space of Guids is frankly irrelevant in today's world.
You absolutely must be careful about how new values are created. Truly random values do not index well. You are compelled to use a COMB guid or some variant that adds a sequential element to the guid.

You still implement the natural key of each table as well don't you? - GUID keys alone obviously won't prevent duplicate data, redundancy and consequent loss of data integrity.
Assuming that you do enforce other keys then adding GUIDs to every table without exception is probably just adding unnecessary complexity and overhead. It doesn't really make it easier to merge data in different tables because you still have to modify/de-duplicate the other key(s) of the table anyway. I suggest you should evaluate the use of a GUID surrogate on a case-by-case basis. Having a blanket rule for every table isn't necessary or helpful because every table models a different thing after all.

Simple answer: it's not relational.
The record (as defined by the GUID) may be unique, but none of the associated attributes can be said to be occuring uniquely with that record.
Using a GUID (or any purely surrogate key) is no more relational than declaring a flat file to be relational, on the basis that each record can be identified by its row number.

A potentially big reason, but one often not thought of, is if you might have to provide compatibility with an Oracle database in the future.
Since Oracle doesn't have a uniqueid column data type, it can lead to a bit of a nightmare when you have two different data types for the same primary key across two different databases, especially when an ORM is involved.

I wonder why there's no standard "miniGUID" type? It would seem that performing a decent hash on a GUID should yield a 64-bit number which would have a trivial probability of duplication in any universe which doesn't have a billion or more things in it. Since the universe in which most GUID/miniGUID identifiers are used will never grow beyond a million things, much less a billion, I would think a smaller 8-byte miniGuid would be very useful.
That would not, of course, suggest that it should be used as a clustered index; that would greatly impede performance. Nonetheless, an 8-byte miniGUID would only waste a third the space of a full GUID (when compared to a 4-byte index).

I can see the case for a given application's or enterprise's own identifiers to be unique and be represented in a consistent way across all its own domains (i.e. because they may span more than one database) but a GUID is overkill for these purposes. I guess they are popular because they are available out of the box and designing and implementing an 'enterprise key' takes time and effort. The rule when designing an artifical identifier is to make it as simple as possible but no simpler. IDENTITY is too simple, a GUID isn't simple enough.
Entities that exist outside of the application/enterprise usually have their own identifiers (e.g. a car has a VIN, a book has an ISBN, etc) maintained by an external trusted source and in such cases the GUID adds nothing. So I guess the philosphical argument against I'm getting at here is that using a artifical identifier on every table is unnecessary.

INT vs Unique-Identifier for ID field in database

I am creating a new database for a web site using SQL Server 2005 (possibly SQL Server 2008 in the near future). As an application developer, I've seen many databases that use an integer (or bigint, etc.) for an ID field of a table that will be used for relationships. But lately I've also seen databases that use the unique identifier (GUID) for an ID field.
My question is whether one has an advantage over the other? Will integer fields be faster for querying and joining, etc.?
UPDATE: To make it clear, this is for a primary key in the tables.

GUIDs are problematic as clustered keys because of the high randomness. This issue was addressed by Paul Randal in the last Technet Magazine Q&A column: I'd like to use a GUID as the clustered index key, but the others are arguing that it can lead to performance issues with indexes. Is this true and, if so, can you explain why?
Now bear in mind that the discussion is specifically about clustered indexes. You say you want to use the column as 'ID', that is unclear if you mean it as clustered key or just primary key. Typically the two overlap, so I'll assume you want to use it as clustered index. The reasons why that is a poor choice are explained in the link to the article I mentioned above.
For non clustered indexes GUIDs still have some issues, but not nearly as big as when they are the leftmost clustered key of the table. Again, the randomness of GUIDs introduces page splits and fragmentation, be it at the non-clustered index level only (a much smaller problem).
There are many urban legends surrounding the GUID usage that condemn them based on their size (16 bytes) compared to an int (4 bytes) and promise horrible performance doom if they are used. This is slightly exaggerated. A key of size 16 can be a very peformant key still, on a properly designed data model. While is true that being 4 times as big as a int results in more a lower density non-leaf pages in indexes, this is not a real concern for the vast majority of tables. The b-tree structure is a naturally well balanced tree and the depth of tree traversal is seldom an issue, so seeking a value based on GUID key as opposed to a INT key is similar in performance. A leaf-page traversal (ie. a table scan) does not look at the non-leaf pages, and the impact of GUID size on the page size is typically quite small, as the record itself is significantly larger than the extra 12 bytes introduced by the GUID. So I'd take the hear-say advice based on 'is 16 bytes vs. 4' with a, rather large, grain of salt. Analyze on individual case by case and decide if the size impact makes a real difference: how many other columns are in the table (ie. how much impact has the GUID size on the leaf pages) and how many references are using it (ie. how many other tables will increase because of the fact they need to store a larger foreign key).
I'm calling out all these details in a sort of makeshift defense of GUIDs because they been getting a lot of bad press lately and some is undeserved. They have their merits and are indispensable in any distributed system (the moment you're talking data movement, be it via replication or sync framework or whatever). I've seen bad decisions being made out based on the GUID bad reputation when they were shun without proper consideration. But is true, if you have to use a GUID as clustered key, make sure you address the randomness issue: use sequential guids when possible.
And finally, to answer your question: if you don't have a specific reason to use GUIDs, use INTs.

The GUID is going to take up more space and be slower than an int - even if you use the newsequentialid() function. If you are going to do replication or use the sync framework you pretty much have to use a guid.

INTs are 4 bytes, BIGINTs ar 8 bytes, and GUIDS are 16 bytes. The more space required to represent the data, the more resources required to process it -- disk space, memory, etc. So (a) they're slower, but (b) this probably only matters if volume is an issue (millions of rows, or thousands of transactions in very, very little time.)
The advantage of GUIDs is that they are (pretty much) Globally Unique. Generate a guid using the proper algorithm (and SQL Server xxxx will use the proper algorithm), and no two guids will ever be alike--no matter how many computers you have generating them, no matter how frequently. (This does not apply after 72 years of usage--I forget the details.)
If you need unique identifiers generated across multiple servers, GUIDs may be useful. If you need mondo perforance and under 2 billion values, ints are probably fine. Lastly and perhaps most importantly, if your data has natural keys, stick with them and forget the surrogate values.

if you positively, absolutely have to have a unique ID, then GUID. Meaning if you're ever gonna merge, sync, replicate, you probably should use a GUID.
For less robust things, an int, should suffice depending upon how large the table will grow.
As in most cases, the proper answer is, it depends.

Use them for replication etc, not as primary keys.
Kimberly L Tripp article
Against: Space, not strictly monotonic, page splits, bookmark/RIDs etc
For: er...

Fully agreed with JBrooks.
I want to say that when your table is large, and you use selects with JOINS, especially with derived tables, using GUIDs can significally decrease performance.

Using custom codification scheme instead of GUID as Primary Key

I'm upsizing a backend MS Access database to SQL Server. The front-end client will remain an Access application for the time being (has about 30k lines of code).
The aim is to eventually allow synchronisation of the database accross multiple servers (not using replication but probably the sync framework).
Currently, all Primary Keys in the Access tables are autoincrement integer surrogates.
I'm not asking about the process of upsizing but about whether I should use GUID or another codification for the PK (I know I could split the number range accross servers, but I don't want to do that and allow the PK to be created on the client if necessary, for instance in Offline mode).
GUID
Pro:
standardised format.
uniqueness assured (practically anyway)
Cons:
not easy to manipulate in Access, especially when using them as filters for subforms or in controls.
degrade INSERT performance due to their randomness.
has more than one representation: string, canonical form, binary that need to be converted.
Custom codification scheme
I thought that maybe a scheme using a more uniform code as PK would avoid the performance penalty and, most importantly, ensure that the PK remains usable in form controls where necessary (and not require these conversions to/from string).
My idea for a codification scheme would be along the lines of a 10 character code split into:
8 digits for a timestamp
4 digits for a unique client ID
2 digits as a random number for potential colisions
Each digit would be in base 34, composed of letters from A-Z and 2-9, avoiding O, 0, 1, I because of their similitude (in case we need to manually handle these PK, for instance during debugging).
Pro:
easier to handle manually when the case arises.
don't require conversion between different representation since it's basically a string (so less existing code to adapt).
uniqueness assured (practically)
Cons:
performance in JOIN hasn't been proven
performance in INSERT should be faster than GUID but not proven
Each server/client machine must be set its own UID, although that should not be too much of an issue.
So, should I use GUID or another scheme for my PK?

not easy to manipulate in Access, especially when using them as filters for subforms or in controls.
-> Access has GUID as Number->Replication Identificator. We have application in Access with every PK as GUID and we haven't any problem with filters (and with filters for subfroms too).
degrade INSERT performance due to their randomness.
-> If you have performance problem based od this, you can have cluster index on another column (timestamp for example). But MSSQL server has two function for generating new GUID values - "newid()" and "newsequenceid()". The second methods - as name says - generates new values in sequence, so the insert performace issue should not happens.
has more than one representation: string, canonical form, binary that need to be converted.
-> its "PRO" in my sight :). But for users-developer and users-admins is in Access and MSSQL represented and consumed as string..
The GUID is in core "only" 128bit number. I don't think you should worry about efectivity of JOINs on GUID columns. The joining GUID columns is much more eficient than conditions on text columns for example.
I don't hink the Custom codification scheme is good idea, because you must solve many things. On other hand the GUID is standardly used and tools are ready to use it.

How many records are you planning on? Is bigint not big enough? Up to 9,223,372,036,854,775,807 records (if you don't include the negatives)
If it is only for inserts, and no selects on the data, go for what ever scheme, (i would still say bigint or GUID/uniqueidentifier). If you need to do selects, an int, or bigint is much faster than GUID or any other custom PK.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas