Using custom codification scheme instead of GUID as Primary Key - sql

I'm upsizing a backend MS Access database to SQL Server. The front-end client will remain an Access application for the time being (has about 30k lines of code).
The aim is to eventually allow synchronisation of the database accross multiple servers (not using replication but probably the sync framework).
Currently, all Primary Keys in the Access tables are autoincrement integer surrogates.
I'm not asking about the process of upsizing but about whether I should use GUID or another codification for the PK (I know I could split the number range accross servers, but I don't want to do that and allow the PK to be created on the client if necessary, for instance in Offline mode).
GUID
Pro:
standardised format.
uniqueness assured (practically anyway)
Cons:
not easy to manipulate in Access, especially when using them as filters for subforms or in controls.
degrade INSERT performance due to their randomness.
has more than one representation: string, canonical form, binary that need to be converted.
Custom codification scheme
I thought that maybe a scheme using a more uniform code as PK would avoid the performance penalty and, most importantly, ensure that the PK remains usable in form controls where necessary (and not require these conversions to/from string).
My idea for a codification scheme would be along the lines of a 10 character code split into:
8 digits for a timestamp
4 digits for a unique client ID
2 digits as a random number for potential colisions
Each digit would be in base 34, composed of letters from A-Z and 2-9, avoiding O, 0, 1, I because of their similitude (in case we need to manually handle these PK, for instance during debugging).
Pro:
easier to handle manually when the case arises.
don't require conversion between different representation since it's basically a string (so less existing code to adapt).
uniqueness assured (practically)
Cons:
performance in JOIN hasn't been proven
performance in INSERT should be faster than GUID but not proven
Each server/client machine must be set its own UID, although that should not be too much of an issue.
So, should I use GUID or another scheme for my PK?

not easy to manipulate in Access, especially when using them as filters for subforms or in controls.
-> Access has GUID as Number->Replication Identificator. We have application in Access with every PK as GUID and we haven't any problem with filters (and with filters for subfroms too).
degrade INSERT performance due to their randomness.
-> If you have performance problem based od this, you can have cluster index on another column (timestamp for example). But MSSQL server has two function for generating new GUID values - "newid()" and "newsequenceid()". The second methods - as name says - generates new values in sequence, so the insert performace issue should not happens.
has more than one representation: string, canonical form, binary that need to be converted.
-> its "PRO" in my sight :). But for users-developer and users-admins is in Access and MSSQL represented and consumed as string..
The GUID is in core "only" 128bit number. I don't think you should worry about efectivity of JOINs on GUID columns. The joining GUID columns is much more eficient than conditions on text columns for example.
I don't hink the Custom codification scheme is good idea, because you must solve many things. On other hand the GUID is standardly used and tools are ready to use it.

How many records are you planning on? Is bigint not big enough? Up to 9,223,372,036,854,775,807 records (if you don't include the negatives)
If it is only for inserts, and no selects on the data, go for what ever scheme, (i would still say bigint or GUID/uniqueidentifier). If you need to do selects, an int, or bigint is much faster than GUID or any other custom PK.

Related

Are there any downsides to using nanoid for primary key?

I know that UUIDs and incrementing integers are often used for primary keys.
I'm thinking of nanoids instead because those are URL friendly without being guessable / brute-force scrapeable (like incrementing integers).
Would there be any reason not to use nanoids as primary keys in a database like Postgres? (For example: Maybe they drastically increase query time since they aren't ... aligned or something?)
https://github.com/ai/nanoid
Most databases use incrementing id's because it's more efficient to insert a new value onto the end of a B-tree based index.
If you insert a new value into a random place in the middle of a B-tree, it may have to split the B-tree nonterminal node, and that could cause the node at the next higher level to split, and so on up to the top of the B-tree.
This also has a greater risk of causing fragmentation, which means the index takes more space for the same number of values.
Read https://www.percona.com/blog/2015/04/03/illustrating-primary-key-models-in-innodb-and-their-impact-on-disk-usage/ for a great visualization about the tradeoff between using an auto-increment versus UUID in a primary key.
That blog is about MySQL, but the same issue applies to any B-tree based data structure.
I'm not sure if there is a disadvantage to using nanoids, but they are often unnecessary. While UUIDs are long, they can be translated to a shorter format without losing entropy.
See the NPM package (https://www.npmjs.com/package/short-uuid).
UUIDs are standardized by the Open Software Foundation (OSF) and described by the RFC 4122. That means that there will be far more chances for other tools to give you some perks around it.
Some examples:
MongoDB has a special type to optimize the storage of UUIDs. Not only a NanoID string will take more space, but even the binary takes more bits (126 in Nano ID and 122 in UUID)
Once saw a logging tool extracting the timestamp from the uids, can't remember which, but is is available
Also the long, non reduced version of UUIDs are very easy to identify visually. When the end user is a developer, it might help to understand the nature/source of the ID (like clearly not a database auto-increment key)

Identifying Differences Efficiently

Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.
The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.
Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..
There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.
One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
How to hash varies by database product, but all of the major providers have some sort of hashing available in them.
The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.
It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.
The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.
Both rsync and git are examples of widely used software that use hashes in this way.
In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.
There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.

Store an integer for bitwise compare in a permission model using JPA 2

I am using a permission model where I have a table user_permissions. This table will hold one or more columns with a certain bigint. I will use the bits of each decimal number to compare with certain permission rules (the bit location will be a permission rule and the value will be the condition of the rule active or not active).
The problem with this approach is that I have limited number of bits to work when using a number such as bigint.
What is the best column type I can use in this case that works in a cross-database environment?
The tags represent the technologies I am aiming for, so any other solution related to those technologies are appreciated.
I was thinking to use #Lob annotation to store large data, is that the best practice?
UPDATE:
The user_permission table extends the user with a 1:1 relationship and have bigint fields like bin_create, bin_read, bin_update, bin_delete that will hold the binary data as decimal numbers.
To clarify the question:
I am considering comparing the permissions using bitwise operators. So let's assume I have a user with the permission value 10(1010), and an action requiring 13 (1101). So 10 & 13 == 8 (1000) -> The user have one permission matching the required permissions for the action, so I could allow or deny (it is up to the application rules define which).
But with this approach I have a limited number of bits to work on (lets say I increase the permissions to be considered, so the numbers will increase too). The max bigint value is ~9223372036854775807 and that gives me the binary 111111111111111111111111111111111111111111111111111111111111111 with ~60 blocks and permission possibilities per field.
So What is the best column type I can use in this case that works in a cross-database environment to store a huge quantity of binary blocks and with the possibility to work with bitwise operators in java?
If you want to store your data in an optimal way, you have to name the target, where you want to optimize.
This is an optimal solution for MySQL (by defining BINARY(32)), you can try something similar on your favorite database:
#Column(columnDefinition = "BINARY(32)", length = 32, nullable = false)
private byte[] bits;
Sometimes with some JPA providers and databases the column definitions ends up with Lob. That's not the best solution, because reading a Lob is an external (very expensive) operation. Try to change either the provider, or the database (if you're working with pure JPA, you can try it).
Options for replacing Lobs are for example numeric columns (you can use e.g. 4 columns with 64bit width, or similar). If you want a nice solution, these container columns can be even #Embedded into your main class. But it all depends on your database.
This way you will have 256 bits (32 bytes), without any conversion and further calculation, and you will have the possibility to extend the range if you want. You have to be careful when changing the column definition, though.
If it's the amount of data you can fit in a field that you're concerned about, why not store the number as a varchar? To my knowledge, pretty much any database will let you go up to at least a varchar(255). If you need more than 255 digits in the number, you could encode it in base 64 to squeeze it down more. If my mental arithmetic is right, that gives you 255 characters * 6 bits per character = 1530 different bits to use. If you need more than that, I might suggest your permissions model is a little excessive.
That's assuming that you're trying to crowd the data into a smallish space in the database. Your question isn't entirely clear on what you're trying to solve for. On the other end of the spectrum, you could unpack the bits and save each bit to its own field or its own row. For example, user_permissions could be a table with two columns: user and permission, where each row is one permission granted to one user.
There are two different approaches:
1) Pretty data model
One row (user in your example) can have values (here a user permission which is one bit or a boolean value) where you don't know how many values are possible, i. e. the number of values is principally unlimited. The normal approach in SQL to handle this is a child table:
You create a table (and Java class for mapping / annotations) UserPermission, which contains the user id as a foreign key, a permission id and the boolean value. User id and permission id is a unique key for this table (you can add an id as a different primary key if you like). You even can add columns for the user who gave the permission, the date when this was done etc, if you want to have some auditing, but this is not necessary.
If you want to make it more pretty, then you also create a table (and Java class) Permission, which contains the permission id, a name for the permission and perhaps other information.
This solution needs more database space than your idea with the bits in an integer, but take in mind you won't have so many users compared to other data in the data base, so the extra amount doesn't matter.
2) Fast solution:
If the solution with extra tables needs to much overhead, because your permissions are not really important, and you worry an integer can be to short, then you can use the Java type BigInteger (the type allows bit manipulations) and map it with an #Column annotation to a NUMBER or DECIMAL in the database.
Bear in mind the size of a database's NUMBER also is limited (for example Oracle allows at maximum 10^40). If this might be a problem, then you must use solution 1).
One more disadvantage of solution 2) is, you never can use an index for the permissions. (A selection of all users having a certain permission set never will go over an index.)
I always would use solution 1).

INT vs Unique-Identifier for ID field in database

I am creating a new database for a web site using SQL Server 2005 (possibly SQL Server 2008 in the near future). As an application developer, I've seen many databases that use an integer (or bigint, etc.) for an ID field of a table that will be used for relationships. But lately I've also seen databases that use the unique identifier (GUID) for an ID field.
My question is whether one has an advantage over the other? Will integer fields be faster for querying and joining, etc.?
UPDATE: To make it clear, this is for a primary key in the tables.
GUIDs are problematic as clustered keys because of the high randomness. This issue was addressed by Paul Randal in the last Technet Magazine Q&A column: I'd like to use a GUID as the clustered index key, but the others are arguing that it can lead to performance issues with indexes. Is this true and, if so, can you explain why?
Now bear in mind that the discussion is specifically about clustered indexes. You say you want to use the column as 'ID', that is unclear if you mean it as clustered key or just primary key. Typically the two overlap, so I'll assume you want to use it as clustered index. The reasons why that is a poor choice are explained in the link to the article I mentioned above.
For non clustered indexes GUIDs still have some issues, but not nearly as big as when they are the leftmost clustered key of the table. Again, the randomness of GUIDs introduces page splits and fragmentation, be it at the non-clustered index level only (a much smaller problem).
There are many urban legends surrounding the GUID usage that condemn them based on their size (16 bytes) compared to an int (4 bytes) and promise horrible performance doom if they are used. This is slightly exaggerated. A key of size 16 can be a very peformant key still, on a properly designed data model. While is true that being 4 times as big as a int results in more a lower density non-leaf pages in indexes, this is not a real concern for the vast majority of tables. The b-tree structure is a naturally well balanced tree and the depth of tree traversal is seldom an issue, so seeking a value based on GUID key as opposed to a INT key is similar in performance. A leaf-page traversal (ie. a table scan) does not look at the non-leaf pages, and the impact of GUID size on the page size is typically quite small, as the record itself is significantly larger than the extra 12 bytes introduced by the GUID. So I'd take the hear-say advice based on 'is 16 bytes vs. 4' with a, rather large, grain of salt. Analyze on individual case by case and decide if the size impact makes a real difference: how many other columns are in the table (ie. how much impact has the GUID size on the leaf pages) and how many references are using it (ie. how many other tables will increase because of the fact they need to store a larger foreign key).
I'm calling out all these details in a sort of makeshift defense of GUIDs because they been getting a lot of bad press lately and some is undeserved. They have their merits and are indispensable in any distributed system (the moment you're talking data movement, be it via replication or sync framework or whatever). I've seen bad decisions being made out based on the GUID bad reputation when they were shun without proper consideration. But is true, if you have to use a GUID as clustered key, make sure you address the randomness issue: use sequential guids when possible.
And finally, to answer your question: if you don't have a specific reason to use GUIDs, use INTs.
The GUID is going to take up more space and be slower than an int - even if you use the newsequentialid() function. If you are going to do replication or use the sync framework you pretty much have to use a guid.
INTs are 4 bytes, BIGINTs ar 8 bytes, and GUIDS are 16 bytes. The more space required to represent the data, the more resources required to process it -- disk space, memory, etc. So (a) they're slower, but (b) this probably only matters if volume is an issue (millions of rows, or thousands of transactions in very, very little time.)
The advantage of GUIDs is that they are (pretty much) Globally Unique. Generate a guid using the proper algorithm (and SQL Server xxxx will use the proper algorithm), and no two guids will ever be alike--no matter how many computers you have generating them, no matter how frequently. (This does not apply after 72 years of usage--I forget the details.)
If you need unique identifiers generated across multiple servers, GUIDs may be useful. If you need mondo perforance and under 2 billion values, ints are probably fine. Lastly and perhaps most importantly, if your data has natural keys, stick with them and forget the surrogate values.
if you positively, absolutely have to have a unique ID, then GUID. Meaning if you're ever gonna merge, sync, replicate, you probably should use a GUID.
For less robust things, an int, should suffice depending upon how large the table will grow.
As in most cases, the proper answer is, it depends.
Use them for replication etc, not as primary keys.
Kimberly L Tripp article
Against: Space, not strictly monotonic, page splits, bookmark/RIDs etc
For: er...
Fully agreed with JBrooks.
I want to say that when your table is large, and you use selects with JOINS, especially with derived tables, using GUIDs can significally decrease performance.

Indexing a 'non guessable' key for quick retrieval?

I'm not fully getting all i want from Google analytics so I'm making my own simple tracking system to fill in some of the gaps.
I have a session key that I send to the client as a cookie. This is a GUID.
I also have a surrogate IDENTITY int column.
I will frequently have to access the session row to make updates to it during the life of the client. Finding this session row to make updates is where my concern lies.
I only send the GUID to the client browser:
a) i dont want my technical 'hacker'
users being able to guage what 'user
id' they are - i.e. know how many
visitors we have had to the site in total
b) i want to make sure noone messes with data maliciously - nobody can guess a GUID
I know GUID indexes are inefficnent, but I'm not sure exactly how inefficient. I'm also not clear how to maximize the efficiency of multiple updates to the same row.
I don't know which of the following I should do :
Index the GUID column and always use that to find the row
Do a table scan to find the row based on the GUID (assuming recent sessions are easy to find). Do this by reverse date order (if thats even possible!)
Avoid a GUID index and keep a hashtable in my application tier of active sessions : IDictionary<GUID, int> to allow the 'secret' IDENTITY surrogate key to be found from the 'non secret' GUID key.
There may be several thousand sessions a day.
PS. I am just trying to better understand the SQL aspects of this. I know I can do other clever thigns like only write to the table on session expiration etc., but please keep answers SQL/index related.
In this case, I'd just create an index on the GUID. Thousands of sessions a day is a completely trivial load for a modern database.
Some notes:
If you create the GUID index as non-clustered, the index will be small and probably be cached in memory. By default most databases cluster on primary key.
A GUID column is larger than an integer. But this is hardly a big issue nowadays. And you need a GUID for the application.
An index on a GUID is just like an index on a string, for example Last Name. That works efficiently.
The B-tree of an index on a GUID is harder to balance than an index on an identity column. (But not harder than an index on Last Name.) This effect can be countered by starting with a low fill factor, and reorganizing the index in a weekly job. This is a micro-optimization for a databases that handle a million inserts an hour or more.
Assuming you are using SQL Server 2005 or above, your scenario might benefit from NEWSEQUENTIALID(), the function that gives you ordered GUIDs.
Consider this quote from the article Performance Comparison - Identity() x NewId() x NewSequentialId
"The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance, and globally unique values."
Declare your table as
create table MyTable(
id uniqueidentifier default newsequentialid() not null primary key clustered
);
However, keep in mind, as Andomar noted that the sequentiality of the GUIDs produced also make them easy to predict. There are ways to make this harder, but non that would make this better than applying the same techniques to sequential integer keys.
Like the other authors I seriously doubt that the overheads of using straight newid() GUIDs would be significant enough for your application to notice. You would be better of focusing on minimizing roundtrips to your DB than on implementing custom caching scenarios such as the dictionary you propose.
If I understand what you're asking, you're worrying that indexing and looking up your users by their hashed GUID might slow your application down? I'm with Andomar, this is unlikely to matter unless you're inserting rows so fast that updating the index slows things down. Only on something like a logging table might that happen, and then only for complicated indicies.
More importantly, did you profile it first? You don't have to guess why your program is slow, you can find out which bits are slow with a profiler. Otherwise you'll waste hours optimizing bits of code that are either A) never used or B) already fast enough.