Is mixing uuidv4 and uuidv5 collision safe? - cryptography

I have a key value store where rows are indexed by uuid. For some of the keys I generate the key with uuidv4. For some of the other keys I generate the key from two other uuids from that same key value store.
e.g.
key
value
2efd5459-fa72-4b28-a801-160e84fa049d
alice
99a975d8-cadf-460a-b9ab-ce8352414d89
bob
2fc821c5-fa09-5a89-b355-e6d3b5b90fc8
alice_bob
The alice_bob row was generated via v5 of alice and bob's uuids.
// alice bob
u.v5('2efd5459-fa72-4b28-a801-160e84fa049d', '99a975d8-cadf-460a-b9ab-ce8352414d89')
Am I more likely to get a collision by mixing v4 and v5 in the same KV store than if I just used v4 or just used v5?

Quoting from https://github.com/uuidjs/uuid/issues/579
RFC4122 UUIDs have the version encoded in them, so [properly formed] v5 UUIDs will never collide with a v4 UUID, and vice-versa.
Outside of that, the odds of collision depend on the behavior of the respective UUID versions. The odds of v4 UUIDs is pretty well documented elsewhere. (tl;dr "vanishingly small"). v5 ids are deterministic hashes, so it mostly depends on the odds of you having the same input names, which isn't something we have control over. (There is a theoretical chance two different names will result in the same hash but, again, the odds of that are vanishingly small.)

Related

Using string vs. integer identifiers in RESTful URLs

How does one decide to use string vs. integer identifiers in RESTful URLs. For example, I see that the Github API uses strings in some cases, e.g.
GET https://api.github.com/repos/nareshbhatia/git-explorer
=> get a repository whose id is "git-explorer"
whereas integers in others
GET https://api.github.com/repos/octocat/Hello-World/issues/1347
=> get an issue whose id is 1347
I understand that it is more natural to identify a repository by its name and an issue by its number, but from an implementation perspective a string identifier poses several issues. Should the primary key for the repository table be the name? But a string is generally a bad choice for a primary key. Ok so how about an integer surrogate key and make the name a unique column. But that means whenever I have a reference to a repository (an integer) and I need to construct a URL for it, I am forced to make a join to the repository table - just to get its name.
To clarify, suppose I am creating the JSON for an issue and I need to include a link to the repository, I need a join to the repository table to create a link like /repos/nareshbhatia/git-explorer instead of a simple link with just an integer reference like /repos/nareshbhatia/10
I'm not sure I agree with the statement that strings are a bad choice for a PK. Certainly if you are using a natural key, odds are good you're going to be using strings. Many people prefer integers because they are more compact and they improve performance in indices, but performance shouldn't be your only consideration, and there are advantages to natural keys. Enough about the database.
When you start talking about URIs, there really is no appreciable performance advantage to shorter URLs, so integers don't have performance as an edge over strings. String also have an SEO advantage over integers in an URI. These considerations might lead you to design your database with integer primary keys, but use strings as URIs in your restful endpoints.

Redis Indexes: Storing full key vs. ID

Given this example:
user:1 email bob#bob.com
user:1 name bob
Based on my research, all the examples create an "index" similar to the following:
user:bob#bob.com 1
My question is: wouldn't it be better to store it as "user:1"? That would eliminate the need to concatenate the string in code. Is there some other reason not to store the whole string? Memory maybe?
The question was specifically about storing the full key in the index or just a numeric ID which is part of this key.
Redis has a number of memory optimizations that you may want to leverage to decrease general memory consumption. One of these optimizations is the intset (an efficient structure to represent sets of integers).
Very often, sets are used as index entries, and in that case, it is much better to store a numeric ID rather than an alphanumeric key, to benefit from the intset optimization.
Your example is slightly different because a given email address should be associated to only one user. A unique hash object is fine to store the whole index. I would still use numeric ID here since it is more compact, and may benefit from future Redis optimizations.
Based on what you've conveyed so far, I'd use Redis hashes. For example, I'd denormalize the data a bit and store is as hmset users:1 email bob#bob.com name Bob and 'hset users:lookup:email bob#bob.com 1'.
This way, I can retrieve the user using both his email ID or user ID. You could create more lookup hashes depending on your needs.
For more useful patterns, look at the Little Redis book, written by Salvatore Sanfilippo himself.

How to choose my primary key?

I found this reading material on choosing a primary key.
Is there a guide / blog post on how to choose the primary key for a given table?
Should I use a auto-incremented/generated key, or should I base the primary key on the data being modeled (assuming it has a truly unique field)?
Should the primary key always be long for performance's sake, or can I take an external unique id as primary key, even if it's a string?
I believe that in practice using a natural key is rarely better than a surrogate key.
The following are the main disadvantages of using a natural key as the primary key:
You might have an incorrect key value, or you may simply want to rename a key value. To edit it, you would have to update all the tables that would be using it as a foreign key.
It is often difficult to have a truly unique natural key.
Natural keys are often strings. An index on an numeric field will be much more compact than one on a string field.
There is no hard rule on what the data type of the primary key should be. A numeric key normally performs better, but you could use a string, especially if the table is not big, and the tables that reference it are not big either.
A key is a set of attributes with two fundamental features: uniqueness and minimality. Minimality means the key has only the minimum number of attributes required to ensure uniqueness.
There are three criteria commonly applied as a guide to choosing a good key:
Familiarity - keys should be meaningful and familiar to the people who use them
Simplicity - keys should be as simple and concise as possible
Stability - key values should change infrequently
These are good guidelines but are not absolute requirements. In all cases functional requirements and the needs of data integrity should determine what keys to use.
I use surrogate keys, often referred to as non-sensical keys, made up of an autogenerated int/bigint datatype.
Here are some of the reasons I like using these keys.
When deleting several items from a list (such as old email) you can supply a comma separated list of integers instead of guids or natural keys
I find it makes writing your own cascade deletes easier
I think inner-joins are faster on integer fields
It can make learning a new system without documentation easier to understand.
Here are a couple of blog posts about primary keys:
http://www.mysqlperformanceblog.com/2006/10/03/long-primary-key-for-innodb-tables/
http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/
I have worked with a lot of different data models in professional systems (mostly bank software) and there were different solutions. There was the GUID solution I have seen and it seemed not to have impacted the performances too much. I have seen the "number provided by a service as a system wide unique number". I have seen algorithms of providing something like a GUID "but shorter". I have seen also that the business key was used (like the account number) which is poor design and caused problems and I would not recommend it. I have seen the auto-incremented key for each table.
What did I like the most? The number provided by a service as a system wide number. It works well. And with a simple key translation table one can use a user key (like an account number) to find out what unique number and what sort of data object (not necessarily the table because the same unique key may apply to several tables if a data object is split up on different tables depending on its type).
So is there a blog or something? Well I have a book to recommend called "Data Modeling Essentials" by Graeme Simsion and Graham Witt. They might not suggest my preferred solution but they give many real live examples and show the different kind of solutions that are possible.
I always choose uuid as a primary key. In comparison to int/long key, there is a slight overhead, but there are a lot of benefits: you cannot run into type overflow, you can shard database later on without changing primary keys, you can integrate with other systems and be sure that your primary keys are always unique, uuid cannot be guessed etc.

Database Design Question: GUID + Natural Numbers

For a database I'm building, I've decided to use natural numbers as the primary key. I'm aware of the advantages that GUID's allow, but looking at the data, the bulk of row's data were GUID keys.
I want to generate XML records from the database data, and one problem with natural numbers is that I don't want to expose my database key's to the outside world, and allow users to guess "keys." I believe GUID's solve this problem.
So, I think the solution is to generate a sparse, unique iD derived from the natural ID (hopefully it would be 2-way), or just add an extra column in the database and store a guid (or some other multibyte id)
The derived value is nicer because there is no storage penalty, but it would be easier to reverse and guess compared to a GUID.
I'm (buy) curious as to what others on SO have done, and what insights they have.
What you can do to compute a "GUID" is to calculate a MD5 hash of the ID with some salt (table name for instance), load this into a GUID and set a few bits so that it is a valid version 3 (MD5) GUID.
This is almost 2-way since you can have a SQL computed column (which can also be indexed in certain cases) holding the GUID without persisting it in the table, and you can always re-compute a GUID with the correct ID and salt, which should be harder for users since they don't know the salt nor the actual ID.

What should I consider when selecting a data type for my primary key?

When I am creating a new database table, what factors should I take into account for selecting the primary key's data type?
Sorry to do that, but I found that the answers I gave to related questions (you can check this and this) could apply to this one. I reshaped them a little bit...
You will find many posts dealing with this issue, and each choice you'll make has its pros and cons. Arguments for these usually refer to relational database theory and database performance.
On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Natural keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons: field too short, rules change, etc.
To this point, you've guessed here that I am basically a member of the uniqueIdentifier/surrogate primary key team, and even if I appreciate and understand arguments such as the ones presented here, I am still looking for the case where "natural" key is better than surrogate ...
In addition to this, one of the most important but always forgotten arguments in favor of this basic rule is related to code normalization and productivity:
each time I create a table, shall I lose time
identifying its primary key and its physical characteristics (type, size)
remembering these characteristics each time I want to refer to it in my code?
explaining my PK choice to other developers in the team?
My answer is no to all of these questions:
I have no time to lose trying to identify "the best Natural Primary Key" when the surrogate option gives me a bullet-proof solution.
I do not want to remember that the Primary Key of my Table_whatever is a 10 characters long string when I write the code.
I don't want to lose my time negotiating the Natural Key length: "well if You need 10 why don't you take 12 to be on the safe side?". This "on the safe side" argument really annoys me: If you want to stay on the safe side, it means that you are really not far from the unsafe side! Choose surrogate: it's bullet-proof!
So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, where a field combination offers a very acceptable Primary Key, I prefer to create this 'id_myManyToManyTable' field being a uniqueIdentifier, just to stick to the rule, and because, finally, it does not hurt.
The major advantage is that you don't have to care anymore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.
And if you still want to have your "Natural Key" somewhere in your table, I advise you to build it following a standard model such as
Tbl_whatever
id_whatever, unique identifier, primary key
code_whatever, whateverTypeYouWant(whateverLengthYouEstimateTheRightOne), indexed
.....
Where id_ is the prefix for primary key, and code_ is used for "natural" indexed field. Some would argue that the code_ field should be set as unique. This is true, and it can be easily managed either through DDL or external code. Note that many "natural" keys are calculated (invoice numbers), so they are already generated through code
I am not sure that my rule is the best one. But it is a very efficient one! If everyone was applying it, we would for example avoid time lost answering to this kind of question!
If using a numeric key, make sure the datatype is giong to be large enough to hold the number of rows you might expect the table to grow to.
If using a guid, does the extra space needed to store the guid need to be considered? Will coding against guid PKs be a pain for developers or users of the application.
If using composite keys, are you sure that the combined columns will always be unique?
I don't really like what they teach in school, that is using a 'natural key' (for example ISBN on a bookdatabase) or even having a primary key made up off 2 or more fields. I would never do that. So here's my little advice:
Always have one dedicated column in every table for your primary key.
They all should have the same colomn name across all tables, i.e. "ID" or "GUID"
Use GUIDs when you can (if you don't need performance), otherwise incrementing INTs
EDIT:
Okay, I think I need to explain my choices a little bit.
Having a dedicated column namend the same across all table for you primary key, just makes your SQL-Statements a lot of easier to construct and easier for someone else (who might not be familiar with your database layout) easier to understand. Especially when you're doing lots of JOINS and things like that. You won't need to look up what's the primary key for a specific table, you already know, because it's the same everywhere.
GUIDs vs. INTs doesn't really matters that much most of the time. Unless you hit the performance cap of GUIDs or doing database merges, you won't have major issues with one or another. BUT there's a reason I prefer GUIDs. The global uniqueness of GUIDs might always come in handy some day. Maybe you don't see a need for it now, but things like, synchronizing parts of the database to a laptop / cell phone or even finding datarecords without needing to know which table they're in, are great examples of the advantages GUIDs can provide. An Integer only identifies a record within the context of one table, whereas a GUID identifies a record everywhere.
In most cases I use an identity int primary key, unless the scenario requires a lot of replication, in which case I may opt for a GUID.
I (almost) never used meaningful keys.
Unless you have an ultra-convenient natural key available, always use a synthetic (a.k.a. surrogate) key of a numeric type. Even if you do have a natural key available, you might want to consider using a synthetic key anyway and placing an additional unique index on your natural key. Consider what happened to higher-ed databases that used social security numbers as PKs when federal law changed, the costs of changing over to synthetic keys were enormous.
Also, I have to disagree with the practice of naming every primary key the same, e.g. "id". This makes queries harder to understand, not easier. Primary keys should be named after the table. For example employee.employee_id, affiliate.affiliate_id, user.user_id, and so on.
Do not use a floating point numeric type, since floating point numbers cannot be properly compared for equality.
Where do you generate it? Incrementing number's don't fit well for keys generated by the client.
Do you want a data-dependent or independent key (sometimes you could use an ID from business data, can't say if this is always useful or not)?
How well can this type be indexed by your DB?
I have used uniqueidentifiers (GUIDs) or incrementing integers so far.
Cheers
Matthias
Numbers that have meaning in the real world are usually a bad idea, because every so often the real world changes the rules about how those numbers are used, in particular to allow duplicates, and then you've got a real mess on your hands.
I'm partial to using an generated integer key. If you expect the database to grow very large, you can go with bigint.
Some people like to use guids. The pro there is that you can merge multiple instances of the database without altering any keys but the con is that performance can be affected.
For a "natural" key, whatever datatype suits the column(s). Artifical (surrogate) keys are usually integers.
It all depends.
a) Are you fine having unique sequential numeric numbers as your primary key? If yes, then selecting UniqueIdentifier as your primary key will suffice.
b) If your business demand is such that you need to have alpha numeric primary key, then you got to go for varchar or nvarchar.
These are the two options I could think of.
A great factor is how much data you're going to store. I work for a web analytics company, and we have LOADS of data. So a GUID primary key on our pageviews table would kill us, due to the size.
A rule of thumb: For high performance, you should be able to store your entire index in memory. Guids could easily break this!
Use natural keys when they can be trusted. Some sources of natural keys can't be trusted. Years ago, the Social Security Administration used to occasionally mess up an assign the same SSN to two different people. Theyv'e probably fixed that by now.
You can probably trust VINs for vehicles, and ISBNs for books (but not for pamphlets, which may not have an ISBN).
If you use natural keys, the natural key will determine the datatype.
If you can't trust any natural keys, create a synthetic key. I prefer integers for this purpose. Leave enough room for reasonable expansion.
I usually go with a GUID column primary key for all tables (rowguid in mssql). What could be natural keys I make unique constraints. A typical example would be a produkt identification number that the user have to make up and ensure that is unique. If I need a sequence, like in a invoice i build a table to keep a lastnumber and a stored procedure to ensure serialized access. Or a Sequence in Oracle :-) I hate the "social security number" sample for natural keys as that number will never be alway awailable in a registration process. Resulting in a need for a scheme to generate dummy numbers.
I usually always use an integer, but here's an interesting perspective.
https://blog.codinghorror.com/primary-keys-ids-versus-guids/
Whenever possible, try to use a primary key that is a natural key. For instance, if I had a table where I logged one record every day, the logdate would be a good primary key. Otherwise, if there is no natural key, just use int. If you think you will use more than 2 billion rows, use a bigint. Some people like to use GUIDs, which works well, as they are unique, and you will never run out of space. However, they are needlessly long, and hard to type in if you are just doing adhoc queries.