Best approach for multi-tenant primary keys

Best approach for multi-tenant primary keys - sql

I have a database used by several clients. I don't really want surrogate incremental key values to bleed between clients. I want the numbering to start from 1 and be client specific.
I'll use a two-part composite key of the tenant_id as well as an incremental id.
What is the best way to create an incremental key per tenant?
I am using SQL Server Azure. I'm concerned about locking tables, duplicate keys, etc. I'd typically set the primary key to IDENTITY and move on.
Thanks

Are you planning on using SQL Azure Federations in the future? If so, the current version of SQL Azure Federations does not support the use of IDENTITY as part of a clustered index. See this What alternatives exist to using guid as clustered index on tables in SQL Azure (Federations) for more details.
If you haven't looked at Federations yet, you might want to check it out as it provides an interesting way to both shard the database and for tenant isolation within the database.
Depending upon your end goal, using Federations you might be able to use a GUID as the primary clustered index on the table and also use an incremental INT IDENTITY field on the table. This INT IDENTITY field could be shown to end-users. If you are federating on the TenantID each "Tenant table" effectively becomes a silo (as I understand it at least) so the use of IDENTITY on a field within that table would effectively be an ever increasing auto generated value which increments within a given Tenant.
When \ if data is merged together (combining data from multiple Tenants) you would wind up with collisions on this INT IDENTITY field (hence why IDENTITY isn't supported as a primary key in federations) but as long as you aren't using this field as a unique identifier within the system at large you should be ok.

If you're looking to duplicate the convenience of having an automatically assigned unique INT key upon insert, you could add an INSTEAD OF INSERT trigger that uses MAX of the existing column +1 to determine the next value.
If the column with the identity value is the first key in an index, the MAX query will be a simple index seek, very efficient.
Transactions will ensure that unique values are assigned but this approach will have different locking semantics than the standard identity column. IIRC, SQL Server can allocate a different identity value for each transaction that requests it in parallel and if a transaction is rolled back, the value(s) allocated to it are discarded. The MAX approach would only allow one transaction to insert rows into the table at a time.
A related approach could be to have a dedicated key value table keyed by the table name, tenant ID and current identity value. It would require the same INSTEAD OF INSERT trigger and more boilerplate to query and keep that key table updated. It wouldn't improve parallel operations though; the lock would just be on a different table's record.
One possibility to fix the locking bottleneck would be to include the current SPID in the key's value (now the identity key is a combination of sequential int and whatever SPID happened to allocate it and not simply sequential), use the dedicated identity value table and insert records there per SPID as necessary; the identity table PK would be (table name, tenant, SPID) and have a non-key column with the current sequential value. That way, each SPID would have its own dynamically allocated identity pool and would only ever have its own SPID specific records locked.
Another downside is maintaining triggers that have to be updated whenever you change the columns in any of the special identity tables.

Related

IDENTITY SEED is incrementing based on other tables seed values

I'm using SQL SERVER 2017 and using SSMS. I have created a few tables whose Primary Key is int and enabled Is Identity and set Identity Increment = 1 and Identity Seed=1 For all the tables I have used the same method. But When I added one record in a table say Lead it's ID was 2, Then added value to the table say Followup then its ID was 3.
Here I'm adding the screenshots for a better understanding
Lead Table
Followup Table
Is there any option available to avoid this? can we keep the identity individual for each table?

The documentation is quite specific about what identity does not guarantee:
The identity property on a column does not guarantee the following:
Uniqueness of the value . . .
Consecutive values within a transaction . . .
Consecutive values after server restart or other failures . . .
Reuse of values
In general, the "uniqueness" property is a non-issue, because identity columns are usually the primary key (or routinely declared at least unique), which does guarantee uniqueness.
The purpose of an identity column is to provide a unique numeric identifier different from other rows, so it can be readily used as a primary key. There are no other guarantees. And for performance SQL Server has lots of short-cuts that result in gaps.
If you want no gaps, then the simplest way is to assign a value when querying:
row_number() over (order by <identity column>)
That is not 100% satisfying, because deletions can affect the value. I also think that on parallel systems, inserts can as well (because identities might be cached on individual nodes).
If you do not care about performance, you can use a sequence for assigning a value. This is less performant than using an identity. Basically, it requires serializing all the inserts to guarantee the properties of the insert.
I should note that even with a sequence, a failed insert can still produce gaps, so it still might not do what you want.

Are PostgreSQL autoicrementing id scalable?

I am new to SQL, I am coming from NoSQL.
I have seen that you need to make a unique id for you rows if you want to use unique ids. They are not automatically made by the database as it was in MongoDB. One way to do so is to create auto-incrementing ids.
Are PostgreSQL auto-incrementing id scalable? Does the DB have to insert a row at a time? How does it work?
-----EDIT-----
What I am actually wondering is in a distributed environment is there a risk that two rows may have the same id?

In Postgres, autoincrement is atomic and scalable. In case some inserts fail, some ids can be missing from sequence but inserted are guaranteed to be unique.
Also, all primary keys don't have to be generated. See my answer to your first question.

Autoincrementing columns that are defined as
id bigint PRIMARY KEY DEFAULT nextval('tab_id_seq')
or, more standard compliant, as
id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY
use a sequence to generate unique values.
A sequence is a special database object that can very efficiently supply unique integers to concurrent database sessions. I doubt that any identity generator, be it in MongoDB or elsewhere, can be more efficient.
Since getting a new sequence values accesses shared state, you can optimize sequences for high concurrency by defining them with a CACHE value higher than 1. Then each database session that uses the sequence keeps a cache of unique values and doesn't have to access the shared state each time it needs a value.

Why some tables in popular databases dont have primary keys defined?

I have seen tables in SAP database and TFS database (both configuration and collection) that don't have primary keys defined. Why is that?

In TFS a number of tables don't have a primary key nor foreign keys due very specific to performance constraints. Plus, these databases are not supposed to be updated manually, TFS handles all changes to these tables though its own APIs. It's one of the reasons why Microsoft doesn't support direct querying against these tables.
Another reason in the case of TFS is that its cloud counterpart, Visual Studio Team Services, doesn't store all of its data in SQL Azure, but in Table Storage Blob storage or DocumentDB.

Nobody has attempted an answer to this yet, so here goes...
Some tables don't need a PRIMARY KEY, because they are never going to be updated, and might only have a tiny set of data in them (e.g. lookup tables). If these tables have no indexes at all then they are essentially heaps, which isn't always a bad thing.
Why should every table have a PRIMARY KEY defined anyway? If the table has a UNIQUE CLUSTERED INDEX in place then this does just about everything that a PRIMARY KEY does, with the added bonus that you can allow NULL values to be stored. Depending on the implementation (e.g. SQL Server allows only one "unique" NULL value, other RDBMs allow multiple) this might be a much better match to your application.
For example, let's say you want a table with two columns, account number and account name. Let's assume that you make your PRIMARY KEY account number, because you want to ensure it is unique. Now you want to allow NULL account numbers, because these aren't always supplied at the point where you create an account; you have some weird 2-part process where you create a record with just a name, then backfill the account number. If you stick to the PRIMARY KEY design then you would need to do something like add an IDENTITY column, make this the PRIMARY KEY, then add a UNIQUE CONSTRAINT to prevent identical multiple account numbers.
Now you are left with a surrogate key that is going to be of little use to any queries, so you would probably end up with a performance index anyway, even if you don't care about uniqueness.
If you had no PRIMARY KEY, but instead a UNIQUE CLUSTERED INDEX then you would be able to do this without changing your table, with only one customer ever allowed to have a NULL account number at the same time (if it's SQL Server).
I designed a database for a customer a couple of years ago that had over 200 tables, and not a single PRIMARY KEY. Although this was more about me "making a point", it's not a stretch to assume that the same is true for other database developers out there.

Confusing t-sql exam answer about sequence or uniqueidentifier

I found a t-sql question and its answer. It is too confusing. I could use a little help.
The question is:
You develop a database application. You create four tables. Each table stores different categories of products. You create a Primary Key field on each table.
You need to ensure that the following requirements are met:
The fields must use the minimum amount of space.
The fields must be an incrementing series of values.
The values must be unique among the four tables.
What should you do?
A. Create a ROWVERSION column.
B. Create a SEQUENCE object that uses the INTEGER data type.
C. Use the INTEGER data type along with IDENTITY
D. Use the UNIQUEIDENTIFIER data type along with NEWSEQUENTIALID()
E. Create a TIMESTAMP column.
The said answer is D. But, I think the more suitable answer is B. Because sequence will use less space than GUID and it satisfies all the requirements.

D is a wrong answer, because NEWSEQUENTIALID doesn't guarantee "an incrementing series of values" (second requirement).
NEWSEQUENTIALID()
Creates a GUID that is greater than any GUID
previously generated by this function on a specified computer since
Windows was started. After restarting Windows, the GUID can start
again from a lower range, but is still globally unique.
I'd say that B (sequence) is the correct answer. At least, you can use a sequence to fulfil all three requirements, if you don't restart/recycle it manually. I think it is the easiest way to meet all three requirements.

Between the choices provided D B is the correct answer, since it meets all requirements:
ROWVERSION is a bad choice for a primary key, as stated in MSDN:
Every time that a row with a rowversion column is modified or inserted, the incremented database rowversion value is inserted in the rowversion column. This property makes a rowversion column a poor candidate for keys, especially primary keys. Any update made to the row changes the rowversion value and, therefore, changes the key value. If the column is in a primary key, the old key value is no longer valid, and foreign keys referencing the old value are no longer valid.
TIMESTAMP is deprecated, as stated in that same page:
The timestamp syntax is deprecated. This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.
An IDENTITY column does not guarantee uniqueness, unless all it's values are only ever generated automatically (you can use SET IDENTITY_INSERT to insert values manually), nor does it guarantee uniqueness between tables for any value.
A GUID is practically guaranteed to be unique per system, so if a guid is the primary key for all 4 tables it ensures uniqueness for all tables. the one requirement it doesn't fulfill is storage size - It's storage size is quadruple that of int (16 bytes instead of 4).
A SEQUENCE, when is not declared as recycle, guarantee uniqueness, and has the lowest storage size.
The sequence of numeric values is generated in an ascending or descending order at a defined interval and can be configured to restart (cycle) when exhausted.
However,
I would actually probably choose a different option all together - create a base table with a single identity column and link it with a 1:1 relationship with all other categories. then use an instead of insert trigger for all categories tables that will first insert a record to the base table and then use scope_identity() to get the value and insert it as the primary key for the category table.
This will enforce uniqueness as well as make it possible to use a single foreign key reference between the categories and products.

The issue has been discussed extensively in the past, in general:
http://blog.codinghorror.com/primary-keys-ids-versus-guids/
The constraint #3 is why a SEQUENCE could run into issues as there is a higher risk of collision/lowered number of possible rows in each table.

Slow progress when adding sequential identity column

We have 8 million row table and we need to add a sequential id column to it. It is used for data warehousing.
From testing, we know that if we remove all the indexes, including the primary key index, adding a new sequential id column was like 10x faster. I still haven't figure out why dropping the indexes would help adding a identity column.
Here is the SQL that add identity column:
ALTER TABLE MyTable ADD MyTableSeqId BIGINT IDENTITY(1,1)
However, the table in question has dependencies, thus I cannot drop the primary key index unless I remove all the FK constraints. As a result adding identity column.
Is there other ways to improve the speed when adding a identity column, so that client down time is minimal?
or
Is there a way to add an identity column without locking the table, so that table can be access, or at least be queried?
The database is SQL Server 2005 Standard Edition.

Adding a new column to a table will acquire a Sch-M (schema modification) lock, which prevents all access to the table for the duration of the operation.
You may get some benefit from switching the database into bulk-logged or simple mode for the duration of the operation, but of course, do so only if you're aware of the effects this will have on your backup / restore strategy.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas