Using Identity or sequence in data warehouse - sql

I'm new to data warehouse, So I try to follow the best practice, mimicking some implementation details from the Microsoft Demo DB WideWorldImportersDW, One of the things that I have noticed is using Sequence as default value for PK over Identity.
Could I ask, If it's preferable to use Sequence over Identity in data warehouse in general and Which one is more convenient especially during ETL process?.

A sequence has more guarantees than an identity column. In particular, each call to a sequence is guaranteed to produce the next value for the sequence.
However, an identity column can have gaps and other inconsistencies. This is all documented here.
Because of the additional guarantees on sequences, I suspect that they are slower. In particular, I suspect that the database cannot preallocate values in batch. That means that in a multi-threaded environments, sequences would impose serialization on transactions, slowing things down.
In general, I see identity used for identifying columns in tables. And although there is probably a performance comparison, I haven't seen one. But I suspect that sequences are a wee bit slower in some circumstances.

Both Sequence and Identity are designed for OLTP tables to enable effective assignment of unique keys in multi-session environment.
Important thing to realize is that in data warehouse environment you often have a different setup and there is only one job that populates a specific table.
In a single user environment you do not need the above features at all and you can simple assign the keys manually starting with max(id) +1 and increment by one for each row.
The general rule of data warehouse is that you should not search for silver bullet recommendation but check the functionality and preformance in your onw test.
If you make some research on SQL Server Identity vs Sequence e.g. here or here you get various result partly prefering the former partly the latter feature.
My recomendation is therefore to perform a test with the manually assigned IDs (i.e. with no overhead) simple to get a baseline for the expectation.
Than repeat it with both identity and sequence - compare and choose.
The sequence in SQL Server was added later and is based on Oracle Sequence, so I would not expect it has some basic problem.
The experience from Oracle tells us, you need to have a large enought cache in the sequence to support effective bulk insert.
In the meantime the identity can also be defined as cached, (IDENTITY_CACHE = { ON | OFF }) so once again, try all three posibilities (sequence, identity, nothing) and choose the best one.

Identity is scoped to a single table, is part of the table definition (DDL) and is reset on a truncate. Identity is unique within the table. Each table has its own identity value when configured and cannot be shared across tables. In general usage, the "next" value is consumed by SQL Server when an Insert occurs on the table.+
Sequence is a first class object, scoped to the database. The "next" value is consumed when the Sequence is used (NEXT VALUE FOR).
Sequences are most effectively used when you need a person readable unique identifier stored across multiple tables. For example a ticketing system that stores ticket types in different tables may use a sequence to ensure no ticket receives the same number, regardless of the table in which it is stored, and that a person can reasonably refer to the number (not GUID).
In data warehousing, the dimension table needs a row identifier unique within the table. In general, the OLTP primary key is not sufficient as it may be duplicated within the dimension table depending on the type of dimension, and you don't want to risk assigning additional context to the OLTP PK as that can cause challenges when the source data changes. The dimension row identifier should only have meaning to the non-measure fact columns associated with it. Fact columns are not joined across different dimensions.++
Since the scope of the dimension table identifier is limited to the dimension table, an identity key is the ideal row identifier. It is simple to create, compact to store, and is meaningless outside the dimension. You won't use the dimension identity on a report. (Really, please don't be that developer.)
+ Its rare you'll need to know the next value without needing to assign to a row. Might be a red flag if you are trying to manipulate the identity value prior to assignment
++ a dimension view may union different tables to feed the OLAP cube, in which case a persistent repeatable key should be generated from the underlying data, usually by concatenating a string literal with each table key in a normalized format.

Related

Are SQL Server sequences guaranteed to always generate unique values even if called simultaneously from multiple connections?

This is a follow-up question to: Are SQL Server sequences thread safe?
I have two separate stored procedures that are calling the same sequence. The stored procedures are launched "in parallel" from an SSIS package. There is no synchronization of any kind between the two stored procedures (other than the fact that I guarantee that they'll never be updating the same rows, even though they are updating the same table). That being said, there's no particular reason that the sequence couldn't be called more or less simultaneously by the two stored procedures. My question is about exactly what would happen in this case.
In the case of the linked question, the OP had several producer applications that were simultaneously inserting into the table and wanted to know whether they could "count" on them being sequential between the processes (i.e. that if producer 2 called the sequence first, its ID would be smaller than producer 3). (This ended up not being the case due to a race condition due to the fact that generating the IDs and storing them were separate steps).
The same logic would presumably apply to my case (that I can't count on them to be in any particular "order" due to the fact that I also produce and store them in separate steps). In my case, however, I don't particularly care whether they're sequential, just that they're unique.
Can I count on that being the case? Are SQL Server sequences guaranteed to always produce unique values (even if called more or less simultaneously from different connections), or could there be some race condition here that would make this no longer be the case?
Edit: The same sequence number could ultimately be added to multiple rows if that matters (although it will always be added to at least one). I fetch the number from the sequence and then do an update query to add it to the rows that I want it to be part of.
If I read correctly, you are just ensuring they are unique (example: you want them to be a primary key?). If so, that is correct. As far as guaranteed order, you are correct that there are conditions, esp. under load, they will not be in a particular order. Does not sound like that is a big problem for you. As long as you are pulling the next value correctly, you are safe.
When I look at created sequences, I think of them like autonumbering in Oracle, where you have to pull the value and then utilize it, rather than IDENTITY in SQL Server (although there are ways to increment IDENTITY to "fill in the hole" later, so it can be utilized in the same/similar manner).
I have not examine the internals, but I would imagine the base sequence concept is utilized for IDENTITY underneath the hood, as the ideas are essentially the same, except IDENTITY is attached to a field in the table.
Yes, they do, and that's one of their most important feature, that is described in the documentation (emphasis is mine):
Identity columns can be used for generating key values. The identity
property on a column guarantees the following:
Each new value is generated based on the current seed & increment.
Each new value for a particular transaction is different from other
concurrent transactions on the table.
Disclaimer: there is no guarantee that values are sequential.

Is there a term to describe tables where only the last value should be used?

In my database there are tables where each row represents an entity, and there are other tables where the same entity can appear multiple times, but only the latest entry is the valid one.
For example, I have a table customer where each row represents a customer and another table customer_membership_status where the same customer can be referenced multiple times, but only the last record for each customer is suppost to be used. Data is never updated in customer_membership_status, only inserted.
Is there a term to describe this pattern? I'm asking because I would like to quickly and easily explain the intended use of the table to others.
Probably the best term would be CQRS and event sourcing
Using the stream of events as the write store, rather than the actual data at a point in time, avoids update conflicts on a single aggregate and maximizes performance and scalability
Event Sourcing pattern
Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data. The store acts as the system of record and can be used to materialize the domain objects. This can simplify tasks in complex domains, by avoiding the need to synchronize the data model and the business domain, while improving performance, scalability, and responsiveness. It can also provide consistency for transactional data, and maintain full audit trails and history that can enable compensating actions.
EDIT:
After closer look you may also read about SCD(Slowly Changing Dimension) Type 2.
This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.
Temporal table. It's a table where a timestamp / version attribute is part of a key. The temporal / version attribute allows you to identify which is the latest row for each customer.

Sequence vs identity

SQL Server 2012 introduced Sequence as a new feature, same as in Oracle and Postgres. Where sequences are preferred over identities? And why do we need sequences?
I think you will find your answer here
Using the identity attribute for a column, you can easily generate
auto-incrementing numbers (which as often used as a primary key). With
Sequence, it will be a different object which you can attach to a
table column while inserting. Unlike identity, the next number for the
column value will be retrieved from memory rather than from the disk –
this makes Sequence significantly faster than Identity. We will see
this in coming examples.
And here:
Sequences: Sequences have been requested by the SQL Server community
for years, and it's included in this release. Sequence is a user
defined object that generates a sequence of a number. Here is an
example using Sequence.
and here as well:
A SQL Server sequence object generates sequence of numbers just like
an identity column in sql tables. But the advantage of sequence
numbers is the sequence number object is not limited with single sql
table.
and on msdn you can also read more about usage and why we need it (here):
A sequence is a user-defined schema-bound object that generates a
sequence of numeric values according to the specification with which
the sequence was created. The sequence of numeric values is generated
in an ascending or descending order at a defined interval and may
cycle (repeat) as requested. Sequences, unlike identity columns, are
not associated with tables. An application refers to a sequence object
to receive its next value. The relationship between sequences and
tables is controlled by the application. User applications can
reference a sequence object and coordinate the values keys across
multiple rows and tables.
A sequence is created independently of the tables by using the CREATE
SEQUENCE statement. Options enable you to control the increment,
maximum and minimum values, starting point, automatic restarting
capability, and caching to improve performance. For information about
the options, see CREATE SEQUENCE.
Unlike identity column values, which are generated when rows are
inserted, an application can obtain the next sequence number before
inserting the row by calling the NEXT VALUE FOR function. The sequence
number is allocated when NEXT VALUE FOR is called even if the number
is never inserted into a table. The NEXT VALUE FOR function can be
used as the default value for a column in a table definition. Use
sp_sequence_get_range to get a range of multiple sequence numbers at
once.
A sequence can be defined as any integer data type. If the data type
is not specified, a sequence defaults to bigint.
Sequence and identity both used to generate auto number but the major difference is Identity is a table dependant and Sequence is independent from table.
If you have a scenario where you need to maintain an auto number globally (in multiple tables), also you need to restart your interval after particular number and you need to cache it also for performance, here is the place where we need sequence and not identity.
Although sequences provide more flexibility than identity columns, I didn't find they had any performance benefits.
I found performance using identity was consistently 3x faster than using sequence for batch inserts.
I inserted approx 1.5M rows and performance was:
14 seconds for identity
45 seconds for sequence
I inserted the rows into a table which used sequence object via a table default:
NEXT VALUE for <seq> for <col_name>
and also tried specifying sequence value in select statement:
SELECT NEXT VALUE for <seq>, <other columns> from <table>
Both were the same factor slower than the identity method. I used the default cache option for the sequence.
The article referenced in Arion's first link shows performance for row-by-row insert and difference between identity and sequence was 16.6 seconds to 14.3 seconds for 10,000 inserts.
The Caching option has a big impact on performance, but identity is faster for higher volumes (+1M rows)
See this link for an indepth analysis as per utly4life's comment.
I know this is a little old, but wanted to add an observation that bit me.
I switched from identity to sequence to have my indexes in order. I later found out that sequence doesn't transfer with replication. I started getting key violations after I setup replication between two databases since the sequences were not in sync. just something to watch out for before you make a decision.
I find the best use of Sequences is not to replace an identity column but to create a "Order Number" type of field.
In other words, an Order Number is exposed to the end user and may have business rules along with it. You want it to be unique, but just using an Identity Column isn't really correct either.
For example, different order types might require a different sequence, so you might have a sequence for Internet Order, as opposed to In-house orders.
In other words, don't think of a Sequence as simple a replacement for identity, think of it as being useful in cases where an identity does not fit the business requirements.
Recently was bit by something to consider for identity vs sequence. Seems MSFT now suggests sequence if you may want to keep identity without gaps. We had an issue where there were huge gaps in the identity, but based on this statement highlighted would explain our issue that SQL cached the identity and after reboot we lost those numbers.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity-property?view=sql-server-2017
Consecutive values after server restart or other failures – SQL Server might cache identity values for performance reasons and some of the assigned values can be lost during a database failure or server restart. This can result in gaps in the identity value upon insert. If gaps are not acceptable then the application should use its own mechanism to generate key values. Using a sequence generator with the NOCACHE option can limit the gaps to transactions that are never committed.

Some sort of “different auto-increment indexes” per a primary key values

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

Fast inserts; BulkCopy with relational data

I have a large amount of constantly incoming data (roughly 10,000 a minute, and growing) that I want to insert into a database as efficiently as possible. At the moment I'm using prepared insert statements, but am thinking of using the SqlBulkCopy class to import the data in larger chunks.
The problem is that I'm not inserting into a single table - elements of the data item are inserted into numerous tables, and their identity columns are used as foreign keys in other rows that are inserted at the same time. I understand that bulk copies aren't meant to allow for more complex inserts like this, but I wonder if it is worth exchanging my identity columns (bigints in this case) for uniqueidentifier columns. This will allow me to do a couple of bulk copies for each table, and since I can determine the IDs before the insert, I don't need to check for anything like SCOPE_IDENTITY which is preventing me from using bulk copy.
Does this sound like a viable solution, or are there other potential issues I might face? Or, is there another way I can insert data quickly, but retain my use of bigint identity columns?
Thanks.
uniqueidentifier will probably make things worse: page splits and wider. See this
If your load is/can be batched, one options is to:
you load a staging table
load the real tables in one go as a stored procedure
use a uniqueidentifier in the staging table for each batch
We deal with peaks of around 50k rows per second (and increasing this way). We actually use a separate staging database to avoid double transaction log writes)
It sounds like you are planning on exchanging "SQL assigns a [bigint identity() column] surrogate key" with a "data prep routine assings a GUID surrogate key" methodology. In other words, the key will not be assigned within SQL, but from outside SQL. Given your volumes, if the data-generating process can assign surrogate key, I'd definitely go with that.
The question then becomes, must you use GUIDs, or can your data-generation process produce auto-incrementing integers? Creating such a process that works consistantly and infallibly is hard (one reason why you pay $$$ for SQL Server), but the trade-off for smaller and more human-legible keys within the database might be worth it.