Is there a term to describe tables where only the last value should be used? - sql

In my database there are tables where each row represents an entity, and there are other tables where the same entity can appear multiple times, but only the latest entry is the valid one.
For example, I have a table customer where each row represents a customer and another table customer_membership_status where the same customer can be referenced multiple times, but only the last record for each customer is suppost to be used. Data is never updated in customer_membership_status, only inserted.
Is there a term to describe this pattern? I'm asking because I would like to quickly and easily explain the intended use of the table to others.

Probably the best term would be CQRS and event sourcing
Using the stream of events as the write store, rather than the actual data at a point in time, avoids update conflicts on a single aggregate and maximizes performance and scalability
Event Sourcing pattern
Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data. The store acts as the system of record and can be used to materialize the domain objects. This can simplify tasks in complex domains, by avoiding the need to synchronize the data model and the business domain, while improving performance, scalability, and responsiveness. It can also provide consistency for transactional data, and maintain full audit trails and history that can enable compensating actions.
EDIT:
After closer look you may also read about SCD(Slowly Changing Dimension) Type 2.
This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.

Temporal table. It's a table where a timestamp / version attribute is part of a key. The temporal / version attribute allows you to identify which is the latest row for each customer.

Related

Using Identity or sequence in data warehouse

I'm new to data warehouse, So I try to follow the best practice, mimicking some implementation details from the Microsoft Demo DB WideWorldImportersDW, One of the things that I have noticed is using Sequence as default value for PK over Identity.
Could I ask, If it's preferable to use Sequence over Identity in data warehouse in general and Which one is more convenient especially during ETL process?.
A sequence has more guarantees than an identity column. In particular, each call to a sequence is guaranteed to produce the next value for the sequence.
However, an identity column can have gaps and other inconsistencies. This is all documented here.
Because of the additional guarantees on sequences, I suspect that they are slower. In particular, I suspect that the database cannot preallocate values in batch. That means that in a multi-threaded environments, sequences would impose serialization on transactions, slowing things down.
In general, I see identity used for identifying columns in tables. And although there is probably a performance comparison, I haven't seen one. But I suspect that sequences are a wee bit slower in some circumstances.
Both Sequence and Identity are designed for OLTP tables to enable effective assignment of unique keys in multi-session environment.
Important thing to realize is that in data warehouse environment you often have a different setup and there is only one job that populates a specific table.
In a single user environment you do not need the above features at all and you can simple assign the keys manually starting with max(id) +1 and increment by one for each row.
The general rule of data warehouse is that you should not search for silver bullet recommendation but check the functionality and preformance in your onw test.
If you make some research on SQL Server Identity vs Sequence e.g. here or here you get various result partly prefering the former partly the latter feature.
My recomendation is therefore to perform a test with the manually assigned IDs (i.e. with no overhead) simple to get a baseline for the expectation.
Than repeat it with both identity and sequence - compare and choose.
The sequence in SQL Server was added later and is based on Oracle Sequence, so I would not expect it has some basic problem.
The experience from Oracle tells us, you need to have a large enought cache in the sequence to support effective bulk insert.
In the meantime the identity can also be defined as cached, (IDENTITY_CACHE = { ON | OFF }) so once again, try all three posibilities (sequence, identity, nothing) and choose the best one.
Identity is scoped to a single table, is part of the table definition (DDL) and is reset on a truncate. Identity is unique within the table. Each table has its own identity value when configured and cannot be shared across tables. In general usage, the "next" value is consumed by SQL Server when an Insert occurs on the table.+
Sequence is a first class object, scoped to the database. The "next" value is consumed when the Sequence is used (NEXT VALUE FOR).
Sequences are most effectively used when you need a person readable unique identifier stored across multiple tables. For example a ticketing system that stores ticket types in different tables may use a sequence to ensure no ticket receives the same number, regardless of the table in which it is stored, and that a person can reasonably refer to the number (not GUID).
In data warehousing, the dimension table needs a row identifier unique within the table. In general, the OLTP primary key is not sufficient as it may be duplicated within the dimension table depending on the type of dimension, and you don't want to risk assigning additional context to the OLTP PK as that can cause challenges when the source data changes. The dimension row identifier should only have meaning to the non-measure fact columns associated with it. Fact columns are not joined across different dimensions.++
Since the scope of the dimension table identifier is limited to the dimension table, an identity key is the ideal row identifier. It is simple to create, compact to store, and is meaningless outside the dimension. You won't use the dimension identity on a report. (Really, please don't be that developer.)
+ Its rare you'll need to know the next value without needing to assign to a row. Might be a red flag if you are trying to manipulate the identity value prior to assignment
++ a dimension view may union different tables to feed the OLAP cube, in which case a persistent repeatable key should be generated from the underlying data, usually by concatenating a string literal with each table key in a normalized format.

SQL Server Auditing Data in the Same Table

A project I'm working on requires that a record be digitally "signed" and after that any modifications would create a new "version" of the row. The "signed" record can't be modified for regulatory reasons and new versions shouldn't be modified very often. In the past, done so by creating a separate logging table with the same schema as the main table with some extra columns for tracking who modified it and when.
However, after doing some work with SharePoint where ALL data (including different versions) is put into the same table I thought of a different approach which I can't find any examples of people doing: I could put the new version of the row right in the same table and increment the version number. Then add the version number to the PK.
PROS:
Implementation is easy, just create an "Instead of update" trigger which performs an insert instead of an update of the row is "signed". I could easily add a IsCurrentVersion column to be updated in the trigger.
Querying for older versions is easy, just get all the records with
the ID I want let the user choose from the list.
A trigger is nice because it guarantees that a row CAN'T be updated if signed (for regulatory and audit purposes).
Schema changes to the table don't have to be replicated to the mirror "logging" table.
CONS:
The table could get a bit larger but most of the time the record won't be changed after "signing" it. The client estimated around 100,000 rows/year max at current usage levels. SQL Server can handle hundreds of millions of rows so this doesn't seem too bad.
Indexing and performance could be an issue. SharePoint adds a tp_CalculatedVersion int to the PK where the calculated number is always 0 for the latest version. I could do the same and calculate it based off the Version number. Would that help performance?
There is an extra step in querying the data to make sure you get the latest version but that could be handled in a SP.
What other cons are there in this scenario. Am I missing anything??
I've seen this pattern used in an enterprise system before,and IMO it wasn't successful.
You are Mixing two different concerns here, viz storage of live and audit data. Queries to this table will always need to keep in mind whether they are seeking leaf or audit data (e.g. reports) - new team members may find this non intuitive. You would likely need to encapsulate this complexity with views etc.
As you mentioned performance will always be a concern. Inserting a new record will also need to update the previous record to mark it as inactive.You may also need to consider changing your clustered index to keep all versions on the same page.
Foreign keys to this table are going to be problematic. Do you
reference an exact version record? Do you then fix up the foreign
keys to point to the new live leaf record?
The one benefit I can think of doing this is that the audit table DDL will always be in synch with the live table - often with the 2 table strategy changes are made to the live, and the audit and trigger DDL isn't updated accordingly.
Overall, I would still recommend keeping your audit table separate.
If the requirement is that the signed data not be changed, then you should move it to another table. In fact, I might suggest moving it to another database/schema, where the only operation allowed on the table is inserting and reading records. You can use both permissions and triggers, if you really want to prevent updates.
You don't want to mess around with regulatory requirements. A complex schema that uses a combination of primary key with version, along with triggers, is a sign that there might be a simpler way.
The historical records can affect performance of the current records. If you end up in a situation where every record has changed 100 times, then keeping them in the same table is just going to slow down queries. Of course, you can embark on more complexity, in the form of partitioning the data. In the end, the solution is simpler: keep the data that cannot be changed in another table where it cannot be changed. You don't want to have to upgrade the hardware just because lots of history has accumulated.
I would also suggest including an effective and end date in the history records. This will allow you to reconstruct all the data as of a particular date, something that users might find useful in the future.
That's right. Audit trails can stay in an application for internal reporting/audit but infosec best practice mandates getting audit logs off the system where they are generated into your log management / SIEM solution.

Why does Wordpress have separate 'usersmeta' and 'users' SQL tables. Why not combine them?

Alongside the users table, Wordpress has a usersmeta table with the following columns
meta_id
user_id
meta_key (e.g. first_name)
meta_value (e.g. Tom)
Each user has 20 rows in the usersmeta table, regardless of whether or not the rows have a filled-in meta_value. That said, would it not be more efficient to add the always-present meta rows to the users table?
I'm guessing that the information in the users table is more frequently queried (e.g. user_id, username, pass), so it is more efficient to keep those rows smaller. Is this true? And are there other reasons for this separation of tables?
Entity Attribute Value
It's known as the Entity Attribute Value (EAV) data model, and allows an arbitrary number of attributes to be assigned to a given entity. That means any number of meta-data entries per user.
Why use it
By default there are a few keys that wordpress sets (20 stated in the question) but there can be any number. If all users have one thousand meta data entries - there are simply one thousand entries in the usermeta table for each user - it doesn't have (in terms of the database structure) a limit to the number of meta data entries a user can have. It also permits one user to have one thousand meta data entires, whilst all others have 20 and still store the data efficiently - or any permutation thereof.
In addition to flexibility, using this kind of structure permits the main users table to remain small - which means more efficient queries.
Alternatives
The alternatives to using EAV include:
Modify the schema whenever the number of attributes changes
Store all attributes in a serialized string (on the user object)
Use a schemaless db
Permissions is the biggest problem with the first point, it is not a good idea to grant blanket access to alter the schema of your database tables, and is a (sane) roadblock for many if not most wordpress installs (hosted on wordpress.com or on a shared host where the db user has no alter permissions). Mysql also has a hard-limit of 4096 columns and 65,535 bytes per row. Attempting to store a large number of columns in a single table will eventually fail, along the way creating a table that is inefficient to query.
Storing all attribute in a serialized string would make it difficult and slow to query by a meta-data value.
Wordpress is quite tied to mysql, and therefore changing datastore isn't a realistic option.
Further WP info
If you aren't using any/many plugins it's possible you will have a constant number of rows in the usermeta table for each user, but typically each plugin you add may need to add meta-data for users; the number added may not be trivial and this data is stored in the usermeta table.
The docs for add_meta_user may add some clarity as to why the database is structured that way. If you put code like this somewhere:
add_user_meta($user_id, "favorite_color", "blue");
It will create a row in the usermeta table for the given user_id, without the need to add a column (favorite_color) to the main users table. That makes it easy-ish to find users by favorite color without the need to modify the schema of the users table.
This is really a question about database normalization. You can look for information on that topic in many places.
Basic answer Since there is a huge literature about this, and there are a lot of differences, I will just give some examples of why this might happen - it boild down to trade-offs; Speed versus storage requirements, or ease of use versus data duplication. Efficiency is multidimensional, and since wordpress does a lot of different things, it may have various reasons to keep them separate - space could be an issue, speed of queries may depend on this, it may be easier to look at just the meta table instead of the full table for some purposes, or vice versa.
Further reading This is a deep topic, you may want to learn more - there are hundreds of books and thousands of scholarly papers on these issues. For instance, look at this previous SO question about designing a database:
Database design: one huge table or separate tables?, or this one: First-time database design: am I overengineering?
or Database Normalization Basics
on About.com.

Intelligent ways to maintain versioned copies of tables in a relational database (programming considerations?)

My second consideration was to simply make a copy of the table data associated with the active document or folder of documents that the baseline was applied to, and store that for safe keeping. The code complexity savings would be enormous, the schema would be kept simpler though there is some impact in terms of storage usage (though I do not believe it will be that big)
This is something I'm faced with as well on a current project. Some good answers here.
I'm personally using option #1, adding a version column to the pkey of each "versioned" table. FKeys use the version column as well. My main consideration was ease of querying, with a consequence of high storage requirements. I have the leeway to strictly limit the number of versions kept around, which mitigates that issue.
I really think you are looking for a temporal database. Our own implementation added a pair of audit log identifiers to a table, the "create ID" and "delete ID" (and every transaction that modified the database was summarized in an "audit log" and given a unique sequential identifier). For each transaction that added a record to the table, the audit log identifier was used as the create ID and the delete ID was zero. When a record changed, the new version had the new create ID and the same identifier was used as the delete ID for the old version.
It was quite easy to create a view of a table where the delete ID was zero, creating a virtual table of just current data. You could also find the content of the tables just before or just after a specific transaction was processed, get a list of the transactions that modified a specific row of a table, etc.
There are time, space, and complexity penalties for all this. But for us, dealing with large sums of money, it was worth it.

Database history for client usage

I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET
I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).
We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.
If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive
Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.