How do I enforce data integrity rules in my database? - sql

I'm designing this collection of classes and abstract (MustInherit) classes…
This is the database table where I'm going to store all this…
As far as the Microsoft SQL Server database knows, those are all nullable ("Allow Nulls") columns.
But really, that depends on the class stored there: LinkNode, HtmlPageNode, or CodePageNode.
Rules might look like this...
How do I enforce such data integrity rules within my database?
UPDATE: Regarding this single-table design...
I'm still trying to zero in on a final architecture.
I initially started with many small tables with almost zero nullalbe fields.
Which is the best database schema for my navigation?
And I learned about the LINQ to SQL IsDiscriminator property.
What’s the best way to handle one-to-one relationships in SQL?
But then I learned that LINQ to SQL only supports single table inheritance.
Can a LINQ to SQL IsDiscriminator column NOT inherit?
Now I'm trying to handle it with a collection of classes and abstract classes.
Please help me with my .NET abstract classes.

Use CHECK constraints on the table. These allow you to use any kind of boolean logic (including on other values in the table) to allow/reject the data.
From the Books Online site:
You can create a CHECK constraint with
any logical (Boolean) expression that
returns TRUE or FALSE based on the
logical operators. For the previous
example, the logical expression is:
salary >= 15000 AND salary <= 100000.

It looks like you are attempting the Single Table Inheritance pattern, this is a pattern covered by the Object-Relational Structural Patterns section of the book Patterns of Enterprise Application Architecture.
I would recommend the Class Table Inheritance or Concrete Table Inheritance patterns if you wish to enforce data integrity via SQL table constraints.
Though it wouldn't be my first suggestion, you could still use Single Table Inheritance and just enforce the constraints via a Stored Procedure.

You can set up some insert/update triggers. Just check if these fields are null or notnull, and reject insert/update operation if needed. This is a good solution if you want to store all the data in the same table.
You can create also create a unique table for each classes as well.

Have a unique table for each type of node.
Why not just make the class you're building enforce the data integrity for its own type?
EDIT
In that case, you can either a) use logical constraints (see below) or b) stored procedures to do inserts/edits (a good idea regardless) or c) again, just make the class enforce data integrity.
A mixture of C & B would be the course of events I take. I would have unique stored procedures for add/edits for each node type (i.e. Insert_Update_NodeType) as well as make the class perform data validation before saving data.

Personally I always insist on putting data integrity code on the table itself either via a trigger or a check constraint. The reason why is that you cannot guarantee that only the user interface will update insert or delete records. Nor can you guarantee that someone might not write a second sp to get around the constraints in the orginal sp without understanding the actual data integrity rules or even write it because he or she is unaware of the existence of the sp with the rules. Tables are often affected by DTS or SSIS packages, dynamic queries from the user interface or through Query analyzer or the query window, or even by scheduled jobs that run code. If you do not put the data integrity code at the table level, sooner or later your data will not have integrity.

It's probably not the answer you want to hear, but the best way to avoid logical inconsistencies, you really want to look at database normalisation

Stephen's answer is the best. But if you MUST, you could add a check constraint the HtmlOrCode column and the other columns which need to change.

I am not that familiar with SQL Server, but I know with Oracle you can specify Constraints that you could use to do what you are looking for. I am pretty sure you can define constraints in SQL server also though.
EDIT: I found this link that seems to have a lot information, kind of long but may be worth a read.

Enforcing Data Integrity in Databases
Basically, there are four primary types of data integrity: entity, domain, referential and user-defined.
Entity integrity applies at the row level; domain integrity applies at the column level, and referential integrity applies at the table level.
Entity Integrity ensures a table does not have any duplicate rows and is uniquely identified.
Domain Integrity requires that a set of data values fall within a specific range (domain) in order to be valid. In other words, domain integrity defines the permissible entries for a given column by restricting the data type, format, or range of possible values.
Referential Integrity is concerned with keeping the relationships between tables synchronized.
#Zack: You can also check out this blog to read more details about data integrity enforcement, here- https://www.bugraptors.com/what-is-data-integrity/

SQL Server doesn't know anything about your classes. I think that you'll have to enforce this by using a Factory class that constructs/deconstructs all these for you and makes sure that you're passing the right values depending upon the type.
Technically this is not "enforcing the rules in the database" but I don't think that this can be done in a single table. Fields either accept nulls or they don't.
Another idea could be to explore SQL Functions and Stored Procedures that do the same thing. BUt you cannot enforce a field to be NOT NULL for one record and NULL for the next one. That's your Business Layer / Factory job.

Have you tried NHibernate? It's much more matured product than Entity Framework. It's free.

Related

SQL: Best way to conditionally relate multiple tables to a single table based on row value in the main table

In a nutshell here's the situation, we have a database that is used to build a hierarchy of "locations". (Example: Street Address > Building 1 > First Floor > Room).
Each of the locations are stored in a table. Each location can be of a different "type". The types are defined in another table. (We use the types of locations to restrict what locations can be added to a location).
Here's the quandary we are facing: We need to be able to store different types of information for different types of locations. For example, a location type of "building" may need to have the address stored where as a location type of "room" may need to have dimensions or paint color stored.
Obviously we could create a table for each location type we define to hold the properties required for the particular location type and then use application logic to query the appropriate table to pull in a particular location's additional information. Is there a more elegant or practical way to accomplish this relationally in the database without having to rely on application logic?
Thanks!
The relationally pure way to do this would be to implement your initial suggestion i.e. have a separate table such as BUILDING_LOCATIONS that holds the attributes that are only applicable to this type of location. A different [TYPE]_LOCATIONS table would be created for each type of location that has its own attributes. In this way you could use standard database constraint functionality to ensure the integrity of the data in these tables.
Another method would be to add a series of nullable columns to the LOCATIONS table such as BUILDING_ADDRESS and ROOM_DIMENSIONS. This is not relationally pure as it means null values can exist in this table. However, you can still use standard database functionality to ensure the integrity of the data. It can be a bit more convoluted if certain values are mandatory in certain situations. Also, if there are many types of location with many differing attributes the number of columns in the table can become unwieldy.
Another method is the Entity-Attribute-Value model. Generally, this is to be avoided if at all possible. It is not relationally pure, as your column values are now no longer defined over domains, and it is extremely difficult, if not impossible, to ensure the integrity of the data. Any real attempt to do so will require a lot of bespoke coding (which needs to be carefully implemented to cater for things like concurrency control that database constraints give you for free) as you cannot use standard database constraints. However, if you are just interested in storing values for information and not doing anything with them you could use this method.
The EAV method does have a danger that because it appears so easy to add attributes to an entity, it becomes the default way of doing so. It is then used to add attributes for which vital processing is dependent and, because you cannot ensure the integrity of the data using this method, you find the values being used are meaningless and the whole logical basis for the processing is destroyed.

How do I represent this model in tables?

I have a table of warehouses and a table of clients to manage several warehouses belonging to different clients
warehouse
=====
id
address
capacity
owner_client
client
=====
id
name
My issue is, i have an ACME client, and ACME has an "ACME safety rating" attribute only applicable to their warehouses. Currently we just have this as a field of warehouses and its null for non-acme warehouses. But this feels wrong and has required some workarounds and special cases.
Whats the best way to represent this? I've thought of making an "Acme safety ratings" table with the number and FK to the warehouse, but now I've made a table specific for one client? What if we need to start tracking "is_foobar_accesible" for the baz client?
The relationally pure way to do this would be to implement your initial suggestion i.e. have a separate table such as ACME_WAREHOUSES that holds the attributes such as SAFTEY_RATING that are only applicable to this client. A different CLIENT_WAREHOUSES table would be created for each client that has its own attributes. In this way you could use standard database constraint functionality to ensure the integrity of the data in these tables.
Another method would be to add a series of nullable columns to the WAREHOUSES table such as ACME_SAFETY_RATING and BAZ_FOOBAR_ACCESSIBLE. This is not relationally pure as it means null values can exist in this table. However, you can still use standard database functionality to ensure the integrity of the data. It can be a bit more convoluted if certain values are mandatory in certain situations. Also, if there are many clients with many differing attributes the number of columns in the table can become unwieldy.
Another method is the Entity-Attribute-Value model. Generally, this is to be avoided if at all possible. It is not relationally pure, as your column values are now no longer defined over domains, and it is extremely difficult, if not impossible, to ensure the integrity of the data. Any real attempt to do so will require a lot of bespoke coding (which needs to be carefully implemented to cater for things like concurrency control that database constraints give you for free) as you cannot use standard database constraints. However, if you are just interested in storing values for information and not doing anything with them you could use this method.
The EAV method does have a danger that because it appears so easy to add attributes to an entity, it becomes the default way of doing so. It is then used to add attributes for which vital processing is dependent and, because you cannot ensure the integrity of the data using this method, you find the values being used are meaningless and the whole logical basis for the processing is destroyed.
I would create a ClientProperty and ClientWarehousePropertyValue table so that you can store these Client owned properties and their values for each warehouse:
ClientProperty
===============
ID
ClientID
Name
ClientWarehousePropertyValue
============================
WarehouseID
ClientPropertyID
Value

SQL: Advantages of an ENUM vs. a one-to-many relationship?

I very rarely see ENUM datatypes used in the wild; a developer almost always just uses a secondary table that looks like this:
CREATE TABLE officer_ranks (
id int PRIMARY KEY
,title varchar NOT NULL UNIQUE);
INSERT INTO officer_ranks VALUES (1,'2LT'),(2,'1LT'),(3,'CPT'),(4,'MAJ'),(5,'LTC'),(6,'COL'),(7,'BG'),(8,'MG'),(9,'LTG'),(10,'GEN');
CREATE TABLE officers (
solider_name varchar NOT NULL
,rank int NOT NULL REFERENCES officer_ranks(id) ON DELETE RESTRICT
,serial_num varchar PRIMARY KEY);
But the same thing can also be shown using a user-defined type / ENUM:
CREATE TYPE officer_rank AS ENUM ('2LT', '1LT','CPT','MAJ','LTC','COL','BG','MG','LTG','GEN');
CREATE TABLE officers (
solider_name varchar NOT NULL
,rank officer_rank NOT NULL
,serial_num varchar PRIMARY KEY);
(Example shown using PostgreSQL, but other RDBMS's have similar syntax)
The biggest disadvantage I see to using an ENUM is that it's more difficult to update from within an application. And it might also confuse an inexperienced developer who's used to using a SQL DB simply as a bit bucket.
Assuming that the information is mostly static (weekday names, month names, US Army ranks, etc) is there any advantage to using a ENUM?
Example shown using PostgreSQL, but other RDBMS's have similar syntax
That's incorrect. It is not an ISO/IEC/ANSI SQL requirement, so the commercial databases do not provide it (you are supposed to provide Lookup tables). The small end of town implement various "extras", but do not implement the stricter requirements, or the grunt, of the big end of town.
We do not have ENUMs as part of a DataType either, that is absurd.
The first disadvantage of ENUMs is that is it non-standard and therefore not portable.
The second big disadvantage of ENUMs is, that the database is Closed. The hundreds of Report Tools that can be used on a database (independent of the app), cannot find them, and therefore cannot project the names/meanings. If you had a normal Standard SQL Lookup table, that problem is eliminated.
The third is, when you change the values, you have to change DDL. In a Normal Standard SQL database, you simply Insert/Update/Delete a row in the Lookup table.
Last, you cannot easily get a list of the content of the ENUM; you can with a Lookup table. More important, you have a vector to perform any Dimension-Fact queries with, eliminating the need for selecting from the large Fact table and GROUP BY.
I don't see any advantage in using ENUMS.
They are harder to maintain and don't offer anything that a regular lookup table with proper foreign keys wouldn't allow you to do.
A disadvantage of using something like an ENUM is that you can't get a list of all the available values if they don't happen to exist in your data table, unless you hard-code the list of available values somewhere. For example, if in your OFFICERS table you don't happen to have an MG on post there's no way to know the rank exists. Thus, when BG Blowhard is relieved by MG Marjorie-Banks you'll have no way to enter the new officer's rank - which is a shame, as he is the very model of a modern Major General. :-) And what happens when a General of the Army (five-star general) shows up?
For simple types which will not change I've used domains successfully. For example, in one of my databases I've got a yes_no_domain defined as follows:
CREATE DOMAIN yes_no_dom
AS character(1)
DEFAULT 'N'::bpchar
NOT NULL
CONSTRAINT yes_no_dom_check
CHECK ((VALUE = ANY (ARRAY['Y'::bpchar, 'N'::bpchar])));
Share and enjoy.
ENUMS are very-very-very useful! You just have to know how to use them:
An ENUM uses only 2 Bytes of storage.
No need for additional constraint (as replacement for FK).
Cheaper changes of Values compared to natural values in FKs.
No need for additional JOIN
ENUMs are ordered, ex you can compare if Monday < Friday, or January is < June or Project Initiation is < Payroll.
Thus if you have a fixed list of string values, which you want to use, an ENUM is a better solution compared to a lookup table. Let's say you need to List Amino-Acids in your products, with their respective weight. Today there are ~20 Amino Acids. If you would store their full names, you'd need much more space each time then 2 Bytes. The other option is to use artificial keys and to link to a foreign table. But how would the foreign Table look like? Would it have 2 columns: ID and Amino Acid Name? And you would join that table every time? What if your main table has >40 such fields? Querying that table would involve >40 Joins.
If your database hosts 1600 Tables, 400 of which are lookup tables which just replace ENUMs, your devs will waste lots of time navigating through them (in addition to the JOINs). Yes, you can work with prefixes, schemas and such.... but why not just kick those tables out?
ENUMS are Enumerated lists / ordered. That means that if you have values which are ordered, you are actually saving the hassle of maintaining a 3 columns lookup table.
The question is rather: why do I need lookup tables then?
Well, the answer is easy:
When your values are changing often
When you need to store more additional attributes --> The lookup table corresponds to a full fledged data object, and not a lookup list.
When you need it quick and dirty
And now the funny thing:
Lookup Tables and ENUMS are not complete replacements for each other!!!!
If you have a list, where the PK is single-column natural key. The list can grow or the values can change their names (for some reason), then you could define an ENUM and use it for both: PK in lookup and FK in main tables!
Example benefit:
you have to change the name of a lookup key. Without using the ENUM the DBMS will have to cascade the changes to all tables, where you use this value and not just your lookup table. If you are using ENUM, then you just change the value of ENUM, and there are no changes to the data.
A small advantage may lie in the fact, that you have a sort of UDT when creating an ENUM. A user defined type can be reused formally in many other database objects, e.g. in views, other tables, other types, stored procedures (in other RDBMS), etc.
Another advantage is for documentation of the allowed values of a field. Examples:
A yes/no field
A male/female field
A mr/mrs/ms/dr field
Probably a matter of taste. I prefer ENUMs for these kinds of fields, rather than foreign keys to lookup tables for such simple concepts.
Yet another advantage may be that when you use code generation or ORMs like jOOQ in Java, you can use that ENUM to generate a Java enum class from it, instead of joining the lookup table, or working with the ENUM literal's ID
It's a fact, though, that only few RDBMS support a formal ENUM type. I only know of Postgres and MySQL. Oracle or DB2 don't have it.
Advantages:
Type safety for stored procedures: will raise a type error if argument can not be coerced into the type. Like: select court_martial('3LT') would raise a type error automatically.
Custom coalition order: In your example, officers could be sorted without a ranking id.
Generally speaking, enum is better for things that don't change much, and it uses slightly fewer resources, since there's no FK checks or anything like to execute on insert etc.
Using a lookup table is more elegant and or traditional and it's much easier to add and remove options than an enum. It's also easier to mass change the values than an enum.
Well, you don't see, because usually developers are using enums in programming languages such as Java, and the don't have their counterparts in database design.
In database such enums are usually text or integer fields, with no constraints. Database enums will not be translated into Java/C#/etc. enums, so the developers see no gain in this.
There are very many very good database features which are rarely used because most ORM tools are too primitive to support them.
Another benefit of enums over a lookup table is that when you write SQL functions you get type checking.

SQL Referencial Integrity Between a Column and (One of Many Possible) Tables

This is more of a curiosity at the moment, but let's picture an environment where I bill on a staunch nickle&dime basis. I have many operations that my system does and they're all billable. All these operations are recorded across various tables (these tables need to be separate because they record very different kinds of information). I also want to micro manage my accounts receivables. (Forgive me if you find inconsistencies here, as this example is not a real situation)
Is there a somewhat standard way of substituting a foreign key with something that can verify that the identifier in column X on my billing table is an existing identifier within one of many operations record tables?
One idea is that when journalizing account activity, I could reference the operation's identifier as well as the operation (specifically, the table that it's in) and use a CHECK constraint. This is probably the best way to go so that my journal is not ambiguous.
Are there other ways to solve this problem, de-facto or proprietary?
Do non-relational databases solve this problem?
EDIT:
To rephrase my initial question,
Is there a somewhat standard way of substituting a foreign key with something that can verify that the identifier in column X on my billing table is an existing identifier within one of many (but not necessarily all) operations record tables?
No, there's no way to achieve this with a single foreign key column.
You can do basically one of two things:
in your table which potentially references any of the other x tables, have x foreign key reference fields (ideally: ID's of type INT), only one of which will ever be non-NULL at any given time. Each FK reference key references exactly one of your other data tables
or:
have one "child" table per master table with a proper and enforced reference, and pull together the data from those n child tables into a view (instead of a table) for your reporting / billing.
Or just totally forget about referential integrity - which I would definitely not recommend!
you can Implementing Table inheritance
see article
http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server
An alternative is to enforce complex referential integrity rules via a trigger. However,and not knowing exactly what your design is, usually when these types of questions are asked it is to work around a bad design. Look at the design first and see if you can change it to make this something that can be handled through FKs, they are much more managable than doing this sort of thing through triggers.
If you do go the trigger route, don't forget to enforce updates as well as inserts and make sure your trigger will work properly with a set-based multi-row insert and update.
A design alternative is to havea amaster table that is parent to all your tables with the differnt details and use the FK against that.

Inheritance in Database Design

I am designing a new laboratory database with MANY types of my main entities.
The table for each entity will hold fields common to ALL types of that entity (entity_id, created_on, created_by, etc). I will then use concrete inheritance (separate table for each unique set of attributes) to store all remaining fields.
I believe that this is the best design for the standard types of data which come through the laboratory daily. However, we often have a special samples which often are accompanied by specific values the originator wants stored.
Question: How should I model special (non-standard) types of entities?
Option 1: Use entity-value for special fields
One table (entity_id, attribute_name, numerical_value) would hold all data for any special entity.
+ Fewer tables.
- Cannot enforce requiring a particular attribute.
- Must convert (pivot) rows to columns which is inefficient.
Option 2: Strict concrete inheritance.
Create separate table for each separate special case.
+ Follows in accordance with all other rules
- Overhead of many tables with only a few rows.
Option 3: Concrete inheritance with special tables under a different user.
Put all special tables under a different user.
+ Keeps all special and standard tables separate.
+ Easier to search for common standard table in a list without searching through all special tables.
- Overhead of many tables with only a few rows.
Actually the design you described (common table plus subtype-specific tables) is called Class Table Inheritance.
Concrete Table Inheritance would have all the common attributes duplicated in the subtype tables, and you'd have no supertype table as you do now.
I'm strongly against EAV. I consider it an SQL antipattern. It may seem like an elegant solution because it requires fewer tables, but you're setting yourself up for a lot of headache later. You identified a couple of the disadvantages, but there are many others. IMHO, EAV is used appropriately only if you absolutely must not create a new table when you introduce a new subtype, or if you have an unbounded number of subtypes (e.g. users can define new attributes ad hoc).
You have many subtypes, but still a finite number of them, so if I were doing this project I'd stick with Class Table Inheritance. You may have few rows of each subtype, but at least you have some assurance that all rows in each subtype have the same columns, you can use NOT NULL if you need to, you can use SQL data types, you can use referential integrity constraints, etc. From a relational perspective, it's a better design than EAV.
One more option that you didn't mention is called Serialized LOB. That is, add a BLOB column for a semi-structured collection of custom attributes. Store XML, YAML, JSON, or your own DSL in that column. You won't be able to parse individual attributes out of that BLOB easily with SQL, you'll have to fetch the whole BLOB back into your application and extract individual attributes in code. So in some ways it's less convenient. But if that satisfies your usage of the data, then there's nothing wrong with that.
I think it depends mostly on how you want to use this data.
First of all, I don't really see the benefit of option 3 over option 2. I think separating the special tables in another schema will make your application harder to maintain, especially if later on commonalities are found between 'special values'.
As another option I would say:
- Store the special values in an XML fragment (or blob). Most databases have ability to query on XML structures these days, so without the need for many extra tables, you would keep your flexibility for a small performance hit.
If you put all the special values in one table, you get a very sparse table. Most normal DBMSes cannot handle this very well, but there are some implementations that specialize in this. You could benefit from that.
Do you often need to query the key-value pairs? if you basically access that table through it's entry_id, I think having a key-value table is not a bad design. An extra index on the kay column might even help you when you do need to query for special values. If you build an application layer on top of your database, the key-value table will map on a Map or Hash structure, which can also easily be used.
It also depends on the different types of values you want to store. If there are many different types, that need to be easily accessed (instead of being serialized/deserialized to XML/Character-String) you might want to store the type in a separate column, but that will usually lead to a very complicated design.
Hope this helps (a little bit).
-Maarten
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
Suggest you read about the problems with entity value tables before deciding to use them.
Oracle can deal with sparsely filled tables quite well. I think you can use a similar approach as company salesforce uses. They use tables with a lot of columns, they create columns when needed. You can index those columns much better than an eav model.
So it is flexible but it performs better than an eav model.
Read: Ask Tom 1, Ask Tom 2, High Scalabilty and SalesForce.
The "Option 1" patterns is also called the "Universal Relation" At first look it seems like a short cut to not doing potentially difficult data modeling. It trades effortless data modeling for not being able to do simple select, update, delete without dramatically more effort than it would take on more usual looking data model with multiple tables.