SQL - How to limit data entry of one attribute depending on another attribute?

SQL - How to limit data entry of one attribute depending on another attribute? - sql

Below is the DDL for the table I want to create. However, I want the attribute 'Expertise_breed' to be derived from 'Expertise_animal'. For example, if 'Dog' is entered into 'Expertise_animal' I don't want to be able to enter in a breed of cat. How would I go about achieving this?
I'm working with SQL Server Management Studio 2012
CREATE TABLE tExpertise
(
Expertise_ID int NOT NULL PRIMARY KEY, --E.G Data '001'
Expertise_type varchar(8) NOT NULL, --E.G Data 'Domestic'
Expertise_animal varchar(30) NOT NULL, --E.G Data 'Dog'
Expertise_breed varchar(30) NOT NULL --E.G Data 'Poodle'
)

This is a relation data situation, you should use relational tables.
I would have three
AnimalClassification - (domestic,wild,other)
AnimalSpecies (dog,cat,goat)
AnimalBreed (Poodle, Beagle)
Animal species would have a foreign key to animal classification i.e.
Dog - domestic
Animal breed would have a foreign key to animal species i.e.
Beagle - dog

You can create a trigger on insert and/or update and compare those two columns for each row. You can refer to the inserted entries via 'inserted' alias.
If you know what mappings are allowed (eg dog x poodle) you could store it in some table and join to it in the insert to filter out the wrong ones.

Theoretically, what you want can be achieved using table level constraints, a generic way of doing this being the following (not tested):
CREATE FUNCTION dbo.validateExpertise(
#expertise_type varchar(8),
#Expertise_animal varchar(30),
#Expertise_breed varchar(30)
)
RETURNS BIT
AS
BEGIN
IF (#Expertise_animal == 'dog' AND #Expertise_breed != 'dog')
RETURN 0;
-- other validations can come here
RETURN 1;
END
GO
-- add a table level constraint
-- WITH NOCHECK can be used to not check existing data
ALTER TABLE detailTable ADD CONSTRAINT chkExpertise
CHECK (dbo.validateExpertise(expertise_type, Expertise_animal, Expertise_breed) = 1)
While this may help you, it is not recommended to put such complex validation at database level. Complex validations are meant to be (at least) implemented in the business layer of your application, which is typically within the Logic tier (usually ASP.NET MVC, WCF service, Web service etc.) (some validations are also put in the presentation layer to avoid round-trips time delays.
Database is meant primarily for data persistence and fetch. Of course, simple constraints such as FKs, unique constraints, column level constraints etc. are welcomed, as they act as a good safety net.
Also, keep in mind that constraints like the one mentioned above will trigger for every INSERT or UPDATE in the table and might seriously degrade the performance for queries involving a large number of records.

Related

Strategies to store extra information about models without too many column names (alternatives to DB normalization and model subclassing)

Say you had a Model called Forest. Each object represents a forest on your continent. There is a set of data that is common to all these forests, like forest type, area etc., and these can be easily represented by columns on the SQL table, forest.
However, imagine that these forests had additional data about them that might not always be repeatable. For example the 20 coniferous forests have a pine-fir split ratio number, whereas the deciduous forests have a autumn-duration number. One way would be to store all these columns on the main table itself, but there will be too many columns on each row, with many columns remaining un-filled by definition.
The most obvious way around this is to make sub-classes of the Forest model and have separate table for each subclass. I feel that's a heavy handed approach that I would rather not follow. If I need some data about the generic forest I'll have to consult another table.
Is there a pattern to solve this problem? What solution do you usually prefer?
NOTE: I have seen the other questions about this. The solutions proposed were:
Subtyping, same as I proposed above.
Have all the columns on the same table.
Have separate tables for each kind of forest, with duplicated data like area and rainfall... duplicated.
Is there an inventive solution that I don't know of?
UPDATE: I have run into the EAV model, and also a modified version where the unpredictable fields are stored out in a NoSQL/JSON store, and the id for that is held in the RDB. I like both, but welcome suggestions in this direction.

On the database side, the best approach is often to store attributes common to all forests in one table, and to store unique attributes in other tables. Build updatable views for clients to use.
create table forests (
forest_id integer primary key,
-- Assumes forest names are not unique on a continent.
forest_name varchar(45) not null,
forest_type char(1) not null
check (forest_type in ('c', 'd')),
area_sq_km integer not null
check (area_sq_km > 0),
-- Other columns common to all forests go here.
--
-- This constraint lets foreign keys target the pair
-- of columns, guaranteeing that a row in each subtype
-- table references a row here having the same subtype.
unique (forest_id, forest_type)
);
create table coniferous_forests_subtype (
forest_id integer primary key,
forest_type char(1) not null
default 'c'
check (forest_type = 'c'),
pine_fir_ratio float not null
check (pine_fir_ratio >= 0),
foreign key (forest_id, forest_type)
references forests (forest_id, forest_type)
);
create table deciduous_forests_subtype (
forest_id integer primary key,
forest_type char(1) not null
default 'd'
check (forest_type = 'd'),
autumn_duration_days integer not null
check (autumn_duration_days between 20 and 100),
foreign key (forest_id, forest_type)
references forests (forest_id, forest_type)
);
Clients usually use updatable views, one for each subtype, instead of using the base tables. (You can revoke privileges on the base subtype tables to guarantee this.) You might want to omit the "forest_type" column.
create view coniferous_forests as
select t1.forest_id, t1.forest_type, t1.area_sq_km,
t2.pine_fir_ratio
from forests t1
inner join coniferous_forests_subtype t2
on t1.forest_id = t2.forest_id;
create view deciduous_forests as
select t1.forest_id, t1.forest_type, t1.area_sq_km,
t2.autumn_duration_days
from forests t1
inner join deciduous_forests_subtype t2
on t1.forest_id = t2.forest_id;
What you have to do to make these views updatable varies a little with the dbms, but expect to write some triggers (not shown). You'll need triggers to handle all the DML actions--insert, update, and delete.
If you need to report only on columns that appear in "forests", then just query the table "forests".

Well, the easiest way is putting all the columns into one table and then having a "type" field to decide which columns to use. This works for smaller tables, but for more complicated cases it can lead to a big messy table and issues with database constraints (such as NULLs).
My preferred method would be something like this:
A generic "Forests" table with: id, type, [generic_columns, ...]
"Coniferous_Forests" table with: id, forest_id (FK to Forests), ...
So, in order to get all the data for a Coniferous Forest with id of 1, you'd have a query like so:
SELECT * FROM Coniferous_Forests INNER JOIN Forests
ON Coniferous_Forests.forest_id = Forests.id
AND Coniferous_Forests.id = 1
As for inventive solutions, there is such a thing as an OODBMS (Object Oriented Database Management Sytem).
The most popular alternative to Relational SQL databases are Document-Oriented NoSQL databases like MongoDB. This is comparable to using JSON objects to store your data, and allows you to be more flexible with your database fields.

t-sql single default value for multiple records

I have an MSSQL Server 2008 table that associates multiple photos to houses, as follows:
HouseID - with foreign key to House table
PhotoID - with foreign key to Photo table
It's all working great, with an unique constraint on PhotoID so that a photo cannot be associated with multiple houses.
I would like to specify a default photo for the house records. The table is updated as such
HouseID
PhotoID
isDefault
The issue is that there can only be a single isDefault = 1 for a set of photos for a house.
In MSSQL Server 2008, how do I ensure that there is only a single isDefault = 1 for a given House ID, and the other records are isDefault = 0? Is it better to use a trigger, or is there a better way? If a trigger, any suggestions on the syntax to ensure optimization?
Lastly, I need this to work on the Insert and on the Update events.
Update:
The following worked like a charm. Comments?
CREATE VIEW HousePhoto_isDefault AS
SELECT yourSchema.HousePhoto.houseID, yourSchema.HousePhoto.isDefault
FROM yourSchema.HousePhoto WHERE isDefault = 1
GO
CREATE UNIQUE CLUSTERED INDEX idx_HousePhoto_isDefault
ON HousePhoto_isDefault (houseID)
GO

As you describe it, you would need to use triggers.
However, if you make a small change to the data structure, you can do it with regular constraints, I think. Instead of storing isDefault at the photo-level, store DefaultPhotoId at the house-level. That way, you can never have more than one default photo, no matter what you do.
If you want to ensure that there is a default, then set it up as NOT NULL.

In MSSQL Server 2008, how do I ensure that there is only a single isDefault = 1 for a given House ID, and the other records are isDefault = 0?
Why yes as Yves Samèr pointed out in this answer you can use use a filtered index
CREATE UNIQUE INDEX photo_isDefault
ON Photos(HouseID) WHERE isDefault = 1
DEMO
The error that is produced is
Cannot insert duplicate key row in object 'dbo.Photos' with unique
index 'photo_isDefault'.: INSERT INTO Photos (houseID, isDefault)
VALUES (1,1)
You could also opt to use a INDEXED VIEW or as you noted a trigger will do as well.

I actually think that triggers might be overkill. Just use CASE statements. Here's an example of the UPDATE statement (replacing the variables with whatever scripting language you're using):
UPDATE HousePhoto
SET isDefault =
(
CASE
WHEN
(PhotoID = #PhotoID)
THEN
1
ELSE
0
END
)
WHERE HouseID = #HouseId
Of course, you could always just use two queries.
UPDATE HousePhoto SET isDefault = 0 WHERE HouseID = #HouseID
UPDATE HousePhoto SET isDefault = 1 WHERE HouseID = #HouseID AND PhotoID = #PhotoID

There another approach: use a "reverse" FK:
[SQL Fiddle]
It is crucial to note the usage of the identifying relationship and the resulting composite PK in Photo: {HouseId, PhotoNo}. This serves two purposes:
Ensures that if a photo is the default of some house, it must belong to the same house.
Makes the FK in House composite. Since one of the FK fields is also in the PK (HouseId), it cannot be NULL, so it is crucial to have another field that can be NULL (DefaultPictureNo). If any of the FK fields is NULL, the FK is not enforced which allows us to break the chicken-and-egg problem when inserting new data1 in the presence of such circular FKs.
Compared to using isDefault flag and the associated filtered index, this approach makes the following tradeoffs:
PRO: Avoids the overhead of the extra field. This is potentially important if the total number of pictures is very high compared to the number of default pictures.
CON: Makes it impractical to use auto-increment for Photo PK.
CON: MS SQL Server will complain if you try to use declarative referential actions (such as ON DELETE CASCADE) in the presence of such circular FKs. You'll need to implement referential actions using triggers. This is an MS SQL Server quirk not generally applicable to other DBMSes.
TIE: Interferes less with clustering in Picture, at the price of interfering more with clustering in House.2
PRO: It's applicable to DBMSes that don't support filtered indexes (not really important here, but worth a mention).
1 I.e. enables us to insert a house without setting the default picture right away, which would be impossible since this is a new house and has no pictures yet.
2 Secondary indexes can be expensive in clustered tables, and the FK in House will need a supporting index.

Auto increment with a Unit Of Work

Context
I'm building a persistence layer to abstract different types of databases that I'll be needing. On the relational part I have mySQL, Oracle and PostgreSQL.
Let's take the following simplified MySQL tables:
CREATE TABLE Contact (
ID varchar(15),
NAME varchar(30)
);
CREATE TABLE Address (
ID varchar(15),
CONTACT_ID varchar(15),
NAME varchar(50)
);
I use code to generate system specific alpha numeric unique ID's fitting 15 chars in this case. Thus, if I insert a Contact record with it's Addresses I have my generated Contact.ID and Address.CONTACT_IDs before committing.
I've created a Unit of Work (amongst others) as per Martin Fowler's patterns to add transaction support. I'm using a key based Identity Map in the UoW to track the changed records in memory. It works like a charm for the scenario above, all pretty standard stuff so far.
The question scenario comes in when I have a database that is not under my control and the ID fields are auto-increment (or in Oracle sequences). In this case I do not have the db generated Contact.ID beforehand, so when I create my Address I do not have a value for Address.CONTACT_ID. The transaction has not been started on the DB session since all is kept in the Identity Map in memory.
Question: What is a good approach to address this? (Avoiding unnecessary db round trips)
Some ideas:
Retrieve the last ID: I can do a call to the database to retrieve the last Id like:
SELECT Auto_increment FROM information_schema.tables WHERE table_name='Contact';
But this is MySQL specific and probably something similar can be done for the other databases. If do this then would need to do the 1st insert, get the ID and then update the children (Address.CONTACT_IDs) – all in the current transaction context.

Avoid explicitly referencing the CONTACT_ID entirely. Assuming that Contact.NAME has a UNIQUE constraint and that the CONTACT_ID column REFERENCES Contact(ID):
INSERT INTO Contact (NAME) VALUES ('Joe Bloggs'); -- Contact.ID auto-generated
INSERT INTO Address (CONTACT_ID, NAME)
VALUES ((SELECT ID FROM Contact WHERE NAME = 'Joe Bloggs'),
'123 Apple Lane');
Now Address.CONTACT_ID is correct without your code knowing the key's value or even its type.

Records linked to any table?

Hi Im struggling a bit with this and could use some ideas...
Say my database has the following tables ;
Customers
Supplers
SalesInvoices
PurchaseInvoices
Currencies
etc etc
I would like to be able to add a "Notes" record to ANY type of record
The Notes table would like this
NoteID Int (PK)
NoteFK Int
NoteFKType Varchar(3)
NoteText varchar(100)
NoteDate Datetime
Where NoteFK is the PK of a customer or supplier etc and NoteFKType says what type of record the note is against
Now i realise that I cannot add a FK which references multiple tables without NoteFK needing to be present in all tables.
So how would you design the above ?
The note FK needs to be in any of the above tables
Cheers,
Daniel

You have to accept the limitation that you cannot teach the database about this foreign key constraint. So you will have to do without the integrity checking (and cascading deletes).
Your design is fine.
It is easily extensible to extra tables, you can have multiple notes per entity, and the target tables do not even need to be aware of the notes feature.
An advantage that this design has over using a separate notes table per entity table is that you can easily run queries across all notes, for example "most recent notes", or "all notes created by a given user".
As for the argument of that table growing too big, splitting it into say five table will shrink the table to about a fifth of its size, but this will not make any difference for index-based access. Databases are built to handle big tables (as long as they are properly indexed).

I think your design is ok, if you can accept the fact, that the db system will not check whether a note is referencing an existing entity in other table or not. It's the only design I can think of that doesn't require duplication and is scalable to more tables.
The way you designed it, when you add another entity type that you'd like to have notes for, you won't have to change your model. Also, you don't have to include any additional columns in your existing model, or additional tables.
To ensure data integrity, you can create set of triggers or some software solution that will clean notes table once in a while.

I would think twice before doing what you suggest. It might seem simple and elegant in the short term, but if you are truly interested in data integrity and performance, then having separate notes tables for each parent table is the way to go. Over the years, I've approached this problem using the solutions found in the other answers (triggers, GUIDs, etc.). I've come to the conclusion that the added complexity and loss of performance isn't worth it. By having separate note tables for each parent table, with an appropriate foreign key constraints, lookups and joins will be simple and fast. When combining the related items into one table, join syntax becomes ugly and your notes table will grow to be huge and slow.

I agree with Michael McLosky, to a degree.
The question in my mind is: What is the technical cost of having multiple notes tables?
In my mind, it Is preferable to consolidate the same functionality into a single table. It aso makes reporting and other further development simpler. Not to mention keeping the list of tables smaller and easier to manage.
It's a balancing act, you need to try to predetermine both the benefits And the costs of doing something like this. My -personal- preference is database referential integrity. Application management of integrity should, in my opinion, be limitted ot business logic. The database should ensure the data is always consistent and valid...
To actually answer your question...
The option I would use is a check constraint using a User Defined Function to check the values. This works in M$ SQL Server...
CREATE TABLE Test_Table_1 (id INT IDENTITY(1,1), val INT)
GO
CREATE TABLE Test_Table_2 (id INT IDENTITY(1,1), val INT)
GO
CREATE TABLE Test_Table_3 (fk_id INT, table_name VARCHAR(64))
GO
CREATE FUNCTION id_exists (#id INT, #table_name VARCHAR(64))
RETURNS INT
AS
BEGIN
IF (#table_name = 'Test_Table_1')
IF EXISTS(SELECT * FROM Test_Table_1 WHERE id = #id)
RETURN 1
ELSE
IF (#table_name = 'Test_Table_2')
IF EXISTS(SELECT * FROM Test_Table_2 WHERE id = #id)
RETURN 1
RETURN 0
END
GO
ALTER TABLE Test_Table_3 WITH CHECK ADD CONSTRAINT
CK_Test_Table_3 CHECK ((dbo.id_exists(fk_id,table_name)=(1)))
GO
ALTER TABLE [dbo].[Test_Table_3] CHECK CONSTRAINT [CK_Test_Table_3]
GO
INSERT INTO Test_Table_1 SELECT 1
GO
INSERT INTO Test_Table_1 SELECT 2
GO
INSERT INTO Test_Table_1 SELECT 3
GO
INSERT INTO Test_Table_2 SELECT 1
GO
INSERT INTO Test_Table_2 SELECT 2
GO
INSERT INTO Test_Table_3 SELECT 3, 'Test_Table_1'
GO
INSERT INTO Test_Table_3 SELECT 3, 'Test_Table_2'
GO
In that example, the final insert statement would fail.

You can get the FK referential integrity, at the costing of having one column in the notes table for each other table.
create table Notes (
id int PRIMARY KEY,
note varchar (whatever),
customer_id int NULL REFERENCES Customer (id),
product_id int NULL REFERENCES Product (id)
)
Then you'll need a constraint to make sure that you have only one of the columns set.
Or maybe not, maybe you might want a note to be able to be associated with both a customer and a product. Up to you.
This design would require adding a new column to Notes if you want to add another referencing table.

You could add a GUID field to the Customers, Suppliers, etc. tables. Then in the Notes table, change the foreign key to reference that GUID.
This does not help for data integrity. But it makes M-to-N relationships easily possible to any number of tables and it saves you from having to define a NoteFKType column in the Notes table.

You can easily implement "multi"-foreign key with triggers. Triggers will give you very flexible mechanism and you can do any integrity checks you wish.

Why dont you do it the other way around and have a foreign key in other tables (Customer, Supplier etc etc) to NotesID. This way you have one to one mapping.

Merging databases how to handle duplicate PK's

We have three databases that are physically separated by region, one in LA, SF and NY. All the databases share the same schema but contain data specific to their region. We're looking to merge these databases into one and mirror it. We need to preserve the data for each region but merge them into one db. This presents quite a few issues for us, for example we will certainly have duplicate Primary Keys, and Foreign Keys will be potentially invalid.
I'm hoping to find someone who has had experience with a task like this who could provide some tips, strategies and words of experience on how we can accomplish the merge.
For example, one idea was to create composite keys and then change our code and sprocs to find the data via the composite key (region/original pk). But this requires us to change all of our code and sprocs.
Another idea was to just import the data and let it generate new PK's and then update all the FK references to the new PK. This way we potentially don't have to change any code.
Any experience is welcome!

I have no first-hand experience with this, but it seems to me like you ought to be able to uniquely map PK -> New PK for each server. For instance, generate new PKs such that data from LA server has PK % 3 == 2, SF has PK % 3 == 1, and NY has PK % 3 == 0. And since, as I understood your question anyway, each server only stores FK relationships to its own data, you can update the FKs in identical fashion.
NewLA = OldLA*3-1
NewSF = OldLA*3-2
NewNY = OldLA*3
You can then merge those and have no duplicate PKs. This is essentially, as you already said, just generating new PKs, but structuring it this way allows you to trivially update your FKs (assuming, as I did, that the data on each server is isolated). Good luck.

BEST: add a column for RegionCode, and include it on your PKs, but you don't want to do all the leg work.
HACK: if your IDs are INTs, a quick fix would be to add a fixed value based on region to each key on import. INTs can be as large as: 2,147,483,647
local server data:
LA IDs: 1,2,3,4,5,6
SF IDs: 1,2,3,4,5
NY IDs: 1,2,3,4,5,6,7,9
add 100000000 to LA's IDs
add 200000000 to SF's IDs
add 300000000 to NY's IDs
combined server data:
LA IDs: 100000001,100000002,100000003,100000004,100000005,100000006
SF IDs: 200000001,200000002,200000003,200000004,200000005
NY IDs: 300000001,300000002,300000003,300000004,300000005,300000006,300000007,300000009

I have done this and I say change your keys (pick a method) rather than changing your code. Invariably you will either miss a stored procedure or introduce a bug. With data changes, it is pretty easy to write tests to look for orphaned records or to verify that things were matched up correctly. With code changes, especially code that is working correctly, it is too easy to miss something.

One thing you could do is set up the tables with regional data to use GUID's. That way, the primary keys in each region are unique, and you can mix and match data (import data from one region to another). For the tables which have shared data (like type tables), you can keep the primary keys the way they are (since they should be the same everywhere).
Here is some information about GUID's:
http://www.sqlteam.com/article/uniqueidentifier-vs-identity
Maybe SQL Server Management Studio lets you convert columns to use GUID's easily. I hope so!
Best of luck.

what i have done in a situation like this is this:
create a new db with the same schema
but only tables. no pk fk, checks
etc.
transfer data from DB1 to this
source db
for each table in target database
find the top number for the PK
for each table in the source
database update their pk, fk etc
starting with the (top number + 1)
from the target db
for each table in target database
set identity insert to on
import data from source db to target
db
for each table in target database
set identity insert to off
clear source db
repeat for DB2

As Jon mentioned, I would use GUIDs to solve the merge task. And I see two different solutions that required GUIDs:
1) Permanently change your database schema to use GUIDs instead of INTEGER (IDENTITY) as primary key.
This is a good solution in general, but if you have a lot of non SQL code that is somehow bound to the way your identifiers work, it could require quite some code changes. Probably since you merge databases, you may anyways need to update your application so that it is working with one region data only based on the user logged in etc.
2) Temporarily add GUIDs for migration purposes only, and after the data is migrated, drop them:
This one is kind-of more tricky, but once you write this migration script, you can (re-)run it multiple times to merge databases again in case you screw it the first time. Here is an example:
Table: PERSON (ID INT PRIMARY KEY, Name VARCHAR(100) NOT NULL)
Table: ADDRESS (ID INT PRIMARY KEY, City VARCHAR(100) NOT NULL, PERSON_ID INT)
Your alter scripts are (note that for all PK we automatically generate the GUID):
ALTER TABLE PERSON ADD UID UNIQUEIDENTIFIER NOT NULL DEFAULT (NEWID())
ALTER TABLE ADDRESS ADD UID UNIQUEIDENTIFIER NOT NULL DEFAULT (NEWID())
ALTER TABLE ADDRESS ADD PERSON_UID UNIQUEIDENTIFIER NULL
Then you update the FKs to be consistent with INTEGER ones:
--// set ADDRESS.PERSON_UID
UPDATE ADDRESS
SET ADDRESS.PERSON_UID = PERSON.UID
FROM ADDRESS
INNER JOIN PERSON
ON ADDRESS.PERSON_ID = PERSON.ID
You do this for all PKs (automatically generate GUID) and FKs (update as shown above).
Now you create your target database. In this target database you also add the UID columns for all the PKs and FKs. Also disable all FK constraints.
Now you insert from each of your source databases to the target one (note: we do not insert PKs and integer FKs):
INSERT INTO TARGET_DB.dbo.PERSON (UID, NAME)
SELECT UID, NAME FROM SOURCE_DB1.dbo.PERSON
INSERT INTO TARGET_DB.dbo.ADDRESS (UID, CITY, PERSON_UID)
SELECT UID, CITY, PERSON_UID FROM SOURCE_DB1.dbo.ADDRESS
Once you inserted data from all the databases, you run the code opposite to the original to make integer FKs consistent with GUIDs on the target database:
--// set ADDRESS.PERSON_ID
UPDATE ADDRESS
SET ADDRESS.PERSON_ID = PERSON.ID
FROM ADDRESS
INNER JOIN PERSON
ON ADDRESS.PERSON_UID = PERSON.UID
Now you may drop all the UID columns:
ALTER TABLE PERSON DROP COLUMN UID
ALTER TABLE ADDRESS DROP COLUMN UID
ALTER TABLE ADDRESS DROP COLUMN PERSON_UID
So at the end you should get a rather long migration script, that should do the job for you. The point is - IT IS DOABLE
NOTE: all written here is not tested.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas