Postgres - How to find id's that are not used in different multiple tables (inactive id's) - badly written query - sql

I have table towns which is main table. This table contains so many rows and it became so 'dirty' (someone inserted 5 milions rows) that I would like to get rid of unused towns.
There are 3 referent table that are using my town_id as reference to towns.
And I know there are many towns that are not used in this tables, and only if town_id is not found in neither of these 3 tables I am considering it as inactive and I would like to remove that town (because it's not used).
as you can see towns is used in this 2 different tables:
employees
offices
and for table * vendors there is vendor_id in table towns since one vendor can have multiple towns.
so if vendor_id in towns is null and town_id is not found in any of these 2 tables it is safe to remove it :)
I created a query which might work but it is taking tooooo much time to execute, and it looks something like this:
select count(*)
from towns
where vendor_id is null
and id not in (select town_id from banks)
and id not in (select town_id from employees)
So basically I said, if vendor_is is null it means this town is definately not related to vendors and in the same time if same town is not in banks and employees, than it will be safe to remove it.. but query took too long, and never executed successfully...since towns has 5 milions rows and that is reason why it is so dirty..
In face I'm not able to execute given query since server terminated abnormally..
Here is full error message:
ERROR: server closed the connection unexpectedly This probably means
the server terminated abnormally before or while processing the
request.
Any kind of help would be awesome
Thanks!

You can join the tables using LEFT JOIN so that to identify the town_id for which there is no row in tables banks and employee in the WHERE clause :
WITH list AS
( SELECT t.town_id
FROM towns AS t
LEFT JOIN tbl.banks AS b ON b.town_id = t.town_id
LEFT JOIN tbl.employees AS e ON e.town_id = t.town_id
WHERE t.vendor_id IS NULL
AND b.town_id IS NULL
AND e.town_id IS NULL
LIMIT 1000
)
DELETE FROM tbl.towns AS t
USING list AS l
WHERE t.town_id = l.town_id ;
Before launching the DELETE, you can check the indexes on your tables.
Adding an index as follow can be usefull :
CREATE INDEX town_id_nulls ON towns (town_id NULLS FIRST) ;
Last but not least you can add a LIMIT clause in the cte so that to limit the number of rows you detele when you execute the DELETE and avoid the unexpected termination. As a consequence, you will have to relaunch the DELETE several times until there is no more row to delete.

You can try an JOIN on big tables it would be faster then two IN
you could also try UNION ALL and live with the duplicates, as it is faster as UNION
Finally you can use a combined Index on id and vendor_id, to speed up the query
CREATE TABLe towns (id int , vendor_id int)
CREATE TABLE
CREATE tABLE banks (town_id int)
CREATE TABLE
CREATE tABLE employees (town_id int)
CREATE TABLE
select count(*)
from towns t1 JOIN (select town_id from banks UNION select town_id from employees) t2 on t1.id <> t2.town_id
where vendor_id is null
count
0
SELECT 1
fiddle

The trick is to first make a list of all the town_id's you want to keep and then start removing those that are not there.
By looking in 2 tables you're making life harder for the server so let's just create 1 single list first.
-- build empty temp-table
CREATE TEMPORARY TABLE TEMP_must_keep
AS
SELECT town_id
FROM tbl.towns
WHERE 1 = 2;
-- get id's from first table
INSERT TEMP_must_keep (town_id)
SELECT DISTINCT town_id
FROM tbl.banks;
-- add index to speed up the EXCEPT below
CREATE UNIQUE INDEX idx_uq_must_keep_town_id ON TEMP_must_keep (town_id);
-- add new ones from second table
INSERT TEMP_must_keep (town_id)
SELECT town_id
FROM tbl.employees
EXCEPT -- auto-distincts
SELECT town_id
FROM TEMP_must_keep;
-- rebuild index simply to ensure little fragmentation
REINDEX TABLE TEMP_must_keep;
-- optional, but might help: create a temporary index on the towns table to speed up the delete
CREATE INDEX idx_towns_town_id_where_vendor_null ON tbl.towns (town_id) WHERE vendor IS NULL;
-- Now do actual delete
-- You can do a `SELECT COUNT(*)` rather than a `DELETE` first if you feel like it, both will probably take some time depending on your hardware.
DELETE
FROM tbl.towns as del
WHERE vendor_id is null
AND NOT EXISTS ( SELECT *
FROM TEMP_must_keep mk
WHERE mk.town_id = del.town_id);
-- cleanup
DROP INDEX tbl.idx_towns_town_id_where_vendor_null;
DROP TABLE TEMP_must_keep;
The idx_towns_town_id_where_vendor_null is optional and I'm not sure if it will actaully lower the total time but IMHO it will help out with the DELETE operation if only because the index should give the Query Optimizer a better view on what volumes to expect.

Related

Append Query Doesn't Append Missing Items

I have 2 tables. Table 1 has data from the bank account. Table 2 aggregates data from multiple other tables; to keep things simple, we will just have 2 tables. I need to append the data from table 1 into table 2.
I have a field in table2, "SrceFk". The concept is that when a record from Table1 appends, it will fill the table2.SrceFk with the table1 primary key and the table name. So record 302 will look like "BANK/302" after it appends. This way, when I run the append query, I can avoid duplicates.
The query is not working. I deleted the record from table2, but when I run the query, it just says "0 records appended". Even though the foreign key is not present.
I am new to SQL, Access, and programming in general. I understand basic concepts. I have googled this issue and looked on stackOverflow, but no luck.
This is my full statement:
INSERT INTO Main ( SrceFK, InvoDate, Descrip, AMT, Ac1, Ac2 )
SELECT Bank.ID &"/"& "BANK", Bank.TransDate, Bank.Descrip, Bank.TtlAmt, Bank.Ac1, Bank.Ac2
FROM Bank
WHERE NOT EXISTS
(
SELECT * FROM Main
WHERE Main.SrceFK = Bank.ID &"/"& "BANK"
);
I expect the query to add records that aren't present in the table, as needed.

SQL Queries instead of Cursors

I'm creating a database for a hypothetical video rental store.
All I need to do is a procedure that check the availabilty of a specific movie (obviously the movie can have several copies). So I have to check if there is a copy available for the rent, and take the number of the copy (because it'll affect other trigger later..).
I already did everything with the cursors and it works very well actually, but I need (i.e. "must") to do it without using cursors but just using "pure sql" (i.e. queries).
I'll explain briefly the scheme of my DB:
The tables that this procedure is going to use are 3: 'Copia Film' (Movie Copy) , 'Include' (Includes) , 'Noleggio' (Rent).
Copia Film Table has this attributes:
idCopia
Genere (FK references to Film)
Titolo (FK references to Film)
dataUscita (FK references to Film)
Include Table:
idNoleggio (FK references to Noleggio. Means idRent)
idCopia (FK references to Copia film. Means idCopy)
Noleggio Table:
idNoleggio (PK)
dataNoleggio (dateOfRent)
dataRestituzione (dateReturn)
dateRestituito (dateReturned)
CF (FK to Person)
Prezzo (price)
Every movie can have more than one copy.
Every copy can be available in two cases:
The copy ID is not present in the Include Table (that means that the specific copy has ever been rented)
The copy ID is present in the Include Table and the dataRestituito (dateReturned) is not null (that means that the specific copy has been rented but has already returned)
The query I've tried to do is the following and is not working at all:
SELECT COUNT(*)
FROM NOLEGGIO
WHERE dataNoleggio IS NOT NULL AND dataRestituito IS NOT NULL AND idNoleggio IN (
SELECT N.idNoleggio
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio=I.idNoleggio
WHERE idCopia IN (
SELECT idCopia
FROM COPIA_FILM
WHERE titolo='Pulp Fiction')) -- Of course the title is just an example
Well, from the query above I can't figure if a copy of the movie selected is available or not AND I can't take the copy ID if a copy of the movie were available.
(If you want, I can paste the cursors lines that work properly)
------ USING THE 'WITH SOLUTION' ----
I modified a little bit your code to this
WITH film
as
(
SELECT idCopia,titolo
FROM COPIA_FILM
WHERE titolo = 'Pulp Fiction'
),
copy_info as
(
SELECT N.idNoleggio, N.dataNoleggio, N.dataRestituito, I.idCopia
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio = I.idNoleggio
),
avl as
(
SELECT film.titolo, copy_info.idNoleggio, copy_info.dataNoleggio,
copy_film.dataRestituito,film.idCopia
FROM film LEFT OUTER JOIN copy_info
ON film.idCopia = copy_info.idCopia
)
SELECT COUNT(*),idCopia FROM avl
WHERE(dataRestituito IS NOT NULL OR idNoleggio IS NULL)
GROUP BY idCopia
As I said in the comment, this code works properly if I use it just in a query, but once I try to make a procedure from this, I got errors.
The problem is the final SELECT:
SELECT COUNT(*), idCopia INTO CNT,COPYFILM
FROM avl
WHERE (dataRestituito IS NOT NULL OR idNoleggio IS NULL)
GROUP BY idCopia
The error is:
ORA-01422: exact fetch returns more than requested number of rows
ORA-06512: at "VIDEO.PR_AVAILABILITY", line 9.
So it seems the Into clause is wrong because obviously the query returns more rows. What can I do ? I need to take the Copy ID (even just the first one on the list of rows) without using cursors.
You can try this -
WITH film
as
(
SELECT idCopia, titolo
FROM COPIA_FILM
WHERE titolo='Pulp Fiction'
),
copy_info as
(
select N.idNoleggio, I.dataNoleggio , I.dataRestituito , I.idCopia
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio=I.idNoleggio
),
avl as
(
select film.titolo, copy_info.idNoleggio, copy_info.dataNoleggio,
copy_info.dataRestituito
from film LEFT OUTER JOIN copy_info
ON film.idCopia = copy_info.idCopia
)
select * from avl
where (dataRestituito IS NOT NULL OR idNoleggio IS NULL);
You should think in terms of sets, rather than records.
If you find the set of all the films that are out, you can exclude them from your stock, and the rest is rentable.
select copiafilm.* from #f copiafilm
left join
(
select idCopia from #r Noleggio
inner join #i include on Noleggio.idNoleggio = include.idNoleggio
where dateRestituito is null
) out
on copiafilm.idCopia = out.idCopia
where out.idCopia is null
I solved the problem editing the last query into this one:
SELECT COUNT(*),idCopia INTO CNT,idCopiaFilm
FROM avl
WHERE (dataRestituito IS NOT NULL OR idNoleggio IS NULL) AND rownum = 1
GROUP BY idCopia;
IF CNT > 0 THEN
-- FOUND AVAILABLE COPY
END IF;
EXCEPTION
WHEN NO_DATA_FOUND THEN
-- NOT FOUND AVAILABLE COPY
Thank you #Aditya Kakirde ! Your suggestion almost solved the problem.

How to design the Tables / Query for (m:n relation?)

I am sorry if the term m:n is not correct, If you know a better term i will correct. I have the following situation, this is my original data:
gameID
participID
result
the data itself looks like that
1 5 10
1 4 -10
2 5 150
2 2 -100
2 1 -50
when i would extract this table it will easily have some 100mio rows and around 1mio participIDs ore more.
i will need:
show me all results of all games from participant x, where participant y was present
luckily only for a very limited amount of participants, but those are subject to change so I need a complete table and can reduce in a second step.
my idea is the following, it just looks very unoptimized
1) get the list of games where the "point of view participant" is included"
insert into consolidatedtable (gameid, participid, result)
select gameID,participID,sum(result) from mastertable where participID=x and result<>0
2) get all games where other participant is included
insert into consolidatedtable (gameid, participid, result)
where gameID in (select gameID from consolidatedtable)
AND participID=y and result<>0
3) delete all games from consolidate table where count<2
delete from consolidatedDB where gameID in (select gameid from consolidatedtable where count(distinct(participID)<2 group by gameid)
the whole thing looks like a childrens solution to me
I need a consolidated table for each player
I insert way to many games into this table and delete them later on
the whole thing needs to be run participant by participant over
the whole master table, it would not work if i do this for several
participants at the same time
any better ideas, must be, this ones just so bad. the master table will be postgreSQL on the DW server, the consolidated view will be mySQL (but the number crunching will be done in postgreSQL)
my problems
1) how do i build the consolidated table(s - do i need more than one), without having to run a single query for each player over the whole master table (i need to data for players x,y,z and no matter who else is playing) - this is the consolidation task for the DW server, it should create the table for webserver (which is condensed)
2) how can i then extract the at the webserver fast (so the table design of (1) should take this into consideration. we are not talking about a lot of players here i need this info, maybe 100? (so i could then either partition by player ID, or just create single table)
Datawarehouse: postgreSQL 9.2 (48GB, SSD)
Webserver: mySQL 5.5 (4GB Ram, SSD)
master table: gameid BIGINT, participID, Result INT, foreign key on particiP ID (to participants table)
the DW server will hold the master table, the DW server should also prepare the consolidated/extracted Tables (processing power, ssd space is not
an issue)
the webserver should hold the consoldiated tables (only for the 100
players where i need the info) and query this data in a very
efficient manner
so efficient query at webserver >> workload of DW server)
i think this is important, sorry that i didnt include it at the beginning.
the data at the DW server updates daily, but i do not need to query the whole "master table" completely every day. the setup allows me to consolidate only never values. eg: yesterday consolidation was up to ID 500, current ID=550, so today i only consolidate 501-550.
Here is another idea that might work, depending on your database (and my understanding of the question):
SELECT *
FROM table a
WHERE participID = 'x'
AND EXISTS (
SELECT 1 FROM table b
WHERE b.participID = 'y'
AND b.gameID=a.gameID
);
Assuming you have indexes on the two columns (participID and gameID), the performance should be good.
I'd compare it to this and see which runs faster:
SELECT *
FROM table a
JOIN (
SELECT gameID
FROM table
WHERE participID = 'y'
GROUP BY gameID
) b
ON a.gameID=b.gameID
WHERE a.participID = 'x';
Sounds like you just want a self join:
For all participants:
SELECT x.gameID, x.participID, x.results, y.participID, y.results
FROM table as x
JOIN table as y
ON T1.gameID = T2.gameID
WHERE x.participID <> y.participID
The downside of that is you'd get each participant on each side of each game.
For 2 specific particpants:
SELECT x.gameID, x.results, y.results
FROM (SELECT gameID, participID, results
FROM table
WHERE t1.participID = 'x'
and results <> 0)
as x
JOIN (SELECT gameID, participID, results
FROM table
WHERE t1.participID = 'y'
and results <> 0)
as y
ON T1.gameID = T2.gameID
You might not need to select participID in your query, depending on what you're doing with the results.

DB design: Should I use constraints within a table or a new table

I inherited a large existing DB and I'd like to know if I should refactor it because 95% of my queries require joining at least 4 tables.
The DB has a 5 tables that only have an ID and Name column with less than 20 rows. I assume the author did this so he could change the names there and not change them in the other tables, but many of those tables are only referenced in one other table. Should I refactor these small 2 column tables into the a larger table and add a constraint to the column so users can't input incorrect names instead of having seperate tables?
Resist that urge. From your description I can deduce that the existing design is solid and probably well normalized. Your refactoring may actually undo a good db structure.
If you are bothered by writing a lot of joins in your queries I would suggest creating views to mitigate the boilerplate.
...the author did this so he could change the names there not change
them in the other tables...
That is evidence of good design and exactly what you should strive for in a normalized database.
no.
your db is normalized and proper.
and you save space, lookup time, indexing for storing an int rather then a varchar name
small tables are optimized away if they are properly keyed.
Sounds like what you have are lookup tables. Let me tell you waht happens when you decide to put all lookups in one table with an additonal column to specify which type it is. Fisrt instead of joining to 4 different tables in one query, you have to join to the same table 4 times. There ends up being more contention for the resources in the "one table to rule them all". Further, you lose FK constraints. That means you eventually lose data integrity. So if one lookup is state, nothing wil prevent you from putting the id values for a different lookup for customer type in the stateid column in the customeraddress table. When the lookups are separate you con enforce that relationship.
Suppose instead of one big table you decide to have a constraint on the column for customer type. Constraints are now enforced but you have a problem when they need to change. Now you have to alter the database in order to add a new type. Again usually this is a very bad idea espcially when the table gets large.
Short story: Replacing strings with ID numbers has nothing to do with normalization. Using natural keys in your case might improve performance. In my tests, queries using natural keys were faster by 1 or 2 orders of magnitude.
You might have accepted an answer too quickly.
The DB has a 5 tables that only have an ID and Name column with less
than 20 rows.
I'm assuming these tables have a structure something like this.
create table a (
a_id integer primary key,
a_name varchar(30) not null unique
);
create table b (...
-- Just like a
create table your_data (
yet_another_id integer primary key,
a_id integer not null references a (a_id),
b_id integer not null references b (b_id),
c_id integer not null references c (c_id),
d_id integer not null references d (d_id),
unique (a_id, b_id, c_id, d_id),
-- other columns go here
);
And it's obvious that your_data will require four joins (at least) to get usable information from it.
But the names in table a, b, c, and d are unique (ahem), so you can use the unique names as targets for foreign key references. You could rewrite the table your_data like this.
create table your_data (
yet_another_id integer primary key,
a_name varchar(30) not null references a (a_name),
b_name varchar(30) not null references b (b_name),
c_name varchar(30) not null references c (c_name),
d_name varchar(30) not null references d (d_name),
unique (a_name, b_name, c_name, d_name),
-- other columns go here
);
Replacing id numbers with strings doesn't change the normal form. (And replacing strings with id numbers doesn't have anything to do with normalization.) If the original table were in 5NF, then this rewrite will be in 5NF, too.
But what about performance? Aren't id numbers plus joins supposed to be faster than strings?
I tested that by inserting 20 rows into each of the four tables a, b, c, and d. Then I generated a Cartesian product to fill one test table written with id numbers, and another using the names. (So, 160K rows in each.) I updated the statistics, and ran a couple of queries.
explain analyze
select a.a_name, b.b_name, c.c_name, d.d_name
from your_data_id
inner join a on (a.a_id = your_data_id.a_id)
inner join b on (b.b_id = your_data_id.b_id)
inner join c on (c.c_id = your_data_id.c_id)
inner join d on (d.d_id = your_data_id.d_id)
...
Total runtime: 808.472 ms
explain analyze
select a_name, b_name, c_name, d_name
from your_data
Total runtime: 132.098 ms
The query using id numbers takes a lot longer to execute. I used a WHERE clause on all four columns, which returns a single row.
explain analyze
select a.a_name, b.b_name, c.c_name, d.d_name
from your_data_id
inner join a on (a.a_id = your_data_id.a_id and a.a_name = 'a one')
inner join b on (b.b_id = your_data_id.b_id and b.b_name = 'b one')
inner join c on (c.c_id = your_data_id.c_id and c.c_name = 'c one')
inner join d on (d.d_id = your_data_id.d_id and d.d_name = 'd one)
...
Total runtime: 14.671 ms
explain analyze
select a_name, b_name, c_name, d_name
from your_data
where a_name = 'a one' and b_name = 'b one' and c_name = 'c one' and d_name = 'd one';
...
Total runtime: 0.133 ms
The tables using id numbers took about 100 times longer to query.
Tests used PostgreSQL 9.something.
My advice: Try before you buy. I mean, test before you invest. Try rewriting your data table to use natural keys. Think carefully about ON UPDATE CASCADE and ON DELETE CASCADE. Test performance with representative sample data. Edit your original question and let us know what you found.

TSQL foreign keys on views?

I have a SQL-Server 2008 database and a schema which uses foreign key constraints to enforce referential integrity. Works as intended. Now the user creates views on the original tables to work on subsets of the data only. My problem is that filtering certain datasets in some tables but not in others will violate the foreign key constraints.
Imagine two tables "one" and "two". "one" contains just an id column with values 1,2,3. "Two" references "one". Now you create views on both tables. The view for table "two" doesn't filter anything while the view for table "one" removes all rows but the first. You'll end up with entries in the second view that point nowhere.
Is there any way to avoid this? Can you have foreign key constraints between views?
Some Clarification in response to some of the comments:
I'm aware that the underlying constraints will ensure integrity of the data even when inserting through the views. My problem lies with the statements consuming the views. Those statements have been written with the original tables in mind and assume certain joins cannot fail. This assumption is always valid when working with the tables - but views potentially break it.
Joining/checking all constraints when creating the views in the first place is annyoing because of the large number of referencing tables. Thus I was hoping to avoid that.
I love your question. It screams of familiarity with the Query Optimizer, and how it can see that some joins are redundant if they serve no purpose, or if it can simplify something knowing that there is at most one hit on the other side of a join.
So, the big question is around whether you can make a FK against the CIX of an Indexed View. And the answer is no.
create table dbo.testtable (id int identity(1,1) primary key, val int not null);
go
create view dbo.testview with schemabinding as
select id, val
from dbo.testtable
where val >= 50
;
go
insert dbo.testtable
select 20 union all
select 30 union all
select 40 union all
select 50 union all
select 60 union all
select 70
go
create unique clustered index ixV on dbo.testview(id);
go
create table dbo.secondtable (id int references dbo.testview(id));
go
All this works except for the last statement, which errors with:
Msg 1768, Level 16, State 0, Line 1
Foreign key 'FK__secondtable__id__6A325CF7' references object 'dbo.testview' which is not a user table.
So the Foreign key must reference a user table.
But... the next question is about whether you could reference a unique index that is filtered in SQL 2008, to achieve a view-like FK.
And still the answer is no.
create unique index ixUV on dbo.testtable(val) where val >= 50;
go
This succeeded.
But now if I try to create a table that references the val column
create table dbo.thirdtable (id int identity(1,1) primary key, val int not null check (val >= 50) references dbo.testtable(val));
(I was hoping that the check constraint that matched the filter in the filtered index might help the system understand that the FK should hold)
But I get an error saying:
There are no primary or candidate keys in the referenced table 'dbo.testtable' that matching the referencing column list in the foreign key 'FK__thirdtable__val__0EA330E9'.
If I drop the filtered index and create a non-filtered unique non-clustered index, then I can create dbo.thirdtable without any problems.
So I'm afraid the answer still seems to be No.
It took me some time to figure out the misunderstaning here -- not sure if I still understand completely, but here it is.
I will use an example, close to yours, but with some data -- easier for me to think in these terms.
So first two tables; A = Department B = Employee
CREATE TABLE Department
(
DepartmentID int PRIMARY KEY
,DepartmentName varchar(20)
,DepartmentColor varchar(10)
)
GO
CREATE TABLE Employee
(
EmployeeID int PRIMARY KEY
,EmployeeName varchar(20)
,DepartmentID int FOREIGN KEY REFERENCES Department ( DepartmentID )
)
GO
Now I'll toss some data in
INSERT INTO Department
( DepartmentID, DepartmentName, DepartmentColor )
SELECT 1, 'Accounting', 'RED' UNION
SELECT 2, 'Engineering', 'BLUE' UNION
SELECT 3, 'Sales', 'YELLOW' UNION
SELECT 4, 'Marketing', 'GREEN' ;
INSERT INTO Employee
( EmployeeID, EmployeeName, DepartmentID )
SELECT 1, 'Lyne', 1 UNION
SELECT 2, 'Damir', 2 UNION
SELECT 3, 'Sandy', 2 UNION
SELECT 4, 'Steve', 3 UNION
SELECT 5, 'Brian', 3 UNION
SELECT 6, 'Susan', 3 UNION
SELECT 7, 'Joe', 4 ;
So, now I'll create a view on the first table to filter some departments out.
CREATE VIEW dbo.BlueDepartments
AS
SELECT * FROM dbo.Department
WHERE DepartmentColor = 'BLUE'
GO
This returns
DepartmentID DepartmentName DepartmentColor
------------ -------------------- ---------------
2 Engineering BLUE
And per your example, I'll add a view for the second table which does not filter anything.
CREATE VIEW dbo.AllEmployees
AS
SELECT * FROM dbo.Employee
GO
This returns
EmployeeID EmployeeName DepartmentID
----------- -------------------- ------------
1 Lyne 1
2 Damir 2
3 Sandy 2
4 Steve 3
5 Brian 3
6 Susan 3
7 Joe 4
It seems to me that you think that Employee No 5, DepartmentID = 3 points to nowhere?
"You'll end up with entries in the
second view that point nowhere."
Well, it points to the Department table DepartmentID = 3, as specified with the foreign key. Even if you try to join view on view nothing is broken:
SELECT e.EmployeeID
,e.EmployeeName
,d.DepartmentID
,d.DepartmentName
,d.DepartmentColor
FROM dbo.AllEmployees AS e
JOIN dbo.BlueDepartments AS d ON d.DepartmentID = e.DepartmentID
ORDER BY e.EmployeeID
Returns
EmployeeID EmployeeName DepartmentID DepartmentName DepartmentColor
----------- -------------------- ------------ -------------------- ---------------
2 Damir 2 Engineering BLUE
3 Sandy 2 Engineering BLUE
So nothing is broken here, the join simply did not find matching records for DepartmentID <> 2 This is actually the same as if I join tables and then include filter as in the first view:
SELECT e.EmployeeID
,e.EmployeeName
,d.DepartmentID
,d.DepartmentName
,d.DepartmentColor
FROM dbo.Employee AS e
JOIN dbo.Department AS d ON d.DepartmentID = e.DepartmentID
WHERE d.DepartmentColor = 'BLUE'
ORDER BY e.EmployeeID
Returns again:
EmployeeID EmployeeName DepartmentID DepartmentName DepartmentColor
----------- -------------------- ------------ -------------------- ---------------
2 Damir 2 Engineering BLUE
3 Sandy 2 Engineering BLUE
In both cases joins do not fail, they simply do as expected.
Now I will try to break the referential integrity through a view (there is no DepartmentID= 127)
INSERT INTO dbo.AllEmployees
( EmployeeID, EmployeeName, DepartmentID )
VALUES( 10, 'Bob', 127 )
And this results in:
Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the FOREIGN KEY constraint "FK__Employee__Depart__0519C6AF". The conflict occurred in database "Tinker_2", table "dbo.Department", column 'DepartmentID'.
If I try to delete a department through the view
DELETE FROM dbo.BlueDepartments
WHERE DepartmentID = 2
Which results in:
Msg 547, Level 16, State 0, Line 1
The DELETE statement conflicted with the REFERENCE constraint "FK__Employee__Depart__0519C6AF". The conflict occurred in database "Tinker_2", table "dbo.Employee", column 'DepartmentID'.
So constraints on underlying tables still apply.
Hope this helps, but then maybe I misunderstood your problem.
Peter already hit on this, but the best solution is to:
Create the "main" logic (that filtering the referenced table) once.
Have all views on related tables join to the view created for (1), not the original table.
I.e.,
CREATE VIEW v1 AS SELECT * FROM table1 WHERE blah
CREATE VIEW v2 AS SELECT * FROM table2 WHERE EXISTS
(SELECT NULL FROM v1 WHERE v1.id = table2.FKtoTable1)
Sure, syntactic sugar for propagating filters for views on one table to views on subordinate tables would be handy, but alas, it's not part of the SQL standard. That said, this solution is still good enough -- efficient, straightforward, maintainable, and guarantees the desired state for the consuming code.
If you try to insert, update or delete data through a view, the underlying table constraints still apply.
Something like this in View2 is probably your best bet:
CREATE VIEW View2
AS
SELECT
T2.col1,
T2.col2,
...
FROM
Table2 T2
INNER JOIN Table1 T1 ON
T1.pk = T2.t1_fk
If rolling over tables so that Identity columns will not clash, one possibility would be to use a lookup table that referenced the different data tables by Identity and a table reference.
Foreign keys on this table would work down the line for referencing tables.
This would be expensive in a number of ways
Referential integrity on the lookup table would have to be be enforced using triggers.
Additional storage of the lookup table and indexing in addition to the data tables.
Data reading would almost certainly involve a Stored Procedure or three to execute a filtered UNION.
Query plan evaluation would also have a development cost.
The list goes on but it might work on some scenarios.
Using Rob Farley's schema:
CREATE TABLE dbo.testtable(
id int IDENTITY(1,1) PRIMARY KEY,
val int NOT NULL);
go
INSERT dbo.testtable(val)
VALUES(20),(30),(40),(50),(60),(70);
go
CREATE TABLE dbo.secondtable(
id int NOT NULL,
CONSTRAINT FK_SecondTable FOREIGN KEY(id) REFERENCES dbo.TestTable(id));
go
CREATE TABLE z(n tinyint PRIMARY KEY);
INSERT z(n)
VALUES(0),(1);
go
CREATE VIEW dbo.SecondTableCheck WITH SCHEMABINDING AS
SELECT 1 n
FROM dbo.TestTable AS t JOIN dbo.SecondTable AS s ON t.Id = s.Id
CROSS JOIN dbo.z
WHERE t.Val < 50;
go
CREATE UNIQUE CLUSTERED INDEX NoSmallIds ON dbo.SecondTableCheck(n);
go
I had to create a tiny helper table (dbo.z) in order to make this work, because indexed views cannot have self joins, outer joins, subqueries, or derived tables (and TVCs count as derived tables).
Another approach, depending on your requirements, would be to use a stored procedure to return two recordsets. You pass it filtering criteria and it uses the filtering criteria to query table 1, and then those results can be used to filter the query to table 2 so that it's results are also consistent. Then you return both results.
You could stage the filtered table 1 data to another table. The contents of this staging table are your view 1, and then you build view 2 via a join of the staging table and table 2. This way the proccessing for filtering table 1 is done once and reused for both views.
Really what it boils down to is that view 2 has no idea what kind of filtering you performed in view 1, unless you tell view 2 the filtering criteria, or make it somehow dependent on the results of view 1, which means emulating the same filtering that occurs on view1.
Constraints don't perform any kind of filtering, they only prevent invalid data, or cascade key changes and deletes.
No, you can't create foreign keys on views.
Even if you could, where would that leave you? You would still have to declare the FK after creating the view. Who would declare the FK, you or the user? If the user is sophisticated enough to declare a FK, why couldn't he add an inner join to the referenced view? eg:
create view1 as select a, b, c, d from table1 where a in (1, 2, 3)
go
create view2 as select a, m, n, o from table2 where a in (select a from view1)
go
vs:
create view1 as select a, b, c, d from table1 where a in (1, 2, 3)
go
create view2 as select a, m, n, o from table2
--# pseudo-syntax for fk:
alter view2 add foreign key (a) references view1 (a)
go
I don't see how the foreign key would simplify your job.
Alternatively:
Copy the subset of data into another schema or database. Same tables, same keys, less data, faster analysis, less contention.
If you need a subset of all the tables, use another database. If you only need a subset of some tables, use a schema in the same database. That way your new tables can still reference the non-copied tables.
Then use the existing views to copy the data over. Any FK violations will raise an error and identify which views require editing. Create a job and schedule it daily, if necessary.
From a purely data integrity perspective (and nothing to do with the Query Optimizer), I had considered an Indexed View. I figured you could make a unique index on it, which could be broken when you try to have broken integrity in your underlying tables.
But... I don't think you can get around the restrictions of indexed views well enough.
For example:
You can't use outer joins, or sub-queries. That makes it very hard to find the rows that don't exist in the view. If you use aggregates, you can't use HAVING, so that cuts out some options you could use there too. You can't even have constants in an indexed view if you have grouping (whether or not you use a GROUP BY clause), so you can't even try putting an index on a constant field so that a second row will fall over. You can't use UNION ALL, so the idea of having a count which will break a unique index when it hits a second zero won't work.
I feel like there should be an answer, but I'm afraid you're going to have to take a good look at your actual design and work out what you really need. Perhaps triggers (and good indexes) on the tables involved, so that any changes that might break something can roll it all that.
But I was really hoping to be able to suggest something that the Query Optimizer might be able to leverage to help the performance of your system, but I don't think I can.