Implementing an Aliases table (self-referencing many-to-many) - sql

I am trying to model an Alias relationship. That is, several records in my person table may represent the same actual person. I don't care who the "Primary" person is. All Person records would carry equal weight.
I have implemented this in the past with the two tables you see below.
------------- ------------
| Person | | Alias |
|-----------| |----------|
| PersonID | | AliasID |
| LastName | | PersonID |
| FirstName | ------------
-------------
Here is some sample data:
Person (1, 'Joseph', 'Smith')
Person (2, 'Jane', 'Doe')
Person (3, 'Joe', 'Smith')
Person (4, 'Joey', 'Smith')
Alias(1, 1)
Alias(1, 3)
Alias(1, 4)
I suppose I could move the AliasID to the Person table since there is a 1-to-1 relationship between the PersonID fields. However, I may want to add additional fields to the Alias table (like Sequence number, etc.) at some point in the future.
Is there a better way to model this than what I have here?

This is how I would do it.
--DROP TABLE [dbo].[Alias]
GO
--DROP TABLE [dbo].[RealPerson]
GO
IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = object_id(N'[dbo].[RealPerson]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
BEGIN
DROP TABLE [dbo].[RealPerson]
END
GO
CREATE TABLE [dbo].[RealPerson]
(
RealPersonUUID [UNIQUEIDENTIFIER] NOT NULL DEFAULT NEWSEQUENTIALID()
, CreateDate smalldatetime default CURRENT_TIMESTAMP
, MyCompanyFriendlyUniqueIdentifier varchar(128) not null
)
GO
ALTER TABLE dbo.RealPerson ADD CONSTRAINT PK_RealPerson
PRIMARY KEY NONCLUSTERED (RealPersonUUID)
GO
ALTER TABLE [dbo].[RealPerson]
ADD CONSTRAINT CK_MyCompanyFriendlyUniqueIdentifier_Unique UNIQUE (MyCompanyFriendlyUniqueIdentifier)
GO
GRANT SELECT , INSERT, UPDATE, DELETE ON [dbo].[RealPerson] TO public
GO
IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = object_id(N'[dbo].[Alias]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
BEGIN
DROP TABLE [dbo].[Alias]
END
GO
CREATE TABLE [dbo].[Alias]
(
AliasUUID [UNIQUEIDENTIFIER] NOT NULL DEFAULT NEWSEQUENTIALID()
, RealPersonUUID [UNIQUEIDENTIFIER] NOT NULL
, CreateDate smalldatetime default CURRENT_TIMESTAMP
, LastName varchar(128) not null
, FirstName varchar(128) not null
, PriorityRank smallint not null
)
GO
ALTER TABLE dbo.Alias ADD CONSTRAINT PK_Alias
PRIMARY KEY NONCLUSTERED (AliasUUID)
GO
ALTER TABLE [dbo].[Alias]
ADD CONSTRAINT FK_AliasToRealPerson
FOREIGN KEY (RealPersonUUID) REFERENCES dbo.RealPerson (RealPersonUUID)
GO
ALTER TABLE [dbo].[Alias]
ADD CONSTRAINT CK_RealPersonUUID_PriorityRank_Unique UNIQUE (RealPersonUUID,PriorityRank)
GO
ALTER TABLE [dbo].[Alias]
ADD CONSTRAINT CK_PriorityRank_Range CHECK (PriorityRank >= 0 AND PriorityRank < 33)
GO
if exists (select * from dbo.sysindexes where name = N'IX_Alias_RealPersonUUID' and id = object_id(N'[dbo].[Alias]'))
DROP INDEX [dbo].[Alias].[IX_Alias_RealPersonUUID]
GO
CREATE INDEX [IX_Alias_RealPersonUUID] ON [dbo].[Alias]([RealPersonUUID])
GO
GRANT SELECT , INSERT, UPDATE, DELETE ON [dbo].[Alias] TO public
GO
INSERT INTO dbo.RealPerson ( RealPersonUUID , MyCompanyFriendlyUniqueIdentifier )
select '11111111-1111-1111-1111-111111111111' , 'ABC'
union all select '22222222-2222-2222-2222-222222222222' , 'DEF'
INSERT INTO dbo.[Alias] ( RealPersonUUID , LastName, FirstName , PriorityRank)
select '11111111-1111-1111-1111-111111111111' , 'Smith' , 'Joseph' , 0
union all select '11111111-1111-1111-1111-111111111111' , 'Smith' , 'Joey' , 1
union all select '11111111-1111-1111-1111-111111111111' , 'Smith' , 'Joe' , 2
union all select '11111111-1111-1111-1111-111111111111' , 'Smith' , 'Jo' , 3
union all select '22222222-2222-2222-2222-222222222222' , 'Doe' , 'Jane' , 0
select 'Main Identity' as X, * from dbo.RealPerson rp join dbo.[Alias] al on rp.RealPersonUUID = al.RealPersonUUID where al.PriorityRank = 0
select 'All Identities' as X, * from dbo.RealPerson rp join dbo.[Alias] al on rp.RealPersonUUID = al.RealPersonUUID
select 'Aliai Only' as X, * from dbo.RealPerson rp join dbo.[Alias] al on rp.RealPersonUUID = al.RealPersonUUID where al.PriorityRank > 0

First, you should identify your entities. Clearly you have a person and each person will have their own identity. They are unique and should allways be kept as such. Then you have Alias's They should be in their own table with a one to many relationship. This should be enfrced with primary keys, forgien keys, indexes for quick lookup where appropriate. Each table need a clustered index also for performance. You should then use stored procedures to return or update the tables. I've intentionally used certain word, because if you google them, you will get lots of good information on what you need to do.

Related

How to I get distinct combinations of one XRef column related to any value in the other XRef column

I need to select the count of unique value combinations of column B in an XRef table which is grouped by column A.
Consider the following schema and data, which represents a simple family structure. Each child has a father and mother:
TABLE Father
FatherID
Name
1
Alex
2
Bob
TABLE Mother
MotherID
Name
1
Alice
2
Barbara
TABLE Child
ChildID
FatherID
MotherID
Name
1
1 (Alex)
1 (Alice)
Adam
2
1 (Alex)
1 (Alice)
Billy
3
1 (Alex)
2 (Barbara)
Celine
4
2 (Bob)
2 (Barbara)
Derek
The distinct combinations of mothers for each father are:
Alex (Alice, Barbara)
Bob (Barbara)
In all there are two distinct combinations of mothers:
Alice, Barbara
Barbara
The query I want to write would return the count of those distinct combinations of mother, regardless of which father they are associated with:
UniqueMotherGroups
2
I was able to do this successfully using the STRING_AGG function, but it feels clunky. It also needs to operate over millions of rows and is quite slow at the moment. Is there a more idiomatic way to do this with set operations instead?
Here is my working example:
-- Drop pre-existing tables
DROP TABLE IF EXISTS dbo.Child;
DROP TABLE IF EXISTS dbo.Father;
DROP TABLE IF EXISTS dbo.Mother;
-- Create family tables.
CREATE TABLE dbo.Father
(
FatherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Father
ADD CONSTRAINT PK_Father
PRIMARY KEY CLUSTERED (FatherID);
ALTER TABLE dbo.Father SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Mother
(
MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Mother
ADD CONSTRAINT PK_Mother
PRIMARY KEY CLUSTERED (MotherID);
ALTER TABLE dbo.Mother SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Child
(
ChildID INT NOT NULL
, FatherID INT NOT NULL
, MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Child
ADD CONSTRAINT PK_Child
PRIMARY KEY CLUSTERED (ChildID);
CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID, MotherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Father
FOREIGN KEY (FatherID)
REFERENCES dbo.Father (FatherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Mother
FOREIGN KEY (MotherID)
REFERENCES dbo.Mother (MotherID);
-- Insert two children with the same parents
INSERT INTO dbo.Father
(
FatherID
, Name
)
VALUES
(1, 'Alex')
, (2, 'Bob')
, (3, 'Charlie')
INSERT INTO dbo.Mother
(
MotherID
, Name
)
VALUES
(1, 'Alice')
, (2, 'Barbara');
INSERT INTO dbo.Child
(
ChildID
, FatherID
, MotherID
, Name
)
VALUES
(1, 1, 1, 'Adam')
, (2, 1, 1, 'Billy')
, (3, 1, 2, 'Celine')
, (4, 2, 2, 'Derek')
, (5, 3, 1, 'Eric');
-- CTE Gets distinct combinations of parents
WITH distinctParentCombinations (FatherID, MotherID)
AS (SELECT children.FatherID
, children.MotherID
FROM dbo.Child as children
GROUP BY children.FatherID
, children.MotherID
)
-- CTE Gets uses STRING_AGG to get unique combinations of mothers.
, motherGroups (Mothers)
AS (SELECT STRING_AGG(CONVERT(VARCHAR(MAX), distinctParentCombinations.MotherID), '-') WITHIN GROUP (ORDER BY distinctParentCombinations.MotherID) AS Mothers
FROM distinctParentCombinations
GROUP BY distinctParentCombinations.FatherID
)
-- Remove the COUNT function to see the actual combinations
SELECT COUNT(motherGroups.Mothers) AS UniqueMotherGroups
FROM motherGroups
-- Clean up the example
DROP TABLE IF EXISTS dbo.Child;
DROP TABLE IF EXISTS dbo.Father;
DROP TABLE IF EXISTS dbo.Mother;
You have a great explanation and setup of your "problem case".
Your setup runs great in (for example) tempdb.
You have solved the problem in a nice way, and I don't think you can optimize it much further if you are going to calculate the mother groups every time you run the query.
There is one small mistake though; You must do a COUNT(DISTINCT motherGroups.Mothers) in your final count.
Since you mention milions of rows, I would suggest a slightly different approach.
If you aggregate the mother groups as soon as there is a change in the Child table, your query can run fast every time - even with millions of rows.
The kind of queries you want to run is seldom run only once, so it would be nice if the heavy work is already done.
Usually I prefer not to use triggers, because you get extra logic in a place where it could be hard to find and debug.
But sometimes triggers are nice to have, especially when you are not able to change the source code running on the clients.
So, my solution is to add a new column to the Father table and to create a trigger which (re)generates the mother group each time there is a change in the Child table.
This way, the hard aggregation work for each father is done as soon there is a change, and you don't have to aggregate when you run your query.
Since you already have millions of rows, we also have to update these existing rows.
I have used SQL Server 2019 for this solution.
*** The solution ***
Add 1 or 2 new columns to the Father table.
If you should add 1 or 2, it depends on what your preferences are:
"Do I want to see the aggregated mother groups for debugging purpose, or do I just trust the hashed values?"
Column 1: Hashed value of the aggregated mother group for each Father row.
The hashed value is VARBINARY and is at least 32 bytes, but we will use VARBINARY(1600):
1600 is less than 1700 which is the max nonclustered index size, so we will not have any problems indexing the column.
Since the hash value is in blocks of 32 bytes, a value of 1600 will cover a really, really, really long aggreated mother group.
-- Column 1: Hashed value of the aggregated mother group for each Father row.
alter table Father add MotherHash varbinary(1600)
create index IX_MotherHash on Father(MotherHash)
Column 2: This column is more optional, and depends on your preferences.
The column could be nice to have for debugging purpose if any questions are made about the result.
Which VARCHAR-length you should use depends on your real data.
MAX? Then you have no problems storing the mother groups, but you might have problems indexing it, since 1700 is the max for an unclustered index. But maybe you don't need to index it?
1700? Then you are able to index the column, but depending on your real data, will this cover the biggest mother group?
Why indexing? If you want to list the aggregated mother groups, it could be faster to read the index than the whole table.
As said; this depends on you (and your data). If we have no need to see the aggregated mother groups, then we don't need this column at all.
For this demo/solution we will add the column for debugging purpose, without any indexing.
-- Column 2: This column is more optional, and depends on your preferences.
alter table Father add MotherGroup varchar(MAX)
go
Create a trigger on the Child table.
It will handle all inserts, updates and deletes in the Child table.
create or alter trigger trIUD_Child on Child
after insert, update, delete
as
begin
set nocount on
-- Get all FatherIDs from the Inserted and Deleted table.
-- An ordinary Temp table is created with a clustered index to get SEEK performance later.
-- The table might also have more than 100 rows, where table variables are not recommended.
declare #numRowsInInsertedDeleted int
create table #rowsInInsertedDeleted(rowId int identity(1, 1), FatherID int)
create unique clustered index ix on #rowsInInsertedDeleted(rowId)
insert #rowsInInsertedDeleted(FatherID)
select distinct f.FatherID
from
(
select i.FatherID from inserted i
union all
select i.FatherID from deleted i
) f
select #numRowsInInsertedDeleted = max(rowId) from #rowsInInsertedDeleted
-- We have to loop each of the FatherIDs, since we might have several rows in the Inserted and Deleted tables.
declare #rowId int = 0
while (#rowId < #numRowsInInsertedDeleted)
begin
-- Get the father for the next row.
select #rowId += 1
declare #fatherId int
select #fatherId = r.FatherID
from #rowsInInsertedDeleted r
where r.rowId = #rowId
-- Aggregate the mothers for this father.
declare #motherGroup varchar(max) = ''
select #motherGroup += ',' + cast(c.MotherID as varchar)
from Child c
where c.FatherID = #fatherId
group by c.MotherID
order by c.MotherID
-- Update the father record.
-- Any empty strings are handled automatically, skip the leading ','.
update Father
set MotherGroup = substring(#motherGroup, 2, 2147483647),
MotherHash = HASHBYTES('SHA2_256', #motherGroup)
where FatherID = #fatherId
end
end
go
Updating existing rows
Since you already have millions of rows, we must aggregate the mother groups for these existing rows.
If you don't have the disk space for logging the update of the whole table, maybe you should take your database out of AG and switch to Simple recovery model for this task?
In that case you should also modify the update with a WHERE clause to update only parts of the table, and run the update for each part until the whole table is updated.
Example: update Child set FatherID = FatherID where FatherID between 1 and 1000000
Note: This update statement could block access to the Child table for other users.
-- Aggregate the mother groups for the existing rows.
-- This could takes minutes to complete, depending on the number of rows.
-- NOTE: This update statement could block access to the Child table for other users.
update Child set FatherID = FatherID
That's it!
You should now be able to quickly get the mother groups on existing rows, and also after future changes in the Child table.
-- Voila - now you can get the unique mother groups any time at a fast speed.
select count(distinct MotherHash) from Father
Thank you for posting such a comprehensive setup for the test data. However, I'm not running any CREATE/DROP statements against my DB so I converted those tables into table variables. Using your data, I came up with the following query. Just change the table names back to your dbo. names and you should be able to test in your environment. I basically concatenate every father/mother combo into a text string using FOR XML PATH. Then I count up all the distinct combos. If you find error in my logic, let me know. I'm just tossing this in the ring of possible solutions.
WITH distinctCombos AS (
SELECT DISTINCT
c.FatherID, c.MotherID
FROM #Child as c
) , motherComboCount AS (
SELECT
f.FatherID
, f.[Name]
, STUFF((
SELECT
',' + CAST(dc.MotherID as nvarchar)
FROM distinctCombos as dc
WHERE dc.FatherID = f.FatherID
ORDER BY dc.MotherID ASC
FOR XML PATH('')
),1,1,'') as motherList
FROM #Father as f
)
SELECT
COUNT(DISTINCT motherList) as UniqueMotherGroups
FROM motherComboCount as mcc
To save a bit of compute power, remove the STUFF function as it's not necessary for the comparison... it just makes the list nicer to look at if displaying... and I'm in the habit of using it.
It looks like the main differences between our methods is the use of FOR XML PATH vs STRING_AGG (I'm still on older SQL.) And I use DISTINCT twice instead of GROUP BY. If you have a larger dataset to test against, let me know how the 2 methods compare. I'm trying to think of a completely set-based method but I can't see it at the moment.
Update: Method 2.
Here's an idea I had using recursive CTEs to build the distinct mother combinations. In your example data, there are only 2 mothers per father. So there would be a total of 4 set-based queries performed (first CTE, 2 queries in the recursive CTE and the final SELECT).
WITH uniqueCombo as (
SELECT DISTINCT
c.FatherID
, c.MotherID
, ROW_NUMBER() OVER(PARTITION BY c.FatherID ORDER BY c.MotherID) as row_num
FROM #Child as c
), combos as (
SELECT
uc.FatherID
, uc.MotherID
, CAST(uc.MotherID as nvarchar(max)) as [path]
, row_num
, 0 as hierarchy_num
FROM uniqueCombo as uc
WHERE uc.row_num = 1
UNION ALL
SELECT
uc.FatherID
, uc.MotherID
, co.[path] + ',' + CAST(uc.MotherID as nvarchar(max))
, uc.row_num
, co.hierarchy_num + 1 as heirarchy_num
FROM uniqueCombo as uc
INNER JOIN combos as co
ON co.FatherID = uc.FatherID
--AND co.MotherID <> uc.MotherID
AND co.row_num + 1 = uc.row_num
), rankedCombos as (
SELECT
c.[path]
, ROW_NUMBER() OVER(PARTITION BY c.FatherID ORDER BY c.hierarchy_num DESC) as row_num
FROM combos as c
)
SELECT COUNT(DISTINCT rc.[path]) as UniqueMotherGroups
FROM rankedCombos as rc
WHERE rc.row_num = 1
Update 2:
I had another idea to use a PIVOT to transpose the records so that the FatherID would be in the left-most column with the MotherIDs as the column headers. To make that work with a dynamic list of MotherIDs, you have to use a dynamic PIVOT/dynamic SQL. (FatherID isn't really needed in the PIVOT so it's not included in the PIVOT query. I just had to describe what the goal is.) After the pivot, you can SELECT DISTINCT to get the unique mother combinations. Then the last SELECT is to get the COUNT. This one I ran in SQL Fiddle:
SQL Fiddle
MS SQL Server 2017 Schema Setup:
-- Create family tables.
CREATE TABLE dbo.Father
(
FatherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Father
ADD CONSTRAINT PK_Father
PRIMARY KEY CLUSTERED (FatherID);
ALTER TABLE dbo.Father SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Mother
(
MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Mother
ADD CONSTRAINT PK_Mother
PRIMARY KEY CLUSTERED (MotherID);
ALTER TABLE dbo.Mother SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Child
(
ChildID INT NOT NULL
, FatherID INT NOT NULL
, MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Child
ADD CONSTRAINT PK_Child
PRIMARY KEY CLUSTERED (ChildID);
CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID, MotherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Father
FOREIGN KEY (FatherID)
REFERENCES dbo.Father (FatherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Mother
FOREIGN KEY (MotherID)
REFERENCES dbo.Mother (MotherID);
-- Insert two children with the same parents
INSERT INTO dbo.Father
(
FatherID
, Name
)
VALUES
(1, 'Alex')
, (2, 'Bob')
, (3, 'Charlie')
INSERT INTO dbo.Mother
(
MotherID
, Name
)
VALUES
(1, 'Alice')
, (2, 'Barbara');
INSERT INTO dbo.Child
(
ChildID
, FatherID
, MotherID
, Name
)
VALUES
(1, 1, 1, 'Adam')
, (2, 1, 1, 'Billy')
, (3, 1, 2, 'Celine')
, (4, 2, 2, 'Derek')
, (5, 3, 1, 'Eric');
Query 1:
DECLARE #cols AS nvarchar(MAX)
DECLARE #query AS nvarchar(MAX)
SET #cols = STUFF((
SELECT DISTINCT ',' + QUOTENAME(m.MotherID)
FROM Mother as m
FOR XML PATH(''))
,1,1,'')
SET #query = 'SELECT COUNT(mCount) as UniqueMotherGroups FROM (
SELECT DISTINCT ' + #cols + ', 1 as mCount FROM (
SELECT ' + #cols + '
FROM (
SELECT
c.FatherID
, c.MotherID
, 1 as mID
FROM child as c
) x
PIVOT
(
MAX(mID)
FOR MotherID in (' + #cols + ')
) p
) as m
) as mg'
--SELECT #query
Exec(#query)
Results:
| UniqueMotherGroups |
|--------------------|
| 3 |
UPDATE 3: Here's one other idea... create a results table with a unique constraint and with IGNORE_DUP_KEY=ON. You could use this in a function or stored procedure, or, setup a trigger to put the mother combinations into a unique-combo-holding-table. With IGNORE_DUP_KEY=ON, you can insert every combo and only the unique combos will remain. Then just do a count of all the rows.
--Create a table to hold the results:
CREATE TABLE results (
ChildID int not null
, UniqueCombos nvarchar(50) not null
PRIMARY KEY WITH (IGNORE_DUP_KEY = ON)
);
--Insert all combos into the results table. The unique constraint will cause only unique entries to remain.
INSERT INTO results (ChildID, UniqueCombos)
SELECT DISTINCT
c.ChildID
, (
SELECT ',' + CAST(MotherID as nvarchar(500))
FROM Child as c2
WHERE c2.ChildID = c.ChildID
ORDER BY c2.MotherID
FOR XML PATH('')
) as mother_combos
FROM Child as c
;
--Count up all the rows in the results table. Since these are all unique combinations, it should be fast to sum.
SELECT COUNT(*)
FROM results;
If you accept to define a maximum number of mothers per father (here 7) you may try:
select count(*) as UniqueMotherGroups from (
select distinct m1, m2, m3, m4, m5, m6, m7 from (
select FatherID, row_number() over(partition by FatherID order by motherid) as rn, motherid
from (
select distinct FatherID, MotherID
from t_Child
)
)
pivot (
max(motherid) for rn in (1 as m1,2 as m2,3 as m3,4 as m4,5 as m5,6 as m6,7 as m7)
)
)
;
UNIQUEMOTHERGROUPS
------------------
3
Here is one idea. Instead of using precise STRING_AGG you can calculate a hash / checksum of the group. You don't need to know the exact composition of the group, you just need to distinguish between different groups. Calculating of the hash may be faster than concatenating strings.
SQL Server has a function CHECKSUM_AGG
You can write your own hashing function with CLR.
Sample data
CREATE TABLE #Child
(
ChildID INT NOT NULL IDENTITY PRIMARY KEY
,FatherID INT NOT NULL
,MotherID INT NOT NULL
,Name VARCHAR(50) NOT NULL
);
INSERT INTO #Child
(
FatherID
,MotherID
,Name
)
VALUES
(1, 1, 'Adam')
,(1, 1, 'Billy')
,(1, 2, 'Celine')
,(2, 2, 'Derek')
,(3, 1, 'Eric')
,(4, 1, 'A')
,(4, 1, 'B')
,(4, 2, 'C')
,(4, 2, 'D')
,(4, 2, 'E')
,(5, 2, 'F')
,(6, 2, 'G')
;
Query
WITH
distinctParentCombinations
AS
(
SELECT
FatherID
,MotherID
FROM #Child
GROUP BY
FatherID
,MotherID
)
,motherGroups
AS
(
SELECT
FatherID
,CHECKSUM_AGG(MotherID) AS MotherGroup
FROM distinctParentCombinations
GROUP BY
FatherID
)
SELECT COUNT(DISTINCT MotherGroup) AS UniqueMotherGroups
FROM motherGroups
;
Result
+--------------------+
| UniqueMotherGroups |
+--------------------+
| 3 |
+--------------------+
You need to compare performance of all methods on your actual data.
Obviously, with CHECKSUM_AGG it is possible that some of the groups will be missed. There is a chance that two different groups will generate the same checksum.
You know better if this is acceptable.
General way to speed up calculations is to have some of the results already pre-calculated. In your case, for the first part you can create indexed view as follows:
CREATE OR ALTER VIEW vw_distinctParentCombinations WITH SCHEMABINDING AS
SELECT children.FatherID
, children.MotherID
,COUNT_BIG(*) AS [wifes_count]
FROM dbo.Child as children
GROUP BY children.FatherID
, children.MotherID
GO
CREATE UNIQUE CLUSTERED INDEX IX_vw_distinctParentCombinations ON vw_distinctParentCombinations
(
FatherID,MotherID
);
Then in your initial query, you can avoid the first CTE:
-- CTE Gets distinct combinations of parents
WITH motherGroups (Mothers)
AS
(SELECT STRING_AGG(CONVERT(VARCHAR(MAX), distinctParentCombinations.MotherID), '-') WITHIN GROUP (ORDER BY distinctParentCombinations.MotherID) AS Mothers
FROM vw_distinctParentCombinations distinctParentCombinations WITH(NOEXPAND)
GROUP BY distinctParentCombinations.FatherID
)
-- Remove the COUNT function to see the actual combinations
SELECT COUNT(motherGroups.Mothers) AS UniqueMotherGroups
FROM motherGroups;
This will avoid the initial read of the large table and depending the distinct combinations of the pairs (father - mother) it can reduce the view size significantly.
Unfortunately, there are a lot of limitations in order to create an indexed view, and you are not able to create such for the second CTE.
If we change our mind and look this issue in different view, simply we can get the count of mothers with this query:
SELECT Count(distinct ConcatMothers) UniqueMothersCount from(
SELECT FatherID, concat(FatherID,'-',SUM(MotherID)) ConcatMothers
FROM dbo.Child
GROUP BY FatherID) t;
Or even you can use Dense_Rank() like this:
SELECT Max(RankMothers) UniqueMothersCount from(
SELECT FatherID, DENSE_RANK() over (order by concat(FatherID,'-',SUM(MotherID))) RankMothers
FROM dbo.Child
GROUP BY FatherID) t;
For the performance it is hard to measure because dataset is small but since we have one column in the group by and the motherId is in the select maybe we can change index as below:
CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID) Include(MotherID);
but you need to check it on your dataset.

Create combination sql table

I'm trying to create a sql table in data base in VS that has room and userid column, but the sql will only accept your input if the userid exists in users table and room exists in rooms tables
Allows:
Users table:
Userid
1
2
3
RoomUsers table:
Room ----- User
1 1
2. 1
1. 2
1. 3
2. 3
Won't allow:
Users table:
Userid
1
2
RoomUsers table:
Room ----- User
1 4
Normal foreign key wont work because it only allows one of each index and not multiple, how can I allow what I need to occur,to happen?
(This would be a mess in comments)
Probably we are having an XY problem here. The thing you describe is simply solved with a foreign key. ie:
CREATE TABLE users (id INT IDENTITY NOT NULL PRIMARY KEY, ad VARCHAR(100));
CREATE TABLE rooms (id INT IDENTITY NOT NULL PRIMARY KEY, ad VARCHAR(100));
CREATE TABLE room_user
(
RoomId INT NOT NULL
, UserId INT NOT NULL
, CONSTRAINT PK_roomuser
PRIMARY KEY(RoomId, UserId)
, CONSTRAINT fk_room
FOREIGN KEY(RoomId)
REFERENCES dbo.rooms(id)
, CONSTRAINT fk_user
FOREIGN KEY(UserId)
REFERENCES dbo.users(id)
);
INSERT INTO dbo.users(ad)
OUTPUT
Inserted.id, Inserted.ad
VALUES('RayBoy')
, ('John')
, ('Frank');
INSERT INTO dbo.rooms(ad)
OUTPUT
Inserted.id, Inserted.ad
VALUES('Room1')
, ('Room2')
, ('Room3');
INSERT INTO dbo.room_user(RoomId, UserId)VALUES(1, 1), (1, 2), (2, 3);
-- won't allow
INSERT INTO dbo.room_user(RoomId, UserId)VALUES(999, 888);

How to enforce an "ALL-TO-ALL" relationship?

Suppose I have the following structure (SQL Server syntax):
CREATE TABLE A (
key_A int NOT NULL PRIMARY KEY CLUSTERED,
info_A nvarchar(50) NULL
);
CREATE TABLE B(
key_B int NOT NULL PRIMARY KEY CLUSTERED,
info_B nvarchar(50) NULL
);
CREATE TABLE C(
key_C int NOT NULL PRIMARY KEY CLUSTERED,
key_A int NOT NULL,
key_B int NOT NULL,
info1 nvarchar(50) NULL,
info2 nvarchar(50) NULL,
info3 nvarchar(50) NULL
);
ALTER TABLE C WITH CHECK ADD CONSTRAINT FK_C_A FOREIGN KEY(key_A) REFERENCES A (key_A);
ALTER TABLE C WITH CHECK ADD CONSTRAINT FK_C_B FOREIGN KEY(key_B) REFERENCES B (key_B);
So, table C has two primary keys to table A and table B.
Table C has to have the cartesian product of table A and table B. That means all combinations, so when a new record is inserted in table A, we have to insert in table C several rows with the new reference in A by all the rows in B. And viceversa, in the case of insertion in B.
The question is, how can you enforce the integrity of such relationship in SQL Server, in which table C has to have all combinations of A and B? Or, if you consider such a structure a bad practice, what alternative tables do you recommend that do not add the hassle of having to do DISTINCT selects and such?
Thanks!
Link to Fiddle:
Fiddle
You need to have inserted into table A / B before you can reference this new entry in table C. The only way I know of would be to create a trigger on tables A and B to populate table C when a new entry was made to either of those tables. The issue there is what do you then put into the other fields? Since these are nullable I assume you're happy defaulting them to null? If not (i.e. you want the user to input valid values) the only way to do this would be in the logic of your application, rather than at database level (or by using stored procedures to populate these tables where those procs had suitable logic to create the appropriate entries in C in addition to A / B).
Trigger Code Example:
use StackOverflowDemos
go
if OBJECT_ID('TRG_C_DELETE','TR') is not null drop trigger TRG_C_DELETE
if OBJECT_ID('TRG_A_INSERT','TR') is not null drop trigger TRG_A_INSERT
if OBJECT_ID('TRG_B_INSERT','TR') is not null drop trigger TRG_B_INSERT
if OBJECT_ID('C','U') is not null drop table C
if OBJECT_ID('A','U') is not null drop table A
if OBJECT_ID('B','U') is not null drop table B
go
CREATE TABLE A
(
key_A int NOT NULL IDENTITY(1,1) CONSTRAINT PK_A PRIMARY KEY CLUSTERED,
info_A nvarchar(50) NULL
);
go
CREATE TABLE B
(
key_B int NOT NULL IDENTITY(1,1) CONSTRAINT PK_B PRIMARY KEY CLUSTERED,
info_B nvarchar(50) NULL
);
go
CREATE TABLE C
(
key_C int NOT NULL IDENTITY(1,1) CONSTRAINT PK_C PRIMARY KEY CLUSTERED,
key_A int NOT NULL CONSTRAINT FK_C_A FOREIGN KEY(key_A) REFERENCES A (key_A),
key_B int NOT NULL CONSTRAINT FK_C_B FOREIGN KEY(key_B) REFERENCES B (key_B),
info1 nvarchar(50) NULL,
info2 nvarchar(50) NULL,
info3 nvarchar(50) NULL
);
go
CREATE TRIGGER TRG_A_INSERT
ON A
AFTER INSERT
AS
BEGIN
--add new As to C
insert C (key_A, key_B)
select key_A, key_B
from inserted
cross join B
END;
go
CREATE TRIGGER TRG_B_INSERT
ON B
AFTER INSERT
AS
BEGIN
--add new As to C
insert C (key_A, key_B)
select key_A, key_B
from inserted
cross join A
END;
go
CREATE TRIGGER TRG_C_DELETE
ON C
AFTER DELETE
AS
BEGIN
DELETE
FROM B
WHERE key_B IN
(
SELECT key_B
FROM DELETED d
--ths row onwards are here to cover the circumstance that the record being deleted isn't the only instance of B in this table
WHERE key_B NOT IN
(
SELECT key_B
FROM C
WHERE C.key_C NOT IN
(
SELECT key_C
FROM deleted
)
)
)
DELETE
FROM A
WHERE key_A IN
(
SELECT key_A
FROM DELETED d
--ths row onwards are here to cover the circumstance that the record being deleted isn't the only instance of A in this table
WHERE key_A NOT IN
(
SELECT key_A
FROM C
WHERE C.key_C NOT IN
(
SELECT key_C
FROM deleted
)
)
)
END;
go
insert A select 'X'
select * from C --no results as no Bs yet
insert A select 'Y'
select * from C --no results as no Bs yet
insert B select '1'
select * from C --2 results; (X,1) and (Y,1)
insert A select 'Z'
select * from C --3 results; the above and (Z,1)
delete from A where info_A = 'Y'
select * from C --3 results; as above since the previous statement should fail due to enforced referential integrity
insert C (key_A, key_B, info1)
select a.key_A, b.key_B, 'second entry for (Y,1)'
from A
cross join B
where a.info_A = 'Y'
and b.info_B = '1'
select * from C --4 results; as above but with a second (Y,1), this time with data in info1
delete from C where info1 = 'second entry for (Y,1)'
select * from C --3 results; as above but without the new(Y,1)
select * from A --3 results
select * from B --1 result
delete from C where key_A in (select key_A from A where info_A = 'Y')
select * from C --2 results; (X,1) and (Z,1)
select * from A --2 results; X and Z
select * from B --1 result
delete from C where key_B in (select key_B from B where info_B = '1')
select * from C --0 results
select * from A --0 results
select * from B --0 result
SQL Fiddle demo here (NB: only SQL Fiddle only shows the outputs from table C; also the error demo has been commented out since this errors the whole thing rather than just the one error line).
http://sqlfiddle.com/#!3/34d2f/4

Split string, copy data into multiple tables

I have the following tables
Table A
RID | Name |Phone |Email |CreatedOn
------------------------------------------------------------
1 | John Smith | 2143556789 |t1#gmail.com |2012-09-01 09:30:00
2 | Jason K Crull | 2347896543 |t2#gmail.com |2012-08-02 10:34:00
Table B
CID| FirstName |LastName |Phone |Email |CreatedOn |Title|Address|City|State
---------------------------------------------------------------------------------------------------
11 | John | Smith |2143556789 |t1#gmail.com |2012-09-01 09:30:00|NULL|NULL|NULL|NULL
12 | Jason | K Crull |2347896543 |t2#gmail.com |2012-08-02 10:34:00|NULL|NULL|NULL|NULL
Table C
RID | CID |IsAuthor|CreatedOn
-----------------------------------------
1 | 11 | 0 |2012-09-01 09:30:00
2 | 12 | 0 |2012-08-02 10:34:00
For every row in "Table A" need to create a row in "Table B" splitting the name into First and Last Name as shown and after creating a row, insert new row into Table C with RID from Table A, CID from Table B, IsAuthor bit Default to 0 and CreatedOn from Table A.The CID is auto incremented. Can anyone help me in achieving this. I am very new to SQL. Thanks!
I believe you're looking for something like this (I left off some fields, but this should get the point across). Main thing to see is the substring and charindex functions which are used to split the name into first name and last names:
insert into tableb (firstname,lastname,phone,email)
select
left(name, charindex(' ',name)-1),
substring(name, charindex(' ', name)+1, len(name)),
phone, email
from tablea ;
insert into tablec
select a.rid, b.cid, 0, a.createdon
from tablea a
inner join tableb b on a.name = b.firstname + ' ' + b.lastname
and a.phone = b.phone and a.email = b.email ;
SQL Fiddle Demo
If there is a concern for the same names, emails, etc, then you're probably going to need to look into using a dreaded cursor and scope_identity(). Hopefully you won't have to go down that route.
Stop thinking about "for every row" and think of it as "a set." Very rarely is it efficient to process anything row-by-row in SQL Server, and very rarely is it beneficial to think in those terms.
--INSERT dbo.TableC(RID, CID, IsAuthor, CreatedOn)
SELECT a.RID, b.CID, IsAuthor = 0, a.CreatedOn
FROM dbo.TableA AS a
INNER JOIN dbo.TableB AS b
ON a.Name = b.FirstName + ' ' b.LastName;
When you believe it is returning the right results, uncomment the INSERT.
To split the name I'd use CharIndex to find the position of the space, then Substring to break the word apart.
For keeping track of which row in TableA the data in TableB came from, I'd just stick a column onto B to record this data, then drop it when you come to inset into table C.
An alternative would be to make CID an identity column on C instead of B, populate C first, then feed that data into TableB when you come to populate that.
if OBJECT_ID('TableA','U') is not null drop table TableA
create table TableA
(
rid int not null identity(1,1) primary key clustered
, Name nvarchar(64)
, Phone nvarchar(16)
, Email nvarchar(256)
, CreatedOn datetime default (getutcdate())
)
if OBJECT_ID('TableB','U') is not null drop table TableB
create table TableB
(
cid int not null identity(1,1) primary key clustered
, FirstName nvarchar(64)
, LastName nvarchar(64)
, Phone nvarchar(16)
, Email nvarchar(256)
, CreatedOn datetime default (getutcdate())
, Title nvarchar(16)
, [Address] nvarchar(256)
, City nvarchar(64)
, [State] nvarchar(64)
)
if OBJECT_ID('TableC','U') is not null drop table TableC
create table TableC
(
rid int primary key clustered
, cid int unique
, IsAuthor bit default(0)
, CreatedOn datetime default (getutcdate())
)
insert TableA (Name, Phone, Email) select 'John Smith', '2143556789', 't1#gmail.com'
insert TableA (Name, Phone, Email) select 'Jason K Crull', '2347896543', 't2#gmail.com'
alter table TableB
add TempRid int
insert TableB(FirstName, LastName, Phone, Email, TempRid)
select case when CHARINDEX(' ', Name) > 0 then SUBSTRING(Name, 1, CHARINDEX(' ', Name)-1) else Name end
, case when CHARINDEX(' ', Name) > 0 then SUBSTRING(Name, CHARINDEX(' ', Name)+1, LEN(Name)) else '' end
, Phone
, Email
, Rid
from TableA
insert TableC (rid, cid)
select TempRid, cid
from TableB
alter table TableB
drop column TempRid
select * from TableB
select * from TableC
Try it here: http://sqlfiddle.com/#!3/aaaed/1
Or the alternate method (inserting to C before B) here: http://sqlfiddle.com/#!3/99592/1

How to prevent an entry in one table from existing in another table

I have 3 tables in SQL Server 2008
Clubs with a PK of ID and a Name.
Products which has a PK of ID, a FK of ClubID , a Name, a ShortCode and a Keyword.
There is a UK to enforce that there are no duplicate keywords for combinations of ShortCode/Keyword.
ProductAdditionalShortCodes. This has a PK of ID, a FK of ProductID and a Keyword
The idea is to prevent any shortcode/keyword combination of products to point to different clubs and also to prevent the creation of duplicate short/code keyword combinations
I have a solution that works, but feels clunky, and could under certain circumstances fail if multiple users happened to update multiple entries at the same time. (Hypothetically)
How can I add some form of constraint to the DB to prevent the Keyword in the Main table from being the same as in the Additional table and the other way round?
Following is a sample script to create the scenario and some of the examples I want to prevent. I am not opposed to changing the DB design if the impact of the change would not disrupt too many other aspects of the solution. (I realize this is subjective)
use Tinker
if exists (SELECT * FROM dbo.sysobjects WHERE id = OBJECT_ID(N'dbo.ProductAdditionalKeywords') AND type in (N'U'))
drop table dbo.ProductAdditionalKeywords
go
if exists (SELECT * FROM dbo.sysobjects WHERE id = OBJECT_ID(N'dbo.Products') AND type in (N'U'))
drop table dbo.Products
go
if exists (SELECT * FROM dbo.sysobjects WHERE id = OBJECT_ID(N'dbo.Clubs') AND type in (N'U'))
drop table dbo.Clubs
go
create table dbo.Clubs (
ID int not null identity(1,1)
,Name varchar(50) not null
,constraint PK_Clubs primary key clustered ( ID )
)
go
alter table dbo.Clubs add constraint UK_Clubs__Name unique ( Name )
go
create table dbo.Products (
ID int not null identity(1,1)
,ClubID int not null
,Name varchar(50) not null
,ShortCode varchar(50) not null
,Keyword varchar(50) not null
,constraint PK_Products primary key clustered ( ID )
)
go
alter table dbo.Products add constraint UK_Products__ShortCode_Keyword unique ( ShortCode , Keyword )
go
alter table dbo.Products add constraint UK_Products__Name unique ( Name )
go
alter table dbo.Products add constraint FK_Products_ClubID foreign key ( ClubID ) references dbo.Clubs ( ID )
go
create table dbo.ProductAdditionalKeywords (
ID int not null identity(1,1)
,ProductID int not null
,Keyword varchar(50) not null
,constraint PK_ProductAdditionalKeywords primary key clustered ( ID )
)
go
alter table dbo.ProductAdditionalKeywords add constraint FK_ProductAdditionalKeywords_ProductID foreign key ( ProductID ) references dbo.Products ( ID )
go
alter table dbo.ProductAdditionalKeywords add constraint UK_ProductAdditionalKeywords__Keyword unique ( Keyword )
go
insert into dbo.Clubs ( Name )
select 'Club 1'
union all select 'Club 2'
insert into dbo.Products (ClubID,Name,Shortcode,Keyword)
select 1,'Product 1','001','P1'
union all select 1,'Product 2','001','P2'
union all select 1,'Product 3','001','P3'
union all select 2,'Product 4','002','P4'
union all select 2,'Product 5','002','P5'
union all select 2,'Product 6','002','P6'
insert into dbo.ProductAdditionalKeywords (ProductID,Keyword)
select 1,'P1A'
union all select 1,'P1B'
union all select 2,'P2A'
union all select 2,'P2B'
/*
What can be done to prevent the following statements from beeing allowed based on the reason in the comments?
*/
--insert into dbo.ProductAdditionalKeywords (ProductID,Keyword) values ( 1 , 'P2' ) -- Main keyword for product 2
--update dbo.Products set Keyword = 'P1A' where ID = 2 -- Additional keyword for product 1
--insert into dbo.ProductAdditionalKeywords (ProductID,Keyword) values ( 3 , 'P1' ) -- Main ShortCode/Keyword combination for product 1
/*
At the moment I look at the following view to see if the proposed(new/updated) Keyword/Shortcode combination already exists
If it already exists I pevent the insert/update
Is there any way to do it in the DB via constraints rather than in the BLL?
*/
select ShortCode,Keyword,count([ClubID]) as ClubCount from
(
select p.ClubID,p.ShortCode,p.Keyword,p.ID
from dbo.Products p
union all
select p.ClubID,p.ShortCode,PAK.Keyword,PAK.ID * -1
from dbo.ProductAdditionalKeywords as PAK
inner join dbo.Products P on PAK.ProductID = P.ID
) as FullList
group by Shortcode,Keyword
order by Shortcode,Keyword
How I would normally do this would be to place all of the keywords in a separate table (e.g. what is currently your additional table). If all keywords must be distinct within a ShortCode, then I'd include ShortCode in this table also, so that a unique constraint can be applied across both columns.
If all keywords for a product must be in the same ShortCode, then I'd keep ShortCode in Products also. I'd apply a unique constraint on (ID,ShortCode) in that table, and an additional foreign key from the keywords table, referencing both columns on both sides.
What we're left with now are two potential issues not included in your original design, but I don't know if they're a concern in practice:
1) Is the Keyword in Products more important, or special, than the additional keywords? If so, we need to add a column to keywords table to mark which one is important. To ensure only one is set, you can search for plenty of other SO questions which involve unique constraints with additional conditions. (Let me know if you can't find one and need it, I'm sure I can add a link if necessary)
2) Should a Product be allowed to have no keywords? If not, then I'd create a view that mimics your original Products table. In this circumstance, it would be easier if 1) above is true, in which case we always join to the "important" keyword. Otherwise, we need to have some way to limit it to a single row per product. We deny insert/update/delete on the table, and only allow them through the view. 3 relatively simple triggers will then maintain the underlying table structure.
on your design, I do not understand the the use of productAdditionalShortCodes having no field of ShortCode.
However, you can add Unique key constraint with ShortCode & Keyword (composite key). This will eliminate duplicate entry in product table.