Masking or hiding inaccurately entered data in SQL Server 2008 - sql

OK, so my subject line isn't very descriptive, but here's the scenario:
An end-user has a legal obligation to submit transaction data to a government agency. The transactions contain the name and address of various individuals and organizations. HOWEVER, end users frequently misspell the names of the reported individuals and organizations, or they badly mangle the address, etc.
The information submitted by the end user is a legal 'document', so it cannot be altered by the agency that received it. Also, the transactions can be viewed and searched by the public. When the government agency notices an obvious misspelling or bad address, they would like to 'hide' or 'mask' that bad value with a known good value. For example, if an end user entered 'Arnie Schwarzeger', the agency could replace that name with 'Arnold Schwarzenegger'. The public that viewed the data would see (and search for) the correct spelling, but could view the original data as entered by the end user after they found the data record in question.
Hopefully that explains the business case well enough...on to the SQL part! So to address this problem, we have tables that look like this:
CREATE TABLE [dbo].[SomeUserEnteredData](
[Id] [uniqueidentifier] NOT NULL,
[LastOrOrganizationName] [nvarchar](350) NOT NULL, // data as entered by end-user
[FirstName] [nvarchar](50) NULL, // data as entered by end-user
[FullName] AS ([dbo].[FullNameValue]([FirstName],[LastName])) PERSISTED, // data as entered by end-user
[MappedName] AS ([dbo].[MappedNameValue]([FirstName],[LastName]))) // this is the 'override' data from the agency
CREATE TABLE [dbo].[CorrectionsByAgency](
[Id] [uniqueidentifier] NOT NULL,
[ReplaceName] [nvarchar](400) NOT NULL,
[KeepName] [nvarchar](400) NOT NULL)
CREATE FUNCTION [dbo].[FullNameValue]
(
#FirstName as NVARCHAR(40),
#LastOrOrganizationName as NVARCHAR(350)
)
RETURNS NVARCHAR(400)
WITH SCHEMABINDING
AS
BEGIN
DECLARE #result NVARCHAR(400)
IF #FirstName = '' OR #FirstName is NULL
SET #result = #LastOrOrganizationName
ELSE
SET #result = #LastOrOrganizationName + ', ' + #FirstName
RETURN #result
END
CREATE FUNCTION [dbo].[MappedNameValue]
(
#FirstName as NVARCHAR(50),
#LastOrOrganizationName as NVARCHAR(350)
)
RETURNS NVARCHAR(400)
AS
BEGIN
DECLARE #result NVARCHAR(400)
DECLARE #FullName NVARCHAR(400)
SET #FullName = dbo.FullNameValue(#FirstName, #LastOrOrganizationName)
SELECT top 1 #result = KeepName from CorrectionsByAgency where ReplaceName = #FullName
if #result is null
SET #result = #FullName
RETURN #result
END
Hopefully, if my sample isn't TOO convoluted, you can see that if the agency enters a name correction, it will replace all occurrences of the misspelled name. From a business logic perspective, this works exactly right: the agency staff only enters a few corrections and the corrections can override everywhere there are misspelled names.
From a server performance standpoint, this solution STINKS. The calculated SomeUserEnteredData.MappedName column can't be indexed, and no view that reads from that column can be indexed either! There's no way this can work for our needs if we can't index the MappedName values.
The only alternative I've been able to see as a possibility is to create an additional linking table between the end-user created data and the agency created data -- when the agency enters a correction record, a record is created in the linking table for every occurrence of the bad column value. The down side to this seems to be the very real likelihood of creating/destroying many (hundreds of thousands) of those linking records for every correction entered by an agency user...
Do any of you SQL geniuses out there have great ideas about how to address this problem?

I'm not sure if this is answering your question directly, but I would try to simplify the whole thing: stop using functions, persist "calculated" values and use application logic (possibly in a stored procedure) to manage the data.
Assuming that one agency correction can be applied to many user-entered names, then you could have something like this:
create table dbo.UserEnteredData (
DocumentId uniqueidentifier not null primary key,
UserEnteredName nvarchar(1000) not null,
CorrectedNameId uniqueidentifier null,
constraint FK_CorrectedNames foreign key (CorrectedNameId)
references dbo.CorrectedNames (CorrectedNameId)
)
create table dbo.CorrectedNames (
CorrectedNameId uniqueidentifier not null primary key,
CorrectedName nvarchar(1000) not null
)
Now, you need to make sure your application logic can do something like this:
External user enters dirty data
Agency user reviews the dirty data and identifies both the incorrect name and the corrected name
Application checks if the corrected name already exists
If no, create a new row in dbo.CorrectedNames
Create a new row in dbo.UserEnteredData, with the CorrectedNameId
I'm assuming that things are rather more complicated in reality and corrections are made based on addresses and other data as well as just names, but the basic relationship you describe seems simple enough. As you said, the functions add a lot of overhead and it's not clear (to me) what benefit they provide over just storing the data you need directly.
Finally, I don't understand your comment about creating/destroying linking records; it's up to your application logic to handle data changes correctly.

Related

SQL custom logic in constraints

I need to create such custom constrain logic no duplicate combination in one period of time
CREATE FUNCTION [dbo].[CheckPriceListDuplicates](
#priceListId uniqueidentifier,
#supplierId uniqueidentifier,
#transportModeId uniqueidentifier,
#currencyId uniqueidentifier,
#departmentTypeId uniqueidentifier,
#consolidationModeId uniqueidentifier,
#importerId uniqueidentifier,
#exporterId uniqueidentifier,
#validFrom datetimeoffset(7),
#validTo datetimeoffset(7))
RETURNS int
AS
BEGIN
DECLARE #result int
IF EXISTS (SELECT * FROM [dbo].[PriceListEntries] AS [Extent1]
WHERE ([Extent1].[Id] <> #priceListId) AND
((([Extent1].[SupplierAddressBook_Id] IS NULL) AND (#supplierId IS NULL)) OR ([Extent1].[SupplierAddressBook_Id] = #supplierId)) AND
([Extent1].[TransportMode_Id] = #transportModeId) AND
([Extent1].[Currency_Id] = #currencyId) AND
([Extent1].[DepartmentType_Id] = #departmentTypeId) AND
((([Extent1].[ConsolidationMode_Id] IS NULL) AND (#consolidationModeId IS NULL)) OR ([Extent1].[ConsolidationMode_Id] = #consolidationModeId)) AND
((([Extent1].[Importer_Id] IS NULL) AND (#importerId IS NULL)) OR ([Extent1].[Importer_Id] = #importerId)) AND
((([Extent1].[Exporter_Id] IS NULL) AND (#exporterId IS NULL)) OR ([Extent1].[Exporter_Id] = #exporterId)) AND
((#validFrom >= [Extent1].[ValidFrom]) OR (#validTo <= [Extent1].[ValidTo]))
)
BEGIN
SET #result = 0
END
ELSE
BEGIN
SET #result = 1
END
RETURN #result
END
ALTER TABLE [dbo].[PriceListEntries]
ADD CONSTRAINT UniquCombinations CHECK ([dbo].[CheckPriceListDuplicates](
Id,
SupplierAddressBook_Id,
TransportMode_Id,
Currency_Id,
DepartmentType_Id,
ConsolidationMode_Id,
Importer_Id,
Exporter_Id,
ValidFrom,
ValidTo) = 1)
any idea how to do with out function?
It's a generally accepted concept that business rules should not be enforced in the DB. This is also generally difficult to strictly enforce as there is a large amount of overlap between business rules and data integrity rules. A data integrity constraint may limit a field to an integer value between 5 and 20, but that is because some business rule somewhere stipulates those are the only valid values.
So the difference between a business rule and a constraint is usually de facto defined as: a business rule is something that can't be easily enforced with the built-in checks available in the database and a constraint can be.
But I would further narrow the definition to state that a business rule is liable to change and a constraint is more static. For example, the rule "a patron may have no more than 5 library items checked out at any one time" could well be easily enforced using database constraints. But the limit of 5 is arbitrary and could change at a moments notice. Therefore it should be defined as a business rule that should not be enforced at the database level.
If a structural or modeling change or enhancement/addition of a database feature makes a "business rule" easily enforceable in the database where it wasn't before, you still have to consider if the rule is rigidly defined such that it is not expected to change. The database should be the bedrock, the foundation of your data edifice. You don't want it shifting around a lot.
One way to query multiple columns in a table and see if they are all (together) unique is to Concatenate the tested data together as a string.
Select CONCAT ( Col1, Col2, Col3) AS ConcateString
from [TABLE_NAME]
WHERE ConcateString = 'All_of_your_data_in_one_string';
https://msdn.microsoft.com/en-us/library/hh231515.aspx
If the result yields more than one result, it combination of the data is not unique.

Conditionally set values in UPDATE statement

I would like to have a stored procedure that will update values in a table row depending on whether or not the parameters are provided. For example, I have a situation where I want to update all the values, but also a situation where I'm only required to update two values. I was hoping to be able to do this with only one procedure, rather than writing two, which doesn't particularly appeal to me. The best I have managed to come up with myself is something like the following:
CREATE PROCEDURE dbo.UpdatePerson
#PersonId INT,
#Firstname VARCHAR(50) = NULL,
#Lastname VARCHAR(50) = NULL,
#Email VARCHAR(50) = NULL
AS
BEGIN
SET NOCOUNT ON
UPDATE Person
Set
Firstname = COALESCE(#Firstname, Firstname),
Lastname = COALESCE(#LastName, Lastname),
Email = COALESCE(#Email, Email)
WHERE PersonId = #PersonId
END
I realize that the values will be updated each time anyway, which isn't ideal. Is this an effective way of achieving this, or could it be done a better way?
I think your code is fine. The only thing I would add is a check for the case when all three params are NULL, in which case no update should be done.
SQL Server does actually have some logic to deal with non updating updates.
More details than you probably wanted to know!

Replication Custom resolver changes empty strings to NULLs

We have an C# application which posts to a database which is replicated to another database (using merge-replication) and has one custom resolver which is a stored procedure.
This was working fine under SQL Server 2000 , but when testing under SQL Server 2005 the custom resolver is attempting to change any empty varchar columns to be nulls (and failing cos this particular column does not allow nulls).
Note that these varchar fields are not the ones which cause the conflict as they are current empty on both databases and are not being changed and the stored procedure does not change them (all it is doing is attempting to set the value of another money column).
Has anyone come across this problem, or has example of a stored procedure which will leave empty strings as they are?
The actual stored procedure is fairly simply and and re-calculates the customer balance in the event of a conflict.
ALTER procedure [dbo].[ReCalculateCustomerBalance]
#tableowner sysname,
#tablename sysname,
#rowguid varchar(36),
#subscriber sysname,
#subscriber_db sysname,
#log_conflict INT OUTPUT,
#conflict_message nvarchar(512) OUTPUT
AS
set nocount on
DECLARE
#CustomerID bigint,
#SysBalance money,
#CurBalance money,
#SQL_TEXT nvarchar(2000)
Select #CustomerID = customer.id from customer where rowguid= #rowguid
Select #SysBalance = Sum(SystemTotal), #CurBalance = Sum(CurrencyTotal) From CustomerTransaction Where CustomerTransaction.CustomerID = #CustomerID
Update Customer Set SystemBalance = IsNull(#SysBalance, 0), CurrencyBalance = IsNull(#CurBalance, 0) Where id = #CustomerID
Select * From Customer Where rowguid= #rowguid
Select #log_conflict =0
Select #conflict_message ='successful'
Return(0)
You have a few options here, each are a bit of a workaround from what my research seems to show is an issue with SQL Server.
1- Alter this statement: Select * From Customer Where rowguid= #rowguid to explicitly mention each of the columns, and use an "isNull" for the offending fields
2- Alter the column in the table to add a default constraint for ''. What this will do, is if you attempt to insert a 'null', it will replace it with the empty string
3- Add a 'before insert' trigger which will alter the data before the insert, to not contain a 'null' anymore
PS: Are you positive that the replication system has that column marked as "required"? I think if it is not required, it will insert 'null' if no data exists.

Updating a Table from a Stored Procedure

I am trying to learn database on my own; all of your comments are appreciated.
I have the following table.
CREATE TABLE AccountTable
(
AccountId INT IDENTITY(100,1) PRIMARY KEY,
FirstName NVARCHAR(50) NULL,
LastName NVARCHAR(50) NULL,
Street NVARCHAR(50) NULL,
StateId INT REFERENCES STATETABLE(StateId) NOT NULL
)
I would like to write a Stored procedure that updates the row. I imagine that the stored procedure would look something like this:
CREATE PROCEDURE AccountTable_Update
#Id INT,
#FirstName NVARCHAR(20),
#LastName NVARCHAR(20),
#StreetName NVARCHAR(20),
#StateId INT
AS
BEGIN
UPDATE AccountTable
Set FirstName = #FirstName
Set LastName = #LastName
Set Street = #StreetName
Set StateId = #StateId
WHERE AccountId = #Id
END
the caller provides the new information that he wants the row to have. I know that some of the fields are not entirely accurate or precise; I am doing this mostly for learning.
I am having a syntax error with the SET commands in the UPDATE portion, and I don't know how to fix it.
Is the stored procedure I am writing a procedure that you would write in real life? Is this an antipattern?
Are there any grave errors I have made that just makes you cringe when you read the above TSQL?
Are there any grave errors I have made that just makes you cringe when you read the above TSQL?
Not really "grave," but I noticed your table's string fields are set up as the datatype of NVARCHAR(50) yet your stored procedure parameters are NVARCHAR(20). This may be cause for concern. Usually your stored procedure parameters will match the corresponding field's datatype and precision.
#1: You need commas between your columns:
UPDATE AccountTable SET
FirstName = #FirstName,
LastName = #LastName,
Street = #StreetName,
StateId = #StateId
WHERE
AccountId = #Id
SET is only called once, at the very start of the UPDATE list. Every column after that is in a comma separated list. Check out the MSDN docs on it.
#2: This isn't an antipattern, per se. Especially given user input. You want parametized queries, as to avoid SQL injection. If you were to build the query as a string off of user input, you would be very, very susceptible to SQL injection. However, by using parameters, you circumvent this vulnerability. Most RDBMS's make sure to sanitize the parameters passed to its queries automagically. There are a lot of opponents of stored procedures, but you're using it as a way to beat SQL injection, so it's not an antipattern.
#3: The only grave error I saw was the SET instead of commas. Also, as ckittel pointed out, your inconsistency in the length of your nvarchar columns.

Adding a constraint to prevent duplicates in SQL Update Trigger

We have a user table, every user has an unique email and username. We try to do this within our code but we want to be sure users are never inserted (or updated) in the database with the same username of email.
I've added a BEFORE INSERT Trigger which prevents the insertion of duplicate users.
CREATE TRIGGER [dbo].[BeforeUpdateUser]
ON [dbo].[Users]
INSTEAD OF INSERT
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
DECLARE #Email nvarchar(MAX)
DECLARE #UserName nvarchar(MAX)
DECLARE #UserId int
DECLARE #DoInsert bit
SET #DoInsert = 1
SELECT #Email = Email, #UserName = UserName FROM INSERTED
SELECT #UserId = UserId FROM Users WHERE Email = #Email
IF (#UserId IS NOT NULL)
BEGIN
SET #DoInsert = 0
END
SELECT #UserId = UserId FROM Users WHERE UserName = #UserName
IF (#UserId IS NOT NULL)
BEGIN
SET #DoInsert = 0
END
IF (#DoInsert = 1)
BEGIN
INSERT INTO Users
SELECT
FirstName,
LastName,
Email,
Password,
UserName,
LanguageId,
Data,
IsDeleted
FROM INSERTED
END
ELSE
BEGIN
DECLARE #ErrorMessage nvarchar(MAX)
SET #ErrorMessage =
'The username and emailadress of a user must be unique!'
RAISERROR 50001 #ErrorMessage
END
END
But for the Update trigger I have no Idea how to do this.
I've found this example with google:
http://www.devarticles.com/c/a/SQL-Server/Using-Triggers-In-MS-SQL-Server/2/
But I don't know if it applies when you update multiple columns at once.
EDIT:
I've tried to add a unique constraint on these columns but it doesn't work:
Msg 1919, Level 16, State 1, Line 1
Column 'Email' in table 'Users' is of a type
that is invalid for use as a key column in an index.
You can add a unique contraint on the table, this will raise an error if you try and insert or update and create duplicates
ALTER TABLE [Users] ADD CONSTRAINT [IX_UniqueUserEmail] UNIQUE NONCLUSTERED
(
[Email] ASC
)
ALTER TABLE [Users] ADD CONSTRAINT [IX_UniqueUserName] UNIQUE NONCLUSTERED
(
[UserName] ASC
)
EDIT: Ok, i've just read your comments to another post and seen that you're using NVARCHAR(MAX) as your data type. Is there a reason why you might want more than 4000 characters for an email address or username? This is where your problem lies. If you reduce this to NVARCHAR(250) or thereabouts then you can use a unique index.
Sounds like a lot of work instead of just using one or more unique indexes. Is there a reason you haven't gone the index route?
Why not just use the UNIQUE attribute on the column in your database? Setting that will make the SQL server enforce that and throw an error if you try to insert a dupe.
You should use a SQL UNIQUE constraint on each of these columns for that.
You can create a UNIQUE INDEX on an NVARCHAR as soon as it's an NVARCHAR(450) or less.
Do you really need a UNIQUE column to be so large?
In general, I would avoid Triggers wherever possible as they can make the behaviour very hard to understand unless you know that the trigger exists. As other commentatators have said, a unique constraint is the way to go (once you have amended your column definitions to allow it).
If you ever find yourself needing to use a trigger, it may be a sign that your design is flawed. Think hard about why you need it and whether it is performing logic that belongs elsewhere.
Be aware that if you use the UNIQUE constraint/index solution with SQL Server, only one null value will be permitted in that column. So, for example, if you wanted the email address to be optional, it wouldn't work, because only one user could have a null email address. In that case, you would have to resort to another approach like a trigger or a filtered index.