Pre-process validation of an INSERT INTO statement - sql

I have a stored procedure that creates many INSERT INTO statements using Dynamic SQL. I have no control over the data that I am attempting to insert and the values being inserted are derived from a
SELECT * FROM sourceTable
Occasionally, when inserting, a foreign key constraint is triggered (the data is being taken from environment A and being inserted into environment B so it is possible certain other tables have not been kept up to date)
My question is - is there a way to pre-process validate all my INSERT statements for any errors (foreign key constraint or otherwise) before executing them? Or do I need to create checkpoints and use the rollback functionality?
---OVERVIEW OF PROCESS
We create tables on environment A (source) containing subsets of data based on a selection criteria
Using the SQL Export Wizard, these tables are copied across to environment B (target)
A stored procedure is run to import the data from these tables into the corresponding
tables on environment B. This sp uses the INSERT INTO tableA SELECT
* FROM tableAFromSource command within a dynamic SQL loop containing all the table names
This approach is used due to external factors we cannot control (servers cannot be linked, data structures, permissions on servers etc)

I don't know any type of pre-processing that checks what would be the result of any given query, but it's often easy to check for conditions using plain T-SQL. If you just want to avoid the primary key constraint violations, you can use the following variation of the query you provided (considering Id the primary key):
INSERT INTO tableA
SELECT *
FROM tableAFromSource src
WHERE src.Id NOT IN ( SELECT tba.Id FROM tableA tba );

Related

Remove Records from Tables Selectively Based on Variable

Scenario:
I need to write an SQL script which will remove all records from some or all tables in a database with around 100 tables.
Some tables are 'data' tables, some are 'lookup' tables. There is nothing in their names to indicate which they are.
Sometimes I will want the script to only remove records from the 'data' tables, on other occasions I will want to use it to remove data from all tables.
The records have to be removed from the tables in a very specific order to prevent foreign key constraint violations.
My original idea was to create a variable at the start of the script - something like #EmptyLookupTables - which I could set to true or false, and then I could wrap the DELETE statements in an IF... statement so that they were only executed if the value of the variable was true.
However, due to the foreign key constraints I need to include the GO command after just about every DELETE statement, and variables are not persisted across these batches.
How can I write a script which deletes records from my tables in the correct order but skips over certain tables based on the value of a single variable? The database is in Microsoft SQL Server 2016.
The only way I know of doing this without writing a parser for DDL in TSQL is to turn it on its head.
Create a new database with the same schema; populate the lookup tables, but without the records you don't want. Then populate the data tables, but again leave out the records you don't want. Finally, rename or delete the old database, and rename the new database to the original name.
It's still hard, though.
Create a #temp table and store your variable in it, it will persist across GO separated batches. Then just check the temp table inside every batch.
SELECT #EmptyLookupTables AS EmptyLookupTables INTO #tmp
GO
DECLARE #EmptyLookupTables BIT
SELECT #EmptyLookupTables = EmptyLookupTables FROM #tmp
DELETE FROM YourLookupTable WHERE #EmptyLookupTables = 1
GO
or you can even join directly on #temp table in delete command
DELETE l FROM YourLookupTable l
INNER JOIN #tmp t ON t.EmptyLookupTables = 1

Limit Rows in ETL Without Date Column for Cue

We have two large tables (Clients and Contacts) which undergo an ETL process every night, being inserted into a single "People" table in the data warehouse. This table is used in many places and cannot be significantly altered without a lot of work.
The source tables are populated by third party software; we used to assume that we could identify the rows that had been updated since last night by using the "UpdateDate" column in each, but more recently identified some rows that were not touched by the ETL, as the "UpdateDate" column was not behaving as we had thought; the software company do not see this as a bug, so we have to live with this fact.
As a result, we now take all source rows, transformed into a temp staging table and then Merge that into the data warehouse, using the Merge to identify any changed values. We have noticed that this process is taking too long on some days and would like to limit the number of rows that the ETL process looks at, as we believe that the reason for the hold-up is the principally the sheer volume of data that is examined and stored on the temp database. We can see no way to look purely at the source data and identify when each row last changed.
Here is a simplified pseudocode of the ETL stored procedure, although what the procedure actually does is not really relevant to the question (included just in case you disagree with me!)
CREATE #TempTable (ClientOrContact BIT NOT NULL, Id INT NOT NULL, [Some_Other_Columns])
INSERT #TempTable
SELECT 1 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ClientsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
INSERT #TempTable
SELECT 0 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ContactsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
ALTER #TempTable ADD PRIMARY KEY (ClientOrContact, Id);
MERGE Target_PeopleTable AS Tgt
USING (SELECT [SomeColumns] FROM #TempTable JOIN [SomeOtherTables]) AS Src
ON Tgt.ClientOrContact = Src.ClientOrContact AND Tgt.Id = Src.Id
WHEN MATCHED AND NOT EXISTS (SELECT Tgt.* INTERSECT SELECT Src.*)
THEN UPDATE SET ([All_NonKeyTargetColumns] = [All_NonKeySourceColumns])
WHEN NOT MATCHED BY Target THEN INSERT [All_TargetColumns] VALUES [All_SourceColumns]
OUTPUT $Action INTO #Changes;
RETURN COUNT(*) FROM #Changes;
GO
The source tables have about 1.5M rows each, but each day only a relatively small number of rows are inserted or updated (never deleted). There are about 50 columns in each table, of those, about 40 columns can have changed values each night. Most columns are VARCHAR and each table contains an independent incremental primary key column. We can add indexes to the source tables, but not alter them in any other way (They have already been indexed by a predecessor) The source tables and target table are on the same server, but different databases. Edit: The Target Table has a composite primary key on the ClientOrContact and Id columns, matching that shown on the temp table in the script above.
So, my question is this - please could you suggest any general possible strategies that might be useful to limit the number of rows we look at or copy across each night? If we only touched the rows that we needed to each night, we would be touching less than 1% of the data we do at the moment...
Before you try the following suggestion, just one thing to check is that the Target_PeopleTable has an index or primary key on the id column. It probably does but without schema information to verify I am making no assumptions and this might speed up the merge stage.
As you've identified if you could somehow limit the records in TempTable to just the changed rows then this could offer a performance win for the actual MERGE statement (depending on how expensive determining just the changed rows is).
As a general strategy I would consider some kind of checksum to try and identify the changed records only. The T-SQL Checksum function could be used to calculate a check sum across the required columns by specifying the columns as a comma separated list to that function or there are actual column types available for this such as Binary_Checksum.
Since you cannot change the source schema you would have to maintain a list of record ids and associated checksums in your target database so that you can readily compare the source checksums to the target checksums from the last run in order to identify a difference.
You can then only insert into the Temp table where there is a checksum difference between the target and source or the id does not exist in the target db.
This might just be moving the performance problem to the temp insert part but I think it's worth a try.
Have you considered triggers? I avoid them like the plague, but they really are the solution to some problems.
Put an INSERT/UPDATE [/DELETE?] trigger on your two source tables. Program it such that when rows are added or updated, the trigger will log the IDs of these rows in a (you'll have to create this) audit table, where that table would contain the ID, the type of change (update or insert – and delete, if you have to worry about those) and when the change was made. When you run ETL, join this list of “to be merged” items with the source tables. When you’re done, delete the table and it’s reset for the next run. (Use the “added on” datetime column to make sure you don’t delete rows that may have been added while you were running ETL.)
There’s lots of details behind proper use and implementation, but overall this idea should do what you need.

Incremental load for Updates into Warehouse

I am planning for an incremental load into warehouse (especially for updates of source tables in RDBMS).
Capturing the updated rows in staging tables from RDBMS based the updates datetime. But how do I determine which column of a particular row needs to be updated in the target warehouse tables?
Or do I just delete a particular row in the warehouse table (based on the primary key of the row in staging table) and insert the new updated row?
Which is the best way to implement the incremental load between the RDBMS and Warehouse using PL/SQL and SQL coding?
In my opinion, the easiest way to accomplish this is as follows:
Create a stage table identical to your host table. When you do your incremental/net-change load, load all changed records into this table (based on whatever your "last updated" field is)
Delete the records from your actual table based on the primary key. For example, if your primary key is customer, part, the query might look like this:
delete from main_table m
where exists (
select null
from stage_table s
where
m.customer = s.customer and
m.part = s.part
);
Insert the records from the stage to the main table.
You could also do an update existing records / insert new records, but either way that's two steps. The advantage of the method I listed is that it will work even if your tables have partitions and the newly updated data violates one of the original partition rules, whereas an update would not accomplish that. Also, the syntax is much simpler as your update would have to list every single field, whereas the delete from / insert into allows you list only the primary key fields.
Oracle also has a merge clause that will update if it exists or insert if it does not. I honestly don't know how that would be impacted if you had partitions.
One major caveat. If your updates include deletes -- records that need to be deleted from the main table, none of these will resolve that and you will need some other way to handle that. It may not be necessary, depending on your circumstances, but it's something to consider.

Foreign key Constraints - Writing to Error Table - SQL Server 2008

I am new to SQL Server. I have a Batch process that loads data into my stage tables. I have some foreign Keys on that table. I want to dump all the foreign key errors encountered while loading into a error table. How do I do that?
Thanks
New Novice
Use SSIS to load the data. Records which fail validation can be sent off to a an exception table.
One approach would be to load the data into a temporary table which has no FK constraints, remove the bad records (that violate the FK constraints), then move the data from the temp table into the stage table. If you have lots of FKs on your table this will probably be a bit tedious, so you would probably want to automate the process.
Here's some pseudo-code to show what I mean...
-- First put the raw data into MyTempTable
-- Find the records that are "bad" -- you can SELECT INTO a "bad records" table
-- for later inspection if you want...
SELECT *
INTO #BadRecords
FROM MyTempTable
WHERE ForeignKeyIDColumn NOT IN
(
SELECT ID FROM ForeignKeyTable
)
-- Remove the bad records now
DELETE
FROM MyTempTable
WHERE ForeignKeyIDColumn NOT IN
(
SELECT ID FROM ForeignKeyTable
)
-- Now the data is "clean" (won't violate the FK) so you can insert it
-- from MyTempTable into the stage table

SQL Server creating a temporary table from another table

I am looking to create a temporary table which is used as an intermediate table while compiling a report.
For a bit of background I am porting a VB 6 app to .net
To create the table I can use...
SELECT TOP 0 * INTO #temp_copy FROM temp;
This creates an empty copy of temp, But it doesn't create a primary key
Is there a way to create a temp table plus the constraints?
Should I create the constraints afterwards?
Or am I better off just creating the table using create table, I didn't want to do this because there are 45 columns in the table and it would fill the procedure with a lot of unnecessary cruft.
The table is required because a lot of people may be generating reports at the same time so I can't use a single intermediary table
Do you actually need a Primary Key? If you are flitering and selecting only the data needed by the report won't you have to visit every row in the temp table anyway?
By design, SELECT INTO does not carry over constraints (PK, FK, Unique), Defaults, Checks, etc. This is because a SELECT INTO can actually pull from numerous tables at once (via joins in the FROM clause). Since SELECT INTO creates a new table from the table(s) you specify, SQL really has no way of determining which constraints you want to keep, and which ones you don't want to keep.
You could write a procedure/script to create the constraint automatically, but it's probably too much effort for minimal gain.
You'd have to do one or the other:
add the PK/indexes afterwards
explicitly declare the temp table with constraints.
I'd also do this rather then TOP 0
SELECT * INTO #temp_copy FROM temp WHERE 1 = 0;