How can I block bad row from a delete query - sql

I have a query that moves year-old rows from one table to an identical "archive" table.
Sometimes, invalid dates get entered in to a dateprocessed column (used to evaluate if the row is more than a year old), and the query errors out. I want to essentially "screen" the bad rows -- i.e. where not isdate(dateprocessed) does not equal 1 -- so that the query does not try to archive them.
I have a few ideas about how to do this, but want to do this in the absolute simplest way possible. If I select the good data into a temp table in my stored procedure, then inner join it with the live table, then run the delete from live output to archive -- will it delete from the underlying live table or the new joined table?
Is there a better way to do this? Thanks for the help. I am a .NET programmer playing DBA, but really want to do this properly.
Here is the query that errors when some of the dateprocessed column values are invalid:
delete from live
output deleted.* into archive
where isdate(dateprocessed) = 1
and cast (dateprocessed as datetime) < dateadd(year, -1, getdate())
and not exists (select * from archive where live.id = archive.id)

The simplest thing to do is:
Select the correct records into a temp table
One of the fields you need to copy into the temp table should be a
unique identifier like an "ID" column
Do any additional processing in the temp table
Archive from the temp table to archive table
Delete from live table with a join with temp table using the "ID" Column. This will ensure no mistakes are made.

If you are a .NET guy you could bring every data down and do a DateTime.TryParse. Better yet just do it once to populate a real DateTime column. The the dates that don't parse you could assign a fixed date or null. And there are some dates strings that .NET will parse that SQL will not (.e.g. November 2010).

Related

Querying a SQL table and only transferring updated rows to a different database

I have a database table which constantly gets updated. I am looking to query only the changes/additions that have been made on rows with a specific attribute in a column. e.g. get the rows which have been changed/added, the 'description' column of which is "xyz". My end goal is to copy these rows to another table in another database. Is this even possible? The reason for not just querying and overwriting the rows in the other database is to avoid inefficiency.
What I have tried so far?
I am able to select query on the table to get the rows but it gives me all the rows, not the ones that have been changed or recently added. If i add these rows to the table in the other database, the only option I have is to overwrite the rows.
Log table logs the changes in a table but I can't put additional filters in SQL which tells me which of these changes are associated with 'description' column as 'xyz'.
Write your update statements to make use of OUTPUT to capture the before and after values and log them to a table of your choice.
Here is a really simple example update example that uses output to store the RowID, before and after values for the ActivityType column:
DECLARE #MyTableVar table (
SummaryBefore nvarchar(max),
SummaryAfter nvarchar(max),
RowID int
);
update DBA.dbo.dtest set ActivityType = 3
OUTPUT deleted.ActivityType,
inserted.ActivityType,
inserted.RowID
INTO #MyTableVar
select * From #MyTableVar
You can do it two ways
Have new date fields/columns like update_time and/or create_time(Can be defaulted if needed). These fields will indicate the status of the record. You need to save your previous_run_time and then your select query will look for records with update_time/create_time greater than previous_run_time, and then you can move these records to the new DB.
Have CDC turned on the source table, which is available by default in SQL server and then move only those records that have been impacted.

Insert into third table where 2 tables have same Id

I have a huge database and want to process it by smaller chunks so I'm trying to write scripts and copy rows onto a temporary table, process it and then copy them back.
Now I've copied around 1000 rows into PersonMeta from old database and now want to insert corresponding rows for People table.
So basically I want to insert data from olddb.People into newdb.People where newdb.PersonMeta and newdb.People have the same code.
I've created this script but for some reason it doesn't copy all the rows. For example it copies 960 rows when it should copy 1000.
INSERT INTO [newdb].[dbo].[People] ([Id]
,[Name]
,[PersonId])
SELECT fp.[Id]
,fp.[Name]
,fp.[PersonId]
FROM [olddb].[dbo].[People] fp
INNER JOIN [newdb].[dbo].[PersonMeta] pm on
pm.PersonId = fp.PersonId
edit:
I originally wrote 100 rows where it was 1000 rows. So the query is selecting 960 (40 less)
edit 2
The People table has some duplicate values for PersonId column. I removed them and now after I run the query it copies 956 rows (4 less then before).
edit 3:
I created this fiddle and it seems to be working just fine.
However, I did some queries on the database. Turns out when I query with a RIGHT JOIN the value for those records which are not copied are all NULL. So when I run the following query:
Select fp.*, fp.personid, pm.personid
From [olddb].[dbo].[People] fp
right join [newdb].[dbo].[PersonMeta] pm on
fp.personid = pm.personid
It returns this:
Is there another approach I could try to copy the data?
There may be a NULL value in the PersonID field on either table. If so, remove/update the NULL record and try again.
First you need to check your record separately what is generating from your query.
Check the output of this query. It should create 100 rows as you are expecting.
SELECT fp.[Id]
,fp.[Name]
,fp.[PersonId]
FROM [olddb].[dbo].[People] fp
INNER JOIN [newdb].[dbo].[PersonMeta] pm on
pm.PersonId = fp.PersonId
But still if it creates more then your expected rows you may try some filters to test the result as isnull(pm.PersonId,0)<>0 and isnull(fp.PersonId,0)<>0.
So it filter out the record having personId is null, which may duplicate your record.
So final query for test was
SELECT fp.[Id]
,fp.[Name]
,fp.[PersonId]
FROM [olddb].[dbo].[People] fp
INNER JOIN [newdb].[dbo].[PersonMeta] pm on
pm.PersonId = fp.PersonId and isnull(pm.PersonId,0)<>0 and isnull(fp.PersonId,0)<>0
If still you can't figure out the issue then please share your table structure of tables which might help to understand the issue.
OK Now I feel silly but the problem was that simply not all of the rows had a corresponding value in the old People table for PersonMeta table. I thought they had it because I used the Id in the query rather than PersonId.
In short the posted query was in fact correct.
Considering you want to keep distinct and unique records in new table.
Below query will create same schema as of old table and copies all the data present in old table to new table.
select * into [newdb].[dbo].[People] from [olddb].[dbo].[People]
Now if you want to keep the data present in new table in sync with the unique records present in [newdb].[dbo].[PersonMeta]. you can simply do
delete from [newdb].[dbo].[People] where personid not in (select personid from [newdb].[dbo].[PersonMeta] )

Limit Rows in ETL Without Date Column for Cue

We have two large tables (Clients and Contacts) which undergo an ETL process every night, being inserted into a single "People" table in the data warehouse. This table is used in many places and cannot be significantly altered without a lot of work.
The source tables are populated by third party software; we used to assume that we could identify the rows that had been updated since last night by using the "UpdateDate" column in each, but more recently identified some rows that were not touched by the ETL, as the "UpdateDate" column was not behaving as we had thought; the software company do not see this as a bug, so we have to live with this fact.
As a result, we now take all source rows, transformed into a temp staging table and then Merge that into the data warehouse, using the Merge to identify any changed values. We have noticed that this process is taking too long on some days and would like to limit the number of rows that the ETL process looks at, as we believe that the reason for the hold-up is the principally the sheer volume of data that is examined and stored on the temp database. We can see no way to look purely at the source data and identify when each row last changed.
Here is a simplified pseudocode of the ETL stored procedure, although what the procedure actually does is not really relevant to the question (included just in case you disagree with me!)
CREATE #TempTable (ClientOrContact BIT NOT NULL, Id INT NOT NULL, [Some_Other_Columns])
INSERT #TempTable
SELECT 1 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ClientsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
INSERT #TempTable
SELECT 0 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ContactsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
ALTER #TempTable ADD PRIMARY KEY (ClientOrContact, Id);
MERGE Target_PeopleTable AS Tgt
USING (SELECT [SomeColumns] FROM #TempTable JOIN [SomeOtherTables]) AS Src
ON Tgt.ClientOrContact = Src.ClientOrContact AND Tgt.Id = Src.Id
WHEN MATCHED AND NOT EXISTS (SELECT Tgt.* INTERSECT SELECT Src.*)
THEN UPDATE SET ([All_NonKeyTargetColumns] = [All_NonKeySourceColumns])
WHEN NOT MATCHED BY Target THEN INSERT [All_TargetColumns] VALUES [All_SourceColumns]
OUTPUT $Action INTO #Changes;
RETURN COUNT(*) FROM #Changes;
GO
The source tables have about 1.5M rows each, but each day only a relatively small number of rows are inserted or updated (never deleted). There are about 50 columns in each table, of those, about 40 columns can have changed values each night. Most columns are VARCHAR and each table contains an independent incremental primary key column. We can add indexes to the source tables, but not alter them in any other way (They have already been indexed by a predecessor) The source tables and target table are on the same server, but different databases. Edit: The Target Table has a composite primary key on the ClientOrContact and Id columns, matching that shown on the temp table in the script above.
So, my question is this - please could you suggest any general possible strategies that might be useful to limit the number of rows we look at or copy across each night? If we only touched the rows that we needed to each night, we would be touching less than 1% of the data we do at the moment...
Before you try the following suggestion, just one thing to check is that the Target_PeopleTable has an index or primary key on the id column. It probably does but without schema information to verify I am making no assumptions and this might speed up the merge stage.
As you've identified if you could somehow limit the records in TempTable to just the changed rows then this could offer a performance win for the actual MERGE statement (depending on how expensive determining just the changed rows is).
As a general strategy I would consider some kind of checksum to try and identify the changed records only. The T-SQL Checksum function could be used to calculate a check sum across the required columns by specifying the columns as a comma separated list to that function or there are actual column types available for this such as Binary_Checksum.
Since you cannot change the source schema you would have to maintain a list of record ids and associated checksums in your target database so that you can readily compare the source checksums to the target checksums from the last run in order to identify a difference.
You can then only insert into the Temp table where there is a checksum difference between the target and source or the id does not exist in the target db.
This might just be moving the performance problem to the temp insert part but I think it's worth a try.
Have you considered triggers? I avoid them like the plague, but they really are the solution to some problems.
Put an INSERT/UPDATE [/DELETE?] trigger on your two source tables. Program it such that when rows are added or updated, the trigger will log the IDs of these rows in a (you'll have to create this) audit table, where that table would contain the ID, the type of change (update or insert – and delete, if you have to worry about those) and when the change was made. When you run ETL, join this list of “to be merged” items with the source tables. When you’re done, delete the table and it’s reset for the next run. (Use the “added on” datetime column to make sure you don’t delete rows that may have been added while you were running ETL.)
There’s lots of details behind proper use and implementation, but overall this idea should do what you need.

How to add dates to database records for trending analysis

I have a SQL server database table that contain a few thousand records. These records are populated by PowerShell scripts on a weekly basis. These scripts basically overwrite last weeks data so the table only has information pertaining to the previous week. I would like to be able to take a copy of that tables data each week and add a date column with that day's date beside each record. I need this so can can do trend analysis in the future.
Unfortunately, I don't have access to the PowerShell scripts to edit them. Is there any way I can accomplish this using MS SQL server or some other way?
You can do the following. Create a table that will contain the clone + dates. Insert the results from your original table along with the date into your clone table. From your description you don't need a where clause because the results of the original table are wiped out only holding new data. After the initial table creation there is no need to do it again. You'll just simply do the insert piece. Obviously the below is very basic and is just to provide you the framework.
CREATE TABLE yourTableClone
(
col1 int
col2 varchar(5)...
col5 date
)
insert into yourTableClone
select *, getdate()
from yourOriginalTable

Select on Row Version

Can I select rows on row version?
I am querying a database table periodically for new rows.
I want to store the last row version and then read all rows from the previously stored row version.
I cannot add anything to the table, the PK is not generated sequentially, and there is no date field.
Is there any other way to get all the rows that are new since the last query?
I am creating a new table that contains all the primary keys of the rows that have been processed and will join on that table to get new rows, but I would like to know if there is a better way.
EDIT
This is the table structure:
Everything except product_id and stock_code are fields describing the product.
You can cast the rowversion to a bigint, then when you read the rows again you cast the column to bigint and compare against your previous stored value. The problem with this approach is the table scan each time you select based on the cast of the rowversion - This could be slow if your source table is large.
I haven't tried a persisted computed column of this, I'd be interested to know if it works well.
Sample code (Tested in SQL Server 2008R2):
DECLARE #TABLE TABLE
(
Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL,
LastChanged ROWVERSION NOT NULL
)
INSERT INTO #TABLE(Data)
VALUES('Hello'), ('World')
SELECT
Id,
Data,
LastChanged,
CAST(LastChanged AS BIGINT)
FROM
#TABLE
DECLARE #Latest BIGINT = (SELECT MAX(CAST(LastChanged AS BIGINT)) FROM #TABLE)
SELECT * FROM #TABLE WHERE CAST(LastChanged AS BIGINT) >= #Latest
EDIT: It seems I've misunderstood, and you don't actually have a ROWVERSION column, you just mentioned row version as a concept. In that case, SQL Server Change Data Capture would be the only thing left I could think of that fits the bill: http://technet.microsoft.com/en-us/library/bb500353(v=sql.105).aspx
Not sure if that fits your needs, as you'd need to be able to store the LSN of "the last time you looked" so you can query the CDC tables properly. It lends itself more to data loads than to typical queries.
Assuming you can create a temporary table, the EXCEPT command seems to be what you need:
Copy your table into a temporary table.
The next time you look, select everything from your table EXCEPT everything from the temporary table, extract the keys you need from this
Make sure your temporary table is up to date again.
Note that your temporary table only needs to contain the keys you need. If this is just one column, you can go for a NOT IN rather than EXCEPT.