SQL Replication error: "The row was not found at the subscriber" but point to a table of another publication - sql

I get the following error in Replication Monitor:
The row was not found at the Subscriber when applying the replicated UPDATE command for Table '[dgv].[POSCustomer]' with Primary Key =
The error is actually not about the missing row, but that the table's schema says dgv.
The publication that generated the error is supposed to only replicate to [ppv].[POSCustomer], and should not even be aware of [dgv].[POSCustomer]. And only rows created AFTER the initial snapshot is delivered are affected.
The background:
I'm setting up transactional replication for 3 on-premises databases PPV, DGV, and PAC to a single Azure SQL database.
The three databases belong to different legal entities, on two separate servers (PPV on one, DGV and PAC on another), and have identical schemas.
Tables with the same names from each dbs are set up to be replicated.
To differentiate them in the target db, I put them in three different schemas using the name of their source dbs, i.e ppv.POSCustomer, dgv.POSCustomer, pac.POSCustomer.
This is done by changing the setting in Publication properties -> Articles -> Article properties -> Destination object owner.
The initial snapshots are delivered without problems; however, after some time, the row was not found started showing up in the replication monitor.
I tried re-initializing the subscriptions several times, but the error keeps showing up after the snapshot is delivered.
All rows created after the snapshots are delivered are affected.
The databases are totally isolated from each other, there are no cross database queries, no stored procedures, no triggers that says a record from PPV.dbo.POSCustomer should be updated in DGV.dbo.POSCustomer, so I'm at a loss as why this error happened.
I used sp_browsereplcmd to trace the command that generated the error, which leads me to:
{CALL [sp_MSupd_dboPOSCustomer] (,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2019-05-14 00:00:00.000,,27280000.0000,10,,,,,,,,,,,,2019-05-14 18:30:04.000,,,,,,,,,,,,,,,,,,,,N'vinhn4-00001395',0x00000000d000080000)}
which I don't understand, and the sp is not part of our POS app.
How can I make this error go away? Manually inserting missing rows will not work, as all new rows are affected. Turning on -skiperrors is not an option. Replicating to different target databases have been done successfully before, but setting up cross database query is such a pain with Azure SQL that I'd prefer to avoid \if possible.

Related

Azure SQL data sync initial sync not working

I need to setup a constant sync between two databases on Azure on the same SQL Server. The database is about 2TB with 2000 tables and about 20 million rows.
I cannot setup an Azure data sync because each time, it freezes on "refresh schema" in the Azure portal. I know there is a limitation of 500 tables, but it takes too long before the schema is visible to select less than the 2000 tables that need to be synced.
Another thing we have tried is to initialize the second database with the tables we want from the first database. Those tables are empty and then we can "refresh schema" on the empty tables and set a sync from member to hub. However, when doing this, the initial sync does not work. The second database remains empty while in the portal, the sync seems to run OK.
Is there another possibility to setup a sync with such a large database?
Will it help to create a data sync between empty tables, run the initial sync and then insert all the rows into the empty tables? This way the initial sync will work (because there is no data) and all the other data will be synced like other data that is appended in the future.
EDIT: According to the following blog (https://azure.microsoft.com/sv-se/blog/sync-sql-data-in-large-scale-using-azure-sql-data-sync/), you should explicitly deny permissions on the tables. I have done this, but I now get an error while trying to retrieve the Schema because there are tables with '.', '[' or ']' in their name. Even though I deny the permissions on these tables, Azure gives an error (it probably executes some query to get the schema and in the results, the tables that the user has no access to, are still displayed).

SSIS : Huge Data Transfer from Source (SQL Server) to Destination (SQL Server)

Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.
Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.
I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too
With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).

Merge SQL databases keeping one index

I have to recover a SQL 2008 R2 database for a POS system that broke down without proper backups in place. The .BAK file has been recovered, but was corrupted. However, I was able to retrieve most of the data and get it back into usable shape.
My problem now is as following:
I have database A, which is a fresh installation for the POS system, and database B, which is the recovered .BAK file.
Most of the tables in B are missing their index values, while A has an intact structure, but is (obviously) lacking all the valuable data.
How would I go about merging the two, so that I get a fully-indexed database with the correct structure?
One simple way is to use the builtin command line tool tablediff.exe. It can compare two tables/views, and print out the differences.
The tablediff utility is used to compare the data in two tables for non-convergence, and is particularly useful for troubleshooting non-convergence in a replication topology. This utility can be used from the command prompt or in a batch file to perform the following tasks:
A row by row comparison between a source table in an instance of Microsoft SQL Server acting as a replication Publisher and the destination table at one or more instances of SQL Server acting as replication Subscribers.
Perform a fast comparison by only comparing row counts and schema.
Perform column-level comparisons.
Generate a Transact-SQL script to fix discrepancies at the destination server to bring the source and destination tables into convergence.
Log results to an output file or into a table in the destination database.

exporting a table containing a large number of records to to another database on a linked server

I have a task in a project that required the results of a process, which could be anywhere from 1000 up to 10,000,000 records (approx upper limit), to be inserted into a table with the same structure in another database across a linked server. The requirement is to be able to transfer in chunks to avoid any timeouts
In doing some testing I set up a linked server and using the following code to test transfered approx 18000 records:
DECLARE #BatchSize INT = 1000
WHILE 1 = 1
BEGIN
INSERT INTO [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2] WITH (TABLOCK)
(
id
,title
,Initials
,[Last Name]
,[Address1]
)
SELECT TOP(#BatchSize)
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [DBNAME1].[dbo].[TABLENAME1] s
WHERE NOT EXISTS (
SELECT 1
FROM [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2]
WHERE id = s.id
)
IF ##ROWCOUNT < #BatchSize BREAK
This works fine however it took 5 mins to transfer the data.
I would like to implement this using SSIS and am looking for any advice in how to do this and speed up the process.
Open Visual Studio/Business Intelligence Designer Studio (BIDS)/SQL Server Data Tools-BI edition(SSDT)
Under the Templates tab, select Business Intelligence, Integration Services Project. Give it a valid name and click OK.
In Package.dtsx which will open by default, in the Connection Managers section, right click - "New OLE DB Connection". In the Configure OLE DB Connection Manager section, Click "New..." and then select your server and database for your source data. Click OK, OK.
Repeat the above process but use this for your destination server (linked server).
Rename the above connection managers from server\instance.databasename to something better. If databasename does not change across the environments then just use the database name. Otherwise, go with the common name of it. i.e. if it's SLSDEVDB -> SLSTESTDB -> SLSPRODDB as you migrate through your environments, make it SLSDB. Otherwise, you end up with people talking about the connection manager whose name is "sales dev database" but it's actually pointing at production.
Add a Data Flow to your package. Call it something useful besides Data Flow Task. DFT Load Table2 would be my preference but your mileage may vary.
Double click the data flow task. Here you will add an OLE DB Source, a Lookup Task and a OLE DB Destination. Probably, as always, it will depend.
OLE DB Source - use the first connection manager we defined and a query
SELECT
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [dbo].[TABLENAME1] s
Only pull in the columns you need. Your query current filters out any duplicates that already exist in the destination. Doing that can be challenging. Instead, we'll bring the entirety of TABLENAME1 into the pipeline and filter out what we don't need. For very large volumes in your source table, this may be an untenable approach and we'd need to do something different.
From the Source we need to use a Lookup Transformation. This will allow us to detect the duplicates. Use the second connection manager we defined, one that points to the destination. Change the NoMatch from "Fail Component" to "Redirect Unmatched rows" (name approximate)
Use your query to pull back the key value(s)
SELECT T2.id
FROM [dbo].[TABLENAME2] AS T2;
Map T2.id to the id column.
When the package starts, it will issue the above query against the target table and cache all the values of T2.id into memory. Since this is only a single column, that shouldn't be too expensive but again, for very large tables, this approach may not work.
There are 3 outputs now available from the Lookup: Match, NoMatch and Error. Match will be anything that exists in the source and destination. You don't care about those as you are only interested in what exists in source and not destination. When you might care is if you have to determine whether there is change between the values in source and the destination. NoMatch are the rows that exist in Source but don't exist in Destination. That's the stream you want. For completeness, Error would capture things that went very wrong but I've not experience it "in the wild" with a lookup.
Connect the NoMatch stream to the OLE DB Destination. Select your Table Name there and ensure the words Fast Load are in the destination. Click on the Columns tab and make sure everything is routed up.
Whether you need to fiddle with the knobs on the OLE DB Destination is highly variable. I would test it, especially with your larger sets of data and see whether the timeout conditions are a factor.
Design considerations for larger sets
It depends.
Really, it does. But, I would look at identifying where the pain point lies.
If my source table is very large and pulling all that data into the pipeline just to filter it back out, then I'd look at something like a Data Flow to first bring all the rows in my Lookup over to the Source database (use the T2 query) and write it into a staging table and make the one column your clustered key. Then modify your source query to reference your staging table.
Depending on how active the destination table is (whether any other process could load it), I might keep that lookup in the data flow to ensure I don't load duplicates. If this process is the only one that loads it, then drop the Lookup.
If the lookup is at fault - it can't pull in all the IDs then either go with the first alternate listed above or look at changing your caching mode from Full to Partial. Do realize that this will issue a query to the target system for potentially all the rows that come out of the source database.
If the destination is giving issues - I'd determine what the issue is. If it's network latency for the loading of data, drop the value of MaximumCommitInsertSize from 2147483647 to something reasonable, like your batch size from above (although 1k might be a bit low). If you're still encountering blocking, then perhaps staging the data to a different table on the remote server and then doing an insert locally might be an approach.

How to resume data migration from the point where error happened in ssis?

I am migrating data from an Oracle database to a SQL server 2008 r2 database using SSIS. My problem is that at a certain point the package fails, say some 40,000 rows out of 100,000 rows. What can I do so that the next time when I run the package after correcting the errors or something, I want it to be restarted from the 40,001st row, i.e, the row where the error had occured.
I have tried using checkpoint in SSIS, but the problem is that they work only between different control flow tasks. I want something that can work on the rows that are being transferred.
There's no native magic I'm aware of that is going to "know" that it failed on row 40,000 and when it restarts, it should start streaming row 40,001. You are correct that checkpoints are not the answer and have plenty of their own issues (can't serialize Object types, loops restart, etc).
How you can address the issue is through good design. If your package is created with the expectation that it's going to fail, then you should be able to handle these scenarios.
There are two approaches I'm familiar with. The first approach is to add a Lookup Transformation in the Data Flow between your source and your destination. The goal of this is to identify what records exist in the target system. If no match is found, then only those rows will be sent on to destination. This is a very common pattern and will allow you to also detect changes between source and destination (if that is a need). The downside is that you will always be transferring the full data set out of the source system and then filtering rows in the data flow. If it failed on row 99,999 out of 1,000,000 you will still need to stream all 1,000,000 rows back to SSIS for it to find the 1 that hasn't been sent.
The other approach is to use a dynamic filter in your WHERE clause of your source. If you can make assumptions like the rows are inserted in order, then you can structure your SSIS package to look like Execute SQL Task where you run a query like SELECT COALESCE(MAX(SomeId), 0) +1 AS startingPoint FROM dbo.MyTable against the Destination database and then assign that to an SSIS variable (#[User::StartingId]). You then use an expression on your select statement from the Source to be something like "SELECT * FROM dbo.MyTable T WHERE T.SomeId > " + (DT_WSTR, 10) #[User::StartingId] Now when the data flow begins, it will start where it last loaded data. The challenge on this approach is finding those scenarios where you know data hasn't been inserted out of order.
Let me know if you have questions, need things better explained, pictures, etc. Also, above code is freehanded so there could be syntax errors but the logic should be correct.