We are working building a new data pipeline for our project and we have to move incremental updates that happen throughout the day on our SQL servers into Azure synapse for some number crunching.
We have to get updates which occur across 60+ tables ( 1-2 million updates a day ) into synapse to crunch some aggregates and statistics as they happen throughout the day.
One of the requirements is being near real time and doing a bulk import into synapse is not ideal because it takes more than 10 mins to do full compute on all data.
I have been reading about CDC feed into synapse https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-data-capture-feature-portal and it is one possible solution.
Wondering if there are other alternatives to this or suggestions for achieving the end goal of data crunching near real time for DB updates.
Change Data Capture (CDC) is the suited way to capture the changes and add to the destination location (storage/database).
Apart from that, you can also use watermark column to capture the changes in multiple tables in SQL Server.
Select one column for each table in the source data store, which you
can identify the new or updated records for every run. Normally, the
data in this selected column (for example, last_modify_time or ID)
keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
Here is the high-level solution diagram for this approach:
Step-by-Step approach is given in this official document Incrementally load data from multiple tables in SQL Server to Azure SQL Database using PowerShell.
I need to merge with 2 different tables from 2 different azure SQL databases where as these two azure sql database are from same azure sql server.
also for performance imporvement purpose, what I need to do is bulk insert and/or bulk update. also, this will be continous activity. for very first time I have to merge all data which is huge. and then whenever respective topic recivies message, I need to add/update that single record only.
what are the different options to do the same. for both processes.
please help. thanks.
You can use Azure SQL Data Sync to merge those tables located on 2 different databases into a third and new database. You just need to create the table with no records, then use Azure SQL Data Sync with one-way sync from those 2 databases (member databases) to the newly created table on the new database (hub database). On the first sync data will be merged on the new table located on the hub database. Every time a record gets updated, deleted or new record arrive on the member databases then that data change is replicated to the hub database and to the merged table.
To know more about the free Azure SQL Data Sync please read here.
I get the following error in Replication Monitor:
The row was not found at the Subscriber when applying the replicated UPDATE command for Table '[dgv].[POSCustomer]' with Primary Key =
The error is actually not about the missing row, but that the table's schema says dgv.
The publication that generated the error is supposed to only replicate to [ppv].[POSCustomer], and should not even be aware of [dgv].[POSCustomer]. And only rows created AFTER the initial snapshot is delivered are affected.
The background:
I'm setting up transactional replication for 3 on-premises databases PPV, DGV, and PAC to a single Azure SQL database.
The three databases belong to different legal entities, on two separate servers (PPV on one, DGV and PAC on another), and have identical schemas.
Tables with the same names from each dbs are set up to be replicated.
To differentiate them in the target db, I put them in three different schemas using the name of their source dbs, i.e ppv.POSCustomer, dgv.POSCustomer, pac.POSCustomer.
This is done by changing the setting in Publication properties -> Articles -> Article properties -> Destination object owner.
The initial snapshots are delivered without problems; however, after some time, the row was not found started showing up in the replication monitor.
I tried re-initializing the subscriptions several times, but the error keeps showing up after the snapshot is delivered.
All rows created after the snapshots are delivered are affected.
The databases are totally isolated from each other, there are no cross database queries, no stored procedures, no triggers that says a record from PPV.dbo.POSCustomer should be updated in DGV.dbo.POSCustomer, so I'm at a loss as why this error happened.
I used sp_browsereplcmd to trace the command that generated the error, which leads me to:
{CALL [sp_MSupd_dboPOSCustomer] (,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2019-05-14 00:00:00.000,,27280000.0000,10,,,,,,,,,,,,2019-05-14 18:30:04.000,,,,,,,,,,,,,,,,,,,,N'vinhn4-00001395',0x00000000d000080000)}
which I don't understand, and the sp is not part of our POS app.
How can I make this error go away? Manually inserting missing rows will not work, as all new rows are affected. Turning on -skiperrors is not an option. Replicating to different target databases have been done successfully before, but setting up cross database query is such a pain with Azure SQL that I'd prefer to avoid \if possible.
I am using Azure Machine Learning to clustering data.
The input data is from an Azure SQL Database, and it works fine.
At the end of everything I want to write the output to a table in the same Azure SQL Database, but I get this error:
Error: Error 1000: AFx Library library exception:
Sql encountered an error: Login failed for user
Anyone any idea?
Thank you very much!
Please follow the instructions and examine the examples provided here to properly use the Export Data module to save the data of ML to Azure SQL Database.
How to Export Data to an Azure SQL Database
Add the Export Data module to your experiment. You can find this module in the Data Input and Output group in the experiment items list in Azure Machine Learning Studio.
Connect it to the module that produces the data that you want to export to Azure SQL DB.
For Data destination, select Azure SQL Database. This option supports Azure SQL Data Warehouse as well.
Set the following options specific to Azure SQL Database or Azure SQL Data Warehouse.
Database server name
Type the server name that is generated by Azure. Typically it has the form <generated_identifier>.database.windows.net.
Database name
Type the name of a database on the server you just specified.The database must already exist; the Export Data cannot create it.
Server user account name
Type the user name of an account that has access permissions for the database.
Server user account password
Provide the password for the specified user account.
Comma-separated list of columns to be saved
Type the names of the columns in the experiment that you want to write to the database.
Data table name
Type the name of the table where data will be stored.
For Azure SQL Database, if the table does not exist, it will be created. For Azure SQL Data Warehouse, the table must already exist and have the correct schema, so be sure to create it in advance.
Comma-separated list of datatable columns
Type the names of the columns as you wish them to appear in the destination table. The columns should correspond in order with the column names that you list in Comma-separated list of columns to be saved.
if you are writing to Azure SQL Data Warehouse, the columns names must match those already in the destination table schema.
Number of rows written per SQL Azure operation
Indicate how many rows should be written to the destination table in each batch. By default, the value is set to 50, which is the default batch size for Azure SQL Database. However, you should increase this value if you have a large number of rows to write.
TIP:
For Azure SQL Data Warehouse, we recommend that you set this value to 1. If you use a larger batch size, the size of the command string that is sent to Azure SQL Data Warehouse can exceed the allowed string length, causing an error.
If you don't want to write new results each time you run the experiment, select the Use cached results option. If there are no other changes to module parameters, the experiment will write the data the first time the module is run, and thereafter not perform writes.
However, a write will always be performed if any parameters have been changed in Export Data that would change the results.
Run the experiment.
Find the issue!
I needed to create an specific user with this SQL code:
CREATE USER AMLApplicationUser WITH PASSWORD = '************';
and then add the user to these roles on the database I want to write.
ALTER ROLE db_datareader ADD MEMBER AMLApplicationUser;
ALTER ROLE db_datawriter ADD MEMBER AMLApplicationUser;
I guess only the datawriter role is enough, but I needed datareader too.
So in conclusion, seems that database admin role can be used to read data, but not to write data from AML.
Thank you for your help!
How can I sync two databases and do a manual refresh on the entities on either of the database whenever I want?
Let's say I have two databases DB1(prod) and DB2(dev). I want to update/insert only a few tables from prod DB to dev DB. How could I achieve this? Is this possible instead of DBlink since I do not have privileges to create a database link?
If you only want to do a manual refresh set up an import/export/datapump script to copy the data across if there is not too much data involved. If there is a large amount of data you could write some pl/sql as described above to only move the new/changed rows. This will be easier if your data has fields such as created/updated_on