Create an Azure Data Factory pipeline to copy new records from DocumentDB to Azure SQL - azure-sql-database

I am trying to find the best way to copy yesterday's data from DocumentDB to Azure SQL.
I have a working DocumentDB database that is recording data gathered via a web service. I would like to routinely (daily) copy all new records from the DocumentDB to an Azure SQL DB table. In order to do so I have created and successfully executed an Azure Data Factory Pipeline that copies records with a datetime > '2018-01-01', but I've only ever been able to get it to work with an arbitrary date - never getting the date from a variable.
My research on DocumentDB SQL querying shows that it has Mathematical, Type checking, String, Array, and Geospatial functions but no date-time functions equivalent to SQL Server's getdate() function.
I understand that Data Factory Pipelines have some system variables that are accessible, including utcnow(). I cannot figure out, though, how to actually use those by editing the JSON successfully. If I try just including utcnow() within the query I get an error from DocumentDB that "'utcnow' is not a recognized built-in function name".
"query": "SELECT * FROM c where c.StartTimestamp > utcnow()",
If I try instead to build the string within the JSON using utcnow() I can't even save it because of a syntax error:
"query": "SELECT * FROM c where c.StartTimestamp > " + utcnow(),
I am willing to try a different technology than a Data Factory Pipeline, but I have a lot of data in our DocumentDB so I'm not interested in abandoning that, and I have much greater familiarity with SQL programming and need to move the data there for joining and other analysis.
What is the easiest and best way to copy those new entries over every day into the staging table in Azure SQL?

Are you using ADF V2 or V1?
For ADF V2.
I think that you can follow the incremental approach that they recommend, for example you could have a watermark table (it could be in your target Azure SQL database) and two lookups activities, one of the lookups will obtain the previous run watermark value (it could be date, integer, whatever your audit value is) and another lookup activity to obtain the MAX (watermark_value, i.e. date) of your source document and have a CopyActivity that gets all the values where the c.StartTimeStamp<=MaxWatermarkValueFromSource AND c.StartTimeStamp>LastWaterMarkValue.
I followed this example using the Python SDK and worked for me.
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-powershell

Related

Problem importing data from azure sql db using azure data factory into Dynamics 365 createdon and modifiedon fields

I want to copy account records from an azure sql db to D365. I am using Azure data factory to achieve this.
I am having problems with a couple of D365 fields, createdon and modifiedon. When I try to populate these with the values from my db they are ignored and D365 instead uses the datetime when I actually insert them into D365.
I've done some reading up and it seems these are locked fields. There are a number of ways to get round this from other tools but I haven't seen a way to get around this issue when importing records from azure data factory.
I've tried making the source field in the db that I want to use, nVarChar, Date and DateTime but all are ignored.
Does anyone know if this is possible and if so how I can achieve it? Can I do it through ADF alone or do I need to configure D365 to allow these fields to be created with a custom value?
Thanks
Macca

Azure Data Factory - Rerun Failed Pipeline Against Azure SQL Table With Differential Date Filter

I am using ADF to keep an Azure SQL DB in sync with an on-prem DB. The on-prem DB is read only and the direction is one-way, from the Azure SQL DB to the on-prem DB.
My source table in the Azure SQL Cloud DB is quite large (10's of millions of rows) so I have the pipeline set to use an UPSERT (merge, trying to create a differential merge). I am using a filter on the Source table and the and the Filter Query has a WHERE condition that looks like this:
[HistoryDate] >= '#{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
AND [HistoryDate] < '#{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'
The HistoryDate column is auto-maintained in the source table with a getUTCDate() type approach. New records will always get a higher value and be included in the WHERE condition.
This works well, but here is my question: I am testing on my local machine before deploying to the client. When I am not working, my laptop hibernates and the pipeline rightfully fails because my local SQL Instance is "offline" during that run. When I move this to production this should not be an issue (computer hibernating), but what happens if the clients connection is temporarily lost (i.e, the client loses internet for a time)? Because my pipeline has a WHERE condition on the source to reduce the table size upsert to a practical number, any failure would result in a loss of any data created during that 5 minute window.
A failed pipeline can be rerun, but the run time would be different at that moment in time and I would effectively miss the block of records that would have been picked up if the pipeline had been run on time. pipeline().parameters.windowStart and pipeline().parameters.windowEnd will now be different.
As an FYI, I have this running every 5 minutes to keep the local copy in sync as close to real-time as possible.
Am I approaching this correctly? I'm sure others have this scenario and it's likely I am missing something obvious. :-)
Thanks...
Sorry to answer my own question, but to potentially help others in the future, it seems there was a better way to deal with this.
ADF offers a "Metadata-driven Copy Task" utility/wizard on the home screen that creates a pipeline. When I used it, it offers a "Delta Load" option for tables which takes a "Watermark". The watermark is a column for an incrementing IDENTITY column, increasing date or timestamp, etc. At the end of the wizard, it allows you to download a script that builds a table and corresponding stored procedure that maintains the values of each parameters after each run. For example, if I wanted my delta load to be based on an IDENTITY column, it stores the value of the max value of a particular pipeline run. The next time a run happens (trigger), it uses this as the MIN value (minus 1) and the current MAX value of the IDENTITY column to get the added records since the last run.
I was going to approach things this way, but it seems like ADF already does this heavy lifting for us. :-)

Azure Synapse pipeline: How to move incremental updates from SQL Server into synapse for crunching numbers

We are working building a new data pipeline for our project and we have to move incremental updates that happen throughout the day on our SQL servers into Azure synapse for some number crunching.
We have to get updates which occur across 60+ tables ( 1-2 million updates a day ) into synapse to crunch some aggregates and statistics as they happen throughout the day.
One of the requirements is being near real time and doing a bulk import into synapse is not ideal because it takes more than 10 mins to do full compute on all data.
I have been reading about CDC feed into synapse https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-data-capture-feature-portal and it is one possible solution.
Wondering if there are other alternatives to this or suggestions for achieving the end goal of data crunching near real time for DB updates.
Change Data Capture (CDC) is the suited way to capture the changes and add to the destination location (storage/database).
Apart from that, you can also use watermark column to capture the changes in multiple tables in SQL Server.
Select one column for each table in the source data store, which you
can identify the new or updated records for every run. Normally, the
data in this selected column (for example, last_modify_time or ID)
keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
Here is the high-level solution diagram for this approach:
Step-by-Step approach is given in this official document Incrementally load data from multiple tables in SQL Server to Azure SQL Database using PowerShell.

ADF - How should I copy table data from source Azure SQL Database to 6 other Azure SQL Databases?

We curate data in the "Dev" Azure SQL Database and then currently use RedGate's Data Compare tool to push up to 6 higher Azure SQL Databases. I am trying to migrate that manual process to ADFv2 and would like to avoid copy/pasting the 10+ copy data actives for each database (x6) to keep it more maintainable for future changes. The static tables have some customization in the copy data activity but the basic idea follows this post to perform an upsert.
How can the implementation described above be done in Azure Data Factory?
I was imagining something like the following:
Using one parameterized link service that has the server name & database name configurable to generate a dynamic connection to Azure SQL Database.
Creating a pipeline for each table's copy data activity.
Creating a master pipeline to then nest each table's pipeline in.
Using variables loop over the different connections an passing those to the sub-pipelines parameters.
Not sure if that is the most efficient plan or even works yet. Other ideas/suggestions?
we can not tell you if that's the most efficient plan. But I think so. Just make it works.
As you said in the comment:
we can use Dynamic Pipelines - Copy multiple tables in Bulk with
'Lookup' & 'ForEach'. we can perform dynamic copies of your data
table lists in bulk within a single pipeline. Lookup returns either
the lists of data or first row of data. ForEach - #activity('Azure
SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv') + This is
efficient and cost optimized since we are using less number of
activities and datasets.
In usually, we also will choose same solution with you: dynamic parameter/pipeline, lookup + foreach active to achieve the scenario. In one word, make the pipeline has a strong logic, simple and efficient.
Added the same info mentioned in the Comment as Answer.
Yup, we can use Dynamic Pipelines - Copy multiple tables in Bulk with 'Lookup' & 'ForEach'.
We can perform dynamic copies of your data table lists in bulk within a single pipeline. Lookup returns either the lists of data or first row of data.
ForEach - #activity('Azure SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv')
This is efficient and cost optimized since we are using less number of activities and datasets.
Attached pic as ref-

Get the Last Modified date for all BigQuery tables in a BigQuery Project

I have several databases within a BigQuery project which are populated by various jobs engines and applications. I would like to maintain a dashboard of all of the Last Modified dates for every table within our project to monitor job failures.
Are there any command line or SQL commands which could provide this list of Last Modified dates?
For a SQL command you could try this one:
#standardSQL
SELECT *, TIMESTAMP_MILLIS(last_modified_time)
FROM `dataset.__TABLES__` where table_id = 'table_id'
I recommend you though to see if you can log these errors at the application level. By doing so you can also understand why something didn't work as expected.
If you are already using GCP you can make use of Stackdriver (it works on AWS as well), we started using it in our projects and I recommend giving it a try (we tested for python applications though, not sure how the tool performs on other clients but it might be quite similar).
I've just queried stacked GA4 data using the following code:
FROM analytics_#########.__TABLES__
where table_id LIKE 'events_2%'
I have kept the 2 on the events to ensure my intraday tables do not pull through also.