How to perform scheduled data transfers with AdlCopy? - azure-data-lake

I'm working on backup and recovery for Data Lake Store. In a nutshell, we need to back up one Data Lake Store to another. I've chosen AdlCopy for that purpose (if you want to know why, check out my previous post: Backup of Data Lake Store).
According to https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-best-practices#resiliency-considerations, AdlCopy supports orchestration through either Azure Automation or Windows Task Scheduler. I'm more keen on using Azure Automation however. Can someone help clarify how I'm supposed to use Azure Automation to run AdlCopy on a schedule? Do I need a VM? AdlCopy only supports Windows 10 and I can't figure out how Azure Automation will help me to achieve a serverless approach (without Data Factory if possible).

If you are going to have scheduled copies, it will be best to do it using Azure Data Factory (ADF). AdlCopy works great for quick one-off transfers of data. But for scheduled ones which need full monitoring support, built-in retries etc, ADF will be best. If there are reasons you cannot use ADF, please do let us know.
Thanks,
Sachin Sheth,
Program Manager, Azure Data Lake.

Related

What is the fastest way to load data into Azure Hypescale?

I have a need to load data into Azure Hyperscale incrementally.
Source data is in Azure VM that has SQL server installed in it.
Source database is about 6Tb in size and has about 370 tables.
We need a way to get incremental changes in the last X amount of hours and sync them into the same database in Hyperscale.
Ideally, we would extend our database with the availability group setup but since Hyperscale does not support that, we need to find a way to keep these in sync.
Source database does have change data capture enabled.
The best on-line migration option is to use the Azure Database Migration Service (link) where the Online (continuous sync) migration support scenario (link) you need is supported:
The sync will essentially run in the background until completed while being able to access the data that has been migrated. I believe this is a continuous copy scenario and is not incremental. With PaaS database services, you do not have access to perform snapshot replication operations from external data sources. The Hyperscale instance is built upon snapshot replication but it currently only serves the hosted database functionality.
Regards,
Mike

how to schedule a query in Azure synapse on-demand

how to schedule a Query in Azure Synapse On-demand and save the result to a azure storage every 1 hour
my idea is to materialize the results into a separate storage and use PowerBI to access the results
Besides the fact that PowerBI can directly access your Synapse instance, if you want to go this route you have several options:
This can be done using a pipeline in the new Synapse Workspace. You should be aware that this technology is still in preview.
Use Polybase and Stored Procedures on a Job Scheduler to INSERT to a Blob Storage location. There is a lot of configuration in this option.
At present, I would recommend Azure Data Factory (ADF) on a Schedule Trigger. This is the simplest and most reliable of the current options. Based on the scenario you described, a single Copy activity could easily perform this task.

How to choose between Azure data lake analytics and Azure Databricks

Azure data lake analytics and azure databricks both can be used for batch processing. Could anyone please help me understand when to choose one over another?
In my humble opinion, a lot of it comes down to existing skillsets. If you have a team experienced in Spark, Java, Python, r or Scala then Databricks is a natural fit. If on the other hand you have a team with existing SQL and c# skills, then the learning curve for them with U-SQL will be less steep.
That aside, there are other questions which can drive out differences:
Do you require realtime interaction (Databricks) or batch mode analytics (both)? Although there is a feedback item for real-time interactivity for U-SQL, please vote.
Do you want a pay-as-you-go model (U-SQL) or clusters with auto-terminate after a certain period (Databricks)?
Do you like working in a notebook (Databricks) or Visual Studio / VSCode / Powershell / .net sdk (U-SQL) method?
Do you want to use Spark libraries like GraphX (Databricks)?
Do you want the ability to run and scale any runtime (U-SQL)? See here for more details.
Do you want a local development emulator (U-SQL)?
The U-SQL emulator in Visual Studio is seamless, ie you develop your code against your local drives in the same structure as your lake (for free), then simply click the drop-down in Visual Studio to run in the cloud. Although I think you can have a local Spark environment, I'm not sure what the local (and disconnected) development experience is for Databricks.
Are you using ADLS Gen 2 (only Databricks)? See here.
UPDATE October 2018:
As far as I am aware, U-SQL does not currently support ADLS Gen 2, which would count against it (happy to be corrected). I will update the post if and when that support is added.
UPDATE January 2019:
U-SQL has not had any meaningful updates since Spring 2018.
HTH
Databricks has more language options that allows professional with different skills to work on the data. Also with databricks you can run jobs with high-performance, in-memory clusters.
In a project, we use data lake more as a storage, and do all the jobs (ETL, analytics) via databricks notebook. Storing data in data lake is cheaper $.
Back to your questions, if a complex batch job, and different type of professional will work on the data you. You may choose a Azure Data Lake + Databricks architecture. Otherwise an Azure Data Lake would satisfied your needs.
Take a look of these 2 articles would help.
https://databricks.com/glossary/data-lake
https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/

Best Approach for syncing Azure SQL Database

Right now, our application only has one Web Site instance along with SQL Database deployed at Azure US datacenter. We are looking for deploying more Web Site instance at other datacenter such as APAC and Europe. There still be a local SQL Database for each of those web site instance. We would like end user could fail over to another instance if his registered instance is not available, such as if US web site instance is down, we could fail over user to Europe instance. With this, we would need to synchronize local SQL Database at all data centers, US, Europe and APAC.
So we are looking for what's best approach to implement the database synchronization here for Azure SQL Database. Here are what we found at this point:
Azure Data Sync, it looks like that it is the perfect choice since it is available right away at Azure Management Portal and it would be up and running with some simple configuration. However there seems couple catches. The feature has been on preview about 2 years now (see this link with the following quote from comment):
SQL Data Sync has been in preview for over 2 years and the last update was December 2012. Has this been abandoned? Is this a technology we should encourage our clients to use? There absolutely needs to be an ability to synchronize data between a local SQL DB and Azure but Microsoft seems to have dropped this and I'm leery of putting a client on this only to find that the plug has been pulled. You owe it to your users to give us some information
I also saw the post Azure data sync not syncing all databases at SO, it seems that this feature is a second class feature at Azure and MS doesn't really pay sufficient attention to it. So I am worried how good it is.
Microsoft Sync Framework, it seems a more generic sync framework and more suitable for client and server sync instead of sync among server database. Plus it is not simple as above SQL Data Sync which is available just by configuration at Azure.
Any other suggestions on sql database sync at Azure? It would be really appreciated if you could share your experience here.
Thanks very much in advance for your insight.
Update:
Azure Data Sync is built upon using Microsoft Sync Framework: see link, the quote:
Microsoft SQL Data Sync is a cloud-based data synchronization service built on the Microsoft Sync Framework technologies.
Since no one is answering this question and I am going to do it myself. Based on some latest information, the Azure Data Sync is buggy and can not be used for production at this point. I guess that's the reason why it never moves out of preview even after around 2 years. There is no other good approach for handling Azure SQL Database sync at this point unless you want to build something yourself.
you can use RedGate Data Compare to sync your Azuresql DB with your Local DB

Periodically update data in Sql Azure Database

I have an SQL Azure Database instance which provide data to a windows 8 app. The data in my database should be updated periodically (weekly). Is the any way to make it? I'm thinking of write a app which will run weekly and update the database. But still don't know how to make it run on Window Azure? Please help!
Thank you,
There are a number of ways to achieve this, does the data however need to come from a different source or can it be calculated?
Either way, seeing as you're already knee deep in SQL Azure I would suggest putting your logic into a worker role that can be scheduled to run your updates once a week. This would give you a great opportunity to do calculations and/or fetch data externally.
Azure gives you the flexibility to scale this worker role into numerous instances as well depending on the work load.
Here is a nice intro tutorial on creating a worker role on Azure: link
Write the application and set it to run through your cron job manager on a weekly time schedule. The cron job manager should be provided by your host.