Azure data lake analytics and azure databricks both can be used for batch processing. Could anyone please help me understand when to choose one over another?
In my humble opinion, a lot of it comes down to existing skillsets. If you have a team experienced in Spark, Java, Python, r or Scala then Databricks is a natural fit. If on the other hand you have a team with existing SQL and c# skills, then the learning curve for them with U-SQL will be less steep.
That aside, there are other questions which can drive out differences:
Do you require realtime interaction (Databricks) or batch mode analytics (both)? Although there is a feedback item for real-time interactivity for U-SQL, please vote.
Do you want a pay-as-you-go model (U-SQL) or clusters with auto-terminate after a certain period (Databricks)?
Do you like working in a notebook (Databricks) or Visual Studio / VSCode / Powershell / .net sdk (U-SQL) method?
Do you want to use Spark libraries like GraphX (Databricks)?
Do you want the ability to run and scale any runtime (U-SQL)? See here for more details.
Do you want a local development emulator (U-SQL)?
The U-SQL emulator in Visual Studio is seamless, ie you develop your code against your local drives in the same structure as your lake (for free), then simply click the drop-down in Visual Studio to run in the cloud. Although I think you can have a local Spark environment, I'm not sure what the local (and disconnected) development experience is for Databricks.
Are you using ADLS Gen 2 (only Databricks)? See here.
UPDATE October 2018:
As far as I am aware, U-SQL does not currently support ADLS Gen 2, which would count against it (happy to be corrected). I will update the post if and when that support is added.
UPDATE January 2019:
U-SQL has not had any meaningful updates since Spring 2018.
HTH
Databricks has more language options that allows professional with different skills to work on the data. Also with databricks you can run jobs with high-performance, in-memory clusters.
In a project, we use data lake more as a storage, and do all the jobs (ETL, analytics) via databricks notebook. Storing data in data lake is cheaper $.
Back to your questions, if a complex batch job, and different type of professional will work on the data you. You may choose a Azure Data Lake + Databricks architecture. Otherwise an Azure Data Lake would satisfied your needs.
Take a look of these 2 articles would help.
https://databricks.com/glossary/data-lake
https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/
Related
I can create an Azure data lake database with pre-built tables using Azure Synapse database templates from the Synapse Studio UI, but is there a way to use these templates programmatically? So far from my research I have not found a command, API, or SDK for this. Perhaps I could create the database and tables via the UI, then generate the associated spark sql creation scripts, but don't see a way how to do that either. Does anyone have any ideas on how to do either of the prior?
You can create the data lake storage, tables and data insertion programmatically using Azure SDKs. But these templates have been made available to overcome these series of manual tasks. Using these templates save your time and efforts to create an environment and sample data for your development.
Therefore, asking to deploy these templates programmatically challenging the complete concept of templates. If you want to deploy these resources manually, you can use Azure SDKs.
Can anyone please help me understand what components/services does Azure Synapse Analytics include?
From what I have read from both Microsoft website and other reviews, it says it is the new SQL Data Warehouse, however, it also says it brings together all these : data ingestion (like azure data factory), data warehouse, and big data analytics (like data lake)?
So what components exactly does a Azure Synapse Analytics include when you purchase it?
Thanks.
Azure Synapse Analytics service currently (as of 6th May 2020) refers to Azure SQL Data Warehouse, more specifically to "gen2" version of it. Microsoft released in November 2019 in Ignite'19 event the new name "Azure Synapse Analytics" and upcoming features for the service. The new features are currently available only in private preview, but I would assume they will be released in public preview soon. Access to new users to private preview is already closed, even though some Microsoft material still hints that you could apply to it.
You can already find information about the new features in documentation and other materil. The confusing part is that you cannot find them in portal yet if you are not part of the private preview. This makes it really hard for new users currently understand what really is available and what is not.
Good start to information on situation and features of both versions this can be found here:
Blog post Azure SQL Data Warehouse is now Azure Synapse Analytics
SQL DW documentation
Synapse new features documentation
Microsoft has made the release of this update very confusing. I assume they wanted to communicate early in Ignite'19 that they will have a competitive offering coming. Compared to some other cloud native data warehousing solutions the old version of Azure DW clearly were behind in many areas, e.g. in flexible scalability. The new Synapse Analytics capabilities look good and can bring Microsoft back to lead in this area.
I'm working on backup and recovery for Data Lake Store. In a nutshell, we need to back up one Data Lake Store to another. I've chosen AdlCopy for that purpose (if you want to know why, check out my previous post: Backup of Data Lake Store).
According to https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-best-practices#resiliency-considerations, AdlCopy supports orchestration through either Azure Automation or Windows Task Scheduler. I'm more keen on using Azure Automation however. Can someone help clarify how I'm supposed to use Azure Automation to run AdlCopy on a schedule? Do I need a VM? AdlCopy only supports Windows 10 and I can't figure out how Azure Automation will help me to achieve a serverless approach (without Data Factory if possible).
If you are going to have scheduled copies, it will be best to do it using Azure Data Factory (ADF). AdlCopy works great for quick one-off transfers of data. But for scheduled ones which need full monitoring support, built-in retries etc, ADF will be best. If there are reasons you cannot use ADF, please do let us know.
Thanks,
Sachin Sheth,
Program Manager, Azure Data Lake.
My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?
I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.
I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery
I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.
BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro
We are using Microsoft Azure USQL for database testing.
Can anyone please provide the ODBC/JDBC connection details for USQL?
U-SQL is currently only available in the Azure Data Lake in a batch job form factor. This means that there is currently no ODBC/JDBC connectivity available since it is not giving you the ability to pass results directly to the providers.
So the programmatic way to submit U-SQL jobs is to use any of the available SDKs (Java, C#, Powershell, node.js, Python once available) to submit the job and then download the generated files as results.
We are working on an interactive form factor for U-SQL as well, but that is still a bit out.