We are using Microsoft Azure USQL for database testing.
Can anyone please provide the ODBC/JDBC connection details for USQL?
U-SQL is currently only available in the Azure Data Lake in a batch job form factor. This means that there is currently no ODBC/JDBC connectivity available since it is not giving you the ability to pass results directly to the providers.
So the programmatic way to submit U-SQL jobs is to use any of the available SDKs (Java, C#, Powershell, node.js, Python once available) to submit the job and then download the generated files as results.
We are working on an interactive form factor for U-SQL as well, but that is still a bit out.
Related
We would like to use the Podio API Key to directly connect to Tableau and have the data refreshed at a cadence set in Tableau. Is this possible?
Yes, connecting to data via an API is possible and there are a couple of ways to do it:
Option #1: Web Data Connector
A WDC is a hosted web application built with JavaScript that connects to an API, converts the data to a JSON format, and passes the data to Tableau. You'll require a webserver to host your WDC and JavaScript skills to write it. Once set up, anyone in your org can just grab the link and use it in Desktop. With WDCs, since the data connection is made when the end-user is requesting the WDC, you can build in customizations for your users. (Ex: users can add filter parameters or authenticate with their own user/pass to only get what they have access to). WDC connections are extracts and can be refreshed on Tableau Server and Online. If you're using Tableau Online you'll need to use Bridge to auto-refresh.
Option #2: Hyper API
The Hyper API allows you to create, modify, and update extract (.hyper) files that you can then publish to Tableau Server/Online on a regular cadence. It's available for Python, Java, .NET, and C++ so you will need skills in one of those languages. I suggest Python as we have the most samples for it. You'll also need a server where you can run the extract refresh and publish scripts on a schedule. With the Hyper API you are creating a single extract for everyone to connect to. Once published to Tableau Server/Online, end users can just connect to this data source directly without having to do any input but this also means the connection can't be customized per user or use case.
Option #3: Use a 3rd-party connector
If building your own connector doesn't appeal to you there are also plenty of services out there you can pay for that can bring your data into Tableau. Ex: tray.io, dataddo, and skyvia are a couple I found after a quick google search.
There is an option to connect a Cloud mySQL instance from BigQuery. I just wanted to know how we can connect a Cloud SQL Server instance to BigQuery.
SQL Server:
There are a bunch of third-party extensions/tools that provide this service. One of them is SSIS Data Flow Source & Destination for Google BigQuery, which is Visual Studio extension that connects SQL Server with Google BigQuery data through SSIS Workflows.:
https://www.cdata.com/drivers/bigquery/ssis/
https://marketplace.visualstudio.com/items?itemName=CDATASOFTWARE.SSISDataFlowSourceDestinationforGoogleBigQuery
In regards to using SQL Server Integration Services to load the data from the on-premises SQL Server to BigQuery, you can take a look for this site. You can also perform ETL from a relational database into BigQuery using Cloud Dataflow, the official documentation details how it can be done, you might need to use Cloud Storage as an intermediate data sink.
Cloud SQL:
BigQuery allows to query data from Cloud SQL by using federated query. The connection must be created within the same project where your Cloud SQL instance is located. If you want to query your data stored in your Cloud SQL instance from BigQuery located in another project, please follow the steps listed below:
Enable the BigQuery API and the BigQuery connection API within your project.
Create a connection to your Cloud SQL instance within the project by following this documentation.
Once you have created the connection, please locate and select it within BigQuery.
Click on the SHARE CONNECTION button and grant permissions to the users that will be use that connection. Please note that the BigQuery Connection User role is the only needed to use a shared connection.
Additionally, please notice that the "Cloud SQL federated queries" feature is in a Beta stage and might change or have limited support (is no available for certain regions, in which case, it is required to use one the supported options mentioned here). Please remember, that to use Cloud SQL Federated queries in BigQuery, the intances need to have a public IP.
If you are limited e.g. by region, one good option might be exporting the data from CloudSQL to Storage as a CSV, and then load it into BigQuery. If you need, it is possible to automate this process using Cloud Composer, refer to this article.
Other approach is to extract information from Cloud SQL (with exports) and import it into BigQuery through load jobs, or streaming inserts.
I hope you find the above pieces of information useful.
It is possible, but be warned the feature is currently Beta
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
Azure data lake analytics and azure databricks both can be used for batch processing. Could anyone please help me understand when to choose one over another?
In my humble opinion, a lot of it comes down to existing skillsets. If you have a team experienced in Spark, Java, Python, r or Scala then Databricks is a natural fit. If on the other hand you have a team with existing SQL and c# skills, then the learning curve for them with U-SQL will be less steep.
That aside, there are other questions which can drive out differences:
Do you require realtime interaction (Databricks) or batch mode analytics (both)? Although there is a feedback item for real-time interactivity for U-SQL, please vote.
Do you want a pay-as-you-go model (U-SQL) or clusters with auto-terminate after a certain period (Databricks)?
Do you like working in a notebook (Databricks) or Visual Studio / VSCode / Powershell / .net sdk (U-SQL) method?
Do you want to use Spark libraries like GraphX (Databricks)?
Do you want the ability to run and scale any runtime (U-SQL)? See here for more details.
Do you want a local development emulator (U-SQL)?
The U-SQL emulator in Visual Studio is seamless, ie you develop your code against your local drives in the same structure as your lake (for free), then simply click the drop-down in Visual Studio to run in the cloud. Although I think you can have a local Spark environment, I'm not sure what the local (and disconnected) development experience is for Databricks.
Are you using ADLS Gen 2 (only Databricks)? See here.
UPDATE October 2018:
As far as I am aware, U-SQL does not currently support ADLS Gen 2, which would count against it (happy to be corrected). I will update the post if and when that support is added.
UPDATE January 2019:
U-SQL has not had any meaningful updates since Spring 2018.
HTH
Databricks has more language options that allows professional with different skills to work on the data. Also with databricks you can run jobs with high-performance, in-memory clusters.
In a project, we use data lake more as a storage, and do all the jobs (ETL, analytics) via databricks notebook. Storing data in data lake is cheaper $.
Back to your questions, if a complex batch job, and different type of professional will work on the data you. You may choose a Azure Data Lake + Databricks architecture. Otherwise an Azure Data Lake would satisfied your needs.
Take a look of these 2 articles would help.
https://databricks.com/glossary/data-lake
https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/
I'm working on backup and recovery for Data Lake Store. In a nutshell, we need to back up one Data Lake Store to another. I've chosen AdlCopy for that purpose (if you want to know why, check out my previous post: Backup of Data Lake Store).
According to https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-best-practices#resiliency-considerations, AdlCopy supports orchestration through either Azure Automation or Windows Task Scheduler. I'm more keen on using Azure Automation however. Can someone help clarify how I'm supposed to use Azure Automation to run AdlCopy on a schedule? Do I need a VM? AdlCopy only supports Windows 10 and I can't figure out how Azure Automation will help me to achieve a serverless approach (without Data Factory if possible).
If you are going to have scheduled copies, it will be best to do it using Azure Data Factory (ADF). AdlCopy works great for quick one-off transfers of data. But for scheduled ones which need full monitoring support, built-in retries etc, ADF will be best. If there are reasons you cannot use ADF, please do let us know.
Thanks,
Sachin Sheth,
Program Manager, Azure Data Lake.
Currently for data integration jobs and transformation,user uses Kettle client for performing ETL operations using Spoon GUI.
My question is whether there is a browser based functionality available for Kettle where user can design tranformations & do job integration in the browser itself instead of using desktop PDI application?
NO there is no such functionality available right now.