How can I get download table from Databricks on my local machine if I don't have select privileges - blob

I have a client Databricks workspace where there are some Databricks SQL tables, I need to get these tables to my blob storage. But because of restricted privileges in the databricks workspace I can't write the tables from my spark data-frames to blob, and I also don't have access to DBFS on databricks.
I can display and download the small tables, but there are some tables that have 200 million records, how can I download these tables to my local machine?

Related

Presto query engine with Azure Data Lake

I have a requirement to deploy a presto server which can help me query data stored in ADLS in Avro file formats.
I have gone through this tutorial and it seems that the Hive is used as a catalogue/connector in presto to query from ADLS. Can I bypass Hive and have any connector to extract data from ADLS?
Can I bypass Hive and have any connector to extract data from ADLS?
No.
Hive here plays two roles here:
storage for metadata. It contains information like:
schema and table name
columns
data format
data location
execution
it is capable to read data from (HDFS) distributed file systems (like HDFS, S3, ADLS)
it tells how execution can be distributed.

Azure SQL External table of Azure Table storage data

Is it possible to create an external table in Azure SQL of the data residing in Azure Table storage?
Answer is no.
I am currently facing similiar issue and this is my research so far:
Azure SQL Database doesn't allow Azure Table Storage as a external data source.
Sources:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-2017
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
Reason:
The possible data source scenarios are to copy from Hadoop (DataLake/Hive,..), Blob (Text files,csv) or RDBMS (another sql server). The Azure Table Storage is not listed.
The possible external data formats are only variations of text files/hadoop: Delimited Text, Hive RCFile, Hive ORC,Parquet.
Note - even copying from blob in JSON format requires implementing custom data format.
Workaround:
Create a copy pipeline with Azure Data Factory.
Create a copy
function/script with Azure Functions using C# and manually transfer
the data
Yes, there are a couple options. Please see the following:
CREATE EXTERNAL TABLE (Transact-SQL)
APPLIES TO: SQL Server (starting with 2016) Azure SQL Database Azure SQL Data Warehouse Parallel Data Warehouse
Creates an external table for PolyBase, or Elastic Database queries. Depending on the scenario, the syntax differs significantly. An external table created for PolyBase cannot be used for Elastic Database queries. Similarly, an external table created for Elastic Database queries cannot be used for PolyBase, etc.
CREATE EXTERNAL DATA SOURCE (Transact-SQL)
APPLIES TO: SQL Server (starting with 2016) Azure SQL Database Azure SQL Data Warehouse Parallel Data Warehouse
Creates an external data source for PolyBase, or Elastic Database queries. Depending on the scenario, the syntax differs significantly. An external data source created for PolyBase cannot be used for Elastic Database queries. Similarly, an external data source created for Elastic Database queries cannot be used for PolyBase, etc.
What is your use case?

Querying ORC data in Hive copied from separate environment

I'm using Azure HDInsights, Azure Data Lake and Hive via Ambari.
I'm setting up a test environment. The original environment's data is stored on Azure Data Lake, in the form of ORC files loaded via Hive. I copied all the data from the original Data Lake to the test Data Lake via Data Factory successfully.
When I try to create my Hive ORC tables in the test environment and then query them no records are returned. Schema/Folder locations on the respective data lakes are the same, am I missing something related to the metastore since it's a different one on test?
Edit: I want to add that I set up an external table to the Test environment's Data Lake in SQL Datawarehouse using Polybase and that is able to read the data just fine.
As chemikadze mentioned, running MSCK REPAIR TABLE <your-table> fixed it. My tables were partitioned and so the metastore didn't know to look in certain sub-folders for locating the data.
The following pattern now helps me accomplish an environment duplication:
Create Data Factory Pipeline to copy Data Lake folders from Dev -> Test.
Run Hive DDL on Test environment.
Run repair table command on each of the partitioned tables created in Test environment.

Tools/utility to migrate mainframe db2 data into aws rds-postgres

Can you please let me know the tools/document which I can use/refer to migrate data residing on mainframe db2 system to aws rds (postgres).
Example -
I have a table DB2 Employee with columns Id as Integer and Name as Char (30) with 100 records so how do migrate/convert DDL into postgres format and load data into postgres table.
I just browse through 'AWS Database Migration Service' but I don't see it support to migrate source db2 data to target aws rds or do we need to upload data on s3 and then load to rds.
Thanks!
To my knowledge AWS's Data Migration Service does not currently offer migration from a DB2 source RDBMS platform.
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html

How to ensure faster response time using transact-SQL in Azure SQL DW when I combine data from SQL and non-relational data in Azure blob storage?

What should I do to ensure optimal query performance using transact-SQL in Azure SQL Data Warehouse while combining data sets from SQL and non-relational data in Azure Blob storage? Any inputs would be greatly appreciated.
The best practice is to load data from Azure Blob Storage into SQL Data Warehouse instead of attempting interactive queries over that data.
The reason is that when you run a query against your data residing in Azure Blob Storage (via an external table), SQL Data Warehouse (under-the-covers) imports all the data from Azure Blob Storage into SQL Data Warehouse temporary tables to process the query. So even if you a run SELECT TOP 1 query on your external table, the entire dataset for that table will be imported temporarily to process the query.
As a result, if you know that you will querying the external data frequently, it is recommended that you explicitly load the data into SQL Data Warehouse permanently using a CREATE TABLE AS SELECT command as shown in the document: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-load-with-polybase/.
As a best practice, break your Azure Storage data into no more than 1GB files when possible for parallel processing with SQL Data Warehouse. More information about how to configure Polybase in SQL Data Warehouse to load data from Azure Storage Blob is here: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-load-with-polybase/
Let me know if that helps!