Azure Databricks storage or data lake - azure-storage

I'm creating a structured streaming job that stores its data in a databricks delta database. I'm confronted with the option of storing the checkpoint location and data from the delta database in either ...
1. a normal dbfs location like "/delta/mycheckpointlocation" and "delta/mydatabase"
2. a mounted directory from a data lake like "/mnt/mydatalake/delta/mycheckpointlocation" and "/mnt/mydatalake/delta/mydatabase"
If I understand correctly the data in nr1 will be persisted in a blob storage while the data in nr2 would be stored in the data lake (assuming it's mounted on /mnt/mydatalake)
What considerations are there to decide to store stuff like the checkpoint location and the delta database in either 1 or 2?

The DBFS location is a part of your workspace. So if you drop the workspace you lose it.
The lake is shared so many things can connect to it, including other Databricks workspaces, or other services (like ADF).
There is no right or wrong to this - pure preference.

Related

Is it possible to export a single table from Azure sql database, then directly save into azure blob storage without downloading it to local disk

I've tried to use SSMS, but it requires a temporary location for the BacPac file which is local. But i don't want to download it to local, would like to export a single table directly to Azure Blob storage.
Per my experience, we could import table data from the csv file stored in blob storage. But didn't find a way to export the table data to Blob Storage as csv file directly.
You could think about using Data Factory.
It can achieve that , please reference bellow tutorials:
Copy and transform data in Azure SQL Database by using Azure Data
Factory
Copy and transform data in Azure Blob storage by using Azure Data
Factory
Using Azure SQL database as the Source, and choose the table as dataset:
Source dataset:
Source:
Using Blob storage as Sink, and choose the DelimitedText as the sink format file:
Sink dataset:
Sink:
Run the pipeline and you will get the csv file in Blob Storage.
Also thanks for the tutorial #Hemant Halwai provided for us.

DataBricks - save changes back to DataLake (ADLS Gen2)

I have legacy data stored as CSV in an Azure DataLake Gen2 storage account. I'm able to connect to this and interrogate it using DataBricks. I have a requirement to remove certain records once their retention period expires, or if a GDPR "right to be forgotten" needs applying to the data.
Using Delta I can load a CSV into a Delta table and use SQL to locate and delete the required rows, but what is the best way to save these changes? Ideally back to the original file, so that the data is removed from the original. I've used the LOCATION option when creating the Delta table to persist the generated Parquet format files to the DataLake but it would be nice to keep it in the original CSV format.
Any advice appreciated.
I'd be careful here. Right to be forgotten means you need to delete the data. Delta doesn't actually delete it from the original file (initially at least) - this will only happen once the data is vacuumed.
The safest way to delete data is to read all the data into a dataframe, filter off the records you do not want and then write it back using overwrite. This will ensure the data is remove and the same structure is re-written.
Convert Parquet to CSV in ADF
The versioned parquet files created in the ADLS Gen2 location can be converted to CSV using the Copy Data task in an Azure Data Factory pipeline.
So, you could read the CSV data into a Delta table(with location pointing to a Data lake folder), perform the required changes using SQL and then convert the parquet files to CSV format using ADF.
I have tried this and it works. The only hurdle might be detecting the column headers while reading the CSV file to Delta. You could read it to a dataframe and create a Delta table from it.
If you are running the delete operations periodically then it is costly to save file in csv, As every time you are reading the file and transforming the dataframe to Delta and then query on it and finally after filtering the records you are again saving it to csv and deleting the Delta table.
So my suggestion here would be, transform the csv to Delta once, perform delete periodically and generate csv only when it's needed.
The advantage here is - Delta internally stores data in parquet format which stores data in binary format and allow better compression and encoding/decoding of data.

Presto query engine with Azure Data Lake

I have a requirement to deploy a presto server which can help me query data stored in ADLS in Avro file formats.
I have gone through this tutorial and it seems that the Hive is used as a catalogue/connector in presto to query from ADLS. Can I bypass Hive and have any connector to extract data from ADLS?
Can I bypass Hive and have any connector to extract data from ADLS?
No.
Hive here plays two roles here:
storage for metadata. It contains information like:
schema and table name
columns
data format
data location
execution
it is capable to read data from (HDFS) distributed file systems (like HDFS, S3, ADLS)
it tells how execution can be distributed.

How to take the backup of a dataset in BigQuery?

We want to create a backup copy of a BigQuery dataset in case a table is accidentally dropped, as it is only recoverable within 7 days.
Is there a way to extend the duration of the recovery period? If not, how can we create a backup of a dataset with a retention period of 30 days in BigQuery?
It is currently not possible to extend the duration of the recovery period. A feature request for the ability to extend the duration of the recovery period has already been created as commented by Katayoon.
Here is a public link to monitor the progress on that issue: https://issuetracker.google.com/120038872
To backup datasets in BigQuery you could either make copies of your dataset, or as a more workable solution, export the data to Cloud Storage, so you can import it back at a later time. Cloud Storage allows you to set a retention period and a lifecycle policy which together will allow you to make sure that data stays undisturbed for the desired amount of time, and that it removes itself after a given time should you wish to save on storage costs.
For how you do export in BigQuery:
You can export the tables as AVRO, JSON or CSV files to the Cloud Storage via web UI, command line, an API and using various languages like C#, Go, Python and Java, as long as both are in the same location. There are other limitations to exporting a table, such as file size, Integer encoding, data compression, etc.
Link to table export and limitations:
https://cloud.google.com/bigquery/docs/exporting-data
You can find the instructions on the procedures here:
Retention Policies and Bucket Lock: https://cloud.google.com/storage/docs/using-bucket-lock#lock-bucket
Object Lifecycle Management:
https://cloud.google.com/storage/docs/managing-lifecycles
Loading data into BigQuery can be done using various file formats, such as CSV, JSON, Avro, Parquet, or ORC and so on. At this moment you can load directly only from local storage, or from Google Storage. More on loading data, file formats, data sources and limitations by following the link: https://cloud.google.com/bigquery/docs/loading-data
More information on
Exporting tables: https://cloud.google.com/bigquery/docs/exporting-data
Export limitations: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations
Loading data into BigQuery: https://cloud.google.com/bigquery/docs/loading-data
Wildcards: https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames
Merging the file: https://cloud.google.com/storage/docs/gsutil/commands/compose
You can take a snapshot of a table using either SQL or CLI:
SQL
CREATE SNAPSHOT TABLE
myproject.library_backup.books
CLONE myproject.library.books
OPTIONS(expiration_timestamp = TIMESTAMP "2022-04-27 12:30:00.00-08:00")
CLI
bq cp --snapshot --no_clobber --expiration=86400 library.books library_backup.books
You can backup and restore using the tools in https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/tree/master/blogs/bigquery_backup:
Backup a table to GCS
./bq_backup.py --input dataset.tablename --output gs://BUCKET/backup
This saves a schema.json, a tabledef.json, and extracted data in AVRO format to GCS.
You can also backup all the tables in a data set:
./bq_backup.py --input dataset --output gs://BUCKET/backup
Restore tables one-by-one by specifying a destination data set
./bq_restore.py --input gs://BUCKET/backup/fromdataset/fromtable --output destdataset
For views, the backup stores the view definition and the restore creates a view.

Querying ORC data in Hive copied from separate environment

I'm using Azure HDInsights, Azure Data Lake and Hive via Ambari.
I'm setting up a test environment. The original environment's data is stored on Azure Data Lake, in the form of ORC files loaded via Hive. I copied all the data from the original Data Lake to the test Data Lake via Data Factory successfully.
When I try to create my Hive ORC tables in the test environment and then query them no records are returned. Schema/Folder locations on the respective data lakes are the same, am I missing something related to the metastore since it's a different one on test?
Edit: I want to add that I set up an external table to the Test environment's Data Lake in SQL Datawarehouse using Polybase and that is able to read the data just fine.
As chemikadze mentioned, running MSCK REPAIR TABLE <your-table> fixed it. My tables were partitioned and so the metastore didn't know to look in certain sub-folders for locating the data.
The following pattern now helps me accomplish an environment duplication:
Create Data Factory Pipeline to copy Data Lake folders from Dev -> Test.
Run Hive DDL on Test environment.
Run repair table command on each of the partitioned tables created in Test environment.