How to get the Number of records in a Data lake File with Logicapp Data Lake connector? - azure-data-lake

I have a requirement to get the Table data from Azure Database table and upload the same to Data Lake file using logic app. Once the upload is complete, I need to get the number of records present in the Data lake file. Does logic app has any expressions or built-in methods to get the number of records in a Data lake file.

At the data lake store level, there is no notion of records. You can query information about the file such as how many bytes long it is. The concept of records is what is interpreted by application that read it based on the type of data (CSV, JSON etc) and the delimiter that makes sense.
You will need to do this as a separate step before or after saving the file.

Related

Azure Data Factory Copy Activity for JSON to Table in Azure SQL DB

I have a copy activity that takes a bunch of JSON files and merges them into a singe JSON.
I would now like to copy the merged single JSON to Azure SQL DB. Is that possible?
Ok, it appears to be working however the output in SQL is just countryCode and CompanyId
However, I need to retrieve all the financial information in the JSON as well
Azure Data Factory Copy Activity for JSON to Table in Azure SQL DB
I repro'd the same and below are the steps.
Two json files are taken as source.
Those files are merged into single file using copy activity.
Then Merged Json data is taken as source dataset in another copy activity.
In sink, dataset for Azure SQL db is created and Auto create table option is selected.
In sink dataset, edit checkbox is selected and sink table name is given.
Once the pipeline is run, data is copied to table.

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

Loading 50GB CSV File Azure Blob to Azure SQL DB in Less time- Performance

I am loading 50GB CSV file From Azure Blob to Azure SQL DB using OPENROWSET.
It takes 7 hours to load this file.
Can you please help me with possible ways to reduce this time?
The easiest option IMHO is just use BULK INSERT. Move the csv file into a Blob Store and the import it directly using BULK INSERT from Azure SQL. Make sure Azure Blob storage and Azure SQL are in the same Azure region.
To make it as fast as possible:
split the CSV in more than one file (for example using something like a CSV splitter. This looks nice https://www.erdconcepts.com/dbtoolbox.html. Never tried and just came up with a simple search, but looks good)
run more BULK INSERT in parallel using TABLOCK option. (https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-2017#arguments). This, if the target table is empty, will allow multiple concurrent bulk operations in parallel.
make sure you are using an higher SKU for the duration of the operation. Depending on the SLO (Service Level Objective) you're using (S4? P1, vCore?) you will get a different amount of log throughput, up to close 100 MB/Sec. That's the maximum speed you can actually achieve. (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-resource-limits-database-server)
Please try using Azure Data Factory.
First create the destination table on Azure SQL Database, let's call it USDJPY. After that upload the CSV to an Azure Storage Account. Now create your Azure Data Factory instance and choose Copy Data.
Next, choose "Run once now" to copy your CSV files.
Choose "Azure Blob Storage" as your "source data store", specify your Azure Storage which you stored CSV files.
Provide information about Azure Storage account.
Choose your CSV files from your Azure Storage.
Choose "Comma" as your CSV files delimiter and input "Skip line count" number if your CSV file has headers
Choose "Azure SQL Database" as your "destination data store".
Type your Azure SQL Database information.
Select your table from your SQL Database instance.
Verify the data mapping.
Execute data copy from CSV files to SQL Database just confirming next wizards.

DataBricks - save changes back to DataLake (ADLS Gen2)

I have legacy data stored as CSV in an Azure DataLake Gen2 storage account. I'm able to connect to this and interrogate it using DataBricks. I have a requirement to remove certain records once their retention period expires, or if a GDPR "right to be forgotten" needs applying to the data.
Using Delta I can load a CSV into a Delta table and use SQL to locate and delete the required rows, but what is the best way to save these changes? Ideally back to the original file, so that the data is removed from the original. I've used the LOCATION option when creating the Delta table to persist the generated Parquet format files to the DataLake but it would be nice to keep it in the original CSV format.
Any advice appreciated.
I'd be careful here. Right to be forgotten means you need to delete the data. Delta doesn't actually delete it from the original file (initially at least) - this will only happen once the data is vacuumed.
The safest way to delete data is to read all the data into a dataframe, filter off the records you do not want and then write it back using overwrite. This will ensure the data is remove and the same structure is re-written.
Convert Parquet to CSV in ADF
The versioned parquet files created in the ADLS Gen2 location can be converted to CSV using the Copy Data task in an Azure Data Factory pipeline.
So, you could read the CSV data into a Delta table(with location pointing to a Data lake folder), perform the required changes using SQL and then convert the parquet files to CSV format using ADF.
I have tried this and it works. The only hurdle might be detecting the column headers while reading the CSV file to Delta. You could read it to a dataframe and create a Delta table from it.
If you are running the delete operations periodically then it is costly to save file in csv, As every time you are reading the file and transforming the dataframe to Delta and then query on it and finally after filtering the records you are again saving it to csv and deleting the Delta table.
So my suggestion here would be, transform the csv to Delta once, perform delete periodically and generate csv only when it's needed.
The advantage here is - Delta internally stores data in parquet format which stores data in binary format and allow better compression and encoding/decoding of data.

Formatting data ingested into Azure SQL Database

Currently I'm importing a CSV file into an Azure SQL database automatically each morning at 3 am, but the file has several blank lines in the csv file that are imported as rows which is cleaned up after the data is ingested.
There isn't a way to correct the file prior to ingestion, so I need to transform the data once it's been ingested and would like to avoid having to do this manually.
Is using something like Azure Data Factory the best approach to doing this? Or is there a less expensive / simpler way to simply remove blank lines via something akin to a stored procedure for Azure SQL Database?