Azure Power-shell command to get the Count of records in Azure Data lake file - azure-powershell

I have set of files on Azure Data-lake store folder location. Is there any simple power-shell command to get the count of records in a file? I would like to do this with out using Get-AzureRmDataLakeStoreItemContent command on the file item as the size of the files in gigabytes. Using this command on big files is giving the below error.
Error:
Get-AzureRmDataLakeStoreItemContent : The remaining data to preview is greater than 1048576 bytes. Please specify a
length or use the Force parameter to preview the entire file. The length of the file that would have been previewed:
749319688

Azure data lake operates at the file/folder level. The concept of a record really depends on how an application interprets it. For instance, in one case the file may have CSV line or in another a set of JSON objects. In some cases files contain binary data. Therefore, there is no way at the file system level to get the count of records.
The best way to get this information is to submit a job such as a USQL job in Azure Data Lake Analytics. The script will be really simple: An EXTRACT statement followed by a COUNT aggregation and an OUTPUT statement.
If you prefer Spark or Hadoop here is a StackOverflow question that discusses that: Finding total number of lines in hdfs distributed file using command line

Related

How do you query data from only the last file uploaded in cloud storage with BigQuery

Everyday I'm uploading a new file to a Cloud Storage bucket. The file is stored as JSON-NL format. I have a BigQuery table (setup as external table) connected to this bucket. Each files is named with the date of their upload. If I want to query only the most recent file, so far the best option I found is to parse the _FILE_NAME in my sql query and match it with the current date. However the parsing is a bit messy so I'm wondering is there is any other better solution.
What are other options to query only the most recent file? Should I set this up differently?
There isn't better solution. Use a script to parse the pseudo-column with the file name, get the latest one and then query it (with an execute immediate). No other solution so far

Loading 50GB CSV File Azure Blob to Azure SQL DB in Less time- Performance

I am loading 50GB CSV file From Azure Blob to Azure SQL DB using OPENROWSET.
It takes 7 hours to load this file.
Can you please help me with possible ways to reduce this time?
The easiest option IMHO is just use BULK INSERT. Move the csv file into a Blob Store and the import it directly using BULK INSERT from Azure SQL. Make sure Azure Blob storage and Azure SQL are in the same Azure region.
To make it as fast as possible:
split the CSV in more than one file (for example using something like a CSV splitter. This looks nice https://www.erdconcepts.com/dbtoolbox.html. Never tried and just came up with a simple search, but looks good)
run more BULK INSERT in parallel using TABLOCK option. (https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-2017#arguments). This, if the target table is empty, will allow multiple concurrent bulk operations in parallel.
make sure you are using an higher SKU for the duration of the operation. Depending on the SLO (Service Level Objective) you're using (S4? P1, vCore?) you will get a different amount of log throughput, up to close 100 MB/Sec. That's the maximum speed you can actually achieve. (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-resource-limits-database-server)
Please try using Azure Data Factory.
First create the destination table on Azure SQL Database, let's call it USDJPY. After that upload the CSV to an Azure Storage Account. Now create your Azure Data Factory instance and choose Copy Data.
Next, choose "Run once now" to copy your CSV files.
Choose "Azure Blob Storage" as your "source data store", specify your Azure Storage which you stored CSV files.
Provide information about Azure Storage account.
Choose your CSV files from your Azure Storage.
Choose "Comma" as your CSV files delimiter and input "Skip line count" number if your CSV file has headers
Choose "Azure SQL Database" as your "destination data store".
Type your Azure SQL Database information.
Select your table from your SQL Database instance.
Verify the data mapping.
Execute data copy from CSV files to SQL Database just confirming next wizards.

Merge files from Data lake store

I have a package that daily imports a file to Data lake store. So that is the same file with different values(same columns etc). My idea is to merge those files into a single file on Data lake, for a monthly report. I want to investigate U-SQL, so my question is:
Is that possible to do with U-SQL?
If its not possible is there any other options to do that?
It is very easily possible to merge records from two files and write a new file. Here are the steps
Read all of the new file using EXTRACT
Read all the records of the current master file using EXTRACT
Use UNION ALL to merge the records: https://msdn.microsoft.com/en-us/library/azure/mt621340.aspx
Write output to master file using OUTPUT statement
For a quick U-SQL tutorial go here: https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-get-started

how to load multiple CSV files into Multiple Tables

I have Multiple CSV files in Folder
Example :
Member.CSv
Leader.CSv
I need to load them in to Data base tables .
I have worked on it using ForEachLoop Container ,Data FlowTask, Excel Source and OLEDB Destination
we can do if by using Expressions and Precedence Constraints but how can I do using Script task if I have more than 10 files ..I got Stuck with this one
We have a similar issue, our solution is a mixture of the suggestions above.
We have a number of files types sent from our client on a daily basis.
These have a specific filename pattern (e.g. SalesTransaction20160218.csv, Product20160218.csv)
Each of these file types have a staging "landing" table of the structure you expect
We then have a .net script task that takes the filename pattern and loads that data into a landing table.
There are also various checks that are done within the csv parser - matching number of columns, some basic data validation, before loading into the landing table
We are not good enough .net programmers to be able to dynamically parse an unknown file structure, create SQL table and then load the data in. I expect it is feasible, after all, that is what the SSIS Import/Export Wizard does (with some manual intervention)
As an alternative to this (the process is quite delicate), we are experimenting with a HDFS data landing area, then it allows us to use analytic tools like R to parse the data within HDFS. After that utilising PIG to load the data into SQL.

Exporting query results as JSON via Google BigQuery API

I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.
1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json