I have a package that daily imports a file to Data lake store. So that is the same file with different values(same columns etc). My idea is to merge those files into a single file on Data lake, for a monthly report. I want to investigate U-SQL, so my question is:
Is that possible to do with U-SQL?
If its not possible is there any other options to do that?
It is very easily possible to merge records from two files and write a new file. Here are the steps
Read all of the new file using EXTRACT
Read all the records of the current master file using EXTRACT
Use UNION ALL to merge the records: https://msdn.microsoft.com/en-us/library/azure/mt621340.aspx
Write output to master file using OUTPUT statement
For a quick U-SQL tutorial go here: https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-get-started
Related
Currently I'm importing a CSV file into an Azure SQL database automatically each morning at 3 am, but the file has several blank lines in the csv file that are imported as rows which is cleaned up after the data is ingested.
There isn't a way to correct the file prior to ingestion, so I need to transform the data once it's been ingested and would like to avoid having to do this manually.
Is using something like Azure Data Factory the best approach to doing this? Or is there a less expensive / simpler way to simply remove blank lines via something akin to a stored procedure for Azure SQL Database?
I have an external application which creates CSV files. I would like to write these files into SQL automatically but as incremental.
I was looking into Bulk Insert, but i do not think this is incremental. The CSV files can get huge so incremental will be the way to go.
Thank you.
The usual way to handle this is to bulk insert the entire CSV into a staging table, and then do the incremental merge of the data in the staging table into the final destination table with a stored procedure.
If you are still concerned that the CSV files are too big for this, the next step is to write a program that reads the CSV, and produces a truncated file with only the new/changed data that you want to import, and then bulk insert that smaller CSV instead of the original one.
Create a text or csv file with names of all the csv files which you want to load in the table. You can include file path if not repeated. You can do this using shell scripting.
Then make a temporary table which loads all the csv file names which need to be inserted. Using a procedure.
Using above temporary table, loop by the number of rows and load it to the target table(without truncating in the loop). If truncate is required then do it before the loop. You can load data to staging if any transformation is required(use procedure for transformation)
We also had the same problem and we were using this method. Recently, we switched to using Python which does all the task and loads data into a staging table. After transformations, it is finally loaded into the target table.
I have set of files on Azure Data-lake store folder location. Is there any simple power-shell command to get the count of records in a file? I would like to do this with out using Get-AzureRmDataLakeStoreItemContent command on the file item as the size of the files in gigabytes. Using this command on big files is giving the below error.
Error:
Get-AzureRmDataLakeStoreItemContent : The remaining data to preview is greater than 1048576 bytes. Please specify a
length or use the Force parameter to preview the entire file. The length of the file that would have been previewed:
749319688
Azure data lake operates at the file/folder level. The concept of a record really depends on how an application interprets it. For instance, in one case the file may have CSV line or in another a set of JSON objects. In some cases files contain binary data. Therefore, there is no way at the file system level to get the count of records.
The best way to get this information is to submit a job such as a USQL job in Azure Data Lake Analytics. The script will be really simple: An EXTRACT statement followed by a COUNT aggregation and an OUTPUT statement.
If you prefer Spark or Hadoop here is a StackOverflow question that discusses that: Finding total number of lines in hdfs distributed file using command line
Can we append data in existing file in U-SQL?
I have created a CSV file as output in U-SQL. I am writing another U-SQL query and I want to append the output of that query in the existing file.
Is it possible?
It's not supported, and would go against the design of a robust, distributed, idempotent big data system (although you could implement that behaviour by reading the previous output as a rowset and do UNION ALL).
The best way to deal with this is to use partitions properly, for example, create one or more new partitions for each of your executions: https://msdn.microsoft.com/en-us/library/azure/mt621324.aspx
I know that by doing:
COPY test FROM '/path/to/csv/example.txt' DELIMITER ',' CSV;
I can import csv data to postgresql.
However, I do not have a static csv file. My csv file gets downloaded several times a day and it includes data which has previously been imported to the database. So, to get a consistent database I would have to leave out this old data.
My bestcase idea to realize this would be something like above. However, worstcase would be a java program manually checks each entry of the db with the csv file. Any recommendations for the implementation?
I really appreciate your answer!
You can dump latest data to the temp table using COPY command and MERGE temp table with the live table.
If you are using JAVA program for execute COPY command, then try CopyManager API.