how to write usql query to output to multiple files - azure-data-lake

i want to group data set based on unique values in column and save them to multiple file.
My problem is same as which is already described here at link:
U-SQL Output in Azure Data Lake
As i am new to USQL language , i am unable to implement the second step from answer. I am unable to figure out how to write usql query to run generated usql script from the first part of answer

If the number of groups is known in advance, you could write a USQL stored procedure that would take as parameter 1) the value of the group 2) the name of the file.
In the pseudo-code below, the name of the final file is driven by the underlying value of the group. The data to be split is sourced from a USQL table (referred in the pseudo-code as ).
DROP PROCEDURE IF EXISTS splitByGroups;
CREATE PROCEDURE splitByGroups(#groupValue string, #file_name_prefix string = "extract")
AS
BEGIN
DECLARE #OUTPUT string = "/output/" + file_name_prefix + "_"+ #groupValue + ".csv";
OUTPUT (
SELECT *
FROM <MyTable>
WHERE <MyGroup> == #groupValue
)
TO #OUTPUT
USING Outputters.Csv(outputHeader : true);
END;
You would then execute the stored procedure as many times as you have groups:
splitByGroups("group1", DEFAULT);
splitByGroups("group1", DEFAULT);
Alternatively, if you wish to analyse the multiple files offline, I would download the full file and use the shell (PowerShell or Linux Shell) to split the file.

Related

Need to read from XML with read only DB access

I'm trying to build a process where I can dump a list of IDs to an XML document which can then be used in a SQL statement to return information for only those IDs. I feel as though I'm very close, but the BFILENAME function I need to use to open the XML file requires that I use the CREATE DIRECTORY statement which fails as I have read-only access and cannot create objects. Is there something else I can do to be able to create the directory used in the BFILENAME function?
I'm experienced in building SQL statements, but have never had to pull data from an external source in this way.
Here is the script I'm trying to run as a proof-of-concept test. Ultimately this will be joining into another table and spooling output to a CSV file.
CREATE DIRECTORY temp_dir AS 'C:\Users\MyDude\Desktop\';
DECLARE
acct_doc xmltype := xmltype( bfilename('temp_dir','TestXML.xml'), nls_charset_id('AL32UTF8') );
vis_PersonID varchar(100);
BEGIN
select PersonID
into vis_PersonID
from xmltable(
'/Root/E'
passing acct_doc
columns PersonID VARCHAR2(400) PATH '.'
);
END;
This is failing on line 1 as I have read only access. I have only two IDs in the file, if this were working properly I'd expect to see those two IDs output.

How to correct 'Operating system error code 12007' when accessing Azure blob storage in a SQL stored procedure

I'm trying to create a stored procedure that will access a file in an azure blob storage container, store the first line of the file in a temporary file, use this data to create a table (effectively using the header fields in the file as the column titles), and then populate the file with the rest of the data.
I've tried the basic process in a local SQL database, using a local source file on my machine, and the procedure itself works as I want it to, creating a new table from the supplied file.
However, when I've set it up within an Azure SQL database and amend the procedure to use a 'datasource' rather than pointing it at a local file, it's producing the following error:
Cannot bulk load because the file "my_example_file" could not be opened. Operating system error code 12007(failed to retrieve text for this error. Reason: 317).
My stored procedure contains the following:
CREATE TABLE [TempColumnTitleTable] ([ColumnTitles] [nvarchar](max)
NULL);
DECLARE #Sql NVARCHAR(Max) = 'BULK INSERT [dbo].
[TempColumnTitleTable] FROM ''' + #fileName + ''' WITH
(DATA_SOURCE = ''Source_File_Blob'', DATAFILETYPE = ''char'',
FIRSTROW = 1, LASTROW = 1, ROWTERMINATOR = ''0x0a'')';
EXEC(#Sql);
The above should be creating a single column table containing all the text for the headers, which I can then interrogate and use for the column titles in my permanent file.
I've set up the DataSource as follows:
CREATE EXTERNAL DATA SOURCE Source_File_Blob
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'location_url',
CREDENTIAL = AzureBlobCredential
);
with an appropriate credential in place!
I'm expecting it to populate my temporary column title file (and then go on and do the other populating that I haven't shown code for above), but it just returns the mentioned error code.
I've had a Google, but the error code seems to be related to other 'path' type issues that I don't think apply here.
We've got similar processes that use blob storage with the same credentials, and they all seem to work ok, but the problem is that the person who wrote them is no longer at our company, so I can't actually consult them!
So basically, what would be causing that error? I don't think it's access, since I am able to run similar processes on other blobs, and as far as I can tell access levels are the same on these.
Yep - used the wrong URL as prefix. It was only when I was finally got access to the blob storage that I realised.

Select rows to extract from a CSV file in USQL

I'm trying to extract a few columns from a CSV file.
This file is replaced every day and columns can be added to file.
My problem is that every time the number of columns change i need to update the USQL code... any help?
#billing =
EXTRACT
id string,
company string
FROM #companydatafile
USING Extractors.Csv(skipFirstNRows : 1);
That works on CSV file:
1, company1
2, company2
But if update the file to
1, company1, address1
2, company2, address1
That will return an error.
Many Thanks!
Another hint, in case you do not want to use a custom extractor but would like to use built-in extractors:
If you know that you evolve your CSV schema over time, use a way to differentiate between the different versions in the path name. Then you can use the following pattern:
#s1 = EXTRACT ... FROM "/data/v1/{*}.csv" USING Extractors.Csv();
#s2 = EXTRACT ... FROM "/data/v2/{*}.csv" USING Extractors.Csv();
....
#data = SELECT * FROM #s1 OUTER UNION ALL BY NAME(*) SELECT * FROM #s2 ...;
You can also wrap it into a table-valued function to abstract it. So you only have to update the function definition and using scripts will automatically get the latest version.
David is correct - if you would like to run the same job for variable columns with no changes to the script, you should create a custom extractor. You can also automatically create an EXTRACT statement from a file using ADL Tools for VS (blog here), which means you can avoid delving through the file each time to get the new columns.
You can also vote or create a new feature request here to help increase the priority for developing this. Hope this helps, and let me know if you have other questions.
Have you seen How to deal with files containing rows with different column counts in U-SQL: Introducing a Flexible Schema Extractor?

Declare variable in template table

I am writing an ETL to extract data from HANA table and load into SQL Server in BODS.
My job is to create a new table on SQL Server every time I run my job with name as date of that day. I know we can do that for flat files by using global variable but not sure how we can declare similar variable in template table to get desired results?
Why you want to use template tables. You can do the same as below:
Load the data in a standard staging table using BODS
Using DS scripting mechanism generate a query to create a table
Execute the query using SQL transform
Generate another query to copy data from staging table to the table created above
Several other ways also like you can write a DB procedure to create a table with the desired name and copy over the data from stage to that table. This procedure you can call from DS.
Hope this helps.
Cheers.
Shaz

Using SSIS to extract a XML representation of table data to a file

I'm trying to use SSIS to extract XML representation of a query result set to a text file. My query is currently successfully extracting the exact XML output I need when I run it in SSMS. I've tried every trick I can find to use this result set in a SSIS package to create a file.
Using a dataflow to port a OLE Source to a Flat file doesn't work because the output of a XML query is treated as TEXT and SSIS can't push TEXT, NTEXT or IMAGE to a file destination.
I've tried to then Execute SQL Task to fill a user variable and then use a Script Task (written using C#) to write the contents of this user variable to a file output, but the user variable is always empty. I don't know, but I suspect this is, again, because the XML is treated as TEXT or IMAGE and the user variable doesn't handle this.
The query is in this form:
SELECT *
FROM dataTable
WHERE dataTable.FIELD = 'Value'
FOR XML AUTO, ROOT('RootVal')
The resulting dataset is well formed XML, but I can't figure out how to get it from result set to file.
It's a relatively easy task for me to write a console app to do this in C# 4.0, but restrictions require me to at least prove it CAN'T be done with SSIS before I write the console app and a scheduler.
Sorry to spoil, but there's an SSIS option for you: Export Column Transformation.
I defined an OLE DB query with
SELECT
*
FROM
(
SELECT * FROM dbo.spt_values FOR XML AUTO, ROOT('RootVal')
) D (xml_node)
CROSS APPLY
(
SELECT 'C:\ssisdata\so_xmlExtract.xml'
) F (fileName)
This results in 1 row and 2 columns in the dataflow. I then attached the Export Column Transformation and wired it up with xml_node as Extract Column and fileName as the File Path Column
Mostly truncated results follow
<RootVal>
<dbo.spt_values name="rpc" number="1" type="A " status="0"/>
<dbo.spt_values name="dist" number="8" type="A " status="0"/>
<dbo.spt_values name="deferred" number="8192" type="V " low="0" high="1" status="0"/>
</RootVal>
A more detailed answer, with pictures, is available on this Q&A Export Varbinary(max) column with ssis
BillInKC's answer is the best I've ever seen, but SQL can be simplified (no need for cross apply):
SELECT X.*, 'output.xml' AS filename
FROM (SELECT * FROM #t FOR XML PATH('item'), ROOT('itemList')) AS X (xml_node)
It will output the same structure:
xml_node filename
-------------------------------------------------- ----------
<itemList><item><num>1000</num></item></itemlist> output.xml
(1 row(s) affected)