Azure Data Lake Loop - azure-data-lake

Does Azure Data Lake Analytics and U-SQL support use While or For to Loop and create multiple outputs?
I want output to multiple files using one USQL execution.
This is what i want:
Foreach #day in #days
#dataToSave =
SELECT day AS day,
company AS Company,
FROM #data
WHERE #day = #day
#out = #day + ".txt"
OUTPUT #dataToSave
TO #out
USING Outputters.Text();
Next
I know i can use a powershell, but i think that will cost performance prepairing the execution.

U-SQL does not support While or For loops. You can use WHERE statements to filter extracted data, and virtual columns to filter based on file paths/names (example).
To output to multiple files, you can write a unique rowset and WHERE clause for each output if its a reasonable number of files.
As you said, you could also script this with Powershell or U-SQL (example).
Dynamic output to multiple files is currently in a limited private preview. Please send an email to usql at microsoft dot com with your scenario if you're interested in this feature, as it could work for your scenario based on your description.
Hope this helps, and let me know if you have more questions about implementing any of these solutions.

You can try create a custom outputter and ignore the output file and write on your own file!
public override void Output (IRow row, IUnstructuredWriter output)

Try this, using outputter too:
public override void Output(IRow input, IUnstructuredWriter output)
{
using (System.IO.StreamWriter streamWriter = new StreamWriter(address + _file, true))
//Save on file!
}

Related

read specific files names in adf pipeline

I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')

How to Parameterize USQL input files without using ADF

I have a input folder in ADLS in the format year/month/date eg: 2017/07/11. I want to pass this input folder as a parameter to my usql script. I am not using ADF. I dont want to generate current date from within Usql script as i am not sure if the input folder is of the current date. How to do it effectively?
One way I thought of was uploading a "done" file after all my input folder is uploaded to ADLS account and that "done" file will contain the date. But i am not able to use that date to form my input data path. Please help.
Let's assume you have several csv files in your folder structure (structured as yyyy/MM/dd) and you want to extract all the files in a folder of a specific date. You can do it in two ways (depending in whether you need to have exact datetime semantics or if you are fine with path concat).
First the path concat example:
DECLARE EXTERNAL #folder = "2017/07/11"; // Script parameter with default value.
// You can specify the value also with constant-foldable expression on Datetime.Now.
DECLARE #path = "/constantpath/"+#folder+"/{*.csv}";
#data = EXTRACT I int, s string // or whatever your schema is...
FROM #path
USING Extractors.Csv();
...
And here is the example with a file set virtual column:
DECLARE EXTERNAL #date = "2017/07/11"; // Script parameter with default value.
// You can specify the value also with constant-foldable expression on Datetime.Now and string serialization (I am not sure if the ADF parameter model supports DateTime values).
DECLARE #path = "/constantpath/{date:yyyy}/{date:MM}/{date:dd}/{*.csv}";
#data = EXTRACT I int, s string // or whatever your schema is...
, date DateTime // virtual column for the date pattern
FROM #path
USING Extractors.Csv();
// Now apply the requested filter to reduce the files to the requested set
#data = SELECT * FROM #data WHERE date == DateTime.Parse(#date);
...
In both cases, you pass the parameter via the ADF parameterization model and you can decide to wrap the code into a U-SQL stored procedure or TVF as suggested by Bob.

Extract incident details from Service Now in Excel

I am trying to extract ticket details from Service Now. Is there a way to extract the details without ODBC ? I have also tried the solution mentioned in [1]: https://community.servicenow.com/docs/DOC-3844, but I am receiving an error 9 -subscript out of range.
Is there a better way to extract details efficiently? I tried asking this in the service now forum but I thought I might get other opinions from here.
It's been a while since this question is asked. Hopefully following is still useful.
I am extracting change data (not incident) , but the process still should be same. You will need to gather incident table and column information. Then there are couple of ways to approach the problem.
1) If the data you are extracting has fixed parameters , such as fixed period or fixed column or group etc., then you can create a report within servicenow and then use REST/SOAP API to get the data in text/csv format. You can use different python modules to convert from csv to xls or xlsx depending on you need. I used openpyXL ,csv , xlsreader ,xlswriter etc.
See here for a example
ServiceNow - How to use SOAP to download reports
2) If the data has dynmaic parameters where you need to change columns, dates or filter etc, you can still use soap / REST API but form query within python scripts instead of having static report. This way you can change it based on your requirement on the fly.
Here is an example query for DB. you can use example for above. Just switch url with following.
table_name = 'u_change_table_name' #SN DB holding change/INCIDENT info
table_limit = 800
table_query = 'active=true&sysparm_display_value=true&planned_start_date=today'
date_query = 'chg_start_date>=javascript:gs.daysAgoStart(1)^active=true^chg_type=normal'
table_fields = 'chg_number,chg_start_date,chg_duration,chg_end_date' #Actual column names from DB and not from SN report.
url= (
'https://yourcompany.service-now.com/api/now/table/' +table_name +\
'?sysparm_query=' + date_query + '&sysparm_fields=' \
+ table_fields + '&sysparm_limit=' + str(table_limit)
)

How to get the reorder the column with csv input fixed column in pentaho

Scenario:
I have created transformation to load data into table from csv file and I have following columns in csv file:
Customer_Id
Company_Id
Employee_Name
But user may give input file with column ordering (random order) as
Employee_Name
Company_Id
Customer_Id
so, if I try to load file which has random column ordering, will kettle load correct column values as per column names ... ?
Using ETL Metadata Injection you can use a transformation like this, to either normalize the data, or to store it to your database:
Then you just need to send the correct data to that transformation. You can read the header line from the CSV, and use Row Normaliser to convert to the format used by ETL Metadata Injection.
I have included a quick example here: csv_inject on Dropbox, if you make something like this and run it from something that runs it per csv file it should work.
Ooh, thats some nasty javascript!
The way to do this is with metadata injection. Look at the samples, but basically you need a template which reads the file, and writes it back out. you then use another parent transformation to figure out the headings, configure that template and then execute it.
There are samples in the PDI samples folder, and also take a look at the "figuring out file format" example in matt casters blueprints project on github.
You could try something like this as your JavaScript:
//Script here
var seen;
trans_Status = CONTINUE_TRANSFORMATION;
var col_names = ['Customer_Id','Company_Id','Employee_Name'];
var col_pos;
if (!seen) {
// First line
trans_Status = SKIP_TRANSFORMATION;
seen = 1;
col_pos = [-1,-1,-1];
for (var i = 0; i < col_names.length; i++) {
for (var j = 0; j < row.length; j++) {
if (row[j] == col_names[i]) {
col_pos[i] = j;
break;
}
}
if (col_pos[i] === -1) {
writeToLog("e", "Cannot find " + col_names[i]);
trans_Status = ERROR_TRANSFORMATION;
break;
}
}
}
var Customer_Id = row[col_pos[0]];
var Company_Id = row[col_pos[1]];
var Employee_Name = row[col_pos[2]];
Here is the .ktr I tried: csv_reorder.ktr
(edit, here are the test csv files)
1.csv:
Customer_Id,Company_Id,Employee_Name
cust1,comp1,emp1
2.csv:
Employee_Name,Company_Id,Customer_Id
emp2,comp2,cust2
Assuming rejecting the input file is not an option you basically have 4 solutions.
reorder the fields in an external editor (don't use excel if it contains dates)
Use code within your transformation to detect the column headers and reorder the file.
Use metadata injection as proposed by bolav
Create a job. This need to:
a. load the file into a temporary database.
b. use an sql statement to retrieve the fields (use a SELECT with an ORDER By clause)
c. output the file in the correct order

How to evenly distribute data in apache pig output files?

I've got a pig-latin script that takes in some xml, uses the XPath UDF to pull out some fields and then stores the resulting fields:
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
Note that we're using pig-0.12.0 on our cluster, so I ripped the XPath/XMLLoader classes out of pig-0.14.0 and put them in my own jar so that I could use them in 0.12.
This above script works fine and produces the data that I'm looking for. However, it generates over 1,900 partfiles with only a few mbs in each file. I learned about the default_parallel option, so I set that to 128 to try and get 128 partfiles. I ended up having to add a piece to force a reduce phase to achieve this. My script now looks like:
set default_parallel 128;
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
forced_reduce = FOREACH (GROUP results BY RANDOM()) GENERATE FLATTEN(results);
store forced_reduce into '$output';
Again, this produces the expected data. Also, I now get 128 part-files. My problem now is that the data is not evenly distributed among the part-files. Some have 8 gigs, others have 100 mb. I should have expected this when grouping them by RANDOM() :).
My question is what would be the preferred way to limit the number of part-files yet still have them evenly-sized? I'm new to pig/pig latin and assume I'm going about this in the completely wrong way.
p.s. the reason I care about the number of part-files is because I'd like to process the output with spark and our spark cluster seems to do a lot better with a smaller number of files.
I'm still looking for a way to do this directly from the pig script but for now my "solution" is to repartition the data within the spark process that works on the output of the pig script. I use the RDD.coalesce function to rebalance the data.
From the first code snippet, I am assuming it is map only job since you are not using any aggregates.
Instead of using reducers, set the property pig.maxCombinedSplitSize
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
exec;
set pig.maxCombinedSplitSize 1000000000; -- 1 GB(given size in bytes)
x = load '$output' using PigStorage();
store x into '$output2' using PigStorage();
pig.maxCombinedSplitSize - setting this property will make sure each mapper reads around 1 GB data and above code works as identity mapper job, which helps you write data in 1GB part file chunks.