Filter data from extractor using other table - azure-data-lake

I'm trying to extract data from multiple files using csv custom extractor that uses a filter based on the content of other file.
Ex.
Files.txt content
file1
file4
Directories structure
/file1/file.txt
/file2/file.txt
/file3/file.txt
/file4/file.txt
I've extracted the Files.txt content to rowset #files and the files in directory to #filesDirectory rowset.
My problem is that if i join #filesDirectory with #files, no matter what files are in Files.txt, all files are read... I just want to read the files specified on it.
But if i specify the file (without join the two rowset) it works!
Any help?
Here is the query:
DECLARE #input string = #"/{dirname}/file.txt";
DECLARE #filterFile = #"/fileFilter.txt";
#inputData =
EXTRACT
dirname string,
content string
FROM #input
USING Extractors.Text(delimiter : '\n', quoting : false);
#inputFilter =
EXTRACT
directories string
FROM #filterFile
USING Extractors.Text();
#result = SELECT * FROM #inputData AS id
LEFT JOIN #inputFilter AS if ON (id.dirname = id.directories)

I used INNER JOIN and the U-SQL join syntax which is two equals signs (==) and this worked for me. NB the files are still read but are filtered out of the results:
DECLARE #inputFile string = "/input/{dirName}/file.txt";
#input =
EXTRACT dirName string,
content string
FROM #inputFile
USING Extractors.Csv();
#inputFilter =
EXTRACT directories string
FROM "/input/files.txt"
USING Extractors.Csv();
#output =
SELECT *
FROM #input
INNER JOIN
#inputFilter
ON dirName == directories
WHERE dirName LIKE "file%";
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv();
My results with similar folder structure:

Did you consider using a file list in your Extract expression? This cannot be a dynamic expression or parameter, so you'll have to generate the U-SQL script before each run based on the data in /input/files.txt, but this will avoid reading all of the files and filtering them on runtime.
DECLARE #input string = #"/{dirname}/file.txt";
DECLARE #filterFile = #"/fileFilter.txt";
#inputData =
EXTRACT
dirname string,
content string
FROM "/file1/file.txt",
"/file4/file.txt"
USING Extractors.Text(delimiter : '\n', quoting : false);

Related

Logic for adding \ before every character in ADF parameter

I have a requirement in ADF wherein i have a parameter which can contain following values AB or A or B
I want to update these values by appending \ in front of each character.
So the output should be as follows:
For AB - \A\B
For A - \A
How to append \ in front of every character irrespective of number of characters in a parameter?
Please note that we don't want to use DataFlow for this.
You cannot do this with the native expressions in ADF. Either call an Azure function to perform the work, or use SQL and a Lookup Activity to perform the task. You can put the following into a dynamic expression referencing the following as an example;
declare ##str nvarchar(max) = 'helloworld'; /* get value in dynamic expression '#{#pipeline().parameters.PARAMETER_NAME}'*/
declare ##strOut nvarchar(max) = '';
declare ##strlen int = len(##str);
declare ##i int = 1;
while ##i <= ##strlen
begin
set ##strOut += concat('\',SUBSTRING(##str,##i,1));
set ##i +=1;
end
select ##strOut [result];
Then fetch the result from the activity. e.g. #activity('Lookup1').output.firstRow.result
As seen below, note the UI is just adding the extra \ in the preview, (ADF rendering bug)
The dynamic expression in the Lookup Query parameter :

Sql server column serialization without key

I have column A with value hello.
I need to migrate it into new column AJson with value ["hello"].
I have to do this with Sql Server command.
There are different commands FOR JSON etc. but they serialize value with column name.
This is the same value that C# method JsonConvert.SerializeObject(new List<string>(){"hello"} serialization result would be.
I can't simply attach [" in the beginning and end because the string value may contain characters which without proper serialization will break the json string.
My advice is you just make a lot of nested replaces and then do it yourself.
FOR JSON is intended for entire JSON, and therefore not valid without keys.
Here is a simple example that replaces the endline with \n
print replace('ab
c','
','\n')
Backspace to be replaced with \b.
Form feed to be replaced with \f.
Newline to be replaced with \n.
Carriage return to be replaced with \r.
Tab to be replaced with \t.
Double quote to be replaced with "
Backslash to be replaced with \
My approach was to use these 3 commands:
UPDATE Offers
SET [DetailsJson] =
(SELECT TOP 1 [Details] AS A
FROM Offers AS B
WHERE B.Id = Offers.Id
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER)
UPDATE Offers
SET [DetailsJson] = Substring([DetailsJson], 6, LEN([DetailsJson]) - 6)
UPDATE Offers
SET [DetailsJson] = '[' + [DetailsJson] + ']'
..for op's answer/table..
UPDATE Offers
SET [DetailsJson] = concat(N'["', string_escape([Details], 'json'), N'"]');
declare #col nvarchar(100) = N'
a b c " : [ ] ]
x
y
z'
select concat(N'["', string_escape(#col, 'json'), N'"]'), isjson(concat(N'["', string_escape(#col, 'json'), N'"]'));

replace comma(,) only if its inside quotes("") in Pig

I have data like this:
1,234,"john, lee", john#xyz.com
I want to remove , inside "" with space using pig script. So that my data will look like:
1,234,john lee, john#xyz.com
I tried using CSVExcelStorage to load this data but i need to use '-tagFile' option as well which is not supported in CSVExcelStorage . So i am planning to use PigStorage only and then replace any comma (,) inside quotes.
I am stuck on this. Any help is highly appreciated. Thanks
Below command will help:
csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);
Ouput:
(1,234,john lee, john#xyz.com)
Load it into a single field and then use STRSPLIT and REPLACE
A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\\"',3);
C = FOREACH B GENERATE REPLACE($1,',','');
D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
DUMP E;
A
1,234,"john, lee", john#xyz.com
B
(1,234,)(john, lee)(, john#xyz.com)
C
(1,234,)(john lee)(, john#xyz.com)
D
(1,234,john lee, john#xyz.com)
E
(1),(234),(john lee),(john#xyz.com)
I got the perfect way to do this. A very generic solution is as below:
data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);
/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');
/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;
Detailed use case is available at my blog

SSIS - Using flat file as a Parameter/Variable

I would like to know how to use a flat file (with only one value, say datetime) as a Parameter/Variable. Instead of feeding a SQL query value from Edit SQL task into a variable I want to save them as a flat file and then load them again as a Parameter/Variable.
This can be done using Script Task .
1 Set ReadonlyVeriable == file name
2 select ReadWriteveriable name = Variablename you have to populate.
3 Script write logic to find the value ( read file and get value)
set the value
this.Dts.Variables("sFileContent").Value = StreamText ;

how to define a constant array and check if a value is in the array for Pig Latin

I want to define an array of user Ids in Pig and then filter data if the userId from the input is NOT in that array,
How do I do this in pig latin? Below is the example of what I intend to do
Thanks
inputData = load '$INPUT' USING PigStorage('|') AS (useriD:chararray,controllerAction:chararray,url:chararray,browserName:chararray,IsMobile:chararray,exceptionDetails:chararray,renderTime:int,serviceHostId:int,auditEventTime:chararray);
filteredInput = filter inputData by controllerAction is not null and auditEventTime is not null and serviceHostId is not null and renderTime is not null and useriD in ('2be2df06-f4ba-4d87-8938-09d867d3f2fe','ac1ac6bf-d151-49fc-8c7c-2b52d2efbb58','f00aec16-36e5-46ae-b7cb-a0f1eeefe609','258890f9-102a-4f8e-a001-ae24d2e25269','cf221779-a077-472c-b377-cca4a9230e1b');
Thanks Murali..I tried the approach of declaring a variable and then using Flatten and stringSplit to join..However I get the following error
Syntax error, unexpected symbol at or near 'flatteneduserids'
%declare REQUIRED_USER_IDS 'xxxxx,yyyyy,sssss' ;
inputData = load '$INPUT' USING PigStorage('|') AS (useriD:chararray,controllerAction:chararray,url:chararray,browserName:chararray,IsMobile:chararray,exceptionDetails:chararray,renderTime:int,serviceHostId:int,auditEventTime:chararray);
filteredInput = filter inputData by controllerAction is not null and auditEventTime is not null and serviceHostId is not null and renderTime is not null;
flatteneduserids = FLATTEN(STRSPLIT('$REQUIRED_USER_IDS',',')) AS (uid:chararray);
useridfilter = JOIN filteredInput BY useriD, flatteneduserids BY uid USING 'replicated';
so Now I tried another way of declaring flatteneduserids which results in the error Undefined alias: IN
flatteneduserids = FOREACH IN GENERATE FLATTEN(STRSPLIT('$REQUIREDUSERIDS',',')) AS (uid:chararray);
Had a similar use case. Tried the approach by declaring the constant value in %define and accessing the same inside IN clause, was not able to achieve the objective. (Refer : Declare a comma seperated string constant)
A thought worth contemplating ....
If the condition inside IN clause is a static/ reference/ meta kind of data, then would suggest to declare this in a static file. We can then read the data at run time and do an inner join with input data to retrieve the matching records.
input_data = LOAD '$INPUT' USING PigStorage('|') AS (user_id:chararray ...)
static_data = LOAD ... AS (req_user_id:chararray
required_data = JOIN input_data BY useriD, static_data BY req_user_id USING 'replicated';
required_data_fmt = -- project required fields.
I was not able to figure out how to do this in memory
So as per Murali's suggestion I added the user ids in a file..load the file and then do a join...that worked as expected for mr