Pig : parsing line with blank delimiter - apache-pig

I'm using Hadoop Pig (0.10.0) to process logs file, a log line looking like :
2012-08-01 INFO (User:irim) getListedStocksByMarkets completed in 7041 ms
I would like to get a relation with tokens split by blanks, that is :
(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)
Loading that data with statement :
records = LOAD 'myapp.log' using PigStorage(' ');
did not achieve that because my tokens can be separated by several white space leading to several empty tokens.
PigStorage does not seem to support regexp delimiter (or at least I haven't succeeded configuring it that way).
So my question : what would be the best way to get those tokens ?
If I could remove empty elements from a relation I would be happy, is possible to do that with Pig ?
For example starting from :
(2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms)
To get
(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)
I'm trying another approach with TextLoader then TOKENIZE but I'm not sure it's the best strategy.
Maybe a User Load Function is a more natural choice ...
Regards,
Joel

You can use built in function STRSPLIT with regular expression to break a line into a tuple. Here is a script for your particular example with comma as a separator:
inpt = load '~/data/regex.txt' as (line : chararray);
dump inpt;
-- 2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms
splt = foreach inpt generate flatten(STRSPLIT(line, ',+'));
dump splt;
-- (2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)

Related

Does pig support load with no delimiter?

I'd like to load a lot of small files from HDFS with Pig and process them as tuples (filename, filecontent).
a=LOAD 'mydir' USING PigStorage('','-tagPath') AS (filepath:chararray, filecontents:chararray);
However it seems like I cannot omit specifying the delimiter. Is there some sort of a "NULL" in Pig or is there any other way to make sure the content of the file will not be split?
You will have to write your own custom loader by extending LoadFunc.
Short answer to your question is no.In order to make sure the content is not split,use a delimiter that would not exist in the content.In that way, the whole content would be loaded to the field filecontents:chararray.So assuming,your input files do not have a special character '~'
a=LOAD 'mydir' USING PigStorage('~','-tagPath') AS (filepath:chararray, filecontents:chararray);

Not getting the ouptut according to defined schema in Apache Pig

I am new to pig latin and i tried this schema on my data,
A = LOAD 'data' USING PigStorage(',') AS (f1:int, f2:int, B:bag{T:tuple(t1:int,t2:int)});
My sample data is
10,1,{(2,4),(5,6)}
10,3,{(1,3),(6,9)}
On performing \d A the output on my terminal is :
(10,1,)
(10,3,)
Please tell me what am i doing wrong.
The sample data you have is not in the correct format.Your load statement is using ',' as the field separator.However the tuples in the bag are also separated by ',' and hence the data is not loaded correctly.
One way to fix this is to choose a different delimiter for the fields.For example tab,pipe,semicolons.
Using Tabs as field separator and comma as tuple separator
10 1 {(2,4),(5,6)}
10 3 {(1,3),(6,9)}
Script for tab delimited fields with the schema
A = LOAD 'test8.txt' using PigStorage('\t') AS (f1:int, f2:int, B:bag{T:tuple(t1:int,t2:int)});
DUMP A;
Output
Alternatively, you can load the sample data without specifying the fields
10,1,{(2,4),(5,6)}
10,3,{(1,3),(6,9)}
Script for load without schema but with ',' as field separator
A = LOAD '/test8.txt' USING PigStorage(',');
DUMP A;
Output

How to escape delimiter found in value - pig script?

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.

Unable to Remove Special Characters In Pig

I have a text file that I want to Load onto my Pig Engine,
The text file have names in it in separate rows, and the data but has errors in it.....special characters....Something like this:
Ja##$s000on
J##a%^ke
T!!ina
Mel#ani
I want to remove the special characters from all the names using REGEX ....One way i found to do the job in pig and finally have the output as...
Jason
Jake
Tina
Melani
Can someone please tell me the regex that will do this job in Pig.
Also write the command that will do it as I unable to use the REGEX_EXTRACT and REGEX_EXTRACT_ALL function.
Also can someone explain what is the Significance of the number 1 that we pass to this function as Argument after defining the Regex.
Any help would be highly appreciated.
You can use REPLACE with RegEx to solve this problem.
input.txt
Ja##$s000on
J##a%^ke T!!ina Mel#ani
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','');
dump B;
Output:
(Jason)
(Jake Tina Melani)
There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray
Please Have a look here

extract only certain tags in xml file using pig latin

I want to extract only the states from the below xml file.
<.Table>
<State>Florida</State>
<id>123</id>
<./Table>
<.Table>
<State>Texas</State>
<id>456</id>
<./Table>
Expected output :
(Florida)
(Texas)
But with the below pig statements I get
()
() as output
A = LOAD 'hdfs:/user.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Table')
AS (x:chararray);
B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(x,
'<Table>\\n\\s*<State>(.*)</State>\\n\\s*\\n\\s*</Table>'))
as (state:chararray);
Please help me understand where I have gone wrong or how do I eliminate a certain tag line?
That looks like a buggy regex, after the closing </State> you are using \\n\\s*\\n\\s*</Table> which seems to ignore the the <id>...</id> elements. Have you looked at using some XML parsing library in a UDF? It might be easier than trying to build a bunch of regexes by hand.
EDIT: One other suggestion. Are you sure that the line separators in your file are just \n, you may have \r\n as the separator, in which case [\r\n]+ should help see this post for more details.