Unable to Remove Special Characters In Pig - apache-pig

I have a text file that I want to Load onto my Pig Engine,
The text file have names in it in separate rows, and the data but has errors in it.....special characters....Something like this:
Ja##$s000on
J##a%^ke
T!!ina
Mel#ani
I want to remove the special characters from all the names using REGEX ....One way i found to do the job in pig and finally have the output as...
Jason
Jake
Tina
Melani
Can someone please tell me the regex that will do this job in Pig.
Also write the command that will do it as I unable to use the REGEX_EXTRACT and REGEX_EXTRACT_ALL function.
Also can someone explain what is the Significance of the number 1 that we pass to this function as Argument after defining the Regex.
Any help would be highly appreciated.

You can use REPLACE with RegEx to solve this problem.
input.txt
Ja##$s000on
J##a%^ke T!!ina Mel#ani
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','');
dump B;
Output:
(Jason)
(Jake Tina Melani)

There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray
Please Have a look here

Related

Apache Pig filtering out carriage returns

I'm fairly new with apache pig and trying to work with some fixed width text. In pig, I'm reading every line in as a chararray (I know I can use fixedwidthloader, but am not in this instance). One of the fields I'm working with is an email field and one entry has a carriage return that generates extra lines of output in the finished data dump (I show 12 rows instead of the 9 I'm expecting). I know which entry has the error but I'm unable to filter it out using pig.
Thus far I've tried to use pig's REPLACE to replace on \r or \uFFFD and even tried a python UDF which works on the command line but not when I run it as a UDF through PIG. Anyone have any suggestions? Please let me know if more details are required.
My original edit with a solution turned out to only work part of the time. This time I had to clean the data before I ran it through pig. On the raw data file I did a perl -i -pe 's/\r//g' filename to remove the rogue carriage return.

Escape all commas in line except first and last

I have a CSV file which I'm trying to import to a SQL Server table. The file contains lines of 3 columns each, separated by a comma. The only problem is that some of the data in the second column contains an arbitrary number of commas. For example:
1281,I enjoy hunting, fishing, and boating,smith317
I would like to escape all occurrences of commas in each line except the first and the last, such that the result of this line would be:
1281,I enjoy hunting\, fishing\, and boating,smith317
I know I will need some type of regular expression to accomplish this task, but my knowledge of regular expressions is very limited. Currently, I'm trying to use Notepad++ find/replace with regex, but I am open to other ideas.
Any help would be greatly appreciated :-)
Okay, could be a manual stuff. Do this:
Normal find all the , and replace it with \,. Escape everything.
Regex find ^(.*)(\\,) and replace it with $1,.
Regex find (\\,)(.*)$ and replace it with ,$2.
Worked for me in Sublime Text 2.

How to escape delimiter found in value - pig script?

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.

extract only certain tags in xml file using pig latin

I want to extract only the states from the below xml file.
<.Table>
<State>Florida</State>
<id>123</id>
<./Table>
<.Table>
<State>Texas</State>
<id>456</id>
<./Table>
Expected output :
(Florida)
(Texas)
But with the below pig statements I get
()
() as output
A = LOAD 'hdfs:/user.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Table')
AS (x:chararray);
B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(x,
'<Table>\\n\\s*<State>(.*)</State>\\n\\s*\\n\\s*</Table>'))
as (state:chararray);
Please help me understand where I have gone wrong or how do I eliminate a certain tag line?
That looks like a buggy regex, after the closing </State> you are using \\n\\s*\\n\\s*</Table> which seems to ignore the the <id>...</id> elements. Have you looked at using some XML parsing library in a UDF? It might be easier than trying to build a bunch of regexes by hand.
EDIT: One other suggestion. Are you sure that the line separators in your file are just \n, you may have \r\n as the separator, in which case [\r\n]+ should help see this post for more details.

Pig : parsing line with blank delimiter

I'm using Hadoop Pig (0.10.0) to process logs file, a log line looking like :
2012-08-01 INFO (User:irim) getListedStocksByMarkets completed in 7041 ms
I would like to get a relation with tokens split by blanks, that is :
(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)
Loading that data with statement :
records = LOAD 'myapp.log' using PigStorage(' ');
did not achieve that because my tokens can be separated by several white space leading to several empty tokens.
PigStorage does not seem to support regexp delimiter (or at least I haven't succeeded configuring it that way).
So my question : what would be the best way to get those tokens ?
If I could remove empty elements from a relation I would be happy, is possible to do that with Pig ?
For example starting from :
(2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms)
To get
(2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)
I'm trying another approach with TextLoader then TOKENIZE but I'm not sure it's the best strategy.
Maybe a User Load Function is a more natural choice ...
Regards,
Joel
You can use built in function STRSPLIT with regular expression to break a line into a tuple. Here is a script for your particular example with comma as a separator:
inpt = load '~/data/regex.txt' as (line : chararray);
dump inpt;
-- 2012-08-01,,,INFO,,,(User:irim),,getListedStocksByMarkets,completed,in,7041,ms
splt = foreach inpt generate flatten(STRSPLIT(line, ',+'));
dump splt;
-- (2012-08-01,INFO,(User:irim),getListedStocksByMarkets,completed,in,7041,ms)