How to escape delimiter found in value - pig script? - apache-pig

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.

You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.

It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.

you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.

Related

How do I print the first occurence of a string after a special character in Hive using reg_extract or split?

I am having a deep dilemma in hive. My data set in Hive looks like this:
##214628##564#7576#7876
#12771#242###256823
###3264###7236473####3
In each instance, I want to print only the first string after the #. So the output should be something like this:
214628
12771
3264
I tried using the reg_extract function, but alas I am getting only NULL values. Since hive doesn't support reg_substr, the following synatax doesn't work:
to_number(trim(regexp_substr(col_name,'[^#]+',1,1)))
Any suggestions are wecome!
You can use regexp_replace and then substr combination.
First remove all multiple occurrences of # from the string using regexp_replace().
regexp_replace(col,'#+','#') -- for data '#####123##' this will produce '#123#'
Then remove first # using substr. And then use instr to fetch everything starting from first till #.
substr(substr(str,2),1, instr(substr(str,2),'#')-1) this will produce '123'
You can see whole sql below.
select substr(substr(str,2),1, instr(substr(str,2),'#')-1) as result
from (
SELECT regexp_replace('#####123##','#+','#') as str) a
I assumed you always have # in the beginning. if you just add if left(str,1)='#'... and handle according to the data.

Extract characters between a string and the first occurrence of something in BigQuery

I want to extract a set of characters between "u1=" and the first semi-colon using a regex. For instance, given the following string: id=1w54;name=nick;u1=blue;u2=male;u3=ohio;u5=
The desired regex output should be just blue.
I tested (?<=u1=)[^;]* on https://regex101.com and it works. However, when I run this in BigQuery, using regexp_extract(string, '(?<=u1=)[^;]*') , I get an error that reads "Cannot parse regular expression: invalid perl operator: (?<"
I'm confused why this isn't working in BQ. Any help would be appreciated.
You can use regexp_extract() like this:
regexp_extract(string, 'u1=([^;]+)')

Hive convert a string to an array of characters

How can I convert a string to an array of characters, for example
"abcd" -> ["a","b","c","d"]
I know the split methd:
SELECT split("abcd","");
#["a","b","c","d",""]
is a bug for the last whitespace? or any other ideas?
This is not actually a bug. Hive split function simply calls the underlying Java String#split(String regexp, int limit) method with limit parameter set to -1, which causes trailing whitespace(s) to be returned.
I'm not going to dig into implementation details on why it's happening since there is already a brilliant answer that describes the issue. Note that str.split("", -1) will return different results depending on the version of Java you use.
A few alternatives:
Use "(?!\A|\z)" as a separator regexp, e.g. split("abcd", "(?!\\A|\\z)"). This will make the regexp matcher skip zero-width matches at the start and at the end positions of the string.
Create a custom UDF that uses either String#toCharArray(), or accepts limit as an argument of the UDF so you can use it as: SPLIT("", 0)
I don't know if it is a bug or that's how it works. As an alternative, you could use explode and collect_list to exclude blanks from a where clause
SELECT collect_list(l)
FROM ( SELECT EXPLODE(split('abcd','') ) as l ) t
WHERE t.l <> '';

Unable to Remove Special Characters In Pig

I have a text file that I want to Load onto my Pig Engine,
The text file have names in it in separate rows, and the data but has errors in it.....special characters....Something like this:
Ja##$s000on
J##a%^ke
T!!ina
Mel#ani
I want to remove the special characters from all the names using REGEX ....One way i found to do the job in pig and finally have the output as...
Jason
Jake
Tina
Melani
Can someone please tell me the regex that will do this job in Pig.
Also write the command that will do it as I unable to use the REGEX_EXTRACT and REGEX_EXTRACT_ALL function.
Also can someone explain what is the Significance of the number 1 that we pass to this function as Argument after defining the Regex.
Any help would be highly appreciated.
You can use REPLACE with RegEx to solve this problem.
input.txt
Ja##$s000on
J##a%^ke T!!ina Mel#ani
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','');
dump B;
Output:
(Jason)
(Jake Tina Melani)
There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray
Please Have a look here

WordCount with custom word delimiters in Pig?

I'm new to Pig, and I'm trying to write a word count program.
One way of getting words from text is to use the TOKENIZE function:
WORDS = foreach INPUT generate flatten(TOKENIZE(text)) AS word;
But I only want to split on whitespace, whereas TOKENIZE splits on things like commas, too. How would I do this? I tried using STRSPLIT(text, ' '), but STRSPLIT seems to return a tuple whereas TOKENIZE returns a bag, so I'm not sure how to use STRSPLIT for this.
It depends on what your input data looks like, but the following could work for you:
Use MyRegExLoader (in PiggyBank) with a regex to load your data.
Use STREAM with Perl, sed, or your favorite scripting language to munge your input data into a format that TOKENIZE will then handle the way you want.
Also, it's possible to convert tuples to a bag with ToBag (also in PiggyBank).
We actually can't directly transform a tuple into a bag (and vice-versa). I suggest you to do this :
Load your data
Use STRSPLIT to split your value into a tuple
Convert your tuples into a bag with an UDF
Flatten you bag