WordCount with custom word delimiters in Pig? - apache-pig

I'm new to Pig, and I'm trying to write a word count program.
One way of getting words from text is to use the TOKENIZE function:
WORDS = foreach INPUT generate flatten(TOKENIZE(text)) AS word;
But I only want to split on whitespace, whereas TOKENIZE splits on things like commas, too. How would I do this? I tried using STRSPLIT(text, ' '), but STRSPLIT seems to return a tuple whereas TOKENIZE returns a bag, so I'm not sure how to use STRSPLIT for this.

It depends on what your input data looks like, but the following could work for you:
Use MyRegExLoader (in PiggyBank) with a regex to load your data.
Use STREAM with Perl, sed, or your favorite scripting language to munge your input data into a format that TOKENIZE will then handle the way you want.
Also, it's possible to convert tuples to a bag with ToBag (also in PiggyBank).

We actually can't directly transform a tuple into a bag (and vice-versa). I suggest you to do this :
Load your data
Use STRSPLIT to split your value into a tuple
Convert your tuples into a bag with an UDF
Flatten you bag

Related

What's the best way to 'normalize' a string in Redshift?

Since my texts are in Portuguese, there are many words with accent and other special characters, like: "coração", "hambúrguer", "São Paulo".
Normally, I treat these names in Python with the following function:
from unicodedata import normalize
def string_normalizer(text):
result = normalize("NFKD", text.lower()).encode("ASCII", "ignore").decode("ASCII")
return result.replace(" ", "-")
This would replace the blank spaces with '-', replace special characters and apply a lowercase convertion. The word "coração" would become "coracao", "São Paulo" would become "Sao Paulo" and so on. Now, I'm not sure what's the best way to do this in Redshift. My solution would be to apply multiple replaces, something like this:
replace(replace(replace(lower(column), 'á', 'a'), 'ç', 'c')...
Even though this works, it doesn't look like the best solution. Is there an easy way to normalize my string?
In Redshift, you can use the translate function to normalize a string. The translate function takes three arguments: the source string, the characters to replace, and the replacement characters. You can use this function to replace all the special characters in your string with their ASCII equivalent.
For example, the following query uses the translate function to replace all the special characters in a string with their ASCII equivalent. Additionally, spaces are replaced with "-" characters.
SELECT translate('São Paulo', ' áàãâäéèêëíìîïóòõôöúùûüçÁÀÃÄÉÈÊËÍÌÎÏÓÒÕÖÔÚÙÛÜÇ', '-aaaaaeeeeiiiiooooouuuucAAAAAEEEEIIIIOOOOOUUUUC')
This query would return the string "Sao Paulo". You can use the lower function to convert the string to lowercase.
Here's an example of how you could use these functions together to normalize a string:
SELECT lower(translate('São Paulo', ' áàãâäéèêëíìîïóòõôöúùûüçÁÀà ÄÉÈÊËÍÌÎÏÓÒÕÖÔÚÙÛÜÇ', '-aaaaaeeeeiiiiooooouuuucAAAAAEEEEIIIIOOOOOUUUUC'))
This query would return the string "sao-paulo".

How to escape delimiter found in value - pig script?

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.

Unable to Remove Special Characters In Pig

I have a text file that I want to Load onto my Pig Engine,
The text file have names in it in separate rows, and the data but has errors in it.....special characters....Something like this:
Ja##$s000on
J##a%^ke
T!!ina
Mel#ani
I want to remove the special characters from all the names using REGEX ....One way i found to do the job in pig and finally have the output as...
Jason
Jake
Tina
Melani
Can someone please tell me the regex that will do this job in Pig.
Also write the command that will do it as I unable to use the REGEX_EXTRACT and REGEX_EXTRACT_ALL function.
Also can someone explain what is the Significance of the number 1 that we pass to this function as Argument after defining the Regex.
Any help would be highly appreciated.
You can use REPLACE with RegEx to solve this problem.
input.txt
Ja##$s000on
J##a%^ke T!!ina Mel#ani
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','');
dump B;
Output:
(Jason)
(Jake Tina Melani)
There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray
Please Have a look here

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));

How to remove strings contained in a list in VB.NET?

How can I find words like and, or, to, a, no, with, for etc. in a sentence using VB.NET and remove them. Also where can I find all words list like above.
Note that unless you use Regex word boundaries you risk falling afoul of the Scunthorpe (Sfannythorpe) problem.
string pattern = #"\band\b";
Regex re = new Regex(pattern);
string input = "a band loves and its fans";
string output = re.Replace(input, ""); // a band loves its fans
Notice the 'and' in 'band' is untouched.
You can indeed replace your list of words using the .Replace function (as colithium described) ...
myString.Replace("and", "")
Edit:
... but indeed, a nicer way is to use Regular Expressions (as edg suggested) to avoid replacing parts of words.
As your question suggests that you would like to clean-up a sentence to keep meaningfull words, you have to do more than just remove two- and three letter words.
What you need is a list of stop-words:
http://en.wikipedia.org/wiki/Stop_word
A comma seperated list of stop-words for the English language can be found here:
http://www.textfixer.com/resources/common-english-words.txt
The easiest way is:
myString.Replace("and", "")
You'd loop over your word list and have a statement like the above. Google for a list of common English words?
List of English 2 Letter Words
List of English 3 Letter Words
You can match the words and remove them using regular expressions.