Issue in Loading data from Movielens into pig - apache-pig

I'm trying to load some data into Pig:
Record:
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
Script Used:
loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
USING PigStorage(':')
AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);
Output
11,,American President, The (1995),,Comedy|Drama|Romance
12,,Dracula,, Dead and Loving It (1995)
How to tackle the colon(:) after Dracula.-?
due to the colon, the second column is getting split into 2 columns and since we have in total of 3 columns, the last column of movieid 12 comedy|horror doesn't get loaded.

You can achieve this using REGEX_EXTRACT_ALL.
Following is the piece of code, which achieves this:
A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
AS (f1:chrarray);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN($0);
D = FOREACH C GENERATE $0 AS (MovieID:long), $1 AS (Title:chararray), $2 AS (Genre:chararray);
DUMP D;
I got the following output (which is a tuple). ":" after "Dracula" is intact.
(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)

Related

Apache Pig floating number SUM error in precision

I have rows with a double values.
The sum of them however has additional floating digits which I dont want in the output. Any idea to avoid this problem ?
A = LOAD 'test.csv' Using PigStorage(',') AS (
ID: chararray,
COST:double
);
B = GROUP A BY (ID);
C = FOREACH B GENERATE SUM(A.COST);
STORE C INTO 'out.txt' USING PigStorage(',');
INPUT FILE
A,0.51
A,0.51
B,4.81
B,4.81
EXPECTED OUTPUT FILE
A,1.02
B,9.62
ACTUAL INVALID OUTPUT FILE
10.020000457763672
9.619999885559082
Try C = FOREACH B GENERATE ROUND(SUM(A.COST)*100.0)/100.0;
EDIT
It works, see below the output

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Merge two lines in Pig

I would like to write a pig script for below query.
Input is:
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
Output should be:
ABC,DEF,GHI,JKL
MNO,PQR,STU,VWX
Could anyone please help me?
It will be difficult to solve this problem using native pig. One option could be download the datafu-1.2.0.jar library and try the below approach.
input.txt
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
PigScript:
REGISTER /tmp/datafu-1.2.0.jar;
DEFINE BagSplit datafu.pig.bags.BagSplit();
A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = GROUP A ALL;
C = FOREACH B GENERATE FLATTEN(BagSplit(2,$1)) AS mybag;
D = FOREACH C GENERATE FLATTEN(STRSPLIT(REPLACE(BagToString(mybag),'_null_null_null_null',''),'_',4));
E = FOREACH D GENERATE $2,$3,$0,$1;
DUMP E;
Output:
(MNO,PQR,STU,VWX)
(ABC,DEF,GHI,JKL)
Note:
Based on the above input format, my assumption will be 1st row last two cols will be null, 2nd row first two cols will be null, similarly for 3rd and 4th row also

How to get the number of words per line in pig?

I'm trying to figure out how many words their are per line in a file in pig. I've gotten as far as loading and splitting:
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);
which gets me a bag of tulples each containing a word. Then I go to count these items I get an error:
counts = FOREACH words GENERATE COUNT(*);
I get an error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException
is that because some of the lines have an empty bag? or is there something else I'm doing wrong?
if it is the problem with an empty bag then you can try something like this: (Not tested)
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*) as tokenized_words;
counts = FOREACH words GENERATE ((tokenized_words IS null or TRIM(tokenized_words) == '') ? 0 : COUNT(*)) as total_count;
here we are writing if-else condition to check if the tokenized_words is null or empty, if yes then we are assigning zero to it else the total count.
Can you try like this?
input
Hi hello how are you
this is apache pig
works
like a charm
Pigscript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;
Output:
(5)
(4)
(1)
()
(3)

PIG - Defining the delimiter used for a bag after a GROUP function

In Pig, I'm loading and grouping two files. I end up with a something like this:
A = LOAD 'File1' Using PigStorage('\t');
B = LOAD 'File2' Using PigStorage('\t');
C = COGROUP A BY $0, B BY $0;
STORE C INTO 'Output' USING PigStorage('\t');
Output:
123 {(123,XYZ,456)} {(123,QRS,889,QWER)}
Where the first field is the group key, the first bag is from File1, and the next bag is from File2. These three sections are delimited from each other using whatever I identified in the PigStorage('\t') clause.
Question: How do I force Pig to delimit the bags by something other than a comma? In my real data, there are commas present and so I need to delimit by tabs instead.
Desired output:
123 {(123\tXYZ\t456)} {(123\tQRS\t889\tQWER)}
This seems to be an open issue (as of June 2013) in Pig. See the corresponding JIRA for more details. Until the issue is fixed, you can change your input data.