ERROR 1200 : Pig script failed to parse - apache-pig

I am getting error on executing below:
data1 = load '/user/pig/join2_genchanA.txt' using PigStorage(',')as (showname:chararray, channelname:chararray);
data2 = load '/user/pig/join2_gennumA.txt' using PigStorage(',')as (showname:chararray, showviewer:long);
joindata = join data1 by showname, data2 by showname;
bat = filter joindata by channelname=='BAT';
foreachviewer = FOREACH bat GENERATE channelname, showviewer;
foreachgroupall = GROUP foreachviewer all; batsum = FOREACH foreachgroupall GENERATE SUM(bat.showviewers);
Now I am getting below error:
"2017-09-15 04:01:03,517 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: **Pig script failed to parse**: <line 28, column 46> Invalid scalar projection: bat Details at logfile: /home/cloudera/pig_1504878875671.log"
Please help me with this.

It's saying field bat is not available in alias foreachgroupall.
Do a DESCRIBE on foreachgroupall, it will show the field names.
Replace the last line in your code snippet with the below one -
batsum = FOREACH foreachgroupall GENERATE SUM(foreachviewer.data2::showviewer);
I would suggest you to filter earlier after reading the data sets rather than filtering for channel name BAT after the join. See code snippet below for details -
data1 = load '/user/pig/join2_genchanA.txt' using PigStorage(',') as (showname:chararray, channelname:chararray);
data2 = load '/user/pig/join2_gennumA.txt' using PigStorage(',') as (showname:chararray, showviewer:long);
bat = FILTER data1 BY channelname=='BAT';
joindata = join bat by showname, data2 by showname;
foreachviewer = FOREACH joindata GENERATE bat::channelname AS channelname, data2::showviewer AS showviewer;
req_stats = FOREACH(GROUP foreachviewer ALL) GENERATE SUM(foreachviewer.showviewer);
DUMP req_stats;

Related

PIg scalar is bigger than 0

I have the following code
Data1 = LOAD '/user/cloudera/Class Ex 2/Data 1' USING PigStorage(',') as (Name:chararray,ID:chararray,text_1:chararray,Grade_1:int,Grade_2:int,Grade_3:int,Grade_4:int);
Data2 = LOAD '/user/cloudera/Class Ex 2/Data 2' USING PigStorage(',') as (Name:chararray,ID:chararray,text_2:chararray,Grade_5:int,Grade_6:int,Grade_7:int,Grade_8:int);
Data_3 = JOIN Data1 BY Data1.ID,Data2 BY Data2.ID;
Data_4 = FOREACH Data_3 GENERATE $0,$1,$2,$3,$4,$5,$6,$9,$10,$11,$12,$13;
Data_5 = FOREACH Data_4 GENERATE
Name,
ID,
text_1,
SIZE(text_1),
REPLACE(text_1,'or',''),
SIZE(REPLACE(text_1,'or','')),
SIZE(text_1)-SIZE(REPLACE(text_1,'or','')),
text_2,
SIZE(text_2),
REPLACE(text_2,'or',''),
SIZE(REPLACE(text_2,'or','')),
SIZE(text_2)-SIZE(REPLACE(text_2,'or','')),
($3+$4+$5+$6+$8+$9+$10+$11)/8;
DESCRIBE Data_5;
STORE Data_5 Into '/user/cloudera/Class Ex 2/Data_output' USING PigStorage(',');
Essentially I have to load 2 sets of data, and then make some basic text statistics and manipulation.
Everything works fine until the last statement, STORE.
When I add it I receive the scalar error.
What am I doing wrong here?
Thanks guys!
First of all, Pig only evaluates the alias' which finally lead to a STORE or a DUMP (this is called lazy evaluation). Hence, your error was always there; It just got caught once you added the STORE statement. Since you have not pasted the full trace, I would think that your error is in the third statement where you try to access the field ID using the dot (.) operator. You need to change it to one of the following:
1) Refer to the field ID directly since only one field called ID in both Data1 and Data2:
Data_3 = JOIN Data1 BY ID, Data2 BY ID;
2) Use :: instead of . if you do need to disambiguate:
Data_3 = JOIN Data1 BY Data1::ID, Data2 BY Data2::ID;
If you want to know why the dot (.) operator caused an error, might help to look at the following question: Getting exception while trying to execute a Pig Latin Script

I am getting perfect result by doing illustrate on the result of join, but dump not working on the same

I am using the below lines of code :
student = load 'test/Pig/student' using PigStorage(' ') as (name:chararray,roll:int);
result = load 'test/Pig/results' using PigStorage('\t') as (id:int,status:chararray);
passroll = FILTER result by status == 'pass';
store passroll into 'test/Pig/passroll';
pass = load 'test/Pig/passroll/part-m-00000' using PigStorage(',') as (id:int,status:chararray);
stupass = JOIN pass by id, student by roll;
studentname = FOREACH stupass GENERATE student::name;
illustrate studentname is giving perfect result of student names those who have pass.
but dump studentname is giving Encountered Warning ACCESSING_NON_EXISTENT_FI
ELD 19 time(s).

How to read a list of values in Pig as a bag and compare it to a specific value?

Input:
ids:
1111,2222,3333,4444
employee:
{"name":"abc","id":"1111"} {"name":"xyz","id":"10"}
{"name":"z","id":"100"} {"name":"m","id":"99"}
{"name":"pqr","id":"3333"}
I want to filter employees whose id exists in the given list.
Expected Output:
{"name":"xyz","id":"10"} {"name":"z","id":"100"}
{"name":"m","id":"99"}
Existing Code:
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
empl = LOAD 'pathToFile' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (data:map[]);
output = FILTER empl BY data#'id' in (idList);
-- not working, states: A column needs to be projected from a relation for it to be used as a scalar
output = FILTER empl BY data#'id' in (idList#id);
-- not working, states: mismatched input 'id' expecting set null
JsonLoad() is native in pig > 0.10, and you can specify the schema:
empl = LOAD 'pathToFile' USING JsonLoader('name:chararray, id:chararray');
DUMP empl;
(abc,1111)
(xyz,10)
(z,100)
(m,99)
(pqr,3333)
You're loading idList as a one column table of type chararray but you want a list.
Loading it as a one column table (implies modifying you file so there is only one record per line):
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
DUMP idList;
(1111)
(2222)
(3333)
(4444)
or as a one-line file, we'll change the separator so it doesn't split into columns (otherwise it will lead to loading only the first column):
idList = LOAD 'pathToFile' USING PigStorage(' ') AS (id:chararray);
idList = FOREACH idList GENERATE FLATTEN(TOKENIZE(id, '[,]')) AS id;
DUMP idList;
(1111)
(2222)
(3333)
(4444)
Now we can do a LEFT JOIN to see which id are not present in idList and then a FILTER to keep only those. output is a reserved keyword, you shouldn't use it:
res = JOIN empl BY id LEFT, idList BY id;
res = FILTER res BY idList::id IS NULL;
DUMP res;
(xyz,10,)
(m,99,)
(z,100,)

Convert one line into multiple line in Pig

I would like to write a pig script for below query.
Input is:
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,BBB,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3
Output should be:
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3
ABC,DEF,GHI,JKL,AAA,bbb,1,2,3
ABC,DEF,GHI,JKL,AAA,ccc,1,2,3
ABC,DEF,GHI,JKL,BBB,aaa,1,2,3
ABC,DEF,GHI,JKL,BBB,bbb,1,2,3
ABC,DEF,GHI,JKL,BBB,ccc,1,2,3
Could anyone please help me?
You can write your own custom UDF or try the below approach
input.txt
ABC,DEF,GHI,JKL,AAA,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,BBB,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3,CCC,aaa,1,2,3,bbb,1,2,3,ccc,1,2,3
PigScript:
A = LOAD 'input.txt' USING PigStorage(',');
B = FOREACH A GENERATE $0,$1,$2,$3,
FLATTEN(TOTUPLE($4)),
FLATTEN(TOBAG(
TOTUPLE($5..$8),
TOTUPLE($9..$12),
TOTUPLE($13..$16)
)
);
C = FOREACH A GENERATE $0,$1,$2,$3,
FLATTEN(TOTUPLE($17)),
FLATTEN(TOBAG(
TOTUPLE($18..$21),
TOTUPLE($22..$25),
TOTUPLE($26..$29)
)
);
D = UNION B,C;
DUMP D
Output:
(ABC,DEF,GHI,JKL,AAA,aaa,1,2,3)
(ABC,DEF,GHI,JKL,AAA,bbb,1,2,3)
(ABC,DEF,GHI,JKL,AAA,ccc,1,2,3)
(ABC,DEF,GHI,JKL,BBB,aaa,1,2,3)
(ABC,DEF,GHI,JKL,BBB,bbb,1,2,3)
(ABC,DEF,GHI,JKL,BBB,ccc,1,2,3)

PIG Error 1066 after iterating through a joined set.

Trying to join a one set which has number of days in the month with a data set on the year month key. After I join the and try to do a FOREACH over the set I get an ERROR: 1066 ... Backend error : Scalar has more than one row in the output.
Here is an abbreviated set with the same problem:
$ hadoop fs -cat DIM/\*
2011,01,31
2011,02,28
2011,03,31
2011,04,30
2011,05,31
2011,06,30
2011,07,31
2011,08,31
2011,09,30
2011,10,31
2011,11,30
2011,12,31
$ hadoop fs -cat ACCT/\*
2011,7,26,key1,23.25,2470.0
2011,7,26,key2,10.416666666666668,232274.08333333334
2011,7,26,key3,82.83333333333333,541377.25
2011,7,26,key4,78.5,492823.33333333326
2011,7,26,key5,110.83333333333334,729811.9166666667
2011,7,26,key6,102.16666666666666,675941.25
2011,7,26,key7,118.91666666666666,770896.75
Then in grunt:
grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, days:int);
grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, day: int, account:chararray, metric1:double, metric2:double);
grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated';
grunt> dump AjD;
...
(2011,7,26,key1,23.25,2470.0,2011,7,31)
(2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
(2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
(2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
(2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
(2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
(2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
grunt> describe AjD;
AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: int,DIM::month: int,DIM::days: int}
grunt> FINAL = FOREACH AjD
>> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
grunt> dump FINAL;
...
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias FINAL. Backend error : Scalar has more than one row in the output. 1st : (2011,7,26,key1,23.25,2470.0), 2nd :(2011,7,26,key2,10.416666666666668,232274.08333333334)
However if I store it and reload it to shed the "join" schema it works:
grunt> STORE AjD INTO 'AjD' using PigStorage(',');
grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, day:int, account:chararray, metric1:double, metric2:double, year2:int, month2:int, days:int);
grunt> FINAL = FOREACH AjD2
>> GENERATE year, month, account, (metric2 /days);
grunt> dump FINAL;
...
(2011,7,key1,79.6774193548387)
(2011,7,key2,7492.712365591398)
(2011,7,key3,17463.782258064515)
(2011,7,key4,15897.526881720427)
(2011,7,key5,23542.319892473122)
(2011,7,key6,21804.5564516129)
(2011,7,key7,24867.637096774193)
Is there a way to iterate (FOREACH) over the joined set without storing and reloading?
Have you tried with the :: Operator which specifies which column to get?
Replacing (ACCT.metric2 / DIM.days) by (ACCT::metric2 / DIM::days).
e.g.
...
FINAL = FOREACH AjD
GENERATE
ACCT.year, ACCT.month, ACCT.account,(ACCT::metric2 / DIM::days);