I am getting perfect result by doing illustrate on the result of join, but dump not working on the same - apache-pig

I am using the below lines of code :
student = load 'test/Pig/student' using PigStorage(' ') as (name:chararray,roll:int);
result = load 'test/Pig/results' using PigStorage('\t') as (id:int,status:chararray);
passroll = FILTER result by status == 'pass';
store passroll into 'test/Pig/passroll';
pass = load 'test/Pig/passroll/part-m-00000' using PigStorage(',') as (id:int,status:chararray);
stupass = JOIN pass by id, student by roll;
studentname = FOREACH stupass GENERATE student::name;
illustrate studentname is giving perfect result of student names those who have pass.
but dump studentname is giving Encountered Warning ACCESSING_NON_EXISTENT_FI
ELD 19 time(s).

Related

How to read a list of values in Pig as a bag and compare it to a specific value?

Input:
ids:
1111,2222,3333,4444
employee:
{"name":"abc","id":"1111"} {"name":"xyz","id":"10"}
{"name":"z","id":"100"} {"name":"m","id":"99"}
{"name":"pqr","id":"3333"}
I want to filter employees whose id exists in the given list.
Expected Output:
{"name":"xyz","id":"10"} {"name":"z","id":"100"}
{"name":"m","id":"99"}
Existing Code:
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
empl = LOAD 'pathToFile' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (data:map[]);
output = FILTER empl BY data#'id' in (idList);
-- not working, states: A column needs to be projected from a relation for it to be used as a scalar
output = FILTER empl BY data#'id' in (idList#id);
-- not working, states: mismatched input 'id' expecting set null
JsonLoad() is native in pig > 0.10, and you can specify the schema:
empl = LOAD 'pathToFile' USING JsonLoader('name:chararray, id:chararray');
DUMP empl;
(abc,1111)
(xyz,10)
(z,100)
(m,99)
(pqr,3333)
You're loading idList as a one column table of type chararray but you want a list.
Loading it as a one column table (implies modifying you file so there is only one record per line):
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
DUMP idList;
(1111)
(2222)
(3333)
(4444)
or as a one-line file, we'll change the separator so it doesn't split into columns (otherwise it will lead to loading only the first column):
idList = LOAD 'pathToFile' USING PigStorage(' ') AS (id:chararray);
idList = FOREACH idList GENERATE FLATTEN(TOKENIZE(id, '[,]')) AS id;
DUMP idList;
(1111)
(2222)
(3333)
(4444)
Now we can do a LEFT JOIN to see which id are not present in idList and then a FILTER to keep only those. output is a reserved keyword, you shouldn't use it:
res = JOIN empl BY id LEFT, idList BY id;
res = FILTER res BY idList::id IS NULL;
DUMP res;
(xyz,10,)
(m,99,)
(z,100,)

Pig Latin fetching first field

I have below pig code for a sample file.
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
I load the above file using PIG load command & then loop thru it and get 2,3 fields as follows.
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate (id,fname,lname);
output of each1:
((001,Rajive,reddy)) etc.
Now i wanna get 1st field of each1 i.e.ID how to get it. I tried below code but showing error
each2 = foreach each1 generate(students.id)
Need to get the first filed from each2 relation.
The extra parenthesis are added to the each1 relation, simply remove them:
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate id,fname,lname;
And you will get something like :
(001,Rajive,reddy)
For the each2 relation, you can get any filed of each1 without using the qualifier students, use filed name or filed position like this :
each2 = foreach each1 generate id;
each2 = foreach each1 generate $0;
And you will get something like :
(001)
each2 = foreach each1 generate $0;

Pig error in local mode

I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)

how to define a constant array and check if a value is in the array for Pig Latin

I want to define an array of user Ids in Pig and then filter data if the userId from the input is NOT in that array,
How do I do this in pig latin? Below is the example of what I intend to do
Thanks
inputData = load '$INPUT' USING PigStorage('|') AS (useriD:chararray,controllerAction:chararray,url:chararray,browserName:chararray,IsMobile:chararray,exceptionDetails:chararray,renderTime:int,serviceHostId:int,auditEventTime:chararray);
filteredInput = filter inputData by controllerAction is not null and auditEventTime is not null and serviceHostId is not null and renderTime is not null and useriD in ('2be2df06-f4ba-4d87-8938-09d867d3f2fe','ac1ac6bf-d151-49fc-8c7c-2b52d2efbb58','f00aec16-36e5-46ae-b7cb-a0f1eeefe609','258890f9-102a-4f8e-a001-ae24d2e25269','cf221779-a077-472c-b377-cca4a9230e1b');
Thanks Murali..I tried the approach of declaring a variable and then using Flatten and stringSplit to join..However I get the following error
Syntax error, unexpected symbol at or near 'flatteneduserids'
%declare REQUIRED_USER_IDS 'xxxxx,yyyyy,sssss' ;
inputData = load '$INPUT' USING PigStorage('|') AS (useriD:chararray,controllerAction:chararray,url:chararray,browserName:chararray,IsMobile:chararray,exceptionDetails:chararray,renderTime:int,serviceHostId:int,auditEventTime:chararray);
filteredInput = filter inputData by controllerAction is not null and auditEventTime is not null and serviceHostId is not null and renderTime is not null;
flatteneduserids = FLATTEN(STRSPLIT('$REQUIRED_USER_IDS',',')) AS (uid:chararray);
useridfilter = JOIN filteredInput BY useriD, flatteneduserids BY uid USING 'replicated';
so Now I tried another way of declaring flatteneduserids which results in the error Undefined alias: IN
flatteneduserids = FOREACH IN GENERATE FLATTEN(STRSPLIT('$REQUIREDUSERIDS',',')) AS (uid:chararray);
Had a similar use case. Tried the approach by declaring the constant value in %define and accessing the same inside IN clause, was not able to achieve the objective. (Refer : Declare a comma seperated string constant)
A thought worth contemplating ....
If the condition inside IN clause is a static/ reference/ meta kind of data, then would suggest to declare this in a static file. We can then read the data at run time and do an inner join with input data to retrieve the matching records.
input_data = LOAD '$INPUT' USING PigStorage('|') AS (user_id:chararray ...)
static_data = LOAD ... AS (req_user_id:chararray
required_data = JOIN input_data BY useriD, static_data BY req_user_id USING 'replicated';
required_data_fmt = -- project required fields.
I was not able to figure out how to do this in memory
So as per Murali's suggestion I added the user ids in a file..load the file and then do a join...that worked as expected for mr

PIG script for creating IDxCITY matrix from given csv file

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');