I am using the below lines of code :
student = load 'test/Pig/student' using PigStorage(' ') as (name:chararray,roll:int);
result = load 'test/Pig/results' using PigStorage('\t') as (id:int,status:chararray);
passroll = FILTER result by status == 'pass';
store passroll into 'test/Pig/passroll';
pass = load 'test/Pig/passroll/part-m-00000' using PigStorage(',') as (id:int,status:chararray);
stupass = JOIN pass by id, student by roll;
studentname = FOREACH stupass GENERATE student::name;
illustrate studentname is giving perfect result of student names those who have pass.
but dump studentname is giving Encountered Warning ACCESSING_NON_EXISTENT_FI
ELD 19 time(s).
Related
Input:
ids:
1111,2222,3333,4444
employee:
{"name":"abc","id":"1111"} {"name":"xyz","id":"10"}
{"name":"z","id":"100"} {"name":"m","id":"99"}
{"name":"pqr","id":"3333"}
I want to filter employees whose id exists in the given list.
Expected Output:
{"name":"xyz","id":"10"} {"name":"z","id":"100"}
{"name":"m","id":"99"}
Existing Code:
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
empl = LOAD 'pathToFile' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (data:map[]);
output = FILTER empl BY data#'id' in (idList);
-- not working, states: A column needs to be projected from a relation for it to be used as a scalar
output = FILTER empl BY data#'id' in (idList#id);
-- not working, states: mismatched input 'id' expecting set null
JsonLoad() is native in pig > 0.10, and you can specify the schema:
empl = LOAD 'pathToFile' USING JsonLoader('name:chararray, id:chararray');
DUMP empl;
(abc,1111)
(xyz,10)
(z,100)
(m,99)
(pqr,3333)
You're loading idList as a one column table of type chararray but you want a list.
Loading it as a one column table (implies modifying you file so there is only one record per line):
idList = LOAD 'pathToFile' USING PigStorage(',') AS (id:chararray);
DUMP idList;
(1111)
(2222)
(3333)
(4444)
or as a one-line file, we'll change the separator so it doesn't split into columns (otherwise it will lead to loading only the first column):
idList = LOAD 'pathToFile' USING PigStorage(' ') AS (id:chararray);
idList = FOREACH idList GENERATE FLATTEN(TOKENIZE(id, '[,]')) AS id;
DUMP idList;
(1111)
(2222)
(3333)
(4444)
Now we can do a LEFT JOIN to see which id are not present in idList and then a FILTER to keep only those. output is a reserved keyword, you shouldn't use it:
res = JOIN empl BY id LEFT, idList BY id;
res = FILTER res BY idList::id IS NULL;
DUMP res;
(xyz,10,)
(m,99,)
(z,100,)
I have below pig code for a sample file.
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
I load the above file using PIG load command & then loop thru it and get 2,3 fields as follows.
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate (id,fname,lname);
output of each1:
((001,Rajive,reddy)) etc.
Now i wanna get 1st field of each1 i.e.ID how to get it. I tried below code but showing error
each2 = foreach each1 generate(students.id)
Need to get the first filed from each2 relation.
The extra parenthesis are added to the each1 relation, simply remove them:
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate id,fname,lname;
And you will get something like :
(001,Rajive,reddy)
For the each2 relation, you can get any filed of each1 without using the qualifier students, use filed name or filed position like this :
each2 = foreach each1 generate id;
each2 = foreach each1 generate $0;
And you will get something like :
(001)
each2 = foreach each1 generate $0;
I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)
I want to define an array of user Ids in Pig and then filter data if the userId from the input is NOT in that array,
How do I do this in pig latin? Below is the example of what I intend to do
Thanks
inputData = load '$INPUT' USING PigStorage('|') AS (useriD:chararray,controllerAction:chararray,url:chararray,browserName:chararray,IsMobile:chararray,exceptionDetails:chararray,renderTime:int,serviceHostId:int,auditEventTime:chararray);
filteredInput = filter inputData by controllerAction is not null and auditEventTime is not null and serviceHostId is not null and renderTime is not null and useriD in ('2be2df06-f4ba-4d87-8938-09d867d3f2fe','ac1ac6bf-d151-49fc-8c7c-2b52d2efbb58','f00aec16-36e5-46ae-b7cb-a0f1eeefe609','258890f9-102a-4f8e-a001-ae24d2e25269','cf221779-a077-472c-b377-cca4a9230e1b');
Thanks Murali..I tried the approach of declaring a variable and then using Flatten and stringSplit to join..However I get the following error
Syntax error, unexpected symbol at or near 'flatteneduserids'
%declare REQUIRED_USER_IDS 'xxxxx,yyyyy,sssss' ;
inputData = load '$INPUT' USING PigStorage('|') AS (useriD:chararray,controllerAction:chararray,url:chararray,browserName:chararray,IsMobile:chararray,exceptionDetails:chararray,renderTime:int,serviceHostId:int,auditEventTime:chararray);
filteredInput = filter inputData by controllerAction is not null and auditEventTime is not null and serviceHostId is not null and renderTime is not null;
flatteneduserids = FLATTEN(STRSPLIT('$REQUIRED_USER_IDS',',')) AS (uid:chararray);
useridfilter = JOIN filteredInput BY useriD, flatteneduserids BY uid USING 'replicated';
so Now I tried another way of declaring flatteneduserids which results in the error Undefined alias: IN
flatteneduserids = FOREACH IN GENERATE FLATTEN(STRSPLIT('$REQUIREDUSERIDS',',')) AS (uid:chararray);
Had a similar use case. Tried the approach by declaring the constant value in %define and accessing the same inside IN clause, was not able to achieve the objective. (Refer : Declare a comma seperated string constant)
A thought worth contemplating ....
If the condition inside IN clause is a static/ reference/ meta kind of data, then would suggest to declare this in a static file. We can then read the data at run time and do an inner join with input data to retrieve the matching records.
input_data = LOAD '$INPUT' USING PigStorage('|') AS (user_id:chararray ...)
static_data = LOAD ... AS (req_user_id:chararray
required_data = JOIN input_data BY useriD, static_data BY req_user_id USING 'replicated';
required_data_fmt = -- project required fields.
I was not able to figure out how to do this in memory
So as per Murali's suggestion I added the user ids in a file..load the file and then do a join...that worked as expected for mr
I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');