I have data files split by "|" so I'm using below code.
RAW_LOG = LOAD 'logs.log' USING TextLoader as (line:chararray);
splt = foreach RAW_LOG generate FLATTEN(STRSPLIT($0, '\\|'));
id_vals = foreach splt generate $4 as uid, $8 as site_id , $9 as dsid , $6 as amt;
I want to SUM(amt) of each site_id, i have tried group by but didn't work.
I am assuming you want end result to be two columns site_id and sum of amt for that site_id.
You can directly load pipe separated file using PigStorage, no need to load and then split. It will be good to provide schema definition, though you can access elements using $.
Here is the code -
RAW_LOG = LOAD 'logs.log' USING PigStorage('|') as (//YOUR SCHEMA DEFINITION);
SITE_GRP = group RAW_LOG by site_id;
SITE_SUM = foreach SITE_GRP generate group, SUM(RAW_LOG.amt);
Hope this helps.
Related
I have below pig code for a sample file.
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
I load the above file using PIG load command & then loop thru it and get 2,3 fields as follows.
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate (id,fname,lname);
output of each1:
((001,Rajive,reddy)) etc.
Now i wanna get 1st field of each1 i.e.ID how to get it. I tried below code but showing error
each2 = foreach each1 generate(students.id)
Need to get the first filed from each2 relation.
The extra parenthesis are added to the each1 relation, simply remove them:
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate id,fname,lname;
And you will get something like :
(001,Rajive,reddy)
For the each2 relation, you can get any filed of each1 without using the qualifier students, use filed name or filed position like this :
each2 = foreach each1 generate id;
each2 = foreach each1 generate $0;
And you will get something like :
(001)
each2 = foreach each1 generate $0;
Let's say we have this scenario:
dataset1.csv :
datefield
field11, field12, field13
field21, field22, field23
field31, field32, field33
What is the best way to get this?:
field11, field12, field13, datefield
field21, field22, field23, datefield
field31, field32, field33, datefield
I tried to generate one dataset (relation1) with just this columns (after loading and generating):
field11, field12, field13
field21, field22, field23
field31, field32, field33
another one (relation2) with just this column (after loading and generating):
datefield
and then doing this:
finalResult = FOREACH dataset1 GENERATE UDFFunction1(relation1::f1) as firstFields, UDFFunction2(relation2::f2) as lastField
But I'm getting 'A column needs to be projected from a relation for it to be used as a scalar'
The problem is with the second field (the one with the datefield).
I would like to avoid joins, since it would be a little messy workaround.
Any suggestions?
Please, forget my UDF functions. They just format the input Tuples accordingly.
Adding the pig script:
register 's3://bucketName/lib/MyJar.jar';
define ParseOutFilesUDF packageName.ParseOutFiles;
define FormatTimestartedUDF packageName.FormatTimestarted;
outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|');
--This UDF just reformat each tuple, adding a String to each Tuple and returning a new one.
resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial;
--load the same csv again to get the TIMESTARTED field
timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1;
--filter to get only one record, which is somth like TIMESTARTED=20160101
filetered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*');
timestarted = foreach filetered GENERATE $0 as fechaStarted;
-- the FormatTimestartedUDF just gets ride of 'TIMESTARTED=' in order to get the date '20160101'
in this FOREACH sentence is where it fails with the 'A column needs to be projected...'
finalResult = FOREACH outFile GENERATE f1, FormatTimestartedUDF(timestarted) as f2;
STORE finalResult INTO 's3://bucketName/output/';
You are getting the error because you are referencing f1 which doesn't exist in outfile and timestarted is a relation and not a field.Also you should be using the field in resultALL and filtered.
outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|');
resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial;
timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1;
filtered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*');
finalResult = FOREACH resultALL GENERATE resultAll.initial, FormatTimestartedUDF(filtered.$0);
STORE finalResult INTO 's3://bucketName/output/';
I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)
I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');
Have a large file with a lot of columns file which I am loading like
A = LOAD '/path/to/file' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS name, $1 as address, $2.. ;
C = FOREACH B FILTER BY (name is NOT NULL);
I get an error that projected field [name] does not exist? I dont want to address columns by doing $0, $1 and all that . How can I give them some identifiers ?
That pig script doesnt run for me - but changing to this :
A = LOAD '/path/to/file' USING PigStorage(',');
B = FOREACH A GENERATE $0 AS name, $1 as address, $2 as another;
C = FILTER B BY (name is NOT NULL);
does work.
Nested FOREACH will be a better option
B=FOREACH A {
filtered_rec = FILTER A BY (name is not null);
GENERATE filtered_rec;
}