Optimize Pig request - optimization

I want to execute a pig command in embedded java program. For moment, I try Pig in local mode. My data file size is around 15MB but the execution of this command is very long so I think my script need optimizations...
My script :
A = LOAD 'data' USING PigPrismeLoader('data.xml');
filter_response_time_less_than_1_s = FILTER A BY (response_time < 1000.0);
filter_response_time_between_1_s_and_2_s = FILTER A BY (response_time >= 1000.0 AND response_time < 1999.0);
filter_response_time_between_greater_than_2_s = FILTER A BY (response_time >= 2000.0);
star__zne_asfo_access_log = FOREACH ( COGROUP A BY (date_day,url,date_minute,ret_code,serveur), filter_response_time_between_greater_than_2_s BY (date_day,url,date_minute,ret_code,serveur), filter_response_time_less_than_1_s BY (date_day,url,date_minute,ret_code,serveur), filter_response_time_between_1_s_and_2_s BY (date_day,url,date_minute,ret_code,serveur) )
{
GENERATE
FLATTEN(group) AS (date_day,zne_asfo_url,date_minute,zne_http_code,zne_asfo_server),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
agg__zne_asfo_access_log_ymd = FOREACH ( COGROUP A BY (date_day,date_year,date_month), filter_response_time_between_greater_than_2_s BY (date_day,date_year,date_month), filter_response_time_less_than_1_s BY (date_day,date_year,date_month), filter_response_time_between_1_s_and_2_s BY (date_day,date_year,date_month) )
{
GENERATE
FLATTEN(group) AS (date_day,date_year,date_month),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
agg__zne_asfo_access_log_ymd_ret_url = FOREACH ( COGROUP A BY (date_day,url,date_year,date_month), filter_response_time_between_greater_than_2_s BY (date_day,url,date_year,date_month), filter_response_time_less_than_1_s BY (date_day,url,date_year,date_month), filter_response_time_between_1_s_and_2_s BY (date_day,url,date_year,date_month) )
{
GENERATE
FLATTEN(group) AS (date_day,zne_asfo_url,date_year,date_month),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
agg__zne_asfo_access_log_ymd_ret_code = FOREACH ( COGROUP A BY (date_day,ret_code,date_year,date_month), filter_response_time_between_greater_than_2_s BY (date_day,ret_code,date_year,date_month), filter_response_time_less_than_1_s BY (date_day,ret_code,date_year,date_month), filter_response_time_between_1_s_and_2_s BY (date_day,ret_code,date_year,date_month) )
{
GENERATE
FLATTEN(group) AS (date_day,zne_http_code,date_year,date_month),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
agg__zne_asfo_access_log_ymd_ret_url_server = FOREACH ( COGROUP A BY (date_day,url,date_year,date_month,serveur), filter_response_time_between_greater_than_2_s BY (date_day,url,date_year,date_month,serveur), filter_response_time_less_than_1_s BY (date_day,url,date_year,date_month,serveur), filter_response_time_between_1_s_and_2_s BY (date_day,url,date_year,date_month,serveur) )
{
GENERATE
FLATTEN(group) AS (date_day,zne_asfo_url,date_year,date_month,zne_asfo_server),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
agg__zne_asfo_access_log_ymd_ret_code_server = FOREACH ( COGROUP A BY (date_day,ret_code,date_year,date_month,serveur), filter_response_time_between_greater_than_2_s BY (date_day,ret_code,date_year,date_month,serveur), filter_response_time_less_than_1_s BY (date_day,ret_code,date_year,date_month,serveur), filter_response_time_between_1_s_and_2_s BY (date_day,ret_code,date_year,date_month,serveur) )
{
GENERATE
FLATTEN(group) AS (date_day,zne_http_code,date_year,date_month,zne_asfo_server),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
agg__zne_asfo_access_log_ymdi_server = FOREACH ( COGROUP A BY (date_day,date_minute,date_year,date_month,serveur), filter_response_time_between_greater_than_2_s BY (date_day,date_minute,date_year,date_month,serveur), filter_response_time_less_than_1_s BY (date_day,date_minute,date_year,date_month,serveur), filter_response_time_between_1_s_and_2_s BY (date_day,date_minute,date_year,date_month,serveur) )
{
GENERATE
FLATTEN(group) AS (date_day,date_minute,date_year,date_month,zne_asfo_server),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
agg__zne_asfo_access_log_ymdhi_url = FOREACH ( COGROUP A BY (date_day,url,date_minute,date_year,date_month), filter_response_time_between_greater_than_2_s BY (date_day,url,date_minute,date_year,date_month), filter_response_time_less_than_1_s BY (date_day,url,date_minute,date_year,date_month), filter_response_time_between_1_s_and_2_s BY (date_day,url,date_minute,date_year,date_month) )
{
GENERATE
FLATTEN(group) AS (date_day,zne_asfo_url,date_minute,date_year,date_month),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
agg__zne_asfo_access_log_ymdhi = FOREACH ( COGROUP A BY (date_day,date_minute,date_year,date_month), filter_response_time_between_greater_than_2_s BY (date_day,date_minute,date_year,date_month), filter_response_time_less_than_1_s BY (date_day,date_minute,date_year,date_month), filter_response_time_between_1_s_and_2_s BY (date_day,date_minute,date_year,date_month) )
{
GENERATE
FLATTEN(group) AS (date_day,date_minute,date_year,date_month),
(long)SUM((bag{tuple(long)})A.response_time) AS response_time,
COUNT(filter_response_time_less_than_1_s) AS response_time_less_than_1_s,
COUNT(filter_response_time_between_1_s_and_2_s) AS response_time_between_1_s_and_2_s,
COUNT(filter_response_time_between_greater_than_2_s) AS response_time_between_greater_than_2_s,
COUNT(A) AS nb_hit;
};
STORE star__zne_asfo_access_log INTO 'star__zne_asfo_access_log' USING PigStorage('\t', '-schema');
STORE agg__zne_asfo_access_log_ymd INTO 'agg__zne_asfo_access_log_ymd' USING PigStorage('\t', '-schema');
STORE agg__zne_asfo_access_log_ymd_ret_url INTO 'agg__zne_asfo_access_log_ymd_ret_url' USING PigStorage('\t', '-schema');
STORE agg__zne_asfo_access_log_ymd_ret_code INTO 'agg__zne_asfo_access_log_ymd_ret_code' USING PigStorage('\t', '-schema');
STORE agg__zne_asfo_access_log_ymd_ret_url_server INTO 'agg__zne_asfo_access_log_ymd_ret_url_server' USING PigStorage('\t', '-schema');
STORE agg__zne_asfo_access_log_ymd_ret_code_server INTO 'agg__zne_asfo_access_log_ymd_ret_code_server' USING PigStorage('\t', '-schema');
STORE agg__zne_asfo_access_log_ymdi_server INTO 'agg__zne_asfo_access_log_ymdi_server' USING PigStorage('\t', '-schema');
STORE agg__zne_asfo_access_log_ymdhi_url INTO 'agg__zne_asfo_access_log_ymdhi_url' USING PigStorage('\t', '-schema');
STORE agg__zne_asfo_access_log_ymdhi INTO 'agg__zne_asfo_access_log_ymdhi' USING PigStorage('\t', '-schema');
Any ideas ?

Your script might need optimization, but as said in the comments, this is a tiny speck of data for Hadoop.
Hadoop does not perform well for such small data (even upto Gigabytes).
This is because Hadoop, designed to process massive amounts of data, involves a complex processing framework which takes time to setup. If you consider a large dataset, this setup time is negligible, but if your working with 15MB of data, setting up the framework would take much longer than actually processing that data.

Related

Pig: how to parse tuple with variable number of elements?

This is my output file, which I wrote out with another Pig script:
1 3,5
2 4,6,7
I'm trying to parse each line as (chararray, tuple)
data = load 'test45' as (x:chararray, y:tuple());
But when I try to dump the tuples, they're empty:
rows = foreach data generate y;
()
()
try this.
X = LOAD 'pigtuple.txt' AS (str:chararray);
X1 = FOREACH X GENERATE FLATTEN(STRSPLIT(str, '\\s+')) AS (id:int, attr:chararray);
X3 = FOREACH X1 GENERATE id, STRSPLIT(attr, ',') AS (y:tuple());
X4 = foreach X3 GENERATE id,y;
dump X4;
if you want access each element in tuple.
X4 = foreach X3 GENERATE y.$0,y.$1;

How to calculate the sum using pig script

Getting error while running the below command
Y = FOREACH X GENERATE ('entry1',(chararray)($0 matches '.*entry1.*'? 1:0)) as t1,('entry2',(chararray)($0 matches '.*entry2.*'?1:0)) as t2,('entry3', (chararray)($0 matches '.*entry3.*'?1:0)) as t3,('entry4',(chararray)($0 matches '.*entry4.*'?1:0)) as t4;
UPDATE: full code
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER(line))) as word;
C = FOREACH B GENERATE ((word matches '.*entry1.*'? 1:0)) as t1,((word matches '.*entry2.*'?1:0)) as t2,((word matches '.*entry3.*'?1:0)) as t3,((word matches '.*entry4.*'?1:0)) as t4;
D = GROUP C ALL;
E = FOREACH D GENERATE FLATTEN(TOBAG(CONCAT('entry1',' ',(chararray)SUM(C.t1)),CONCAT('entry2',' ',(chararray)SUM(C.t2)),CONCAT('entry3',' ',(chararray)SUM(C.t3)),CONCAT('entry4',' ',(chararray)SUM(C.t4))));
DUMP E;
Output:
(entry1 2)
(entry2 0)
(entry3 2)
(entry4 1)

How to group by index and index+1 in Pig

I have data input like this:
(index,x,y)
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
...
how can I group by [index] and [index + 1] like
{(1, 0.0, 0.0), (2, -0.1, -0.1)}
{(2, -0.1, -0.1), (3, 1.0, -2.2)}
...
Please help me through this. Thanks.
The below approach will work for your case.
input:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(index:int,x:double,y:double);
B = FILTER A BY index>=1;
C = FILTER A BY index>1;
D = FOREACH C GENERATE ($0-1) AS dindex,index,x,y;
E = JOIN B BY index, D BY dindex;
F = FOREACH E GENERATE TOBAG(TOTUPLE(B::index,B::x,B::y),TOTUPLE(D::index,D::x,D::y));
DUMP F;
Output:
({(1,0.0,0.0),(2,-0.1,-0.1)})
({(2,-0.1,-0.1),(3,1.0,-2.2)})
You can use the following query (explanation in comments).
-- load relation
R = LOAD 'data.txt' USING PigStorage(',') AS (index,x,y);
-- project each tuple to 2 different keys
-- one with index and one with index+1
R1 = FOREACH R GENERATE index+0, index, x, y;
R2 = FOREACH R GENERATE index+1, index, x, y;
-- group
result = COGROUP R1 by $0, R2 by $0;
-- clean out wrong combinations
result2 = filter result by NOT(IsEmpty(R1)) and NOT(IsEmpty(R2));
-- flatten the results
result3 = FOREACH result2 GENERATE FLATTEN(R1), FLATTEN(R2);
result4 = FOREACH result3 GENERATE (R1::index,R1::x,R1::y), (R2::index,R2::x,R2::y);
The file I used to test contains the following:
1,0.0,0.0
2,-0.1,-0.1
3,1.0,-2.2
Note that the parentheses are not present, but you can filter them away using a simple preprocessing script.
The dumps of intermediate results are:
DUMP R;
(1,0.0,0.0)
(2,-0.1,-0.1)
(3,1.0,-2.2)
DUMP R1;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP R2;
((1,1,0.0,0.0))
((2,2,-0.1,-0.1))
((3,3,1.0,-2.2))
DUMP result;
(1,{(1,1,0.0,0.0)},{})
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
(4,{},{(4,3,1.0,-2.2)})
DUMP result2;
(2,{(2,2,-0.1,-0.1)},{(2,1,0.0,0.0)})
(3,{(3,3,1.0,-2.2)},{(3,2,-0.1,-0.1)})
DUMP result3;
(2,2,-0.1,-0.1,2,1,0.0,0.0)
(3,3,1.0,-2.2,3,2,-0.1,-0.1)
DUMP result4;
((2,-0.1,-0.1),(1,0.0,0.0))
((3,1.0,-2.2),(2,-0.1,-0.1))

Limit but not Order in PIG

I meet one problem while I using Limit in PIG.
The result of Limit is sorted, but I don't want the result be sorted.
From the example on the website:
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
Using Limit
X = LIMIT A 3;
DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
Is it possible that show the top 3 lines without sorted in the reuslt?
(1,2,3)
(4,2,1)
(8,3,4)
My code is below:
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = FOREACH C {
topnresult = LIMIT B $lines;
GENERATE FLATTEN(topnresult);
}
dump D;
Thank you very much.
By default LIMIT will execute ORDER command followed by LIMIT command internally, so obviously you will get the sorted list. There are many way to solve this problem, one option could be
input.txt
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
PigScript:
A = LOAD 'input.txt' AS (a1:int,a2:int,a3:int);
B = RANK A;
C = FILTER B BY rank_A<=3;
D = FOREACH C GENERATE a1,a2,a3;
DUMP D;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
Option2:
A = LOAD 'input.txt' AS (a1:int,a2:int,a3:int);
B = GROUP A ALL;
C = FOREACH B {
top3list = LIMIT A 3;
GENERATE FLATTEN(top3list);
}
DUMP C;
Output:
(1,2,3)
(4,2,1)
(8,3,4)
UPDATE: Solution1
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = GROUP C ALL;
E = FOREACH D {
topnresult = LIMIT C $lines;
GENERATE FLATTEN(topnresult);
}
DUMP E;
Solution2:
A = LOAD '$input';
B = foreach A generate $s_field;
C = FILTER B BY $pattern;
D = RANK C;
E = FILTER D BY rank_C<=$lines;
F = FOREACH E GENERATE $1..;
DUMP F;
I have tested the solution using the below command line and its working fine
>pig -x local -param input='input.txt' -param s_field='$0,$1,$2' -param pattern='$0<10' -param lines=3 myscript.pig

Random selection in pig after doing group BY

I have a query. I have a data in the format id:int, name:chararray
1, abc
1, def
2, ghi,
2, mno
2, pqr
After that I do Group BY id and my data becomes
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
Any idea what How this can be done ?
The question is I have grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
Use a nested foreach. Assign each item in the bag a random value, order by that value, and choose the first one to keep. You can make it more compact than this, but this shows you each idea.
Script:
data = LOAD 'tmp/data.txt' AS (f1:int, f2:chararray);
grpd = GROUP data BY f1;
rand =
FOREACH grpd {
shuf_ = FOREACH data GENERATE f2, RANDOM() AS r;
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1.f2);
};
DUMP rand;
Output:
(1,abc)
(2,ghi)
Running it again:
(1,abc)
(2,pqr)
And again:
(1,def)
(2,pqr)
One more time!
(1,abc)
(2,ghi)
Whee!
(1,def)
(2,mno)