computing average in pig - apache-pig

I have data in format
1,1.2
2,1.3
and so on..
So basically this is id, val combination where id is unique...
I want to calculate the average of all the values..
So here.. avg(1.2,1.3)
I was going thru the documentation but most of the aggregation function involves grouping by some id.. and then using AVG... but since the id is unique.. how do I group them???
So basically the outcome of this endeavor would be one float..
Any suggestions will be greatly appreciated.
Thanks

GROUP X ALL should solve your problem :)
A = LOAD 'data' USING PigStorage(') AS (f1:int, f2:int);
B = GROUP A ALL;
AV = FOREACH B GENERATE AVG(A.f1);
DUMP AV;

Related

SQL Getting first value from a column and duplicate it in a new column

Hi guys, first thank you for reading and for your potential help.
I'm beginner in Standard SQL and i'm trying to do something but I'm stuck.
As you can see on the picture I have some products with the same item_group_id.
For these products , I want to take the FIRST declinaison value and give it to the other products having the same item_group_id in a new column.
to be more clear I will give the example for the products I encircled.
This is what I'm trying to get :
sku Declinaison item_group_id NEW_COLUMN
195810 ...multi dimensional sophistiqué_10 P195800 ...multi dimensional sophistiqué_10
195820 ...multi dimensional sophistiqué_20 P195800 ...multi dimensional sophistiqué_10
Thank you so much for your help
A way to achieve this could be using a JOIN clause to reference the same table twice. However, this approach is not recommended as it computes many more rows than needed.
Using an analytic function such as FIRST_VALUE is the recommended approach:
SELECT
sku, declinaison, item_group_id,
FIRST_VALUE(declinaison) OVER (PARTITION BY item_group_id ORDER BY sku) AS NEW_COLUMN
FROM
TABLE_NAME

Converting apache pig to hive

Trying to figure out "group" flatten and what this particular "flatten" code is doing. I have been working on the code below trying to figure out how to convert it to hive for a few days off and on, and I just don't get it. Normally, they use flatten to create multiple rows for two or more columns that they want named the same in the output. But in this case, I'm not sure what it's doing to replicate it in hive. Any assistance would be greatly appreciated as I don't have much time to work on this while I'm expected to complete and test it in the next couple of weeks. Thanks.
Change_pop = GROUP IPChange_pop BY (acct_num,strategy_code);
Oldest_GLChange = FOREACH Change_pop {
OList = ORDER IPChange_pop BY process_date ASC, new_loc DESC;
Oldest = LIMIT OList 1;
GENERATE
FLATTEN(GLChange_pop) as (email,acct_num,acct_nm,cust_num,type,strategy_code,process_date,last_5,cmGroup,current_loc,new_loc,update_ts),
FLATTEN(group.strategy_code) as grp_strategy_code,
FLATTEN(Oldest.process_date) as early_process_date, FLATTEN(Oldest.new_loc) as early_new_loc;
};
Flatten is being used to un-nest tuples, bags, and maps. From the top of my head, I recall Hive equivalent would be using EXPLODE() function along with LATERAL VIEW.
https://pig.apache.org/docs/latest/basic.html#flatten
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode

How to get the first tuple inside a bag when Grouping

I do not understand how to deal with duplicates when generating my output, so I ended up getting several duplicates but I want one only.
I've tried using LIMIT but that only applies when selecting I suppose. I also used DISTINCT but wrong scenario I guess.
grouped = GROUP wantedTails BY tail_number;
smmd = FOREACH grouped GENERATE wantedTails.tail_number as Tails, SUM(wantedTails.distance) AS totaldistance;
So for my grouped, I got smg like (not the whole):
({(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB)},44550)
but I expect (N983JB,44550). How can I delete those duplicates generated during grouping? Thank you!
The way I see it, there are two ways to de-duplicate data in Pig.
Less flexible but a convenient way is to apply MAX to the columns which need to be de-duplicated after performing a GROUP BY. Apply SUM only if you want to add up values across duplicates:
dataWithDuplicates = LOAD '<path_to_data>';
grouped = GROUP dataWithDuplicates BY tail_number;
dedupedData= FOREACH grouped GENERATE
--Since you have grouped on tailNumber, it is already de-duped
group AS tailNumber,
MAX(dataWithDuplicates.distance) AS dedupedDistance,
SUM(dataWithDuplicates.distance) AS totalDistance;
If you want more flexibility while de-duping, you can take help of nested-FOREACH in Pig. This question captures the gist of its usage: how to delete the rows of data which is repeating in Pig. Other references for nested-FORACH: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html

Apache Pig - Store/Flatten Bag so it can output as CSV

Not a great question title I admit.
Here's my problem, I have the following output from a query, where each row is like:
{(570349476329862),(570349476329862),(570349476329862)} {(66638102521614253348753),(66638102521614253348753),(66638102521614253348753)} 3
The schema of the above is:
{{(ID1:chararray)},{(ID2:chararray)},COUNT:long}
What I'm trying to do is generate output in a CSV format so that it can be easily ingested into a database, e.g. turn the above into:
570349476329862,66638102521614253348753,3
I think I want to flatten the bags but although this 'compiles' it doesn't run.
Any ideas welcome.
Thanks
If you have the same data on the bag e.g. result of a group, you can do 2 things:
Involve the given field in the grouping, than you won't need to deal with that
...
B = FOREACH (GROUP A BY (COUNT, ID1, ID2))
GENERATE FLATTEN(group) AS (COUNT, ID1, ID2),
...
Or use an inbuilt function e.g. MAX
...
B = FOREACH (GROUP A BY COUNT) GENERATE
FLATTEN(group) AS COUNT,
MAX(A.ID1) AS ID1,
MAX(A.ID2) AS ID1,
...
The benefit of this compare with the suggested datafu function, that you can do it with an inbuilt function.
I hope this helps

Projecting Grouped Tuples in Pig

I have a collection of tuples of the form (t,a,b) that I want to group by b in Pig. Once grouped, I want to filter out b from the tuples in each group and generate a bag of filtered tuples per group.
As an example, assume we have
(1,2,1)
(2,0,1)
(3,4,2)
(4,1,2)
(5,2,3)
The pig script would produce
{(1,2),(2,0)}
{(3,4),(4,1)}
{(5,2)}
The question is: how do I go about producing this result? I'm used to seeing examples where aggregation operations follow a group by operation. It's less clear to me how to filter the tuples and return them in a bag. Thanks for your assistance!
Turns out what I was looking for is the syntax for nested projection in Pig.
If one has tuples of the form (t,a,b) and wants to drop b after the group by, it is done this way.
grouped = GROUP tups BY b;
result = FOREACH grouped GENERATE tup.(t,a);
See the "Nested Projection" section on the PigLatin page. http://wiki.apache.org/pig/PigLatin