Pig Sum field from bag not works - apache-pig

Updated
The input is a json line text file.
{"store":"079","items":[{"name":"早晨全餐","unit_price":18,"quantity":1,"total":18},{"name":"麦趣鸡盒","unit_price":78,"quantity":5,"total":390},{"name":"巨无霸","unit_price":17,"quantity":5,"total":85},{"name":"香骨鸡腿","unit_price":12,"quantity":2,"total":24},{"name":"小薯条","unit_price":7,"quantity":5,"total":35}],"date":"\/Date(1483256820000)\/","oId":"27841ef9-f88e-478f-8f20-17c3ad090ebc"}
{"store":"041","items":[{"name":"小薯条","unit_price":7,"quantity":2,"total":14},{"name":"巨无霸","unit_price":17,"quantity":4,"total":68}],"date":"\/Date(1483221780000)\/","oId":"afee2e6d-0f81-4780-82e9-2169bf3c43f3"}
{"store":"008","items":[{"name":"奶昔香草","unit_price":9,"quantity":5,"total":45},{"name":"小薯条","unit_price":7,"quantity":2,"total":14}],"date":"\/Date(1483248600000)\/","oId":"802ea077-1eef-4cc9-af89-af7398e56792"}
Expect to group by all store and calculate the sum of total in each items,for example:
store_name total_amount
_________________________
001 2212.26
002 3245.46
003 888888
My Pig script:
store_table = LOAD '/example/store-data/2017-store-sales-data.json'
USING JsonLoader('
store_name:chararray,
items: {(
name:chararray,
unit_price:Bigdecimal,
quantity:int,
total:Bigdecimal)
},
date:Datetime,
oId:chararray'
);
platten_table = foreach store_table generate flatten(items), store_name;
store_group = group platten_table by store_name;
result = foreach store_group {
total_sum = sum(platten_table.items::total);
Generate group,total_sum;
}
Pig error is :
2017-11-28 08:53:54,357 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'Generate' expecting SEMI_COLON

Eval Functions are case sensitive, you need to use the eval function SUM in upper case.
Code snippet -
result = foreach store_group {
total_sum = SUM(platten_table.items::total);
Generate group,total_sum;
}
Refer : https://pig.apache.org/docs/r0.10.0/basic.html

Related

How to automate a field mapping using a table in snowflake

I have one column table in my snowflake database that contain a JSON mapping structure as following
ColumnMappings : {"Field Mapping": "blank=Blank,E=East,N=North,"}
How to write a query that if I feed the Field Mapping a value of E I will get East or if the value if N I will get North so on and so forth without hard coding the value in the query like what CASE statement provides.
You really want your mapping in this JSON form:
{
"blank" : "Blank",
"E" : "East",
"N" : "North"
}
You can achieve that in Snowflake e.g. with a simple JS UDF:
create or replace table x(cm variant) as
select parse_json(*) from values('{"fm": "blank=Blank,E=East,N=North,"}');
create or replace function mysplit(s string)
returns variant
language javascript
as $$
res = S
.split(",")
.reduce(
(acc,val) => {
var vals = val.split("=");
acc[vals[0]] = vals[1];
return acc;
},
{});
return res;
$$;
select cm:fm, mysplit(cm:fm) from x;
-------------------------------+--------------------+
CM:FM | MYSPLIT(CM:FM) |
-------------------------------+--------------------+
"blank=Blank,E=East,N=North," | { |
| "E": "East", |
| "N": "North", |
| "blank": "Blank" |
| } |
-------------------------------+--------------------+
And then you can simply extract values by key with GET, e.g.
select cm:fm, get(mysplit(cm:fm), 'E') from x;
-------------------------------+--------------------------+
CM:FM | GET(MYSPLIT(CM:FM), 'E') |
-------------------------------+--------------------------+
"blank=Blank,E=East,N=North," | "East" |
-------------------------------+--------------------------+
For performance, you might want to make sure you call mysplit only once per value in your mapping table, or even pre-materialize it.

Need help in converting SQL query to LINQ

I am new to the world of LINQ and hence I am stuck at converting one sql query to LINQ.
My SQL query is:
select COUNT(DISTINCT PAYER) as count,
PPD_COL FROM BL_REV
where BL_NO_UID = 1084
GROUP BY PPD_COL
The desired output is:
Count PPD_COL
12 P
20 C
I have written something like below in LINQ:
var PayerCount = from a in LstBlRev where a.DelFlg == "N"
group a by new { a.PpdCol} into grouping
select new
{
Count = grouping.First().PayerCustCode.Distinct().Count(),
PPdCol = (grouping.Key.PpdCol == "P") ? "Prepaid" : "Collect"
};
But it is not giving me the desired output. The count is returned same for PPD_COL value P & C. What am I missing here?
Change the groupby as following. in the group group only the property you need and then in thr by no need to create an anonymous object - just the one property you are grouping by.
var PayerCount = from a in LstBlRev
where a.DelFlg == "N"
group a.PayerCustCode by a.PpdCol into grouping
select new
{
Count = grouping.Distinct().Count(),
PPdCol = grouping.Key == "P" ? "Prepaid" : "Collect"
};

Apache Pig: Combine multiple records within a bag

Any help in this would be greatly appreciated! Best way is with an example:
Input:
Schema:
Name|phone_type|phone_num
Example data:
Kyle|Cell|555-222-3333
Kyle|Home|453-444-5555
Tom|Home|555-555-5555
Tom|Pager|555-555-4344
Desired output:
Schema:
Name|Home_num|Cell_num|Pager_num
Example:
Kyle|453-444-5555|555-222-3333|null
Tom|555-555-5555|null|555-555-4344
Code:
data=Load 'test.txt' using PigStorage('|');
grpd= Group data by $0;
Foreach grpd{
???
}
After the comment of #Murali lao, I rewrite the solution.
I now use FILTER, and then the trick to not filter empty bag with FLATTEN is to add an empty string when the bag is empty.
Here are my test data:
tom,home,555
tom,pager,666
tom,cell,777
bob,home,111
bob,cell,222
Here is my solution:
data = LOAD 'phone' USING PigStorage(',') AS (name:chararray, phone_type: chararray, phone_num: chararray);
user = FOREACH (GROUP data BY name) {
home = FILTER $1 BY phone_type == 'home';
-- you add an empty string if the the bag is empty
homenum = (IsEmpty(home) ? {('')} : home.phone_num);
pager = FILTER $1 BY phone_type == 'pager';
pagernum = (IsEmpty(pager) ? {('')} : pager.phone_num);
cell = FILTER $1 BY phone_type == 'cell';
cellnum = (IsEmpty(cell) ? {('')} : cell.phone_num);
GENERATE group as name, FLATTEN(homenum) as home, FLATTEN(pagernum) as pager, FLATTEN(cellnum) as cell;
};
After a dump, I obtain the following result :
(bob,111,,222)
(tom,555,666,777)

How to get intersection of bag and tuple in Pig?

I have a bag like this (url:chararray mal:float) and like this (url:chararray links:chararray).
I want to parse the links field and intersect the bag with parsed links:
src = LOAD 'hbase://$collection' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:url anchors:links', '-loadKey true') AS (id:bytearray, url:chararray, links:chararray);
mals = LOAD '/tmp/prepare' as (url:chararray, mal:float);
urls = FILTER src BY (links IS NOT null);
urls2 = FOREACH urls GENERATE TOKENIZE(links, '\t') as links, id, url;
processed = FOREACH urls2 {
grouped = COGROUP links BY $0, mals BY url;
intersected = FILTER grouped BY NOT IsEmpty(urls) AND NOT IsEmpty(links4);
weights = FOREACH intersected GENERATE mal;
GENERATE id, AVG(weights) as mal;
};
This code isn't working: parser fails with:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file ./Rank.pig, line 11, column 19> [query, statement, foreach_statement, foreach_complex_statement, foreach_clause_complex, foreach_plan_complex, nested_blk, nested_command_list, nested_command, expr, add_expr, multi_expr, cast_expr, unary_expr, expr_eval, var_expr, projectable_expr, func_eval, recoverFromMismatchedToken] mismatched input 'links' expecting LEFT_PAREN
I'm using Pig 0.11.0.
As far as I understand links is tuple, and mals is bag, so they cannot be cogrouped. How can I create a bag with links to do cogroup?
UPD:
Example dataset:
/tmp/prepare:
http://1 1.0
http://2 0.9
http://3 0.8
http://4 0.0
HBase:
id: ID
url: http://4
links: http://1 http://2 http://3
As output:
{(id: ID, mal: 0.9)}

Can I generate nested bags using nested FOREACH statements in Pig Latin?

Let's say I have a data set of restaurant reviews:
User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5
And I want to produce a list by user and city of average review. I.e. output:
User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75
I could write a Pig script as follows:
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float
);
PerUserCity = GROUP Data BY (user, city);
ResultSet = FOREACH PerUserCity {
GENERATE group.user, group.city, AVG(Data.rating);
}
However I'm curious whether I can first group the higher level group (the users) and then sub group the next level (the cities) later: i.e.
PerUser = GROUP Data BY user;
Intermediate = FOREACH PerUser {
B = GROUP Data BY city;
GENERATE group AS user, B;
}
I get:
Error during parsing.
Invalid alias: GROUP in {
group: chararray,
Data: {
user: chararray,
city: chararray,
restaurant: chararray,
rating: float
}
}
Has anyone tried this with success? Is it simply not possible to GROUP within a FOREACH?
My goal is to do something like:
ResultSet = FOREACH PerUser {
FOREACH City {
GENERATE user, city, AVG(City.rating)
}
}
Currently the allowed operations are DISTINCT, FILTER, LIMIT, and ORDER BY inside a FOREACH.
For now grouping directly by (user, city) is the good way to do as you said.
Release notes for Pig version 0.10 suggest that nested FOREACH operations are now supported.
Try this:
Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
grpRecs = group Records By (user,city);
avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average;
Result = foreach avgRating_Byuser_perCity generate flatten(group), average;
awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float);
data = filter rawdata by user != 'User';
groupbyusercity = group data by (user,city);
--describe groupbyusercity;
--groupbyusercity: {group: (user: chararray,city: chararray),data: {(user: chararray,city: chararray,restaurant: chararray,rating: float)}}
average = foreach groupbyusercity {
generate group.user,group.city,AVG(data.rating);
}
dump average;
Grouping by two keys and then flattening the structure leads to the same result:
Loading Data like you did
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float);
Group by user and city
ByUserByCity = GROUP Data BY (user, city);
Add Rating average of the groups (you can add more, like COUNT(Data) as count_res)
Then flatten the group structure to the original one.
ByUserByCityAvg = FOREACH ByUserByCity GENERATE
FLATTEN(group) AS (user, city),
AVG(Data.rating) as user_city_avg;
Results in:
Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,