How to get intersection of bag and tuple in Pig? - apache-pig

I have a bag like this (url:chararray mal:float) and like this (url:chararray links:chararray).
I want to parse the links field and intersect the bag with parsed links:
src = LOAD 'hbase://$collection' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:url anchors:links', '-loadKey true') AS (id:bytearray, url:chararray, links:chararray);
mals = LOAD '/tmp/prepare' as (url:chararray, mal:float);
urls = FILTER src BY (links IS NOT null);
urls2 = FOREACH urls GENERATE TOKENIZE(links, '\t') as links, id, url;
processed = FOREACH urls2 {
grouped = COGROUP links BY $0, mals BY url;
intersected = FILTER grouped BY NOT IsEmpty(urls) AND NOT IsEmpty(links4);
weights = FOREACH intersected GENERATE mal;
GENERATE id, AVG(weights) as mal;
};
This code isn't working: parser fails with:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file ./Rank.pig, line 11, column 19> [query, statement, foreach_statement, foreach_complex_statement, foreach_clause_complex, foreach_plan_complex, nested_blk, nested_command_list, nested_command, expr, add_expr, multi_expr, cast_expr, unary_expr, expr_eval, var_expr, projectable_expr, func_eval, recoverFromMismatchedToken] mismatched input 'links' expecting LEFT_PAREN
I'm using Pig 0.11.0.
As far as I understand links is tuple, and mals is bag, so they cannot be cogrouped. How can I create a bag with links to do cogroup?
UPD:
Example dataset:
/tmp/prepare:
http://1 1.0
http://2 0.9
http://3 0.8
http://4 0.0
HBase:
id: ID
url: http://4
links: http://1 http://2 http://3
As output:
{(id: ID, mal: 0.9)}

Related

Not able to put tag column value on monday item

I am trying to use python to automation common Monday tasks. I am able to create an item in the board but the column (type=tag) is not updating.
I used this tutorial:
https://support.monday.com/hc/en-us/articles/360013483119-API-Quickstart-Tutorial-Python#
Here is my graphql code that I am executing:
query = 'mutation ($device: String!, $columnVals: JSON!) { create_item (board_id:<myboardid>, item_name:$device, column_values:$columnVals) { id } }'
vars = {'device': device,
'columnVals': json.dumps({
'cloud_name6': {'text': cloudname} # this is where i want to add a tag. cloud_name6 is id of the column.
})
}
data = {'query' : query, 'variables' : vars}
r = requests.post(url=apiUrl, json=data, headers=headers) print(r.json())
I have tried changing id to title as key in the Json string but no luck. I fetched the existing item and tried to add exact json string but still no luck. I also tried below json data without any luck
'columnVals': json.dumps({
'cloud_name6': cloudname
})
Any idea what's wrong with the query?
When creating or mutating tag columns via item queries, you need to send an array of ids of the tags ("tag_ids") that are relating to this item. You don't set or alter tag names via an item query.
Corrected Code
'columnVals': json.dumps({
'cloud_name6': {'tag_ids': [295026,295064]}
})
https://developer.monday.com/api-reference/docs/tags

Pig Sum field from bag not works

Updated
The input is a json line text file.
{"store":"079","items":[{"name":"早晨全餐","unit_price":18,"quantity":1,"total":18},{"name":"麦趣鸡盒","unit_price":78,"quantity":5,"total":390},{"name":"巨无霸","unit_price":17,"quantity":5,"total":85},{"name":"香骨鸡腿","unit_price":12,"quantity":2,"total":24},{"name":"小薯条","unit_price":7,"quantity":5,"total":35}],"date":"\/Date(1483256820000)\/","oId":"27841ef9-f88e-478f-8f20-17c3ad090ebc"}
{"store":"041","items":[{"name":"小薯条","unit_price":7,"quantity":2,"total":14},{"name":"巨无霸","unit_price":17,"quantity":4,"total":68}],"date":"\/Date(1483221780000)\/","oId":"afee2e6d-0f81-4780-82e9-2169bf3c43f3"}
{"store":"008","items":[{"name":"奶昔香草","unit_price":9,"quantity":5,"total":45},{"name":"小薯条","unit_price":7,"quantity":2,"total":14}],"date":"\/Date(1483248600000)\/","oId":"802ea077-1eef-4cc9-af89-af7398e56792"}
Expect to group by all store and calculate the sum of total in each items,for example:
store_name total_amount
_________________________
001 2212.26
002 3245.46
003 888888
My Pig script:
store_table = LOAD '/example/store-data/2017-store-sales-data.json'
USING JsonLoader('
store_name:chararray,
items: {(
name:chararray,
unit_price:Bigdecimal,
quantity:int,
total:Bigdecimal)
},
date:Datetime,
oId:chararray'
);
platten_table = foreach store_table generate flatten(items), store_name;
store_group = group platten_table by store_name;
result = foreach store_group {
total_sum = sum(platten_table.items::total);
Generate group,total_sum;
}
Pig error is :
2017-11-28 08:53:54,357 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'Generate' expecting SEMI_COLON
Eval Functions are case sensitive, you need to use the eval function SUM in upper case.
Code snippet -
result = foreach store_group {
total_sum = SUM(platten_table.items::total);
Generate group,total_sum;
}
Refer : https://pig.apache.org/docs/r0.10.0/basic.html

Apache Pig: Combine multiple records within a bag

Any help in this would be greatly appreciated! Best way is with an example:
Input:
Schema:
Name|phone_type|phone_num
Example data:
Kyle|Cell|555-222-3333
Kyle|Home|453-444-5555
Tom|Home|555-555-5555
Tom|Pager|555-555-4344
Desired output:
Schema:
Name|Home_num|Cell_num|Pager_num
Example:
Kyle|453-444-5555|555-222-3333|null
Tom|555-555-5555|null|555-555-4344
Code:
data=Load 'test.txt' using PigStorage('|');
grpd= Group data by $0;
Foreach grpd{
???
}
After the comment of #Murali lao, I rewrite the solution.
I now use FILTER, and then the trick to not filter empty bag with FLATTEN is to add an empty string when the bag is empty.
Here are my test data:
tom,home,555
tom,pager,666
tom,cell,777
bob,home,111
bob,cell,222
Here is my solution:
data = LOAD 'phone' USING PigStorage(',') AS (name:chararray, phone_type: chararray, phone_num: chararray);
user = FOREACH (GROUP data BY name) {
home = FILTER $1 BY phone_type == 'home';
-- you add an empty string if the the bag is empty
homenum = (IsEmpty(home) ? {('')} : home.phone_num);
pager = FILTER $1 BY phone_type == 'pager';
pagernum = (IsEmpty(pager) ? {('')} : pager.phone_num);
cell = FILTER $1 BY phone_type == 'cell';
cellnum = (IsEmpty(cell) ? {('')} : cell.phone_num);
GENERATE group as name, FLATTEN(homenum) as home, FLATTEN(pagernum) as pager, FLATTEN(cellnum) as cell;
};
After a dump, I obtain the following result :
(bob,111,,222)
(tom,555,666,777)

Raven query returns 0 results for collection contains

I have a basic schema
Post {
Labels: [
{ Text: "Mine" }
{ Text: "Incomplete" }
]
}
And I am querying raven, to ask for all posts with BOTH "Mine" and "Incomplete" labels.
queryable.Where(candidate => candidate.Labels.Any(label => label.Text == "Mine"))
.Where(candidate => candidate.Labels.Any(label => label.Text == "Incomplete"));
This results in a raven query (from Raven server console)
Query: (Labels,Text:Incomplete) AND (Labels,Text:Mine)
Time: 3 ms
Index: Temp/XWrlnFBeq8ENRd2SCCVqUQ==
Results: 0 returned out of 0 total.
Why is this? If I query for JUST containing "Incomplete", I get 1 result.
If I query for JUST containing "Mine", I get the same result - so WHY where I query for them both, I get 0 results?
EDIT:
Ok - so I got a little further. The 'automatically generated index' looks like this
from doc in docs.FeedAnnouncements
from docLabelsItem in ((IEnumerable<dynamic>)doc.Labels).DefaultIfEmpty()
select new { CreationDate = doc.CreationDate, Labels_Text = docLabelsItem.Text }
So, I THINK the query was basically testing the SAME label for 2 different values. Bad.
I changed it to this:
from doc in docs.FeedAnnouncements
from docLabelsItem1 in ((IEnumerable<dynamic>)doc.Labels).DefaultIfEmpty()
from docLabelsItem2 in ((IEnumerable<dynamic>)doc.Labels).DefaultIfEmpty()
select new { CreationDate = doc.CreationDate, Labels1_Text = docLabelsItem1.Text, Labels2_Text = docLabelsItem2.Text }
Now my query (in Raven Studio) Labels1_Text:Mine AND Labels2_Text:Incomplete WORKS!
But, how do I address these phantom fields (Labels1_Text and Labels2_Text) when querying from Linq?
Adam,
You got the reason right. The default index would generate 2 index entries, and your query is executing on a single index entry.
What you want is to either use intersection, or create your own index like this:
from doc in docs.FeedAnnouncements
select new { Labels_Text = doc.Labels.Select(x=>x.Text)}
And that would give you all the label's text in a single index entry, which you can execute a query on.

Can I generate nested bags using nested FOREACH statements in Pig Latin?

Let's say I have a data set of restaurant reviews:
User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5
And I want to produce a list by user and city of average review. I.e. output:
User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75
I could write a Pig script as follows:
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float
);
PerUserCity = GROUP Data BY (user, city);
ResultSet = FOREACH PerUserCity {
GENERATE group.user, group.city, AVG(Data.rating);
}
However I'm curious whether I can first group the higher level group (the users) and then sub group the next level (the cities) later: i.e.
PerUser = GROUP Data BY user;
Intermediate = FOREACH PerUser {
B = GROUP Data BY city;
GENERATE group AS user, B;
}
I get:
Error during parsing.
Invalid alias: GROUP in {
group: chararray,
Data: {
user: chararray,
city: chararray,
restaurant: chararray,
rating: float
}
}
Has anyone tried this with success? Is it simply not possible to GROUP within a FOREACH?
My goal is to do something like:
ResultSet = FOREACH PerUser {
FOREACH City {
GENERATE user, city, AVG(City.rating)
}
}
Currently the allowed operations are DISTINCT, FILTER, LIMIT, and ORDER BY inside a FOREACH.
For now grouping directly by (user, city) is the good way to do as you said.
Release notes for Pig version 0.10 suggest that nested FOREACH operations are now supported.
Try this:
Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
grpRecs = group Records By (user,city);
avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average;
Result = foreach avgRating_Byuser_perCity generate flatten(group), average;
awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float);
data = filter rawdata by user != 'User';
groupbyusercity = group data by (user,city);
--describe groupbyusercity;
--groupbyusercity: {group: (user: chararray,city: chararray),data: {(user: chararray,city: chararray,restaurant: chararray,rating: float)}}
average = foreach groupbyusercity {
generate group.user,group.city,AVG(data.rating);
}
dump average;
Grouping by two keys and then flattening the structure leads to the same result:
Loading Data like you did
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float);
Group by user and city
ByUserByCity = GROUP Data BY (user, city);
Add Rating average of the groups (you can add more, like COUNT(Data) as count_res)
Then flatten the group structure to the original one.
ByUserByCityAvg = FOREACH ByUserByCity GENERATE
FLATTEN(group) AS (user, city),
AVG(Data.rating) as user_city_avg;
Results in:
Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,