Sort a bag with pig - apache-pig

Sort a bag with pig - apache-pig

I have the following structure
GROUPED_ANSWERS_PARENT_ID: {group: chararray,ANSWERS: {(id: chararray,score: long,parentId: chararray)}}
My data where score is respectively 27,287,35,37,46,48
((4,{(305467,27,4),(7,287,4),(2791,35,4),(594436,37,4),(110198,46,4),(7263,48,4)}))
I want it to be ordered by score DESC and return the following:
((4,{(7,287,4),(7263,48,4),(110198,46,4),(594436,37,4),(2791,35,4),(305467,27,4)}))
I have tried the following but the result is still incorrect.
SORTED_GROUPED_ANSWERS_PARENT_ID = FOREACH GROUPED_ANSWERS_PARENT_ID {
ORDER_BY_SCORE = ORDER $1 BY score;
GENERATE (group,ORDER_BY_SCORE);
};
Any help would be greatly appreciated.
PS: I have looked at this post, but it did not help me

You are missing the DESC keyword
SORTED_GROUPED_ANSWERS_PARENT_ID = FOREACH GROUPED_ANSWERS_PARENT_ID
{
ORDER_BY_SCORE = ORDER ANSWERS BY score DESC;
GENERATE (group,ORDER_BY_SCORE);
};

Related

Kotlin: Efficient way of sorting list using another list and alphabetical order

I want to sort the students list based on their popularity (this list is always sorted by their points) and then sort the ones that aren't on that list in alphabetical order
The two lists look like:
students = listOf<Student>(
Student(id = 3, name ="mike"),
Student(id = 2,name ="mathew"),
Student(id = 1,name ="john"),
Student(id = 4,name ="alan")
)
val popularity = listOf<Ranking>(
Ranking(id= 2, points = 30),
Ranking(id= 3, points = 15)
)
The result I'm looking for would be:
[
Student(id=2,name"mathew"), <- first, because of popularity
Student(id=3,name="mike"),
Student(id=4,name="alan"), <- starting here by alphabetical order
Student(id=1,name="john")
]
If anyone knows about an efficient way of doing this I would kindly appreciate the help

Having the rankings as a list is suboptimal, because to look things up you need to browse the list everytime, which is slow if you have many rankings. If you do have a lot of them, I would suggest to get them as a map first. You can do it easily from the list using associate.
Then you can create a custom comparator to compare by popularity and then by name, and use it in sortedWith:
val studentIdToPopularity = popularity.associate { it.id to it.points }
val studentComparator = compareByDescending<Student> { studentIdToPopularity[it.id] ?: 0 }.thenBy { it.name }
val sortedStudents = students.sortedWith(studentComparator)
You can see the result here: https://pl.kotl.in/a1GZ736jZ
Student(id=2, name=mathew)
Student(id=3, name=mike)
Student(id=4, name=alan)
Student(id=1, name=john)

Lodash choose which duplicates to reject

I have array of objects, objects have properties say "a", "b" and "c".
Now I need to filter out objects which has unique values of "a & b".
However c plays role on which objects to keep and which ones to reject.
If I do uniqBy on properties a and b, I will be blindly rejecting other objects. It will keep the first object it encounters in the array and reject all other duplicates.
Let me know if you need more clarification on the question.
This is how I found the uniq objects based on two properties.
var uniqArray= _.uniqBy(obj, function(elem) { return [elem.a, elem.b].join(); })
lodash uniq - choose which duplicate object to keep in array of objects
Do we have better solution?
Here is an example of input and expected output
Input: var arrayOfObj = [{a:1, b:1, c:2}, {a:1,b:1,c:1}, {a:2, b:2,c:2}];
Output: arrayOfObj = [{a:1,b:1,c:1}, {a:2, b:2,c:2}]; there should not be any duplicate a1,b1 combination

According to Lodash documentation, the order of result values is determined by the order they occur in the array. Therefore you need to order the array using the 'c' property in order to get the expected result.
To do so, you can use _.sortBy. It orders a collection in asc order based on a property or an iteratee. I think your problem could be solved using a property directly though you could use a function to provide a more accurate comparator if needed.
After that, you just perform the uniqBy action and retrieve the result:
var res = _(arrayOfObj)
.orderBy('c')
.uniqBy('a', 'b')
.value();
console.log(res);
Here's the fiddle.
As I read in the comments, your 'c' property is a timestamp. If you want to order from latest to newest, you can pass an iteratee to sort by in order to reverse the natural order by 'c':
var res = _(arrayOfObj)
.orderBy(function(obj) {
return -(+obj.c);
})
.uniqBy('a', 'b')
.value();
console.log(res);
You can check it out in this update of the previous fiddle. Hope it helps.

Would help to know what the actual data is and what you want to achieve with it. Here's an idea: group your objects based on a and b, then from each grouping, select one item based on c.
var arrayOfObj = [
{a:1, b:1, c:2},
{a:1, b:1, c:1},
{a:2, b:2, c:2}
];
var result = _.chain(arrayOfObj)
.groupBy(function (obj) {
return obj.a.toString() + obj.b.toString();
})
.map(function (objects) {
//replace with code to select the right item based on c
return _.head(objects);
}).value();

Apache Pig: Combine multiple records within a bag

Any help in this would be greatly appreciated! Best way is with an example:
Input:
Schema:
Name|phone_type|phone_num
Example data:
Kyle|Cell|555-222-3333
Kyle|Home|453-444-5555
Tom|Home|555-555-5555
Tom|Pager|555-555-4344
Desired output:
Schema:
Name|Home_num|Cell_num|Pager_num
Example:
Kyle|453-444-5555|555-222-3333|null
Tom|555-555-5555|null|555-555-4344
Code:
data=Load 'test.txt' using PigStorage('|');
grpd= Group data by $0;
Foreach grpd{
???
}

After the comment of #Murali lao, I rewrite the solution.
I now use FILTER, and then the trick to not filter empty bag with FLATTEN is to add an empty string when the bag is empty.
Here are my test data:
tom,home,555
tom,pager,666
tom,cell,777
bob,home,111
bob,cell,222
Here is my solution:
data = LOAD 'phone' USING PigStorage(',') AS (name:chararray, phone_type: chararray, phone_num: chararray);
user = FOREACH (GROUP data BY name) {
home = FILTER $1 BY phone_type == 'home';
-- you add an empty string if the the bag is empty
homenum = (IsEmpty(home) ? {('')} : home.phone_num);
pager = FILTER $1 BY phone_type == 'pager';
pagernum = (IsEmpty(pager) ? {('')} : pager.phone_num);
cell = FILTER $1 BY phone_type == 'cell';
cellnum = (IsEmpty(cell) ? {('')} : cell.phone_num);
GENERATE group as name, FLATTEN(homenum) as home, FLATTEN(pagernum) as pager, FLATTEN(cellnum) as cell;
};
After a dump, I obtain the following result :
(bob,111,,222)
(tom,555,666,777)

How to get count of all items in a criteria GORM query

So i have this criteria query that is getting 10 feature articles that have itemchannel objects that are of type 4 and in a channel of id 1 i.e get me top 10 articles which are of type feature and in channel x.
def criteria = Feature.createCriteria()
list = criteria.list {
maxResults(params.max)
itemChannels {
eq ('itemType.id',(long)4)
eq ('channel.id',(long)1)
}
}
How do i get the total count efficiently i.e. i have the articles for page 1 but i need the total number for pagination?
Thanks

Think i sorted this.
criteria = Feature.createCriteria()
count = criteria.get{
projections {
countDistinct('id')
}
itemChannels {
eq ('itemType.id',(long)4)
eq ('channel.id',(long)2)
}
}

Can I generate nested bags using nested FOREACH statements in Pig Latin?

Let's say I have a data set of restaurant reviews:
User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5
And I want to produce a list by user and city of average review. I.e. output:
User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75
I could write a Pig script as follows:
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float
);
PerUserCity = GROUP Data BY (user, city);
ResultSet = FOREACH PerUserCity {
GENERATE group.user, group.city, AVG(Data.rating);
}
However I'm curious whether I can first group the higher level group (the users) and then sub group the next level (the cities) later: i.e.
PerUser = GROUP Data BY user;
Intermediate = FOREACH PerUser {
B = GROUP Data BY city;
GENERATE group AS user, B;
}
I get:
Error during parsing.
Invalid alias: GROUP in {
group: chararray,
Data: {
user: chararray,
city: chararray,
restaurant: chararray,
rating: float
}
}
Has anyone tried this with success? Is it simply not possible to GROUP within a FOREACH?
My goal is to do something like:
ResultSet = FOREACH PerUser {
FOREACH City {
GENERATE user, city, AVG(City.rating)
}
}

Currently the allowed operations are DISTINCT, FILTER, LIMIT, and ORDER BY inside a FOREACH.
For now grouping directly by (user, city) is the good way to do as you said.

Release notes for Pig version 0.10 suggest that nested FOREACH operations are now supported.

Try this:
Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
grpRecs = group Records By (user,city);
avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average;
Result = foreach avgRating_Byuser_perCity generate flatten(group), average;

awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float);
data = filter rawdata by user != 'User';
groupbyusercity = group data by (user,city);
--describe groupbyusercity;
--groupbyusercity: {group: (user: chararray,city: chararray),data: {(user: chararray,city: chararray,restaurant: chararray,rating: float)}}
average = foreach groupbyusercity {
generate group.user,group.city,AVG(data.rating);
}
dump average;

Grouping by two keys and then flattening the structure leads to the same result:
Loading Data like you did
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float);
Group by user and city
ByUserByCity = GROUP Data BY (user, city);
Add Rating average of the groups (you can add more, like COUNT(Data) as count_res)
Then flatten the group structure to the original one.
ByUserByCityAvg = FOREACH ByUserByCity GENERATE
FLATTEN(group) AS (user, city),
AVG(Data.rating) as user_city_avg;
Results in:
Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Sort a bag with pig - apache-pig

You are missing the DESC keyword SORTED_GROUPED_ANSWERS_PARENT_ID = FOREACH GROUPED_ANSWERS_PARENT_ID { ORDER_BY_SCORE = ORDER ANSWERS BY score DESC; GENERATE (group,ORDER_BY_SCORE); };

Related

Kotlin: Efficient way of sorting list using another list and alphabetical order

Lodash choose which duplicates to reject

Apache Pig: Combine multiple records within a bag

How to get count of all items in a criteria GORM query

Can I generate nested bags using nested FOREACH statements in Pig Latin?

Categories

Resources