Pig latin Limit operator applied to each group attribute - apache-pig

I am trying to return only the five largest places based on population in each state. I am also trying to sort the result by state name, with places in each state listed in declining order of population. What I have currently gives me only the first five places with states and not the five largest places for each state.
-- Groups places by state name.
group_by_state_name_populated_place_name =
GROUP project_using_state_name
BY (state::name, place::name);
-- Counts population for each place in every state.
count_population_for_each_place_in_every_state =
FOREACH group_by_state_name_populated_place_name
GENERATE group.state::name AS state_name,
group.place::name AS name,
COUNT(project_using_state_name.population) AS population;
-- Orders population in each group found above to enable the use of limit.
order_groups_of_states_and_population =
ORDER count_population_for_each_place_in_every_state
BY state_name ASC, population DESC, name ASC;
-- Limit the top 5 population for each state BUT currently returning just the first 5 tuples of the previous one and not 5 of each state.
limit_population =
LIMIT order_groups_of_states_and_population 5;

Below code snippet may help
inp_data = load 'input_data.csv' using PigStorage(',') AS (state:chararray,place:chararray,population:long);
req_stats = FOREACH(GROUP inp_data BY state) {
ordered = ORDER inp_data BY population DESC;
required = LIMIT ordered 5;
GENERATE FLATTEN(required);
};
req_stats_ordered = ORDER req_stats BY state, population DESC;
DUMP req_stats_ordered;

Related

In SOSL Is there a way to get a total record count from a query before an offset?

Is there a way if I have a query to calculate the total records that meet the criteria before the OFFSET. Without having to do a separate query?. My code is below. How would I get the total record amount from this query? Thanks.
''''''
if (String.isBlank(userId)){
listRec = [SELECT Id, Name,
At_Fault_Party__c, At_Fault_Party__r.Name,
Accident__r.Name, Accident__c, Booking__c,
Vehicle__r.Vehicle_Type__c,
Accident__r.At_Fault_Party_Claim_Numer__c, OwnerId, Owner.Name,
Insurer__c, Insurer__r.Name,
Vehicle__r.Make__r.Name, Vehicle__r.Model__r.Name,
Accident__r.Accident_Date_and_Time__c,
AF_Vehicle__c, AF_Vehicle__r.Name,
Vehicle__r.Name, Customer__c, Customer__r.Name,
Accident__r.Repairer__r.Name,
Last_Contact_Date__c, Recovery_Type__c,
Customer_Vehicle__r.Insurer__c, Customer_Vehicle__r.Insurer__r.Name,
Customer_Vehicle__r.Claim_Number__c,
Customer_Vehicle__r.Make__r.Name,
Customer_Vehicle__r.Model__r.Name,
Customer_Vehicle__c, Customer_Vehicle__r.Name,
(SELECT Id, Name, Booking_of_Days__c, Booking_Start__c,
Booking_Finish_Return_Date_and_Time__c, Drop_Off_Address__c from Bookings__r order by Booking_Start__c desc)
FROM Recovery__c where
Name LIKE :whereClause and
(Recovery_Status__c = :selectedValues )
order by Last_Contact_Date__c desc limit 15 OFFSET :offsetval ];

Sorting a list of objects by another list of objects VB.NET LINQ

So, here's a simplified version of the situation:
I have Gyms in a database:
Gym: GymID, Name
GymAmenities: GymID, AmenityID
Amenities: AmenityID
So a Gym, can have 0 to many "Amenities". Along comes a user who prioritizes amenities that are important to him or her:
UserPrefAmenities: UserID, AmenityID, Ranking
Now when searching for a gym in a zip code, I want the search results to be in order of the user preferred amenities in order of rank...
gyms = (From g In db.Gyms Where g.Zip = thisRequest.Zip Order By g.GymAmenitys.Contains(From upa In thisUser.UserPrefAmenitys Order By upa.Rank)).ToList
Or something like that...
*note that running the above results in:
Unable to create a constant value of type 'UserPrefAmenity'. Only
primitive types or enumeration types are supported in this context.
I think this should do it:
gyms = (
From g In db.Gyms
Where g.Zip = thisRequest.Zip
Order By If((From ga in g.GymAmenitys
Join upa In db.UserPrefAmenitys On ga.AmenityID
Equals upa.AmenityID
Where upa.UserID = thisUser.UserID
Order By upa.Rank Descending
Select CType(upa.Rank, integer?)).FirstOrDefault(), 0)
).ToList
The idea is that you find a user's highest ranked UserPrefAmenity that shares an AmenityID of the Gym's amenities. If the user has no matching UserPrefAmenity the value 0 is taken as rank.

Pig: Summing Fields

I have some census data in which each line has a number denoting the county and fields for the number of people in a certain age range (eg, 5 and under, 5 to 17, etc.). After some initial processing in which I removed the unneeded columns, I grouped the labeled data as follows (labeled_data is of the schema {county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int}):
grouped_data = GROUP filtered_data BY county;
So grouped_data is of the schema
{group: chararray,filtered_data: {(county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int)}}
Now I would like to to sum up all of the pop fields for each county, yielding the total population of each county. I'm pretty sure the command to do this will be of the form
pop_sums = FOREACH grouped_data GENERATE group, SUM(something about the pop fields);
but I've been unable to get this to work. Thanks in advance!
I don't know if this is helpful, but the following is a representative entry of grouped_data:
(147,{(147,385,1005,283,468,649,738,933,977),(147,229,655,178,288,394,499,579,481)})
Note that the 147 entries are actually county codes, not populations. They are therefore of type chararray.
Can you try the below approach?
Sample input:
147,1,1,1,1,1,1,1,1
147,2,2,2,2,2,2,2,2
145,5,5,5,5,5,5,5,5
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(country:chararray,pop1:int,pop2:int,pop3:int,pop4:int,pop5:int,pop6:int,pop7:int,pop8:int);
B = GROUP A BY country;
C = FOREACH B GENERATE group,(SUM(A.pop1)+SUM(A.pop2)+SUM(A.pop3)+SUM(A.pop4)+SUM(A.pop5)+SUM(A.pop6)+SUM(A.pop7)+SUM(A.pop8)) AS totalPopulation;
DUMP C;
Output:
(145,40)
(147,24)

Pig script to get top 3 data in a single record

I have the sample data as
user_id, date, accessed url, session time
the data refers to the top 3 interests of the user depending on the session time.
Got the data using the code:
top3 = FOREACH DataSet{
sorted = ORDER DataSet BY sessiontime DESC;
lim = LIMIT sorted 3;
GENERATE flatten(group), flatten(lim);
};
Output:
(1,20,url1,2484)
(1,20,url2,1863)
(1,20,url3,1242)
(2,22,url4,484)
(2,22,url5,63)
(2,22,url6,42)
(3,25,url7,500)
(3,25,url8,350)
(3,25,url9,242)
But I want my output to be like this:
(1,20,url1,url2,url3)
(2,22,url4,url5,url6)
(3,25,url7,url8,url9)
Please help.
You are close. The problem is that you FLATTEN the bag of URLs when you really want to keep them all in one record. So do this instead:
top3 = FOREACH DataSet{
sorted = ORDER DataSet BY sessiontime DESC;
lim = LIMIT sorted 3;
GENERATE flatten(group), lim.url;
};
Based on the output you got, you will now get
(1,20,{(url1),(url2),(url3)})
(2,22,{(url4),(url5),(url6)})
(3,25,{(url7),(url8),(url9)})
Note that the URLs are contained inside a bag. If you want to have them as three top-level fields, you will need to use a UDF to convert a bag into a tuple, and then FLATTEN that.

Pig Latin query using group by and MAX function

Given the table:
Place(name, province, population, mayorid)
How would you write in Pig Latin the following query?
Return for each province the place(s) with the largest population. Your result set should have the province name, the place name and the population of that place.
Haven't tested this, but something like
places = LOAD 'placesInput' AS (name, province, population, mayorid);
placesProjected = FOREACH places GENERATE name,province,population;
placesGrouped = GROUP placesProjected by province;
biggestPlaces = FOREACH placesGrouped {
sorted = ORDER placesProjected by population DESC;
maxPopulation = LIMIT sorted 1;
GENERATE group as province, FLATTEN(maxPopulation.name) as name, FLATTEN(maxPopulation.population) as population;
};
oughta work.