Pig Latin query using group by and MAX function - apache-pig

Given the table:
Place(name, province, population, mayorid)
How would you write in Pig Latin the following query?
Return for each province the place(s) with the largest population. Your result set should have the province name, the place name and the population of that place.

Haven't tested this, but something like
places = LOAD 'placesInput' AS (name, province, population, mayorid);
placesProjected = FOREACH places GENERATE name,province,population;
placesGrouped = GROUP placesProjected by province;
biggestPlaces = FOREACH placesGrouped {
sorted = ORDER placesProjected by population DESC;
maxPopulation = LIMIT sorted 1;
GENERATE group as province, FLATTEN(maxPopulation.name) as name, FLATTEN(maxPopulation.population) as population;
};
oughta work.

Related

SQL in R: HAVING condition with only the condition of one row?

I am learning to use SQL in R.
I want to select cities that are more northern than Berlin from a dataset.
For this, I have tried:
sql4 = "SELECT Name, Latitude FROM city HAVING Latitude > Latitude(Berlin)"
result4 = dbSendQuery(con, sql4)
df4 = dbFetch(result4)
head(df4)
and
sql4 = "SELECT Name, Latitude FROM city HAVING Latitude > (Name IN 'Berlin')"
result4 = dbSendQuery(con, sql4)
df4 = dbFetch(result4)
head(df4)
Neither syntax works unfortunatley.
So my question is: How do I select all cities "north from Berlin", i.e. latitude value higher than that of the Name row 'Berlin'? Is there a different, better approach?
Assuming Berlin occur at most once in the city table, you may use:
SELECT Name, Latitude
FROM city
WHERE Latitude > (SELECT Latitude FROM city WHERE Name = 'Berlin');
You want to be using WHERE here to filter, rather than HAVING. HAVING is for filtering aggregates when using GROUP BY, which you are not using.
You cannot actually use Latitude(Berlin), I think. I typically use something like this:
SELECT Name, Latitude FROM city WHERE Latitude = (SELECT Latitude from city WHERE Name = "Berlin")
Hope this helps.
-Suhas

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

Pig latin Limit operator applied to each group attribute

I am trying to return only the five largest places based on population in each state. I am also trying to sort the result by state name, with places in each state listed in declining order of population. What I have currently gives me only the first five places with states and not the five largest places for each state.
-- Groups places by state name.
group_by_state_name_populated_place_name =
GROUP project_using_state_name
BY (state::name, place::name);
-- Counts population for each place in every state.
count_population_for_each_place_in_every_state =
FOREACH group_by_state_name_populated_place_name
GENERATE group.state::name AS state_name,
group.place::name AS name,
COUNT(project_using_state_name.population) AS population;
-- Orders population in each group found above to enable the use of limit.
order_groups_of_states_and_population =
ORDER count_population_for_each_place_in_every_state
BY state_name ASC, population DESC, name ASC;
-- Limit the top 5 population for each state BUT currently returning just the first 5 tuples of the previous one and not 5 of each state.
limit_population =
LIMIT order_groups_of_states_and_population 5;
Below code snippet may help
inp_data = load 'input_data.csv' using PigStorage(',') AS (state:chararray,place:chararray,population:long);
req_stats = FOREACH(GROUP inp_data BY state) {
ordered = ORDER inp_data BY population DESC;
required = LIMIT ordered 5;
GENERATE FLATTEN(required);
};
req_stats_ordered = ORDER req_stats BY state, population DESC;
DUMP req_stats_ordered;

postgresql get the per-row number of keys of hstore data if key is in List "foo", "bar"

i try to count per row, how keys are in hstore data column.
array_length(akeys(tags), 1) as num_keys
this works fine for all tags.
In table nodes are many different more tags - I will just number per row of tags in my SELECT.
"name"=>"Campus", "amenity"=>"restaurant", "wheelchair"=>"yes"
count only "name" -> result 1
SELECT
id,
st_x(ST_Transform(geom,4326)) AS lon,
st_y(ST_Transform(geom,4326)) AS lat,
array_length(akeys(tags), 1) as num_keys,
tags->'name' AS name,
tags->'amenity' AS amenity,
tags->'addr:street' AS street,
tags->'addr:housenumber' AS housenumber,
tags->'addr:postcode' AS postcode,
tags->'addr:city' AS city,
tags->'cuisine' AS cuisine,
tags->'beer_garden' AS beer_garden,
tags->'outdoor_seating' AS outdoor_seating,
tags->'smoking' AS smoking,
tags->'brewery' AS brewery,
tags->'website' AS website,
tags->'internet_access' AS wlan,
tags->'phone' AS phone,
tags->'email' AS email,
tags->'opening_hours' AS opening_hours
FROM
nodes
WHERE
tags->'amenity' IN ('pub','bar','nightclub','biergarten','cafe','restaurant')
AND
tags->'name' IS NOT NULL;
The solution is:
SELECT
CASE WHEN 'email' = ANY (akeys(tags)) THEN 1 ELSE 0 END +
CASE WHEN 'opening_hours' = ANY (akeys(tags)) THEN 1 ELSE 0 END
AS count_tags
FROM
nodes

how to find the maximum and its associated values from a grouped relation in pig?

Below is my input
$ cat people.csv
Steve,US,M,football,6.5
Alex,US,M,football,5.5
Ted,UK,M,football,6.0
Mary,UK,F,baseball,5.5
Ellen,UK,F,football,5.0
I Need to group my data based on the Country.
people = LOAD 'people.csv' USING PigStorage(',') AS (name:chararray,country:chararray,gender:chararray, sport:chararray,height:float);
grouped = GROUP people BY country;
Now i have to find the maximum height of the person and his details from the grouped data.
So i tried the below
a = FOREACH grouped GENERATE group AS country, MAX(people.height) as height, people.name as name;
which gives the output as
(UK,6.0,{(Ellen),(Mary),(Ted)})
(US,6.5,{(Alex),(Steve)})
But i need my output should be
(UK,6.0,Ted)
(US,6.5,Steve)
Could someone please help me to achieve this ?
This code will help you .
As per this code, If there are two players with max height under the same country then you will get both those players details
records = LOAD '/home/user/footbal.txt' USING PigStorage(',') AS(name:chararray,country:chararray,gender:chararray,sport:chararray,height:double);
records_grp = GROUP records BY (country);
records_each = foreach records_grp generate group as temp_country, MAX(records.height) as max_height;
records_join = join records by (country,height), records_each by (temp_country,max_height);
records_output = foreach records_join generate country, max_height, name;
dump records_output;
OutPut :
(UK,6.0,Ted)
(US,6.5,Steve)