Join two relations (candidates) and display difference between votes? - apache-pig

I first split the relation into those who won and those who lost.
I have trouble with joining the all the candidates back together and making tuples with the last names of both candidates (elected and defeated) and the difference between their vote totals (only tuples where the difference is less than 10).
--load the data
raw = LOAD '.../data2.csv' USING PigStorage(',') AS (
date, type:chararray, parl:int, prov:chararray, riding:chararray,
lastname:chararray, firstname:chararray, gender:chararray,
occupation:chararray, party:chararray, votes:int,
percent:double, elected:int);
fltrd = FILTER raw by votes > 100 ;
spltrd = SPLIT fltrd INTO won IF elected > 0, lost IF elected == 0;
jnd = JOIN won BY lastname AS lastname_won, lost BY lastname AS lastname_lost;
For displaying the difference btw the votes this is the idea I had but it's not working:
jnd = JOIN won BY lastname AS lastname_won, vote AS vote_won, lost BY lastname AS lastname_lost, vote AS vote_lost;
gen = foreach jnd generate lastname_won, lastname_lost,(vote_won - vote_lost) as diffVotes;

What exactly is is that is not working?
At first glance, your Join will probably not work because of the AS you are using. You cannot rename the columns in the join, you'll need a foreach for this. Use :: to distinguish between the fields from the two relations:
jnd2 = FOREACH jnd GENERATE won::lastname AS lastname_won, won::vote AS vote_won etc.
If you join on multiple columns change the syntax as follows:
jnd = JOIN won BY (lastname, vote), lost BY (lastname, vote);

Related

PIG How do I combine 2 files based one not equal condition

I am trying to find the player that played on the most teams in one year. I have one file wit the the schema of PlayerID, yearID, teamID. I brought the file in twice to try to join where the PlayerID and yearID are equal but the teamID is not. How do I do in in PIG? Can I do a <> in a join statement? Do I need to group them and them compare? I know sql i could join based on the PlayerID and yearID being equal and the teamID not being equal but not sure how to do that in PIG.
I tried this but it is no the right syntax"
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
(id:chararray,yearid:int, teamid:chararray);
batters1 = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS ` (id:chararray,yearid:int, teamid:chararray);
batter_fltr = FILTER batters BY (yearid > 0) AND (teamid> ' ');
batter1_fltr = FILTER batters1 BY (yearid>0) AND (teamid> ' ');
multi_playr = JOIN batter_fltr BY (yearid,id), batter1_fltr BY(yearid,id) ,LEFT OUTER BY(teamid);
You wanted to find the player that played on the most teams in one year. Therefore, you should group by player & year, then you can count the number of teams per player per year. Finally, order the data by the count descending - the first result will be your answer. There's no need to load the data twice or do a join.
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
(id:chararray, yearid:int, teamid:chararray);
-- Apply filtering as needed here
teams_per_year = FOREACH (GROUP batters BY (id, yearid))
GENERATE
group.id AS id,
group.yearid AS yearid,
COUNT(batters.teamid) AS num_teams;
ordered_results = ORDER teams_per_year BY num_teams DESC;
DUMP ordered_results;
If you need the distinct number of teams, add a nested DISTINCT:
teams_per_year = FOREACH (GROUP batters BY (id, yearid)) {
dist_teams = DISTINCT batters.teamid;
GENERATE
group.id AS id,
group.yearid AS yearid,
COUNT(dist_teams) AS num_teams;
}

Is count() being used correctly for my query?

Find all teams that only won 1 game in tourney #3 (1 column, 4 rows)
My thought process to create this query is that I need to count (WonGame)s for each Team. And that that number cannot be has to equal 1. But When I run my query I get no results (I should get 4 teams).
Experimenting with my query I changed the equals to a greater than and that returned 8 results. So I don't understand why equals 1 returns no results.
Also I checked my Data and there is indeed 4 teams that one only one game during Tournament #3.
select Teams.TeamName
from Teams
join Bowlers on Teams.TeamID = Bowlers.TeamID
join Bowler_Scores on Bowlers.BowlerID = Bowler_Scores.BowlerID
join Match_Games on Bowler_Scores.GameNumber = Match_Games.GameNumber
join Tourney_Matches on Match_Games.MatchID = Tourney_Matches.MatchID
where Tourney_Matches.TourneyID = 3
group by Teams.TeamName
having count(Bowler_Scores.WonGame) = 1;
Bowling League DB Structure
Bowling League Data
The diagram seems to indicate that the relationship between Match_Games and Bowler_Scores is on BOTH of MatchID and GameNumber
If you change your JOIN conditions to be both columns
join Match_Games on Bowler_Scores.GameNumber = Match_Games.GameNumber and Bowler_Scores.MatchID = Match_Games.MatchID
Then you might get the required answer.
I can only speculate on what your data really looks like. However, it is doubtful that this expression:
having count(Bowler_Scores.WonGame) = 1;
does what you want. This counts the number of non-NULL values. Presumably, WonGame as some value such as "1" or "W" for the winner. If the value were 1, then the correct expression would be:
having sum(Bowler_Scores.WonGame) = 1
This is just speculation though without a better description of your data.
EDIT:
Based on the comment:
having sum(convert(int, Bowler_Scores.WonGame)) = 1

Sql Left or Right Join One To Many Pagination

I have one main table and join other tables via left outer or right outer outer join.One row of main table have over 30 row in join query as result. And I try pagination. But the problem is I can not know how many rows will it return for one main table row result.
Example :
Main table first row result is in my query 40 rows.
Main table second row result is 120 row.
Problem(Question) UPDATE:
For pagination I need give the pagesize the count of select result. But I can not know the right count for my select result. Example I give page no 1 and pagesize 50, because of this I cant get the right result.I need give the right pagesize for my main table top 10 result. Maybe for top 10 row will the result row count 200 but my page size is 50 this is the problem.
I am using Sql 2014. I need it for my ASP.NET project but is not important.
Sample UPDATE :
it is like searching an hotel for booking. Your main table is hotel table. And the another things are (mediatable)images, (mediatable)videos, (placetable)location and maybe (commenttable)comments they are more than one rows and have one to many relationship for the hotel. For one hotel the result will like 100, 50 or 10 rows for this all info. And I am trying to paginate this hotels result. I need get always 20 or 30 or 50 hotels for performance in my project.
Sample Query UPDATE :
SELECT
*
FROM
KisiselCoach KC
JOIN WorkPlace WP
ON KC.KisiselCoachId = WP.WorkPlaceOwnerId
JOIN Album A
ON KC.KisiselCoachId = A.AlbumId
JOIN Media M
ON A.AlbumId = M.AlbumId
LEFT JOIN Rating R
ON KC.KisiselCoachId = R.OylananId
JOIN FrUser Fr
ON KC.CoachId = Fr.UserId
JOIN UserJob UJ
ON KC.KisiselCoachId = UJ.UserJobOwnerId
JOIN Job J
ON UJ.JobId = J.JobId
JOIN UserExpertise UserEx
ON KC.KisiselCoachId = UserEx.UserExpertiseOwnerId
JOIN Expertise Ex
ON UserEx.ExpertiseId = Ex.ExpertiseId
Hotel Table :
HotelId HotelName
1 Barcelona
2 Berlin
Media Table :
MediaID MediaUrl HotelId
1 www.xxx.com 1
2 www.xxx.com 1
3 www.xxx.com 1
4 www.xxx.com 1
Location Table :
LocationId Adress HotelId
1 xyz, Berlin 1
2 xyz, Nice 1
3 xyz, Sevilla 1
4 xyz, Barcelona 1
Comment Table :
CommentId Comment HotelId
1 you are cool 1
2 you are great 1
3 you are bad 1
4 hmm you are okey 1
This is only sample! I have 9999999 hotels in my database. Imagine a hotel maybe it has 100 images maybe zero. I can not know this. And I need get 20 hotels in my result(pagination). But 20 hotels means 1000 rows maybe or 100 rows.
First, your query is poorly written for readability flow / relationship of tables. I have updated and indented to try and show how/where tables related in hierarchical relativity.
You also want to paginate, lets get back to that. Are you intending to show every record as a possible item, or did you intend to show a "parent" level set of data... Ex so you have only one instance per Media, Per User, or whatever, then once that entry is selected you would show details for that one entity? if so, I would do a query of DISTINCT at the top-level, or at least grab the few columns with a count(*) of child records it has to show at the next level.
Also, mixing inner, left and right joins can be confusing. Typically a right-join means you want the records from the right-table of the join. Could this be rewritten to have all required tables to the left, and non-required being left-join TO the secondary table?
Clarification of all these relationships would definitely help along with the context you are trying to get out of the pagination. I'll check for comments, but if lengthy, I would edit your original post question with additional details vs a long comment.
Here is my SOMEWHAT clarified query rewritten to what I THINK the relationships are within your database. Notice my indentations showing where table A -> B -> C -> D for readability. All of these are (INNER) JOINs indicating they all must have a match between all respective tables. If some things are NOT always there, they would be changed to LEFT JOINs
SELECT
*
FROM
KisiselCoach KC
JOIN WorkPlace WP
ON KC.KisiselCoachId = WP.WorkPlaceOwnerId
JOIN Album A
ON KC.KisiselCoachId = A.AlbumId
JOIN Media M
ON A.AlbumId = M.AlbumId
LEFT JOIN Rating R
ON KC.KisiselCoachId = R.OylananId
JOIN FrUser Fr
ON KC.CoachId = Fr.UserId
JOIN UserJob UJ
ON KC.KisiselCoachId = UJ.UserJobOwnerId
JOIN Job J
ON UJ.JobId = J.JobId
JOIN UserExpertise UserEx
ON KC.KisiselCoachId = UserEx.UserExpertiseOwnerId
JOIN Expertise Ex
ON UserEx.ExpertiseId = Ex.ExpertiseId
Readability of a query is a BIG help for yourself, and/or anyone assisting or following you. By not having the "on" clauses near the corresponding joins can be very confusing to follow.
Also, which is your PRIMARY table where the rest are lookup reference tables.
ADDITION PER COMMENT
Ok, so I updated a query which appears to have no context to the sample data and what you want in your post. That said, I would start with a list of hotels only and a count(*) of things per hotel so you can give SOME indication of how much stuff you have in detail. Something like
select
H.HotelID,
H.HotelName,
coalesce( MedSum.recs, 0 ) as MediaItems,
coalesce( LocSum.recs, 0 ) as NumberOfLocations,
coalesce( ComSum.recs, 0 ) as NumberOfLocations
from
Hotel H
LEFT JOIN
( select M.HotelID,
count(*) recs
from Media M
group by M.HotelID ) MedSum
on H.HotelID = MedSum.HotelID
LEFT JOIN
( select L.HotelID,
count(*) recs
from Location L
group by L.HotelID ) LocSum
on H.HotelID = LocSum.HotelID
LEFT JOIN
( select C.HotelID,
count(*) recs
from Comment C
group by C.HotelID ) ComSum
on H.HotelID = ComSum.HotelID
order by
H.HotelName
--- apply any limit per pagination
Now this will return every hotel at a top-level and the total count of things per the hotel per the individual counts which may or not exist hence each sub-check is a LEFT-JOIN. Expose a page of 20 different hotels. Now, as soon as one person picks a single hotel, you can then drill-into the locations, media and comments per that one hotel.
Now, although this COULD work, having to do these counts on an every-time query might get very time consuming. You might want to add counter columns to your main hotel table representing such counts as being performed here. Then, via some nightly process, you could re-update the counts ONCE to get them primed across all history, then update counts only for those hotels that have new activity since entered the date prior. Not like you are going to have 1,000,000 posts of new images, new locations, new comments in a day, but of 22,000, then those are the only hotel records you would re-update counts for. Each incremental cycle would be short based on only the newest entries added. For the web, having some pre-aggregate counts, sums, etc is a big time saver where practical.

How to combine multiple rows in a relation into a tuple to perform calculations in PIG Latin

I have the following code:
pitcher_res = UNION pitcher_total_salary,pitcher_total_appearances;
dump pitcher_res;
The output is:
(8965000.0)
(22.0)
However, I want to calculate 8965000.0/22.0, so I need something like:
res = FOREACH some_relation GENERATE $0/$1;
Therefore I need to have some_relation = (8965000.0,22.0). How can I perform such a conversion?
You can do a CROSS.
Computes the cross product of two or more relations.
https://pig.apache.org/docs/r0.11.1/basic.html#cross
Ideally you would have a unique identifier for each entry in your source relations. Then you can perform a join based on this identifier which results in the kind of relation you want to have.
Salary relation
salaries: pitcher_id, pitcher_total_salary
Total appearances relation
appearances: pitcher_id, pitcher_total_appearances
Join
pitcher_relation = join salaries by pitcher_id, appearances by pitcher_id;
Calculation
res = FOREACH pitcher_relation GENERATE pitcher_total_salary/pitcher_total_apperances;
The below pig latin scripts will surely come to your rescue:
load the salary file
salary = load '/home/abhishek/Work/pigInput/pitcher_total_salary' as (salary:long);
load the appearances file
appearances = load '/home/abhishek/Work/pigInput/pitcher_total_appearances' as (appearances:long);
Now, use the CROSS command
C = cross salary, appearances
Then, the final output
res = foreach C generate salary/appearances;
Output
dump res
407500
Hope this helps

Applying two separate filters on a Rails database query

I am trying to apply two filters to a database query in Rails 3. The first filter shows only media of type images. The second filter shows the highest saluted stories. On their own the filters work ok, but when I try to combine both filters, I get errors.
There are 3 tables involved. Stories, memories, and salutes. The salutes table keeps track of how many times someone 'salutes' a memory. Each story is composed of multiple memories. A story's total salutes is the sum of the salutes of that story's memories. I want to retrieve records of image-only stories in the order of highest to lowest salutes.
models/story.rb
def self.where_contains_image()
joins(
'INNER JOIN memories AS wci_memories ON wci_memories.story_id = stories.id'
)
.where(
'wci_memories.media_type_cd = ?', Memory.image
).uniq
end
controllers/stories_controller.rb
if params[:filter_content] == 'image'
stories = stories.where_contains_image
end
if (params[:filter_trends] == 'most_saluted')
stories = stories.order("(SELECT COUNT(1) FROM salutes
LEFT JOIN memories AS ms_memories ON salutes.content_id = ms_memories.id
LEFT JOIN stories AS ms_stories ON ms_stories.id = ms_memories.story_id
WHERE ms_stories.id = stories.id AND salutes.content_type = 'Memory')
DESC");
end
On its own, when the 'most_saluted' param is set, the query works as expected. When both the 'most_saluted' param and the 'image' param are set, I get an error:
for SELECT DISTINCT, ORDER BY expressions must appear in select list
I understand what the error is, but I cannot figure out how to rewrite the queries so that it can return only images in the order of most saluted.
When I run this SQL query on the database, it returns the records I'm looking for. But I cannot figure out how to make rails return the same records. Furthermore, this query combines the two filters (only images and highest salutes). I want to keep them separate so that I can apply one filter individually, or both together.
SELECT DISTINCT stories.*, (SELECT COUNT(1) FROM salutes
LEFT JOIN memories AS ms_memories ON salutes.content_id = ms_memories.id
LEFT JOIN stories AS ms_stories ON ms_stories.id = ms_memories.story_id
WHERE ms_stories.id = stories.id AND salutes.content_type = 'Memory')
AS total_salutes FROM stories INNER JOIN memories AS wci_memories
ON wci_memories.story_id = stories.id WHERE wci_memories.media_type_cd = 0
ORDER BY total_salutes DESC
Any thoughts on how I can resolve this?
You can use scope to achieve this, actually the Activerecord scope are the more cleaner/moduler way to chain conditions
read here about scopes
HTH