Rank within Groups in Pig 11 - jython

Pig question,
I have my data setup the following way.
Function Group Home Name
Rent MX 1 John
Rent MX 1 Jake
Rent MX 1 Pat
Rent DG 2 Jason
Rent DG 6 Patrick
Rent DG 6 Smith
Rent DG 6 Joe
What I want to do is Group by function,group and home and then rank within that group.
Function Group Home Name Rank
Rent MX 1 John 1
Rent MX 1 Jake 2
Rent MX 1 Pat 3
Rent DG 6 Patrick 1
Rent DG 6 Smith 2
Rent DG 6 Joe 3
The RANK function in Pig does not allow me to RANK within group.Any suggestions? Jython UDF ?

Check out the Enumerate UDF in DataFu, it does this for you. http://datafu.incubator.apache.org/docs/datafu/1.1.0/datafu/pig/bags/Enumerate.html

I will give some pointers to this.
In Cascading API ,I used buffer which allows us to iterate the group values.
I read that cascading also has an api for Jython developers ,you may explore that.

Ok this worked
def num_bag(input):
output = []
for rank, item in num(input):
output.append(tuple([rank] + list(item)))
return output

Related

Create a new column for group based on condition

I wanted to create a new column (Group ID) on the basis of following conditions:
If the DOB and first three letters of Name are same, then it must fall is same Group ID.
Name
DOB
Group ID
Anny
18-01-1922
0
Anny Scott
01-01-1950
1
Annie
01-01-1950
1
David
14-02-1950
2
David Kern
15-02-1951
3
William Perry
15-02-1953
4
Kenneth Field
15-02-1953
5
This how I want to create the groups
I have used the following code, to create the group ID for name (If first three letters are matched)
df['Group ID Name']=df.groupby(df['name'].str[:3]).ngroup()
The following code is used to create the group ID for DOB (If two records have the same DOB)
df['Group ID DOB']=df.groupby('Date of Birth').ngroup()
I want to use both the condition to create the Group ID, please help me out for the same.
Add multiple columns in list and also for correct ordering sort=False:
df['Group ID Name'] = df.groupby(['DOB',df['Name'].str[:3]], sort=False).ngroup()
print (df)
Name DOB Group;ID Group ID Name
0 Anny 18-01-1922 0 0
1 Anny Scott 01-01-1950 1 1
2 Annie 01-01-1950 1 1
3 David 14-02-1950 2 2
4 David Kern 15-02-1951 3 3
5 William erry 15-02-1953 4 4
6 Kenneth Field 15-02-1953 5 5

Sql: add a column with integers in a loop for duplicates

I have a sql table like:
ID Name Balance
1 Peter 324.5
2 Michael 122.7
3 Peter 788.3
4 Mark 45.7
5 Ralph 333.5
6 Thomas 563.2
7 Ralph 9685.1
8 Peter 2444.5
9 Susi 35.2
10 Andrew 442.5
11 Susi 2424.8
Is it possible to write a while loop in sql, where you could add a whole new column with integer numbers (for example 1....3) for each duplicate names (3 times Peter, 2 times Susi, 2 times Ralph)? For the non duplicate names it should be a value of 0.
So the final table should look like this:
ID Name Balance Value
1 Peter 324.5 1
2 Michael 122.7 0
3 Peter 788.3 1
4 Mark 45.7 0
5 Ralph 333.5 2
6 Thomas 563.2 0
7 Ralph 9685.1 2
8 Peter 2444.5 1
9 Susi 35.2 3
10 Andrew 442.5 0
11 Susi 2424.8 3
You wouldn't want to use a while loop for this. Just use window functions:
select t.*, count(*) over (partition by name) as cnt
from t;
This provides the total count for each name. If you want an incremental value, you can use row_number():
select t.*, row_number() over (partition by name order by id) as seqnum
from t;
This would enumerate the rows for each name, so every name would have a "1" value, some would have "2" and so on.

Joining player and game tables to get player points

I have the following SQL tables and I'm basically trying to pull a table of every game that Ralph played in for 2018, and the amount of points scored.
Ralph has a unique_id, but may play on multiple teams, or in different positions. Each year that he plays has a new record entered into the player info table for each of those teams and/or positions.
The games data table's player ID may use both of Ralph's player info records, so for instance, records 1 and 2 of game data are both for Ralph, and his actual total points scored is 18 (12 + 6). I don't need those points to be added together, as that can be done easier in PHP, but I do need both records pulled.
------------------------------
Player Info as pi
------------------------------
id | unique_id | year | name | team | pos
1 5000 2018 Ralph 5 F
2 5000 2018 Ralph 5 C
3 5600 2018 Bill 5 G
4 5000 2017 Ralph 4 F
5 2688 2016 Mike 6 G
------------------------------
Game Info as gi
------------------------------
id | team 1 | team 2
1 5 6
2 6 5
3 8 3
4 6 2
------------------------------
Game Data as gd
------------------------------
id | game_info_id | player_id | Points
1 1 1 12
2 1 2 6
3 2 1 4
4 4 5 6
The table should show pi.id, pi.unique_id, gi.id, gd.* WHERE gd.player_id = Any of Ralph's pi.id's AND pi.year=2018
Any help here is appreciated, this seems a bit out of my wheelhouse.
Join the tables like this:
select
pi.id, pi.unique_id, gi.id, gd.*
from playerinfo pi
inner join gameinfo gi on pi.team in (gi.team1, gi.team2)
inner join gamedata gd on gd.game_info_id = gi.id and gd.player_id = pi.id
where pi.name = 'Ralph' and pi.year = 2018

SQLQuery COUNT number of games per team

I have 4 tables:
Teams
codTeam: 1
year: 1995
codYears: 1
codType: 1
name: FCP
points: 3
codTeam: 2
year: 1990
codYears: 1
codType: 1
name: SLB
points: 3
codTeam: 3
year: 1995
codYears: 3
codType: 2
name: BCP
points: 0
Trainers (People who train a team)
codTrainer: 1
name: Peter
street: Ghost street
cellphone: 252666337
birthdayDate: 1995-02-01
BI: 11111111
number: 121212121
codTrainer: 1
name: Pan
street: Ghost street Remade
cellphone: 253999666
birthdayDate: 1995-01-01
BI: 22222222
number: 212121212
TeamsTrainers (In which team is someone training)
codTeamTrainer: 1
codTeam: 1
codTrainer: 2
dataInicio: 1998-05-05
codTeamTrainer: 2
codTeam: 2
codTrainer: 2
dataInicio: 1998-06-07
codTeamTrainer: 3
codTeam: 2
codTrainer: 1
dataInicio: 1999-09-09
Games
codGame: 1
date: 2015-02-12 13:00:00
codTeamHome: 1
codTeamAgainst: 2
goalsHome: 3
goalsAgainst: 2
codTypeGame: 1
codGame: 2
date: 2015-02-12 15:00:00
codTeamHome: 2
codTeamAgainst: 1
goalsHome: 1
goalsAgainst: 2
codTypeGame: 3
So basically I want to:
Get the table Games and show:
Team Name | Trainer Name | Goals Home | Goals Against | Points | Ammout of Games from the Home Team
I have the following code for that in SQLQuery:
SELECT Teams.name, Trainers.name, Games.goalsHome,
Games.goalsAgainst, Teams.points, COUNT(*)
FROM Teams, Trainers, Games, TeamsTrainers
WHERE Games.codTeamHome = Teams.codTeam AND
TeamsTrainers.codTeam = Teams.codTeam AND
TeamsTrainers.codTrainer = Trainers.codTrainer
GROUP BY Teams.name, Trainers.name, Games.goalsHome,
Games.goalsAgainst, Teams.points
(May have some errors as I translated)
Yet, the COUNT only shows 1 (Probably because on the WHERE it has "teamHome" so it only counts 1), yet, if it's because of that, how do I fix it?
Result:
FCP | Pan | 3 | 2 | 3 | 1 (Count)
SLB | Peter | 1 | 2 | 3 | 1 (Count)
SLB | Pan | 1 | 2 | 3 | 1 (Count)
It should be 2 for each one on the Count
Any idea?
The reason you get wrong result is of wrong joing data type. You should use repsectivelly: left, right or inner join instead of joing data via using where clause. Your data model provides 1 to N relationship, so you should use specific type of join.
See: http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
EDIT
SELECT Te.name, Tr.name, Ga.goalsHome, Ga.goalsAgainst, Te.points,
(SELECT COUNT(*)
FROM Games
WHERE codTeamHome = Te.codTeam OR codTeamAgainst = Te.codTeam)
AS CountOfGames
FROM TeamsTrainers AS Tt
LEFT JOIN Teams AS Te ON Tt.codTeam = Te.codTeam
LEFT JOIN Trainers AS Tr ON Tt.codTrainer = Tr.codTrainer
LEFT JOIN Games AS Ga ON Ga.codTeamHome = Te.codTeam
SQL Fiddle
You can change your WHERE clause by saying
[what you have] OR (Games.codTeamAgainst = Teams.codTeam AND ...)
However, this probably causes other problems because you probably care about whether a particular team scores the goals, not whether the home team scores the goals in games that team plays on either side.
You might not notice the other problems for a while because your GROUP BY clause is probably pretty far from what you want, and you might want to be selecting aggregate functions for a much simpler grouping.

Row aggregation of count-distinct measure

I have a fairly simple project set up to demonstrate what I want here. Here's the data:
Group
ID Name
1 Group 1
2 Group 2
3 Group 3
Person
ID GroupID Age Name
1 1 18 John
2 1 21 Stephen
3 1 18 Kate
4 2 18 Mary
5 2 19 Joseph
6 2 19 Michael
7 3 21 David
8 3 22 Kevin
9 3 21 Julian
I have 1 measure in my cube called Person Count which is a Distinct count on Person ID
I have set up each non-ID column in the dimensions as attributes (Age, Person Name, Group).
When I process and browse the cube in Business Intelligence Development Studio, I get the following result set:
But what I actually want here are the rows for Age to aggregate up the count of the Person Count together, so here it should show 2 and only one row for 18.
Is this possible (and how)?
Turns out this was a problem with the way I set up the Age attribute for the dimension.
I had:
KeyColumns = Person.ID
ValueColumn = Person.Age.
I don't know why I did this, but the solution is to delete the content of ValueColumn and set the KeyColumns to Person.Age again.
I now get the following result:
Everything else is the same for the project; this was the only change and is exactly what I wanted. If I get any issues with it I will keep this post updated for anyone else who may run into this in the future.