Join on multiple fields in Pig - apache-pig

I'm learning Pig and not sure how to do the following. I have on file that stores a series of metadata about chat messages:
12345 13579
23456 24680
19350 20283
28394 20384
10384 29475
.
.
.
The first column is the id of the sender and the second column is the id of the receiver. What I want to do is count how many messages are being sent from men to women, men to men, women to men, and women to women. So I have another file which stores user id's and gender:
12345 M
23456 F
34567 M
45678 M
.
.
.
So the Pig script might start out as follows:
messages = load 'messages.txt' as (from:int, to:int);
users = load 'users.txt' as (id:int,sex:chararray);
From there I'm really not sure what the next step to take should be. I was able to join one column at a time of messages to users, but not sure how to join both columns and then do the subsequent grouping.
Any advice/tips would be super helpful.

I guess what you want is to join then group and count your data.
joinedSenderRaw = JOIN users BY id, messages BY from;
joinedSender = FOREACH joinedSenderRaw
GENERATE messages::from as sender_id,
users::sex as sender_sex,
messages::to as receiver_id;
joinedAllRaw = JOIN joinedSender BY receiver_id, users BY id;
joinedAll = FOREACH joinedAllRaw
GENERATE joinedSender::sender_id,
joinedSender::sender_sex,
joinedSender::receiver_id,
users::sex as receiver_sex;
grouped = GROUP joinedAll BY (sender_sex, receiver_sex);
result = FOREACH grouped
GENERATE $0.sender_sex AS sender_sex,
$0.receiver_sex AS receiver_sex,
COUNT($1) AS your_stat;
I did not test it but something like this should work.

Related

Sql Left or Right Join One To Many Pagination

I have one main table and join other tables via left outer or right outer outer join.One row of main table have over 30 row in join query as result. And I try pagination. But the problem is I can not know how many rows will it return for one main table row result.
Example :
Main table first row result is in my query 40 rows.
Main table second row result is 120 row.
Problem(Question) UPDATE:
For pagination I need give the pagesize the count of select result. But I can not know the right count for my select result. Example I give page no 1 and pagesize 50, because of this I cant get the right result.I need give the right pagesize for my main table top 10 result. Maybe for top 10 row will the result row count 200 but my page size is 50 this is the problem.
I am using Sql 2014. I need it for my ASP.NET project but is not important.
Sample UPDATE :
it is like searching an hotel for booking. Your main table is hotel table. And the another things are (mediatable)images, (mediatable)videos, (placetable)location and maybe (commenttable)comments they are more than one rows and have one to many relationship for the hotel. For one hotel the result will like 100, 50 or 10 rows for this all info. And I am trying to paginate this hotels result. I need get always 20 or 30 or 50 hotels for performance in my project.
Sample Query UPDATE :
SELECT
*
FROM
KisiselCoach KC
JOIN WorkPlace WP
ON KC.KisiselCoachId = WP.WorkPlaceOwnerId
JOIN Album A
ON KC.KisiselCoachId = A.AlbumId
JOIN Media M
ON A.AlbumId = M.AlbumId
LEFT JOIN Rating R
ON KC.KisiselCoachId = R.OylananId
JOIN FrUser Fr
ON KC.CoachId = Fr.UserId
JOIN UserJob UJ
ON KC.KisiselCoachId = UJ.UserJobOwnerId
JOIN Job J
ON UJ.JobId = J.JobId
JOIN UserExpertise UserEx
ON KC.KisiselCoachId = UserEx.UserExpertiseOwnerId
JOIN Expertise Ex
ON UserEx.ExpertiseId = Ex.ExpertiseId
Hotel Table :
HotelId HotelName
1 Barcelona
2 Berlin
Media Table :
MediaID MediaUrl HotelId
1 www.xxx.com 1
2 www.xxx.com 1
3 www.xxx.com 1
4 www.xxx.com 1
Location Table :
LocationId Adress HotelId
1 xyz, Berlin 1
2 xyz, Nice 1
3 xyz, Sevilla 1
4 xyz, Barcelona 1
Comment Table :
CommentId Comment HotelId
1 you are cool 1
2 you are great 1
3 you are bad 1
4 hmm you are okey 1
This is only sample! I have 9999999 hotels in my database. Imagine a hotel maybe it has 100 images maybe zero. I can not know this. And I need get 20 hotels in my result(pagination). But 20 hotels means 1000 rows maybe or 100 rows.
First, your query is poorly written for readability flow / relationship of tables. I have updated and indented to try and show how/where tables related in hierarchical relativity.
You also want to paginate, lets get back to that. Are you intending to show every record as a possible item, or did you intend to show a "parent" level set of data... Ex so you have only one instance per Media, Per User, or whatever, then once that entry is selected you would show details for that one entity? if so, I would do a query of DISTINCT at the top-level, or at least grab the few columns with a count(*) of child records it has to show at the next level.
Also, mixing inner, left and right joins can be confusing. Typically a right-join means you want the records from the right-table of the join. Could this be rewritten to have all required tables to the left, and non-required being left-join TO the secondary table?
Clarification of all these relationships would definitely help along with the context you are trying to get out of the pagination. I'll check for comments, but if lengthy, I would edit your original post question with additional details vs a long comment.
Here is my SOMEWHAT clarified query rewritten to what I THINK the relationships are within your database. Notice my indentations showing where table A -> B -> C -> D for readability. All of these are (INNER) JOINs indicating they all must have a match between all respective tables. If some things are NOT always there, they would be changed to LEFT JOINs
SELECT
*
FROM
KisiselCoach KC
JOIN WorkPlace WP
ON KC.KisiselCoachId = WP.WorkPlaceOwnerId
JOIN Album A
ON KC.KisiselCoachId = A.AlbumId
JOIN Media M
ON A.AlbumId = M.AlbumId
LEFT JOIN Rating R
ON KC.KisiselCoachId = R.OylananId
JOIN FrUser Fr
ON KC.CoachId = Fr.UserId
JOIN UserJob UJ
ON KC.KisiselCoachId = UJ.UserJobOwnerId
JOIN Job J
ON UJ.JobId = J.JobId
JOIN UserExpertise UserEx
ON KC.KisiselCoachId = UserEx.UserExpertiseOwnerId
JOIN Expertise Ex
ON UserEx.ExpertiseId = Ex.ExpertiseId
Readability of a query is a BIG help for yourself, and/or anyone assisting or following you. By not having the "on" clauses near the corresponding joins can be very confusing to follow.
Also, which is your PRIMARY table where the rest are lookup reference tables.
ADDITION PER COMMENT
Ok, so I updated a query which appears to have no context to the sample data and what you want in your post. That said, I would start with a list of hotels only and a count(*) of things per hotel so you can give SOME indication of how much stuff you have in detail. Something like
select
H.HotelID,
H.HotelName,
coalesce( MedSum.recs, 0 ) as MediaItems,
coalesce( LocSum.recs, 0 ) as NumberOfLocations,
coalesce( ComSum.recs, 0 ) as NumberOfLocations
from
Hotel H
LEFT JOIN
( select M.HotelID,
count(*) recs
from Media M
group by M.HotelID ) MedSum
on H.HotelID = MedSum.HotelID
LEFT JOIN
( select L.HotelID,
count(*) recs
from Location L
group by L.HotelID ) LocSum
on H.HotelID = LocSum.HotelID
LEFT JOIN
( select C.HotelID,
count(*) recs
from Comment C
group by C.HotelID ) ComSum
on H.HotelID = ComSum.HotelID
order by
H.HotelName
--- apply any limit per pagination
Now this will return every hotel at a top-level and the total count of things per the hotel per the individual counts which may or not exist hence each sub-check is a LEFT-JOIN. Expose a page of 20 different hotels. Now, as soon as one person picks a single hotel, you can then drill-into the locations, media and comments per that one hotel.
Now, although this COULD work, having to do these counts on an every-time query might get very time consuming. You might want to add counter columns to your main hotel table representing such counts as being performed here. Then, via some nightly process, you could re-update the counts ONCE to get them primed across all history, then update counts only for those hotels that have new activity since entered the date prior. Not like you are going to have 1,000,000 posts of new images, new locations, new comments in a day, but of 22,000, then those are the only hotel records you would re-update counts for. Each incremental cycle would be short based on only the newest entries added. For the web, having some pre-aggregate counts, sums, etc is a big time saver where practical.

Weave rows representing email messages into send & reply conversation threads

I have (two) tables of SENT and RECEIVED email messages exchanged between patients and their doctors within an app. I need to group these rows into conversation threads exactly the way you would expect to see them in your email inbox, but with the following difference:
Here, “thread” encompasses all back-and-forth exchanges between the same 2 users. Thus, each single unique pair of communicating users constitutes 1 and only 1 thread.
The following proof-of-concept code successfully creates a notion of “thread” for a single instance where I know the specific patient and doctor user IDs. The parts I can’t figure out are:
(1) how to accomplish this when I’m pulling multiple patients and doctors from tables, and then
(2) to sort the resulting threads by initiating-date
SELECT send.MessageContent, send.SentDatetime, rec.ReadDatetime, other_stuff
FROM MessageSend send
INNER JOIN MessageReceive rec
ON send.MessageId = rec.MessageId
WHERE
( send.UserIdSender = 123
OR rec.UserIdReceiver = 123 )
AND
(send.UserIdSender = 456
OR rec.UserIdReceiver = 456)
If MessageID is unique for a conversion, You can order the messages using the send and received date time.
If you want to filter for particular doctor or patient ,you can include it in the where clause.
SELECT send.MessageContent, send.SentDatetime, rec.ReadDatetime, other_stuff
FROM MessageSend send
INNER JOIN MessageReceive rec
ON send.MessageId = rec.MessageId
ORDER BY send.MessageId,send.SentDatetime, rec.ReadDatetime

Count the grouped records in pig query

Below is my test data.
John,q1,Correct
Jack,q1,wrong
John,q2,Correct
Jack,q2,wrong
John,q3,wrong
Jack,q3,Correct
John,q4,wrong
Jack,q4,wrong
John,q5,wrong
Jack,q5,wrong
I want to find something like below:
John wrong 4
John correct 1
Jack wrong 3
Jack correct 2
My Code:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
Now the out put looks like below:
((John,wrong),{(John,q5,wrong),(John,q4,wrong),(John,q2,wrong),(John,q1,wrong)})
((John,Correct),{(John,q3,Correct)})
((Jack,wrong),{(Jack,q5,wrong),(Jack,q4,wrong),(Jack,q3,wrong)})
((Jack,Correct),{(Jack,q2,Correct),(Jack,q1,Correct)})
How should I calculate count the grouped records.
The COUNT function will give you the number of elements in a bag, which is exactly what you want. After grouping by user and result, you end up with a bag with the number of times each combination appeared.
Therefore, you only have to add one line:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
dump D;
(Jack,wrong,4)
(Jack,Correct,1)
(John,wrong,3)
(John,Correct,2)
The FLATTEN(group) is because after grouping, a tuple containing the elements you grouped by is generated, and by the looks of what you want as output you don't want it inside a tuple, as the output would be like ((Jack,wrong),4).

Pig: Summing Fields

I have some census data in which each line has a number denoting the county and fields for the number of people in a certain age range (eg, 5 and under, 5 to 17, etc.). After some initial processing in which I removed the unneeded columns, I grouped the labeled data as follows (labeled_data is of the schema {county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int}):
grouped_data = GROUP filtered_data BY county;
So grouped_data is of the schema
{group: chararray,filtered_data: {(county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int)}}
Now I would like to to sum up all of the pop fields for each county, yielding the total population of each county. I'm pretty sure the command to do this will be of the form
pop_sums = FOREACH grouped_data GENERATE group, SUM(something about the pop fields);
but I've been unable to get this to work. Thanks in advance!
I don't know if this is helpful, but the following is a representative entry of grouped_data:
(147,{(147,385,1005,283,468,649,738,933,977),(147,229,655,178,288,394,499,579,481)})
Note that the 147 entries are actually county codes, not populations. They are therefore of type chararray.
Can you try the below approach?
Sample input:
147,1,1,1,1,1,1,1,1
147,2,2,2,2,2,2,2,2
145,5,5,5,5,5,5,5,5
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(country:chararray,pop1:int,pop2:int,pop3:int,pop4:int,pop5:int,pop6:int,pop7:int,pop8:int);
B = GROUP A BY country;
C = FOREACH B GENERATE group,(SUM(A.pop1)+SUM(A.pop2)+SUM(A.pop3)+SUM(A.pop4)+SUM(A.pop5)+SUM(A.pop6)+SUM(A.pop7)+SUM(A.pop8)) AS totalPopulation;
DUMP C;
Output:
(145,40)
(147,24)

Pig matching with an external file

I have a file (In Relation A) with all tweets
today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
...
I have another file (In Relation B) with words to be filtered
sick
viral fever
feeling
...
My Code
//loads all the tweets
A = load 'tweets' as tweets;
//loads all the words to be filtered
B = load 'filter_list' as filter_list;
Expected Output
(sick,1)
(viral fever,2)
(feeling,1)
...
How do i achieve this in pig using a join?
EDITED SOLUTION
The basic concept that I supplied earlier will work, but it requires the addition of a UDF to generate NGrams pairs of the tweets. You then union the NGram pairs to the Tokenized tweets, and then perform the wordcount function on that dataset.
I've tested the code below, and it works fine against the data provided. If records in your filter_list have more than 2 words in a string (ie: "I feel bad"), you'll need to recompile the ngram-udf with the appropriate count (or ideally, just turn it into a variable and set the ngram count on the fly).
You can get the source code for the NGramGenerator UDF here: Github
ngrams.pig
REGISTER ngram-udf.jar
DEFINE NGGen org.apache.pig.tutorial.NGramGenerator;
--Load the initial data
A = LOAD 'tweets.txt' as (tweet:chararray);
--Create NGram tuple with a size limit of 2 from the tweets
B = FOREACH A GENERATE FLATTEN(NGGen(tweet)) as ngram;
--Tokenize the tweets into single word tuples
C = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)tweet)) as ngram;
--Union the Ngram and word tuples
D = UNION B,C;
--Group similar tuples together
E = GROUP D BY ngram;
--For each unique ngram, generate the ngrame name and a count
F = FOREACH E GENERATE group, COUNT(D);
--Load the wordlist for joining
Z = LOAD 'wordlist.txt' as (word:chararray);
--Perform the innerjoin of the ngrams and the wordlist
Y = JOIN F BY group, Z BY word;
--For each intersecting record, store the ngram and count
X = FOREACH Y GENERATE $0,$1;
DUMP X;
RESULTS/OUTPUT
(feeling,1)
(viral fever,2)
tweets.txt
today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
wordlist.txt
sick
viral fever
feeling
Original Solution
I don't have access to my Hadoop system at the moment to test this answer, so the code may be off slightly. The logic should be sound, however. An easy solution should be:
Perform the classic wordcount program against the tweets dataset
Perform an inner join of the wordlist and tweets
Generate the data again to get rid of the duplicate word in the tuple
Dump/Store the join results
Example code:
A = LOAD 'tweets.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
Z = LOAD 'wordlist.txt' as (word:chararray);
Y = JOIN D BY group, Z BY word;
X = FOREACH Y GENERATE ($1,$2);
DUMP X;
As far as I know, this is not possible using a join.
You could do a CROSS followed by a FILTER with a regex match.