pig script: join tables with null values - apache-pig

I'd like to join 2 tables, and I'm a bit lost with different kinds of joins
A(a_name:chararray, a_number:int)
a 1
b 2
c
d 3
e
B(b_id:int, b_name:chararray)
1 one
2 two
3 three
I know that I need to some sort of join, but with
AB = JOIN A by a_number, B by b_id;
FOREACH AB GENERATE
a_name,
b_name as a_number;
I get
a one
b two
d three
Instead of
a one
b two
c
d three
e
which I actually want.
How should I do this?
edit:
Ok, I tried left join but it doesn't keep the row order and instead returns
a one
b two
d three
c
e
Any workaround?

You are looking for a left JOIN.
This will keep all values on the left side of the relationship even if they don't appear in the right. Pig defaults to an inner JOIN, so it only keeps values that are in both sides.
This will now generate what you expect.
AB = JOIN A by a_number LEFT, B by b_id;
C = FOREACH AB GENERATE a_name, b_name AS a_number;
Also, you should be able to compact those two relations into:
AB = FOREACH (JOIN A by a_number LEFT, B by b_id)
GENERATE a_name, b_name AS a_number;
As far as I know there is no option in JOIN to perverse the order of the left relation. However, you can RANK A beforehand then ORDER on the number RANK creates after the JOIN.

Related

why is my sql inner join return much more data than table 1?

I need to join three tables to get all the info I need. Table a has 70 million rows, after joining a with b, I got 40 million data. But after I join table c, which has only 1.7 million rows, it becomes 300 million rows.
In table c, there are more than one same pt_id and fi_id, one pt_id can connect to many different fi_id, but one fi_id only connects to one same pt_id.
I'm wondering if there is any way to get rid of the duplicate rows, cause I join table c only to get the pt_id.
Thanks for any help!
select c.pt_id,b.fi_id,a.zq_id
from a
inner join (select zq_id, fi_id from b) b
on a.zq_id = b.zq_id
inner join (select fi_id,pt_id from c) c
on b.fi_id = c.fi_id
You can use GROUP BY
select c.pt_id,b.fi_id,a.zq_id
from a
inner join (select zq_id, fi_id from b) b
on a.zq_id = b.zq_id
inner join (select fi_id,pt_id from c) c
on b.fi_id = c.fi_id
group by c.pt_id,b.fi_id,a.zq_id
to remove all duplicate row as question below:
How do I (or can I) SELECT DISTINCT on multiple columns?

Generate all combinations that do not already exist between two tables and union them with the ones that already exist SQL

I have 3 tables:
A
B
C
C is an association class between A and B. Meaning that there is a many to many relationship between A and B. C also has fields of its own that are not the primary keys of A/B.
I want to return all the fields in C for a given A.ID (PK). Now this part might return 0 to * results. But I always want to return the same number of results as there are records in B. That is I want to fill in the missing combinations between A.ID and B (that do not exist in C) with Null Values.
SAMPLE:
I am trying to do this within Access.
In case it helps, here is an image with the specific tables and their fields that I am trying to do this with.
Where A is ASCs, B is Flights, and C is FlightHistory.
You will need 2 queries, one that selects all IDs of B together with the desired AID, and one query that selects all these combinations, outer-joined to the existing combinations in C. This can be written in a single query (with a subquery) like this:
SELECT AB.AID, AB.BID, C.Desc
FROM (SELECT A.AID, B.BID FROM A, B WHERE (((A.AID)=1))) AB
LEFT JOIN C ON (AB.BID = C.BID) AND (AB.AID = C.AID);
You could union two result sets together, one which gets the primary records from C with another that gets the missing entries from B...
SELECT * FROM C WHERE C.AID = 1
UNION
SELECT 1 as AID, ID as BID, '' as Desc FROM B WHERE ID NOT IN (SELECT BID FROM C WHERE C.AID = 1)
ORDER BY BID;
http://sqlfiddle.com/#!9/02d110/11/0

Additional inner join modifying results of previous calculations

I am having issues with using the count() function in an sql plus query.. say if
SELECT B.ID COUNT(S.BRANCH_ID) FROM BRANCH B
INNER JOIN STAFF S ON S.BRANCH_ID = B.ID
GROUP BY B.ID;
from doing this I'll get the results
b.id count
1 6
2 6
3 6
4 7
5 6
which is fine.. However if I even add an extra inner join i'll get completely different and wrong results.. So if I put for example..
SELECT COUNT(S.BRANCH_ID) FROM BRANCH B
INNER JOIN STAFF S ON S.BRANCH_ID = B.ID
INNER JOIN TOOL_STOCK TS ON TS.BRANCH_ID = B.ID
GROUP BY B.ID;
Now the results I get will be...
b.id count
1 96
2 96
3 96
4 112
5 96
Why is this and how do I stop it? Cheers!
Try
SELECT B.ID, COUNT(DISTINCT S.STAFF_ID) FROM BRANCH B
INNER JOIN STAFF S ON S.BRANCH_ID = B.ID
INNER JOIN TOOL_STOCK TS ON TS.BRANCH_ID = B.ID
GROUP BY B.ID;
replacing S.STAFF_ID with the primary key field from the STAFF table.
Your problem is that the COUNT function returns the number of rows matching the GROUP BY clause after all rows have been joined and returned.
In your initial query you are finding the number of employees for each branch, In the second the number of employees is multiplied by the number of stock items.
When you add the second join, you are getting the counts for STAFF + TOOLS at each branch.
You will likely need to add a subquery if you want all the data returned, but only counts of one record type.
I think the key to your issue is, which are you actually trying to count?

SQL One-To-Many join issue

Let's say I have two tables in Access. TableLetters and TableNumbers. TableLetters has one column TheLetter and 4 records, A, B, C, & D. TheNumbers is many for one TableLetters record. Say we have two columns in TheNumbersTable [TheLetter][TheNumber]. See below:
TheLetters
[TheLetter]
A
B
C
D
TheNumbers
[TheLetter][TheNumber]
A 1
A 2
A 3
B 1
B 2
How do I write a query that returns one record for each "TheLetters" record and the MAX "TheNumber" from TheNumbers table or blank if there's no match for TheLetter in TheNumbers table? So I want my result set to be:
[TheLetters.TheLetter][TheNumbers.TheNumber]
A 3
B 2
C <NULL>
D <NULL>
I can get A,3 - B,2 but it cuts out C & D because there's not a match in TheNumbers. I've tried switching my joins all around. I've tried putting an IF in the WHERE clause that says if we have a match return the record from TheNumbers or else give me blank. I can't seem to get the syntax right. Thanks for any help!
The key is to use a LEFT JOIN:
SELECT l.TheLetter, MAX(n.TheNumber)
FROM TheLetters l
LEFT JOIN TheNumbers n ON l.TheLetter = n.TheLetter
GROUP BY l.TheLetter
A left outer join returns all rows in the left table, returning data for any correlated rows in the right table, or a single row with the right table's columns set to NULL if there are no correlated rows.
Left Join should correct as below
SELECT l.TheLetter, MAX(n.TheNumber)
FROM TheLetters l
LEFT JOIN TheNumbers n ON l.TheLetter = n.TheLetter
GROUP BY l.TheLetter

SQL Server Need Some Help Joins , please

I have 3 tables A, B, C. There is a relationship between tables A and C while there is a relationship between tables B and C . There is no relationship between A and B.
What I would really like to do is get a list of all the records from B when there are records in C related to B given a value from A .
Please let me know, if this is not clear enough
Thanks
you can right query something like this...
SELECT B.* FROM B
INNER JOIN C ON C.aa = B.aa
INNER JOIN A ON A.bb = C.bb
WHERE A.cc = #yourvalue
#yourvalue is your value on which bases you need to select the value from B table. if you need match mutliple values from A then you need to change bit of query some thing like this...
WHERE A.cc IN (#val1,#val2,#val3....,#valNth)
In this query we have used INNER JOIN so it will gives only those records which are common on both the tables LIKE if you only join B with C then it will give the records which are common in B and C and then you join A with C then it will give those records which are common in A and C.
So suppose in B there is records something like 1,2,3 and in C there is 2,3,4,5 and in A there is 1,3,4,5
so the output of above query (without applying WHERE cause) is 1,3 only because this is common in all three table A,B,C.
you can got more information for joins in sqlserver by refering this links..
http://blog.sqlauthority.com/2009/04/13/sql-server-introduction-to-joins-basic-of-joins/
http://www.dotnet-tricks.com/Tutorial/sqlserver/W1aI140312-Different-Types-of-SQL-Joins.html
http://www.aspdotnet-suresh.com/2011/12/different-types-of-joins-in-sql-server.html
Simple math dictates if there is a relationship between A and C, and a relationship between B and C, there is, albeit by association, a relationship between A and B (through C).
Thus you will need to join all three together, going from A, through C, to B:
SELECT B.*
FROM A
JOIN C ON A.x = C.x
JOIN B ON B.y = C.y
WHERE A.z = #z