Converting SQL query to Hive query - sql

I am having some trouble converting my SQL queries into Hive queries.
Relational schema:
Suppliers(sid, sname, address)
Parts(pid, pname, color)
Catalog(sid, pid, cost)
Query 1: Find the pnames of parts for which there is some supplier.
I have attempted one of the query conversions for query 1 and I think it is correct If someone can let me know if it is correct or incorrect I would really appreciate it. They seem to be the same to me based on the info I have looked up for Hive.
Query 1: SQL
SELECT pname
FROM Parts, Catalog
WHERE Parts.pid = Catalog.pid
Query 1: Converted to Hive
SELECT pname
FROM Parts, Catalog
WHERE Parts.pid = Catalog.pid;
Query 2: Find the sids of suppliers who supply only red parts.
For the second query I am having trouble. Mainly I am having trouble with the "not exists" part and the defining what color we want part. Can someone help me figure this out? I need to put the SQL into a Hive query.
Query 2: SQL
SELECT DISTINCT C.sid
FROM Catalog C
WHERE NOT EXISTS ( SELECT *
FROM Parts P
WHERE P.pid = C.pid AND P.color <> ‘Red’)
If someone can help me get these into the correct Hive format I would really appreciate it.
Thank you.

Although I have never used HiveQL in looking up some of its documentation it appears to support outer joins written in plain sql. In that case this should work: (an outer join where there is no match)
select distinct c.id
from catalog c
left outer join parts p
on (c.pid = p.pid
and p.color <> 'Red')
where p.pid is null
Edit -- enclosed on clause in () , this is not normally required of any major databases but seems to be needed in hiveql -- (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins).
Regarding your first query I don't think it wouldn't work in hive based on example queries in those docs however as another commenter mentioned it is best practice to use explicit joins via the join clause rather than implicit joins in the where clause. Think of the where clause as where you filter based on various conditions (except for join conditions), and the join clause as where you put all join conditions. It helps organize your query's logic. In addition, you can only imply inner joins (in the where clause). The join clause is needed any time you want to work with outer joins (as in the case of your second query, above).
This is the same as your first query but with explicit join syntax:
select pname
from parts p
join catalog c
on (p.pid = c.pid)

Related

SQL Question: Does the order of the WHERE/INNER JOIN clause when interlinking table matter?

Exam Question (AQA A-level Computer Science):
[Primary keys shown by asterisks]
Athlete(*AthleteID*, Surname, Forename, DateOfBirth, Gender, TeamName)
EventType(*EventTypeID*, Gender, Distance, AgeGroup)
Fixture(*FixtureID*, FixtureDate, LocationName)
EventAtFixture(*FixtureID*, *EventTypeID*)
EventEntry(*FixtureID*, *EventTypeID*, *AthleteID*)
A list is to be produced of the names of all athletes who are competing in the fixture
that is taking place on 17/09/18. The list must include the Surname, Forename and
DateOfBirth of these athletes and no other details. The list should be presented in
alphabetical order by Surname.
Write an SQL query to produce the list.
I understand that you could do this two ways, one using a WHERE clause and the other using the INNER JOIN clause. However, I am wondering if the order matters when linking the tables.
First exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete, EventEntry, Fixture
WHERE FixtureDate = "17/09/2018"
AND Athlete.AthleteID = EventEntry.AthleteID
AND EventEntry.FixtureID = Fixture.FixtureID
ORDER BY Surname
Here is the first exemplar solution, would it still be correct if I was to switch the order of the tables in the WHERE clause, for example:
WHERE FixtureDate = "17/09/2018"
AND EventEntry.AthleteID = Athlete.AthleteID
AND Fixture.FixtureID = EventEntry.FixtureID
I have the same question for the INNER JOIN clause to, here is the second exemplar solution:
SELECT Surname, Forename, DateOfBirth
FROM Athlete
INNER JOIN EventEntry ON Athlete.AthleteID = EventEntry.AthleteID
INNER JOIN Fixture ON EventEntry.FixtureID = Fixture.FixtureID
WHERE FixtureDate = "17/09/2018"
ORDER BY Surname
Again, would it be correct if I used this order instead:
INNER JOIN EventEntry ON Fixture.FixtureID = EventEntry.FixtureID
If the order does matter, could somebody explain to me why it is in the order shown in the examples?
Some advice:
Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
Use table aliases that are abbreviations for the table names.
Use standard date formats!
Qualify all column names.
Then, the order of the comparisons doesn't matter for equality. I would recommend using a canonical ordering.
So, the query should look more like:
SELECT a.Surname, a.Forename, a.DateOfBirth
FROM Athlete a INNER JOIN
EventEntry ee
ON a.AthleteID = ee.AthleteID INNER JOIN
Fixture f
ON ee.FixtureID = f.FixtureID
WHERE a.FixtureDate = '2018-09-17'
ORDER BY a.Surname;
I am guessing that all the columns in the SELECT come from Athlete. If that is not true, then adjust the table aliases.
There are lots of stylistic conventions for SQL and #gordonlinoff's answer mentions some of the perennial ones.
There are a few answers to your question.
The most important is that (notionally) SQL is a declarative language - you tell it what you want it to do, not how to do it. In a procedural language (like C, or Java, or PHP), the order of execution really matters - the sequence of instructions is part of the procedure. In a declarative language, the order doesn't really matter.
This wasn't always totally true - older query optimizers seemed to like the more selective where clauses earlier on in the statement for performance reasons. I haven't seen that for a couple of decades now, so assume that's not really a thing.
Because order doesn't matter, but correctly understanding the intent of a query does, many SQL developers emphasize readability. That's why we like explicit join syntax, and meaningful aliases. And for readability, the sequence of instructions can help. I favour starting with the "most important" table, usually the one from which you're selecting most columns, and then follow a logical chain of joins from one table to the next. This makes it easier to follow the logic.
When you use inner joins order does not matter as long as the prerequisite table is above/before. At your example both joins start from table Athlete so order doesn't matter. If however this very query is found starting from EventEntry (for any reason), then you must join at Athlete at the first inner else you cannot join to Fixture. As recommended, it is best to use standard join syntax and preferable place all inner joins before all lefts. If you cant then you need to review because the left you need to put inside the group of inner joins will probably behave like an inner join. That is because an inner below uses the left table else you could place it below the inner block. So when it comes to null the left will be ok but the inner below will cut the record.
When however the above cases do not exist/affect order and all inner joins can be placed at any order, only performance matters. Usually table with high cardinality on top perform better while there are cases where the opposite works better. So if the order is free you may try higher to lower cardinality tables ordering or the opposite - whatever works faster.
Clarifying: As prerequisite table i call the table needed by the joined table by condition: ... join B on [whatever] join C on c.id=b.cid - here table B is prerequisite for table C.
I mention left joins because while the question is about inner order, when joins are mixed (inners and lefts)then order of joins alone is important (to be all above) as may affect query logic:
... join B on [whatever] left join C on c.id=b.cid join D on D.id = C.did
At the above example the left join sneaks into the inner joins order. We cannot order it after D because it is prerequisite for D. For records however where condition c.id=b.cid is not true the entire B table row turns null and then the entire result row (B+C+D) turns off the results because of D.id = C.did condition of the following inner join. This example needs review as the purpose of left join evaporates by the following (next on order) inner join. Concluding, the order of inner joins when mixed with lefts is better to be on top without any left joins interfering.

Designing query to select count from different tables

This is the schema for my questions
Hi, I don't have experience in SQL Developer and I'm trying to build a query for the following question:
I need that for each DVD in the catalog, display the title, length, release_date, and how many times it has been checked out by all customers across all libraries.
Also I want to include those that have not been checked out yet displaying 0, and sort results by title.
So far I have this in the query but I'm stock here:
--Question C. ************* VERIFY
Select
Catalog_Item.Title,
DVD.Length,
Catalog_Item.Release_Date,
(
Select
Count(Transaction.Transaction_ID)
From Transaction
Where
DVD.Catalog_Item_ID = Physical_Item.Catalog_Item_ID
And Physical_Item.Physical_Item_ID = Transaction.Physical_Item_ID
) as "Total_DVD"
From
Catalog_Item,DVD,
Physical_Item
Group by
Catalog_Item.Title,
DVD.Length,
Catalog_Item.Release_Date
If I run this exact query I get error
Not a Group By Expression
And if I exclude the GROUP BY, I get results by doesn't look like the correct outputs.
Any suggestions on what syntax I can use to achieve the desired output? Thanks!
You put three tables to the query but you missed to link them. If you don't link them, you will see too much-duplicated rows.
Also, your sub-query links were wrong, I assume you tried to put the links here that you missed in the main query.
I believe you need something like that:
Select
CI.Title
,DVD.Length
,CI.Release_Date
,NVL(TR.TotalTransactions,0) TotalTransactions
From Catalog_Item CI
INNER JOIN DVD ON DVD.Catalog_Item_ID = CI.Catalog_Item_ID
LEFT JOIN Physical_Item PHI ON CI.Catalog_Item_ID = PHI.Catalog_Item_ID
LEFT JOIN (SELECT Physical_Item_ID
, Count(Transaction_ID) TotalTransactions
FROM Transaction
GROUP BY Physical_Item_ID
) TR ON PHI.Physical_Item_ID = TR.Physical_Item_ID
For a start, join Catalog_Item, Physical_Item and DVD together. Without appropriate join conditions, these three tables will join using a cartesian product join - which is probably one of the reasons why you are seeing unexpected results.

Unknown column on 'on clause' in sql joins?

Can you guys help me please in understanding the SQL specifications on join. I don't understand it. I kept getting errors called Unknown column list on on clause.
I got this error over my SQL syntax, I almost rubbed it in my face I just can't understand why it is not working, I have read some article regarding that it is because of precedence etc but I am really confused on what I have done wrong here.
select product.name , product.price from product inner join product_category on
(product_category.product_no = product.product_no ) where product_category.sub_category =
"COFFIN";
I know this question have been ask a hudred and million times here, but the ones I saw are complicated nowhere close in this very basic sql syntax.
THanks for helping me out.
EDIT:
I just had realize that I have product_category not a direct child to my product table so
I just have typed
select * from product
join specifications
join product_category on ( specifications.product_no = product_category.product_no);
But this still gave me an error, unknown column product_category.
I've read and followed some instruction similarly to this sites:
MYSQL unknown clause join column in next join
Unknown column {0} in on clause
MySQL "Unknown Column in On Clause"
I am really frustrated. I really can't get it to work.
For each new table you join in your query, each table must have at least one ON clause. It's hard to know exactly what you're trying to do without knowing the schema (table names, columns, etc), but here's an example
select *
from product p
join specifications s
on p.product_no = s.product_no
join product_category pc
on pc.spec_no = p.spec_no
Check out this link on table aliases as well. Gives a good example on joins + really useful info on how to increase the readability of your SQL
http://msdn.microsoft.com/en-us/library/ms187455(v=sql.90).aspx
I found this article useful as well as it visually displays the different types of joins
http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
You are missing the part where you specify the joining conditions between product and specifcations.
select * from product
join specifications YOU NEED SOMETHING HERE
join product_category on etc
I modified the SQL syntax to look like this, I have overlook a key that connects product_category onto specification so I made necessary link and it worked!!!
SELECT *
FROM product
JOIN specifications ON ( product.product_no = specifications.product_no )
JOIN product_category ON ( specifications.spec_no = product_category.spec_no )
WHERE product_category.sub_category = "COFFIN"
LIMIT 0 , 30
Also thanks for the heads up on missing joining condition on specifications. Heck this carelessness cost so much time.
Thank you so much!
The default join type is an inner join. So if you write join, the database reads inner join, and insist that you include an on clause.
If you'd like to join without a condition, specify the cross join explicitly:
select *
from product p
cross join
specifications s
inner join
product_category pc
on pc.product_no = p.product_no
left join
some_other_table sot
on 1=1
The last join, with the on 1=1 condition, is another way to do a cross join. It's subtly different in that it will return rows from the left table even if the right table is empty.
Example at SQL Fiddle.

SQL UNION/JOIN?

I'm trying to show the Student's First Name, Last Name, SID, Major, and Minor. As well as the Courses, CID's and Meeting Times for each student. The Bridge table is confusing me though. Idk whether to do a Union or Multi-Table Join or what. Please Help!
Since this is homework, instead of answering your question directly I will provide you some tips.
You should really consider reviewing UNION and JOINs.
A UNION is used when you multiple queries that you want returned in the same dataset. They have to have similar datatypes in each field and the same number of fields in each query. So for this particular question, does it make sense to use a UNION? Probably not, since you want the student and their course in a single row.
For this type of query you want a JOIN. There is a really helpful visual explanation of JOINs that is available online. I suggest keep it handy to help you while you are studying SQL.
A Visual Explanation of SQL Joins
Since you need the Student data and their courses info, the only way you can get the two is by using the tblRegistration. If you use this table, in your query you should get the results you are looking for.
Edit:
Since a commenter has pointed out that the second drawing is a UNION not a JOIN, which I disagree on here is an example to prove it.
A UNION as stated above combines two queries into a single result set. Here is a sqlfiddle with a UNION demo:
Fiddle 1
A FULL OUTER JOIN Specifies that a row from either the left or right table that does not meet the join condition is included in the result set, and output columns that correspond to the other table are set to NULL. This is in addition to all rows typically returned by the INNER JOIN. Here is a sqlfiddle with a FULL OUTER JOIN demo:
Fiddle2
As you can see the products from both queries are not the same so a FULL OUTER JOIN in the diagram is not a UNION.
SELECT *
FROM tblStudent AS s
LEFT OUTER JOIN tblRegistration AS r
ON s.SID = r.SID
LEFT OUTER JOIN tblCourse AS c
ON c.CID = r.CID
Oracle Syntax:
SELECT stu.FirstName,
stu.LastName,
stu.Major,
stu.Minor,
reg.SID,
cou.CID,
cou.MeetingTime
FROM
tblStudent as stu,
tblRegistration as reg,
tblCourse as cou
WHERE
stu.SID(+) = reg.SID
and reg.CID = cou.CID

Queries that implicit SQL joins can't do?

I've never learned how joins work but just using select and the where clause has been sufficient for all the queries I've done. Are there cases where I can't get the right results using the WHERE clause and I have to use a JOIN? If so, could someone please provide examples? Thanks.
Implicit joins are more than 20 years out-of-date. Why would you even consider writing code with them?
Yes, they can create problems that explicit joins don't have. Speaking about SQL Server, the left and right join implicit syntaxes are not guaranteed to return the correct results. Sometimes, they return a cross join instead of an outer join. This is a bad thing. This was true even back to SQL Server 2000 at least, and they are being phased out, so using them is an all around poor practice.
The other problem with implicit joins is that it is easy to accidentally do a cross join by forgetting one of the where conditions, especially when you are joining too many tables. By using explicit joins, you will get a syntax error if you forget to put in a join condition and a cross join must be explicitly specified as such. Again, this results in queries that return incorrect values or are fixed by using distinct to get rid of the cross join which is inefficient at best.
Moreover, if you have a cross join, the maintenance developer who comes along in a year to make a change doesn't know if it was intended or not when you use implicit joins.
I believe some ORMs also now require explicit joins.
Further, if you are using implied joins because you don't understand how joins operate, chances are high that you are writing code that, in fact, does not return the correct result because you don't know how to evaluate what the correct result would be since you don't understand what a join is meant to do.
If you write SQL code of any flavor, there is no excuse for not thoroughly understanding joins.
Yes. When doing outer joins. You can read this simple article on joins. Joins are not hard to understand at all so you should start learning (and using them where appropriate) right away.
Are there cases where I can't get the right results using the WHERE clause and I have to use a JOIN?
Any time your query involves two or more tables, a join is being used. This link is great for showing the differences in joins with pictures as well as sample result sets.
If the join criteria is in the WHERE clause, then the ANSI-89 JOIN syntax is being used. The reason for the newer JOIN syntax in the ANSI-92 format, is that it made LEFT JOIN more consistent across various databases. For example, Oracle used (+) on the side that was optional while in SQL Server you had to use =*.
Implicit join syntax by default uses Inner joins. It is sometimes possible to modify the implicit join syntax to specify outer joins, but it is vendor dependent in my experience (i know oracle has the (-) and (+) notation, and I believe sqlserver uses *= ). So, I believe your question can be boiled down to understanding the differences between inner and outer joins.
We can look at a simple example for an inner vs outer join using a simple query..........
The implicit INNER join:
select a.*, b.*
from table a, table b
where a.id = b.id;
The above query will bring back ONLY rows where the 'a' row has a matching row in 'b' for it's 'id' field.
The explicit OUTER JOIN:
select * from
table a LEFT OUTER JOIN table b
on a.id = b.id;
The above query will bring back EVERY row in a, whether or not it has a matching row in 'b'. If no match exists for 'b', the 'b' fields will be null.
In this case, if you wanted to bring back EVERY row in 'a' regardless of whether it had a corresponding 'b' row, you would need to use the outer join.
Like I said, depending on your database vendor, you may still be able to use the implicit join syntax and specify an outer join type. However, this ties you to that vendor. Also, any developers not familiar wit that specialized syntax may have difficulty understanding your query.
Any time you want to combine the results of two tables you'll need to join them. Take for example:
Users table:
ID
FirstName
LastName
UserName
Password
and Addresses table:
ID
UserID
AddressType (residential, business, shipping, billing, etc)
Line1
Line2
City
State
Zip
where a single user could have his home AND his business address listed (or a shipping AND a billing address), or no address at all. Using a simple WHERE clause won't fetch a user with no addresses because the addresses are in a different table. In order to fetch a user's addresses now, you'll need to do a join as:
SELECT *
FROM Users
LEFT OUTER JOIN Addresses
ON Users.ID = Addresses.UserID
WHERE Users.UserName = "foo"
See http://www.w3schools.com/Sql/sql_join.asp for a little more in depth definition of the different joins and how they work.
Using Joins :
SELECT a.MainID, b.SubValue AS SubValue1, b.SubDesc AS SubDesc1, c.SubValue AS SubValue2, c.SubDesc AS SubDesc2
FROM MainTable AS a
LEFT JOIN SubValues AS b ON a.MainID = b.MainID AND b.SubTypeID = 1
LEFT JOIN SubValues AS c ON a.MainID = c.MainID AND b.SubTypeID = 2
Off-hand, I can't see a way of getting the same results as that by using a simple WHERE clause to join the tables.
Also, the syntax commonly used in WHERE clauses to do left and right joins (*= and =*) is being phased out,
Oracle supports LEFT JOIN and RIGHT JOIN using their special join operator (+) (and SQL Server used to support *= and =* on join predicates, but no longer does). But a simple FULL JOIN can't be done with implicit joins alone:
SELECT f.title, a.first_name, a.last_name
FROM film f
FULL JOIN film_actor fa ON f.film_id = fa.film_id
FULL JOIN actor a ON fa.actor_id = a.actor_id
This produces all films and their actors including all the films without actor, as well as the actors without films. To emulate this with implicit joins only, you'd need unions.
-- Inner join part
SELECT f.title, a.first_name, a.last_name
FROM film f, film_actor fa, actor a
WHERE f.film_id = fa.film_id
AND fa.actor_id = a.actor_id
-- Left join part
UNION ALL
SELECT f.title, null, null
FROM film f
WHERE NOT EXISTS (
SELECT 1
FROM film_actor fa
WHERE fa.film_id = f.film_id
)
-- Right join part
UNION ALL
SELECT null, a.first_name, a.last_name
FROM actor a
WHERE NOT EXISTS (
SELECT 1
FROM film_actor fa
WHERE fa.actor_id = a.actor_id
)
This will quickly become very inefficient both syntactically as well as from a performance perspective.