I have looked all over for an explanation, to how does the subquery in a select statement work and still I cannot grasp the concept because of very vague explanations.
I would like to know how do you use a subquery in a select statement in oracle and what exactly does it output.
For example, if i had a query that wanted to display the names of employees and the number of profiles they manage from these tables
Employee(EmpName, EmpId)
Profile(ProfileId, ..., EmpId)
how do I use the subquery?
I was thinking a subquery is needed in the select statement to implement the group by function to count the number of profiles being managed for each employee, but I am not too sure.
It's simple-
SELECT empname,
empid,
(SELECT COUNT (profileid)
FROM profile
WHERE profile.empid = employee.empid)
AS number_of_profiles
FROM employee;
It is even simpler when you use a table join like this:
SELECT e.empname, e.empid, COUNT (p.profileid) AS number_of_profiles
FROM employee e LEFT JOIN profile p ON e.empid = p.empid
GROUP BY e.empname, e.empid;
Explanation for the subquery:
Essentially, a subquery in a select gets a scalar value and passes it to the main query. A subquery in select is not allowed to pass more than one row and more than one column, which is a restriction. Here, we are passing a count to the main query, which, as we know, would always be only a number- a scalar value. If a value is not found, the subquery returns null to the main query. Moreover, a subquery can access columns from the from clause of the main query, as shown in my query where employee.empid is passed from the outer query to the inner query.
Edit:
When you use a subquery in a select clause, Oracle essentially treats it as a left join (you can see this in the explain plan for your query), with the cardinality of the rows being just one on the right for every row in the left.
Explanation for the left join
A left join is very handy, especially when you want to replace the select subquery due to its restrictions. There are no restrictions here on the number of rows of the tables in either side of the LEFT JOIN keyword.
For more information read Oracle Docs on subqueries and left join or left outer join.
In the Oracle RDBMS, it is possible to use a multi-row subquery in the select clause as long as the (sub-)output is encapsulated as a collection. In particular, a multi-row select clause subquery can output each of its rows as an xmlelement that is encapsulated in an xmlforest.
Related
SQL Masters,
I don't understand part of this query. In the select statement there are what look like independent 'select statements'almost like a function. This code is vendor written Blackbaud CRM. As independent code there is no join in the code for the info they bring into the data set as you can see in the from clause. One last odd item is that in the column aliased Spouse_id the column SPOUSE.RECIPROCALCONSTITUENTID dose not even exist in the table referred to. Any BBCRM people out there that can explain this?
Thanks
select
CONSTITUENT.ID,
CONSTITUENT.ISORGANIZATION,
CONSTITUENT.KEYNAME,
CONSTITUENT.FIRSTNAME,
CONSTITUENT.MIDDLENAME,
CONSTITUENT.MAIDENNAME,
CONSTITUENT.NICKNAME,
(select SPOUSE.RECIPROCALCONSTITUENTID
from dbo.RELATIONSHIP as SPOUSE
where SPOUSE.RELATIONSHIPCONSTITUENTID = CONSTITUENT.ID
and SPOUSE.ISSPOUSE = 1) as [SPOUSE_ID],
(select MARITALSTATUSCODE.DESCRIPTION
from dbo.MARITALSTATUSCODE
where MARITALSTATUSCODE.ID = CONSTITUENT.MARITALSTATUSCODEID) as [MARITALSTATUSCODEID_TRANSLATION]
From
dbo.constituent
left join
dbo.ORGANIZATIONDATA on ORGANIZATIONDATA.ID = CONSTITUENT.ID
where
(CONSTITUENT.ISCONSTITUENT = 1)
These are correlated subqueries. Although there is no explicit JOIN, there is a link to the outer table which behaves like a join (although more constrained than explicit JOINs):
(select SPOUSE.RECIPROCALCONSTITUENTID
from dbo.RELATIONSHIP as SPOUSE
where SPOUSE.RELATIONSHIPCONSTITUENTID = CONSTITUENT.ID AND
-------^ correlation clause connecting to outer table
SPOUSE.ISSPOUSE = 1
) as [SPOUSE_ID],
This behaves like a LEFT JOIN. If no rows match, then the result is NULL.
Note that in this context, the correlated subquery is also a scalar subquery. That means that it returns exactly one column and at most one row.
If the query returned more than one column, you would get a compile-time error on the query. If the query returns more than one row, you will get a run-time error on the query.
What is the difference between semi-joins and a subquery? I am currently taking a course on this on DataCamp and i'm having a hard time making a distinction between the two.
Thanks in advance.
A join or a semi join is required whenever you want to combine two or more entities records based on some common conditional attributes.
Unlike, Subquery is required whenever you want to have a lookup or a reference on same table or other tables
In short, when your requirement is to get additional reference columns added to existing tables attributes then go for join else when you want to have a lookup on records from the same table or other tables but keeping the same existing columns as o/p go for subquery
Also, In case of semi join it can act/used as a subquery because most of the times we dont actually join the right table instead we mantain a check via subquery to limit records in the existing hence semijoin but just that it isnt a subquery by itself
I don't really think of a subquery and a semi-join as anything similar. A subquery is nothing more interesting than a query that is used inside another query:
select * -- this is often called the "outer" query
from (
select columnA -- this is the subquery inside the parentheses
from mytable
where columnB = 'Y'
)
A semi-join is a concept based on join. Of course, joining tables will combine both tables and return the combined rows based on the join criteria. From there you select the columns you want from either table based on further where criteria (and of course whatever else you want to do). The concept of a semi-join is when you want to return rows from the first table only, but you need the 2nd table to decide which rows to return. Example: you want to return the people in a class:
select p.FirstName, p.LastName, p.DOB
from people p
inner join classes c on c.pID = p.pID
where c.ClassName = 'SQL 101'
group by p.pID
This accomplishes the concept of a semi-join. We are only returning columns from the first table (people). The use of the group by is necessary for the concept of a semi-join because a true join can return duplicate rows from the first table (depending on the join criteria). The above example is not often referred to as a semi-join, and is not the most typical way to accomplish it. The following query is a more common method of accomplishing a semi-join:
select FirstName, LastName, DOB
from people
where pID in (select pID
from class
where ClassName = 'SQL 101'
)
There is no formal join here. But we're using the 2nd table to determine which rows from the first table to return. It's a lot like saying if we did join the 2nd table to the first table, what rows from the first table would match?
For performance, exists is typically preferred:
select FirstName, LastName, DOB
from people p
where exists (select pID
from class c
where c.pID = p.pID
and c.ClassName = 'SQL 101'
)
In my opinion, this is the most direct way to understand the semi-join. There is still no formal join, but you can see the idea of a join hinted at by the usage of directly matching the first table's pID column to the 2nd table's pID column.
Final note. The last 2 queries above each use a subquery to accomplish the concept of a semi-join.
I have my query and I need to get the same output with a correlated subquery. I'm new in the correlated subqueries, so please help.
The original query:
SELECT Sales.SalesOrderHeader.CustomerID, SUM(Sales.SalesOrderDetail.LineTotal)
FROM Sales.SalesOrderDetail
INNER JOIN Sales.SalesOrderHeader
ON Sales.SalesOrderDetail.SalesOrderID = Sales.SalesOrderHeader.SalesOrderID
GROUP BY Sales.SalesOrderHeader.CustomerID;
Sorry for all the back and forth in the comments. Using a correlated subquery in the SELECT portion of the query, you could write this, also, as:
SELECT customerID, sum(sumOfLines)
FROM
(
SELECT header.CustomerID,
(SELECT sum(Detail.LineTotal) FROM Sales.SalesOrderDetail.LineTotal as Detail WHERE Detail.SalesOrderID = header.SalesOrderID) as sumOfLines
FROM Sales.SalesOrderHeader.CustomerID as header
) sub
GROUP BY customerID
This is pretty ugly and is not going to perform faster. There is a fairly good chance your DBMS will choose the same execution path for both version.
Update: I updated the above sql to aggregate again by using a subquery so that only unique customerID's come through, since we can't aggregate on a correlated subquery within the query that utilizes the correlated subquery.
I'm getting really frustrated about SQL Server. I'm just trying to join 3 tables, very simple and easily done in mysql. But in SQL Server it keeps telling me to contain tbl_department.deptname in an aggregate function. But what aggregate function could I possibly use in a simple string?
SELECT
COUNT(tblStudent_Department.student_id) AS Expr2,
tbl_department.deptname AS Expr1
FROM
tblStudent_Department
LEFT OUTER JOIN
tbl_department ON tblStudent_Department.deptcode = tbl_department.deptcode
LEFT OUTER JOIN
tblStudent ON tblStudent_Department.student_id = tblStudent.studentid
GROUP BY
tblStudent_Department.deptcode
Please help.
The database doesn't know that if you group on deptcode, you're implicitly grouping on deptname. You must tell SQL Server this by adding the column to the group by:
GROUP BY tblStudent_Department.deptcode, tbl_department.deptname
MySQL is special in that it basically picks a random row if you don't specify an aggregate. This can be misleading and lead to wrong results. As in many other things, MySQL has the more pragmatic solution, and SQL Server the more correct one.
The problem is because your GROUP BY and SELECT terms don't match up.
The simplest way to fix this is to add tbl_department.deptname into your GROUP BY, like so:
GROUP BY tblStudent_Department.deptcode, tbl_department.deptname
You're grouping by deptcode but selecting deptname - if you don't want to aggregate the department (which sounds like it makes sense) then you need to have the deptname in the "group by" statement:
SELECT COUNT(tblStudent_Department.student_id) AS Expr2, tbl_department.deptname AS Expr1
FROM tblStudent_Department
LEFT OUTER JOIN tbl_department ON tblStudent_Department.deptcode = tbl_department.deptcode
LEFT OUTER JOIN tblStudent ON tblStudent_Department.student_id = tblStudent.studentid
GROUP BY tblStudent_Department.deptname
Note I've removed the deptcode because I don't think you need it
If you're using aggregate functions (sum, count etc) ALL fields returned in your select statement need to either be aggregated OR in the group by clause.
First, Last, or put it into the group by.
The rule are:
IF you use a group by, every field is either one of the grouping fields OR one of the aggregated fields.
If you select tbl_department.deptname then you have to either group by that, too, or say WHICH ONE is taken.
Some aggretgate functions are faking that nicely - First, Last (take first or last occurance).
Can anyone give me a good example of a subquery using TSQL 2008?
Maximilian Mayer believes that, due to referencing MS documentation, my assertion that there is a difference between a subquery and a subSelect is incorrect. Frankly, I'd consider MSDN's "Subquery Fundamentals" a better choice. Quote:
You are making distinctions between terms that actually mean the same.
O RLY?
A subQUERY...
IE:
WHERE id IN (SELECT n.id FROM TABLE n)
OR id = (SELECT MAX(m.id) FROM TABLE m)
OR EXISTS(SELECT 1/0 FROM TABLE) --won't return a math error for division by zero
...affects the WHERE or HAVING clauses -- the filteration of data -- for a SELECT, INSERT, UPDATE or DELETE statement. The value from a subquery is never directly visible in the SELECT clause.
A subSELECT...
IE:
SELECT t.column,
(SELECT x.col FROM TABLE x) AS col2
FROM TABLE t
...does not affect the filteration of data in the main query, and the value is exposed directly in the SELECT clause. But it's only one value - you can't return two or more columns into a single column in the outer query.
A subselect is a consistent means of performing a LEFT JOIN in ANSI-89 join syntax - if there is no supporting row, the column will be null. Additionally, a non-correlated subselect will return the same value for every row of the main query.
Correlation
If a subquery or subselect is correlated, that query runs once for every record of the main query returned -- which doesn't scale well as the number of rows in the result set increases.
Derived Table/Inline View
IE:
SELECT x.*,
y.max_date,
y.num
FROM TABLE x
JOIN (SELECT t.id,
t.num,
MAX(t.date) AS max_date
FROM TABLE t
GROUP BY t.id, t.num) y ON y.id = x.id
...is a JOIN to a derived table (AKA inline view).
"Inline view" is a better term, because that is all that happens when you reference a non-materialized view -- a view is just a prepared SQL statement. There's no performance or efficiency difference if you create a view with a query like the one in the example, and reference the view name in place of the SELECT statement within the brackets of the JOIN. The example has the same information as a correlated subquery, but the performance benefit of using a join and none of the subquery detriments. And you can return more than one column, because it is a view/derived table.
Conclusion
It should be obvious why I and others make distinctions. The concept of relying on the word "subquery" to categorize any SELECT statement that isn't the main clause is fatality flawed, because it's also a specific case under a categorization of the same word (IE: subquery-subselect, subquery-subquery, subquery-join...). Now think of helping someone who says "I've got a problem with a subquery..."
Maximilian Mayer's idea of "official" documentation was written by technical writers, who often have no experience in the subject and are only summarizing what they've been told to from knowledgeable people who have simplified things. Ultimately, it's just text on a page or screen -- like what you're reading now -- and the decision is up to you if the details I've laid out make sense to you.
For variety's sake, here's one in the where clause:
select
a.firstname,
a.lastname
from
employee a
where
a.companyid in (
select top 10
c.companyid
from
company c
where
c.num_employees > 1000
)
...returns all employees in the top ten companies with over 1000 employees.
SELECT
*,
(SELECT TOP 1 SomeColumn FROM dbo.SomeOtherTable)
FROM
dbo.MyTable
SELECT a.*, b.*
FROM TableA AS a
INNER JOIN
(
SELECT *
FROM TableB
) as b
ON a.id = b.id
Thats a normal subquery, running once for the whole result set.
On the other hand
SELECT a.*, (SELECT b.somecolumn FROM TableB AS b WHERE b.id = a.id)
FROM TableA AS a
is a correlated subquery, running once for every row in the result set.