Using a SELECT statement within a WHERE clause - sql

SELECT * FROM ScoresTable WHERE Score =
(SELECT MAX(Score) FROM ScoresTable AS st WHERE st.Date = ScoresTable.Date)
Is there a name to describe using a SELECT statement within a WHERE clause? Is this good/bad practice?
Would this be a better alternative?
SELECT ScoresTable.*
FROM ScoresTable INNER JOIN
(SELECT Date, MAX(Score) AS MaxScore
FROM ScoresTable GROUP BY Date) SubQuery
ON ScoresTable.Date = SubQuery.Date
AND ScoresTable.Score = SubQuery.MaxScore
It is far less elegant, but appears to run more quickly than my previous version. I dislike it because it is not displayed very clearly in the GUI (and it needs to be understood by SQL beginners). I could split it into two separate queries, but then things begin to get cluttered...
N.B. I need more than just Date and Score (e.g. name)

It's called correlated subquery. It has it's uses.

It's not bad practice at all. They are usually referred as SUBQUERY, SUBSELECT or NESTED QUERY.
It's a relatively expensive operation, but it's quite common to encounter a lot of subqueries when dealing with databases since it's the only way to perform certain kind of operations on data.

There's a much better way to achieve your desired result, using SQL Server's analytic (or windowing) functions.
SELECT DISTINCT Date, MAX(Score) OVER(PARTITION BY Date) FROM ScoresTable
If you need more than just the date and max score combinations, you can use ranking functions, eg:
SELECT *
FROM ScoresTable t
JOIN (
SELECT
ScoreId,
ROW_NUMBER() OVER (PARTITION BY Date ORDER BY Score DESC) AS [Rank]
FROM ScoresTable
) window ON window.ScoreId = p.ScoreId AND window.[Rank] = 1
You may want to use RANK() instead of ROW_NUMBER() if you want multiple records to be returned if they both share the same MAX(Score).

The principle of subqueries is not at all bad, but I don't think that you should use it in your example. If I understand correctly you want to get the maximum score for each date. In this case you should use a GROUP BY.

This is a correlated sub-query.
(It is a "nested" query - this is very non-technical term though)
The inner query takes values from the outer-query (WHERE st.Date = ScoresTable.Date) thus it is evaluated once for each row in the outer query.
There is also a non-correlated form in which the inner query is independent as as such is only executed once.
e.g.
SELECT * FROM ScoresTable WHERE Score =
(SELECT MAX(Score) FROM Scores)
There is nothing wrong with using subqueries, except where they are not needed :)
Your statement may be rewritable as an aggregate function depending on what columns you require in your select statement.
SELECT Max(score), Date FROM ScoresTable
Group By Date

In your case scenario, Why not use GROUP BY and HAVING clause instead of JOINING table to itself. You may also use other useful function. see this link

Subquery is the name.
At times it's required, but good/bad depends on how it's applied.

Related

SQL aggregate function when count(*)=1 so there can be only one value

Sometimes you write a grouped query where each group is a single row, as having count(*) = 1. This means that the usual aggregate functions like min, max, sum and so on are a bit pointless: the min equals the max, equals the sum, equals the average. Since there's exactly one value to aggregate.
I usually end up picking min arbitrarily. If we take the familiar example of a table mapping a book to its author(s), I might want to query just books that have a single author:
-- For books that have a single author, pull back that author's id.
select book_id,
min(author_id) as author_id
-- I could equally well use max(author_id) or even sum(author_id)...
from book_authors
group by book_id
having count(*) = 1
That works, but it seems it could be expressed better. I'm not actually interested in the 'minimum' per se, but just to get the single value which I know exists. Some column types (such as bit in Microsoft SQL Server) do not support the min aggregate function so you have to do workarounds like convert(bit, min(convert(int, mycol))).
So, I expect the answer will be no, but is there some better way to specify my intent?
select book_id,
there_must_be_one_value_so_just_return_it(author_id) as author_id
from book_author
group by book_id
having count(*) = 1
Clearly, if you're not requiring count(*)=1 then you no longer guarantee a single value and the special aggregate function could not be used. That error could be caught when the SQL is compiled.
The desired result would be equivalent to the min query above.
I'm using Microsoft SQL Server (2016) but as this is a fairly "blue sky" kind of question, I would be interested in replies about other SQL dialects too.
You could, instead, use a windowed COUNT and then filter based on that:
WITH CTE AS(
SELECT ba.book_id,
ba.author_id,
COUNT(ba.book_id) OVER (PARTITION BY ba.book_id) AS Authors
FROM dbo.book_authors ba)
SELECT c.book_id,
c.author_id
FROM CTE c
WHERE c.Authors = 1;
An alternative method would be to use a correlated subquery:
SELECT ba.book_id,
ba.author_id
FROM dbo.book_authors ba
WHERE EXISTS (SELECT 1
FROM dbo.book_authors e
WHERE e.book_id = ba.book_id
GROUP BY e.book_id
HAVING COUNT(*) = 1);
I have not tested performance on either with a decent amount of data, however, I would hope that for a correlated subquery with a well indexed table, you should see better performance.

How to select the row with the lowest value- oracle

I have a table where I save authors and songs, with other columns. The same song can appear multiple times, and it obviously always comes from the same author. I would like to select the author that has the least songs, including the repeated ones, aka the one that is listened to the least.
The final table should show only one author name.
Clearly, one step is to find the count for every author. This can be done with an elementary aggregate query. Then, if you order by count and you can just select the first row, this would solve your problem. One approach is to use ROWNUM in an outer query. This is a very elementary approach, quite efficient, and it works in all versions of Oracle (it doesn't use any advanced features).
select author
from (
select author
from your_table
group by author
order by count(*)
)
where rownum = 1
;
Note that in the subquery we don't need to select the count (since we don't need it in the output). We can still use it in order by in the subquery, which is all we need it for.
The only tricky part here is to remember that you need to order the rows in the subquery, and then apply the ROWNUM filter in the outer query. This is because ORDER BY is the very last thing that is processed in any query - it comes after ROWNUM is assigned to rows in the output. So, moving the WHERE clause into the subquery (and doing everything in a single query, instead of a subquery and an outer query) does not work.
You can use analytical functions as follows:
Select * from
(Select t.*,
Row_number() over (partition by song order by cnt_author) as rn
From
(Select t.*,
Count(*) over (partition by author) as cnt_author
From your_table t) t ) t
Where rn = 1

Hive query optimization for aggregated columns appear once in a select statement

if there are multi aggregated column in one select, would the be evaluated only once? for example:
select
date,
count(userid) as uv,
sum(isclick) as clickcnt,
count(userid) / sum(isclick) as ctr
from
user_access_log
group by
1
here both count(userid) and sum(isclick) are used twice, would they be evaluated twice or only once, will hive do any query optimization?
This is too long for a comment.
It doesn't make a difference. The expense of running an aggregation query is almost entirely in bringing the rows for groups together. For the most part, the aggregations themselves are not expensive.
The one exception is count(distinct) (well, distinct with any form). This requires a bunch more overhead.
If you really want to run the aggregations only once, you can use a subquery:
select ual.*, (uv / clickcnt) as ctr
from (select date, count(userid) as uv, sum(isclick) as clickcnt,
from user_access_log
group by 1
) ual;
To be honest, I suspect that you actually want count(distinct userid), so this might give a small improvement in performance.

SQL random aggregate

Say I have a simple table with 3 fields: 'place', 'user' and 'bytes'. Let's say, that under some filter, I want to group by 'place', and for each 'place', to sum all the bytes for that place, and randomly select a user for that place (uniformly from all the users that fit the 'where' filter and the relevant 'place'). If there was a "select randomly from" aggregate function, I would do:
SELECT place, SUM(bytes), SELECT_AT_RANDOM(user) WHERE .... GROUP BY place;
...but I couldn't find such an aggregate function. Am I missing something? What could be a good way to achieve this?
If your RDBMS supports analytical functions.
WITH T
AS (SELECT place,
Sum(bytes) OVER (PARTITION BY place) AS Sum_bytes,
user,
Row_number() OVER (PARTITION BY place ORDER BY random_function()) AS RN
FROM YourTable
WHERE .... )
SELECT place,
Sum_bytes,
user
FROM T
WHERE RN = 1;
For SQL Server Crypt_gen_random(4) or NEWID() would be examples of something that could be substituted in for random_function()
I think your question is DBMS specific. If your DBMS is MySql, you can use a solution like this:
SELECT place_rand.place, SUM(place_rand.bytes), place_rand.user as random_user
FROM
(SELECT place, bytes, user
FROM place
WHERE ...
ORDER BY rand()) place_rand
GROUP BY
place_rand.place;
The subquery orders records in random order. The outer query groups by place, sums bytes, and returns first random user, since user is not in an aggregate function and neither in the group by clause.
With a custom aggregate function, you could write expressions as simple as:
SELECT place, SUM(bytes), SELECT_AT_RANDOM(user) WHERE .... GROUP BY place;
SELECT_AT_RAMDOM would be the custom aggregate function.
Here is precisely an implementation in PostgreSQL.
I would do a bit of a variation on Martin's solution:
select place, sum(bytes), max(case when seqnum = 1 then user end) as random_user
from (select place, bytes,
row_number() over (partition by place order by newid()) as sequm
from t
) t
group by place
(Where newid() is just one way to get a random number, depending on the database.)
For some reason, I prefer this approach, because it still has the aggregation function in the outer query. If you are summarizing a bunch of fields, then this seems cleaner to me.

max(), group by and order by

I have following SQL statement.
SELECT t.client_id,max(t.points) AS "max" FROM sessions GROUP BY t.client_id;
It simply lists client id's with maximum amount of points they've achieved. Now I want to sort the results by max(t.points). Normally I would use ORDER BY, but I have no idea how to use it with groups. I know using value from SELECT list is prohibited in following clauses, so adding ORDER BY max at the end of query won't work.
How can I sort those results after grouping, then?
Best regards
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY t.client_id
order by max(t.points) desc
It is not quite correct that values from the SELECT list are prohibited in following clauses. In fact, ORDER BY is logically processed after the SELECT list and can refer to SELECT list result names (in contrast with GROUP BY). So the normal way to write your query would be
SELECT t.client_id, max(t.points) AS "max"
FROM sessions
GROUP BY t.client_id
ORDER BY max;
This way of expressing it is SQL-92 and should be very portable. The other way to do it is by column number, e.g.,
ORDER BY 2;
These are the only two ways to do this in SQL-92.
SQL:1999 and later also allow referring to arbitrary expressions in the sort list, so you could just do ORDER BY max(t.points), but that's clearly more cumbersome, and possibly less portable. The ordering by column number was removed in SQL:1999, so it's technically no longer standard, but probably still widely supported.
Since you have tagged as Postgres: Postgres allows a non-standard GROUP BY and ORDER BY column number. So you could have
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY 1
order by 2 desc
After parsing, this is identical to RedFilter’s solution.