How do I do a group by with a random aggregation function in SQL Server? - sql

I have the following SQL:
select Username,
COUNT(DISTINCT Message) as "Count",
AVG(WordCount) as "Average",
RAND(Message) //Essentially what I want to do
from Messages
group by Username
order by "Count" desc
My two aggregation functions as columns are Count and Average, which are obvious. What I want to do is to also return a random row from each grouping from the 'Message' column.
I've written this query in Linq2SQL, however it doesn't support random numbers.
I think I need to create a custom aggregation function but they seem pretty over-the-top, and I want to know if there's an easier way before I try that. I'd try a CLR aggregation function, but then the database wouldn't be as easily portable between instances due to their dll nature.
I also know that using per-row random numbers in SQL is a bit verbose as well, but I can't find a way to use them in my group by query.
I've seen Marc Gravell's idea for random rows here:
Random row from Linq to Sql,
however his solution pulls in every row which I don't want to do; only the grouping (which is orders of magnitude smaller.)

select Username,
COUNT(DISTINCT m.Message) as "Count",
AVG(m.WordCount) as "Average",
FOO.Message
from
Messages m
cross apply
(select TOP 1 Message, Username
from Messages m2
WHERE m2.Username = m.Username
order by newid()
) FOO
group by m.Username, FOO.Message
order by "Count" desc

Related

SQL aggregate function when count(*)=1 so there can be only one value

Sometimes you write a grouped query where each group is a single row, as having count(*) = 1. This means that the usual aggregate functions like min, max, sum and so on are a bit pointless: the min equals the max, equals the sum, equals the average. Since there's exactly one value to aggregate.
I usually end up picking min arbitrarily. If we take the familiar example of a table mapping a book to its author(s), I might want to query just books that have a single author:
-- For books that have a single author, pull back that author's id.
select book_id,
min(author_id) as author_id
-- I could equally well use max(author_id) or even sum(author_id)...
from book_authors
group by book_id
having count(*) = 1
That works, but it seems it could be expressed better. I'm not actually interested in the 'minimum' per se, but just to get the single value which I know exists. Some column types (such as bit in Microsoft SQL Server) do not support the min aggregate function so you have to do workarounds like convert(bit, min(convert(int, mycol))).
So, I expect the answer will be no, but is there some better way to specify my intent?
select book_id,
there_must_be_one_value_so_just_return_it(author_id) as author_id
from book_author
group by book_id
having count(*) = 1
Clearly, if you're not requiring count(*)=1 then you no longer guarantee a single value and the special aggregate function could not be used. That error could be caught when the SQL is compiled.
The desired result would be equivalent to the min query above.
I'm using Microsoft SQL Server (2016) but as this is a fairly "blue sky" kind of question, I would be interested in replies about other SQL dialects too.
You could, instead, use a windowed COUNT and then filter based on that:
WITH CTE AS(
SELECT ba.book_id,
ba.author_id,
COUNT(ba.book_id) OVER (PARTITION BY ba.book_id) AS Authors
FROM dbo.book_authors ba)
SELECT c.book_id,
c.author_id
FROM CTE c
WHERE c.Authors = 1;
An alternative method would be to use a correlated subquery:
SELECT ba.book_id,
ba.author_id
FROM dbo.book_authors ba
WHERE EXISTS (SELECT 1
FROM dbo.book_authors e
WHERE e.book_id = ba.book_id
GROUP BY e.book_id
HAVING COUNT(*) = 1);
I have not tested performance on either with a decent amount of data, however, I would hope that for a correlated subquery with a well indexed table, you should see better performance.

In SQL, does groupby on an ordered query behave the same as doing both in the same query?

Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.

SQL random aggregate

Say I have a simple table with 3 fields: 'place', 'user' and 'bytes'. Let's say, that under some filter, I want to group by 'place', and for each 'place', to sum all the bytes for that place, and randomly select a user for that place (uniformly from all the users that fit the 'where' filter and the relevant 'place'). If there was a "select randomly from" aggregate function, I would do:
SELECT place, SUM(bytes), SELECT_AT_RANDOM(user) WHERE .... GROUP BY place;
...but I couldn't find such an aggregate function. Am I missing something? What could be a good way to achieve this?
If your RDBMS supports analytical functions.
WITH T
AS (SELECT place,
Sum(bytes) OVER (PARTITION BY place) AS Sum_bytes,
user,
Row_number() OVER (PARTITION BY place ORDER BY random_function()) AS RN
FROM YourTable
WHERE .... )
SELECT place,
Sum_bytes,
user
FROM T
WHERE RN = 1;
For SQL Server Crypt_gen_random(4) or NEWID() would be examples of something that could be substituted in for random_function()
I think your question is DBMS specific. If your DBMS is MySql, you can use a solution like this:
SELECT place_rand.place, SUM(place_rand.bytes), place_rand.user as random_user
FROM
(SELECT place, bytes, user
FROM place
WHERE ...
ORDER BY rand()) place_rand
GROUP BY
place_rand.place;
The subquery orders records in random order. The outer query groups by place, sums bytes, and returns first random user, since user is not in an aggregate function and neither in the group by clause.
With a custom aggregate function, you could write expressions as simple as:
SELECT place, SUM(bytes), SELECT_AT_RANDOM(user) WHERE .... GROUP BY place;
SELECT_AT_RAMDOM would be the custom aggregate function.
Here is precisely an implementation in PostgreSQL.
I would do a bit of a variation on Martin's solution:
select place, sum(bytes), max(case when seqnum = 1 then user end) as random_user
from (select place, bytes,
row_number() over (partition by place order by newid()) as sequm
from t
) t
group by place
(Where newid() is just one way to get a random number, depending on the database.)
For some reason, I prefer this approach, because it still has the aggregation function in the outer query. If you are summarizing a bunch of fields, then this seems cleaner to me.

Using a SELECT statement within a WHERE clause

SELECT * FROM ScoresTable WHERE Score =
(SELECT MAX(Score) FROM ScoresTable AS st WHERE st.Date = ScoresTable.Date)
Is there a name to describe using a SELECT statement within a WHERE clause? Is this good/bad practice?
Would this be a better alternative?
SELECT ScoresTable.*
FROM ScoresTable INNER JOIN
(SELECT Date, MAX(Score) AS MaxScore
FROM ScoresTable GROUP BY Date) SubQuery
ON ScoresTable.Date = SubQuery.Date
AND ScoresTable.Score = SubQuery.MaxScore
It is far less elegant, but appears to run more quickly than my previous version. I dislike it because it is not displayed very clearly in the GUI (and it needs to be understood by SQL beginners). I could split it into two separate queries, but then things begin to get cluttered...
N.B. I need more than just Date and Score (e.g. name)
It's called correlated subquery. It has it's uses.
It's not bad practice at all. They are usually referred as SUBQUERY, SUBSELECT or NESTED QUERY.
It's a relatively expensive operation, but it's quite common to encounter a lot of subqueries when dealing with databases since it's the only way to perform certain kind of operations on data.
There's a much better way to achieve your desired result, using SQL Server's analytic (or windowing) functions.
SELECT DISTINCT Date, MAX(Score) OVER(PARTITION BY Date) FROM ScoresTable
If you need more than just the date and max score combinations, you can use ranking functions, eg:
SELECT *
FROM ScoresTable t
JOIN (
SELECT
ScoreId,
ROW_NUMBER() OVER (PARTITION BY Date ORDER BY Score DESC) AS [Rank]
FROM ScoresTable
) window ON window.ScoreId = p.ScoreId AND window.[Rank] = 1
You may want to use RANK() instead of ROW_NUMBER() if you want multiple records to be returned if they both share the same MAX(Score).
The principle of subqueries is not at all bad, but I don't think that you should use it in your example. If I understand correctly you want to get the maximum score for each date. In this case you should use a GROUP BY.
This is a correlated sub-query.
(It is a "nested" query - this is very non-technical term though)
The inner query takes values from the outer-query (WHERE st.Date = ScoresTable.Date) thus it is evaluated once for each row in the outer query.
There is also a non-correlated form in which the inner query is independent as as such is only executed once.
e.g.
SELECT * FROM ScoresTable WHERE Score =
(SELECT MAX(Score) FROM Scores)
There is nothing wrong with using subqueries, except where they are not needed :)
Your statement may be rewritable as an aggregate function depending on what columns you require in your select statement.
SELECT Max(score), Date FROM ScoresTable
Group By Date
In your case scenario, Why not use GROUP BY and HAVING clause instead of JOINING table to itself. You may also use other useful function. see this link
Subquery is the name.
At times it's required, but good/bad depends on how it's applied.

max(), group by and order by

I have following SQL statement.
SELECT t.client_id,max(t.points) AS "max" FROM sessions GROUP BY t.client_id;
It simply lists client id's with maximum amount of points they've achieved. Now I want to sort the results by max(t.points). Normally I would use ORDER BY, but I have no idea how to use it with groups. I know using value from SELECT list is prohibited in following clauses, so adding ORDER BY max at the end of query won't work.
How can I sort those results after grouping, then?
Best regards
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY t.client_id
order by max(t.points) desc
It is not quite correct that values from the SELECT list are prohibited in following clauses. In fact, ORDER BY is logically processed after the SELECT list and can refer to SELECT list result names (in contrast with GROUP BY). So the normal way to write your query would be
SELECT t.client_id, max(t.points) AS "max"
FROM sessions
GROUP BY t.client_id
ORDER BY max;
This way of expressing it is SQL-92 and should be very portable. The other way to do it is by column number, e.g.,
ORDER BY 2;
These are the only two ways to do this in SQL-92.
SQL:1999 and later also allow referring to arbitrary expressions in the sort list, so you could just do ORDER BY max(t.points), but that's clearly more cumbersome, and possibly less portable. The ordering by column number was removed in SQL:1999, so it's technically no longer standard, but probably still widely supported.
Since you have tagged as Postgres: Postgres allows a non-standard GROUP BY and ORDER BY column number. So you could have
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY 1
order by 2 desc
After parsing, this is identical to RedFilter’s solution.