SQL random aggregate - sql

Say I have a simple table with 3 fields: 'place', 'user' and 'bytes'. Let's say, that under some filter, I want to group by 'place', and for each 'place', to sum all the bytes for that place, and randomly select a user for that place (uniformly from all the users that fit the 'where' filter and the relevant 'place'). If there was a "select randomly from" aggregate function, I would do:
SELECT place, SUM(bytes), SELECT_AT_RANDOM(user) WHERE .... GROUP BY place;
...but I couldn't find such an aggregate function. Am I missing something? What could be a good way to achieve this?

If your RDBMS supports analytical functions.
WITH T
AS (SELECT place,
Sum(bytes) OVER (PARTITION BY place) AS Sum_bytes,
user,
Row_number() OVER (PARTITION BY place ORDER BY random_function()) AS RN
FROM YourTable
WHERE .... )
SELECT place,
Sum_bytes,
user
FROM T
WHERE RN = 1;
For SQL Server Crypt_gen_random(4) or NEWID() would be examples of something that could be substituted in for random_function()

I think your question is DBMS specific. If your DBMS is MySql, you can use a solution like this:
SELECT place_rand.place, SUM(place_rand.bytes), place_rand.user as random_user
FROM
(SELECT place, bytes, user
FROM place
WHERE ...
ORDER BY rand()) place_rand
GROUP BY
place_rand.place;
The subquery orders records in random order. The outer query groups by place, sums bytes, and returns first random user, since user is not in an aggregate function and neither in the group by clause.

With a custom aggregate function, you could write expressions as simple as:
SELECT place, SUM(bytes), SELECT_AT_RANDOM(user) WHERE .... GROUP BY place;
SELECT_AT_RAMDOM would be the custom aggregate function.
Here is precisely an implementation in PostgreSQL.

I would do a bit of a variation on Martin's solution:
select place, sum(bytes), max(case when seqnum = 1 then user end) as random_user
from (select place, bytes,
row_number() over (partition by place order by newid()) as sequm
from t
) t
group by place
(Where newid() is just one way to get a random number, depending on the database.)
For some reason, I prefer this approach, because it still has the aggregation function in the outer query. If you are summarizing a bunch of fields, then this seems cleaner to me.

Related

How to select the row with the lowest value- oracle

I have a table where I save authors and songs, with other columns. The same song can appear multiple times, and it obviously always comes from the same author. I would like to select the author that has the least songs, including the repeated ones, aka the one that is listened to the least.
The final table should show only one author name.
Clearly, one step is to find the count for every author. This can be done with an elementary aggregate query. Then, if you order by count and you can just select the first row, this would solve your problem. One approach is to use ROWNUM in an outer query. This is a very elementary approach, quite efficient, and it works in all versions of Oracle (it doesn't use any advanced features).
select author
from (
select author
from your_table
group by author
order by count(*)
)
where rownum = 1
;
Note that in the subquery we don't need to select the count (since we don't need it in the output). We can still use it in order by in the subquery, which is all we need it for.
The only tricky part here is to remember that you need to order the rows in the subquery, and then apply the ROWNUM filter in the outer query. This is because ORDER BY is the very last thing that is processed in any query - it comes after ROWNUM is assigned to rows in the output. So, moving the WHERE clause into the subquery (and doing everything in a single query, instead of a subquery and an outer query) does not work.
You can use analytical functions as follows:
Select * from
(Select t.*,
Row_number() over (partition by song order by cnt_author) as rn
From
(Select t.*,
Count(*) over (partition by author) as cnt_author
From your_table t) t ) t
Where rn = 1

SQL simple GROUP BY query

Is there a way to make a simple GROUP BY query with SQL and not use COUNT,AVG or SUM? I want to show all columns and group it with a single column.
SELECT * FROM [SPC].[dbo].[BoardSFC] GROUP BY boardsn
The query above is working on Mysql but not on SQL, is there a way to achieve this? any suggestion would be great
UPDATE: Here is my data I just need to group them by boardsn and get imulti equals to 1
I thing you just understand 'group data' in a different way than it is implemented in sql server. You simply want rows that have the same value together in the result and that would be ordering not grouping. So maybe what you need is:
SELECT *
FROM [SPC].[dbo].[BoardSFC]
WHERE imulti = 1
ORDER BY boardsn
The query above is working on Mysql but not on SQL, is there a way to achieve this? any suggestion would be great
No, there is not. MySQL only lets you do this because it violates the various SQL standards quite egregiously.
You need to name each column you want in the result-set whenever you use GROUP BY. The SELECT * feature is only provided as a convenience when working with data interactively - in production code you should never use SELECT *.
You could use a TOP 1 WITH TIES combined with a ORDER BY ROW_NUMBER.
SELECT TOP 1 WITH TIES *
FROM [SPC].[dbo].[BoardSFC]
ORDER BY ROW_NUMBER() OVER (PARTITION BY boardsn ORDER BY imulti)
Or more explicitly, use ROW_NUMBER in a sub-query
SELECT *
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY boardsn ORDER BY imulti) as RN
FROM [SPC].[dbo].[BoardSFC]
) q
where RN = 1

What order is used by First() function?

Why do the following two queries return identical results?
SELECT FIRST(score) FROM (SELECT score FROM scores ORDER BY score ASC)
SELECT FIRST(score) FROM (SELECT score FROM scores ORDER BY score DESC)
It's confusing, considering that I manually specify the order of subqueries.
The order of the results in the subquery is irrelevant, unless you use TOP within the subquery, which you don't here. Most SQL variants won't allow this syntax -- using an ORDER BY in a subquery throws an error in SQL Server, for example.
Your top-level query has no ORDER BY, thus the concepts of FIRST or TOP 1 are undefined in the context of that query.
In the reference docs, Microsoft states (emphasis mine):
Because records are usually returned in no particular order (unless
the query includes an ORDER BY clause), the records returned by these
functions will be arbitrary.
To answer the question directly:
Access ignores the ORDER BY clause in most subqueries. I beleive (but can't prove) this is due to bugs/limitations in the query optimiser, although it's not documented anywhere (that I could find). I've tested lots of SQL using Access 2007 and Access 2016 to come to this conclusion.
To make the examples work as expected:
Add TOP 100 PERCENT to the subqueries:
SELECT FIRST(score) FROM (SELECT TOP 100 PERCENT score FROM scores ORDER BY score ASC)
SELECT FIRST(score) FROM (SELECT TOP 100 PERCENT score FROM scores ORDER BY score DESC)
When to use First/Last instead of Max/Min:
A good example of when you'd want to use this approach instead of the simpler Min and Max aggregate functions is when there's another field that you want from the same record, e.g if the underlying scores table also held the names of players and the rounds of the game, you can get the name and score of the best and worst player in each round like this:
SELECT
round, FIRST(name) AS best, FIRST(score) AS highscore, LAST(name) AS worst, LAST(score) AS lowscore
FROM
(SELECT TOP 100 PERCENT * FROM scores ORDER BY score DESC)
GROUP BY
round
Your statements are a perfect functional equivalents to
SELECT Min(Score) FROM Scores and
SELECT Max(Score) FROM Scores.
If you really want to retrieve the first and last score, you will need an AutoNumber or a DateTime field to indicate the input order. You could then query:
SELECT First(Score), Last(Score) FROM Scores ORDER BY MySortKey
If you persist with your question, the correct syntax would be
SELECT FIRST(score) FROM (SELECT score FROM scores) ORDER BY score ASC,
or, simplified,
SELECT FIRST(score) FROM scores ORDER BY score ASC

Using a SELECT statement within a WHERE clause

SELECT * FROM ScoresTable WHERE Score =
(SELECT MAX(Score) FROM ScoresTable AS st WHERE st.Date = ScoresTable.Date)
Is there a name to describe using a SELECT statement within a WHERE clause? Is this good/bad practice?
Would this be a better alternative?
SELECT ScoresTable.*
FROM ScoresTable INNER JOIN
(SELECT Date, MAX(Score) AS MaxScore
FROM ScoresTable GROUP BY Date) SubQuery
ON ScoresTable.Date = SubQuery.Date
AND ScoresTable.Score = SubQuery.MaxScore
It is far less elegant, but appears to run more quickly than my previous version. I dislike it because it is not displayed very clearly in the GUI (and it needs to be understood by SQL beginners). I could split it into two separate queries, but then things begin to get cluttered...
N.B. I need more than just Date and Score (e.g. name)
It's called correlated subquery. It has it's uses.
It's not bad practice at all. They are usually referred as SUBQUERY, SUBSELECT or NESTED QUERY.
It's a relatively expensive operation, but it's quite common to encounter a lot of subqueries when dealing with databases since it's the only way to perform certain kind of operations on data.
There's a much better way to achieve your desired result, using SQL Server's analytic (or windowing) functions.
SELECT DISTINCT Date, MAX(Score) OVER(PARTITION BY Date) FROM ScoresTable
If you need more than just the date and max score combinations, you can use ranking functions, eg:
SELECT *
FROM ScoresTable t
JOIN (
SELECT
ScoreId,
ROW_NUMBER() OVER (PARTITION BY Date ORDER BY Score DESC) AS [Rank]
FROM ScoresTable
) window ON window.ScoreId = p.ScoreId AND window.[Rank] = 1
You may want to use RANK() instead of ROW_NUMBER() if you want multiple records to be returned if they both share the same MAX(Score).
The principle of subqueries is not at all bad, but I don't think that you should use it in your example. If I understand correctly you want to get the maximum score for each date. In this case you should use a GROUP BY.
This is a correlated sub-query.
(It is a "nested" query - this is very non-technical term though)
The inner query takes values from the outer-query (WHERE st.Date = ScoresTable.Date) thus it is evaluated once for each row in the outer query.
There is also a non-correlated form in which the inner query is independent as as such is only executed once.
e.g.
SELECT * FROM ScoresTable WHERE Score =
(SELECT MAX(Score) FROM Scores)
There is nothing wrong with using subqueries, except where they are not needed :)
Your statement may be rewritable as an aggregate function depending on what columns you require in your select statement.
SELECT Max(score), Date FROM ScoresTable
Group By Date
In your case scenario, Why not use GROUP BY and HAVING clause instead of JOINING table to itself. You may also use other useful function. see this link
Subquery is the name.
At times it's required, but good/bad depends on how it's applied.

How do I do a group by with a random aggregation function in SQL Server?

I have the following SQL:
select Username,
COUNT(DISTINCT Message) as "Count",
AVG(WordCount) as "Average",
RAND(Message) //Essentially what I want to do
from Messages
group by Username
order by "Count" desc
My two aggregation functions as columns are Count and Average, which are obvious. What I want to do is to also return a random row from each grouping from the 'Message' column.
I've written this query in Linq2SQL, however it doesn't support random numbers.
I think I need to create a custom aggregation function but they seem pretty over-the-top, and I want to know if there's an easier way before I try that. I'd try a CLR aggregation function, but then the database wouldn't be as easily portable between instances due to their dll nature.
I also know that using per-row random numbers in SQL is a bit verbose as well, but I can't find a way to use them in my group by query.
I've seen Marc Gravell's idea for random rows here:
Random row from Linq to Sql,
however his solution pulls in every row which I don't want to do; only the grouping (which is orders of magnitude smaller.)
select Username,
COUNT(DISTINCT m.Message) as "Count",
AVG(m.WordCount) as "Average",
FOO.Message
from
Messages m
cross apply
(select TOP 1 Message, Username
from Messages m2
WHERE m2.Username = m.Username
order by newid()
) FOO
group by m.Username, FOO.Message
order by "Count" desc