I have a table of unique users that each has a "rating" column (it's an average rating they give out of all their ratings given in a different table of reviews). I want to add another column to my table, which specifies either them giving a rating that is above the average of all ratings of all users (hence I use the AVG() function), below or at average (I call it "bias"). In other words, I want to see whether each user gives on average higher or lower ratings than the total average. I understand the limitedness of this query, and ideally I would include an interval (i.e. within 0.5 points below or above average still counts as average) but I can't seem to make even the simplest query work.
I've been using the Yelp dataset from a Coursera course, but I tried to create a sample that produces the same result that I do not want - just one row. I want to have this categorization for each row, hence it should return 3 rows in this example, "below average" in the first two and "above average" in the third. However, the code below produces just one row. I have been working with R and this seems like I am using incorrect syntax, but after 30 minutes of searching the web I cannot find a solution.
I am working in and want to use SQLite syntax as part of the course in Coursera
CREATE TABLE test
(
id integer primary key,
rating integer
);
INSERT INTO test
(id, rating)
VALUES
(1, 1);
INSERT INTO test
(id, rating)
VALUES
(2, 3);
INSERT INTO test
(id, rating)
VALUES
(3, 8);
SELECT id,
rating,
CASE
WHEN rating > AVG(rating) THEN "above average"
WHEN rating < AVG(rating) THEN "below average"
ELSE "no bias"
END AS "bias"
FROM test
You can't use the aggregate function AVG() like this.
But you can do it with AVG() window function:
SELECT id,
rating,
CASE
WHEN rating > AVG(rating) OVER () THEN "above average"
WHEN rating < AVG(rating) OVER () THEN "below average"
ELSE "no bias"
END AS "bias"
FROM test
See the demo.
Results:
| id | rating | bias |
| --- | ------ | ------------- |
| 1 | 1 | below average |
| 2 | 3 | below average |
| 3 | 8 | above average |
SELECT id,
rating,
CASE
WHEN rating > (select AVG(rating) from test) THEN "above average"
WHEN rating < (select AVG(rating) from test) THEN "below average"
ELSE "no bias"
END AS "bias"
FROM test
AVG is an aggregate function and works in conjunction with a GROUP BY.
when you do not specify anything in the GROUP BY section, it will aggregate the whole table thus reducing it to one row.
Generally you select aggregated columns and non-aggregated columns without specifying the non aggregated columns in the GROUP BY list. I am not a big fan of DBMS which allow this behavior (SQLLite seems to be an offender).
What I did in the query above is that I calculated the average of the whole table using a subquery. And then compared each row against the average.
Or like others have specified you can go with WINDOW functions. Where you apply a function over a some parts of the data as defined by your window. They look like theri regular aggregated functions conunterparts but you will notice the OVER keyword which specify they are applied over a window. In the over clause you can partition your data or you can use it as a whole. For example if you had multiple stores and sales amount per day for each store, you could parition by store to compute the per store average.
Related
A SQL question: I have a table game with columns user_id (unique per user), game_id (unique per game), game_start_timestamp_utc (the UTC timestamp for when the game starts), and game_status, which can either be ‘pass’, ‘in progress’ or ‘fail’.
The question is to write a query to return the game that has the lowest pass rate (pass users/enrolled users).
The table should be like this
user_id game_id game_start_timestamp_utc game_status
-----------------------------------------------------
1 111 10/22/2019 pass
2 111 10/21/2018 fail
...
I know how to do it in Python pandas, just need group by game_id to calculate pass rate, but have not much idea to do it in SQL. Thanks in advance.
Use conditional aggregation. avg() comes handy for this:
select game_id,
avg(case when game_status = 'pass' then 1.0 else 0 end) as pass_rate
from game
group by game_id
order by pass_rate
This gives you the pass rate of each game, as a value between 0 and 1, ordered by increasing rate - so the first row is the result you want.
You can keep that one row only with a row-limiting clause. The syntax varies across databases: limit 1, top (1), fetch first row, ...
I need to "group by" my data to distinguish between tests (each test has a specific id, name and temperature) and to calculate their count, standard deviation, etc. But I also need access every raw data value from each group, for further indexes calculations that I do in a python script.
I have found two solution to this problem, but both seems non-optimal/flawed:
1) Using listagg to store every raw value that were grouped into a single string row. It does the work but it is not optimized : I concatenate multiples float values into a giant string that I will immediately de-concatenate and convert back to float. That seem necessary and costly.
2) Removing the group by entirely and do the count and standard deviation though partitioning. But that seems even worse to me. I don't know if PLSQL/oracle optimizes this, it could be calculating the same count and standard deviation for every line (I don't know how to check this). The query result also becomes messy: since there is no 'group by' anymore, I have to do add multiple checks in my python file in order to differentiate every test data (specific id, name and temperature).
I think that my first solution can be improved but I don't know how. How can I use a group by but still access every number values cleverly ?
A function similar to list_agg but with a collection/array output type instead of a string output type could maybe do the trick (a sort of 'array_agg' compatible with oracle), but I don't know any.
EDIT:
The sample data is complex and probably restricted to the company viewing, but I can show you my simplified query for my 1) :
SELECT
rav.rav_testid as test_id,
tte.tte_testname as test_name,
tsc.tsc_temperature as temperature,
listagg(rav.rav_value, ' ')WITHIN GROUP (ORDER BY rav.rav_value) as all_specific_test_values,
COUNT(rav.rav_value) as n,
STDDEV(rav.rav_value) as sigma,
FROM
...
(8 inner joins)
GROUP BY
rav.rqv_testid, tte.tte_testname,tsc.tsc_temperature
ORDER BY
rav.RAV_testid, tte.tte_testname, spd.SPD_SPLITNAMEINTERNAL,tsc.tsc_temperature
The result looks like :
test_id | test_name | temperature | all_specific_test_values | n | sigma
-------------------------------------------------------------------------
6001 |VADC_A(...) | -40 | ,8094034194946289 ,8(...)| 58 | 0,54
6001 |VADC_A(...) | 25 | ,5054857852946545 ,6(...)| 56 | 0,24
6001 |VADC_A(...) | 150 | ,8625754277452524 ,4(...)| 56 | 0,26
6002 |VADC_B(...) | -40 | ,9874514651548454 ,5(...)| 57 | 0,44
I think you want analytic functions:
select t.*,
count(*) over (partition by test) as cnt,
avg(value) over (partition by test) as avg_value,
stddev(value) over (partition by test) as stddev_value
from t;
This adds additional columns on each row.
I would suggest going with #Gordon_Linoff's solution. That is likely the most standard solution.
If you want to go with a less standard solution, you can have a group by that returns a collection as one of the columns. Presumably, your script could iterate through that collection though it might take a bit of work in the script to do that.
create type num_tbl as table of number;
/
create table foo (
grp integer,
val number
);
insert into foo values( 1, 1.1 );
insert into foo values( 2, 1.2 );
insert into foo values( 1, 1.3 );
insert into foo values( 2, 1.4 );
select grp, avg(val), cast( collect( val ) as num_tbl )
from foo
group by grp
I'm trying to add a column which calculates percentages of different products in MS Access Query. Basically, this is the structure of the query that I'm trying to reach:
Product |
Total |
Percentage
Prod1 |
15 |
21.13%
Prod2 |
23 |
32.39%
Prod3 |
33 |
46.48%
Product |
71 |
100%
The formula for finding the percent I use is: ([Total Q of a Product]/[Totals of all Products])*100, but when I try to use the expression builder (since my SQL skills are basic) in MS Access to calculate it..
= [CountOfProcuts] / Sum([CountOfProducts])
..I receive an error message "Cannot have aggregate function in GROUP BY clause.. (and the expression goes here)". I also tried the option with two queries: one that calculates only the totals and another that use the first one to calculate the percentages, but the result was the same.
I'll be grateful if someone can help me with this.
You can get all but the last row of your desired output with this query.
SELECT
y.Product,
y.Total,
Format((y.Total/sub.SumOfTotal),'#.##%') AS Percentage
FROM
YourTable AS y,
(
SELECT Sum(Total) AS SumOfTotal
FROM YourTable
) AS sub;
Since that query does not include a JOIN or WHERE condition, it returns a cross join between the table and the single row of the subquery.
If you need the last row from your question example, you can UNION the query with another which returns the fabricated row you want. In this example, I used a custom Dual table which is designed to always contain one and only one row. But you could substitute another table or query which returns a single row.
SELECT
y.Product,
y.Total,
Format((y.Total/sub.SumOfTotal),'#.##%') AS Percentage
FROM
YourTable AS y,
(
SELECT Sum(Total) AS SumOfTotal
FROM YourTable
) AS sub
UNION ALL
SELECT
'Product',
DSum('Total', 'YourTable'),
'100%'
FROM Dual;
I'm somewhat new to SQL queries, and I'm struggling with this particular problem.
Let's say I have query that returns the following 3 records (kept to one column for simplicity):
Tom
Jack
Tom
And I want to have those results grouped by the name and also include the fraction (ratio) of the occurrence of that name out of the total records returned.
So, the desired result would be (as two columns):
Tom | 2/3
Jack | 1/3
How would I go about it? Determining the numerator is pretty easy (I can just use COUNT() and GROUP BY name), but I'm having trouble translating that into a ratio out of the total rows returned.
SELECT name, COUNT(name)/(SELECT COUNT(1) FROM names) FROM names GROUP BY name;
Since the denominator is fixed, the "ratio" is directly proportional to the numerator. Unless you really need to show the denominator, it'll be a lot easier to just use something like:
select name, count(*) from your_table_name
group by name
order by count(*) desc
and you'll get the right data in the right order, but the number that's shown will be the count instead of the ratio.
If you really want that denominator, you'd do a count(*) on a non-grouped version of the same select -- but depending on how long the select takes, that could be pretty slow.
I have the following table (highscores),
id gameid userid name score date
1 38 2345 A 100 2009-07-23 16:45:01
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
5 38 2345 A 50 2009-07-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
7 32 2345 A 100 2009-07-20 16:45:01
Now in the above structure, a user can play a game multiple times but I want to display the "Games Played" by a specific user. So in games played section I can't display multiple games. So the concept should be like if a user played a game 3 times then the game with highest score should be displayed out of all.
I want result data like:
id gameid userid name score date
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
I tried following query but its not giving me the correct result:
SELECT id,
gameid,
userid,
date,
MAX(score) AS score
FROM highscores
WHERE userid='2345'
GROUP BY gameid
Please tell me what will be the query for this?
Thanks
Requirement is a bit vague/confusing but would something like this satisfy the need ?
(purposely added various aggregates that may be of interest).
SELECT gameid,
MIN(date) AS FirstTime,
MAX(date) AS LastTime,
MAX(score) AS TOPscore.
COUNT(*) AS NbOfTimesPlayed
FROM highscores
WHERE userid='2345'
GROUP BY gameid
-- ORDER BY COUNT(*) DESC -- for ex. to have games played most at top
Edit: New question about adding the id column to the the SELECT list
The short answer is: "No, id cannot be added, not within this particular construct". (Read further to see why) However, if the intent is to have the id of the game with the highest score, the query can be modified, using a sub-query, to achieve that.
As explained by Alex M on this page, all the column names referenced in the SELECT list and which are not used in the context of an aggregate function (MAX, MIN, AVG, COUNT and the like), MUST be included in the ORDER BY clause. The reason for this rule of the SQL language is simply that in gathering the info for the results list, SQL may encounter multiple values for such an column (listed in SELECT but not GROUP BY) and would then not know how to deal with it; rather than doing anything -possibly useful but possibly silly as well- with these extra rows/values, SQL standard dictates a error message, so that the user can modify the query and express explicitly his/her goals.
In our specific case, we could add the id in the SELECT and also add it in the GROUP BY list, but in doing so the grouping upon which the aggregation takes place would be different: the results list would include as many rows as we have id + gameid combinations the aggregate values for each of this row would be based on only the records from the table where the id and the gameid have the corresponding values (assuming id is the PK in table, we'd get a single row per aggregation, making the MAX() and such quite meaningless).
The way to include the id (and possibly other columns) corresponding to the game with the top score, is with a sub-query. The idea is that the subquery selects the game with TOP score (within a given group by), and the main query's SELECTs any column of this rows, even when the fieds wasn't (couldn't be) in the sub-query's group-by construct. BTW, do give credit on this page to rexem for showing this type of query first.
SELECT H.id,
H.gameid,
H.userid,
H.name,
H.score,
H.date
FROM highscores H
JOIN (
SELECT M.gameid, hs.userid, MAX(hs.score) MaxScoreByGameUser
FROM highscores H2
GROUP BY H2.gameid, H2.userid
) AS M
ON M.gameid = H.gameid
AND M.userid = H.userid
AND M.MaxScoreByGameUser = H.score
WHERE H.userid='2345'
A few important remarks about the query above
Duplicates: if there the user played several games that reached the same hi-score, the query will produce that many rows.
GROUP BY of the sub-query may need to change for different uses of the query. If rather than searching for the game's hi-score on a per user basis, we wanted the absolute hi-score, we would need to exclude userid from the GROUP BY (that's why I named the alias of the MAX with a long, explicit name)
The userid = '2345' may be added in the [now absent] WHERE clause of the sub-query, for efficiency purposes (unless MySQL's optimizer is very smart, currently all hi-scores for all game+user combinations get calculated, whereby we only need these for user '2345'); down side duplication; solution; variables.
There are several ways to deal with the issues mentioned above, but these seem to be out of scope for a [now rather lenghty] explanation about the GROUP BY constructs.
Every field you have in your SELECT (when a GROUP BY clause is present) must be either one of the fields in the GROUP BY clause, or else a group function such as MAX, SUM, AVG, etc. In your code, userid is technically violating that but in a pretty harmless fashion (you could make your code technically SQL standard compliant with a GROUP BY gameid, userid); fields id and date are in more serious violation - there will be many ids and dates within one GROUP BY set, and you're not telling how to make a single value out of that set (MySQL picks a more-or-less random ones, stricter SQL engines might more helpfully give you an error).
I know you want the id and date corresponding to the maximum score for a given grouping, but that's not explicit in your code. You'll need a subselect or a self-join to make it explicit!
Use:
SELECT t.id,
t.gameid,
t.userid,
t.name,
t.score,
t.date
FROM HIGHSCORES t
JOIN (SELECT hs.gameid,
hs.userid,
MAX(hs.score) 'max_score'
FROM HIGHSCORES hs
GROUP BY hs.gameid, hs.userid) mhs ON mhs.gameid = t.gameid
AND mhs.userid = t.userid
AND mhs.max_score = t.score
WHERE t.userid = '2345'