How correctly use AVG in query? - sql

In PostgreSQL database I have table called answers which looks like this:
| EMPLOYEE | QUESTION_ID | QUESTION_TEXT | OPTION_ID | OPTION_TEXT |
|----------|-------------|------------------------|-----------|--------------|
| Bob | 1 | Do you like soup? | 1 | 1 |
| Alex | 1 | Do you like soup? | 9 | 9 |
| Oliver | 1 | Do you like soup? | 6 | 6 |
| Bob | 2 | Do you like ice cream? | 3 | 3 |
| Alex | 2 | Do you like ice cream? | 9 | 9 |
| Oliver | 2 | Do you like ice cream? | 8 | 8 |
| Bob | 3 | Do you like summer? | 2 | 2 |
| Alex | 3 | Do you like summer? | 9 | 9 |
| Oliver | 3 | Do you like summer? | 8 | 8 |
In this table you can notice that I have 3 question and user answers to them. Users answer questions on a scale of one to ten. I'm trying to find the number of users whose avg of answers to questions 1, 2 and 3 is greater than 5 without deep subquery. For example only 2 user has avg(option_text) for three question more than 5. They are Alex and Oliver.
I tried to use this script, but it's work not as I expected:
SELECT
SUM(CASE WHEN (AVG(OPTION_ID) FILTER(WHERE QUESTION_ID IN(61, 62))) > 5 THEN 1 ELSE 0 END) AS COUNT
FROM
ANSWERS;
ERROR:
SQL Error [42803]: ERROR: aggregate function calls cannot be nested

You can select all employees that have an average response of greater than 5 for questions 1,2,3 with a group by query
select employee, avg(option_id)
from answers
where question_id in (1,2,3)
group by employee
having avg(option_id) > 5
and count(distinct question_id) = 3
-- the last part is only needed if you only want employees that answered all questions
To count the number of users that have an average that's greater than 5
select count(*) from (
select employee
from answers
where question_id in (1,2,3)
group by employee
having avg(option_id) > 5
and count(distinct question_id) = 3
)

This following query should work-
SELECT
DISTINCT COUNT(*) OVER () AS CNT
FROM ANSWERS
WHERE QUESTION_ID NOT IN(61, 62)
GROUP BY EMPLOYEE
HAVING AVG(OPTION_ID) > 5
Check demo Here

Related

How do you use two aggregate functions for separate tables in a join?

Sorry if this is a noob question!
I have two tables - a movie and a comment table.
I am trying to return output of the movie name and each comment for that movie as long as that movie has more than 1 comment associated to it.
Here are my tables
test_movies=# SELECT * FROM movie;
id | name | rating | release_date | original_copy_location
----+------------------------------------+--------+--------------+------------------------
1 | Cruella | 9 | 2021-05-28 | 4
7 | Shutter Island | 9 | 2010-02-19 | 4
9 | Grown Ups | 7 | 2010-06-25 | 4
11 | Guardians of the Galaxy: Volume 1 | 8 | 2014-09-01 | 4
14 | The RIng | 8 | 2002-10-18 | 4
17 | Digimon: The Movie | 6 | 2000-01-10 | 4
19 | Star Wars Episode 1 | 5 | 1999-06-21 | 4
20 | Ghosts Of Mars | 5 | 1998-09-15 | 4
5 | Interstellar | 8 | 2014-11-07 | 1
10 | Mean Girls | 8 | 2004-04-30 | 1
12 | Captain America: The First Avenger | 7 | 2011-07-22 | 1
15 | Get Out | 6 | 2017-02-24 | 1
6 | The Dark Knight | 10 | 2008-07-18 | 2
16 | Pokemon: The First Movie | 5 | 1998-11-10 | 2
18 | The Last Dance | 8 | 2020-05-01 | 2
8 | Just Go With It | 8 | 2011-02-11 | 3
13 | The Blair Witch Project | 8 | 1999-08-29 | 3
(17 rows)
test_movies=# SELECT * FROM comments;
c_id | c_comment | c_movie | c_user
------+--------------------------------------+---------+--------
1 | testing comment 1 | 16 | 4
2 | testing comment 1 | 1 | 1
3 | testing comment 1 | 1 | 2
4 | testing comment 1 | 8 | 5
5 | testing comment 1 | 6 | 3
6 | testing comment 1 | 12 | 2
7 | testing comment 1 | 20 | 3
8 | testing comment 1 | 16 | 5
9 | testing comment 1 | 17 | 4
10 | testing comment 1 | 12 | 2
(10 rows)
Output im trying to get is this:
name | c_comment
------------------------+-------------------------------------
Cruella | testing comment 1
Curella | testing comment 1
Pokemon:The First Movie | testing comment 1
Pokemon:The First Movie | testing comment 1
Captain America | testing comment 1
Captain America | testing comment 1
The problem with my queries is that I can't figure out how to return both the movie name and comment associated with it using aggregate functions.
If I use the count in the first select statement it returns all rows:
SELECT m.name, c.c_comment FROM movie m, comments c WHERE m.id = c.c_movie GROUP BY m.name, c.c_comment HAVING COUNT(m.name) >= 1;
If I try the below subquery I get the error - ERROR: subquery must return only one column
SELECT m.name, c.c_comment FROM movie m, comments c WHERE m.id = c.c_movie AND(SELECT m.name, COUNT(c.c_movie) FROM movie m, comments c WHERE m.id =c.c_movie GROUP BY name HAVING COUNT(c.c_movie) > 1);
Still a bit new to SQL as I'm a student and having a tough time figuring this query out lol.
Thanks in advance!
Something like this could work
select m.name, c.c_comment
from movie m
join comment c
on c.c_movie = m.id
where exists (select 1 from comments cc where cc.c_movie=m.id group by c_movie having count(*)>1)
It's standard sql, but you cannot work with mysql and postgresql at the same time... 🤔
Use window functions!
select m.name, c.c_comment
from movie m join
(select c.*, count(*) over (partition by c_movie) as cnt
from comment c
) c
on c.c_movie = m.id
where cnt > 1;

Generate 'average' column from sub query and ROW_NUMBER window function in SQL SELECT

I have the following SQL Server tables (with sample data):
Questionnaire
id | coachNodeId | youngPersonNodeId | complete
1 | 12 | 678 | 1
2 | 12 | 52 | 1
3 | 30 | 99 | 1
4 | 12 | 678 | 1
5 | 12 | 678 | 1
6 | 30 | 99 | 1
7 | 12 | 52 | 1
8 | 30 | 102 | 1
Answer
id | questionnaireId | score
1 | 1 | 1
2 | 2 | 3
3 | 2 | 2
4 | 2 | 5
5 | 3 | 5
6 | 4 | 5
7 | 4 | 3
8 | 5 | 4
9 | 6 | 1
10 | 6 | 3
11 | 7 | 5
12 | 8 | 5
ContentNode
id | text
12 | Zak
30 | Phil
52 | Jane
99 | Ali
102 | Ed
678 | Chris
I have the following T-SQL query:
SELECT
Questionnaire.id AS questionnaireId,
coachNodeId AS coachNodeId,
coachNode.[text] AS coachName,
youngPersonNodeId AS youngPersonNodeId,
youngPersonNode.[text] AS youngPersonName,
ROW_NUMBER() OVER (PARTITION BY Questionnaire.coachNodeId, Questionnaire.youngPersonNodeId ORDER BY Questionnaire.id) AS questionnaireNumber,
score = (SELECT AVG(score) FROM Answer WHERE Answer.questionnaireId = Questionnaire.id)
FROM
Questionnaire
LEFT JOIN
ContentNode AS coachNode ON Questionnaire.coachNodeId = coachNode.id
LEFT JOIN
ContentNode AS youngPersonNode ON Questionnaire.youngPersonNodeId = youngPersonNode.id
WHERE
(complete = 1)
ORDER BY
coachNodeId, youngPersonNodeId
This query outputs the following example data:
questionnaireId | coachNodeId | coachName | youngPersonNodeId | youngPersonName | questionnaireNumber | score
1 | 12 | Zak | 678 | Chris | 1 | 1
2 | 12 | Zak | 52 | Jane | 1 | 3
3 | 30 | Phil | 99 | Ali | 1 | 5
4 | 12 | Zak | 678 | Chris | 2 | 4
5 | 12 | Zak | 678 | Chris | 3 | 4
6 | 30 | Phil | 99 | Ali | 2 | 2
7 | 12 | Zak | 52 | Jane | 2 | 5
8 | 30 | Phil | 102 | Ed | 1 | 5
To explain what's happening here… There are various coaches whose job is to undertake questionnaires with various young people, and log the scores. A coach might, at a later date, repeat the questionnaire with the same young person several times, hoping that they get a better score. The ultimate goal of what I'm trying to achieve is that the managers of the coaches want to see how well the coaches are performing, so they'd like to see whether the scores for the questionnaires tend to go up or not. The window function represents a way to establish how many times the questionnaire has been undertaken by the same coach/young person combo.
I need to be able to determine the average score based on the questionnaire number. So for example, the coach 'Zak' logged scores of '1' and '3' for his first questionnaires (where questionnaireNumber = 1) so the average would be 2. For his second questionnaires (where questionnaireNumber = 2) the scores were '3' and '5' so the average would be 4. So in analysing this data we know that over time Zak's questionnaire scores have improved from an average of '2' the first time to an average of '4' the second time.
I feel like the query needs to be grouped by the coachNodeId and questionnaireNumber values so it would output something like this (I've ommitted the questionnaireId, youngPersonNodeId, youngPersonName and score columns as they aren't crucial for the output — they're only used to derive the averageScore — and wouldn't be useful the way the results are grouped):
coachNodeId | coachName | questionnaireNumber | averageScore
12 | Zak | 1 | 2 (calculation: (1 + 3) / 2)
12 | Zak | 2 | 4 (calculation: (3 + 5) / 2)
12 | Zak | 3 | 4 (only one value: 4)
30 | Phil | 1 | 5 (calculation: (5 + 5) / 2)
30 | Phil | 2 | 2 (only one value: 2)
Could anyone suggest how I can modify my query to output the average scores based on the score from the sub-query and the ROW_NUMBER window function? I've hit the limits of my SQL skills!
Many thanks.
It is a bit hard to tell without sample data, but I think you are describing aggregation:
SELECT q.coachNodeId AS coachNodeId,
cn.[text] AS coachName,
q.youngPersonNodeId AS youngPersonNodeId,
ypn.[text] AS youngPersonName,
AVG(score)
FROM Questionnaire q JOIN
ContentNode cn
ON q.coachNodeId = cn.id JOIN
ContentNode ypn
ON q.youngPersonNodeId = ypn.id LEFT JOIN
Answer a
ON a.questionnaireId = q.id
WHERE complete = 1
GROUP BY q.coachNodeID, cn.[text] AS coachName,
q.youngPersonNodeId, ypn.[text]

Aggregate function calls cannot be nested?

In PostgreSQL database I have table called answers. This table stores information about how users answered a questions. There are only 4 question in the table. At the same time, the number of users who answered the questions can be dynamic and the user can answer only part of the questions.
Table answers:
| EMPLOYEE | QUESTION_ID | QUESTION_TEXT | OPTION_ID | OPTION_TEXT |
|----------|-------------|------------------------|-----------|--------------|
| Bob | 1 | Do you like soup? | 1 | Yes |
| Alex | 1 | Do you like soup? | 2 | No |
| Kate | 1 | Do you like soup? | 3 | I don't know |
| Bob | 2 | Do you like ice cream? | 1 | Yes |
| Alex | 2 | Do you like ice cream? | 3 | I don't know |
| Oliver | 2 | Do you like ice cream? | 1 | Yes |
| Bob | 3 | Do you like summer? | 2 | No |
| Alex | 3 | Do you like summer? | 1 | Yes |
| Jack | 3 | Do you like summer? | 2 | No |
| Bob | 4 | Do you like winter? | 3 | I don't know |
| Alex | 4 | Do you like winter? | 1 | Yes |
| Oliver | 4 | Do you like winter? | 3 | I don't know |
For example, with next code I can find average of the answers for question 1 and 2 of each person who answered for these questions.
select
employee,
avg(
case when question_id in (1, 2) then option_id else null end
) as average_score
from
answers
group by
employee
Result:
| EMPLOYEE | AVERAGE_SCORE |
|----------|---------------|
| Bob | 2 |
| Alex | 2,5 |
| Kate | 3 |
| Oliver | 1 |
Now, I want to know the number of users whose average of the answers for question 1 and 2 is >= than 2. I tried next code but it raise error:
select
count(
avg(
case when question_id in (1, 2) then option_id else null end
)
) as average_score
from
answers
where
average_score >= 2
group by
answers.employee
ERROR:
SQL Error [42803]: ERROR: aggregate function calls cannot be nested
You need to filter after aggregation. That uses a having clause. In Postgres, you can also use filter:
select employee,
avg(option_id) filter (where question_id in (1, 2)) as average_score
from answers
group by employee
having avg(option_id) filter (where question_id in (1, 2)) > 2;
If you want the count, then use this as a subquery: select count(*) from <the above query>.
It is strange that you are equating "option_id" with "score", but that is how your question is phrased.
you have to use having clause.. it can be done simply as
select employee, [Average Score] = avg(case when question_id in (1, 2)
then option_id else null
end
)
from answers group by employee having average_score > 2;
update :
It must work now...
select employee, average_score = avg(case when question_id in (1, 2)
then option_id else null
end
)
from answers group by employee having average_score > 2;

Select max value from column for every value in other two columns

I'm working on a webapp that tracks tvshows, and I need to get all episodes id's that are season finales, which means, the highest episode number from all seasons, for all tvshows.
This is a simplified version of my "episodes" table.
id tvshow_id season epnum
---|-----------|--------|-------
1 | 1 | 1 | 1
2 | 1 | 1 | 2
3 | 1 | 1 | 3
4 | 1 | 2 | 1
5 | 1 | 2 | 2
6 | 2 | 1 | 1
7 | 2 | 1 | 2
8 | 2 | 1 | 3
9 | 2 | 1 | 4
10 | 2 | 2 | 1
11 | 2 | 2 | 2
The expect output:
id
---|
3 |
5 |
9 |
11 |
I've managed to get this working for the latest season but I can't make it work for all seasons.
I've also tried to take some ideas from this but I can't seem to find a way to add the tvshow_id in there.
I'm using Postgres v10
SELECT Id from
(Select *, Row_number() over (partition by tvshow_id,season order by epnum desc) as ranking from tbl)c
Where ranking=1
You can use the below SQL to get your result, using GROUP BY with sub-subquery as:
select id from tab_x
where (tvshow_id,season,epnum) in (
select tvshow_id,season,max(epnum)
from tab_x
group by tvshow_id,season)
Below is the simple query to get desired result. Below query is also good in performance with help of using distinct on() clause
select
distinct on (tvshow_id,season)
id
from your_table
order by tvshow_id,season ,epnum desc

Strange window function behaviour

I have the following set of data:
player | score | day
--------+-------+------------
John | 3 | 02-01-2014
John | 5 | 02-02-2014
John | 7 | 02-03-2014
John | 9 | 02-04-2014
John | 11 | 02-05-2014
John | 13 | 02-06-2014
Mark | 2 | 02-01-2014
Mark | 4 | 02-02-2014
Mark | 6 | 02-03-2014
Mark | 8 | 02-04-2014
Mark | 10 | 02-05-2014
Mark | 12 | 02-06-2014
Given two time ranges:
02-01-2014..02-03-2014
02-04-2014..02-06-2014
I need to get average score for each player within a given time range. Ultimate result I'm trying to achieve is this:
player | period_1_score | period_2_score
--------+----------------+----------------
John | 5 | 11
Mark | 4 | 10
The original algorithm I came up with was:
perform SELECT with two values, derived by partitioning the set of scores into two for each time period
over the first SELECT, perform another one, grouping the set by player name.
I'm stuck on step 1: running the following query:
SELECT
player,
AVG(score) OVER (PARTITION BY day BETWEEN '02-01-2014' AND '02-03-2014') AS period_1,
AVG(score) OVER (PARTITION BY day BETWEEN '02-04-2014' AND '02-06-2014') AS period_2;
Gets me incorrect result (note how period1 and period2 average scores scores are the same:
player | period_1_score | period_2_score
--------+----------------+----------------
John | 5 | 5
John | 5 | 5
John | 5 | 5
John | 5 | 5
John | 5 | 5
John | 5 | 5
Mark | 4 | 4
Mark | 4 | 4
Mark | 4 | 4
Mark | 4 | 4
Mark | 4 | 4
Mark | 4 | 4
I think I don't fully understand how window functions work... I have 2 questions:
What is wrong with my query?
How do I do it right?
You don't need window function for this.
Try:
select
player
,avg(case when day BETWEEN '02-01-2014' AND '02-03-2014' then score else null end) as period_1_score
,avg(case when day BETWEEN '02-04-2014' AND '02-06-2014' then score else null end) as period_1_score
from <your data>
group by player