Count(*) and Sum in the same row - sql

I'm banging my head against the wall, here. I've looked at dozens of StackOverflow questions that are similar, and they get me close, but I haven't found one yet that does what I need.
I have thousands of questions in a database with answers from multiple users to each question. I need to aggregate the answers to show the count of distinct answers per question. That's the easy part; where I'm stumbling is in adding a Sum column to show the total number of answers given for each question. I can do it if I restrict the Where clause to specific questions, but I'm trying to get this all into one query if possible.
Here's the Query:
select c.ID, a.userID. c.question, a.answer, count(a.answer) as cnt
from NotableAnswers a, categories b, questions c
where c.fkCategory = b.ID and a.questionID = c.ID and b.ID = 18
Group By a.answer, c.ID, c.question
Order By c.ID, answer asc
What I need is a result set that looks like this
ID | userID | Question | Answer | cnt | totcnt
------------------------------------------------------------------
175 | 10318 |Favorite... |Dropbox | 15 | 35
175 | 10354 |Favorite... |Box | 2 | 35
175 | 10323 |Favorite... |Google Drive | 15 | 35
175 | 103111 |Favorite... |Cubby | 3 | 35
186 | 10318 |Best IDE... |IntelliJ | 4 | 12
186 | 103613 |Best IDE... |Android Studio| 6 | 12
186 | 103117 |Best IDE... |Eclipse | 2 | 12
This set shows the Answer as an aggregate and the count of that specific answer along with the sum of the number of answers provided to each distinct question.
Any and all help greatly appreciated.

First, learn to use proper join syntax. Simple rule: Never use commas in the FROM clause. Always use proper explicit JOIN syntax.
Second, the answer is window functions:
select q.ID, a.userID. q.question, a.answer, count(a.answer) as cnt,
sum(count(a.answer)) over (partition by q.id) as total_cnt
from NotableAnswers a join
questions q
on a.questionID = q.ID join
categories c
on q.fkCategory = c.ID
where c.ID = 18
Group By a.answer, c.ID, c.question
Order By q.ID, answer asc;
In addition, it is better to use table aliases that are abbreviations for the table names rather than arbitrary letters.

Related

How to compute overlap percentage of agreement between people in Hive table

Suppose I have a survey where each question has 4 possible answers, and surveyed people can choose at least one answer (multiple answers allowed). I want to compute per question per answer, how many people chose that answer. For example, if I have the hive table:
question_id | answer_id | person_id
-------------------------------------
1 | A | 1
1 | B | 1
1 | C | 1
1 | D | 1
1 | A | 2
1 | B | 2
1 | C | 2
2 | D | 1
2 | A | 1
Then the resulting table would be:
question_id | answer_id | Percentage
-------------------------------------
1 | A | 100
1 | B | 100
1 | C | 100
1 | D | 50
2 | D | 50
2 | A | 50
For question 1, both people put A,B,C giving 100% for all three, but one person put D as well, giving 50%. For question 2, one person put D and one person put A, giving 50% and 50%.
I've been really stuck and I haven't been able to find anything online that accomplishes what I'm looking for. Any help would be amazing!
Hmmm . . . If I understand correctly, you want the number of people who chose one particular question/answer combination divided by the people who chose the question. If so, I think
select qa.*, qa.num_persons * 100.0 / q.num_persons
from (select question_id, answer_id, count(*) as num_persons
from t
group by question_id, answer_id
) qa join
(select question_id, count(distinct person_id) as num_persons
from t
group by question_id
) q
on qa.question_id = q.question_id;
Also you can use analytic functions and size(collect_set) for counting distinct. This will allow to eliminate join and will work fine if the number of distinct person per question is not too big (array produced by collect_set can fit in memory)
select qa.question_id, qa.answer_id,
qa.num_persons * 100.0 / size(qa.question_persons) as Percentage
from (select question_id, answer_id,
count(*) over (partition by question_id, answer_id) as num_persons,
collect_set(person_id) over(partition by question_id) as question_persons
from t
) qa;
I'm not familiar with prestoDB but below is a SQL script that will have the same result as what you posted.
The 2.0 is the number of person. You might want to select that first and store it in a vairable.
select
question_id, answer_id, (count(answer_id)/2.0) * 100.0
from Sample
group by question_id, answer_id
order by question_Id, answer_id

Reorder columns and rows in StackExchange query

I want to do Data Analysis on some Stack Overflow Posts and need to get a query output in the right format. My goal is to input Post ID's and get my answers in the following format:
ID|Title|Question|Answer1|Answer2|Answer3|Answer4|Answer5|Answer...
__________________________________________________________________
1 |Tit 1|Quest 1 |1.Answ |2.Answ |3.Answ |4.Answ |5.Answ |Answer...
2 |Tit 2|Quest 2 |1.Answ |2.Answ |3.Answ | | |
3 |Tit 3|Quest 3 |1.Answ |2.Answ |3.Answ |4.Answ | |
I am not familiar with writing queries on StackExchange but i managed to write a query to get almost the right output. My results is like this:
ID|Title|Question|Answer|
_________________________
1 |Tit 1|Quest 1 |1.Answ |
1 |Tit 1|Quest 1 |2.Answ |
1 |Tit 1|Quest 1 |3.Answ |
2 |Tit 2|Quest 2 |2.Answ |
2 |Tit 2|Quest 2 |2.Answ |
2 |Tit 2|Quest 2 |2.Answ |
As you can see i duplicate the Id,Title and Question for each answer. And the answers are in a column and not side by side.
This is the query i managed to write. Can somebody help me with that or point me in the right direction?
select
p.Id, p.Title, p.Body, k.Body
from
Posts as p inner join
Posts as k on
p.id = k.parentid
where
p.Id in (##id##) and k.posttypeid=2
You'll want to PIVOT your table to turn the rows into columns.
Check out this article about Pivots. The downside is that you need to hard code each possible answer which you don't know how many there will be (Answer1, Answer2, ...).
Using STUFF to put it in one column - something like:
SELECT Id, Title, Q,
STUFF(
(
SELECT '|'+ Body FROM POSTS WHERE Id = t.Id FOR XML path('') ), 1, 1, '')
FROM (
SELECT p.Id, p.Title, p.Body AS Q, k.Body AS ANS
FROM Posts as p
INNER JOIN Posts AS k ON
p.id = k.parentid
WHERE p.Id in (##id##) AND k.posttypeid=2 ) t

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

How can I get the MAX COUNT for multiple users?

I'm sorry if this happens to be a re-post however looking through all of the previous questions I could find with similar wording I have not been able to find a working answer.
I have a trainingHistory table that has a record for every new training. The training can be done by multiple trainers. Clients can have multiple trainers.
What I am trying to accomplish is to COUNT the number of clients that was last trained by each trainer.
Example:
clientID | trainDate | trainerID
101 | 2012-03-13 10:58:11| 10
101 | 2012-03-12 10:58:11| 11
102 | 2012-03-15 10:58:11| 10
102 | 2012-03-09 10:58:11| 12
103 | 2012-03-08 10:58:11| 7
So the end result I am looking for would be:
Results
trainerID | count
10 | 2
7 | 1
I've tried quite a few different queries and looked over quite a few answers, including this one here Using sub-queries in SQL to find max(count()) but have so far been unable to get the desired result.
What I keep getting is like this:
Results
trainerID | count
10 | 5
7 | 5
How can I get an accurate count per trainer as opposed to an overall total?
The closest I've gotten is this:
SELECT t.trainerName,
t.trainerID,
(
SELECT COUNT(lastTrainerCount)
FROM (
SELECT MAX(th.clientID) AS lastTrainerCount
FROM trainingHistory th
GROUP BY th.clientID
) AS lastTrainerCount
)
FROM trainers t
INNER JOIN trainingHistory th ON (th.trainerID = t.trainerID)
WHERE th.trainingDate BETWEEN '12/14/14' AND '02/07/15'
GROUP BY t.trainerName, t.trainerID
Which results in:
Results
trainerID | count
10 | 1072
7 | 1072
Using SQL Server 2012
Appreciate any help you can provide.
First find the max trainDate per clientID in sub-select. Then count the trainerID in outer query. Try this.
select trainerID,count(trainerID) [Count]
From
(
select clientID,trainDate,trainerID,
row_number()over(partition by clientID order by trainDate Desc) Rn
From yourtable
) A
where Rn=1
Group by trainerID
SQLFIDDLE DEMO

Accessing column alias in postgresql

Having a little bit of trouble understanding how a query alias works in postgresql.
I have the following:
SELECT DISTINCT robber.robberid,
nickname,
Count(accomplices.robberid) AS count1
FROM robber
INNER JOIN accomplices
ON accomplices.robberid = robber.robberid
GROUP BY robber.robberid,
robber.nickname
ORDER BY Count(accomplices.robberid) DESC;
robberid | nickname | count1
----------+--------------------------------+--------
14 | Boo Boo Hoff | 7
15 | King Solomon | 7
16 | Bugsy Siegel | 7
23 | Sonny Genovese | 6
1 | Al Capone | 5
...
I can rename the "count1" column using the as command but I can't seem to be able to refer to this again in the query? I am trying to include a HAVING command at the end of this query to query only objects who have a count less than half of the max.
This is homework but I am not asking for the answer only a pointer to how I can include the count1 column in another clause.
Can anyone help?
In general, you can't refer to an aggregate column's alias later in the query, and you have to repeat the aggregate
If you really want to use its name, you could wrap your query as a subquery
SELECT *
FROM
(
SELECT DISTINCT robber.robberid, nickname, count(accomplices.robberid)
AS count1 FROM robber
INNER JOIN accomplices
ON accomplices.robberid = robber.robberid
GROUP BY robber.robberid, robber.nickname
) v
ORDER BY count1 desc