How do you determine the average total of a column in Postgresql? - sql

Consider the following Postgresql database table:
id | book_id | author_id
---------------------------
1 | 1 | 1
2 | 2 | 1
3 | 3 | 2
4 | 4 | 2
5 | 5 | 2
6 | 6 | 3
7 | 7 | 2
In this example, Author 1 has written 2 books, Author 2 has written 4 books, and Author 3 has written 1 book. How would I determine the average number of books written by an author using SQL? In other words, I'm trying to get, "An author has written an average of 2.3 books".
Thus far, attempts with AVG and COUNT have failed me. Any thoughts?

select avg(totalbooks) from
(select count(1) totalbooks from books group by author_id) bookcount
I think your example data actually only has 3 books for author id 2, so this would not return 2.3
http://sqlfiddle.com/#!15/3e36e/1
With the 4th book:
http://sqlfiddle.com/#!15/67eac/1

You'll need a subquery. The inner query will count the books with GROUP BY author; the outer query will scan the results of the inner query and avg them.
You can use a subquery in the FROM clause for this, or you can use a CTE (WITH expression).

For an average number of books per author you can do simply:
SELECT 1.0*COUNT(DISTINCT book_id)/count(DISTINCT author_id) FROM tbl;
For number of books per author:
SELECT 1.0*COUNT(DISTINCT book_id)/count(DISTINCT author_id)
FROM tbl GROUP BY author_id;
We need 1.0 factor to make the result not integer.
You can remove DISTINCT depending of result you want (it matters only if one book have many authors).
As Craig Ringer rightly pointed out 2 distincts may be expensive. For test performance I have generated 50 000 rows and I got followng results:
My query with 2 DISTINCTS: ~70ms
My query with 1 DISTINCT: ~40ms
Martin Booth's approach: ~30ms
Then added 1 milion rows and tested again:
My query with 2 DISTINCTS: ~1520ms
My query with 1 DISTINCT: ~820ms
Martin Booth's approach: ~1060ms
Then added another 9 milion rows and tested again:
My query with 2 DISTINCTS: ~17s
My query with 1 DISTINCT: ~11s
Martin Booth's approach: ~19s
So there is no universal solution.

This should work:
SELECT AVG(cnt) FROM (
SELECT COUNT(*) cnt FROM t
GROUP BY author_id
) s

Related

count different column values after grouping by

Consider this table:
id name department email
1 Alex IT blah#gmail.com
1 Alex IT blah#gmail.com
2 Jay HR jay#gmail.com
2 Jay Marketing zou#gmail.com
If I group byid,name and count I get:
id name count(*)
1 Alex 2
2 Jay 2
With this query:
select id,name,count(*) from tb group by id,name;
However I would like to count only records that diverge from department,email, so as to have:
id name count(*)
1 Alex 0
2 Jay 1
This time the count for the first group 1,Alex is 0 because department,email have the same values (duplicated) , on the other hand 2,Jay is one because department,email has one different value.
If you meant "two different values" for "Jay", you can use distinct:
select id,name,count(*) from (SELECT distinct * FROM tb) group by id,name;
You can use count(*) - 1 to get similar results in your question.

Identifying Records Where a String Appears More Than Once

I have a following dataset that looks like:
ID Medication Dose
1 Aspirin 4
1 Tylenol 7
1 Aspirin 2
1 Ibuprofen 1
2 Aspirin 6
2 Aspirin 2
2 Ibuprofen 6
2 Tylenol 4
3 Tylenol 3
3 Tylenol 7
3 Tylenol 2
I would like to develop a code that would identify patients who have been administered a medication more than once. So for example, ID 1 had Aspirin twice, ID 2 had Aspirin twice and ID 3 had Tylenol three times.
I could be wrong but I think the easiest way to do this would be to concatenate each ID based on Medication using a code similar to the one below; but I'm not quite sure what to do after that - is it possible to count if a string appears twice within a cell?
SELECT DISTINCT ST2.[ID],
SUBSTRING(
(
SELECT ','+ST1.Medication AS [text()]
FROM ED_NOTES_MASTER ST1
WHERE ST1.[ID] = ST2.[ID]
Order BY [ID]
FOR XML PATH ('')
), 1, 200000) [Result]
FROM ED_NOTES_MASTER ST2
I would like the output to look like the following:
ID MEDICATION Aspirin2x Tylenol2x Ibuprofen2x
1 Aspirin, Tylenol , Aspirin YES NO NO
2 Ibuprofen, Aspirin, Aspirin YES NO NO
3 Tylenol, Tylenol ,Tylenol NO YES NO
For the first part of your question (identify patients that have had a particular medication more than once), you can do this using GROUP BY to group by the ID and medication, and then using COUNT to get how many times each medication was given to each patient. For example:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
This will give you a list of all ID - Medication combinations that appear in the table and a count of how many times each combo appears. To limit these results down to just those that are greater than 2, you can add a condition to the COUNTed field using HAVING:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
HAVING amount >= 2
The problem now is formatting the results in the way you want. What you will get from the query above is a list of all patient - medication combinations that came up in the table more than once, like this:
ID | Medication | Count
------+---------------+-------
1 | Aspirin | 2
2 | Aspirin | 2
3 | Tylenol | 3
I'd suggest that you try and work with this format if possible, because as you have found, to get multiple values returned in a comma delimited list as you have in your Medication column you have to resort to some hacks to get it to work (although a recent version of SQL Server does implement some sort of proper group concatenation functionality.). If you really need the Aspirin2x etc. columns, take a look at the PIVOT operation in SQL Server.

SQL - Distribution Count

Hi I have the following table:
crm_id | customer_id
jon 12345
jon 12346
ben 12347
sam 12348
I would like to show the following:
Crm_ID count | Number of customer_ids
1 2
2 1
Basically I want to count the number crm_ids that have 1,2,3,4,5+ customer_ids.
Thanks
One approach is to aggregate twice. First, aggregate over crm_id and generate counts. Then, aggregate over those counts themselves and generate a count of counts.
SELECT
cnt AS crm_id_cnt,
COUNT(*) AS num_customer_ids
FROM
(
SELECT crm_id, COUNT(DISTINCT customer_id) AS cnt
FROM yourTable
GROUP BY crm_id
) t
GROUP BY cnt;
Have a look at a demo below, given in MySQL as you did not specify a particular database (though my answer should run on most databases I think).
Demo

SQL Sum count based on an identifier

Assume I have a table with the following data:
Name TransID Cost
---------------------------------------
Susan 1 10
Johnny 2 10
Johnny 3 9
Dave 4 10
I want to find a way to sum the Costs per name (assume the Names are unique) so that I get a table like this:
Name Cost
---------------------------------------
Susan 10
Johnny 19
Dave 10
Any help is appreciated.
This is relatively straightforward: you need to use a GROUP BY clause in your query:
SELECT Name,SUM(Cost)
FROM MyTable
GROUP BY Name

selecting top N rows for each group in a table

I am facing a very common issue regarding "Selecting top N rows for each group in a table".
Consider a table with id, name, hair_colour, score columns.
I want a resultset such that, for each hair colour, get me top 3 scorer names.
To solve this i got exactly what i need on Rick Osborne's blogpost "sql-getting-top-n-rows-for-a-grouped-query"
That solution doesn't work as expected when my scores are equal.
In above example the result as follow.
id name hair score ranknum
---------------------------------
12 Kit Blonde 10 1
9 Becca Blonde 9 2
8 Katie Blonde 8 3
3 Sarah Brunette 10 1
4 Deborah Brunette 9 2 - ------- - - > if
1 Kim Brunette 8 3
Consider the row 4 Deborah Brunette 9 2. If this also has same score (10) same as Sarah, then ranknum will be 2,2,3 for "Brunette" type of hair.
What's the solution to this?
If you're using SQL Server 2005 or newer, you can use the ranking functions and a CTE to achieve this:
;WITH HairColors AS
(SELECT id, name, hair, score,
ROW_NUMBER() OVER(PARTITION BY hair ORDER BY score DESC) as 'RowNum'
)
SELECT id, name, hair, score
FROM HairColors
WHERE RowNum <= 3
This CTE will "partition" your data by the value of the hair column, and each partition is then order by score (descending) and gets a row number; the highest score for each partition is 1, then 2 etc.
So if you want to the TOP 3 of each group, select only those rows from the CTE that have a RowNum of 3 or less (1, 2, 3) --> there you go!
The way the algorithm comes up with the rank, is to count the number of rows in the cross-product with a score equal to or greater than the girl in question, in order to generate rank. Hence in the problem case you're talking about, Sarah's grid would look like
a.name | a.score | b.name | b.score
-------+---------+---------+--------
Sarah | 9 | Sarah | 9
Sarah | 9 | Deborah | 9
and similarly for Deborah, which is why both girls get a rank of 2 here.
The problem is that when there's a tie, all girls take the lowest value in the tied range due to this count, when you'd want them to take the highest value instead. I think a simple change can fix this:
Instead of a greater-than-or-equal comparison, use a strict greater-than comparison to count the number of girls who are strictly better. Then, add one to that and you have your rank (which will deal with ties as appropriate). So the inner select would be:
SELECT a.id, COUNT(*) + 1 AS ranknum
FROM girl AS a
INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score < b.score)
GROUP BY a.id
HAVING COUNT(*) <= 3
Can anyone see any problems with this approach that have escaped my notice?
Use this compound select which handles OP problem properly
SELECT g.* FROM girls as g
WHERE g.score > IFNULL( (SELECT g2.score FROM girls as g2
WHERE g.hair=g2.hair ORDER BY g2.score DESC LIMIT 3,1), 0)
Note that you need to use IFNULL here to handle case when table girls has less rows for some type of hair then we want to see in sql answer (in OP case it is 3 items).