Identifying Records Where a String Appears More Than Once - sql

I have a following dataset that looks like:
ID Medication Dose
1 Aspirin 4
1 Tylenol 7
1 Aspirin 2
1 Ibuprofen 1
2 Aspirin 6
2 Aspirin 2
2 Ibuprofen 6
2 Tylenol 4
3 Tylenol 3
3 Tylenol 7
3 Tylenol 2
I would like to develop a code that would identify patients who have been administered a medication more than once. So for example, ID 1 had Aspirin twice, ID 2 had Aspirin twice and ID 3 had Tylenol three times.
I could be wrong but I think the easiest way to do this would be to concatenate each ID based on Medication using a code similar to the one below; but I'm not quite sure what to do after that - is it possible to count if a string appears twice within a cell?
SELECT DISTINCT ST2.[ID],
SUBSTRING(
(
SELECT ','+ST1.Medication AS [text()]
FROM ED_NOTES_MASTER ST1
WHERE ST1.[ID] = ST2.[ID]
Order BY [ID]
FOR XML PATH ('')
), 1, 200000) [Result]
FROM ED_NOTES_MASTER ST2
I would like the output to look like the following:
ID MEDICATION Aspirin2x Tylenol2x Ibuprofen2x
1 Aspirin, Tylenol , Aspirin YES NO NO
2 Ibuprofen, Aspirin, Aspirin YES NO NO
3 Tylenol, Tylenol ,Tylenol NO YES NO

For the first part of your question (identify patients that have had a particular medication more than once), you can do this using GROUP BY to group by the ID and medication, and then using COUNT to get how many times each medication was given to each patient. For example:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
This will give you a list of all ID - Medication combinations that appear in the table and a count of how many times each combo appears. To limit these results down to just those that are greater than 2, you can add a condition to the COUNTed field using HAVING:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
HAVING amount >= 2
The problem now is formatting the results in the way you want. What you will get from the query above is a list of all patient - medication combinations that came up in the table more than once, like this:
ID | Medication | Count
------+---------------+-------
1 | Aspirin | 2
2 | Aspirin | 2
3 | Tylenol | 3
I'd suggest that you try and work with this format if possible, because as you have found, to get multiple values returned in a comma delimited list as you have in your Medication column you have to resort to some hacks to get it to work (although a recent version of SQL Server does implement some sort of proper group concatenation functionality.). If you really need the Aspirin2x etc. columns, take a look at the PIVOT operation in SQL Server.

Related

count different column values after grouping by

Consider this table:
id name department email
1 Alex IT blah#gmail.com
1 Alex IT blah#gmail.com
2 Jay HR jay#gmail.com
2 Jay Marketing zou#gmail.com
If I group byid,name and count I get:
id name count(*)
1 Alex 2
2 Jay 2
With this query:
select id,name,count(*) from tb group by id,name;
However I would like to count only records that diverge from department,email, so as to have:
id name count(*)
1 Alex 0
2 Jay 1
This time the count for the first group 1,Alex is 0 because department,email have the same values (duplicated) , on the other hand 2,Jay is one because department,email has one different value.
If you meant "two different values" for "Jay", you can use distinct:
select id,name,count(*) from (SELECT distinct * FROM tb) group by id,name;
You can use count(*) - 1 to get similar results in your question.

Select records distinct in one column in Postgresql database

I got the following records where different people have the same name:
id name category_id birthdate family_id
1 joe 2 2014-05-01 1
2 jack 3 2013-04-01 2
3 joe 2 1964-05-01 1
4 jack 5 1982-05-01 2
5 emma 1 2014-05-01 1
6 joe 3 2003-07-06 3
Now I need a query which results to the following. I want only each name once per family_id. I need all values of each record afterwards including the id. In case the table gets further rows down the road I need them too. So the result should include all values.
id name category_id birthdate family_id
1 joe 2 2014-05-01 1
2 jack 3 2013-04-01 2
5 emma 1 2014-05-01 1
6 joe 3 2003-07-06 3
I tried it with several approaches (GROUP BY, DISTINCT, DISTINCT ON etc.) but none was working out for me.
When I use a GROUP BY clause (GROUP BY name) I get a ERROR: column "deals.id" must appear in the GROUP BY clause or be used in an aggregate function. But when I put id inside the clause I get all records back.
Same with distinct. There I have to choose all fields on which the result set should be distinct. But I need all values of the record. And because of the primary each record is distinct when i include the id.
I tried it with a sub clause which has filtered all distinct names. I used them in a where clause. But I got all values back including (of course) the not distinct name/family_id records.
Has anybody a helping hand for me?
You might not of specified all of the fields in your group by and if you included id, then that would make the rows unique.
Try something like:
SELECT
name, category_id, birthdate, family_id
FROM family
GROUP BY
name, category_id, birthdate, family_id;
It worked with DISTINCT ON.
The following worked quite well:
SELECT DISTINCT ON (table.name, table.family_id) table.* FROM table;
The only thing I have to check is the ordering. But I wanted to share my solution so far.

How to get the count of distinct values until a time period Impala/SQL?

I have a raw table recording customer ids coming to a store over a particular time period. Using Impala, I would like to calculate the number of distinct customer IDs coming to the store until each day. (e.g., on day 3, 5 distinct customers visited so far)
Here is a simple example of the raw table I have:
Day ID
1 1234
1 5631
1 1234
2 1234
2 4456
2 5631
3 3482
3 3452
3 1234
3 5631
3 1234
Here is what I would like to get:
Day Count(distinct ID) until that day
1 2
2 3
3 5
Is there way to easily do this in a single query?
Not 100% sure if will work on impala
But if you have a table days. Or if you have a way of create a derivated table on the fly on impala.
CREATE TABLE days ("DayC" int);
INSERT INTO days
("DayC")
VALUES (1), (2), (3);
OR
CREATE TABLE days AS
SELECT DISTINCT "Day"
FROM sales
You can use this query
SqlFiddleDemo in Postgresql
SELECT "DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN days
WHERE "Day" <= "DayC"
GROUP BY "DayC"
OUTPUT
| DayC | count |
|------|-------|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
UPDATE VERSION
SELECT T."DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN (SELECT DISTINCT "Day" as "DayC" FROM sales) T
WHERE "Day" <= T."DayC"
GROUP BY T."DayC"
try this one:
select day, count(distinct(id)) from yourtable group by day

How do I create a frequency distribution?

I'm trying to create a frequency distribution to show how many customers have transacted 1x, 2x, 3x, etc.
I have a database transactions and column user_id. Each row indicates a transaction, and if a user_id shows up in multiple rows, that user has done multiple transactions.
Now I'd like to get a list that looks something like this:
Tra. | Freq.
0 | 345
1 | 543
2 | 45
3 | 20
4 | 0
5 | 3
etc
Currently I have this, but it just shows a list of users and how many transactions they have had.
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
ORDER BY number_of_transactions DESC;
I did some digging and was suggested that generate_series might help, but I'm stuck and don't know how to move forward.
Use the first result as input to an outer query where you apply the count again, but this time grouping on number_of_transactions:
SELECT number_of_transactions, COUNT(*) AS freq
FROM (
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
) A
GROUP BY number_of_transactions;
This would transform a result like:
user_id number_of_transactions
----------- ----------------------
1 2
2 1
3 2
4 4
to this:
number_of_transactions freq
---------------------- -----------
1 1
2 2
4 1

How do you determine the average total of a column in Postgresql?

Consider the following Postgresql database table:
id | book_id | author_id
---------------------------
1 | 1 | 1
2 | 2 | 1
3 | 3 | 2
4 | 4 | 2
5 | 5 | 2
6 | 6 | 3
7 | 7 | 2
In this example, Author 1 has written 2 books, Author 2 has written 4 books, and Author 3 has written 1 book. How would I determine the average number of books written by an author using SQL? In other words, I'm trying to get, "An author has written an average of 2.3 books".
Thus far, attempts with AVG and COUNT have failed me. Any thoughts?
select avg(totalbooks) from
(select count(1) totalbooks from books group by author_id) bookcount
I think your example data actually only has 3 books for author id 2, so this would not return 2.3
http://sqlfiddle.com/#!15/3e36e/1
With the 4th book:
http://sqlfiddle.com/#!15/67eac/1
You'll need a subquery. The inner query will count the books with GROUP BY author; the outer query will scan the results of the inner query and avg them.
You can use a subquery in the FROM clause for this, or you can use a CTE (WITH expression).
For an average number of books per author you can do simply:
SELECT 1.0*COUNT(DISTINCT book_id)/count(DISTINCT author_id) FROM tbl;
For number of books per author:
SELECT 1.0*COUNT(DISTINCT book_id)/count(DISTINCT author_id)
FROM tbl GROUP BY author_id;
We need 1.0 factor to make the result not integer.
You can remove DISTINCT depending of result you want (it matters only if one book have many authors).
As Craig Ringer rightly pointed out 2 distincts may be expensive. For test performance I have generated 50 000 rows and I got followng results:
My query with 2 DISTINCTS: ~70ms
My query with 1 DISTINCT: ~40ms
Martin Booth's approach: ~30ms
Then added 1 milion rows and tested again:
My query with 2 DISTINCTS: ~1520ms
My query with 1 DISTINCT: ~820ms
Martin Booth's approach: ~1060ms
Then added another 9 milion rows and tested again:
My query with 2 DISTINCTS: ~17s
My query with 1 DISTINCT: ~11s
Martin Booth's approach: ~19s
So there is no universal solution.
This should work:
SELECT AVG(cnt) FROM (
SELECT COUNT(*) cnt FROM t
GROUP BY author_id
) s