Using COUNT and GROUP BY in Spark SQL - sql

I'm trying to get pretty basic output that pulls unique NDC Codes for medications and counts the number of unique patients that take each drug. My dataset basically looks like this:
patient_id | drug_ndc
---------------------
01 | 250
02 | 725
03 | 1075
04 | 1075
05 | 250
06 | 250
I want the output to look something like this:
NDC | Patients
--------------
250 | 3
1075 | 2
725 | 1
I tried using some queries like this:
select distinct drug_ndc as NDC, count patient_id as Patients
from table 1
group by 1
order by 1
But I keep getting errors. I've tried with and without using an alias, but to no avail.

The correct syntax should be:
select drug_ndc as NDC, count(*) as Patients
from table 1
group by drug_ndc
order by 1;
SELECT DISTINCT is almost never appropriate with GROUP BY. And you can can use COUNT(*) unless the patient id can be NULL.

to get the number of unique patients, you should do:
select drug_ndc as NDC, count(distinct patient_id) as Patients
from table 1
group by drug_ndc;

Related

In PostgreSQL, in a table with multiple rows per unique ID, can you select one row by a condition per unique ID and other values?

Specifically, I a table of test data in which I am trying to pick one row of data per student and session (i.e. fall, winter, spring). The problem is that there are some students who re-took the test within the same session, and I would like my query to handle these occurrences.
So let's say a student (with studentid = 12345) took a test twice in the fall--Once on September 23rd with a score of 85/100, and then again on October 3rd with a 75/100. I want to know two different queries, one to handle each of the following:
Return the row of their most recent test (i.e. the test from Oct 3rd)
Return the row of their highest scoring test (i.e. the test from Sept 23rd)
Here is an example of a table similar to the one I am working with:
| studentid | session | testdate | score | schoolyear |
-----------------------------------------------------------
| ...
| 42532 | Fall | '2020-10-01' | 68 | '2020-2021'
| 42532 | Winter | '2021-02-02' | 70 | '2020-2021'
| 12345 | Fall | '2020-09-23' | 85 | '2020-2021' <--- (this student has two records for the fall)
| 12345 | Fall | '2020-10-03' | 75 | '2020-2021' <---
| 12345 | Winter | '2021-01-10' | 79 | '2020-2021'
| 83456 | Fall | '2020-09-08' | 90 | '2019-2020'
| 83456 | Winter | '2021-01-18' | 83 | '2019-2020'
| ...
So I want to run a query similar to the following:
SELECT studentid, session, testdate, score
FROM exam_result
WHERE schoolyear = '2020-2021'
-- (something to filter out the multiples)
Where it returns 1 row per student AND session, for all students
Any help would be greatly appreciated!
If you want just one row, use fetch or limit:
select er.*
from exam_result er
where er.studentid = 12345
order by testdate desc
limit 1;
Just adjust the order by for the row you want.
For all tests, you would use distinct on:
select distinct on (er.studentid) er.*
from exam_result er
where . . . -- whatever other conditions you have
order by er.studentid, testdate desc
Which score you prefer to have; best or worst or something else? This query gives you the best score. I dropped testdate away because it isn't unique. If you need test date it makes query a bit more complicated. And what date you want if studet gets the same score twice?
SELECT studentid, session, MAX(score)
FROM exam_result
WHERE schoolyear = '2020-2021'
GROUP BY studentid, session
If you need exam date this this kind of query gives you the first exam date when a student get his highest score. This isn't tested, but you get idea. Make separate subquery which gets min date for student/score combination and join it to your original query.
SELECT
a.student_id.
a.session,
b.exam_date,
a.score
FROM
exam_resut a JOIN
(SELECT
student_id, session, MIN(exam_date)
FROM exam_result
WHERE schoolyear = '2020-2021'
GROUP BY student_id, session) b
ON a.studet_id = b.student_id and a.session = b.sesion
WHERE a.schoolyear = '2020-2021'
GROUP BY a.studentid, a.session, b.exam_date

SQL Finding sum of rows and returning count of keys

For a database table looking something like this:
id | year | stint | sv
----+------+-------+---
mk1 | 2001 | 1 | 30
mk1 | 2001 | 2 | 20
ml0 | 1999 | 1 | 43
ml0 | 2000 | 1 | 44
hj2 | 1993 | 1 | 70
I want to get the following output:
count
-------
3
with the conditions being count the number of ids that have a sv > 40 for a single year greater than 1994. If there is more than one stint for the same year, add the sv points and see if > 40.
This is what I have written so far but it is obviously not right:
SELECT COUNT(DISTINCT id),
SUM(sv) as SV
FROM public.pitching
WHERE (year > 1994 AND sv >40);
I know the syntax is completely wrong and some of the conditions' information is missing but I'm not familiar enough with SQL and don't know how to properly do the summing of two rows in the same table with a condition (maybe with a subquery?). Any help would be appreciated! (using postgres)
You could use a nested query to get the aggregations, and wrap that for getting the count. Note that the condition on the sum must be in a having clause:
SELECT COUNT(id)
FROM (
SELECT id,
year,
SUM(sv) as SV
FROM public.pitching
WHERE year > 1994
GROUP BY id,
year
HAVING SUM(sv) > 40 ) years
If an id should only count once even it fulfils the condition in more than one year, then do COUNT(distinct id) instead of COUNT(id)
You can try like following using sum and partition by year.
select count( distinct year) from
(
select year, sum(sv) over (partition by year) s
from public.pitching
where year > 1994
) t where s>40
Online Demo

Calculate time span over a number of records

I have a table that has the following schema:
ID | FirstName | Surname | TransmissionID | CaptureDateTime
1 | Billy | Goat | ABCDEF | 2018-09-20 13:45:01.098
2 | Jonny | Cash | ABCDEF | 2018-09-20 13:45.01.108
3 | Sally | Sue | ABCDEF | 2018-09-20 13:45:01.298
4 | Jermaine | Cole | PQRSTU | 2018-09-20 13:45:01.398
5 | Mike | Smith | PQRSTU | 2018-09-20 13:45:01.498
There are well over 70,000 records and they store logs of transmissions to a web-service. What I'd like to know is how would I go about writing a script that would select the distinct TransmissionID values and also show the timespan between the earliest CaptureDateTime record and the latest record? Essentially I'd like to see what the rate of records the web-service is reading & writing.
Is it even possible to do so in a single SELECT statement or should I just create a stored procedure or report in code? I don't know where to start aside from SELECT DISTINCT TransmissionID for this sort of query.
Here's what I have so far (I'm stuck on the time calculation)
SELECT DISTINCT [TransmissionID],
COUNT(*) as 'Number of records'
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
Not sure how to get the difference between the first and last record with the same TransmissionID I would like to get a result set like:
TransmissionID | TimeToCompletion | Number of records |
ABCDEF | 2.001 | 5000 |
Simply GROUP BY and use MIN / MAX function to find min/max date in each group and subtract them:
SELECT
TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime))
FROM yourdata
GROUP BY TransmissionID
HAVING COUNT(*) > 1
Use min and max to calculate timespan
SELECT [TransmissionID],
COUNT(*) as 'Number of records',datediff(s,min(CaptureDateTime),max(CaptureDateTime)) as timespan
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
A method that returns the average time for all transmissionids, even those with only 1 record:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime)) * 1.0 / NULLIF(COUNT(*) - 1, 0)
FROM yourdata
GROUP BY TransmissionID;
Note that you may not actually want the maximum of the capture date for a given transmissionId. You might want the overall maximum in the table -- so you can consider the final period after the most recent record.
If so, this looks like:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second,
MIN(CaptureDateTime),
MAX(MAX(CaptureDateTime)) OVER ()
) * 1.0 / COUNT(*)
FROM yourdata
GROUP BY TransmissionID;

Counting distinct stores SQL

I am fairly new to SQL and was wondering if anyone could help with my code.
I am trying to count the distinct number of stores that are tied to a certain Warehouse which is tied to a purchase order.
Example: If there are 100 stores with this PO that came from Warehouse #2 or #5 or etc... then I would like:
| COUNT_STORE | WH_LOCATION |
1 | 100 | 2 |
2 | 25 | 5 |
3 | 56 | 1 |
[]
My Code:
select count(distinct Store_ID) as Count_Store, WH_Location
from alc_Loc
where alloc_PO = 11345
group by Store_ID, WH_Location
When I run this I get a 1 for "count_store" and it shows me the WH_Location multiple times. I feel as if something is not tying in correctly.
Any help is appreciated!
Just remove store_id from the group by:
select count(distinct Store_ID) as Count_Store, WH_Location
from alc_Loc
where alloc_PO = 11345
group by WH_Location;
When you include Store_ID in the group by, you are getting a separate row for each Store_ID. The distinct count is then obviously 1 (or 0 if the store id is NULL).

How can I get the MAX COUNT for multiple users?

I'm sorry if this happens to be a re-post however looking through all of the previous questions I could find with similar wording I have not been able to find a working answer.
I have a trainingHistory table that has a record for every new training. The training can be done by multiple trainers. Clients can have multiple trainers.
What I am trying to accomplish is to COUNT the number of clients that was last trained by each trainer.
Example:
clientID | trainDate | trainerID
101 | 2012-03-13 10:58:11| 10
101 | 2012-03-12 10:58:11| 11
102 | 2012-03-15 10:58:11| 10
102 | 2012-03-09 10:58:11| 12
103 | 2012-03-08 10:58:11| 7
So the end result I am looking for would be:
Results
trainerID | count
10 | 2
7 | 1
I've tried quite a few different queries and looked over quite a few answers, including this one here Using sub-queries in SQL to find max(count()) but have so far been unable to get the desired result.
What I keep getting is like this:
Results
trainerID | count
10 | 5
7 | 5
How can I get an accurate count per trainer as opposed to an overall total?
The closest I've gotten is this:
SELECT t.trainerName,
t.trainerID,
(
SELECT COUNT(lastTrainerCount)
FROM (
SELECT MAX(th.clientID) AS lastTrainerCount
FROM trainingHistory th
GROUP BY th.clientID
) AS lastTrainerCount
)
FROM trainers t
INNER JOIN trainingHistory th ON (th.trainerID = t.trainerID)
WHERE th.trainingDate BETWEEN '12/14/14' AND '02/07/15'
GROUP BY t.trainerName, t.trainerID
Which results in:
Results
trainerID | count
10 | 1072
7 | 1072
Using SQL Server 2012
Appreciate any help you can provide.
First find the max trainDate per clientID in sub-select. Then count the trainerID in outer query. Try this.
select trainerID,count(trainerID) [Count]
From
(
select clientID,trainDate,trainerID,
row_number()over(partition by clientID order by trainDate Desc) Rn
From yourtable
) A
where Rn=1
Group by trainerID
SQLFIDDLE DEMO