distinct count with group by - sql

I have already searched SO but found no answer to my question. My question is if I use the query below I get correct count which is 90:
select count(distinct account_id)
from FactCustomerAccount f
join DimDate d on f.date_id = d.datekey
-- 90
But when I group by CalendarYear as below I am missing 12 counts. The query and output is below:
select CalendarYear,count(distinct account_id) as accountCount
from FactCustomerAccount f
join DimDate d on f.date_id = d.datekey
group by CalendarYear
output:
CalendarYear accountCount
2005 10
2006 26
2007 49
2008 63
2009 65
2010 78
I am not sure why I am missing 12 counts. To debug I run following query if I have missing date_id in FactCustomerAccount but found no missing keys:
select distinct f.date_id from FactCustomerAccount f
where f.date_id not in
(select DateKey from dimdate d)
I am using SQL Server 2008 R2.
Can anyone please suggest what could be the reason for missing 12 counts?
Thanks in advance.
EDIT ONE:
I did not quite understand reason/answer given to my question in the 2 replies so I would like to add 2 queries below using AdventureWorksDW2008R2 where no count is missing:
select count (distinct EmployeeKey)
from FactSalesQuota f
join dimdate d on f.DateKey = d.DateKey
-- out: 17
select d.CalendarYear, count (distinct EmployeeKey) as Employecount
from FactSalesQuota f
join dimdate d on f.DateKey = d.DateKey
group by d.CalendarYear
-- out:
-- CalendarYear Employecount
-- 2005 10
-- 2006 14
-- 2007 17
-- 2008 17
So please correct me what I am missing.

Your queries are very different:
The first:
select count(distinct account_id)
from FactCustomerAccount f
join DimDate d on f.date_id = d.datekey
Return a count of different accounts (over all years), so if you have an account_id present in two years, you have 1 (count) returned.
The second:
Grouped by CalendarYear so if you have an account_id in two different years, this information goes in two different rows.
select CalendarYear,count(distinct account_id) as accountCount
from FactCustomerAccount f
join DimDate d on f.date_id = d.datekey
group by CalendarYear
EDIT
I try to explain better:
I suppose this data set of order couple: (year, account_id)
`2008 10`
`2009 10`
`2010 10`
`2010 12`
If you run two upper queries you have:
`2`
and
`2008 1`
`2009 1`
`2010 2`
because exist two different account_id (10 and 12) and only in the last year (2010) account_ids 10 and 12 have written their rows.
But if you have this data set:
`2008 10`
`2009 10`
`2009 12`
`2010 12`
You'll have:
First query result:
2
Second query result:
2008 1
2009 2
2010 1

You aren't missing 12. It could be that some accounts didn't have activities in the final years.

i will say to analyze this,check number of rows.check calender column.Is there any rows with null in calenderyear .or try rank,i am not sure
select *,
ROW_NUMBER()over(partition by CalendarYear,account_id order by CalendarYear)
from FactSalesQuota f
join dimdate d on f.DateKey = d.DateKey

Related

how to sql query for patients between dates

I'd like query for patients having received their first diagnosis of x between 2019 - present and excluding those patients that received a diagnosis of x prior to 2019.
When I use the query below, I result in the same number of patient with or without statement: AND d.[DOS] !< '2019'
Can someone help?
Thanks!
SELECT [id]
,[DiagnosisCD]
,[DOS]
FROM [diags] d
WHERE [DiagnosisCD] IN ('H91.2', 'H91.20', 'H91.21', 'H91.22', 'H91.23')
AND d.[DOS] >= '2019'
AND d.[DOS] !< '2019'
You can use aggregation:
SELECT [id]
FROM [diags] d
WHERE [DiagnosisCD] IN ('H91.2', 'H91.20', 'H91.21', 'H91.22', 'H91.23')
GROUP BY id
HAVING MIN(DOS) >= 2019
If DOS is really a date then use:
HAVING MIN(DOS) >= '2019-01-01'
If you want all rows related to these diagnoses -- even if there is more than one per patient -- then you can use exists:
SELECT d.*
FROM [diags] d
WHERE d.DiagnosisCD IN ('H91.2', 'H91.20', 'H91.21', 'H91.22', 'H91.23') AND
NOT EXISTS (SELECT 1
FROM diags d2
WHERE d2.id = d.id AND
d2.DiagnosisCD IN ('H91.2', 'H91.20', 'H91.21', 'H91.22', 'H91.23') AND
d2.dos < 2019
);

Having clause being ignored

In this query, I am attempting to get a count that gives me a count of patients for each practice under given conditions.
The issue is that I have to show patients who have had >=3 office visits in the past year.
Count(D.PID)
in the select list is ignoring
HAVING count(admitdatetime)>=3
Here is my query
select distinct D.PracticeAbbrevName, D.ProviderLastName, count(D.pid) AS Count
from PersonDetail AS D
left join Visit AS V on D.PID = V.PID
where D.A1C >=7.5 and V.admitdatetime >= (getdate()-365) and D.A1CDays <180 and D.Diabetes = 1
group by D.PracticeAbbrevName, D.ProviderLastName
having count(admitdatetime)>=3
order by PracticeAbbrevName
If I get rid of the count function for D.pid, and just display each PID individually, my having phrase works properly.
There is something about count and having that do now work properly together.
Revised answer:
SELECT DISTINCT
D.PracticeAbbrevName,
D.ProviderLastName,
COUNT(D.pid) AS PIDCount,
COUNT(admitdatetime) AS AdmitCount
FROM
PersonDetail AS D
LEFT JOIN Visit AS V
ON D.PID = V.PID
WHERE
D.A1C >= 7.5
AND V.admitdatetime >= ( GETDATE() - 365 )
AND D.A1CDays < 180
AND D.Diabetes = 1
GROUP BY
D.PracticeAbbrevName,
D.ProviderLastName
HAVING
COUNT(admitdatetime) >= 3
ORDER BY
PracticeAbbrevName
You're trying to do too much at once. Split the logic in 2 steps:
Query grouping by PID to filter out patients that don't meet your criteria.
Query grouping by practice to get a patient count.
Your query would look like this:
;with EligiblePatients as (
select d.pid,
d.PracticeAbbrevName,
d.ProviderLastName
from PersonDetail d
left join Visit v
on v.pid = d.pid
and v.admitdatetime >= (getdate()-365)
where d.A1C >= 7.5
and d.A1CDays < 180
and d.Diabetes = 1
group by d.pid,
d.PracticeAbbrevName,
d.ProviderLastName
having count(v.pid) >= 3
)
select PracticeAbbrevName,
ProviderLastName,
COUNT(*) as PatientCount
from EligiblePatients
group by PracticeAbbrevName,
ProviderLastName
order by PracticeAbbrevName

Aggregate totals and create data for weeks that no data exists

I have a table like such:
Region Date Cases
France 1-1-2014 5
Spain 2-5-2014 6
France 3-5-2014 7
...
What I would like to do is run an aggregated function like so, to group the total number of cases in weeks for each region.
select region, datepart(week, date) weeknbr, sum(cases) cases
from <table>
group by region, datepart(week, date)
order by region, datepart(week, date)
Using this aggregated function, is there a way to insert a zero value for each region when data does not exist for that week?
so the final result would look like:
region weeknbr cases
France 1 5
France 2 0
France 3 0
.....
Spain 1 0
Spain 2 0
Spain 3 0
....
Spain 8 6
I have tried to create a table with week numbers, and then joining the week numbers with my data, but have been unsuccessful. This ends up creating a null or zero value for the region and cases. I can always use the isnull function to make the cases 0, but I need to account for each region for each week. That's whats killing me right now. Is this possible? If not, where should I start looking and how should I modify the underlining tables?
Any help would be greatly appreciated. Thank you
If I understand your meaning correctly, you could always generate artificial rows, cross join on grouped regions for completeness of your 0's, then left join your aggregate table on region and week. So:
select r.region, w.RowId as Weeknbr, isnull(c.Cases,0)
from (
select row_number()over(order by name) as RowID
from master..spt_values
) w
cross join (
select region
from <table>
group by region
) r
left join
select region, datepart(week, date) weeknbr, sum(cases) cases
from <table>
group by region, datepart(week, date)
order by region, datepart(week, date)
) c on (w.RowID <= 53 and w.RowID = c.Weeknbr and r.region = c.region)
You need a date_list table and a region_list table. Cross join the dimension tables to get all date-region combinations and then left join against your fact table.
SELECT
d.date,
r.region,
t.cases
FROM date_list d
CROSS JOIN region_list r
LEFT JOIN date_region t ON d.date = t.date AND r.region = t.region

Missing a single day

My database has two tables, a car table and a wheel table.
I'm trying to find the number of wheels that meet a certain condition over a range of days, but some days are not included in the output.
Here is the query:
USE CarDB
SELECT MONTH(c.DateTime1) 'Month',
DAY(c.DateTime1) 'Day',
COUNT(w.ID) 'Wheels'
FROM tblCar c
INNER JOIN tblWheel w
ON c.ID = w.CarID
WHERE c.DateTime1 BETWEEN '05/01/2013' AND '06/04/2013'
AND w.Measurement < 18
GROUP BY MONTH(c.DateTime1), DAY(c.DateTime1)
ORDER BY [ Month ], [ Day ]
GO
The output results seem to be correct, but days with 0 wheels do not show up. For example:
Sample Current Output:
Month Day Wheels
2 1 7
2 2 4
2 3 2 -- 2/4 is missing
2 5 9
Sample Desired Ouput:
Month Day Wheels
2 1 7
2 2 4
2 3 2
2 4 0
2 5 9
I also tried a left join but it didn't seem to work.
You were on the right track with a LEFT JOIN
Try run your query with this kind of outer join but remove your WHERE clause. Notice anything?
What's happening is that the join is applied and then the where clause removes the values that don't match the criteria. All this happens before the group by, meaning the cars are excluded.
Here's one method for you:
SELECT Year(cars.datetime1) As the_year
, Month(cars.datetime1) As the_month
, Day(cars.datetime1) As the_day
, Count(wheels.id) As wheels
FROM (
SELECT id
, datetime1
FROM tblcar
WHERE datetime1 BETWEEN '2013-01-05' AND '2013-04-06'
) As cars
LEFT
JOIN tblwheels As wheels
ON wheels.carid = cars.id
What's different this time round is that we're limiting the results of the car table before we join to the wheels table.
You probably want to use a LEFT OUTER JOIN:
USE CarDB
SELECT MONTH (c.DateTime1) 'Month', DAY (c.DateTime1) 'Day', COUNT (w.ID) 'Wheels'
FROM tblCar c LEFT OUTER JOIN tblWheel w ON c.ID = w.CarID
WHERE c.DateTime1 BETWEEN '05/01/2013' AND '06/04/2013'
AND (w.Measurement IS NULL OR w.Measurement < 18)
GROUP BY MONTH (c.DateTime1), DAY (c.DateTime1)
ORDER BY [Month], [Day]
GO
Aand then, you need to adapt the WHERE condition, as you want to keep the rows with w.Measurement being NULL due to the OUTER join.
Remove the join and change your select to this:
SELECT MONTH (c.DateTime1) 'Month', DAY (c.DateTime1) 'Day', isnull(select top 1 (select COUNT from tblWheel where id = tblCar.ID and Measurement < 18), 0) 'Wheels'

Time series querying in Postgres

This is a follow on question from #Erwin's answer to Efficient time series querying in Postgres.
In order to keep things simple I'll use the same table structure as that question
id | widget_id | for_date | score |
The original question was to get score for each of the widgets for every date in a range. If there was no entry for a widget on a date then show the score from the previous entry for that widget. The solution using a cross join and a window function worked well if all the data was contained in the range you were querying for. My problem is I want the previous score even if it lies outside the date range we are looking at.
Example data:
INSERT INTO score (id, widget_id, for_date, score) values
(1, 1337, '2012-04-07', 52),
(2, 2222, '2012-05-05', 99),
(3, 1337, '2012-05-07', 112),
(4, 2222, '2012-05-07', 101);
When I query for the range May 5th to May 10th 2012 (ie generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')) I would like to get the following:
DAY WIDGET_ID SCORE
May, 05 2012 1337 52
May, 05 2012 2222 99
May, 06 2012 1337 52
May, 06 2012 2222 99
May, 07 2012 1337 112
May, 07 2012 2222 101
May, 08 2012 1337 112
May, 08 2012 2222 101
May, 09 2012 1337 112
May, 09 2012 2222 101
May, 10 2012 1337 112
May, 10 2012 2222 101
The best solution so far (also by #Erwin) is:
SELECT a.day, a.widget_id, s.score
FROM (
SELECT d.day, w.widget_id
,max(s.for_date) OVER (PARTITION BY w.widget_id ORDER BY d.day) AS effective_date
FROM (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
) a
LEFT JOIN score s ON s.for_date = a.effective_date AND s.widget_id = a.widget_id
ORDER BY a.day, a.widget_id;
But as you can see in this SQL Fiddle it produces null scores for widget 1337 on the first two days. I would like to see the earlier score of 52 from row 1 in its place.
Is it possible to do this in an efficient way?
As #Roman mentioned, DISTINCT ON can solve this. Details in this related answer:
Select first row in each GROUP BY group?
Subqueries are generally a bit faster than CTEs, though:
SELECT DISTINCT ON (d.day, w.widget_id)
d.day, w.widget_id, s.score
FROM generate_series('2012-05-05'::date, '2012-05-10'::date, '1d') d(day)
CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT JOIN score s ON s.widget_id = w.widget_id AND s.for_date <= d.day
ORDER BY d.day, w.widget_id, s.for_date DESC;
You can use a set returning function like a table in the FROM list.
SQL Fiddle
One multicolumn index should be the key to performance:
CREATE INDEX score_multi_idx ON score (widget_id, for_date, score)
The third column score is only included to make it a covering index in Postgres 9.2 or later. You would not include it in earlier versions.
Of course, if you have many widgets and a wide range of days, the CROSS JOIN produces a lot of rows, which has a price-tag. Only select the widgets and days you actually need.
Like you wrote, you should find matching score, but if there is a gap - fill it with nearest earlier score. In SQL it will be:
SELECT d.day, w.widget_id,
coalesce(s.score, (select s2.score from score s2
where s2.for_date<d.day and s2.widget_id=w.widget_id order by s2.for_date desc limit 1)) as score
from (select distinct widget_id FROM score) AS w
cross join (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
left join score s ON (s.for_date = d.day AND s.widget_id = w.widget_id)
order by d.day, w.widget_id;
Coalesce in this case means "if there is a gap".
You can use distinct on syntax in PostgreSQL
with cte_d as (
select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day
), cte_w as (
select distinct widget_id from score
)
select distinct on (d.day, w.widget_id)
d.day, w.widget_id, s.score
from cte_d as d
cross join cte_w as w
left outer join score as s on s.widget_id = w.widget_id and s.for_date <= d.day
order by d.day, w.widget_id, s.for_date desc;
or get max date by subquery:
with cte_d as (
select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day
), cte_w as (
select distinct widget_id from score
)
select
d.day, w.widget_id, s.score
from cte_d as d
cross join cte_w as w
left outer join score as s on s.widget_id = w.widget_id
where
exists (
select 1
from score as tt
where tt.widget_id = w.widget_id and tt.for_date <= d.day
having max(tt.for_date) = s.for_date
)
order by d.day, w.widget_id;
The performance really depends on indexes you have on your table (unique widget_id, for_date if possible). I think if you have many rows for each widget_id then second one would be more efficient, but you have to test it on your data.
>> sql fiddle demo <<