SQL "Count (Distinct...)" returns 1 less than actual data shows? - sql

I have some data that doesn't appear to be counting correctly. When I look at the raw data I see 5 distinct values in a given column, but when I run an "Count (Distinct ColA)" it reports 4. This is true for all of the categories I am grouping by, too, not just one. E.g. a 2nd value in the column reports 2 when there are 3, a 3rd value reports 1 when there are 2, etc.
Table A: ID, Type
Table B: ID_FK, WorkID, Date
Here is my query that summarizes:
SELECT COUNT (DISTINCT B.ID_FK), A.Type
FROM A INNER JOIN B ON B.ID_FK = A.ID
WHERE Date > 5/1/2013 and Date < 5/2/2013
GROUP BY Type
ORDER BY Type
And a snippet of the results:
4|Business
2|Design
2|Developer
Here is a sample of my data, non-summarized. Pipe is the separator; I just removed the 'COUNT...' and 'GROUP BY...' parts of the query above to get this:
4507|Business
4515|Business
7882|Business
7889|Business
7889|Business
8004|Business
4761|Design
5594|Design
5594|Design
5594|Design
7736|Design
7736|Design
7736|Design
3132|Developer
3132|Developer
3132|Developer
4826|Developer
5403|Developer
As you can see from the data, Business should be 5, not 4, etc. At least that is what my eyes tell me. :)
I am running this inside a FileMaker 12 solution using it's internal ExecuteSQL call. Don't be concerned by that too much, though: the code should be the same as nearly anything else. :)
Any help would be appreciated.
Thanks,
J

Try using a subquery:
SELECT COUNT(*), Type
FROM (SELECT DISTINCT B.ID_FK, A.Type Type
FROM A
INNER JOIN B ON B.ID_FK = A.ID
WHERE Date > 5/1/2013 and Date < 5/2/2013) x
GROUP BY Type
ORDER BY Type

This could be a FileMaker issue, have you seen this post on the FileMaker forum? It describes the same issue (a count distinct smaller by 1) with 11V3 back in 03/2012 with a plug in, then updated with same issue with 12v3 in 11/2012 with ExecuteSQL. It didn't seem to be resolved in either case.
Other considerations might be if there are any referential integrity constraints on the joined tables, or if you can get a query execution plan, you might find it is executing the query differently than expected. not sure if FileMaker can do this.
I like Barmar's suggestion, it would sort twice.
If you are dealing with a bug, directing the COUNT DISTINCT, Join and/or Group By by structuring the query to make them happen at different times might work around it:
SELECT COUNT (DISTINCT x.ID), x.Type
FROM (SELECT A.ID ID, A.Type Type
FROM A
INNER JOIN B ON B.ID_FK = A.ID
WHERE B.Date > 5/1/2013 and B.Date < 5/2/2013) x
GROUP BY Type
ORDER BY Type
you might also try replacing B.ID_FK with A.ID, who knows what context it applies, such as:
SELECT COUNT (DISTINCT A.ID), A.Type

Related

How can I access a selected column from my first select-statement in my third-level subslect?

I have a table "Bed" and a table "Component". Between those two I have a m:n relation and the table "BedComponent", where I store the Bed-ID and the Component-ID.
Every Component has a price. And now I want to write a select-statement that gives me the sum of prices for a certain bed.
This is what I have:
SELECT Bed.idBed, Bed.name, SUM(src.price) AS summe, Bed.idCustomer
FROM Bed,
(SELECT price
FROM dbo.Component AS C
WHERE (C.idComponent IN
(SELECT idComponent
FROM dbo.BedComponent AS BC
WHERE 1 = BC.idBed))) AS src
GROUP BY dbo.Bed.idBed, dbo.Bed.name, dbo.Bed.idCustomer;
This statement works. But of course I don't want to write the bed-ID hard coded into my select as it will always calculate the price for bed 1. Instead of the "1" i want to have the current bed-id.
I work with MS SQL Server
Thanks for your help.
I think you want:
select b.idBed, b.name, SUM(src.price) AS summe, b.idCustomer
from bed b join
bedcomponent bc
on b.idBed = bc.idBed join
component c
on c.idComponent = bc.idComponent
group by b.idBed, b.name, b.idCustomer;
The idCustomer looks strange to me in the select and group by, but I don't know what you are trying to achieve.
Also note the use of table aliases, which make the query easier to write and to read.

Include missing years in Group By query

I am fairly new in Access and SQL programming. I am trying to do the following:
Sum(SO_SalesOrderPaymentHistoryLineT.Amount) AS [Sum Of PaymentPerYear]
and group by year even when there is no amount in some of the years. I would like to have these years listed as well for a report with charts. I'm not certain if this is possible, but every bit of help is appreciated.
My code so far is as follows:
SELECT
Base_CustomerT.SalesRep,
SO_SalesOrderT.CustomerId,
Base_CustomerT.Customer,
SO_SalesOrderPaymentHistoryLineT.DatePaid,
Sum(SO_SalesOrderPaymentHistoryLineT.Amount) AS [Sum Of PaymentPerYear]
FROM
Base_CustomerT
INNER JOIN (
SO_SalesOrderPaymentHistoryLineT
INNER JOIN SO_SalesOrderT
ON SO_SalesOrderPaymentHistoryLineT.SalesOrderId = SO_SalesOrderT.SalesOrderId
) ON Base_CustomerT.CustomerId = SO_SalesOrderT.CustomerId
GROUP BY
Base_CustomerT.SalesRep,
SO_SalesOrderT.CustomerId,
Base_CustomerT.Customer,
SO_SalesOrderPaymentHistoryLineT.DatePaid,
SO_SalesOrderPaymentHistoryLineT.PaymentType,
Base_CustomerT.IsActive
HAVING
(((SO_SalesOrderPaymentHistoryLineT.PaymentType)=1)
AND ((Base_CustomerT.IsActive)=Yes))
ORDER BY
Base_CustomerT.SalesRep,
Base_CustomerT.Customer;
You need another table with all years listed -- you can create this on the fly or have one in the db... join from that. So if you had a table called alltheyears with a column called y that just listed the years then you could use code like this:
WITH minmax as
(
select min(year(SO_SalesOrderPaymentHistoryLineT.DatePaid) as minyear,
max(year(SO_SalesOrderPaymentHistoryLineT.DatePaid) as maxyear)
from SalesOrderPaymentHistoryLineT
), yearsused as
(
select y
from alltheyears, minmax
where alltheyears.y >= minyear and alltheyears.y <= maxyear
)
select *
from yearsused
join ( -- your query above goes here! -- ) T
ON year(T.SO_SalesOrderPaymentHistoryLineT.DatePaid) = yearsused.y
You need a data source that will provide the year numbers. You cannot manufacture them out of thin air. Supposing you had a table Interesting_year with a single column year, populated, say, with every distinct integer between 2000 and 2050, you could do something like this:
SELECT
base.SalesRep,
base.CustomerId,
base.Customer,
base.year,
Sum(NZ(data.Amount)) AS [Sum Of PaymentPerYear]
FROM
(SELECT * FROM Base_CustomerT INNER JOIN Year) AS base
LEFT JOIN
(SELECT * FROM
SO_SalesOrderT
INNER JOIN SO_SalesOrderPaymentHistoryLineT
ON (SO_SalesOrderPaymentHistoryLineT.SalesOrderId = SO_SalesOrderT.SalesOrderId)
) AS data
ON ((base.CustomerId = data.CustomerId)
AND (base.year = Year(data.DatePaid))),
WHERE
(data.PaymentType = 1)
AND (base.IsActive = Yes)
AND (base.year BETWEEN
(SELECT Min(year(DatePaid) FROM SO_SalesOrderPaymentHistoryLineT)
AND (SELECT Max(year(DatePaid) FROM SO_SalesOrderPaymentHistoryLineT))
GROUP BY
base.SalesRep,
base.CustomerId,
base.Customer,
base.year,
ORDER BY
base.SalesRep,
base.Customer;
Note the following:
The revised query first forms the Cartesian product of BaseCustomerT with Interesting_year in order to have base customer data associated with each year (this is sometimes called a CROSS JOIN, but it's the same thing as an INNER JOIN with no join predicate, which is what Access requires)
In order to have result rows for years with no payments, you must perform an outer join (in this case a LEFT JOIN). Where a (base customer, year) combination has no associated orders, the rest of the columns of the join result will be NULL.
I'm selecting the CustomerId from Base_CustomerT because you would sometimes get a NULL if you selected from SO_SalesOrderT as in the starting query
I'm using the Access Nz() function to convert NULL payment amounts to 0 (from rows corresponding to years with no payments)
I converted your HAVING clause to a WHERE clause. That's semantically equivalent in this particular case, and it will be more efficient because the WHERE filter is applied before groups are formed, and because it allows some columns to be omitted from the GROUP BY clause.
Following Hogan's example, I filter out data for years outside the overall range covered by your data. Alternatively, you could achieve the same effect without that filter condition and its subqueries by ensuring that table Intersting_year contains only the year numbers for which you want results.
Update: modified the query to a different, but logically equivalent "something like this" that I hope Access will like better. Aside from adding a bunch of parentheses, the main difference is making both the left and the right operand of the LEFT JOIN into a subquery. That's consistent with the consensus recommendation for resolving Access "ambiguous outer join" errors.
Thank you John for your help. I found a solution which works for me. It looks quiet different but I learned a lot out of it. If you are interested here is how it looks now.
SELECT DISTINCTROW
Base_Customer_RevenueYearQ.SalesRep,
Base_Customer_RevenueYearQ.CustomerId,
Base_Customer_RevenueYearQ.Customer,
Base_Customer_RevenueYearQ.RevenueYear,
CustomerPaymentPerYearQ.[Sum Of PaymentPerYear]
FROM
Base_Customer_RevenueYearQ
LEFT JOIN CustomerPaymentPerYearQ
ON (Base_Customer_RevenueYearQ.RevenueYear = CustomerPaymentPerYearQ.[RevenueYear])
AND (Base_Customer_RevenueYearQ.CustomerId = CustomerPaymentPerYearQ.CustomerId)
GROUP BY
Base_Customer_RevenueYearQ.SalesRep,
Base_Customer_RevenueYearQ.CustomerId,
Base_Customer_RevenueYearQ.Customer,
Base_Customer_RevenueYearQ.RevenueYear,
CustomerPaymentPerYearQ.[Sum Of PaymentPerYear]
;

SQl Query get data very slow from different tables

I am writing a sql query to get data from different tables but it is getting data from different tables very slowly.
Approximately above 2 minutes to complete.
What i am doing is here :
1. I am getting data differences and on behalf of date difference i am getting account numbers
2. I am comparing tables to get exact data i need.
here is my query
select T.accountno,
MAX(T.datetxn) as MxDt,
datediff(MM,MAX(T.datetxn), '2011-6-30') as Diffs,
max(P.Name) as POName
from Account_skd A,
AccountTxn_skd T,
POName P
where A.AccountNo = T.AccountNo and
GPOCode = A.OfficeCode and
Code = A.POCode and
A.servicecode = T.ServiceCode
group by T.AccountNo
order by len(T.AccountNo) DESC
please help that how i can use joins or any other way to get data within very less time say 5-10 seconds.
Since it appears you are getting EVERY ACCOUNT, and performance is slow, I would try by creating a prequery by just account, then do a single join to the other join tables something like..
select
T.Accountno,
T.MxDt,
datediff(MM, T.MxDt, '2011-6-30') as Diffs,
P.Name as POName
from
( select T1.AccountNo,
Max( T1.DateTxn ) MxDt
from AccontTxn_skd T1
group by T1.AccountNo ) T
JOIN Account_skd A
on T.AccountNo = A.AccountNo
JOIN POName P
on A.POCode = P.Code <-- GUESSING as you didn't qualify alias.field
AND A.OfficeCode = P.GPOCode <-- in your query for these two fields
order by
len(T.AccountNo) DESC
You had other elements based on the T.ServiceCode matching, but since you are only grouping on the account number anyhow, did it matter which service code was used? Otherwise, you would need to group by both the account AND service code (which I would have added the service code into the prequery and added as join condition to the account table too).

SQL SUM function doubling the amount it should using multiple tables

My query below is doubling the amount on the last record it returns. I have 3 tables - activities, bookings and tempbookings. The query needs to list the activities and attached information and pull the total number (using the SUM) of places booked (as BookingTotal) from the booking table by each activity and then it needs to calculate the same for tempbookings (as tempPlacesReserved) providing the reservedate field inside that table is in the future.
However the first issue is that if there are no records for an activity in the tempbookings table it does not return any records for that activity at all, to get around this i created dummy records in the past so that it still returns the record, but if I can make it so I don't have to do this I would prefer it!
The main issue I have is that on the final record of the returned results it doubles the booking total and the places reserved which of course makes the whole query useless.
I know that I am doing something wrong I just haven't been able to sort it, I have searched similar issues online but am unable to apply them to my situation correctly.
Any help would be appreciated.
P.S. I'm aware that normally you wouldn't need to fully label all the paths to the databases, tables and fields as I have but for the program I am planning to use it in I have to do it this way.
Code:
SELECT [LeisureActivities].[dbo].[activities].[activityID],
[LeisureActivities].[dbo].[activities].[activityName],
[LeisureActivities].[dbo].[activities].[activityDate],
[LeisureActivities].[dbo].[activities].[activityPlaces],
[LeisureActivities].[dbo].[activities].[activityPrice],
SUM([LeisureActivities].[dbo].[bookings].[bookingPlaces]) AS 'bookingTotal',
SUM (CASE WHEN[LeisureActivities].[dbo].[tempbookings].[tempReserveDate] > GetDate() THEN [LeisureActivities].[dbo].[tempbookings].[tempPlaces] ELSE 0 end) AS 'tempPlacesReserved'
FROM [LeisureActivities].[dbo].[activities],
[LeisureActivities].[dbo].[bookings],
[LeisureActivities].[dbo].[tempbookings]
WHERE ([LeisureActivities].[dbo].[activities].[activityID]=[LeisureActivities].[dbo].[bookings].[activityID]
AND [LeisureActivities].[dbo].[activities].[activityID]=[LeisureActivities].[dbo].[tempbookings].[tempActivityID])
AND [LeisureActivities].[dbo].[activities].[activityDate] > GetDate ()
GROUP BY [LeisureActivities].[dbo].[activities].[activityID],
[LeisureActivities].[dbo].[activities].[activityName],
[LeisureActivities].[dbo].[activities].[activityDate],
[LeisureActivities].[dbo].[activities].[activityPlaces],
[LeisureActivities].[dbo].[activities].[activityPrice];
Your current query is using an INNER JOIN between each of the tables so if the tempBookings table has no records, you will not return anything.
I would advise that you start to use JOIN syntax. You might also need to use subqueries to get the totals.
SELECT a.[activityID],
a.[activityName],
a.[activityDate],
a.[activityPlaces],
a.[activityPrice],
coalesce(b.bookingTotal, 0) bookingTotal,
coalesce(t.tempPlacesReserved, 0) tempPlacesReserved
FROM [LeisureActivities].[dbo].[activities] a
LEFT JOIN
(
select activityID,
SUM([bookingPlaces]) AS bookingTotal
from [LeisureActivities].[dbo].[bookings]
group by activityID
) b
ON a.[activityID]=b.[activityID]
LEFT JOIN
(
select tempActivityID,
SUM(CASE WHEN [tempReserveDate] > GetDate() THEN [tempPlaces] ELSE 0 end) AS tempPlacesReserved
from [LeisureActivities].[dbo].[tempbookings]
group by tempActivityID
) t
ON a.[activityID]=t.[tempActivityID]
WHERE a.[activityDate] > GetDate();
Note: I am using aliases because it is easier to read
Use new SQL-92 Join syntax, and make join to tempBookings an outer join. Also clean up your sql with table aliases. Makes it easier to read. As to why last row has doubled values, I don't know, but on off chance that it is caused by extra dummy records you entered. get rid of them. That problem is fixed by using outer join to tempBookings. The other possibility is that the join conditions you had to the tempBookings table(t.tempActivityID = a.activityID) is insufficient to guarantee that it will match to only one record in activities table... If, for example, it matches to two records in activities, then the rows from Tempbookings would be repeated twice in the output, (causing the sum to be doubled)
SELECT a.activityID, a.activityName, a.activityDate,
a.activityPlaces, a.activityPrice,
SUM(b.bookingPlaces) bookingTotal,
SUM (CASE WHEN t.tempReserveDate > GetDate()
THEN t.tempPlaces ELSE 0 end) tempPlacesReserved
FROM LeisureActivities.dbo.activities a
Join LeisureActivities.dbo.bookings b
On b.activityID = a.activityID
Left Join LeisureActivities.dbo.tempbookings t
On t.tempActivityID = a.activityID
WHERE a.activityDate > GetDate ()
GROUP BY a.activityID, a.activityName,
a.activityDate, a.activityPlaces,
a.activityPrice;

MySQL to PostgreSQL: GROUP BY issues

So I decided to try out PostgreSQL instead of MySQL but I am having some slight conversion problems. This was a query of mine that samples data from four tables and spit them out all in on result.
I am at a loss of how to convey this in PostgreSQL and specifically in Django but I am leaving that for another quesiton so bonus points if you can Django-fy it but no worries if you just pure SQL it.
SELECT links.id, links.created, links.url, links.title, user.username, category.title, SUM(votes.karma_delta) AS karma, SUM(IF(votes.user_id = 1, votes.karma_delta, 0)) AS user_vote
FROM links
LEFT OUTER JOIN `users` `user` ON (`links`.`user_id`=`user`.`id`)
LEFT OUTER JOIN `categories` `category` ON (`links`.`category_id`=`category`.`id`)
LEFT OUTER JOIN `votes` `votes` ON (`votes`.`link_id`=`links`.`id`)
WHERE (links.id = votes.link_id)
GROUP BY votes.link_id
ORDER BY (SUM(votes.karma_delta) - 1) / POW((TIMESTAMPDIFF(HOUR, links.created, NOW()) + 2), 1.5) DESC
LIMIT 20
The IF in the select was where my first troubles began. Seems it's an IF true/false THEN stuff ELSE other stuff END IF yet I can't get the syntax right. I tried to use Navicat's SQL builder but it constantly wanted me to place everything I had selected into the GROUP BY and that I think it all kinds of wrong.
What I am looking for in summary is to make this MySQL query work in PostreSQL. Thank you.
Current Progress
Just want to thank everybody for their help. This is what I have so far:
SELECT links_link.id, links_link.created, links_link.url, links_link.title, links_category.title, SUM(links_vote.karma_delta) AS karma, SUM(CASE WHEN links_vote.user_id = 1 THEN links_vote.karma_delta ELSE 0 END) AS user_vote
FROM links_link
LEFT OUTER JOIN auth_user ON (links_link.user_id = auth_user.id)
LEFT OUTER JOIN links_category ON (links_link.category_id = links_category.id)
LEFT OUTER JOIN links_vote ON (links_vote.link_id = links_link.id)
WHERE (links_link.id = links_vote.link_id)
GROUP BY links_link.id, links_link.created, links_link.url, links_link.title, links_category.title
ORDER BY links_link.created DESC
LIMIT 20
I had to make some table name changes and I am still working on my ORDER BY so till then we're just gonna cop out. Thanks again!
Have a look at this link GROUP BY
When GROUP BY is present, it is not
valid for the SELECT list expressions
to refer to ungrouped columns except
within aggregate functions, since
there would be more than one possible
value to return for an ungrouped
column.
You need to include all the select columns in the group by that are not part of the aggregate functions.
A few things:
Drop the backticks
Use a CASE statement instead of IF() CASE WHEN votes.use_id = 1 THEN votes.karma_delta ELSE 0 END
Change your timestampdiff to DATE_TRUNC('hour', now()) - DATE_TRUNC('hour', links.created) (you will need to then count the number of hours in the resulting interval. It would be much easier to compare timestamps)
Fix your GROUP BY and ORDER BY
Try to replace the IF with a case;
SUM(CASE WHEN votes.user_id = 1 THEN votes.karma_delta ELSE 0 END)
You also have to explicitly name every column or calculated column you use in the GROUP BY clause.