Does Count (*) skew desired results when performing additional join? - sql

My query works fine; however, I need to join another dataset to my query, and I expect that the count(f.*) will break.
Here's the query I start with:
SELECT
MIN(received_date) AS FirstVisit
, patient_id AS PatientID
INTO #LookupTable
FROM F_ACCESSION_DAILY
SELECT
f.doctor AS Doctor
, COUNT(f.*) AS CountNewPatients
, MONTH(firstvisit) AS Month
, YEAR(firstvisit) AS Year
FROM F_ACCESSION_DAILY f
INNER JOIN #LookupTable l ON f.received_date = l.FirstVisit
AND f.patient_id = l.PatientID
GROUP BY f.doctor
, MONTH(firstvisit)
, YEAR(firstvisit)
DROP TABLE #LookupTable
I would like to Join the above query on another table.
The question is *will my count(f.*) stay the same or will it change because I've added a new dataset?*
**How do I make sure that the count(f.*) will remain the same?
Thank you so much for your guidance.

will my count(f.*) stay the same or will it change because I've added a new dataset?*
COUNT(*) counts rows. If you join another table and the number of rows increases, the result of COUNT(*) will increase.
How do I make sure that the count(f.*) will remain the same?
Use COUNT (DISTINCT f.Id).

If there's exactly a 1 row per patient_id in the new table (and you're doing an INNER JOIN) then the count won't change. Otherwise, it will.
You could use an OUTER APPLY (SELECT TOP 1 ....) instead of a JOIN to guarantee that the count won't change.
By the way, it looks like you're missing a GROUP BY patient_id in your first SELECT.

Joins do not "skew" the COUNT(*). The count does exactly what it is advertised to do. The problem is that you may be multiplying the number of rows, without really realizing it.
One way to solve the problem is to do the aggregations at the appropriate level. Sometimes, you have to do it this way -- for instance, when SUMs and AVGs are involved.
For the count, though, you can replace it with:
count(distinct AccessionDailyID)
Even if the rows gets multiplied, then this will work to get your count. By the way, this assumes thatyour table has a unique id for each row.
By the way, you may want to be sure thatyou use LEFT OUTER JOIN rather than INNER JOIN to be sure that you don't lose any rows in the joining process.

Related

SQL Query - Joining and Aggregating

I need to run a query every hour against a table that joins and aggregates data from another table with millions of rows.
select f.master_con,
s.containers
from
(
select master_con
from shipped
where start_time >= a and start_time <= a+1
) f,
(
select master_con,
count(distinct container) as containers
from picked
) s
where f.master_con = s.master_con
This query above sorta works, the exact syntax may not be correct because I wrote it from memory.
In the sub query 's' I only want to count container for each master_con in the 'f' query, and I think my query runs for a long time because I'm counting container for all master_con but then joining only to master_con from 'f'
Is there a better, more efficient way to write this type of query?
(In the end, I'll sum(containers) from this query above to get total containers shipped during that hour)
Most likely, there is. Can you provide some simplified sample table structures? Additionally, the join method being used has been moving towards deprecation for some time. You should declare your joins explicitly. The below should be an improvement. Left outer join was used so that you get all of the shipper records that meet your criteria and keep them even if they aren't in the picked table. Change that to inner join if you want them gone.
SELECT shipped.master_con,
COUNT(DISTINCT picked.containers) AS containers
FROM shipped LEFT OUTER JOIN
Picked ON picked.master_con = shipped.master_con
WHERE shipped.start_time BETWEEN a AND a+1
GROUP BY shipped.master_con

Compare value of one field in one table to the total sum of one column in another table

I'm having trouble with executing a query to compare where the value of a column in one table is not equal to the sum of another column in a different table. Below is the query I have been trying to execute:
select id.invoice_no,sum(id.bank_charges),
from db2apps.invoice_d id
inner join db2apps.invoice_h ih on (id.invoice_no = ih.invoice_no)
group by id.invoice_no
having coalesce(sum(id.bank_charges), 0) != ih.tax_value
with ur;
I tried with joining on the tables, the group by having format, etc and have had no luck. I really want to select id.invoice_no, ih.tax_value, and sum(id.bank_charges) in the result set, and also grab the data where the sum(id.bank_charges) is not equal to the value of ih.tax_value. Any help would be appreciated.
Perhaps this solves your problem:
select ih.invoice_no, ih.tax_value, sum(id.bank_charges)
from db2apps.invoice_h ih left join
db2apps.invoice_d id
on id.invoice_no = ih.invoice_no
group by ih.invoice_no, ih.tax_value
having coalesce(sum(id.bank_charges), 0) <> ih.tax_value;
The most logical way is probably to SUM the invoice detail first.
SELECT IH.INVOICE_NO
, IH.TAX_VALUE
FROM
DB2APPS.INVOICE_H IH
JOIN
( SELECT INVOICE_NO
, COALESCE(SUM(BANK_CHARGES),0) AS BANK_CHARGES
FROM
DB2APPS.INVOICE_D
GROUP BY
INVOICE_NO
) ID
ON
ID.INVOICE_NO = IH.INVOICE_NO
WHERE
ID.BANK_CHARGE <> IH.TAX_VALUE
Generally, you never need to use HAVING in SQL and often your code will be clearer and easier to follow if you do avoid using it (even if it it sometimes a bit longer).
P.S. you can remove the COALESCE if BANK_CHARGES is NOT NULL.

Getting way more results than expected in SQL left join query

My code is such:
SELECT COUNT(*)
FROM earned_dollars a
LEFT JOIN product_reference b ON a.product_code = b.product_code
WHERE a.activity_year = '2015'
I'm trying to match two tables based on their product codes. I would expect the same number of results back from this as total records in table a (with a year of 2015). But for some reason I'm getting close to 3 million.
Table a has about 40,000,000 records and table b has 2000. When I run this statement without the join I get 2,500,000 results, so I would expect this even with the left join, but somehow I'm getting 300,000,000. Any ideas? I even refered to the diagram in this post.
it means either your left join is using only part of foreign key, which causes row multiplication, or there are simply duplicate rows in the joined table.
use COUNT(DISTINCT a.product_code)
What is the question are are trying to answer with the tsql?
instead of select count(*) try select a.product_code, b.product_code. That will show you which records match and which don't.
Should also add a where b.product_code is not null. That should exclude the records that don't match.
b is the parent table and a is the child table? try a right join instead.
Or use the table's unique identifier, i.e.
SELECT COUNT(a.earned_dollars_id)
Not sure what your datamodel looks like and how it is structured, but i'm guessing you only care about earned_dollars?
SELECT COUNT(*)
FROM earned_dollars a
WHERE a.activity_year = '2015'
and exists (select 1 from product_reference b ON a.product_code = b.product_code)

Cumulative Summing Values in SQLite

I am trying to perform a cumulative sum of values in SQLite. I initially only needed to sum a single column and had the code
SELECT
t.MyColumn,
(SELECT Sum(r.KeyColumn1) FROM MyTable as r WHERE r.Date < t.Date)
FROM MyTable as t
Group By t.Date;
which worked fine.
Now I wanted to extend this to more columns KeyColumn2 and KeyColumn3 say. Instead of adding more SELECT statements I thought it would be better to use a join and wrote the following
SELECT
t.MyColumn,
Sum(r.KeyColumn1),
Sum(r.KeyColumn2),
Sum(r.KeyColumn3)
FROM MyTable as t
Left Join MyTable as r On (r.Date < t.Date)
Group By t.Date;
However this does not give me the correct answer (instead it gives values that are much larger than expected). Why is this and how could I correct the JOIN to give me the correct answer?
You are likely getting what I would call mini-Cartesian products: your Date values are probably not unique and, as a result of the self-join, you are getting matches for each of the non-unique values. After grouping by Date the results are just multiplied accordingly.
To solve this, the left side of the join must be rid of duplicate dates. One way is to derive a table of unique dates from your table:
SELECT DISTINCT Date
FROM MyTable
and use it as the left side of the join:
SELECT
t.Date,
Sum(r.KeyColumn1),
Sum(r.KeyColumn2),
Sum(r.KeyColumn3)
FROM (SELECT DISTINCT Date FROM MyTable) as t
Left Join MyTable as r On (r.Date < t.Date)
Group By t.Date;
I noticed that you used t.MyColumn in the SELECT clause, while your grouping was by t.Date. If that was intentional, you may be relying on undefined behaviour there, because the t.MyColumn value would probably be chosen arbitrarily among the (potentially) many in the same t.Date group.
For the purpose of this example, I assumed that you actually meant t.Date, so, I replaced the column accordingly, as you can see above. If my assumption was incorrect, please clarify.
Your join is not working cause he will find way more possibilities to join then your subselect would do.
The join is exploding your table.
The sub select does a sum of all records where the date is lower then the one from the current record.
The join joins every row multiple times aslong as the date is lower then the current record. This mean a single record could do as manny joins as there are records with a date lower. This causes multiple records. And in the end a higher SUM.
If you want the sum from mulitple columns you will have to use 3 sub query or define a unique join.

does the order of columns in a SQL select matters?

my question is regarding a left join I've tried to count how many people are tracking a certain project.
(there can be zero followers)
now the only way i can get it to work is by adding
group by idproject
my question is if the is a way to avoid using this and only selecting and implicitly
setting that group option.
SQL:
select `project_view`.`idproject` AS `idproject`,
count(`track`.`iduser`) AS `c`,`name`
from `project_view` left join `track` using(idproject)
I expected it count null as zero but it doesn't appear at all, if i neglect counting then it shows as null where there are no followers.
If you have a WHERE clause to specify a certain project then you don't need a GROUP BY.
SELECT project_view.idproject, COUNT(track.iduser) AS c, name
FROM project_view
LEFT JOIN track USING (idproject)
WHERE idproject = 4
If you want a count for each project then you do need a GROUP BY.
SELECT project_view.idproject, COUNT(track.iduser) AS c, name
FROM project_view
LEFT JOIN track USING (idproject)
GROUP BY idproject
Yes the order of selecting matters. For performance reasons you (typically) want your most limiting select first to narrow your data set. This makes every subsequent query operate on a smaller dataset.