sql sum and count functions with inner join - sql

select *
from Table1 t1
inner join Table2 t2 on t1.id=t2.tid
returns 102 rows
select sum(t1.val), count(t1.val)
from Table1 t1
inner join Table2 t2 on t1.id=t2.tid
returns 29000 103
That means the second query doesnt work correctly. What the problem?

Looks like one of your 103 values has null in val column.
select sum(t1.val), count(*)
from Table1 t1
inner join Table2 t2 on t1.id=t2.tid
This should return 103 for count. At least in MS SQL Server. But I think it's part of SQL ANSI, so should work for all SQL ANSI compliant database engines

As you haven't specified a DBMS, I'm answering based on just SQL, as you tagged it. Anyway, this should apply to any DBMS.
You have 2 different queries that have the same join. The join will generate the same amount of results in both cases. It is clear from the first one that there are 102 results after the join.
If you then want to count those rows there is not way that you get more rows than there really are. What can happen is that you get less because the count(field) aggregation function will count only non-null values for field.
However, you stated that you got more and that is absolutely not possible.

Related

Why would LEFT JOIN on a field to then later filter it out in WHERE clause?

Query
SELECT ID, Name, Phone
FROM Table1
LEFT JOIN Table2 ON Table1.ID = Table2.ID
WHERE Table2.ID IS NULL
Problem
Finding it hard to understand why someone would left join on an ID
and then set it to NULL in the where clause?
Am I missing something here? Is there any significance to this?
Could we just omit the Table2 altogether? As in not join at all?
Any help would be much appreciated.
The query you have in the question is basically equivalent to the following query:
SELECT ID, Name, Phone
FROM Table1
WHERE NOT EXISTS
(
SELECT 1
FROM Table2
WHERE Table1.ID = Table2.ID
)
Meaning it selects all the records in Table1 that does not have a correlated record in Table2.
The execution plan for both queries will most likely be the same (Personally, I've never seen a case when they produce a different execution plan, but I don't rule that out), so both queries should be equally efficient, and it's up to you to decide whether the left join or the exists syntax is more readable to you.
I think you should have an alias for you table and specify which table each column is coming from.
Assuming Name is from table one and Phone is form table two and ID is common in both, then the Left join mentioned above may help get all users that do not have phone numbers.
Table 1
Id Name
1 John Smith
2 Jane Doe
Table 2
Id Phone
2 071 555 0863
Left Join without the where clause
ID Name Phone
1 John Smith NULL
2 Jane Doe 071 555 0863
Left Join with the where clause
ID Name Phone
1 John Smith NULL
This is one of the ways to implement the relational database operation of antijoin, called anti semi join within sql server's terminology. This is essentially "bring rows from one table that are not in another table".
The ways I cant think of doing this are:
select cols from t1 left join t2 on t1.key=t2.key where t2.key is null
select cols from t1 where key not in (select key from t2)
select cols from t1 where not exists (select 1 from t2 where t1.key=t2.key)
and even
select * from t1 where key in (select key from t1 except select key from t2)
There are some differences between these methods (most notably, the danger of null handling in the case of not in), but they generally do the same.
To address your points:
Finding it hard to understand why someone would left join on an ID and
then set it to NULL in the where clause?
As mentioned, in order to exclude results from t1 that are present in t2
Could we just omit the Table2 altogether? As in not join at all?
If you don't use the join (or any of its equivelant alternatives), you will get more results, as the rows in table1 that have the same id with any rows in table2 will be returned, too.
If joining condition column is having null value specifically ID then it is bad database design per my understanding.
As per your query below. Here are the possible scnario why where clause make sense
I am assuming that your name and phone number are coming from table2 and then you are trying to find the name and phone number whose ID is null.
If name and phone number is coming from table1 and table 2 is just having ID join and not selecting anything from table 2 then where clause is total waste.
SELECT
ID,
Name,
Phone
FROM
Table1
LEFT JOIN
Table2
ON
Table1.ID = Table2.ID
WHERE
Table2.ID IS NULL
Essentially in the above common business scenario, developers put where clause filter criteria in left join when any value is coming from right side is having non relevance data and not required to be the part of dataset then filter it out.

SQL Inner Join with no WHERE clause

I was wondering, how does an inner join work when no WHERE clause is specified? For example,
SELECT table1.letter, table2.letter, table1.number, table2.number
FROM tbl AS table1, tbl AS table2;
tbl:
text, integer
a , 1
b , 2
c , 3
Tried finding some examples online but I couldn't seem to find any :-/
Thanks!
The current implicit join syntax you are using:
FROM tbl AS table1, tbl AS table2;
will result in a cross join if no restrictions are present in the WHERE clause. But really you should use modern ANSI-92 syntax when writing your queries, e.g.
SELECT
table1.letter,
table2.letter,
table1.number,
table2.number
FROM tbl AS table1
INNER JOIN tbl AS table2
-- ON <some conditions>
One obvious reason to use this syntax is that it makes it much easier to see the logic of your query. In this case, if your updated query were missing an ON clause, then we would know right away that it is doing a cross join, which most of the time is usually not what you want to be doing.
The comma operator generates a Cartesian product -- every row in the first table combined with every row of the second.
This is more properly written using the explicit cross join:
SELECT table1.letter, table2.letter, table1.number, table2.number
FROM tbl table1 CROSS JOIN
tbl table2;
If you have conditions for combining the two tables, then you would normally use JOIN with an ON clause.
You can use cross join
select * from table1 cross join table2
Here is a link to understand more about the use of cross join.
https://www.w3resource.com/sql/joins/cross-join.php

SQL aggregate function returning inflated values on joined table

I'm breaking my head here where I'm going wrong.
The following query:
SELECT SUM(table1.col1) FROM table1
returns value x.
And the following query:
SELECT SUM(table1.col1) FROM table2 RIGHT OUTER JOIN table1 ON table2.ID = table1.ID
returns value y. (I need the Join for the other data of table2). Why is the 2nd example returning a different value than in the first?
Make life easier on yourself, your colleagues that will support your code, and your clients by temporarily ignoring the existence of RIGHT OUTER JOIN. Use Table1 as the "from table" instead of table2.
Then, If aggregating, you will often find it necessary to do this BEFORE joining, so that the numbers are accurate. e.g.
SELECT T1.SUMCOL1
FROM (
SELECT id, SUM(col1) as SUMCOL1 FROM Table1 GROUP BY id
) T1
LEFT OUTER JOIN table2 T2 on T1.id = T2.ID
Obvious answer is because table2 is many to table1's one. That is, there are multiple rows in table2 for one id in table1. You may also be eliminating rows from table1 if the id isn't present in table2.
Compare:
SELECT COUNT(*) FROM table1
To:
SELECT COUNT(*) FROM table2 RIGHT OUTER JOIN table1 ON table2.ID = table1.ID
If you get different results, you're aggregating duplicates or eliminating rows from table1.
If you want to avoid this, you'll need to use a subquery.

SQL query to limit number of rows having distinct values

Is there a way in SQL to use a query that is equivalent to the following:
select * from table1, table2 where some_join_condition
and some_other_condition and count(distinct(table1.id)) < some_number;
Let us say table1 is an employee table. Then a join will cause data about a single employee to be spread across multiple rows. I want to limit the number of distinct employees returned to some number. A condition on row number or something similar will not be sufficient in this case.
So what is the best way to get the same effect the same output as intended by the above query?
select *
from (select * from employee where rownum < some_number and some_id_filter), table2
where some_join_condition and some_other_condition;
This will work for nearly all DBs
SELECT *
FROM table1 t1
INNER JOIN table2 t2
ON some_join_condition
AND some_other_condition
INNER JOIN (
SELECT t1.id
FROM table1 t1
HAVING
count(t1.ID) > someNumber
) on t1.id = t1.id
Some DBs have special syntax to make this a little bit eaiser.
I may not have a full understanding of what you're trying to accomplish, but lets say you're trying to get it down to 1 row per employee, but each join is causing multiple rows per employee and grouping by employee name and other fields is still not unique enough to get it down to a single row, then you can try using ranking and partitioning and then select the rank you prefer for each employee partition.
See example : http://msdn.microsoft.com/en-us/library/ms176102.aspx

Getting distinct rows from a left outer join

I am building an application which dynamically generates sql to search for rows of a particular Table (this is the main domain class, like an Employee).
There are three tables Table1, Table2 and Table1Table2Map.
Table1 has a many to many relationship with Table2, and is mapped through Table1Table2Map table. But since Table1 is my main table the relationship is virtually like a one to many.
My app generates a sql which basically gives a result set containing rows from all these tables. The select clause and joins dont change whereas the where clause is generated based on user interaction. In any case I dont want duplicate rows of Table1 in my result set as it is the main table for result display. Right now the query that is getting generated is like this:
select distinct Table1.Id as Id, Table1.Name, Table2.Description from Table1
left outer join Table1Table2Map on (Table1Table2Map.Table1Id = Table1.Id)
left outer join Table2 on (Table2.Id = Table1Table2Map.Table2Id)
For simplicity I have excluded the where clause. The problem is when there are multiple rows in Table2 for Table1 even though I have said distinct of Table1.Id the result set has duplicate rows of Table1 as it has to select all the matching rows in Table2.
To elaborate more, consider that for a row in Table1 with Id = 1 there are two rows in Table1Table2Map (1, 1) and (1, 2) mapping Table1 to two rows in Table2 with ids 1, 2. The above mentioned query returns duplicate rows for this case. Now I want the query to return Table1 row with Id 1 only once. This is because there is only one row in Table2 that is like an active value for the corresponding entry in Table1 (this information is in Mapping table).
Is there a way I can avoid getting duplicate rows of Table1.
I think there is some basic problem in the way I am trying to solve the problem, but I am not able to find out what it is. Thanks in advance.
Try:
left outer join (select distinct YOUR_COLUMNS_HERE ...) SUBQUERY_ALIAS on ...
In other words, don't join directly against the table, join against a sub-query that limits the rows you join against.
You can use GROUP BY on Table1.Id ,and that will get rid off the extra rows. You wouldn't need to worry about any mechanics on join side.
I came up with this solution in a huge query and it this solution didnt effect the query time much.
NOTE : I'm answering this question 3 years after its been asked but this may help someone i believe.
You can re-write your left joins to be outer applies, so that you can use a top 1 and an order by as follows:
select Table1.Id as Id, Table1.Name, Table2.Description
from Table1
outer apply (
select top 1 *
from Table1Table2Map
where (Table1Table2Map.Table1Id = Table1.Id) and Table1Table2Map.IsActive = 1
order by somethingCol
) t1t2
outer apply (
select top 1 *
from Table2
where (Table2.Id = Table1Table2Map.Table2Id)
) t2;
Note that an outer apply without a "top" or an "order by" is exactly equivalent to a left outer join, it just gives you a little more control. (cross apply is equivalent to an inner join).
You can also do something similar using the row_number() function:
select * from (
select distinct Table1.Id as Id, Table1.Name, Table2.Description,
rowNum = row_number() over ( partition by table1.id order by something )
from Table1
left outer join Table1Table2Map on (Table1Table2Map.Table1Id = Table1.Id)
left outer join Table2 on (Table2.Id = Table1Table2Map.Table2Id)
) x
where rowNum = 1;
Most of this doesn't apply if the IsActive flag can narrow down your other tables to one row, but they might come in useful for you.
To elaborate on one point: you said that there is only one "active" row in Table2 per row in Table1. Is that row not marked as active such that you could put it in the where clause? Or is there some magic in the dynamic conditions supplied by the user that determines what's active and what isn't.
If you don't need to select anything from Table2 the solution is relatively simply in that you can use the EXISTS function but since you've put TAble2.Description in the clause I'll assume that's not the case.
Basically what separates the relevant rows in Table2 from the irrelevant ones? Is it an active flag or a dynamic condition? The first row? That's really how you should be removing duplicates.
DISTINCT clauses tend to be overused. That may not be the case here but it sounds like it's possible that you're trying to hack out the results you want with DISTINCT rather than solving the real problem, which is a fairly common problem.
You have to include activity clause into your join (and no need for distinct):
select Table1.Id as Id, Table1.Name, Table2.Description from Table1
left outer join Table1Table2Map on (Table1Table2Map.Table1Id = Table1.Id) and Table1Table2Map.IsActive = 1
left outer join Table2 on (Table2.Id = Table1Table2Map.Table2Id)
If you want to display multiple rows from table2 you will have duplicate data from table1 displayed. If you wanted to you could use an aggregate function (IE Max, Min) on table2, this would eliminate the duplicate rows from table1, but would also hide some of the data from table2.
See also my answer on question #70161 for additional explanation