SQL Server - getting duplicate rows with different queries - sql

I'm trying to get a count of child records (addresses) for each customer. I have 2 queries and I'm wondering if they're the same:
SELECT
a.AddressId, c.CustomerID, COUNT(*) AS NumDuplicates
FROM
Customers C
INNER JOIN
Addresses a ON c.AddressID = a.AddressID
GROUP BY
c.CustomerID, a.AddressId
ORDER BY
NumDuplicates DESC
SELECT
c.CustomerID,
(SELECT COUNT(*)
FROM Addresses a
WHERE a.AddressID = c.AddressID) AS AddressCount
FROM
Customers c
ORDER BY
AddressCount desc
If they're not, what's the difference? If they are which is more efficient?

The two queries are different, because the first only returns customers that have at least one match in the address table. The second returns all customers, even those with no match and having AddressId is NULL.
The equivalent first query is:
SELECT c.CustomerID, COUNT(a.AddressId) AS NumDuplicates
FROM Customers C LEFT JOIN
Addresses a
ON c.AddressID = a.AddressID
GROUP BY c.CustomerID
ORDER BY NumDuplicates DESC;
As for performance, you should try them out. There are reasons why either might be faster. The second avoids having to do aggregation, but does have a correlated subquery. However, SQL Server has some tricks for speeding joins and aggregation. I would guess that the correlated subquery version is faster, but I might be wrong for your data and server.

Related

Join with a groupby operation in SQL

I am playing with W3Schools SQL environment. A pre-defined database is setup here.
Tables to be used: Customer and Orders.
To get all the info from Customer we can do:
SELECT * FROM [Customers]
To get Customers who have only less than 3 orders we do:
SELECT CustomerID, count(*) as num_orders FROM [Orders] group by customerID having num_orders<3
To get the Customers we have in London, we do:
SELECT * FROM [Customers] where city="London"
Question: How can I get, for every customer in London (with less than 3 orders), how many orders they have?
I know it has to be a Left join, as I want to keep all customers even if they have N/A orders (so, no records in "Orders"), but I am having a hard time to make it work.
I tried:
SELECT * FROM [Customers] where city="London"
left join (SELECT CustomerID, count(*) as num_orders FROM [Orders] group by customerID having num_orders<3) as data
on customers.CustomerID= data.CustomerID
But the environment gives no meaninful info about the error.
The proper syntax is:
SELECT c.*, o.num_orders
FROM [Customers] c LEFT JOIN
(SELECT o.CustomerID, COUNT(*) as num_orders
FROM [Orders] o
GROUP BY o.customerID
) o
ON c.CustomerID = o.CustomerID
WHERE c.city = 'London';
Notes:
The most important difference is the order of the clauses. WHERE comes after the FROM clause.
The HAVING clause is removed, because the question is for all customers in London.
Single quotes are used to quote London. Single quotes are the standard string delimiter.
The query uses table aliases, and these are specifically chosen to be very short and related to the table names.
All columns are qualified.

Order of Execution of Subqueries in SQL

SELECT customerid,
(SELECT COUNT(*)
FROM orders
WHERE customers.customerid = orders.customerid) as total_orders
FROM customers
Can anyone explain the working of this SQL code? The subquery should always return the same number of rows in this case according to me, because the total no. of rows where
customers.customerid = orders.customerid is same. But its displaying each customer and the total_orders made by him/her. What is the order of execution that results in this?
Please find the database here:
https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_distinct
Your query is:
SELECT c.customerid,
(SELECT COUNT(*)
FROM orders o
WHERE c.customerid = o.customerid
) as total_orders
FROM customers c;
(Note that I added table aliases and qualified all column names.)
This is a scalar, correlated subquery. It is a scalar subquery because it returns a single value (rather than a table).
It is correlated because the subquery is linked to the outer query. This is the part that confuses you.
Basically, the outer query says that the result set will have one row for each customer.
The subquery than says that for each customer, the result set will count the number of matching rows for the customer in any given row.
Although writing the query with a subquery is totally fine, this would often be written as:
SELECT c.customerid, COUNT(o.customerid) as total_orders
FROM customers c LEFT JOIN
orders o
ON c.customerid = o.customerid
GROUP BY c.customerId
You are basically using the Correlated subquery which means your inner query is executed for each of the row of the outer query.
In your case, the inner query gets executed for all the customers because of the where clause customers.customerid = orders.customerid. So, the aggregate function COUNT(*) returns the total number of orders for every customer. Since your outer query selects customerId and total_orders that is why you get 2 columns.

Getting count of number of rows of data retrieved from several SQL Tables on an SQL Server

I am working on an SQL Query which returns several rows of data from SQL tables on SQL Server using joins. But I just want the Query to return count of number of rows of data that is returned by following SQL Query
SELECT C.ContactID,
C.FirstName,
C.LastName,
SP.SalesPersonID,
SP.CommissionPct,
SP.SalesYTD,
SP.SalesLastYear,
SP.Bonus,
ST.TerritoryID,
ST.Name,
ST.[Group],
ST.SalesYTD
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP
ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST
ON ST.TerritoryID = SP.TerritoryID
ORDER BY ST.TerritoryID, C.LastName
How to get the number of rows the above query returns. I would do it easily with the help of an SQL View but I don't want to create a view on the server as I just have read permissions to the database.
May I know a better way to solve considering the restrictions I have on the database?
I would just do this as two queries. One like you listed, and the other with a COUNT(*):
SELECT COUNT(*)
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP
ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST
ON ST.TerritoryID = SP.TerritoryID
ORDER BY ST.TerritoryID, C.LastName
This will return a scalar result, so you don't have to waste unnecessary bandwidth at any point. But it really depends on how you expect to use it. Dave's answer is appropriate if you need to pull all the records back no matter what, but if that's the case I would just check your List<>.Count or [].Length properties.
You could also add in the column COUNT(*) OVER() AS [ResultCount], but remember that that will return the same value for every row. Again, it just depends how you want to do this.
Use ##ROWCOUNT (MSDN). It returns the number of rows affected (selected, updated, deleted, etc) of the previous query.
SELECT C.ContactID,
C.FirstName,
C.LastName,
SP.SalesPersonID,
SP.CommissionPct,
SP.SalesYTD,
SP.SalesLastYear,
SP.Bonus,
ST.TerritoryID,
ST.Name,
ST.[Group],
ST.SalesYTD
FROM Person.Contact C
INNER JOIN Sales.SalesPerson SP
ON C.ContactID = SP.SalesPersonID
FULL OUTER JOIN Sales.SalesTerritory ST
ON ST.TerritoryID = SP.TerritoryID
ORDER BY ST.TerritoryID, C.LastName
SELECT ##ROWCOUNT
One way is you can add "select ##rowcount" as the next command after your query. An alternative is you could group by all your fields and add a count(*) in the select clause.

Give all rows appear in another table at least specific number of times

I am using the sample database and I want to write a query on the tables Customers and Orders that gives all the customers which have made more than 2 Orders. Although I achive that with the query:
Select Customers.*
From Customers
Where Customers.CustomerID IN(
Select Orders.CustomerID
From Orders
Group by Orders.CustomerID
Having count(*)>2
);
I cannot understand why the query:
SELECT Customers.*
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID
GROUP BY Customers.CustomerID
HAVING COUNT(*)>2;
cannot give the same results. The message from the database is:
"Cannot group on fields selected with '*' (Customers)."
I had though the impression that it should work, since Customers.CustomerID is included on the demanded columns in Select statement. What is the problem and how could I modify the second query in order to work, even though it excecutes probably superfluous statements?
From SQL GROUP BY Statement
The GROUP BY statement is used in conjunction with the aggregate
functions to group the result-set by one or more columns.
SQL GROUP BY Syntax
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name;
So for using aggregating, you need to specify which columns you are aggregating by, and for that you cannot use *
You would have to specifically specify the columns in both the SELECT and GROUP BY clauses.
Specify the columns you need in SELECT statement:
SELECT Customers.CustomerID, Customers.CustomerName
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID
GROUP BY Customers.CustomerID, Customers.CustomerName
HAVING COUNT(*)>2;
Your first solution is formed of two queries actually
The second part is used for determining returning customers by listing CustomerID
Select o.CustomerID
From Sales.SalesOrderHeader o
Group by o.CustomerID
Having count(*)>2
And the first part displays Customer details by using returning customer list gained by the second query
Select c.*
From Sales.Customer c
Where
c.CustomerID IN(
...
);
It is not possible to return all customer data while trying to fetch dublicate customer ID's using Group By syntax on Order table
But instead of Group By, SQL aggregate functions (Count function below) with PARTITION BY clause can be used here
Please check the tutorial http://www.kodyaz.com/t-sql/sql-count-function-with-partition-by-clause.aspx and have a look at the following query
select * from (
SELECT distinct
c.*,
COUNT(o.SalesOrderID) over (partition by c.CustomerId) cnt
FROM Sales.SalesOrderHeader o
INNER JOIN Sales.Customer c ON o.CustomerID = c.CustomerID
) t where cnt > 1

Difference b/w putting condition in JOIN clause versus WHERE clause

Suppose I have 3 tables.
Sales Rep
Rep Code
First Name
Last Name
Phone
Email
Sales Team
Orders
Order Number
Rep Code
Customer Number
Order Date
Order Status
Customer
Customer Number
Name
Address
Phone Number
I want to get a detailed report of Sales for 2010. I would be doing a join. I am interested in knowing which of the following is more efficient and why ?
SELECT
O.OrderNum, R.Name, C.Name
FROM
Order O INNER JOIN Rep R ON O.RepCode = R.RepCode
INNER JOIN Customer C ON O.CustomerNumber = C.CustomerNumber
WHERE
O.OrderDate >= '01/01/2010'
OR
SELECT
O.OrderNum, R.Name, C.Name
FROM
Order O INNER JOIN Rep R ON (O.RepCode = R.RepCode AND O.OrderDate >= '01/01/2010')
INNER JOIN Customer C ON O.CustomerNumber = C.CustomerNumber
JOINs must reflect the relationship aspect of your tables. WHERE clause, is a place where you filter records. I prefer the first one.
Make it readable first, table relationships should be obvious (by using JOINs), then profile
Efficiency-wise, the only way to know is to profile it, different database have different planner on executing the query
Wherein some database might apply filter first, then do the join subsquently; some database might join tables blindly first, then execute where clause later. Try to profile, on Postgres and MySQL use EXPLAIN SELECT ..., in SQL Server use Ctrl+K, with SQL Server you can see which of the two queries is faster relative to each other