SQL: Why is distinct and max not removing duplicates? - sql

SHouldn't the following query remove duplicates:
SELECT DISTINCT Relevant.PropertyID, ACC.TenancyStartDate, ACC.AccountID, ACC.TenancyType
FROM DimAccount AS ACC RIGHT OUTER JOIN
(SELECT DISTINCT PropertyID, MAX(TenancyStartDate) AS Tenancystart
FROM DimAccount
WHERE (AccountStatus = 'Current')
GROUP BY PropertyID, TenancyStartDate) AS Relevant ON ACC.PropertyID = Relevant.PropertyID AND ACC.TenancyStartDate = Relevant.Tenancystart
GROUP BY Relevant.PropertyID, ACC.TenancyStartDate, ACC.AccountID, ACC.TenancyType, ACC.TenancyType
From my understanding (and what I want to happen) is, the query in brackets is selecting the property ID and of the ones with a status of current returning the highest tenancy start date (albeit several times). This is then joined to the original table by start date and property id, to get the most recent tenancytype.
Why is it still returning duplicate lines!?
(by the way this is relating to another question I had yesterday, but apparently replies are not supposed to descend into conversation so I thought I'd seperate this off... I hope that is the right thing to do... I have searched but clearly there is something missing in my understanding of something!)

First, you almost never need select distinct when using group by.
The problem with your query is the group by clause in the subquery.
SELECT Relevant.PropertyID, ACC.TenancyStartDate, ACC.AccountID, ACC.TenancyType
FROM DimAccount ACC RIGHT OUTER JOIN
(SELECT PropertyID, MAX(TenancyStartDate) AS Tenancystart
FROM DimAccount
WHERE (AccountStatus = 'Current')
GROUP BY PropertyID
) Relevant
ON ACC.PropertyID = Relevant.PropertyID AND
ACC.TenancyStartDate = Relevant.Tenancystart
GROUP BY Relevant.PropertyID, ACC.TenancyStartDate, ACC.AccountID, ACC.TenancyType;
It should not have TenancyStartDate. Also, your outer query had ACC.TenancyType twice in the group by.
That said, it is easier to write the query using analytic functions:
select a.*
from (select a.*,
max(tenancystartdate) over (partition by propertyid) as max_tsd
from dimaccount a
where accountstatus = 'Current'
) a
where tenancystartdate = max_tsd;
This is not exactly the same as your query, because your query will take non-current records into account. I am guessing that this might be the intention, however.

To answer your question: Yes, you are right, there can be no duplicates. And I am pretty sure there are none. I am also pretty sure that your query does not what you think it does.
This is your derived table:
SELECT DISTINCT PropertyID, MAX(TenancyStartDate) AS Tenancystart
FROM DimAccount
WHERE (AccountStatus = 'Current')
GROUP BY PropertyID, TenancyStartDate
As you group by PropertyID and TenancyStartDate, you get one line per PropertyID and TenancyStartDate. For each such line you want the MAX(TenancyStartDate), which is the TenancyStartDate itself of course. There is no other field you aggregate, so you don't aggregate at all, but only make the rows distinct, for which one would use DISTINCT. Then you do use DISTINCT to get unique result records, but your records are already unique, by your obfuscated way of doing it. So you say: select the distinct records of distinct records. Your subquery can be re-written as:
SELECT DISTINCT PropertyID, TenancyStartDate
FROM DimAccount
WHERE AccountStatus = 'Current'
Then you outer-join the DimAccount table. So you would keep your found records, even in case there is no matching DimAccount record. But: You've selected from DimAccount, so of course there is always at least the one record you already found. Your outer join is actually an inner join. Then the only field from the derived query shown is PropertyID which always equals ACC.PropertyID. This means: You are only selecting records from ACC and the derived table is just to make sure a 'Current' record exists for PropertyID and TenancyStartDate. Your query could thus be re-written as:
SELECT DISTINCT
PropertyID, TenancyStartDate, AccountID, TenancyType
FROM DimAccount AS ACC
WHERE EXISTS
(
SELECT *
FROM DimAccount CurrentAccount
WHERE CurrentAccount.AccountStatus = 'Current'
AND CurrentAccount.PropertyID = ACC.PropertyID
AND CurrentAccount.TenancyStartDate = ACC.TenancyStartDate
);
In case PropertyID + TenancyStartDate + AccountID + TenancyType are unique (is AccountID the table's ID?) then you can even remove DISTINCT.
This query gets all 'Current' DimAccount records first and then gives you all records with the same PropertyID and TenancyStartDate. However, from your explanation it seems you want to select the latest 'Current' DimAccount record per PropertyID. This is something entirely else. There are different solutions to such a task depending on the dbms you are using (you haven't specified yours in your tags).

Related

Get row from one table, plus COUNT from a related table

I'm trying to build an SQL query where I grab one table's information (WHERE shops.shop_domain = X) along with the COUNT of the customers table WHERE customers.product_id = 4242451.
The shops table DOES NOT have product.id in it, but the customers table DOES HAVE the shop_domain in it, hence my attempt to do some sort of join.
I essentially want to return the following:
shops.id
shops.name
shops.shop_domain
COUNT OF CUSTOMERS WHERE customers.product_id = '4242451'
Here is my not so lovely attempt at the query.
I think I have the idea right (maybe...) but I can't wrap my head around building this query.
SELECT shops.id, shops.name, shops.shop_domain, COUNT(customers.customer_id)
FROM shops
LEFT JOIN customers ON shops.shop_domain = customers.shop_domain
WHERE shops.shop_domain = 'myshop.com' AND
customers.product_id = '4242451'
GROUP BY shops.shop_id
Relevant database schemas:
shops:
id, name, shop_domain
customers:
id, name, product_id, shop_domain
You are close. The condition on customers needs to go in the ON clause, because this is a LEFT JOIN and customers is the second table:
SELECT s.id, s.name, s.shop_domain, COUNT(c.customer_id)
FROM shops s LEFT JOIN
customers c
ON s.shop_domain = c.shop_domain AND c.product_id = '4242451'
WHERE s.shop_domain = 'myshop.com'
GROUP BY s.id, s.name, s.shop_domain;
I am also inclined to include all three columns in the GROUP BY, although Postgres (and ANSI/ISO standards) are happy with just id if it is declared as the primary key in the table.
A correlated subquery should be substantially cheaper (and simpler) for the purpose:
SELECT id, name, shop_domain
, (SELECT count(*)
FROM customers
WHERE shop_domain = s.shop_domain
AND product_id = 4242451) AS special_count
FROM shops s
WHERE shop_domain = 'myshop.com';
This way you only need to aggregate in the subquery, and need not worry about undesired effects on the outer query.
Assuming product_id is a numeric data type, so I use a numeric literal (4242451) instead of a string literal '4242451' - which might cause problems otherwise.

I expect these 2 sql statements to return same number of rows

In my mind these 2 sql statements are equivalent.
My understanding is:
the first one i am pulling all rows from tmpPerson and filtering where they do not have an equivalent person id. This query returns 211 records.
The second one says give me all tmpPersons whose id isnt in person. this returns null.
Obviously they are not equivalent or theyd have the same results. so what am i missing? thanks
select p.id, bp.id
From person p
right join(
select distinct id
from tmpPerson
) bp
on p.id= bp.id
where p.id is null
select id
from tmpPerson
where id not in (select id from person)
I pulled some ids from the first result set and found no matching records for them in Person so im guessing the first one is accurate but im still surprised they're different
I much prefer left joins to right joins, so let's write the first query as:
select p.id, bp.id
From (select distinct id
from tmpPerson
) bp left join
person p
on p.id = bp.id
where p.id is null;
(The preference is because the result set keeps all the rows in the first table rather than the last table. When reading the from clause, I immediately know what the first table is.)
The second is:
select id
from tmpPerson
where id not in (select id from person);
These are not equivalent for two reasons. The most likely reason in your case is that you have duplicate ids in tmpPerson. The first version removes the duplicates. The second doesn't. This is easily fixed by putting distincts in the right place.
The more subtle reason has to do with the semantics of not in. If any person.id has a NULL value, then all rows will be filtered out. I don't think that is the case with your query, but it is a difference.
I strongly recommend using not exists instead of not in for the reason just described:
select tp.id
from tmpPerson tp
where not exists (select 1 from person p where p.id = tp.id);
select id
from tmpPerson
where id not in (select id from person)
If there is a null id in tmp person then they will not be captured in this query. But in your first query they will be captured. So using an isnull will be resolve the issue
where isnull(id, 'N') not in (select id from person)

SQL Server 2016 Sub Query Guidance

I am currently working on an assignment for my SQL class and I am stuck. I'm not looking for full code to answer the question, just a little nudge in the right direction. If you do provide full code would you mind a small explanation as to why you did it that way (so I can actually learn something.)
Here is the question:
Write a SELECT statement that returns three columns: EmailAddress, ShipmentId, and the order total for each Client. To do this, you can group the result set by the EmailAddress and ShipmentId columns. In addition, you must calculate the order total from the columns in the ShipItems table.
Write a second SELECT statement that uses the first SELECT statement in its FROM clause. The main query should return two columns: the Client’s email address and the largest order for that Client. To do this, you can group the result set by the EmailAddress column.
I am confused on how to pull in the EmailAddress column from the Clients table, as in order to join it I have to bring in other tables that aren't being used. I am assuming there is an easier way to do this using sub Queries as that is what we are working on at the time.
Think of SQL as working with sets of data as opposed to just tables. Tables are merely a set of data. So when you view data this way you immediately see that the query below returns a set of data consisting of the entirety of another set, being a table:
SELECT * FROM MyTable1
Now, if you were to only get the first two columns from MyTable1 you would return a different set that consisted only of columns 1 and 2:
SELECT col1, col2 FROM MyTable1
Now you can treat this second set, a subset of data as a "table" as well and query it like this:
SELECT
*
FROM (
SELECT
col1,
col2
FROM
MyTable1
)
This will return all the columns from the two columns provided in the inner set.
So, your inner query, which I won't write for you since you appear to be a student, and that wouldn't be right for me to give you the entire answer, would be a query consisting of a GROUP BY clause and a SUM of the order value field. But the key thing you need to understand is this set thinking: you can just wrap the ENTIRE query inside brackets and treat it as a table the way I have done above. Hopefully this helps.
You need a subquery, like this:
select emailaddress, max(OrderTotal) as MaxOrder
from
( -- Open the subquery
select Cl.emailaddress,
Sh.ShipmentID,
sum(SI.Value) as OrderTotal -- Use the line item value column in here
from Client Cl -- First table
inner join Shipments Sh -- Join the shipments
on Sh.ClientID = Cl.ClientID
inner join ShipItem SI -- Now the items
on SI.ShipmentID = Sh.ShipmentID
group by C1.emailaddress, Sh.ShipmentID -- here's your grouping for the sum() aggregation
) -- Close subquery
group by emailaddress -- group for the max()
For the first query you can join the Clients to Shipments (on ClientId).
And Shipments to the ShipItems table (on ShipmentId).
Then group the results, and count or sum the total you need.
Using aliases for the tables is usefull, certainly when you select fields from the joined tables that have the same column name.
select
c.EmailAddress,
i.ShipmentId,
SUM((i.ShipItemPrice - i.ShipItemDiscountAmount) * i.Quantity) as TotalPriceDiscounted
from ShipItems i
join Shipments s on (s.ShipmentId = i.ShipmentId)
left join Clients c on (c.ClientId = s.ClientId)
group by i.ShipmentId, c.EmailAddress
order by i.ShipmentId, c.EmailAddress;
Using that grouped query in a subquery, you can get the Maximum total per EmailAddress.
select EmailAddress,
-- max(TotalShipItems) as MaxTotalShipItems,
max(TotalPriceDiscounted) as MaxTotalPriceDiscounted
from (
select
c.EmailAddress,
-- i.ShipmentId,
-- count(*) as TotalShipItems,
SUM((i.ShipItemPrice - i.ShipItemDiscountAmount) * i.Quantity) as TotalPriceDiscounted
from ShipItems i
join Shipments s on (s.ShipmentId = i.ShipmentId)
left join Clients c on (c.ClientId = s.ClientId)
group by i.ShipmentId, c.EmailAddress
) q
group by EmailAddress
order by EmailAddress
Note that an ORDER BY is mostly meaningless inside a subquery if you don't use TOP.

NTILE Function and Using Inner Join in Oracle

I am supposed to use the given Database(Its pretty huge so I used codeshare) to list last names and customer numbers of top 5% of customers for each branch. To find the top 5% of customers, I decided to use the NTILE Function, (100/5 = 20, hence NTILE 20). The columns are pulled from two separate tables so I used Inner joins. For the life of me, I honesly cannot figure out where I am going wrong. I keep getting "missing expression" errors but Do not know what exactly I am missing. Here is the Database
Database: https://codeshare.io/5XKKBj
ERD: https://drive.google.com/file/d/0Bzum6VJXi9lUX1d2ZkhudTE3QXc/view?usp=sharing
Here is my SQL Query so far.
SELECT
Ntile(20) over
(partition by Employee.Branch_no
order by sum(ORDERS.SUBTOTAL) desc
) As Top_5,
CUSTOMER.CUSTOMER_NO,
CUSTOMER.LNAME
FROM
CUSTOMER
INNER JOIN ORDERS
ON
CUSTOMER.CUSTOMER_NO = ORDERS.CUSTOMER_NO
GROUP BY
ORDERS.SUBTOTAL,
CUSTOMER.CUSTOMER_NO,
CUSTOMER.LNAME;
You need to join Employee and the GROUP BY must include all non-aggregated expressions. You can use a subquery to generate the subtotals and get the NTILE in the outer query, e.g.:
SELECT
Ntile(20) over
(partition by BRANCH_NO
order by sum_subtotal desc
) As Top_5,
CUSTOMER_NO,
LNAME
FROM (
SELECT
EMPLOYEE.BRANCH_NO,
CUSTOMER.CUSTOMER_NO,
CUSTOMER.LNAME,
sum(ORDERS.SUBTOTAL) as sum_subtotal
FROM CUSTOMER
JOIN ORDERS
ON CUSTOMER.CUSTOMER_NO = ORDERS.CUSTOMER_NO
JOIN EMPLOYEE
ON ORDERS.EMPLOYEE_NO = EMPLOYEE.EMPLOYEE_NO
GROUP BY
EMPLOYEE.BRANCH_NO,
CUSTOMER.CUSTOMER_NO,
CUSTOMER.LNAME
);
Note: you might want to include BRANCH_NO in the select list as well, otherwise the output will look confusing with duplicate customers (if a customer has ordered from employees in multiple branches).
Now, if you want to filter the above query to just get the top 5%, you can put the whole thing in another subquery and add a predicate on the Top_5 column, e.g.:
SELECT CUSTOMER_NO, LNAME
FROM (... the query above...)
WHERE Top_5 = 1;

Update using Distinct SUM

I have found a few good resources that show I should be able to merge a select query with an update, but I just can't get my head around of the correct formatting.
I have a select statement that is getting info for me, and I want to pretty much use those results to Update an account table that matches the accountID in the select query.
Here is the select statement:
SELECT DISTINCT SUM(b.workers)*tt.mealTax as MealCost,b.townID,b.accountID
FROM buildings AS b
INNER JOIN town_tax AS tt ON tt.townID = b.townID
GROUP BY b.townID,b.accountID
So in short I want the above query to be merged with:
UPDATE accounts AS a
SET a.wealth = a.wealth - MealCost
Where MealCost is the result from the select query. I am sure there is a way to put this into one, I just haven't quite been able to connect the dots to get it to run consistently without separating into two queries.
First, you don't need the distinct when you have a group by.
Second, how do you intend to link the two results? The SELECT query is returning multiple rows per account (one for each town). Presumably, the accounts table has only one row. Let's say that you wanted the average MealCost for the update.
The select query to get this is:
SELECT accountID, avg(MealCost) as avg_Mealcost
FROM (SELECT SUM(b.workers)*tt.mealTax as MealCost, b.townID, b.accountID
FROM buildings AS b INNER JOIN
town_tax AS tt
ON tt.townID = b.townID
GROUP BY b.townID,b.accountID
) a
GROUP BY accountID
Now, to put this into an update, you can use syntax like the following:
UPDATE accounts
set accounts.wealth = accounts.wealth + asum.avg_mealcost
from (SELECT accountID, avg(MealCost) as avg_Mealcost
FROM (SELECT SUM(b.workers)*tt.mealTax as MealCost, b.townID, b.accountID
FROM buildings AS b INNER JOIN
town_tax AS tt
ON tt.townID = b.townID
GROUP BY b.townID,b.accountID
) a
GROUP BY accountID
) asum
where accounts.accountid = asum.accountid
This uses SQL Server syntax, which I believe is the same as for Oracle and most other databases. Mysql puts the "from" clause before the "set" and allows an alias on "update accounts".