Interview Question about SQL group by and having

Interview Question about SQL group by and having - sql

This problem is from the following
https://www.programmerinterview.com/index.php/database-sql/advanced-sql-interview-questions-and-answers/
Assume we have two tables:
Salesperson
ID Name Age Salary
Orders
Number order_date cust_id salesperson_id Amount
The question is following:
We want to retrieve the names of all salespeople that have more than 1 order from the tables above. You can assume that each salesperson only has one ID. I would probably also assume that names are all distinct.
My answer was this.
select Name from
salesperson S inner join Orders O
on S.ID=O.salesperson_id
group by Name
having count(number) >=2
However, the given answer is following:
SELECT Name
FROM Orders inner join Salesperson
On Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id, NAME
Having count(salesperson_id) > 1
If name and salesperson_id is one to one, is there any reason we have to add salesperson_id into the group by statement here? Also, if name and salesperson_id relationship is just one to one, wouldn't count(salesperson_id) be always 1 if we group by salesperson_id, name?
I was a bit confused about this, and I was wondering if anybody encountered this problem before and found this weird as well.

Both your solution and the accepted one are functionally identical, except for the GROUP BY clause.
The likely reason why the accepted solution is aggregating both by Name and salesperson_id is that it could be the case that two or more salespeople happen to have the same name. Should this occur, your query would report only a single name, but with aggregate results from more than one salesperson. But, the combination of salesperson_id and Name should always be unique.
Other than this, I actually prefer your version, and I would start joining from the salesperson table out to the Orders table.

Related

SQL COUNT(*) Confusion

This is from a book Murachs SQL Server for Developers:
I have this summary query that calculates the number of invoices and the average invoice amount for the vendors in each state and city group. I understand the majority of the code but one part that confuses me is the COUNT(*) aggregate. According to the book, this aggregate will get the number of invoices a State, City Group has.
I cannot seem to follow the logic to me it looks like the COUNT(*) in the SELECT statement will give a total of how many times a vendor state/city group will appear in the vendor's table not in the invoices table.
SELECT VendorState, VendorCity, COUNT(*) AS 'Invoice QTY',
AVG(InvoiceTotal) AS 'InvoiceAvg'
FROM Invoices JOIN Vendors
ON Invoices.VendorID = Vendors.VendorID
GROUP BY VendorState, VendorCity
HAVING COUNT(*) >= 2
ORDER BY VendorState, VendorCity;`

This is how I would parse your query:
SELECT VendorState, VendorCity, COUNT(*) AS 'Invoice QTY',
AVG(InvoiceTotal) AS 'InvoiceAvg'
Okay, you select some blah blah. I'd come back here after parsing to see whether the chosen columns are really available, but here you have no errors so I'll assume the columns are good.
FROM Invoices
You start with all the invoices (perhaps this is the point that confuses you).
JOIN Vendors
ON Invoices.VendorID = Vendors.VendorID
Each Invoice is joined to a single Vendor (VendorID is a primary key), so the cardinality does not change (assuming all vendors are still in place of course; an invoice with no matching VendorId will "disappear". Usually this is not the case when invoices and vendors are involved; you might have a flag to exclude "terminated" vendors, but you wouldn't remove invoices from the database). The important thing is, if you had 1,000 invoices, you now have 1,000 rows after the JOIN, not 2,000 or any other number. So, you're still working with invoices.
GROUP BY VendorState, VendorCity
Okay, so the COUNT(*) refers to the invoices of each single city in each single state. The HAVING clause restricts the results to those cities where at least two invoices are present.

Simply put, the HAVING clause is evaluated after the JOIN is performed. So the number of rows counted will be the number of invoices (less any invoices that are missing a valid vendorID, and thus fail to Join)

First, I would suggest writing the query as:
SELECT v.VendorState, v.VendorCity,
COUNT(*) AS InvoiceQTY, AVG(i.InvoiceTotal) AS InvoiceAvg
FROM Invoices i JOIN
Vendors v
ON i.VendorID = v.VendorID
GROUP BY v.VendorState, v.VendorCity
HAVING COUNT(*) >= 2
ORDER BY v.VendorState, v.VendorCity;
The changes are:
Use table aliases to make the query easier to write and read.
Qualify all column references.
Only use single quotes for strings, not column names.
Avoid column names that need to be escaped.
The COUNT(*) is counting neither the number of "vendors" nor "invoices" -- well, not directly. It is counting the number of rows that match after the JOIN takes place.
Based on your naming convention, each invoice matches exactly one vendor. So, when you use COUNT(*) you are counting invoices, not vendors.

Selecting fields that are not in GROUP BY when nested SELECTS aren't allowed

I have the tables:
Product(code (PK), pname, (....), sid (FK)),
Supplier(sid(PK), sname, (....))
The assignment is:
Find Suppliers that supply only one product. Display their name (sname) and product name (pname).
It seem to me like a GROUP BY problem, so I used:
SELECT sid FROM
Product GROUP BY sid
HAVING CAST(COUNT(*) AS INTEGER) = 1;
This query have found me the list of sid's that supply one product only, but now I have encountered a problem:
The assignment forbids any form of nested SELECT queries.
The result of the query I have written has only one column. (The sid column)
Thus, I am unable to access the product name as it is not in the query result table, and if I would have added it to the GROUP BY statement, then the grouping will based on product name as well, which is an unwanted behavior.
How should I approach the problem, then?
Note: I use PostgreSQL

You can phrase the query as:
SELECT s.sid, s.sname, MAX(p.pname) as pname
FROM Product p JOIN
Supplier s
ON p.sid = s.sid
GROUP BY s.sid, s.sname
HAVING COUNT(*) = 1;
You don't need to convert COUNT(*) to an integer. It is already an integer.

You could put
max(pname)
in the SELECT list. That's an aggregate, so it would be fine.

calculates between These two columns in SQL server [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I want to add a column to the query that calculates between These two columns In the same query ....................................................................
,isnull(sum(ORDERS.Net_Amount) + 0,0) as orders
,isnull(sum (convert(float,(RECEIPTS.Amount))) + 0,0) as recepts
IN SQL SERVER
SELECT CUSTOMERS.[ID_CUSTOMER]
,[FIRST_NAME]
,[TEL]
,[EMAIL]
,isnull(sum(ORDERS.Net_Amount) + 0,0) as orders
,isnull(sum (convert(float,(RECEIPTS.Amount))) + 0,0) as recepts
,[CRIDIT_LIMIT]
,[CUSTOMER_SINCE]
,[ADRESS]
,CUSTOMERS.[state]
FROM [CUSTOMERS]
LEFT JOIN ORDERS on CUSTOMERS.ID_CUSTOMER = ORDERS.CUSTOMER_ID
LEFT JOIN RECEIPTS on CUSTOMERS.ID_CUSTOMER = RECEIPTS.ID_CUSTOMER
GROUP BY CUSTOMERS.[ID_CUSTOMER]
,[FIRST_NAME]
,[TEL]
,[EMAIL]
,[CRIDIT_LIMIT]
,[CUSTOMER_SINCE]
,[ADRESS]
,CUSTOMERS.[state]
CUSTOMERS table
SELECT [ID_CUSTOMER]
,[FIRST_NAME]
,[TEL]
,[EMAIL]
,[IMAGE_CUSTOMER]
,[CRIDIT_LIMIT]
,[CUSTOMER_SINCE]
,[ADRESS]
,[Balance]
,[state]
FROM [CUSTOMERS]
ORDERS table
SELECT [ID_ORDER]
,[DATE_ORDER]
,[CUSTOMER_ID]
,[DESCRIPTION_ORDERS]
,[SALEMAN]
,[ORDER_TOTAL]
,[Discount_Of_Total]
,[Total_After_Discount]
,[Paid_Up]
,[Net_Amount]
,[state]
FROM [ORDERS]
RECEIPTS table
SELECT [image_state]
,[ID_RECEIPT]
,[ID_CUSTOMER]
,[Date]
,[Ref]
,[Amount]
,[Memo]
,[User_Name]
,[state]
,[Payment_Method]
,[Account_ID]
FROM [RECEIPTS]

Orders and receipts are not really related to each other (the receipt doesn't refer to a specific order), so don't join the two. What you want to do instead is find the order amount and the receipt amount per customer and show them. So aggregate the two tables per customer and outer-join the results to the customer table.
select
c.id_customer,
c.first_name,
c.tel,
c.email,
coalesce(o.sum_net_amount, 0) as order_amount,
coalesce(r.sum_amount, 0) as receipt_amount,
c.cridit_limit,
c.customer_since,
c.adress,
c.balance,
c.state
from customers c
left join
(
select customer_id, sum(net_amount) as sum_net_amount
from orders
group by customer_id
) o on c.id_customer = o.customer_id
left join
(
select id_customer, sum(amount) as sum_amount
from receipts
group by id_customer
) r on c.id_customer = r.id_customer;
I see you have updated your request now asking also for the difference of the sums. Well, the operator for subtraction in SQL is - little surprising - the minus sign:
coalesce(o.sum_net_amount, 0) - coalesce(r.sum_amount, 0) as diff

Please try the following...
SELECT Customers.ID_Customer,
first_name,
tel,
email,
SUM( net_amount ) AS OrdersTotal,
COALESCE( sumAmount, 0 ) AS PaymentsTotal,
SUM( net_amount ) - COALESCE( sumAmount, 0 ) AS DifferenceInTotals,
cridit_limit,
customer_since,
adress,
balance,
state
FROM Customers
INNER JOIN Orders ON Customers.ID_Customer = Orders.Customer_ID
LEFT JOIN ( SELECT ID_Customer,
SUM( amount ) AS sumAmount
FROM Receipts
GROUP BY ID_Customer
) AS sumAmountFinder ON Customers.ID_Customer = sumAmountFinder.ID_Customer
GROUP BY Customers.ID_Customer,
first_name,
tel,
email,
cridit_limit,
customer_since,
adress,
balance,
state;
This Answer is based on the assumption that every Customer will have at least one Order, but possibly no Receipts.
This statement is essentially the one that you supplied with the following modifications...
I have changed the JOIN to Orders to an INNER JOIN, since I am assuming that each Customer will have at least one Order. The LEFT JOIN is only necessary where you wish to retain all records from the left table that do not have at least one matching record from the right table as defined by the ON clause. (Note : If you wish to retain all the records from the right table where there is no matching records from the left table, use a RIGHT JOIN).
I have replaced the JOIN to the Receipts table with a subquery that calculates the total of the amount field for each Customer in the Receipts table. A LEFT JOIN is necessary between Customers and the results of this subquery as not all Customers will have a Receipt. In such situations the LEFT JOIN will set each of the fields from the subquery in the joined dataset to NULL.
Where the SUM() function encounters only NULL values it returns NULL, not 0. So that PaymentsTotal will be set to 0 for records where the Customer has no Receipts, I have used the COALESCE() function. This function will return the first non-NULL argument it encounters. Here I have set it to return the total of amount where it encounters one, and 0 where it encounters no total amount.
I have removed all of the square brackets from your field and table names. They are only required where you have used an otherwise disallowed name, such as names with spaces (use [Full Name] instead of Full Name) or names that are also reserved by SQL-Server (if you had decided to call PaymentsTotal Sum, then you would have had to use AS [Sum]). Many programmers consider giving fields such names to be bad practice, even when it is possible with []'s, but fortunately you have not used any otherwise names.
I have removed the table names from your SUM() calculations. Since only one table has a field called net_amount, then it will be a unique field name in the joined dataset, and you will be able to refer to them without specifying the name of the source table as well. Specifying the source table is still necessary in the case of Customers.ID_Customer as the joined dataset will have more than one field called ID_Customer. Also, you will need to specify the source tables names when creating the joined dataset using the JOIN's.
I have also taken the liberty of changing your capitalisation scheme. Having just about everything in constant upper-case is monotonous to the eye. Using different casing for SQL terms, table names and field names makes recognising each type of statement part much easier, and thus makes debugging code much easier.
Finally, and relatively trivially, cridit is actually spelt credit and adress is actually spelt address.
If you have any questions or comments, then please feel free to post a Comment accordingly.

SQL Aggregation AVG statement

Ok, so I have real difficulty with the following question.
Table 1: Schema for the bookworm database. Primary keys are underlined. There are some foreign key references to link the tables together; you can make use of these with natural joins.
For each publisher, show the publisher’s name and the average price per page of books published by the publisher. Average price per page here means the total price divided by the total number of pages for the set of books; it is not the average of (price/number of pages). Present the results sorted by average price per page in ascending order.
Author(aid, alastname, afirstname, acountry, aborn, adied).
Book(bid, btitle, pid, bdate, bpages, bprice).
City(cid, cname, cstate, ccountry).
Publisher(pid, pname).
Author_Book(aid, bid).
Publisher_City(pid, cid).
So far I have tried:
SELECT
pname,
bpages,
AVG(bprice)
FROM book NATURAL JOIN publisher
GROUP BY AVG(bpages) ASC;
and receive
ERROR: syntax error at or near "asc"
LINE 3: group by avg(bpages) asc;

You can't group by an aggregate, at least not like that. Also don't use natural join, it's bad habit to get into because most of the time you'll have to specify join conditions. It's one of those things you see in text books but almost never in real life.
OK with that out of the way, and this being homework so I don't want to just give you an answer without an explanation, aggregate functions (sum in this case) affect all values for a column within a group as limited by the where clause and join conditions, so unless your doing every row you have to specify what column contains the values you are grouping by. In this case our group is Publisher name, they want to know per publisher, what the price per page is. Lets work out a quick select statement for that:
select Pname as Publisher
, Sum(bpages) as PublishersTotalPages
, sum(bprice) as PublishersTotalPrice
, sum(bprice)/Sum(bpages) as PublishersPricePerPage
Next up we have to determine where to get the information and how the tables relate to eachother, we will use books as the base (though due to the nature of left or right joins it's less important than you think). We know there is a foreign key relation between the column PID in the book table and the column PID in the Publisher table:
From Book B
Join Publisher P on P.PID = B.PID
That's what is called an explicit join, we are explicitly stating equivalence between the two columns in the two tables (vs. implying equivalence if it's done in the where clause). This gives us a many to one relation ship, because each publisher has many books published. To see that just run the below:
select b.*, p.*
From Book B
Join Publisher P on P.PID = B.PID
Now we get to the part that seems to have stumped you, how to get the many to one relationship between books and the publishers down to one row per publisher and perform an aggregation (sum in this case) on the page count per book and price per book. The aggregation portion was already done in our selection section, so now we just have to state what column the values our group will come from, since they want to know a per publisher aggregate we'll use the publisher name to group on:
Group by Pname
Order by PublishersPricePerPage Asc
There is a little gotcha in that last part, publisherpriceperpage is a column alias for the formula sum(bprice)/Sum(bpages). Because order by is done after all other parts of the query it's unique in that we can use a column alias no other part of a query allows that, without nesting the original query. so now that you have patiently waded through my explanation, here is the final product:
select Pname as Publisher
, Sum(bpages) as PublishersTotalPages
, sum(bprice) as PublishersTotalPrice
, sum(bprice)/Sum(bpages) as PublishersPricePerPage
From Book B
Join Publisher P on P.PID = B.PID
Group by Pname
Order by PublishersPricePerPage Asc
Good luck and hope the explanation helped you get the concept.

You need ORDER BY clause and not GROUP BY to sort record. So change your query to:
SELECT pname, AVG(bprice)
FROM book NATURAL JOIN publisher
GROUP by pname
ORDER BY AVG(bpages) ASC;

You need Order By for sorting, which was missing:
SELECT
pname,
bpages,
AVG(bprice)
FROM book NATURAL JOIN publisher
GROUP BY pname, bpages
order by AVG(bpages) ASC;

Base on what you're trying to achieve. You can try my query below. I used the stated formula in a CASE statement to catch the error when a bprice is divided by zero(0). Also I added ORDER BY clause in your query and there's no need for the AVG aggregates.
SELECT
pname,
CASE WHEN SUM(bpages)=0 THEN '' ELSE SUM(bprice)/SUM(bpages) END price
FROM book NATURAL JOIN publisher
GROUP BY pname
ORDER BY pname ASC;

The ASC is part of the ORDER BY clause. You are missing the ORDER BY here.
Reference: http://www.tutorialspoint.com/sql/sql-group-by.htm

Check this
SELECT pname, AVG(bprice)
FROM book NATURAL JOIN publisher
GROUP by pname
ORDER BY AVG(bpages)

Select records for MySQL only once based on a column value

I have a table that stores transaction information. Each transaction is has a unique (auto incremented) id column, a column with the customer's id number, a column called bill_paid which indicates if the transaction has been paid for by the customer with a yes or no, and a few other columns which hold other information not relevant to my question.
I want to select all customer ids from the transaction table for which the bill has not been paid, but if the customer has had multiple transactions where the bill has not been paid I DO NOT want to select them more than once. This way I can generate that customer one bill with all the transactions they owe for instead of a separate bill for each transaction. How would I build a query that did that for me?

Returns exactly one customer_id for each customer with bill_paid equal to 'no':
SELECT
t.customer_id
FROM
transactions t
WHERE
t.bill_paid = 'no'
GROUP BY
t.customer_id
Edit:
GROUP BY summarises your resultset.
Caveat: Every column selected must be either 'grouped by' or aggregated in some fashion. As shown by nikic you could use SUM to get the total amount owed, e.g.:
SELECT
t.customer_id
, SUM(t.amount) AS TOTAL_OWED
FROM
transactions AS t
WHERE
t.bill_paid = 'no'
GROUP BY
t.customer_id
t is simply an alias.
So instead of typing transactions everywhere you can now simply type t. The alias is not necessary here since you query only one table, but I find them invaluable for larger queries. You can optionally type AS to make it more clear that you're using an alias.

You might try the Group By operator, eg group by the customer.

SELECT customer, SUM(toPay) FROM .. GROUP BY customer

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Interview Question about SQL group by and having - sql

Related

SQL COUNT(*) Confusion

Selecting fields that are not in GROUP BY when nested SELECTS aren't allowed

calculates between These two columns in SQL server [closed]

SQL Aggregation AVG statement

Select records for MySQL only once based on a column value

Categories

Resources