When do you give up set operations in SQL and go procedural? - sql

I was once given this task to do in an RDBMS:
Given tables customer, order, orderlines and product. Everything done with the usual fields and relationships, with a comment memo field on the orderline table.
For one customer retrieve a list of all products that customer has ever ordered with product name, year of first purchase, dates of three last purchases, comment of the latest order, sum of total income for that product-customer combination last 12 months.
After a couple of days I gave up doing it as a Query and opted to just fetch every orderline for a customer, and every product and run through the data procedurally to build the required table clientside.
I regard this a symptom of one or more of the following:
I'm a lazy idiot and should have seen how to do it in SQL
Set operations are not as expressive as procedural operations
SQL is not as expressive as it should be
Did I do the right thing? Did I have other options?

You definitely should be able to do this exercise without doing the work equivalent to a JOIN in application code, i.e. by fetching all rows from both orderlines and products and iterating through them. You don't have to be an SQL wizard to do that one. JOIN is to SQL what a loop is to a procedural language -- in that both are fundamental language features that you should know how to use.
One trap people fall into is thinking that the whole report has to be produced in a single SQL query. Not true! Most reports don't fit into a rectangle, as Tony Andrews points out. There are lots of rollups, summaries, special cases, etc. so it's both simpler and more efficient to fetch parts of the report in separate queries. Likewise, in a procedural language you wouldn't try do all your computation in a single line of code, or even in a single function (hopefully).
Some reporting tools insist that a report is generated from a single query, and you have no opportunity to merge in multiple queries. If so, then you need to produce multiple reports (and if the boss wants it on one page, then you need to do some paste-up manually).
To get a list of all products ordered (with product name), dates of last three purchases, and comment on latest order is straightforward:
SELECT o.*, l.*, p.*
FROM Orders o
JOIN OrderLines l USING (order_id)
JOIN Products p USING (product_id)
WHERE o.customer_id = ?
ORDER BY o.order_date;
It's fine to iterate over the result row-by-row to extract the dates and comments on the latest orders, since you're fetching those rows anyway. But make it easy on yourself by asking the database to return the results sorted by date.
Year of first purchase is available from the previous query, if you sort by the order_date and fetch the result row-by-row, you'll have access to the first order. Otherwise, you can do it this way:
SELECT YEAR(MIN(o.order_date)) FROM Orders o WHERE o.customer_id = ?;
Sum of product purchases for the last 12 months is best calculated by a separate query:
SELECT SUM(l.quantity * p.price)
FROM Orders o
JOIN OrderLines l USING (order_id)
JOIN Products p USING (product_id)
WHERE o.customer_id = ?
AND o.order_date > CURDATE() - INTERVAL 1 YEAR;
edit: You said in another comment that you'd like to see how to get the dates of the last three purchases in standard SQL:
SELECT o1.order_date
FROM Orders o1
LEFT OUTER JOIN Orders o2
ON (o1.customer_id = o2.customer_id AND (o1.order_date < o2.order_date
OR (o1.order_date = o2.order_date AND o1.order_id < o2.order_id)))
WHERE o1.customer_id = ?
GROUP BY o1.order_id
HAVING COUNT(*) <= 3;
If you can use a wee bit of vendor-specific SQL features, you can use Microsoft/Sybase TOP n, or MySQL/PostgreSQL LIMIT:
SELECT TOP 3 order_date
FROM Orders
WHERE customer_id = ?
ORDER BY order_date DESC;
SELECT order_date
FROM Orders
WHERE customer_id = ?
ORDER BY order_date DESC
LIMIT 3;

Set operations are not as expressive as procedural operations
Perhaps more like: "Set operations are not as familiar as procedural operations to a developer used to procedural languages" ;-)
Doing it iteratively as you have done now is fine for small sets of data, but simply doesn't scale the same way. The answer to whether you did the right thing depends on whether you are satisfied with the performance right now and/or don't expect the amount of data to increase much.
If you could provide some sample code, we might be able to help you find a set-based solution, which will be faster to begin with and scale far, far better. As GalacticCowboy mentioned, techniques such as temporary tables can help make the statements far more readable while largely retaining the performance benefits.

In most RDBMS you have the option of temporary tables or local table variables that you can use to break up a task like this into manageable chunks.
I don't see any way to easily do this as a single query (without some nasty subqueries), but it still should be doable without dropping out to procedural code, if you use temp tables.

This problem may not have been solvable by one query. I see several distinct parts...
For one customer
Get a list of all products ordered (with product name)
Get year of first purchase
Get dates of last three purchases
Get comment on latest order
Get sum of product purchases for the last 12 months
Your procedure is steps 1 - 5 and SQL gets you the data.

Sounds like a data warehouse project to me. If you need things like "three most recent things" and "sum of something over the last 12 months" then store them i.e. denormalize.

EDIT: This is a completely new take on the solution, using no temp tables or strange sub-sub-sub queries. However, it will ONLY work on SQL 2005 or newer, as it uses the "pivot" command that is new in that version.
The fundamental problem is the desired pivot from a set of rows (in the data) into columns in the output. While noodling on the issue, I recalled that SQL Server now has a "pivot" operator to deal with this.
This works on SQL 2005 only, using the Northwind sample data.
-- This could be a parameter to a stored procedure
-- I picked this one because he has products that he ordered 4 or more times
declare #customerId nchar(5)
set #customerId = 'ERNSH'
select c.CustomerID, p.ProductName, products_ordered_by_cust.FirstOrderYear,
latest_order_dates_pivot.LatestOrder1 as LatestOrderDate,
latest_order_dates_pivot.LatestOrder2 as SecondLatestOrderDate,
latest_order_dates_pivot.LatestOrder3 as ThirdLatestOrderDate,
'If I had a comment field it would go here' as LatestOrderComment,
isnull(last_year_revenue_sum.ItemGrandTotal, 0) as LastYearIncome
from
-- Find all products ordered by customer, along with first year product was ordered
(
select c.CustomerID, od.ProductID,
datepart(year, min(o.OrderDate)) as FirstOrderYear
from Customers c
join Orders o on o.CustomerID = c.CustomerID
join [Order Details] od on od.OrderID = o.OrderID
group by c.CustomerID, od.ProductID
) products_ordered_by_cust
-- Find the grand total for product purchased within last year - note fudged date below (Northwind)
join (
select o.CustomerID, od.ProductID,
sum(cast(round((od.UnitPrice * od.Quantity) - ((od.UnitPrice * od.Quantity) * od.Discount), 2) as money)) as ItemGrandTotal
from
Orders o
join [Order Details] od on od.OrderID = o.OrderID
-- The Northwind database only contains orders from 1998 and earlier, otherwise I would just use getdate()
where datediff(yy, o.OrderDate, dateadd(year, -10, getdate())) = 0
group by o.CustomerID, od.ProductID
) last_year_revenue_sum on last_year_revenue_sum.CustomerID = products_ordered_by_cust.CustomerID
and last_year_revenue_sum.ProductID = products_ordered_by_cust.ProductID
-- THIS is where the magic happens. I will walk through the individual pieces for you
join (
select CustomerID, ProductID,
max([1]) as LatestOrder1,
max([2]) as LatestOrder2,
max([3]) as LatestOrder3
from
(
-- For all orders matching the customer and product, assign them a row number based on the order date, descending
-- So, the most recent is row # 1, next is row # 2, etc.
select o.CustomerID, od.ProductID, o.OrderID, o.OrderDate,
row_number() over (partition by o.CustomerID, od.ProductID order by o.OrderDate desc) as RowNumber
from Orders o join [Order Details] od on o.OrderID = od.OrderID
) src
-- Now, produce a pivot table that contains the first three row #s from our result table,
-- pivoted into columns by customer and product
pivot
(
max(OrderDate)
for RowNumber in ([1], [2], [3])
) as pvt
group by CustomerID, ProductID
) latest_order_dates_pivot on products_ordered_by_cust.CustomerID = latest_order_dates_pivot.CustomerID
and products_ordered_by_cust.ProductID = latest_order_dates_pivot.ProductID
-- Finally, join back to our other tables to get more details
join Customers c on c.CustomerID = products_ordered_by_cust.CustomerID
join Orders o on o.CustomerID = products_ordered_by_cust.CustomerID and o.OrderDate = latest_order_dates_pivot.LatestOrder1
join [Order Details] od on od.OrderID = o.OrderID and od.ProductID = products_ordered_by_cust.ProductID
join Products p on p.ProductID = products_ordered_by_cust.ProductID
where c.CustomerID = #customerId
order by CustomerID, p.ProductID

SQL queries return results in the form of a single "flat" table of rows and columns. Reporting requirements are often more complex than this, demanding a "jagged" set of results like your example. There is nothing wrong with "going procedural" to solve such requirements, or using a reporting tool that sits on top of the database. However, you should use SQL as far as possible to get the best performance from the database.

Related

Student Seeking Advice for a CSC Exam

I'm a student taking a course on SQL and DB. My question is this: how does one get good at hand writing queries? Our final exam will consist of many of these questions, and I want to do well. We aren't allowed any sort of reference sheet either, just fyi.
I suppose what I'm asking is: how would you approach this?
In short, You require practice aka hands on sql.
You will probably get many opinions on this from others. Aside from practice and reading, try to ensure you understand the absolute basics and sequence of query.
Always use table.column or alias.column to help prevent any ambiguity of where something is coming from.
Know the overall basic segments of writing a query such as
select
[all your alias.columns comma separated]
from
[your primary and/or JOIN/LEFT JOIN/etc tables]
where
[what is the criteria you are looking for]
AND [use proper parenthesis to prevent ambiguity if so needed]
group by
[any columns if doing aggregates such as count, min, max, avg, etc]
[you need to list all NON-AGGREGATE alias.columns]
having
[if any, such as count(*) > someValue]
order by
[any specific columns and ascending or descending order]
[such as orderDate DESC to put most recent order at top]
In my opinion, getting your FROM clause is one of the most important and I try to always list my table JOIN clauses on first table/alias = second table/alias. Indentation helps here so you can see how you get from one table to the next. At this point, do not think of your filtering (YET), just HOW the tables are related. Then you can add "AND" criteria for something you are specifically looking for from that source.
An example of orders. Looking for customers who ordered in the last 30 days. Start with that source as your first FROM table, everything else off of that. So I start with the orders because I care about WHEN something was ordered. I can then join to customers to get their name.
select
c.LastName,
c.FirstName,
o.OrderDate
from
Orders o
JOIN Customers c
on o.CustomerID = c.CustomerID
where
o.OrderDate > [sql-specific current date - 30 days]
order by
c.LastName,
c.FirstName
Another example of orders that ordered a specific item in the last 30 days. In this case, I could reverse the order of details as specific things being ordered might be smaller granularity vs everything. So, altering above such as
select
c.LastName,
c.FirstName,
o.OrderDate
from
Items i
JOIN OrderDetails od
on i.ItemID = od.ItemID
JOIN Orders o
on od.OrderID = o.OrderID
AND o.OrderDate > [sql-specific current date - 30 days]
JOIN Customers c
on o.CustomerID = c.CustomerID
where
i.ItemDescription = 'SomeThing'
order by
c.LastName,
c.FirstName
Notice my indentation nesting. Personal style preference, but at least you can see how alias i to od, od to o, o to c. In my preference, easier to see the trail of tables and how each are directly related. I also added the "AND" clause to filter out orders within the last 30 days directly in the JOIN to the orders table.
LEFT JOINs, I do the same and keep the criteria directly at the JOIN level. If you put a criteria of a left-join into the WHERE clause (without explicitly handling NULL OR [condition] it turns a left-join into an [INNER] join.
Hope this basic guidance helps you get more comfortable as you get more into writing your own queries and course/test preparation.

My question is about SQL, using a TOP function inside a sub-query in MS Access

Overall what I'm trying to achieve is a query that shows the most ordered item from a customer in a database. To achieve this I've tried making a query showing how many times a customer has ordered an item, and now I am trying to create a sub-query in it using TOP1 to discern the most bought items.
With the SQL from the first query (looking weird because I made it with the Access automatic creator):
SELECT
Customers.CustomerFirstName,
Customers.CustomerLastName,
Products.ProductName,
COUNT(SalesQuantity.ProductCode) AS CountOfProductCode
FROM (Employees
INNER JOIN (Customers
INNER JOIN Sales
ON Customers.CustomerCode = Sales.CustomerCode)
ON Employees.EmployeeCode = Sales.EmployeeCode)
INNER JOIN (Products
INNER JOIN SalesQuantity
ON Products.ProductCode = SalesQuantity.ProductCode)
ON Sales.SalesCode = SalesQuantity.SalesCode
GROUP BY
Customers.CustomerFirstName,
Customers.CustomerLastName,
Products.ProductName
ORDER BY
COUNT(SalesQuantity.ProductCode) DESC;
I have tried putting in a subquery after FROM line:
(SELECT TOP1 CountOfProduct(s)
FROM (.....)
ORDER by Count(SalesQuantity.ProductCode) DESC)
I'm just not sure what to put in for the "from"-every other tutorial has the data from an already created table, however this is from a query that is being made at the same time. Just messing around I've put "FROM" and then listed every table, as well as
FROM Count(SalesQuantity.ProductCode)
just because I've seen that in the order by from the other code, and assume that the query is discerning from this count. Both tries have ended with an error in the syntax of the "FROM" line.
I'm new to SQL so sorry if it's blatantly obvious, but any help would be greatly appreciated.
Thanks
As I understand, you want the most purchased product for each customer.
So, begin by building aggregate query that counts product purchases by customer (appears to be done in the posted image). Including customer ID in the query would simplify the next step which is to build another query with TOP N nested query.
Part of what complicates this is unique record identifier is lost because of aggregation. Have to use other fields from the aggregate query to provide unique identifier. Consider:
SELECT * FROM Query1 WHERE CustomerID & ProductName IN
(SELECT TOP 1 CustomerID & ProductName FROM Query1 AS Dupe
WHERE Dupe.CustomerID = Query1.CustomerID
ORDER BY Dupe.CustomerID, Dupe.CountOfProductCode DESC);
Overall what I'm trying to achieve is a query that shows the most ordered item from a customer in a database.
This answers your question. It does not modify your query which is only tangentially related.
SELECT s.CustomerCode, sq.ProductCode, SUM(sq.quantity) as qty
FROM Sales as s INNER JOIN
SalesQuantity as sq
ON s.SalesCode = sq.SalesCode
GROUP BY s.CustomerCode, sq.ProductCode;
To get the most ordered items, you can use this twice:
SELECT s.CustomerCode, sq.ProductCode, SUM(sq.quantity) as qty
FROM Sales as s INNER JOIN
SalesQuantity as sq
ON s.SalesCode = sq.SalesCode
GROUP BY s.CustomerCode, sq.ProductCode
HAVING sq.ProductCode IN (SELECT TOP 1 sq2.ProductCode
FROM Sales as s2 INNER JOIN
SalesQuantity as sq2
ON s2.SalesCode = sq2.SalesCode
WHERE s2.CustomerCode = s.CustomerCode
GROUP BY sq2.ProductCode
);
In almost any other database, this would be simpler, because you would be able to use window functions.

SQL query on Northwind multiple tables

From Northwind database I want to get total revenue generated by emplyee sales
Employee -> Orders -> "Order Details"
I am not sure if my solution gives the right data (it was partly guessing)
SELECT
Employees.FirstName, Employees.LastName,
SUM(CONVERT(MONEY, ("Order Details".UnitPrice * Quantity * (1 - Discount) / 100)) * 100) AS ExtendedPrice
FROM
((Orders
INNER JOIN
"Order Details" ON Orders.OrderID = "Order Details".OrderID)
INNER JOIN
Employees ON Orders.EmployeeID = Employees.EmployeeID)
GROUP BY
LastName, FirstName;
Northwind database structure can be found here
Thank you in advance. It would be great to have a nice explanation as well
Chris, your effort is pretty good first effort, so there are a few things to change on this.
You don't need to divide by 100 and then multiply by 100. The discount is already a %. Your operation just truncates the numbers. I would avoid to this too early in a process as it introduces rounding errors. It is better to keep numbers raw and keep their precision as best you can for as long as you can. It is OK to display numbers as money in the GUI though i.e. to 2 decimals but not in intermediate calculations due to error introduced by truncating.
Table names and field names with spaces should be handled using [] rather than quotes. That makes it easier to find misspelling so use [Order Details]
When grouping and summing, make sure you use the keys. So name is not a key, so use EmployeeID if you are trying to group individual employees, this is because in real datasets you may have 2 employees with the same name and their sales will be grouped together incorrectly using your code.
Try this course/book, it is a good intro to querying databases. https://www.microsoft.com/en-au/learning/exam-70-461.aspx
The reason how this works? Select syntax has Select [fieldlist] from [table] inner join [jointable] on [join fields] group by [grouping fields]. fieldlist can be a calculation as well as actual field names to display. "inner join" means you want only those orders, order details, employees where there is actual matching data - Correct in your scenario. [table] and [jointable] is the actual tables that contain your data in a relational sense.
There is obvisouly a lot here to learn in one go. I would work through some of the different SQL Server querying courses that you can google.
Here's a revised version of the code:
SELECT Employees.EmployeeID, Employees.FirstName, Employees.LastName, Sum([Order Details].UnitPrice * Quantity * (1 - Discount)) AS ExtendedPrice
FROM Orders
INNER JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID
INNER JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID
group by Employees.EmployeeID, Employees.FirstName, Employees.LastName
order by Employees.FirstName, Employees.LastName;

I am getting too many solutions when I need only one

I use a query that returns the name of the city with the highest number of orders placed.
This is what I have:
SELECT MAX(o.OrderID) AS [Number of Orders], od.ShipCity
FROM Orders o, [Order Details] od
GROUP BY o.ShipCity
ORDER BY [Number of Orders] DESC
I got all of the cities and their orders instead of just the one city with the most orders.
What happened?
Yeah, there's a couple of things wrong with your query. First, you're getting the max order id , which is presumably some autoincrement column. It's like Karl's answer is first, mine is second, SELECT MAX(answerid) FROM this_discussion = 2.... but that doesn't mean I have more answers than he does.
Rnofx5 is also right... you need to tell your table what to join ON, cause right now it's creating a Cartesian Product. If you're not sure what that is, for now accept that it's a horrible, evil, wicked thing to do and then Google it after we're done fixing the query.
So, we have orders and order details. Presumably orders does not contain City, so we need order details
SELECT count(o.OrderID), od.ShipCity
FROM orders AS o
INNER JOIN [Order Details] AS od
ON o.{a varible that both Orders and Order Details have in common} = .{a varible that both Orders and Order Details have in common}
GROUP BY od.ShipCity
ORDER BY count(o.OrderID) DESC
LIMIT 1;
Okay, so we're joining Orders with Order Details. In order to do that, we need to associate every order with something in Order Details. I don't know your schema, but from the sounds of it probably each order has a corresponding record in Order Details. In that case, you join these two tables using their ID. Something like
ON o.OrderID = od.OrderID
Now, we are counting all of the orders associated with a particular city... and we sorting them by our count, in descending order. And then we are keeping only the very first record that's returned (LIMIT 1)
Depending on your SQL implementation, you may need TOP 1 instead of LIMIT 1. You tagged mysqli, so presumably this is MySQL and in that case you'd want LIMIT not TOP. But be aware that that's a syntax variation you may encounter at some point
You are getting the highest orderID (MAX) per ship city rather then the count.
You instead need COUNT(o.OrderID)
And in MySQL you need to use LIMIT 1 on the end to get only the top most result.

Uses of unequal joins

Of all the thousands of queries I've written, I can probably count on one hand the number of times I've used a non-equijoin. e.g.:
SELECT * FROM tbl1 INNER JOIN tbl2 ON tbl1.date > tbl2.date
And most of those instances were probably better solved using another method. Are there any good/clever real-world uses for non-equijoins that you've come across?
Bitmasks come to mind. In one of my jobs, we had permissions for a particular user or group on an "object" (usually corresponding to a form or class in the code) stored in the database. Rather than including a row or column for each particular permission (read, write, read others, write others, etc.), we would typically assign a bit value to each one. From there, we could then join using bitwise operators to get objects with a particular permission.
How about for checking for overlaps?
select ...
from employee_assignments ea1
, employee_assignments ea2
where ea1.emp_id = ea2.emp_id
and ea1.end_date >= ea2.start_date
and ea1.start_date <= ea1.start_date
Whole-day inetervals in date_time fields:
date_time_field >= begin_date and date_time_field < end_date_plus_1
Just found another interesting use of an unequal join on the MCTS 70-433 (SQL Server 2008 Database Development) Training Kit book. Verbatim below.
By combining derived tables with unequal joins, you can calculate a variety of cumulative aggregates. The following query returns a running aggregate of orders for each salesperson (my note - with reference to the ubiquitous AdventureWorks sample db):
select
SH3.SalesPersonID,
SH3.OrderDate,
SH3.DailyTotal,
SUM(SH4.DailyTotal) RunningTotal
from
(select SH1.SalesPersonID, SH1.OrderDate, SUM(SH1.TotalDue) DailyTotal
from Sales.SalesOrderHeader SH1
where SH1.SalesPersonID IS NOT NULL
group by SH1.SalesPersonID, SH1.OrderDate) SH3
join
(select SH1.SalesPersonID, SH1.OrderDate, SUM(SH1.TotalDue) DailyTotal
from Sales.SalesOrderHeader SH1
where SH1.SalesPersonID IS NOT NULL
group by SH1.SalesPersonID, SH1.OrderDate) SH4
on SH3.SalesPersonID = SH4.SalesPersonID AND SH3.OrderDate >= SH4.OrderDate
group by SH3.SalesPersonID, SH3.OrderDate, SH3.DailyTotal
order by SH3.SalesPersonID, SH3.OrderDate
The derived tables are used to combine all orders for salespeople who have more than one order on a single day. The join on SalesPersonID ensures that you are accumulating rows for only a single salesperson. The unequal join allows the aggregate to consider only the rows for a salesperson where the order date is earlier than the order date currently being considered within the result set.
In this particular example, the unequal join is creating a "sliding window" kind of sum on the daily total column in SH4.
Dublicates;
SELECT
*
FROM
table a, (
SELECT
id,
min(rowid)
FROM
table
GROUP BY
id
) b
WHERE
a.id = b.id
and a.rowid > b.rowid;
If you wanted to get all of the products to offer to a customer and don't want to offer them products that they already have:
SELECT
C.customer_id,
P.product_id
FROM
Customers C
INNER JOIN Products P ON
P.product_id NOT IN
(
SELECT
O.product_id
FROM
Orders O
WHERE
O.customer_id = C.customer_id
)
Most often though, when I use a non-equijoin it's because I'm doing some kind of manual fix to data. For example, the business tells me that a person in a user table should be given all access roles that they don't already have, etc.
If you want to do a dirty join of two not really related tables, you can join with a <>.
For example, you could have a Product table and a Customer table. Hypothetically, if you want to show a list of every product with every customer, you could do somthing like this:
SELECT *
FROM Product p
JOIN Customer c on p.SKU <> c.SSN
It can be useful. Be careful, though, because it can create ginormous result sets.