SQL "GROUP BY" issue - sql

I'm designing a shopping cart. To circumvent the problem of old invoices showing inaccurate pricing after a product's price gets changed, I moved the price field from the Product table into a ProductPrice table that consists of 3 fields, pid, date and price. pid and date form the primary key for the table. Here's an example of what the table looks like:
pid date price
1 1/1/09 50
1 2/1/09 55
1 3/1/09 54
Using SELECT and GROUP BY to find the latest price of each product, I came up with:
SELECT pid, price, max(date) FROM ProductPrice GROUP BY pid
The date and pid returned were accurate. I received exactly 1 entry for every unique pid and the date that accompanied it was the latest date for that pid. However, what came as a surprise was the price returned. It returned the price of the first row matching the pid, which in this case was 50.
After reworking my statement, I came up with this:
SELECT pp.pid, pp.price, pp.date FROM ProductPrice AS pp
INNER JOIN (
SELECT pid AS lastPid, max(date) AS lastDate FROM ProductPrice GROUP BY pid
) AS m
ON pp.pid = lastPid AND pp.date = lastDate
While the reworked statement now yields the correct price(54), it seems incredible that such a simple sounding query would require an inner join to execute. My question is, is my second statement the easiest way to accomplish what I need to do? Or am I missing something here? Thanks in advance!
James

The reason you get an arbitrary price is that mysql cannot know which columns to select if you GROUP BY something. It knows it needs a price and a date per pid and can fetch the latest date as you requested with max(date) but chooses to return a price that is most efficient for him to retrieve - you didn't provide an aggregate function for that column (your first query is not valid SQL, actually.)
Your second query looks OK, but here is a shorter alternative:
SELECT pid, price, date
FROM ProductPrice p
WHERE date = (SELECT MAX(date) FROM ProductPrice tmp WHERE tmp.pid = p.pid)
But if you access the latest price a lot (which I think you do), I would recommend adding the old column back to your original table to hold the newest value, if you have the option of altering the database structure again.

I think you broke your database schema.
To circumvent the problem of old invoices showing inaccurate pricing after a product's price gets changed, I moved the price field from the Product table into a ProductPrice table that consists of 3 fields, pid, date and price. pid and date form the primary key for the table.
As you have pointed out you need to keep a change history of prices. But you can still keep the current price in the products table in addition to that new table. That would make your life much easier (and your queries faster).

You cannot solve your problem with the GROUP BY clause, because for each group of pid MySQL will simply fetch the first pid, the maximum date and the first price found (which is not what you need).
You may either use a subquery (which can be inefficient):
SELECT pid, date, price
FROM ProductPrice p1
WHERE date = ( SELECT MAX(p2.date)
FROM ProductPrice p2
WHERE p1.pid = p2.pid)
or you can simply join the table with itself:
SELECT p1.pid, p1.date, p1.price
FROM ProductPrice p1
LEFT JOIN ProductPrice p2 ON p1.pid = p2.pid
AND p1.date < p2.date
WHERE p2.pid IS NULL
Take a look at this section of MySQL docs.

You might wanna try this:
SELECT pid, price, date FROM ProductPrice GROUP BY pid ORDER BY date DESC
Group has some obscure functionality, I'm too always unsure if it's the right field...but it should be the first in the resultset.

Here is another -possibly inefficient- one:
SELECT pid, substring_index( group_concat( price order by date desc ), ',', 1 ) , max(date)
FROM ProductPrice
GROUP BY pid

I think that the key here is simple sounding query - you can see what you want but computers ain't human and so to produce the desired result from set based operations you have to be explicit as in the second query.
The inner query identifies the last price for each product, then the outer query lets you get the value for the last price - that's about as simple as it can get.
As an aside, if you have an invoicing system, you really ought to store the price for the product (and the tax rates as well as the "codes") with the invoice i.e. the invoice tables should contain all the necessary financial information to reproduce the invoice. In general, you do not want to rely on being able to look up a price (or a tax rate) in a mutable table even allowing for the system introduced as above. Regardless of this have the pricing history has its own merits.

i faced same problem in one of my project i used subquery to fetch date and then compare it but it makes system slow when data increases. so, its better to store latest price in your Products table in addition to the new table you have created to keep history of price changes.
you can always use any of query ppl suggested to get latest price of product on particular date. but also you can add one field in the same table is it latest. so for one date you can make flag true once. and you can always find product's latest price for particular date by one simple query.

Related

What's the use of this WHERE clause

this is an answer to the question : We need a list of customer IDs with the total amount they have ordered. Write a SQL statement to return customer ID (cust_id in the Orders table) and total_ordered using a subquery to return the total of orders for each customer. Sort the results by amount spent from greatest to the least. Hint: you’ve used the SUM() to calculate order totals previously.
SELECT prod_name,
(SELECT Sum(quantity)
FROM OrderItems
WHERE Products.prod_id=OrderItems.prod_id) AS quant_sold
FROM Products
;
So there is this simple code up here, and I know that this WHERE clause is comparing two columns in two different tables. But since We are calculating the SUM of that quantity, why do need that WHERE clause exactly. I really couldn't get it. Why the product_id exactly and not any other column ( p.s: the only shared column between those two tables is prod_id column ) I am still a beginner. Thank you!
First you would want to know the sum for each product - so need to adjust the subquery similar to this:
(SELECT prod_id, Sum(quantity) qty
FROM OrderItems
group by prod_id
) AS quant_sold
then once you know how much for each product, then you can link that
SELECT prod_name,
(SELECT prod_id, Sum(quantity) qty
FROM OrderItems
group by prod_id
) AS quant_sold
FROM Products p
WHERE p.prod_id = quant_sold.prod_id
Run it without the where clause and compare the results. You'll learn a lot that way. specifically focus on two different product Ids ensuring they both have order items and quantities.
You have two different tables involved. There are multiple products. You don't want the sum of all orders on each product; which is what you would get without the where clause. So the where clause correlates the two tables ensuring you only SUM the quantity of each order item for each product between the tables. Personally, I'd use a join, sum, and a group by as I find it easier to read and I'm not a fan of sub selects in the select of another query; but that's me.
SELECT prod_name,
(SELECT Sum(quantity)
FROM OrderItems
WHERE Products.prod_id=OrderItems.prod_id) AS quant_sold
FROM Products
Should be the same as:
SELECT prod_name, Sum(coalesce(P.quantity,0))
FROM Products P
LEFT JOIN orderItems OI
on P.prod_id=OI.prod_id
GROUP BY Prod_Name
'Notes
the above is untested.
a left join is needed because all products should be listed and if a product doesn't have an order, the quantity would be zero.
if we use an inner join, the product would be excluded.
We use coalesce because you'd have a "Null" quantity instead of zero for such lines without an order item.
as to which is "right" well it depends and varies on different cases. each has it's own merits and in different cases, one will perform better than another, and in a different case, vice-versa. See --> Join vs. sub-query
As an example:
Say you have Products A & B
"A" has Order Item Quantities of 1 & 2
"B" has order item Quantities of 10 & 20
If we don't have the where clause every result record would have qty 33
If we have the where product "A" would have 3
product "B" would have qty 30.

Joining 2 tables with a timestamp of YEAR for the first item purchased by customer

I have 2 tables "items" and "customers".
In both tables, "customer_id" is present and a single customer can have more than 1 item. In the "items" table there is also a timestamp field called "date_created" when an item was purchased.
I want to construct a query that can return each customer_id and item_id associated with the first item each customer bought in a specific year, let's say 2020.
My approach was
SELECT customer_id, items
INNER JOIN items ON items.customer_id=customers.customer_id
and then try to use the EXTRACT function to take care of the first item each customer bought in 2020 but I can't seem to extract the first item only for the specific year. I would really appreciate some help.
I am using PostgreSQL. Thank you.
Just use distinct on:
select distinct on (customer_id) i.*
from items i
where date_created >= date '2020-01-01' and
date_created < date '2021-01-01'
order by customer_id, date_created;
First, note the use of direct date comparisons. This makes it easier for the optimizer to choose the best execution plan.
distinct on is a handy Postgres extension that returns the first row encountered for the keys in parentheses, which must be the first keys in the order by. "First" is based on the subsequent order by keys.

Price comparison database - put price data in main table, in one separate table or in many product tables?

I'm trying to build a price comparison database with n products and a definitive but changing number of vendors that sell these products.
For my price comparison database, I need to store both current prices for a product across different vendors and historical prices (one lowest price).
As I see it, I have 2 options to design the database tables:
1. Put all vendor prices into the main table.
I know how many vendors there will be and if I add or remove a vendor I can add or remove a column.
Historical prices (lowest price on certain date across all vendors), goes into a separate table with a product name, a price and a date.
2. Have one table for products and one table for prices
I will have only the static attribute data in the main table such as categories, attributes etc and then add prices to a separate product table where I store price, vendor, date in it and I can store the lowest price as a pseudo-vendor in that table for each date or I can store it in a separate table as well.
Which method would you suggest and am I missing something?
You should store the base data in a normalized format that contains all the history. This means that you have tables for:
products, with one row per product and the static information about the products.
vendors, with one row per vendor and the static information about the vendor.
prices, with one row per price along with the date and product and vendor.
You can get the current and lowest prices using a query, such as:
select pr.*
from (select pr.*, min(price) over (partition by product) as min_price
row_number() over (partition by product, vendor order by price_datetime desc) as seqnum
from prices pr
where pr.product_id = XXX
) pr
where seqnum = 1;
For performance, you want an index on prices(product, vendor, price_datetime desc).
Eventually, you may find that this query runs too slowly. In that case, you will then consider optimizations. One optimization would simply be storing the latest date for each price/vendor combination using a trigger, along with the minimum price in the products table -- presumably using triggers.
Another would be maintaining a summary table for each product and vendor using triggers. However, that is probably not how you should start the endeavor.
However, you might be surprised at how well the above query can perform on your data.

SQL using result from select from clause to join another table

I am working on a legacy app, I am still learning SQL and would consider my SQL knowledge as beginner.
I have 2 tables, one is a receipt type structure containing receipt no, a docket number (plus other info regarding total etc) and a car rego number.
there are the potential for multiple receipts for a car ie multiple matches on rego number
The second has a listing of the items related to that receipt (description, partno, time) each of the items are related by docketnumber - the "registerhistory"
multiple items appear as multiple rows (with same docketnumber) in the "registerhistory" and also items of the same type are not stored as a qty but as duplicated rows in the table with the same docket number each have a price stored
I am trying to generate a report based upon a search match on rego number and create a join to the matching tableregister items and list them (with hopefully an end goal of grouping any duplicate items into a qty and subtotal)
This is an access database if that changes the syntax
I am unclear on how I can take the results of one select query and use these results to create a join or there might be a better approach
So I need to firstly locate all receipts with a matching rego number, with those receipts, find the associated items (by docket number) hopefully group the items like so
receipt no 1
Item1 with multiples as qty with subtotal
Item2
Item3
receipt no 2
Item1
Item2 with multiples as qty with subtotal
Item3
Any help greatly appreciated,
(SELECT * from tblreceipts
where vehicle = 'abc123')
join tblregisterhistory on
tblreceipts.docketnum = tblregisterhistory.docketnum
I can even get to linking the results from the select query to a join, let alone get to my desired end result.
Are you trying to simply do a JOIN and GROUP BY? Something like this:
select partno, count(*) as qty
from tblreceipts as r inner join
tblregisterhistory as rh
on rh.docketnum = r.docketnum
where r.vehicle = 'abc123'
group by partno;
Ok with a bit of study and some helpful hints above. I have this (also apologies for the code formatting, still familiarising myself with the stack posting techniques)
SELECT vehicle, tblregisterhistory.date, partnumber, count(partnumber) as qty,
description, sum(price) as subtotal
FROM TBLRECEIPTS INNER JOIN
tblregisterhistory ON tblreceipts.docketnumber =
tblregisterhistory.docketnumber where tblreceipts.vehicle = 'abc123' group by
tblregisterhistory.date, vehicle, partnumber, description, price

Returning data from a single Child record (sorted by date) with Parent data also

As a SQL noob I have a, what I am assuming, basic question about 1 to many children records.
I have an order table and an Order_Status child table.
Order table
ID Order_Number Status Order_Date ect
Order_Status table
StatusTo StatusFrom Order_ID StatusChange_Date
The child table can have many enties for the status changing for a single parent order.
How do I pull back the following information as a single record with the child tables's (os) most recent record for that parent(p)? (p.Order_Number, p.Status, p.Order_Date, os.StatusTo, os.StatusChange_Date).
I need to know because I am concerned the final os.statusto does not match the p.status.
Thanks in advance!
Steve
you can join on to a sub query which gets the most recent order status
e.g.
SELECT p.Order_Number, p.Status, p.Order_Date, os.StatusTo, os.StatusChange_Date
FROM ORDER p
LEFT JOIN (
SELECT StatusTo, Order_ID, MAX(StatusChange_Date) as StatusChange_Date
FROM Order_Status
GROUP BY StatusTo, Order_ID
) os ON os.Order_ID= p.Order_ID
I believe this should work. Assuming that you only care about those orders that have changes, and where the change is different than what is recorded (should be trivial to modify).
WITH Most_Recent_Change (order_id, statusTo, changedAt, rownum) as
(SELECT order_id, statusTo, statusChange_date,
ROWNUMBER() OVER(PARTITION BY order_id
ORDER BY statusChange_date DESC)
FROM Order_Status)
SELECT Order.order_number, Order.status, Order.order_date,
Most_Recent_Change.statusTo, Most_Recent_Change.changedAt
FROM Order
JOIN Most_Recent_Change
ON Most_Recent_Change.order_id = Order.id
AND Most_Recent_Change.rownum = 1
AND Most_Recent_Change.statusTo <> Order.status
(would have an SQLFiddle example, but it's acting weird at the moment)
Please note you should be careful of the commit level you run this at, as otherwise you may get false positives from rows being concurrently updated.
Other notes:
Don't use reserved words (like ORDER) for identifiers. It's just a hassle in general
Don't suffix columns with their datatypes, especially if those types may change in the future. I'm aware that order_date isn't being strictly named in this fashion, but it's dangerously close. It should probably be something like orderedOn (if of a strict 'solar day' type) or the better orderedAt (timestamp, in UTC or with timezone).