performance: joining tables vs. large table with redundant data - sql

Lets say i have a bunch of products. Each product has and id, price, and long description made up of multiple paragraphs. Each product would also have multiple sku numbers that would represent different sizes and colors.
To clarify: product_id 1 has 3 skus, product_id 2 has 5 skus. All of the skus in product 1 share the same price and description. product 2 has a different price and description than product 1. All of product 2's skus share product 2's price and description.
I could have a large table with different records for each sku. The records would have redundant fields like the long description and price.
Or I could have two tables. One named "products" with product_id, price, and description. And one named "skus" with product_id, sku, color, and size. I would then join the tables on the product_id column.
$query = "SELECT * FROM skus LEFT OUTER JOIN products ON skus.product_id=products.product_id WHERE color='green'";
or
$query = "SELECT * FROM master_table WHERE color='green'";
This is a dumbed down version of my setup. In the end there will be a lot more columns and a lot of products. Which method would have better performance?
So to be more specific: Let's say I want to LIKE search on the long_description column for all of the skus. I am trying to compare having one table that has 5000 long_description and 5000 skus vs OUTER JOINing two tables, one has 1000 long_description records and the other has 5000 skus.

It depends on the usage of those tables - in order to get a definitive answer you should do both and compare using representative data sets / system usage.
The normal approach is to only denormalise data in order to combat specific performance problems that you are having, so in this case my advice would be to default to joining across two tables and only denormalise to using a single table if you have a performance problem and find that denormalisation fixes it.

OLTP normalized tables better
Join them at query, easier data manupulation and good response for short queries
OLAP denormalized tables better
Tables mostly dont change and good for long queries

Related

Price comparison database - put price data in main table, in one separate table or in many product tables?

I'm trying to build a price comparison database with n products and a definitive but changing number of vendors that sell these products.
For my price comparison database, I need to store both current prices for a product across different vendors and historical prices (one lowest price).
As I see it, I have 2 options to design the database tables:
1. Put all vendor prices into the main table.
I know how many vendors there will be and if I add or remove a vendor I can add or remove a column.
Historical prices (lowest price on certain date across all vendors), goes into a separate table with a product name, a price and a date.
2. Have one table for products and one table for prices
I will have only the static attribute data in the main table such as categories, attributes etc and then add prices to a separate product table where I store price, vendor, date in it and I can store the lowest price as a pseudo-vendor in that table for each date or I can store it in a separate table as well.
Which method would you suggest and am I missing something?
You should store the base data in a normalized format that contains all the history. This means that you have tables for:
products, with one row per product and the static information about the products.
vendors, with one row per vendor and the static information about the vendor.
prices, with one row per price along with the date and product and vendor.
You can get the current and lowest prices using a query, such as:
select pr.*
from (select pr.*, min(price) over (partition by product) as min_price
row_number() over (partition by product, vendor order by price_datetime desc) as seqnum
from prices pr
where pr.product_id = XXX
) pr
where seqnum = 1;
For performance, you want an index on prices(product, vendor, price_datetime desc).
Eventually, you may find that this query runs too slowly. In that case, you will then consider optimizations. One optimization would simply be storing the latest date for each price/vendor combination using a trigger, along with the minimum price in the products table -- presumably using triggers.
Another would be maintaining a summary table for each product and vendor using triggers. However, that is probably not how you should start the endeavor.
However, you might be surprised at how well the above query can perform on your data.

SQL query - select only products ids which was sorted top in another table

Ok, I have situation where I need to create SQL query which will return for me ids from table1 (products) which was ordered by table2 (category) and limited by 10 for each category.
So, what I need. Select product ids which was appeared on "top 10" (limited by 10) results in each category after ordering of those products. Each product has some columns and I order by those columns. The same product can appear on different categories on top 10, for example. So I need use distinct for uniq results.
Is there any relationship between Product and Category? What at the Product columns you're ordering by? Is it ok for there to be duplication between different lists of products? You should really post you models/sql tables and more clearly explain what you're trying to do if you want real help.
Assuming they're many-to-many/the relationships are set up in rails and having the same products in multiple lists is ok I would do something like this
top_products = {}
Category.all.each do |cat|
top_products[cat.name] = cat.products.order("some_product_column DESC").limit(10).map{|p| p.id}
end

Access query/SQL - duplicates in one field with distinct multiple 2nd field

I am working on a database with products and lot numbers. Each entry in the Lots table has a Lot Number and a Product description.
Sometimes there are multiple records of the same lot number, for example when an item is repacked a new record is created, but with the same Lot Number and same product description - this is fine. But other times there are problem cases, namely when two different products share the same Lot Number. I am trying to find those.
In other words, there are 3 possibilities:
Lot numbers for which there is only one record in the table.
Lot numbers for which there are multiple records, but the Product description is the same for all of them
Lot numbers for which there are multiple records, and the product descriptions are not all the same.
I need to return only #3, with a separate record for each instance of that Lot Number and product description.
Any help would be greatly appreciated.
Thanks Juan for the sample data. Using this example, I want to return the data contained in Id 2-8, but not 1, 9, 10, 11.
This wasn't easy because lot of time don't use access.
First select unique values using distinct.
Then count how many diferent product appear on each lotnumber using group by
Last join both result and show only the lots with more than one description where total >1
.
SELECT id, Product.lotnumber, Product.Product, total
FROM
Product Inner join
(
SELECT lotnumber, count(*) as total
FROM
(SELECT distinct lotnumber, product
FROM Product)
GROUP BY lotnumber
) SubT On Product.lotnumber = SubT.lotnumber
WHERE total > 1
ORDER BY id
As you can see :
lot 2 have two products (yy and zz)
lot 3 have thre products (aa, bb, cc)
I include my product table:
Sorry for spanish. Field types are Autonumeric, Short Text, and Number

MS Access 2010 query pulls same records multiple times, sql challenge

I'm currently working on a program that keeps track of my company's stock inventory, using ms Access 2010. I'm having a hard time getting the query, intended to show inventory, to display the information I want. The problem seems to be that the query pulls the same record multiple times, inflating the sums of reserved and sold product.
Background:
My company stocks steel bars. We offer to cut the bars into pieces. From an inventory side, We want to track the length of each bar, from the moment it comes in to the warehouse, through it's time in the warehouse (where it might get cut into smaller pieces), until the entire bar is sold and gone.
Database:
The query giving problems, is consulting 3 tables;
Barstock (with the following fields)
BatchNumber (all the bars recieved, beloning to the same production heat)
BarNo (the individual bar)
Orginial Length (the length of the bar when recieved at the stock
(BatchNumber and BarNo combined, is the primary key)
Sales
ID (primary key)
BatchNumber
BarNo
Quantity Sold
Reservation (a seller kan reserve some material, when a customer signals interest, but needs time to decide)
ID (Primary key)
BatchNumber
BarNo
Quantity reserved
I'd like to pull information from the three tables into one list, that displays:
-Barstock.orginial length As Received
- Sales.Quantity sold As Sold
- Recieved - Sold As On Stock
- reservation.Quantity Reserved As Reserved
- On Stock - Reserved As Available.
The problem is that I suck at sql. I've looked into union and inner join to the best of my ability, but my efforts have been in vain. I usually rely on the design view to produce the Sql statements I need. With design view, I've come up with the following Sql:
SELECT
BarStock.BatchNo
, BarStock.BarNo
, First(BarStock.OrgLength) AS Recieved
, Sum(Sales.QtySold) AS SumAvQtySold
, [Recieved]-[SumAvQtySold] AS [On Stock]
, Sum(Reservation.QtyReserved) AS Reserved
, ([On Stock]-[Reserved])*[Skjemaer]![Inventory]![unitvalg] AS Available
FROM
(BarStock
INNER JOIN Reservation ON (BarStock.BarNo = Reservation.BarNo) AND (BarStock.BatchNo = Reservation.BatchNo)
)
INNER JOIN Sales ON (BarStock.BarNo = Sales.BarNo) AND (BarStock.BatchNo = Sales.BatchNo)
GROUP BY
BarStock.BatchNo
, BarStock.BarNo
I know that the query is pulling the same record multiple times because;
- when I remove the GROUP BY term, I get several records that are exactley the same.
- There are however, only one instance of these records in the corresponding tables.
I hope I've been able to explain myself properly, please ask if I need to elaborate on anything.
Thank you for taking the time to look at my problem!
!!! Checking some assumptions
From your database schema, it seems that:
There could be multiple Sales records for a given BatchNumber/BarNo (for instance, I can imagine that multiple customers may have bought subsections of the same bar).
There could be multiple Reservation records for a given BatchNumber/BarNo (for instance, multiple sections of the same bar could be 'reserved')
To check if you do indeed have multiple records in those tables, try something like:
SELECT CountOfDuplicates
FROM (SELECT COUNT(*) AS CountOfDuplicates
FROM Sales
GROUP BY BatchNumber & "," & BarNo)
WHERE CountOfDuplicates > 1
If the query returns some records, then there are duplicates and it's probably why your query is returning incorrect values.
Starting from scratch
Now, the trick to your make your query work is to really think about what is the main data you want to show, and start from that:
You basically want a list of all bars in the stock.
Some of these bars may have been sold, or they may be reserved, but if they are not, you should show the Quantity available in Stock. Your current query would never show you that.
For each bar in stock, you want to list the quantity sold and the quantity reserved, and combined them to find out the quantity remaining available.
So it's clear, your central data is the list of bars in stock.
Rather than try to pull everything into a single large query straight away, it's best to create simple queries for each of those goals and make sure we get the proper data in each case.
Just the Bars
From what you explain, each individual bar is recorded in the BarStock table.
As I said in my comment, from what I understand, all bars that are delivered have a single record in the BarStock table, without duplicates. So your main list against which your inventory should be measured is the BarStock table:
SELECT BatchNumber,
BarNo,
OrgLength
FROM BarStock
Just the Sales
Again, this should be pretty straightforward: we just need to find out how much total length was sold for each BatchNumber/BarNo pair:
SELECT BatchNumber,
BarNo,
Sum(QtySold) AS SumAvQtySold
FROM Sales
GROUP BY BatchNumber, BarNo
Just the Reservations
Same as for Sales:
SELECT BatchNumber,
BarNo,
SUM(QtyReserved) AS Reserved
FROM Reservation
GROUP BY BatchNumber, BarNo
Original Stock against Sales
Now, we should be able to combine the first 2 queries into one. I'm not trying to optimise, just to make the data work together:
SELECT BarStock.BatchNumber,
BarStock.BarNo,
BarStock.OrgLength,
S.SumAvQtySold,
(BarStock.OrgLength - Nz(S.SumAvQtySold)) AS OnStock
FROM BarStock
LEFT JOIN (SELECT BatchNumber,
BarNo,
Sum(QtySold) AS SumAvQtySold
FROM Sales
GROUP BY BatchNumber, BarNo) AS S
ON (BarStock.BatchNumber = S.BatchNumber) AND (BarStock.BarNo = S.BarNo)
We do a LEFT JOIN because there might be bars in stock that have not yet been sold.
If we did an INNER JOIN, we wold have missed these in the final report, leading us to believe that these bars were never there in the first place.
All together
We can now wrap the whole query in another LEFT JOIN against the reserved bars to get our final result:
SELECT BS.BatchNumber,
BS.BarNo,
BS.OrgLength,
BS.SumAvQtySold,
BS.OnStock,
R.Reserved,
(OnStock - Nz(Reserved)) AS Available
FROM (SELECT BarStock.BatchNumber,
BarStock.BarNo,
BarStock.OrgLength,
S.SumAvQtySold,
(BarStock.OrgLength - Nz(S.SumAvQtySold)) AS OnStock
FROM BarStock
LEFT JOIN (SELECT BatchNumber,
BarNo,
SUM(QtySold) AS SumAvQtySold
FROM Sales
GROUP BY BatchNumber,
BarNo) AS S
ON (BarStock.BatchNumber = S.BatchNumber) AND (BarStock.BarNo = S.BarNo)) AS BS
LEFT JOIN (SELECT BatchNumber,
BarNo,
SUM(QtyReserved) AS Reserved
FROM Reservation
GROUP BY BatchNumber,
BarNo) AS R
ON (BS.BatchNumber = R.BatchNumber) AND (BS.BarNo = R.BarNo)
Note the use of Nz() for items that are on the right side of the join: if there is no Sales or Reservation data for a given BatchNumber/BarNo pair, the values for SumAvQtySold and Reserved will be Null and will render OnStock and Available null as well, regardless of the actual quantity in stock, which would not be the result we expect.
Using the Query designer in Access, you would have had to create the 3 queries separately and then combine them.
Note though that the Query Designed isn't very good at dealing with multiple LEFT and RIGHT joins, so I don't think you could have written the whole thing in one go.
Some comments
I believe you should read the information that #Remou gave you in his comments.
To me, there are some unfortunate design choices for this database: getting basic stock data should be as easy as s simple SUM() on the column that hold inventory records.
Usually, a simple way to track inventory is to keep track of each stock transaction:
Incoming stock records have a + Quantity
Outgoing stock records have a - Quantity
The record should also keep track of the part/item/bar reference (or ID), the date and time of the transaction, and -if you want to manage multiple warehouses- which warehouse ID is involved.
So if you need to know the complete stock at hand for all items, all you need to do is something like:
SELECT BarID,
Sum(Quantity)
FROM StockTransaction
GROUP BY BarID
In your case, while BatchNumber/BarNo is your natural key, keeping them in a separate Bar table would have some advantages:
You can use Bar.ID to get back the Bar.BatchNumber and Bar.BarNo anywhere you need it.
You can use BarID as a foreign key in your BarStock, Sales and Reservation tables. It makes joins easier without having to mess with the complexities of compound keys.
There are things that Access allows that are not really good practice, such as spaces in table names and fields, which end up making things less readable (at least because you need to keep them between []), less consistent with VBA variable names that represent these fields, and incompatible with other database that don't accept anything other than alphanumerical characters for table and field names (should you wish to up-size later or interface your database with other apps).
Also, help yourself by sticking to a single naming convention, and keep it consistent:
Do not mix upper and lower case inconsistently: either use CamelCase, or lower case or UPPER case for everything, but always keep to that rule.
Name tables in the singular -or the plural-, but stay consistent. I prefer to use the singular, like table Part instead of Parts, but it's just a convention (that has its own reasons).
Spell correctly: it's Received not Recieved. That mistake alone may cost you when debugging why some query or VBA code doesn't work, just because someone made a typo.
Each table should/must have an ID column. Usually, this will be an auto-increment that guarantees uniqueness of each record in the table. If you keep that convention, then foreign keys become easy to guess and to read and you never have to worry about some business requirement changing the fact that you could suddenly find yourself with 2 identical BatchNumbers, for some reason you can't fathom right now.
There are lots of debates about database design, but there are certain 'rules' that everyone agrees with, so my recommendation should be to strive for:
Simplicity: make sure that each table records one kind of data, and does not contain redundant data from other tables (normalisation).
Consistency: naming conventions are important. Whatever you choose, stick to it throughout your project.
Clarity: make sure that you-in-3-years and other people can easily read the table names and fields and understand what they are without having to read a 300 page specification. It's not always possible to be that clear, but it's something to strive for.

Can items with most common descendants be found using SQL / MySQL?

Can SQL be used to find all the brands that has the most common categories?
For example, the brand "Dove" can have category of Soap, Skin Care, Shampoo
It is to find all the brands that has the most matching categories, in other words, the most similar brands.
It can be done programmatically using Ruby or PHP: just take a brand, and loop through all the other brands, and see how many matching categories there are, and sort by it. But if there are 2000 brands, then there needs to be 2000 queries per brand. (unless we pre-cache all the 2000 query results, so for all 2000 brands, we re-use those results)
Can it be done by SQL / MySQL by 1 query?
Say, the table has:
entities
--------
id
type = brand or category or product
name
entities_parent_child
--------------------
parent_id
child_id
the table above has an entry for each parent = brand and child = product, and also an entry for each parent = category and child = product, so brand has to relate to category by products.
I think the hard part for SQL is: find all the maximum matching counts, and sort by those numbers.
I agree with wuputah's comment. For this problem an "entities" table is not the answer. You've given yourself a hint the design is wrong when you say you cannot form a query to get the answers you want.
Create a proper hierarchy for your data, with separate tables for separate real word entities, yours will be:
[Brands]
[Categories]
[Products]
If you need help with defining trees and hierarchies in SQL I suggest you pick up a copy of Celko's Trees and Hierarchies in SQL for Smarties.
SQL has no concept of polymorphism so don't try to design your database to fit your programming language. Databases work with sets, so think in sets.
To find similar brands join your tables and use grouping:
SELECT Brands.brand_name, COUNT(Categores.category_name) as category_count
FROM Brands INNER JOIN Categories
ON Brands.brand_name = Categories.brand_name
GROUP BY Brands.brand_name
ORDER BY Brands.brand_name, COUNT(Categores.category_name) -- add DESC if you want largest count at the top
That gives you the basic idea, if you can expand on the requirement:
...find all the maximum matching
counts, and sort by those numbers
Then I can help redesign the query and, if necessary, the schema design.