Can items with most common descendants be found using SQL / MySQL? - sql

Can SQL be used to find all the brands that has the most common categories?
For example, the brand "Dove" can have category of Soap, Skin Care, Shampoo
It is to find all the brands that has the most matching categories, in other words, the most similar brands.
It can be done programmatically using Ruby or PHP: just take a brand, and loop through all the other brands, and see how many matching categories there are, and sort by it. But if there are 2000 brands, then there needs to be 2000 queries per brand. (unless we pre-cache all the 2000 query results, so for all 2000 brands, we re-use those results)
Can it be done by SQL / MySQL by 1 query?
Say, the table has:
entities
--------
id
type = brand or category or product
name
entities_parent_child
--------------------
parent_id
child_id
the table above has an entry for each parent = brand and child = product, and also an entry for each parent = category and child = product, so brand has to relate to category by products.
I think the hard part for SQL is: find all the maximum matching counts, and sort by those numbers.

I agree with wuputah's comment. For this problem an "entities" table is not the answer. You've given yourself a hint the design is wrong when you say you cannot form a query to get the answers you want.
Create a proper hierarchy for your data, with separate tables for separate real word entities, yours will be:
[Brands]
[Categories]
[Products]
If you need help with defining trees and hierarchies in SQL I suggest you pick up a copy of Celko's Trees and Hierarchies in SQL for Smarties.
SQL has no concept of polymorphism so don't try to design your database to fit your programming language. Databases work with sets, so think in sets.
To find similar brands join your tables and use grouping:
SELECT Brands.brand_name, COUNT(Categores.category_name) as category_count
FROM Brands INNER JOIN Categories
ON Brands.brand_name = Categories.brand_name
GROUP BY Brands.brand_name
ORDER BY Brands.brand_name, COUNT(Categores.category_name) -- add DESC if you want largest count at the top
That gives you the basic idea, if you can expand on the requirement:
...find all the maximum matching
counts, and sort by those numbers
Then I can help redesign the query and, if necessary, the schema design.

Related

Count the number of unique rows after a group_concat aggregate in SQLite

I ask this on StackOverflow after carefully reading this answer about StackOverflow vs dba.se—I’m a non-expert database novice, and in my possibly-misguided estimation, a fellow non-DBA coder can help me just as well as a database expert. SQLite is also a “lite” database.
My SQLite table is for, say, a recipes scenario. It has two columns: each row has a field, meal and one ingredient required by the meal. Since most meals take more than one ingredient, there are many rows with the same meal but different ingredients.
I need to know how many meals the exact set of ingredients can make—actually I need a sorted list of all the ingredients and how many meals can be made with exactly those ingredients. I hope the code will explain this completely:
CREATE TABLE recipes (
meal TEXT,
ingredient TEXT);
INSERT INTO recipes VALUES
("tandoori chicken","chicken"), ("tandoori chicken","spices"),
("mom's chicken","chicken"), ("mom's chicken","spices"),
("spicy chicken","chicken"), ("spicy chicken","spices"),
("parmesan chicken","chicken"), ("parmesan chicken","cheese"), ("parmesan chicken","bread"),
("breaded chicken","chicken"), ("breaded chicken","cheese"), ("breaded chicken","bread"),
("plain chicken","chicken");
Here, we have
one set of three meals that use exactly the same ingredients (tandoori chicken, mom’s chicken, and spicy chicken),
another set of two meals using a different set of ingredients, and
one meal other meal that needs exactly its ingredient.
I want something like the following:
chicken,,,spices|3
chicken,,,cheese,,,bread|2
chicken|1
That is, a string containing the exact set of ingredients and how many meals can be made using exactly these ingredients. (Don’t worry about collating/sorting the ingredients, I can ensure that for each meal, rows will be inserted in the same order all the time. Also, don’t worry about pathological cases where the same meal-ingredient row is repeated—I can prevent that from happening.)
I can get the above output like this:
WITH t
AS (SELECT group_concat(recipes.ingredient, ",,,") AS ingredients
FROM recipes
GROUP BY recipes.meal)
SELECT t.ingredients,
count(t.ingredients) AS cnt
FROM t
GROUP BY t.ingredients
ORDER BY cnt DESC;
There’s a couple of reasons I’m not happy with this: first, it creates a sub-view and I’m really curious if there’s a way to achieve this without a sub-view—that would likely be faster and clearer. And second, inside the sub-view, I create a string via group_concat to represent the vector of ingredients—I feel like there ought to be a row-based, or data structure-like, way to get the same information out of SQL.
My question: can I get the above output, or some equivalent, without using sub-views and/or without string concatenation?
This simplification seems to work:
SELECT distinct group_concat(recipes.ingredient, ",,,")
, count(*) AS cnt
FROM recipes recipes
GROUP BY recipes.meal
ORDER BY cnt DESC;
It's really just a re-formulation of what you have already though, without the nested query or common table expression.
Since a recipe can have an arbitrary number of ingredients doing repeated joins isn't feasible (without recursion) so I think this is a great example of how handy the GROUP_CONCAT() function is.
Edit:
Woops, you are right, sorry about that. Looking at the problem again, I think that a separate result set is required. There are 2 levels of aggregation, one to 'pivot' the data so it is the recipe grain with a list of ingredients for each, and then another to count the number of recipies with the same ingredients list. Below is a simple way to look at it, with the use of 'order by' in the GROUP_CONCAT to control the ordering, so the same list of ingredients is grouped together. –
select ingredients_list, count(*) from ( SELECT meal, group_concat(recipes.ingredient, ",,," order by recipes.ingredient) as ingredients_list FROM recipes recipes GROUP BY recipes.meal ) meal_ingredients group by ingredients_list ;

SQL query - select only products ids which was sorted top in another table

Ok, I have situation where I need to create SQL query which will return for me ids from table1 (products) which was ordered by table2 (category) and limited by 10 for each category.
So, what I need. Select product ids which was appeared on "top 10" (limited by 10) results in each category after ordering of those products. Each product has some columns and I order by those columns. The same product can appear on different categories on top 10, for example. So I need use distinct for uniq results.
Is there any relationship between Product and Category? What at the Product columns you're ordering by? Is it ok for there to be duplication between different lists of products? You should really post you models/sql tables and more clearly explain what you're trying to do if you want real help.
Assuming they're many-to-many/the relationships are set up in rails and having the same products in multiple lists is ok I would do something like this
top_products = {}
Category.all.each do |cat|
top_products[cat.name] = cat.products.order("some_product_column DESC").limit(10).map{|p| p.id}
end

Linking Three Tables together

I'm creating an archive for Academic Papers. Each paper may have one author, or multiple authors. I've created the tables in the following manner:
Table 1: PaperInfo - Each row contains information on the paper
Table 2: PaperAuthor - Only Two Columns: contains PaperID, and AuthorID
Table 3: AuthorList - Contains Author Information.
There is also a Table 4 which is linked to Table 4, which contains a list of Universities which the author belongs to, but I'm going to leave it out for now in case it gets too complicated.
I wish to have a Query which will link all three tables together, and display Paper Information of the recordset in a table, with columns such as these:
Paper Title
Paper Authors
The column "Paper Authors" is going to contain more than one authors in some cases.
I've wrote the following query:
SELECT a.*,b.*,c.*
FROM PaperInfo a, PaperAuthor b, AuthorList c
WHERE a.PaperID = b.PaperID AND b.AuthorID = c.AuthorID
So far, the results I've been getting for each row is one author per row. I wish to contain more authors in one column. Can this be done in anyway?
Note: I'm using Access 2010 as my database.
In straight SQL the answer unfortunately is that it isn't possible. You would need to use a processing language in order to get the result you are after.
Since you mention you are using Access 2010 please refer to this question: is there a group_concat function in ms-access?
Particularly, read the post which points to http://www.rogersaccesslibrary.com/forum/generic-function-to-concatenate-child-records_topic16&SID=453fabc6-b3z9-34z6zb14-a78f832z-19z89a2c.html
You probably need to implement a custom function but the 2nd url does what you are looking for.
This functionality is not part of the SQL standard, but different vendors have solutions for it, see for instance Pivot Table with many to many table, MySQL pivot table.
If you know the maximum number of authors per paper (for example 3 or 4), you could get away with a triple or quadruple left join.
What you are after is an inner join.
An SQL JOIN clause is used to combine rows from two or more tables, based on a common field between them.
The most common type of join is: SQL INNER JOIN (simple join). An SQL INNER JOIN return all rows from multiple tables where the join
condition is met.
http://www.w3schools.com/sql/sql_join.asp
You may want to combine the inner join with a group to give you 1 paper to many authors in your results.
The GROUP BY statement is used in conjunction with the aggregate
functions to group the result-set by one or more columns.
http://www.w3schools.com/sql/sql_groupby.asp

performance: joining tables vs. large table with redundant data

Lets say i have a bunch of products. Each product has and id, price, and long description made up of multiple paragraphs. Each product would also have multiple sku numbers that would represent different sizes and colors.
To clarify: product_id 1 has 3 skus, product_id 2 has 5 skus. All of the skus in product 1 share the same price and description. product 2 has a different price and description than product 1. All of product 2's skus share product 2's price and description.
I could have a large table with different records for each sku. The records would have redundant fields like the long description and price.
Or I could have two tables. One named "products" with product_id, price, and description. And one named "skus" with product_id, sku, color, and size. I would then join the tables on the product_id column.
$query = "SELECT * FROM skus LEFT OUTER JOIN products ON skus.product_id=products.product_id WHERE color='green'";
or
$query = "SELECT * FROM master_table WHERE color='green'";
This is a dumbed down version of my setup. In the end there will be a lot more columns and a lot of products. Which method would have better performance?
So to be more specific: Let's say I want to LIKE search on the long_description column for all of the skus. I am trying to compare having one table that has 5000 long_description and 5000 skus vs OUTER JOINing two tables, one has 1000 long_description records and the other has 5000 skus.
It depends on the usage of those tables - in order to get a definitive answer you should do both and compare using representative data sets / system usage.
The normal approach is to only denormalise data in order to combat specific performance problems that you are having, so in this case my advice would be to default to joining across two tables and only denormalise to using a single table if you have a performance problem and find that denormalisation fixes it.
OLTP normalized tables better
Join them at query, easier data manupulation and good response for short queries
OLAP denormalized tables better
Tables mostly dont change and good for long queries

What is the best way to implement this SQL query?

I have a PRODUCTS table, and each product can have multiple attributes so I have an ATTRIBUTES table, and another table called ATTRIBPRODUCTS which sits in the middle. The attributes are grouped into classes (type, brand, material, colour, etc), so people might want a product of a particular type, from a certain brand.
PRODUCTS
product_id
product_name
ATTRIBUTES
attribute_id
attribute_name
attribute_class
ATTRIBPRODUCTS
attribute_id
product_id
When someone is looking for a product they can select one or many of the attributes. The problem I'm having is returning a single product that has multiple attributes. This should be really simple I know but SQL really isn't my thing and past a certain point I get a bit lost in the logic. The problem is I'm trying to check each attribute class separately so I want to end up with something like:
SELECT DISTINCT products.product_id
FROM attribproducts
INNER JOIN products ON attribproducts.product_id = products.product_id
WHERE (attribproducts.attribute_id IN (9,10,11)
AND attribproducts.attribute_id IN (60,61))
I've used IN to separate the blocks of attributes of different classes, so I end up with the products which are of certain types, but also of certain brands. From the results I've had it seems to be that AND between the IN statements that's causing the problem.
Can anyone help a little? I don't have the luxury of completely refactoring the database unfortunately, there is a lot more to it than this bit, so any suggestions how to work with what I have will be gratefully received.
Take a look at the answers to the question SQL: Many-To-Many table AND query. It's the exact same problem. Cletus gave there 2 possible solutions, none of which very trivial (but then again, there simply is no trivial solution).
SELECT DISTINCT products.product_id
FROM products p
INNER JOIN attribproducts ptype on p.product_id = ptype.product_id
INNER JOIN attribproducts pbrand on p.product_id = pbrand.product_id
WHERE ptype.attribute_id IN (9,10,11)
AND pbrand.attribute_id IN (60,61)
Try this:
select * from products p, attribproducts a1, attribproducts a2
where p.product_id = a1.product_id
and p.product_id = a2.product_id
and a1.attribute_id in (9,10,11)
and a2.attribute_id in (60,61);
This will return no rows because you're only counting rows that have a number that's (either 9, 10, 11) AND (either 60, 61).
Because those sets don't intersect, you'll get no rows.
If you use OR instead, it'll give products with attributes that are in the set 9, 10, 11, 60, 61, which isn't what you want either, although you'll then get multiple rows for each product.
You could use that select as an subquery in a GROUP BY statement, grouping by the quantity of products, and order that grouping by the number of shared attributes. That will give you the highest matches first.
Alternatively (as another answer shows), you could join with a new copy of the table for each attribute set, giving you only those products that match all attribute sets.
It sounds like you have a data schema that is GREAT for storage but terrible for selecting/reporting. When you have a data structure of OBJECT, ATTRIBUTE, OBJECT-ATTRIBUTE and OBJECT-ATTRIBUTE-VALUE you can store many objects with many different attributes per object. This is sometime referred to as "Vertical Storage".
However, when you want to retrieve a list of objects with all of their attributes values, it is an variable number of joins you have to make. It is much easier to retrieve data when it is stored horizonatally (Defined columns of data)
I have run into this scenario several times. Since you cannot change the existing data structure. My suggest would be to write a "layer" of tables on top. Dynamically create a table for each object/product you have. Then dynamically create static columns in those new tables for each attribute. Pretty much you need to "flatten" your vertically stored attribute/values into static columns. Convert from a vertical architecture into a horizontal ones.
Use the "flattened" tables for reporting, and use the vertical tables for storage.
If you need sample code or more details, just ask me.
I hope this is clear. I have not had much coffee yet :)
Thanks,
- Mark
You can use multiple inner joins -- I think this would work:
select distinct product_id
from products p
inner join attribproducts a1 on a1.product_id=p.product_id
inner join attribproducts a2 on a1.product_id=p.product_id
where a1.attribute_id in (9,10,11)
and a2.attribute_id in (60,61)