SQL Filter by attributes (Design) - sql

I have a database of products which I'd like to filter by arbitrary categories. Let's say for the sake of an example that I run a garage. I have a section of products which are cars.
Each car should have a collection of attributes, all cars have the same number and type of attributes; for instance colour:red, doors:2, make:ford ; with those same attributes set to various values on all the cars.
Gut feeling tells me that it would be best to add "colour", "doors" and "make" columns to the product table.
HOWEVER: Not ALL the products in the table are cars. Perhaps I would like to list tyres on the page of cars. Obviously, "colour" and "doors" won't apply to tires. Even so, if a user selects colour=red as a filter, I would still like the tires to be shown as they lack the colour attribute.
Mulling it over (and I'm really not a database guy so I apologise if this approach is horrible) I considered having a single "attributes" column which I could fill with an arbitrary number of arbitrarily named attributes, then use SQLs string functions to do the filtering. I guess you could even use a bit field here if you planned carefully. This seems hackish to me though, I'd be interested to know how some of the larger sites such as Amazon do this.
What are the issues with these approaches, can anyone recommend any alternatives or shed any light on the subject for me?
Thanks in advance

You should read about database normalization. It is generally not a good idea to use concatenated strings as values in a single column. I made a very small sqlfiddle for you to start playing around. This does not really solve all your problems, but it may lead you in the right direction.
Schema:
CREATE TABLE product (id int, name varchar(200), info varchar(200));
INSERT INTO product (id, name, info) VALUES (100, "Porsche", "cool");
...
INSERT INTO product (id, name, info) VALUES (103, "Tires", "you need them!");
CREATE TABLE attr (id int, product_id int, a_name varchar(200), a_value varchar(200));
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 100, "color", "black");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (2, 100, "doors", "2");
...
A Query:
SELECT * FROM product INNER JOIN attr ON attr.product_id=product.id
WHERE attr.a_name="doors" AND attr.a_value = "2"

Anyone reading this in the future, I managed to get the results I wanted thanks to luksch taking the time to help me out!!! Thanks!!!
Using this layout:
CREATE TABLE product (id int, name varchar(200));
INSERT INTO product (id, name) VALUES (100, "Red Porsche");
INSERT INTO product (id, name) VALUES (101, "Red Ferrari V8");
INSERT INTO product (id, name) VALUES (102, "Red Ferrari V12");
INSERT INTO product (id, name) VALUES (103, "Blue Porsche");
INSERT INTO product (id, name) VALUES (104, "Blue Ferrari V8");
INSERT INTO product (id, name) VALUES (105, "Blue Ferrari V12");
INSERT INTO product (id, name) VALUES (106, "Snow Tires");
INSERT INTO product (id, name) VALUES (107, "Fluffy Dice");
CREATE TABLE attr (id int, product_id int, a_name varchar(200), a_value varchar(200));
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 100, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 101, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 101, "cylinders", "8");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 102, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 102, "cylinders", "12");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 103, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 104, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 104, "cylinders", "8");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 105, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 105, "cylinders", "12");
I achieved the result I wanted; which was two things:
Firstly I wanted to be able to select products by attribute, say by colour and cylinders, but also show any products which have neither the colour nor cylinders attribute, which I achieved with this query:
SELECT DISTINCT product.id, name, a_value
FROM product
LEFT JOIN attr
ON product_id=product.id
WHERE
(
(a_name="colour" AND a_value="blue")
OR
(a_name IS NULL)
)
AND product.id IN
(
SELECT product.id
FROM product
LEFT JOIN attr
ON product_id=product.id
WHERE
(a_name="cylinders" AND a_value="12")
OR
(a_name IS NULL)
)
This lists all the blue cars with 12 cylinders, and also lists the tires and fluffy dice since they have neither a colour or cylinder count. This can easily be adapted to filter on one attribute, or you can add more AND / IN clauses to add more filters
And I also wanted to be able to list all relevant attributes (I use WHERE 1 in this example, but in practise this would be WHERE idfolders=? to list all attribute relevant to a specific folder)
SELECT DISTINCT a_value, a_name
FROM product
INNER JOIN attr
ON product_id=product.id
WHERE 1

Related

Run mode() function of each value in INT ARRAY

I have a table that holds an INT ARRAY data type representing some features (this is done instead of having a separate boolean column for each feature). The column is called feature_ids. If a record has a specific feature, the ID of the feature will be present in the feature_ids column. The mapping of the feature_ids are for context understanding as follows:
1: Fast
2: Expensive
3: Colorfull
4: Deadly
So in other words, I would also have had 4 columns called is_fast, is_expensive, is_colorfull and is_deadly - but I don't since my real application have +100 features, and they change quite a bit.
Now back to the question: I wanna do an aggregate mode() on the records in the table returning what are the most "frequent" features to have (e.g. if it's more common to be "fast" than not etc.). I want it returned in the same format as the original feature_ids column, but where the ID of a feature is ONLY in represented, if it's more common to be there than not, within each group:
CREATE TABLE test (
id INT,
feature_ids integer[] DEFAULT '{}'::integer[],
age INT,
type character varying(255)
);
INSERT INTO test (id, age, feature_ids, type) VALUES (1, 10, '{1,2}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (2, 2, '{1}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (3, 9, '{1,2,4}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (4, 11, '{1,2,3}', 'wine');
INSERT INTO test (id, age, feature_ids, type) VALUES (5, 12, '{1,2,4}', 'hat');
INSERT INTO test (id, age, feature_ids, type) VALUES (6, 12, '{1,2,3}', 'hat');
INSERT INTO test (id, age, feature_ids, type) VALUES (7, 8, '{1,4}', 'hat');
I wanna do a query something like this:
SELECT
type, avg(age) as avg_age, mode() within group (order by feature_ids) as most_frequent_features
from test group by "type"
The result I expect is:
type avg_age most_frequent_features
hat 10.6 [1,2,4]
movie 7.0 [1,2]
wine 11.0 [1,2,3]
I have made an example here: https://www.db-fiddle.com/f/rTP4w7264vDC5rqjef6Nai/1
I find this quite tricky. The following is a rather brute-force approach -- calculating the "mode" explicitly and then bringing in the other aggregates:
select tf.type, t.avg_age,
array_agg(feature_id) as features
from (select t.type, feature_id, count(*) as cnt,
dense_rank() over (partition by t.type order by count(*) desc) as seqnum
from test t cross join
unnest(feature_ids) feature_id
group by t.type, feature_id
) tf join
(select t.type, avg(age) as avg_age
from test t
group by t.type
) t
on tf.type = t.type
where seqnum <= 2
group by tf.type, t.avg_age;
Here is a db<>fiddle.

oracle correlated subquery using distinct listagg

I have an interesting query I'm trying to figure out. I have a view which is getting a column added to it. This column is pivoted data coming from other tables, to form into a single row. Now, I need to wipe out duplicate entries in this pivoted data. Listagg is great for getting the data to a single row, but I need to make it unique. While I know how to make it unique, I'm tripping up on the fact that correlated sub-queries only go 1 level deep. So... not really sure how to get a distinct list of values. I can get it to work if I don't do the distinct just fine. Anyone out there able to work some SQL magic?
Sample data:
drop table test;
drop table test_widget;
create table test (id number, description Varchar2(20));
create table test_widget (widget_id number, test_fk number, widget_type varchar2(20));
insert into test values(1, 'cog');
insert into test values(2, 'wheel');
insert into test values(3, 'spring');
insert into test_widget values(1, 1, 'A');
insert into test_widget values(2, 1, 'A');
insert into test_widget values(3, 1, 'B');
insert into test_widget values(4, 1, 'A');
insert into test_widget values(5, 2, 'C');
insert into test_widget values(6, 2, 'C');
insert into test_widget values(7, 2, 'B');
insert into test_widget values(8, 3, 'A');
insert into test_widget values(9, 3, 'C');
insert into test_widget values(10, 3, 'B');
insert into test_widget values(11, 3, 'B');
insert into test_widget values(12, 3, 'A');
commit;
Here is an example of the query that works, but shows duplicate data:
SELECT A.ID
, A.DESCRIPTION
, (SELECT LISTAGG (WIDGET_TYPE, ', ') WITHIN GROUP (ORDER BY WIDGET_TYPE)
FROM TEST_WIDGET
WHERE TEST_FK = A.ID) widget_types
FROM TEST A
Here is an example of what does NOT work due to the depth of where I try to reference the ID:
SELECT A.ID
, A.DESCRIPTION
, (SELECT LISTAGG (WIDGET_TYPE, ', ') WITHIN GROUP (ORDER BY WIDGET_TYPE)
FROM (SELECT DISTINCT WIDGET_TYPE
FROM TEST_WIDGET
WHERE TEST_FK = A.ID))
WIDGET_TYPES
FROM TEST A
Here is what I want displayed:
1 cog A, B
2 wheel B, C
3 spring A, B, C
If anyone knows off the top of their head, that would fantastic! Otherwise, I can post up some sample create statements to help you with dummy data to figure out the query.
You can apply the distinct in a subquery, which also has the join - avoiding the level issue:
SELECT ID
, DESCRIPTION
, LISTAGG (WIDGET_TYPE, ', ')
WITHIN GROUP (ORDER BY WIDGET_TYPE) AS widget_types
FROM (
SELECT DISTINCT A.ID, A.DESCRIPTION, B.WIDGET_TYPE
FROM TEST A
JOIN TEST_WIDGET B
ON B.TEST_FK = A.ID
)
GROUP BY ID, DESCRIPTION
ORDER BY ID;
ID DESCRIPTION WIDGET_TYPES
---------- -------------------- --------------------
1 cog A, B
2 wheel B, C
3 spring A, B, C
I was in a unique situation using the Pentaho reports writer and some inconsistent data. The Pentaho writer uses Oracle to query data, but has limitations. The data pieces were unique but not classified in a consistent manner, so I created a nested listagg inside of a left join to present the data the way I wanted to:
left join
(
select staff_id, listagg(thisThing, ' --- '||chr(10) ) within group (order by this) as SCHED_1 from
(
SELECT
staff_id, RPT_STAFF_SHIFTS.ORGANIZATION||': '||listagg(
RPT_STAFF_SHIFTS.DAYS_OF_WEEK
, ',' ) within group (order by BEGIN_DATE desc)
as thisThing
FROM "RPT_STAFF_SHIFTS" where "RPT_STAFF_SHIFTS"."END_DATE" is null
group by staff_id, organization)
group by staff_id
) schedule_1 on schedule_1.staff_id = "RPT_STAFF"."STAFF_ID"
where "RPT_STAFF"."STAFF_ID" ='555555'
This is a different approach than using the nested query, but it some situations it might work better by taking into account the level issue when developing the query and taking an extra step to fully concatenate the results.

Naming Category and SubCategory Tables

I'm trying to create a bunch of lookup tables in a database but am stuck when it comes to naming them. The tables are like this:
1. dbo.AccountType (this is the highest level category)
2. dbo.AccountSubType (this is a 2nd level category)
3. dbo.AccountSubSubType (this is a 3rd level category)
The above naming convention breaks easily. So perhaps this is better:
1. dbo.AccountType1 (highest level)
2. dbo.AccountType2 (second level)
3. dbo.AccountType3 (third level)
4. dbo.AccountType-N (and so on...)
I know naming conventions are opinion based, but surely there has to be some logical way to do this that is scalable and not confusing to developers.
Example of how the data looks in the dbo.AccountType2 table using the second solution:
AccountTypeID (FK) | AccountType1ID (FK) | AccountType2ID (PK) | AccountType2
=============================================================================
1 4 1 Credit Card
1 5 2 Savings
Is there any better way to store hierarchical data in a database and name the tables correctly?
This would probably be better represented as a single table with a hierarchical relationship:
E.g.
CREATE TABLE [dbo].[AccountType] (
Id int NOT NULL
,ParentId int NULL
CONSTRAINT [FK_AccountType_AccountType_Parent] REFERENCES [dbo].[AccountType] (Id)
,Name nvarchar(200) NOT NULL
CONSTRAINT [PK_AccountType] PRIMARY KEY CLUSTERED ([Id])
)
Then populate it with data as follows:
INSERT INTO dbo.AccountType (Id, ParentId, Name) VALUES (1, NULL, 'Credit Card')
INSERT INTO dbo.AccountType (Id, ParentId, Name) VALUES (2, 1, 'Credit Card Sub-Type')
INSERT INTO dbo.AccountType (Id, ParentId, Name) VALUES (3, 2, 'Credit Card Sub-Sub-Type')
INSERT INTO dbo.AccountType (Id, ParentId, Name) VALUES (4, NULL, 'Savings')
INSERT INTO dbo.AccountType (Id, ParentId, Name) VALUES (5, 4, 'Savingsd Sub-Type')
INSERT INTO dbo.AccountType (Id, ParentId, Name) VALUES (6, 5, 'Savings Sub-Sub-Type')
Anything with a ParentId of NULL is a root value, otherwise it is a child of the specified parent...
Edit: To query you'd use a CTE. E.g.
WITH ParentAccountType ( Id, ParentId, Name, ParentName )
AS
(
SELECT Id, ParentId, Name, CAST('N/A' AS nvarchar(200)) AS ParentName
FROM AccountType
WHERE ParentId IS NULL
UNION ALL
SELECT c.Id, c.ParentId, c.Name, p.Name AS ParentName
FROM
AccountType c
INNER JOIN ParentAccountType p ON c.ParentId = p.Id
)
SELECT ParentName, Name
FROM ParentAccountType
GO
SQL Fiddler here

Oracle SQL: count frequencies and convert to columns

I want to count how frequent values appear in certain columns and create a new table, with the values as columns and the frequencies as data. Example:
create table users
(id number primary key,
name varchar2(255));
insert into users values (1, 'John');
insert into users values (2, 'Joe');
insert into users values (3, 'Max');
create table meals
(id number primary key,
user_id number,
food varchar2(255));
insert into meals values (1, 1, 'Apple');
insert into meals values (2, 1, 'Apple');
insert into meals values (3, 1, 'Orange');
insert into meals values (4, 1, 'Bread');
insert into meals values (5, 1, 'Apple');
insert into meals values (6, 2, 'Apple');
insert into meals values (7, 2, 'Bread');
insert into meals values (8, 2, 'Bread');
insert into meals values (9, 2, 'Apple');
insert into meals values (10, 3, 'Orange');
insert into meals values (11, 3, 'Bread');
insert into meals values (12, 3, 'Bread');
So I got different users and their meals (here Bread, Apple and Oranges). For every user I want to know how often did he eat the different food. The following query does exactly what I want:
select
(select count(id) from meals where meals.user_id = users.id and meals.food = 'Apple') as count_apple,
(select count(id) from meals where meals.user_id = users.id and meals.food = 'Orange') as count_orange,
(select count(id) from meals where meals.user_id = users.id and meals.food = 'Bread') as count_bread
from users;
The problem is, this is REALLY slow, especially when I got more than 100.000 users and dozens of different foods. I am sure that there is a faster way, but I am not experienced enough in SQL to solve this problem.
If you're using 11g, then you can use the pivot operator, like so:
select * from (
select user_id, food from meals
)
pivot (count(*) as count for (food) in ('Apple', 'Orange', 'Bread'));
Otherwise you'll have to do a manual pivot:
select user_id,
sum(case when food = 'Apple' then 1 else 0 end) count_apple,
sum(case when food = 'Orange' then 1 else 0 end) count_orange,
sum(case when food = 'Bread' then 1 else 0 end) count_bread
from meals
group by user_id
In either case, these should be faster than your original approach as you're only accessing the meals table once.

Data cube design: hard-to-aggregate measure

I'm in the process of designing the fact table for a data cube, and I have a measure which I don't really know how to correctly aggregate. The following SQL code will create a small sample fact table and dimension table:
create table FactTable (
ID int,
Color int,
Flag int)
insert into FactTable (ID, Color, Flag) values (1, 'RED', 1)
insert into FactTable (ID, Color, Flag) values (1, 'WHITE', 0)
insert into FactTable (ID, Color, Flag) values (1, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (2, 'RED', 0)
insert into FactTable (ID, Color, Flag) values (2, 'WHITE', 0)
insert into FactTable (ID, Color, Flag) values (2, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (3, 'RED', 1)
insert into FactTable (ID, Color, Flag) values (3, 'WHITE', 1)
insert into FactTable (ID, Color, Flag) values (3, 'BLUE', 1)
create table ColorDim (
CID int,
Color int)
insert into ColorDim (CID, Color) values (1, 'RED')
insert into ColorDim (CID, Color) values (2, 'WHITE')
insert into ColorDim (CID, Color) values (3, 'BLUE')
FactTable and ColorDim are joined on FactTable.Color = ColorDim.Color. In the cube, there should be a measure called 'Patriotic' which counts object IDs including the colors red, white, or blue (at least one of the colors). The desired output is as follows:
When browsing the cube, if the user pulls in the Patriotic measure (pulling no dimensions), the total shown should be 2, since there are 2 IDs (namely, 1 and 3) which include at least one of the three colors. Notice that ID 1 should contribute 1 to the total Patriotic value, even though it has two of the colors.
If the user browses the Patriotic measure by the Color dimension, they should see a table like the following. Note that the the ID 1 contributes 1 to the RED count and 1 to the BLUE count.
+--------+-----------+
| Color | Patriotic |
+--------+-----------+
| RED | 2 |
| WHITE | 1 |
| BLUE | 2 |
+--------+-----------+
(I tried to create a table using this web app, but the spacing doesn't appear to be correct. Hopefully it's readable enough to understand.)
I'm sure this is a very basic cube design situation, but I haven't worked with cubes much before, and the measures I've used are usually simple SUMs of columns, or products of SUMs of columns, etc. Any help would be much appreciated.
(If it's relevant, I'm running the SQL queries which build the fact/dimension tables in MS SQL Server 2008, and I'll be designing the cube itself in MS Visual Studio 2008.)
I'll give it a try, although I'm not 100% sure I understand the questions. Also I don't want to post queries into comments to verify if they are valid. If I'm way off and this is not helpful, I'll delete the answer.
When browsing the cube, if the user pulls in the Patriotic measure (pulling no dimensions), the total shown should be 2, since there are 2 IDs (namely, 1 and 3) which include at least one of the three colors. Notice that ID 1 should contribute 1 to the total Patriotic value, even though it has two of the colors.
WITH MyCTE (id, Count)
AS
(
select id, count(flag) as count
from FactTable
where Flag=1
group by id
having COUNT(flag) >=2
)
select COUNT(*) from MyCTE
If the user browses the Patriotic measure by the Color dimension, they should see a table like the following. Note that the the ID 1 contributes 1 to the RED count and 1 to the BLUE count.
select a.Color, COUNT(*)
from FactTable a
join ColorDim b
on a.Color = b.Color
where Flag = 1
group by a.Color
Not entirely sure why you Fact table needs to be a cross join between "ID" and "Color". You can simply eliminiate all Flag=0 rows and use a simple count of the ID column as your Patriotic measure, a distinct count will give you the total of Patriotic rows.
You also do not need a Color dimension as there is no extra information being provided by the ColorDim table.
However, if more colours were added to the rows, you would be able to add the 'Patriotic' flag to the ColorDim table. Any queries would then be able to filter by the 'Patriotic' flag and still get accurate counts for Patriotic rows.
create table FactTable (
ID int,
Color int
)
insert into FactTable (ID, Color) values (1, 'RED')
insert into FactTable (ID, Color) values (1, 'BLUE')
insert into FactTable (ID, Color) values (2, 'BLUE')
insert into FactTable (ID, Color) values (3, 'RED')
insert into FactTable (ID, Color) values (3, 'WHITE')
insert into FactTable (ID, Color) values (3, 'BLUE')
create table ColorDim (
CID int,
Color int,
PatrioticFlag int
)
insert into ColorDim (CID, Color) values (1, 'RED',1)
insert into ColorDim (CID, Color) values (2, 'WHITE',1)
insert into ColorDim (CID, Color) values (3, 'BLUE',1)
insert into ColorDim (CID, Color) values (4, 'BEIGE',0)
I finally figured it out. First, I added one row per ID to the fact table containing pre-aggregated data for that ID, so the fact table becomes:
create table FactTable (
ID int,
Color int,
Flag int)
insert into FactTable (ID, Color, Flag) values (1, 'RED', 1)
insert into FactTable (ID, Color, Flag) values (1, 'WHITE', 0)
insert into FactTable (ID, Color, Flag) values (1, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (1, 'PATRIOTIC', 1)
insert into FactTable (ID, Color, Flag) values (2, 'RED', 0)
insert into FactTable (ID, Color, Flag) values (2, 'WHITE', 0)
insert into FactTable (ID, Color, Flag) values (2, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (2, 'PATRIOTIC', 1)
insert into FactTable (ID, Color, Flag) values (3, 'RED', 1)
insert into FactTable (ID, Color, Flag) values (3, 'WHITE', 1)
insert into FactTable (ID, Color, Flag) values (3, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (3, 'PATRIOTIC', 1)
Similarly, add a row to the color dimension table:
create table ColorDim (
CID int,
Color int)
insert into ColorDim (CID, Color) values (1, 'RED')
insert into ColorDim (CID, Color) values (2, 'WHITE')
insert into ColorDim (CID, Color) values (3, 'BLUE')
insert into ColorDim (CID, Color) values (4, 'PATRIOTIC')
Then, in MS Visual Studio, edit the DefaultMember property of the Color attribute in the Color Dimension as:
[Color Dimension].[ColorDim].&[PATRIOTIC]
The DefaultMember property tells MS Visual Studio that rows of the fact table which have Color 'PATRIOTIC' are already aggregates of the other rows with the same ID and other Color values.