Data cube design: hard-to-aggregate measure - sql

I'm in the process of designing the fact table for a data cube, and I have a measure which I don't really know how to correctly aggregate. The following SQL code will create a small sample fact table and dimension table:
create table FactTable (
ID int,
Color int,
Flag int)
insert into FactTable (ID, Color, Flag) values (1, 'RED', 1)
insert into FactTable (ID, Color, Flag) values (1, 'WHITE', 0)
insert into FactTable (ID, Color, Flag) values (1, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (2, 'RED', 0)
insert into FactTable (ID, Color, Flag) values (2, 'WHITE', 0)
insert into FactTable (ID, Color, Flag) values (2, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (3, 'RED', 1)
insert into FactTable (ID, Color, Flag) values (3, 'WHITE', 1)
insert into FactTable (ID, Color, Flag) values (3, 'BLUE', 1)
create table ColorDim (
CID int,
Color int)
insert into ColorDim (CID, Color) values (1, 'RED')
insert into ColorDim (CID, Color) values (2, 'WHITE')
insert into ColorDim (CID, Color) values (3, 'BLUE')
FactTable and ColorDim are joined on FactTable.Color = ColorDim.Color. In the cube, there should be a measure called 'Patriotic' which counts object IDs including the colors red, white, or blue (at least one of the colors). The desired output is as follows:
When browsing the cube, if the user pulls in the Patriotic measure (pulling no dimensions), the total shown should be 2, since there are 2 IDs (namely, 1 and 3) which include at least one of the three colors. Notice that ID 1 should contribute 1 to the total Patriotic value, even though it has two of the colors.
If the user browses the Patriotic measure by the Color dimension, they should see a table like the following. Note that the the ID 1 contributes 1 to the RED count and 1 to the BLUE count.
+--------+-----------+
| Color | Patriotic |
+--------+-----------+
| RED | 2 |
| WHITE | 1 |
| BLUE | 2 |
+--------+-----------+
(I tried to create a table using this web app, but the spacing doesn't appear to be correct. Hopefully it's readable enough to understand.)
I'm sure this is a very basic cube design situation, but I haven't worked with cubes much before, and the measures I've used are usually simple SUMs of columns, or products of SUMs of columns, etc. Any help would be much appreciated.
(If it's relevant, I'm running the SQL queries which build the fact/dimension tables in MS SQL Server 2008, and I'll be designing the cube itself in MS Visual Studio 2008.)

I'll give it a try, although I'm not 100% sure I understand the questions. Also I don't want to post queries into comments to verify if they are valid. If I'm way off and this is not helpful, I'll delete the answer.
When browsing the cube, if the user pulls in the Patriotic measure (pulling no dimensions), the total shown should be 2, since there are 2 IDs (namely, 1 and 3) which include at least one of the three colors. Notice that ID 1 should contribute 1 to the total Patriotic value, even though it has two of the colors.
WITH MyCTE (id, Count)
AS
(
select id, count(flag) as count
from FactTable
where Flag=1
group by id
having COUNT(flag) >=2
)
select COUNT(*) from MyCTE
If the user browses the Patriotic measure by the Color dimension, they should see a table like the following. Note that the the ID 1 contributes 1 to the RED count and 1 to the BLUE count.
select a.Color, COUNT(*)
from FactTable a
join ColorDim b
on a.Color = b.Color
where Flag = 1
group by a.Color

Not entirely sure why you Fact table needs to be a cross join between "ID" and "Color". You can simply eliminiate all Flag=0 rows and use a simple count of the ID column as your Patriotic measure, a distinct count will give you the total of Patriotic rows.
You also do not need a Color dimension as there is no extra information being provided by the ColorDim table.
However, if more colours were added to the rows, you would be able to add the 'Patriotic' flag to the ColorDim table. Any queries would then be able to filter by the 'Patriotic' flag and still get accurate counts for Patriotic rows.
create table FactTable (
ID int,
Color int
)
insert into FactTable (ID, Color) values (1, 'RED')
insert into FactTable (ID, Color) values (1, 'BLUE')
insert into FactTable (ID, Color) values (2, 'BLUE')
insert into FactTable (ID, Color) values (3, 'RED')
insert into FactTable (ID, Color) values (3, 'WHITE')
insert into FactTable (ID, Color) values (3, 'BLUE')
create table ColorDim (
CID int,
Color int,
PatrioticFlag int
)
insert into ColorDim (CID, Color) values (1, 'RED',1)
insert into ColorDim (CID, Color) values (2, 'WHITE',1)
insert into ColorDim (CID, Color) values (3, 'BLUE',1)
insert into ColorDim (CID, Color) values (4, 'BEIGE',0)

I finally figured it out. First, I added one row per ID to the fact table containing pre-aggregated data for that ID, so the fact table becomes:
create table FactTable (
ID int,
Color int,
Flag int)
insert into FactTable (ID, Color, Flag) values (1, 'RED', 1)
insert into FactTable (ID, Color, Flag) values (1, 'WHITE', 0)
insert into FactTable (ID, Color, Flag) values (1, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (1, 'PATRIOTIC', 1)
insert into FactTable (ID, Color, Flag) values (2, 'RED', 0)
insert into FactTable (ID, Color, Flag) values (2, 'WHITE', 0)
insert into FactTable (ID, Color, Flag) values (2, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (2, 'PATRIOTIC', 1)
insert into FactTable (ID, Color, Flag) values (3, 'RED', 1)
insert into FactTable (ID, Color, Flag) values (3, 'WHITE', 1)
insert into FactTable (ID, Color, Flag) values (3, 'BLUE', 1)
insert into FactTable (ID, Color, Flag) values (3, 'PATRIOTIC', 1)
Similarly, add a row to the color dimension table:
create table ColorDim (
CID int,
Color int)
insert into ColorDim (CID, Color) values (1, 'RED')
insert into ColorDim (CID, Color) values (2, 'WHITE')
insert into ColorDim (CID, Color) values (3, 'BLUE')
insert into ColorDim (CID, Color) values (4, 'PATRIOTIC')
Then, in MS Visual Studio, edit the DefaultMember property of the Color attribute in the Color Dimension as:
[Color Dimension].[ColorDim].&[PATRIOTIC]
The DefaultMember property tells MS Visual Studio that rows of the fact table which have Color 'PATRIOTIC' are already aggregates of the other rows with the same ID and other Color values.

Related

How can I delete records that have the same values in two non-primary key columns while factoring in a third?

Here is a link to a sqlfiddle with sample data: http://sqlfiddle.com/#!18/d8d552/3
I'm trying to delete 1 row from rows that have the same values in both name and color. If they have they same score, then it doesn't matter which one is deleted. If those rows have a different score, then I want to delete the row with the lower score.
I'm getting it down to the correct rows (but with an extra row with the same score that I would want to keep one of) using an inner join, but I'm having difficulty differentiating between scores.
Another route I've tried is with the 2nd query that gets me the exact rows I want to delete, but then I can't seem to get the ItemID as it's not part of the aggregate function or GROUP BY. Using the function ROW_NUMBER() helps get the correct rows, but makes it difficult to get the IDs to delete.
I'm looking to delete ItemIds 1 and 9 and then 1 of (2,4) and 1 of (6,8)
ItemID
name
color
score
1
asdf
green
5
2
asdf
blue
4
3
asdf
green
6
4
asdf
blue
4
5
asdf
yellow
0
6
qwer
green
3
7
qwer
blue
3
8
qwer
green
3
9
qwer
blue
2
10
qwer
yellow
0
Here is some SQL that I've tried out which gets me close:
SELECT I.*
FROM Items I
INNER JOIN
(SELECT name, color, MIN(score) AS ms
FROM Items
GROUP BY name, color
HAVING COUNT(*) > 1) s ON I.name = s.name
AND I.color = s.color
AND I.score = ms
ORDER BY I.ItemID
GO
SELECT
name, color, MIN(score) AS minScore,
ROW_NUMBER() OVER (PARTITION BY name, color
ORDER BY MIN(score)) AS rn
FROM Items
GROUP BY name, color
HAVING COUNT(*) > 1
ORDER BY name
EDIT
I believe #Susang 's answer works better than mine. Here's a link to a fiddle with it working: http://sqlfiddle.com/#!18/d8873/5
You can use CTE as below:
CREATE TABLE #test(ItemID INT, name varchar(50), color varchar(50), score INT)
INSERT INTO #test(ItemID, name, color, score) VALUES
(1, 'asdf', 'green', 5),
(2, 'asdf', 'blue', 4),
(3, 'asdf', 'green', 6),
(4, 'asdf', 'blue', 4),
(5, 'asdf', 'yellow', 0),
(6, 'qwer', 'green', 3),
(7, 'qwer', 'blue', 3),
(8, 'qwer', 'green', 3),
(9, 'qwer', 'blue', 2),
(10, 'qwer', 'yellow', 0)
;WITH recDelete AS (
SELECT
rnk = ROW_NUMBER() OVER (PARTITION BY name, color ORDER BY score DESC),
ItemID
FROM #test
)
DELETE FROM recDelete WHERE rnk > 1
Use an updatable CTE! This is simple:
WITH todelete AS (
SELECT i.*
ROW_NUMBER() OVER (PARTITION BY name, color ORDER BY score DESC) as seqnum
FROM items i
)
DELETE todelete
WHERE seqnum > 1;
No JOIN is needed.

Query to detect changes in item location from one day before (RFID tracking)

I have some RFID tags on itens, which generate some data on a table.
Unfortunately, the reports on this system are poorly to non existant, and I want to make one.
Consider my data somethings as this. Everytime we "query" the system, it scans and inserts all item data on the same table:
My desired output, when using 02/03/2020 day as base, is this (based on the location):
I've done some color (based on the address column) on the first picture to better ilustrate.
In this example, you can see all possible status:
No change (ITEM_ID stood on the same place as the day before)
Item left forever (was sold, so it is not on any other place - example as ITEM_ID 00006)
Item was moved from address A to B (example ITEM_ID 0005: was on A104 and now is in A110)
New item added on a unused address (example ITEM_ID 0008, another bag)
New item on a address used yesterday (example ITEM_ID 0009 which was put on A104, used the day before by ITEM_ID 0005)
Someone told me that this can be accomplished by using ROLLUP or CUBE, but I'm not sure if it is the best approach, or how to use it.
The totals are a plus. I can export data do Excel and do a count based on the STATUS column (or even another select)
In summary, it is a tracking report.
ANY tips will be kindly appreciated.
SQL SERVER is Microsoft SQL Server 2016 Standard
Basically, you want to join the table with itself to track where things are going. I don't think I've addressed all of your concerns, but this should be a good place to start.
CREATE TABLE rfid
(
item_id INT,
address VARCHAR(50),
description VARCHAR(50),
qty INT,
[date] DATE
)
INSERT INTO rfid
(item_id,
address,
description,
qty,
[date])
VALUES (1,
'a100',
'cable',
100,
'2020-01-03'),
(2,
'a101',
'charger',
100,
'2020-01-03'),
(3,
'a102',
'laptop',
100,
'2020-01-03'),
(4,
'a103',
'chair',
100,
'2020-01-03'),
(5,
'a104',
'basket',
100,
'2020-01-03'),
(6,
'a105',
'bag',
100,
'2020-01-03'),
(1,
'a100',
'cable',
100,
'2020-02-03'),
(2,
'a101',
'charger',
100,
'2020-02-03'),
(3,
'a102',
'laptop',
100,
'2020-02-03'),
(4,
'a103',
'chair',
100,
'2020-02-03'),
(5,
'a110',
'basket',
100,
'2020-02-03'),
(8,
'a200',
'bag',
100,
'2020-02-03'),
(9,
'a104',
'keyboard',
100,
'2020-02-03');
WITH inventory_new (item_id, address, description)
AS (SELECT item_id,
address,
description
FROM rfid
WHERE [date] = '2020-02-03'),
inventory_old (item_id, address, description)
AS (SELECT item_id,
address,
description
FROM rfid
WHERE [date] = '2020-01-03')
SELECT COALESCE(o.item_id, n.item_id) item_id,
COALESCE(o.description, n.description) description,
CASE
WHEN o.address = n.address THEN 'no change'
WHEN o.address IS NULL THEN 'in'
WHEN n.address IS NULL THEN 'out'
END outcome
FROM inventory_old o
FULL OUTER JOIN inventory_new n
ON ( n.item_id = o.item_id )

Run mode() function of each value in INT ARRAY

I have a table that holds an INT ARRAY data type representing some features (this is done instead of having a separate boolean column for each feature). The column is called feature_ids. If a record has a specific feature, the ID of the feature will be present in the feature_ids column. The mapping of the feature_ids are for context understanding as follows:
1: Fast
2: Expensive
3: Colorfull
4: Deadly
So in other words, I would also have had 4 columns called is_fast, is_expensive, is_colorfull and is_deadly - but I don't since my real application have +100 features, and they change quite a bit.
Now back to the question: I wanna do an aggregate mode() on the records in the table returning what are the most "frequent" features to have (e.g. if it's more common to be "fast" than not etc.). I want it returned in the same format as the original feature_ids column, but where the ID of a feature is ONLY in represented, if it's more common to be there than not, within each group:
CREATE TABLE test (
id INT,
feature_ids integer[] DEFAULT '{}'::integer[],
age INT,
type character varying(255)
);
INSERT INTO test (id, age, feature_ids, type) VALUES (1, 10, '{1,2}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (2, 2, '{1}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (3, 9, '{1,2,4}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (4, 11, '{1,2,3}', 'wine');
INSERT INTO test (id, age, feature_ids, type) VALUES (5, 12, '{1,2,4}', 'hat');
INSERT INTO test (id, age, feature_ids, type) VALUES (6, 12, '{1,2,3}', 'hat');
INSERT INTO test (id, age, feature_ids, type) VALUES (7, 8, '{1,4}', 'hat');
I wanna do a query something like this:
SELECT
type, avg(age) as avg_age, mode() within group (order by feature_ids) as most_frequent_features
from test group by "type"
The result I expect is:
type avg_age most_frequent_features
hat 10.6 [1,2,4]
movie 7.0 [1,2]
wine 11.0 [1,2,3]
I have made an example here: https://www.db-fiddle.com/f/rTP4w7264vDC5rqjef6Nai/1
I find this quite tricky. The following is a rather brute-force approach -- calculating the "mode" explicitly and then bringing in the other aggregates:
select tf.type, t.avg_age,
array_agg(feature_id) as features
from (select t.type, feature_id, count(*) as cnt,
dense_rank() over (partition by t.type order by count(*) desc) as seqnum
from test t cross join
unnest(feature_ids) feature_id
group by t.type, feature_id
) tf join
(select t.type, avg(age) as avg_age
from test t
group by t.type
) t
on tf.type = t.type
where seqnum <= 2
group by tf.type, t.avg_age;
Here is a db<>fiddle.

SQL Filter by attributes (Design)

I have a database of products which I'd like to filter by arbitrary categories. Let's say for the sake of an example that I run a garage. I have a section of products which are cars.
Each car should have a collection of attributes, all cars have the same number and type of attributes; for instance colour:red, doors:2, make:ford ; with those same attributes set to various values on all the cars.
Gut feeling tells me that it would be best to add "colour", "doors" and "make" columns to the product table.
HOWEVER: Not ALL the products in the table are cars. Perhaps I would like to list tyres on the page of cars. Obviously, "colour" and "doors" won't apply to tires. Even so, if a user selects colour=red as a filter, I would still like the tires to be shown as they lack the colour attribute.
Mulling it over (and I'm really not a database guy so I apologise if this approach is horrible) I considered having a single "attributes" column which I could fill with an arbitrary number of arbitrarily named attributes, then use SQLs string functions to do the filtering. I guess you could even use a bit field here if you planned carefully. This seems hackish to me though, I'd be interested to know how some of the larger sites such as Amazon do this.
What are the issues with these approaches, can anyone recommend any alternatives or shed any light on the subject for me?
Thanks in advance
You should read about database normalization. It is generally not a good idea to use concatenated strings as values in a single column. I made a very small sqlfiddle for you to start playing around. This does not really solve all your problems, but it may lead you in the right direction.
Schema:
CREATE TABLE product (id int, name varchar(200), info varchar(200));
INSERT INTO product (id, name, info) VALUES (100, "Porsche", "cool");
...
INSERT INTO product (id, name, info) VALUES (103, "Tires", "you need them!");
CREATE TABLE attr (id int, product_id int, a_name varchar(200), a_value varchar(200));
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 100, "color", "black");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (2, 100, "doors", "2");
...
A Query:
SELECT * FROM product INNER JOIN attr ON attr.product_id=product.id
WHERE attr.a_name="doors" AND attr.a_value = "2"
Anyone reading this in the future, I managed to get the results I wanted thanks to luksch taking the time to help me out!!! Thanks!!!
Using this layout:
CREATE TABLE product (id int, name varchar(200));
INSERT INTO product (id, name) VALUES (100, "Red Porsche");
INSERT INTO product (id, name) VALUES (101, "Red Ferrari V8");
INSERT INTO product (id, name) VALUES (102, "Red Ferrari V12");
INSERT INTO product (id, name) VALUES (103, "Blue Porsche");
INSERT INTO product (id, name) VALUES (104, "Blue Ferrari V8");
INSERT INTO product (id, name) VALUES (105, "Blue Ferrari V12");
INSERT INTO product (id, name) VALUES (106, "Snow Tires");
INSERT INTO product (id, name) VALUES (107, "Fluffy Dice");
CREATE TABLE attr (id int, product_id int, a_name varchar(200), a_value varchar(200));
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 100, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 101, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 101, "cylinders", "8");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 102, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 102, "cylinders", "12");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 103, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 104, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 104, "cylinders", "8");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 105, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 105, "cylinders", "12");
I achieved the result I wanted; which was two things:
Firstly I wanted to be able to select products by attribute, say by colour and cylinders, but also show any products which have neither the colour nor cylinders attribute, which I achieved with this query:
SELECT DISTINCT product.id, name, a_value
FROM product
LEFT JOIN attr
ON product_id=product.id
WHERE
(
(a_name="colour" AND a_value="blue")
OR
(a_name IS NULL)
)
AND product.id IN
(
SELECT product.id
FROM product
LEFT JOIN attr
ON product_id=product.id
WHERE
(a_name="cylinders" AND a_value="12")
OR
(a_name IS NULL)
)
This lists all the blue cars with 12 cylinders, and also lists the tires and fluffy dice since they have neither a colour or cylinder count. This can easily be adapted to filter on one attribute, or you can add more AND / IN clauses to add more filters
And I also wanted to be able to list all relevant attributes (I use WHERE 1 in this example, but in practise this would be WHERE idfolders=? to list all attribute relevant to a specific folder)
SELECT DISTINCT a_value, a_name
FROM product
INNER JOIN attr
ON product_id=product.id
WHERE 1

Oracle SQL: count frequencies and convert to columns

I want to count how frequent values appear in certain columns and create a new table, with the values as columns and the frequencies as data. Example:
create table users
(id number primary key,
name varchar2(255));
insert into users values (1, 'John');
insert into users values (2, 'Joe');
insert into users values (3, 'Max');
create table meals
(id number primary key,
user_id number,
food varchar2(255));
insert into meals values (1, 1, 'Apple');
insert into meals values (2, 1, 'Apple');
insert into meals values (3, 1, 'Orange');
insert into meals values (4, 1, 'Bread');
insert into meals values (5, 1, 'Apple');
insert into meals values (6, 2, 'Apple');
insert into meals values (7, 2, 'Bread');
insert into meals values (8, 2, 'Bread');
insert into meals values (9, 2, 'Apple');
insert into meals values (10, 3, 'Orange');
insert into meals values (11, 3, 'Bread');
insert into meals values (12, 3, 'Bread');
So I got different users and their meals (here Bread, Apple and Oranges). For every user I want to know how often did he eat the different food. The following query does exactly what I want:
select
(select count(id) from meals where meals.user_id = users.id and meals.food = 'Apple') as count_apple,
(select count(id) from meals where meals.user_id = users.id and meals.food = 'Orange') as count_orange,
(select count(id) from meals where meals.user_id = users.id and meals.food = 'Bread') as count_bread
from users;
The problem is, this is REALLY slow, especially when I got more than 100.000 users and dozens of different foods. I am sure that there is a faster way, but I am not experienced enough in SQL to solve this problem.
If you're using 11g, then you can use the pivot operator, like so:
select * from (
select user_id, food from meals
)
pivot (count(*) as count for (food) in ('Apple', 'Orange', 'Bread'));
Otherwise you'll have to do a manual pivot:
select user_id,
sum(case when food = 'Apple' then 1 else 0 end) count_apple,
sum(case when food = 'Orange' then 1 else 0 end) count_orange,
sum(case when food = 'Bread' then 1 else 0 end) count_bread
from meals
group by user_id
In either case, these should be faster than your original approach as you're only accessing the meals table once.