Remove duplicates from fact table to calculate measure correctly - sql

I'm very new to data warehousing and dimensional modelling. For a uni project I started out with a database that I need to turn into a data warehouse for analyses. To end up with a clean star schema, I had to denormalize a few tables into 1 fact table. The downside to this is the amount of redundancy.
Below is a part of the data from the fact table:
A voyage consists of multiple shipments, and a shipment can consist of multiple different items. In this example, containers 1-2000 of shipment 1 contain item 3, and containers 2001-5000 contain item 1. The total amount of containers for this shipment is 5000, obviously. However, for data analysis purposes, I need to calculate a measure for the total amount of containers per voyage. This presents a problem with the current fact table, because I have a record for each different item. So for voyage 1, the actual total amount should be 9200, but because of the duplication I'll end up with 19400, leading to an incorrect measure.
I need to find a way to get rid of the duplicates in my calculation, but I can't find a way to do so. Any help would be much appreciated.

What you'll need to do is group by your shipments (CTE, inner query, temp table, etc) to get the number of containers per shipment, then group by your voyages to get the number of containers per voyage.
Here's an example with an inner query:
SELECT voyage_id, SUM(num_ship_containers) AS num_voyage_containers
FROM (
SELECT voyage_id, shipment_id, MAX(container_end) AS num_ship_containers
FROM ShippingWarehouse
GROUP BY voyage_id, shipment_id
) AS ship_data
GROUP BY voyage_id;
voyage_id
num_voyage_containers
1
9200
Try it out!

Related

Different detail levels in one table

EDITED
I'm having a problem with a table, where I need to store measures on different detail levels. My default table is:
Id
TotalQuantity
Amount
1
75
1000
Where TotalQuantity is a sum of quantities from every month.
Now I need to add into my default table an information, what quantities I have each month. These monthly quantities should be in one column, so I used UNION.
The problem is that when I will sum up values from these columns in some reports, TotalQuantity AND OTHER VALUES THAT ARE THE SAME FOR BOTH ROWS will be displayed wrong. How can I possibly store all that information?
You need a fact table at the (id, month) grain, like FactMonthlyTotals(id, month, amount). If you have other data that is not for a particular month it would go on a separate fact table, or perhaps a dimension table.

How to create an aggregate table (data mart) that will improve chart performance?

I created a table named user_preferences where user preferences have been grouped by user_id and month.
Table:
Each month I collect all user_ids and assign all preferences:
city
district
number of rooms
the maximum price they can spend
The plan assumes displaying a graph showing users' shopping intentions like this:
The blue line is the number of interested users for the selected values in the filters.
The graph should enable filtering by parameters marked in red.
What you see above is a simplified form for clarifying the subject. In fact, there are many more users. Every month, the table increases by several hundred thousand records. The SQL query retrieving data (feeding) for chart lasts up to 50 seconds. It's far too much - I can't afford it.
So, I need to create a table (table/aggregation/data mart) where I will be able to insert the previously calculated numer of interested users for all combinations. Thanks to this, the end user will not have to wait for the data to count.
Details below:
Now the question is - how to create such a table in PostgreSQL?
I know how to write a SQL query that will calculate a specific example.
SELECT
month,
count(DISTINCT user_id) interested_users
FROM
user_preferences
WHERE
month BETWEEN '2020-01' AND '2020-03'
AND city = 'Madrid'
AND district = 'Latina'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
GROUP BY
1
The question is - how to calculate all possible combinations? Can I write multiple nested loop in SQL?
The topic is extremely important to me, I think it will also be useful to others for the future.
I will be extremely grateful for any tips.
Well, base on your query, you have the following filters:
month
city
distirct
rooms
price_max
You can try creating a view with the following structure:
SELECT month
,city
,distirct
,rooms
,price_max
,count(DISTINCT user_id)
FROM user_preferences
GROUP BY month
,city
,distirct
,rooms
,price_max
You can make this view materialized. So, the query behind the view will not be executed when queried. It will behave like table.
When you are adding new records to the base table you will need to refresh the view (unfortunately, posgresql does not support auto-refresh like others):
REFRESH MATERIALIZED VIEW my_view;
or you can scheduled a task.
If you are using only exact search for each field, this will work. But in your example, you have criteria like:
month BETWEEN '2020-01' AND '2020-03'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
In such cases, I usually write the same query but SUM the data from the materialized view. In your case, you are using DISTINCT and this may lead to counting a user multiple times.
If this is a issue, you need to precalculate too many combinations and I doubt this is the answer. Alternatively, you can try to normalize your data - this will improve the performance of the aggregations.

MDX : combining data from different tables

How can I combine data coming from different tables.
Let's assume I have 2 tables:
First with sales:
id shop
id product
date
amount
Second with stocks:
actually, with the same structure
id shop
id product
date
amount
I need to analyze for how many days' stock there is in the shop now. For that I need to calculate the average sale per shop per day for last 20 weeks and then divide the remaining stock by the average sales rate.
How can I achieve this?
This is not a actual problem in MDX as you can combine dimension over different fact tables.
You need to create your 3 dimension (using reference table or similar) :
id_shop -> [Shop]
id_product -> [Product]
date -> [Time]
Now we need to add the two tables as 'fact' tables. Recall that Fact Tables are the ones defining measures.
In icCube create a default Cube, e.g. [Cube], and for each table create a 'measure group' (just click the '+' ).
Bind your tables to the dimension, the 'magic'wand will do the work and create a measure for each table (e.g. [Stock] & [Sales] ).
Once the schema is defined and deployed you can use both measures without taking even noticing they are coming from different tables :
[Measures].[Sales] / [Measures].[Stocks]

SSAS - relationship/granularity

I have 2 fact tables with a measure group each, Production and Production Orders. Production has production information at a lower granularity (at the component level) productionorders has information at a higher level (order level with header quantities etc.).
I have created a surrogate key link between the two tables on productionorderid. As soon as I add Prod ID (from productiondetailsdim) to the pivot table it blats out the actual qty (from prod order measure group) and I cannot combine the qty's from the two measure groups.
How can I design the correct relationship between the two? Please see my dim usage diagram. Production Details is the dim that links the two fact tables, at the moment DimProductionDetails is in a fact relationship with Production. I'm not sure what the relationship should be with Production Order (it is currently many to many).
Please see example data between the two tables:
I have to be able to duplicate this behaviour:
Do you want the full actual qty from prod order measure group to repeat next to each product? If so a many-to-many relationship is right. I suspect once I explain how that many-to-many works you will spot the problem.
When you slice full actual qty from prod order measure group by product from the Production Details dimension it does a runtime join between the two measure groups on the common dimensions. So for example, if for if order 245295 has a date of 1/1/2015 while the production details for order 245295 have dates of 1/8/2015 then the runtime join will lose rows for that order and actual qty will show as null. So compare all the dimensions used on both measure groups and ensure all rows for the same order have the same dimension keys for those common dimensions. If for example dates differ then create a named query in the DSV that selects just the dimension columns from the production fact table which match the order fact table. Then create a new measure group off that named query and use the new measure group as the intermediate measure group in your many to many dimension. (The current many to many cell in the dimension usage tab should say the name of the new measure group not the existing Production measure group.)
Edit: if you want the actual qty measure to only show when you are at the order level and be null at the product level then try the following. Change the many-to-many relationship to a regular relationship and in the dialog where you choose how the fact table joins to the dimension change the dimension attribute to ProductionOrder_SK (which is not the key of the dimension) and choose the corresponding column in the fact table. Then left click on the Production Order measure group and go to the Properties window and set IgnoreUnrelatedRelationships to false. That way slicing actual qty by work center or by an attribute that is below grain in the Production Details dimension will show as null.

SQL Normalization with multiple "measures" tables

I'm currently trying to redesign a Point of Sale database to make it more normalized, which will help tremendously with managing the data, etc. I'm a little bit unsure about the best design practices, based on the data I have to deal with. First of all there are basically two sets of measures, which share common keys. There is inventory data, units and dollars, and then point of sale data, units and dollars. Each of these is a customer, store, item and date level.
What I've done (mostly in theory at this point) is to create separates table for
Item level information
Item_ID,
Customer_ID
itemnumber
(and a few other item specific information).
Stores
Store_ID,
Customer_ID,
Store Number,
(and essentially address information)
Customer
Customer_ID,
Customer Number
(other customer specific information like name).
So in addition to those "support" tables, I have the
Main Inventory Data
Store_ID
Item_ID
I also have POS Data table, with the exact same ID's.
Basically my questions are:
should I include the Customer ID in the Pos Data and Inventory Data tables, even though they are a part of both the stores and items tables?
My second question would be, if I do add the customer ID, if I would join all of these tables together,
would I join the customer ID from all of the tables (Pos Data, Stores and Items OR Inventory Data, Stores and Items) to the customers table or
would just joining from the Pos Data table be sufficient.
Let me give a few additional details, regarding the data. As an example, we have two Customers, CustomerA and CustomerB. CustomerA has several stores whose store numbers are 1000,1025, 1036 and 1037. CustomerB also has several stores, whose store numbers are 1025, 1030, and 1037. Store numbers 1025 and 1037 happen to be the same between customers, but the stores themselves are unique and completely different.
CustomerA's Store Number 1000 sells three of our items (this is a wholesale perspective), which are Items ABC, DEF and EFG. CustomerA's Store Number 1025 also sells three of our items, which are ABC, HIJ and XYZ.
Each of these items contains two import pieces of data, in regards to its relationship to its specific customer and store number, Point of Sale data and Inventory Data. Point of Sale data would be in the form of PosUnits, which would be the quantity of an item that were sold, and PosDollars, which would be the total Dollars of the item that were sold in that store (essentially the number of units times the price it was sold for). The Inventory Data would be in InventoryUnits, which is the quantity of an item that is in stock at a store. [one thing to note, I separated inventory and pos data into separate tables, because we don't always receive both pieces of data from every customer. Also inventory and POS data are generally analyzed separately].
So, back to my example, CustomerA's Store Number 1000, item ABC may have sold 100 units, which is $1245.00. CustomerA's Store Number 1025, may have sold only 10 units of the same item for $124.50.
Now if we go back to CustomerB, it just so happens this Customer also has an item named ABC that it sells in many of their stores. CustomerA's item ABC is a completely different product from CustomerB's item ABC. It's purely coincidental that they named them the same thing.
Let me add this last point of clarification, which I probably should have stated earlier. My perspective is as a wholesaler. When I say item, I'm speaking of the customers item number, not the wholesalers item number. There is a cross reference involved in getting to the wholesalers item and the customer may have more than one of their item number the reference the same wholesaler item number. I don't think it' necessary to delve into that, though.
Question #1: As part of the normalization rules, you should avoid to include redundant data in any table unless there are performance issues that require de-normalization. there are thousands of articles that will explain why avoiding redundancy.
As for Question #2: in the rules are only pick the columns that you need in your queries, if you need the Customer_ID pick it from where is cheaper for the database
Allow me to raise one more question
why do you have repeated Customer_ID in Stores and Item_level when you can join them thought the Main Inventory Data. this is another redundancy.