CUBE dimension reduction with sequential slices - sql

I have data in a cube, organized across 5 axes:
Source (data provider)
GEO (country)
Product (A or B)
Item (Sales, Production, Sales_national)
Date
In short, I have multiple data providers for different Product, Item, GEO and Date, i.e. for different slices of the cube.
Not all "sources" cover all dates, product, countries. Some will have more up to date information, but it will be preliminary.
The core of the problem is to have a synthesis of what all sources say.
Importantly, the choice of data provide for each "slice" of the cube is made by the user/analyst and needs to be so (business knowledge of provider methodology, quality etc).
What I am looking for, is a way to create a 'central dictionary' with all the calculation-types.
Such dictionary would be organized like this:
Operation Source GEO Item Product Date_start Date_end
Assign Source3 ITA Sales Product_A 01/01/2016 01/01/2017
Assign Source1 ITA Sales Product_A 01/01/2017 last
Assign with %delta Source2 ITA Sales Product_A 01/01/2018 last
This means:
From Jan2016 to Jan 2017, ProdA Sales in Italy, take Source 3
From Jan17 to last available, take Source 1
From Jan18 to last available, take the existing, add %difference across time from Source 2
The data and calculation are examples, there are other more complex, but the gist of it is putting slices of the "Source" 5-dimensional cube into a "Target" 4-dimensional cube, with a set of sequential calculations.
In SQL, it is the equivalent of a bunch of filtered SELECTs + INSERT, but the complexity of the calculations will probably lead to lots of nested JOINS.
The solution will be most likely custom functions, but I was wondering if anyone is aware of a language or software other than DAX/MDX which would allow to do this with minimal customization?
Many thanks

Related

How to use normalization to set levels of confidence between a rating and the number of ratings in Python or SQL?

I have a list of about 800 sales items that have a rating (from 1 to 5), and the number of ratings. I'd like to list the items that are most probable of having a "good" rating in an unbiased way, meaning that 1 person voting 5.0 isn't nearly as good as 50 people having voted and the rating of the item being a 4.5.
Initially I thought about getting the smallest amount of votes (which will be zero 99% of the time), and the highest amount of votes for an item on the list and factor that into the ratings, giving me a confidence level of 0 to 100%, however I'm thinking that this approach would be too simplistic.
I've heard about Bayesian probability but I have no idea on how to implement it. My list of items, ratings and number of ratings is on a MySQL view, but I'm parsing the code using Python, so I can make the calculations on either side (but preferably at the SQL view).
Is there any practical way that I can normalize this voting with SQL, considering the rating and number of votes as parameters?
|----------|--------|--------------|
| itemCode | rating | numOfRatings |
|----------|--------|--------------|
| 12330 | 5.00 | 2 |
| 85763 | 4.65 | 36 |
| 85333 | 3.11 | 9 |
|----------|--------|--------------|
I've started off trying to assign percentiles to the rating and numOfRatings, this way I'd be able to do normalization (sum them with an initial 50/50 weight). Here's the code I've attempted:
SELECT p.itemCode AS itemCode, (p.rating - min(p.rating)) / (max(p.rating) - min(p.rating)) AS percentil_rating,
(p.numOfRatings - min(p.numOfRatings)) / (max(p.numOfRatings) - min(p.numOfRatings)) AS percentil_qtd_ratings
FROM products p
WHERE p.available = 1
GROUP BY p.itemCode
However that's only bringing me a result for the first itemCode on the list, not all of them.
Clearly the issue here is the low number of observations your data has. Implementing Bayesian's method is the way to go because it provides great probability distribution for applications involving ratings especially if there is limited observations, and it easily decides the future likelihood ratio based on given parameters (this article provides an excellent explanation about Bayesian probability for beginners).
I would suggest storing your data in CSV files so it becomes easier to manipulate in python. Denormalizing the data via joins is the first task to do before analyzing your ratings.
This is Bayesian's simplified formula to use in your python code:
R – Confidence level aka number of observations
v – number of votes for a single product
C – avg vote for all products
m - tuneable parameter aka cutoff number required for votes to be considered (How many votes do you want displayed)
Since this is the simplified formula, this article explains how its been derived from its original formula. This article is helpful too in explaining the parameters.
Knowing the formula pretty much gets 50% of your work done, the rest is just importing your data and working with it. I provided below examples similar to your problem in case you need full demonstration:
Github example 1
Github example 2

Choosing a value based on a ranking of another column

I have decent Google-Fu skills, but they've utterly failed me on this.
I'm working in PowerPivot and I'm trying to match up a product with a price point in another table. Sounds easy, right? Well each product has several price points, based on the source of the price, with a hierarchy of importance.
For Example:
Product 1 has three prices living in a Pricing Ledger:
Price 1 has an account # of A22011
Price 2 has an account # of B22011
Price 3 has an account # of C22011
Price A overrides Price B which overrides Price C
What I want to do is be able to pull the most relevant price (i.e, that with the highest rank in the hierarchy) when not all price points are being used.
I'd originally used a series of IF statements, but that's when there were only four points. We now have ten points, and that might grow, so the IF statements are an untenable solution.
I'd appreciate any help.
-Thanks-

How to create Deltas in bigquery

I have a table in BQ which I refresh on daily basis. It's a full snapshot every day.
I have a business requirement to create deltas of that feed.
Table Details :
Table contains 10 columns
Out of 10 columns, 5 columns change on daily basis. How do I identify which columns changed and only create a snapshot for that?
For eg here are the columns in tableA: columns which will frequently change are in bold.
Custid - ABC
first_product - toy
first_product_purchase_date - 2015-01-01
last_product - ebook
last_product_purchase_date - 2018-05-01
second_product - Magazine
second_product_purchase_date - 2016-01-01
third_product - null
third_product_purchase_date - null
fourth_product - null
fourth_product_purchase_date - null
After more purchase Data will look like this:
Custid - ABC
first_product - toy
first_product_purchase_date - 2015-01-01
last_product - Hardbook
last_product_purchase_date - 2018-05-17
second_product - Magazine
second_product_purchase_date - 2016-01-01
third_product - CD
third_product_purchase_date - 2017-01-01
fourth_product - null
fourth_product_purchase_date - null
first_product = first product ever purchased
last_product = most recent product purchased
This is just one row of records for one customer. I have millions of customers with all these columns, and let's say half a million of the rows will be updated on daily basis.
In my delta, I just want the rows where any of the column value changed.
It seems like you have a column for each product bought and their repetition, perhaps this comes from a de-normalize dimensional models. To query the last "update" you would have to compare each columns the previous row by using the lead function. This would use a lot of computation and might not be optimal.
I recommend using repeated fields. The product and product_purchase_date would be repeated field and you could simply query using a where product_purchase_date = current_date() which would use much less computation.
De-normalize dimensional models are meant to use less computation on traditional data warehouses. Bigquery being fast, highly scalable, enterprise data warehouse has a lot of computing power.
To have a beter understanding on how BigQuery works under the hood I recommend reviewing this document.

Optimal selection for ordering multiple items (parts) from multiple suppliers (vendors)

The task here is to define the optimal (as detailed below) way of ordering items (parts) from suppliers.
The relevant parts of the table schema (with some sample data) are
Items
ID NUMBER
1 Item0001
2 Item0002
3 Item0003
Suppliers
ID NAME DELIVERY DISCOUNT
1 Supplier0001 0 0
2 Supplier0002 0 0.025
3 Supplier0003 20 0
DELIVERY is the delivery charge (in dollars) levied by that supplier on each delivery. DISCOUNT is the settlement discount (as a percentage i.e. 2.5% for ID=2 above) allowed by that supplier for on time payment.
SupplierItems
SUPPLIER_ID ITEM_ID PRICE
1 2 21.67
1 5 45.54
1 7 32.97
This is the many-to-many join between suppliers and items with the price that supplier charges for that item (in dollars). Every item has at least 1 supplier but some have more than one. A supplier may have no items.
PartsRequests
ID ITEM_ID QUANTITY LOCATION_ID ORDER_ID
1 59 4 2 (null)
2 89 5 2 (null)
3 42 4 2 (null)
This table is a request from a field site for parts to be ordered and delivered by the supplier to that site. A delivery of any number of items to a site attracts a delivery charge. When the parts are ordered, the ORDER_ID is inserted into the table so we are only concerned with those where ORDER_ID IS NULL
The question is, what is the optimal way to order these parts for each `LOCATION' where there are 3 optimal solutions that need to be presented to the user for selection.
The combination of orders with the least number of suppliers
The combination of orders with the lowest total cost i.e. The sum of QUANTITY*PRICE for each item plus the DELIVERY for each order summed over all orders ignoring DISCOUNT
As item 2 but accounting for DISCOUNT
Clearly I need to determine the combinations of orders that are available and then determining the optimal ones becomes trivial but I am a bit stuck on an efficient way to deal with building the combinations.
I have built some SQL fiddles in SQL Server 2008 with random data. This one has 100 items, 10 suppliers and 100 requests. This one has 1000 items, 50 suppliers and 250 requests. The table schema is the same.
Update
I reasoned that the solution had to be recursive and I built a nice table valued function to get but I ran into the 32 hard limit on recursion in SQL Server. I was uncomfortable with it anyway because it hinted more of a procedural language solution than a RDMS.
So I am now playing with CTE recursion.
The root query is:
SELECT DISTINCT
'' SOLUTION_ID
,LOCATION_ID
,SUPPLIER_ID
,(subquery I haven't quite worked out) SOLE_SUPPLIER
FROM PartsRequests pr
INNER JOIN
SupplierItems si ON pr.ITEM_ID=si.ITEM_ID
WHERE pr.ORDER_ID IS NULL
This gets all the suppliers that can supply the required items and is certainly a solution, probably not optimal. The subquery sets a flag if the supplier is the sole supplier of any product required for that location; if so they must be part of any solution.
The recursive part is to remove suppliers one by one by means of CTE.SUPPLIER_ID<>CTE.SUPPLIER_ID and add them if they still cover all the items. The SOLUTION_ID will be a CSV list of the suppliers removed, partly to uniquely identify each solution and partly to check against so I get combinations instead of permutations.
Still working on the details, the purpose of this update was to allow the Community to say "Yay, looks like that will work" or, alternatively "You moron, that won't work because ..."
Thanks
This is a more general answer (as in, not sql) as I think solving this problem will require something more powerful. Your first scenario is to select a minimum number of suppliers. This problem can be seen as a set cover problem as you are trying to cover all demands per site with the suppliers. This problem is already NP-complete.
Your third scenario seems to be basically the same as the second. You just have to take the discount into account in the prices, assuming you pay on time for every order.
The second scenario is at least NP-hard as I see a lot of resemblance with the facility location problem. You are trying to decide which suppliers (facilities) to use (open) to cover your orders (demands) based on their prices and delivery costs (opening costs).
Enumerating your possible solutions seems infeasible as with 10 suppliers, you have 2^10 possibilities of using them, further complicated by the distribution of demands internally.
I would suggest some dynamic programming to first select the suppliers that you have to use (=they are the only ones that deliver a specific thing), eliminating some possibilities (if the cost for supplier A +delivery cost A< cost for supplier B) and then trying to expand your set of possible solutions. Linear programming is also a valid train of thought.

How should you separate dimension tables from fact tables if you are not building a data warehouse?

I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!
Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.
If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.