Factless Fact Table, but with Facts? - sql

Problem: I am working with a SaaS company that provides monthly services. We are trying to create a data model to track customer related metrics such as count, signups, cancellations, and reactivations. I’ve done extensive research online, but the closest I’ve found is accumulating snapshots with start/end dates, which doesn’t make sense with a SaaS company where a customer can reactivate an account.
My initial thought is to create a Factless Fact table for customer, however this factless table would also have keys to event dimension tables, I.e. DimSignupType, DimCancellationType, DimReactivationType, etc and boolean measures for isSignup, isCancellation, and isReactivation. I think this is counterintuitive because a factless fact table shouldn’t have facts, but I need track those and feel multiple fact tables is worse because I would have to join them together in the view.
Is there a better approach to this problem?
Edit based on feedback: The main goal of this is to create a dimensional model that is maintainable, but also something I create a view for with other dimensional tables that allows less technical users to discover insights with tools like Tableau. At the end of the day I need to provide a large flat view with multiple measures and dimensions that allows for easy analytical discovery. Common questions may be, "How many signups do we have MTD for this customer type vs last mtd?", "How many cancelations did we have due to Non-Payment this month compared to last", "How many reactivations from Non-Payment did we have this month compared to last?", etc. A lot of this meta data comes from Dimension tables I would join to the factless fact table based on keys, however it still requires a focus on Signups, Cancellations and Reactivations being tracked as Facts for reporting purposes. So I don't know the best modelling approach for it that abides by traditional standards. It almost seems like a Snapshot Fact Table that contains keys to dimensional tables that describe events to be aggregated. I just don't know what that would be called.
I feel the most flexible solution in terms of data management and ease of use would be a factless fact table modeled in a daily snapshot manner with "facts" for signup, cancellation and, reactivations that link to types.

Related

Data model guidance, database choice for aggregations on changing filter criteria

Problem:
We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.
We have different underlying data but this example should showcase the fundamental problem:
Let's say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.
Non-Functional Requirements:
It is an offline use-case, meaning there are no inserts, deletes or
updates happening, instead every X weeks a new complete dump is imported to replace the old data.
We would like to have an upper bound of 10 seconds
to answer our queries. The faster the better max 2 seconds for queries would be great.
The actual data has around 100-200 million entries, growth rate is linear.
The system has to serve a limited amount of concurrent users, max 100.
Questions:
What would be the right database technology or mixture of technologies to solve our problem?
What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
(Bonus) What would be the estimated hardware requirements given a specific technology?
What we tried so far:
Setting up a document store with denormalized entries. Problem: It doesn't perform well on general queries because it has to scan too many entries for aggregations.
Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.
You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.
The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.
A business intelligence (BI) tool is a "lighter" version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.
Now, there's the question of which data warehouse and/or BI solution to use.
That's a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.
Lastly, there's the data modeling piece of it.
To summarize your requirements, you have "lots of changing filter criteria" and varied statistics that you'll need, for a variety of entities.
The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you're using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.
Let's look at the example you gave:
Data of Facebook friend requests and interactions over time. You need to answer:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.
One potential dataset can be structured like:
Date of Interaction | Initiating Person's Country | Responding Person's Country | Topic | Interaction Type | Initiating Person's Top Interest | Responding Person's Top Interest
This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.
As you can tell, this is just scratching the surface of a massive topic, but what you're asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies :)

Best way to track sales/inventory history for a POS system?

So, I'm writing a POS system, and I want it to be able to keep track of an inventory and generate reports based on past sales.
I'm pretty familiar with database design and that sort of thing, but I'm not quite sure how to approach this particular problem. The first thing I thought was to have tables that track item sales by day, week, month, and year, and then have the program keep track of how much time has elapsed so it knows when to reset these particular records. But now I'm thinking there's got to be a much simpler approach to it than that.
Another thing I thought of doing is to query the sales transaction table based on time stamps, but I'm not sure if that's a step in the right direction either.
I know that there are simpler ways of doing this for things like orders and order history with customers, but what about for the store itself, if they want to track how much product they've sold over the course of a week, month, year, etc? Is it a similar approach? Different? I can't really find anything that speaks to this particular problem.
I would go with your second thought - create a table for transactions with a timestamp, and use the timestamp to do reports (and partitions if necessary). If you know you will be querying by the timestamp very frequently, you can create an index on it to improve performance.
Whether you are tracking customer orders or store sales shouldn't make a difference in the design unless there is some major requirement difference.
Will this be a system where store owners are autonomous or will it be a system with a load of POS terminals that report back to a central hub?
If this is for autonomous store owners you have to start worrying about things like backups and data archiving. Stuff that store owners don't really care about. If you look online you'll probably find some cloud providers that do all this POS stuff for store owners.
On the other hand the general design pattern for larger businesses I have seen is as follows:
On your POS terminals hold minimum required data that is needed at the POS terminal. Minimal reporting is required at the terminal.
Replicate all POS data to a central database server that keeps and merges all different POS terminals. This is your detailed operational reporting. Once data is replicated here it can be deleted from the terminal
Often the store guys aren't too interested in the longer trends but it depends on the business.
Now you can run a report by month or year off the central database server (as can your store owners) and just summarise up to month/year in place. At this point there is no need to create summary tables.
Eventually you'll run into performance issues as data size increases.
The answer to this is not to build summary tables because then your user / reporting system gets complicated because you have to pick the correct table.
The answer is to apply standard performance tuning techniques such as:
Improving server hardware (Just adding RAM often is the most cost effective)
Adding Indexes (including indexed views)
Implementing partitioning
Consider using cubes for reporting
If this is not sufficient you might then want to consider the overhead of batch jobs that populate summary tables. But again Indexed Views can cover this off to a limited extent without requiring summary tables.
You need to understand data sizes, growth and report requirements before considering any design options such as summary tables.

Dimensional and cube developme data models

I have created a dimensional model which is similar in structure to the financial reporting design in the AdventureworksDW environment, where the value of each account is held as a single value column in the fact table and the dimensions give the data its semantic meaning.
There are over a thousand columns in this model so it works well for adding or deleting additional columns. Here is a really good blog on this design: http://garrettedmondson.wordpress.com/2011/10/26/dimensional-modeling-financial-data-in-ssas/
Although this model works well for querying the dimensional model, and there are examples supporting this model for dimensional analysis, I'm concerned that this model is not standard for cube development or data mining which seem to prefer wider tables.
Questions:
Is this design categorized as Entity-Attribute-Value (EAV)?
Would a design using multiple fact tables be better? So many wide fact tables (up to 10) with up to 200-300 columns each, but fewer rows.
Should I expect more performance issues with the much wider tables?
You are right that specific design is considered as EAV model.
By using such a design, you can easily add new accounts, hierarchies etc. You dont need to update your model.
I would not recommend one column per measure aproach. Most account will be null in most of the rows. Also with such a design, you need to read all of your measures even if you need to retrieve only one of them.
We heavily use account dimension in our cubes. Unfortunately things like shared members are not easy to handle in SSAS like in Essbase.
You need to create an Account dimension which is parent-child and also you need to have the key of this account dimension in the fact table as usual.
By using account dimension, you get nice support for time balance functionality. Using time balance functionality of SSAS supposed to be faster than custom MDX code.
We are converting unary operators and parent-child relationships to formulas at the moment.
So basically we have normal formulas, and parents in hierarchies also works as formulas.
At the end we are flattening the hierarchy. So it is not possible to drill down in account dimension. We are using account dimension as a calculation engine only.
It is possible to have proper hierarchies as well, but we decided not to mix custom rollup members and unary operators at the same time.
Shared members and all our formulas implemented as custom rollup members.

Thoughts on dimension measures for BI

I am working with a consultant who recommends creating a measure dimension and then adding the measure dimension key to our fact table.
I can see how this can make adding new measures easier by just adding rows instead of physically creating columns in the fact table. I can also see how this can add work to the ETL process, adds another join to the star schema, one generic column in fact table to hold all measure data etc.
I'm interested in how others have dealt with this situation. We currently have close to twenty measures.
Instinctively, I don't like it: it's the EAV model, which is not very popular (you can Google the reasons why).
The EAV model is generally considered to be a headache to query and maintain
Different measures go together with different dimensions; this approach could easily turn into "one giant fact table for everything" instead of multiple smaller fact tables for specific reporting areas
I suspect you would end up creating views to give the appearance of multiple fact tables anyway
You will multiply the number of rows in your fact table by the number of measures, resulting in a much bigger physical table
Even with a good indexing/partitioning scheme, queries that include more than one measure will have to read a lot more rows to get the data
What about measures with different data types?
Is this easily supported in your reporting tool?
I'm sure there are other issues, but those are the ones that come to mind immediately. As a rule of thumb, if someone suggests an EAV implementation in any context, you should be very wary and ask them exactly what advantages it offers and how it will be managed as the data and complexity increase. But I think you've already identified some key areas of concern.
SSAS will do this, and I know of a major vendor of insurance policy administration software that provided a M.I. solution for their system that works like this. You do get some flexibility from the approach in that you can add measures without having to deploy a build of the cube, although for 20 measures I don't think you need to worry about that.
'Measures' is essentially another dimension (and often referred to as such in the documentation). I believe SSAS uses a largely column-oriented structure behind the scenes.
However, a naive application of this approach does have some issues that could come and bite you to a greater or lesser extent.
You only have one measure, [Value], [Amount] or whatever it's called. If your tool won't let you inject calculated measures at the front-end then you can't sort the whole data set on the value of one of your attribute types. ProClarity and report builder >=2.0 will do this but Excel won't.
You can't do ratios or other calculated measures in this way. You will have to either embed them in the cube script (meaning you need to deploy a build to add them) or use a tool that lets you define them in the client.
Although it doesn't make a lot of differece to the cube it will be slow to query on the database and increase storage requirements. It's also fiddly to query on the database.

Efficient Ad-hoc SQL OLAP Structure

Over the years I have read a lot of people's opinions on how to get better performance out of their SQL (Microsoft SQL Server, just so we are all on the same page...) queries. However, they all seem to be tightly tied to either a high-performance OLTP setup or a data warehouse OLAP setup (cubes-galore...). However, my situation today is kind of in the middle of the 2, hence my indecision.
I have a general DB structure of [Contacts], [Sites], [SiteContacts] (the junction table of [Sites] and [Contacts]), [SiteTraits], and [ContractTraits]. I have nearly 3 million contacts with about 50 fields (between [Contacts] and [ContactTraits]) relating to just the contact, and about 600 thousand sites with about 150 fields (between [Sites] and [SiteTraits]) relating to just the sites. Basically it’s a pretty big flattened table or view… Most of the columns are int, bit, char(3), or short varchar(s). My problem is that a good portion of these columns are available to be used in ad-hoc queries by the user, and as quickly as possible because the main UI for this will be a website. I know the most common filters, but even with heavy indexing on them I think this will still be a beast… This data is read-only; the data doesn’t change at all during the day and the database will only be refreshed with the latest information during scheduled downtime. So I see this situation like an OLAP database with the read requirements of an OLTP database.
I see 3 options; 1. Break the table into smaller divisible units sub-query everything, 2. make one flat table and really go to town on the indexing 3. Create an OLAP cube and sub-query the rest based on what filter values I don’t put as the cube dimensions, and. I have not done much with OLAP cubes so I frankly don’t even know if that is an option, but from what I’ve done with them in the past I think it might be an option. Also, just to clarify what I mean when I say “sub-query everything” is instead of having a WHERE clause on the outer select, there would be one (if applicable) for each table being brought into the query and then the tables are INNER JOINed, to eliminate a really large Cartesian Product. As for the second option of the one large table, I have heard and seen conflicting results with that approach as it will save on joins but at the same time a table scan takes much longer.
Ideas anyone? Do I need to share what I’m smoking? I think this could turn into a pretty good discussion if everyone puts in their 2 cents. Oh, and feel free to tell me if I’m way off base with the OLAP cube idea if that’s the case, I’m new to that stuff too.
Thanks in advance to any and all opinions and help with this dilemma I’ve found myself in.
You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.
In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.
The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.
It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).
I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.
The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.
OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.
What are the most common queries? For single pieces of data or aggregates?
If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).
On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.
OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.
Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.