I have a network of data acquisition systems deployed around the country. Each system is measuring the performance of a different building. Each system makes temperature, flow, speed, energy, and other types of measurements. The sampling rate varies from system to system; it can be as low as 5 seconds and as high as 15 minutes.
My objective is to store all the data in a SSAS databases for analysis purposes. However, I'm struggling to identify the appropriate grain and associated fact table and SSAS measure/measure group design.
My first question is, do any of the following have the same grain?:
Data values with different units, such as temperature and energy values
Data values with different sampling rates, such as temperature values sampled every minute and temperatures sampled every 15 minutes
Data values recorded from different buildings, such as the indoor temperature at Building #1 and the indoor temperature recorded at Building #2
Additive data values (such as energy) and non-additive data values (such as temperature)
Two different measurements with the same units at a particular building, such as first floor temperature and second floor temperature.
If most or all of these examples do not have same grain, does this mean that each building's measurement should have a separate fact table in the relational data warehouse and a separate measure in the SSAS database? If yes, then I'd be looking at 1000 measures if we assume that I am studying 10 buildings with 100 measurements each. This doesn't seem right, but it also seems that I'm dealing with a data model that has many different measure grains.
Almost all of the examples I have found are related to finance or retail, which consist of measures that are obviously additive or countable, such dollar amounts or items. Therefore, the examples haven't been helping much.
Data values with different units can be at the same grain if they are collected with the same intersection of all dimensions (time, building, etc.). In the example, the temperature and energy values would be separate columns in the fact table. It would not be recommended to use a single fact column to represent measurements with different units.
Data values with different sampling rates would not be at the same grain if you need to be able to report on both intervals simultaneously. However, if you can standardize your reports on the finer-grained interval, you could simply replicate the coarser-grained values into the finer-grained rows. For example, you could have a TemperaturePerMinute column and a TemperaturePer15Minute column, with the 15-minute interval value duplicated across all of the minute values within that interval. You would need to be careful when using aggregations (count, etc.) when reporting on the coarse-grained column, though. In addition, if you need to drill down to the sample level, you might need to use a separate fact table for each interval type.
Data values from different buildings could certainly be recorded in a single fact table at the same grain. You would simply have a Building dimension that you link to your fact table for each measurement sample.
In an OLAP model, measures are usually independent of each other, so additive and non-additive values can be represented at the same grain simply as separate columns.
In this case, you would extend your Building dimension to include a Floor hierarchy and link your fact table records at the Floor grain. However, if you need to be able to link some samples to a Building and some samples to a Floor, then you may need to use a parent-child dimension (perhaps a Location dimension abstracting Buildings and Floors), to which you could link your fact table to either a Building or a Floor.
I have found that it helps to take a look at how you want to report on the data when the organization of the star schema is unclear. This should help to reduce the number of cubes that you have to create. In general, though, it's best not to try to normalize the OLAP model or make it too generic (i.e. using a single column to represent temperature and energy).
Related
I am trying to design a schema for the a clinical dataset, MIMIC-III. I am trying to figure out a way to store the events in a way such that users can query the dataset with ease using possibly a star schema. Almost all of the entries such as diagnoses, procedures, notes, charts etc. are related to a single admission. I had a few things in my mind but I have no experience so am finding it difficult to figure out what the best way to do this is.
Create multiple fact tables, for example I would have one fact table for diagnoses, one for procedures, one for lab notes but this just seems like too many fact tables with little to gain. Like I could have a fact table with an entry for each diagnosis per user per admission but would that give me more benefit than the OLTP schema already implemented?
Create one fact table with a row per admission with multiple columns/dimensions like diagnoses, procedures etc. But the issue with is that for most there are multiple diagnoses per admission so I will have to link to a bridge table in most fact tables and then it would look like the image below. The issue with this is the required joins for the queries.
The third way that I read about is using something like an entity-attribute-value fact table where I have a fact table with each row being one fact. For example, one row could be something like (patientid - addmissionid - (Attribute) Heart Rate Reading - (VALUE) 120bpm) this would create a single fact table with almost everything inside and fewer joins required but it would require the end user to pivot the table after querying due to the nature of EAV's thus making it more complicated for the end user. It would look like the i2b2 star schema.
The last way I thought of was doing an entry per event into the fact table but having many columns in the fact table to store dimensions, like (patientid, admissionid, icustay_id, diagnosis, procedure, labnote, labevent, microbiologyevent, etc.) in which patientid, and admissionid will be in all rows but the rest will depend on the entry so one entry could have just patientid, admissionid and a single procedure. I don't know how the end result of this will be like in terms of querying due to my lack of experience. I also don't know whether or not all these entries with almost every column being irrelevant is the way to go.
Any help would be greatly appreciated, I'm trying to have this implemented into BigQuery.
Some tests have concluded that reducing the number of joins can enhance the BigQuery's performance. In other words, if you denormalize most of your data in a single table you will probably get better results since you'll not need to perform many joins.
There are some points that you should think about when deciding the data modeling:
Is it easy to write queries in your model?
If you need many joins, your model will be probably difficult to build queries. If you use nested fields to reduce the number of joins you can simplify the queries you will write. However, if you create very complex nested structures, you will start having problems to query the model again.
Is it easy to join data in your model?
If you have many tables to be joined, your data will be difficult to join. The more denormalized the data is, the easier it is to join.
Easy to update data
If you need to update your data, you should consider that denormalization can help you. If you reduce the number of tables, you will need to update fewer tables. Its important to say that if you create very complex nested fields, it will also be difficult to be updated.
Make data understandable
This is the most important point. Will your data be understandable in a given model? All the previous points are technical and not exactly related to your data. You should design your model considering these points, but your data must make sense to you.
Finally, I would like to summarize some advice:
You can get a better performance denormalizing your data as long as the data keeps understandable.
Use nested fields to denormalize the model but try not creating very complex structures (more than 2 level of nesting).
Keep in mind that your model will need more disk space when you denormalize it so your billing for storage in BigQuery will probably be higher.
I hope it helps
Initial Reading
Trying reading The Data Warehouse Toolkit, chapter 14 has a section on healthcare data modelling.
Modelling vs Storage
But what you should try and figure out is what is the important and high value data, and what is the less important and low value data. Only model and load into the database the high value data. If you try and build the perfect data model, you are never going to deliver any useful reporting to your customers/clients.
What will be used on a daily or hourly basis? This information needs to be in your data model and loaded into the database for aggregation and slicing.
What will be used only occasionally? Once the slicing and dicing is complete, there will be further questions about the minutiae of a small cohort. This is when you go to the bulk storage and retrieve this data from your Data Lake.
Data Lake
Instrument measurements are great examples of low value information. Most individual measurements are not useful, but your ETL could review them and make an overall determination of some kind. E.G. Blood Pressure Normal, Heart Rate High, etc.
Prescription drugs are another example of low value information. In the database you may set a flag if multiple prescriptions are present or something similar. Freeform notes are another. Once some cohort has been decided on based on many other factors, you enable a data scientist to process the note with some machine learning but doing this for all notes is not very useful.
Group/Junk Dimensions
A number of these measurement determinations could be lumped together inside of a group/junk dimension as a way of resolving maintaining the grain of the fact and keeping the fact from exploding with low value rows. You could even hold off on modelling a junk dimension until your customers/clients start telling you about long executions for specific types of data extracts. You can then design a junk dimension to serve these higher value measurement determinations to enable additional aggregations or slicing prior to extracting the insightful data from the data lake.
File Structure
In your data lake I would have many file schemas for the low value data. These could be JSON, parquet, csv, or whatever you prefer. You would include the data needed to connect it back to the fact along with the data specific to the file type.
PatientId
AdmissionId
MeasureType
MeasureValue
Date
Time
The point is most of this data will never be looked at, but occasionally a few records are high value. You just don't know which they will be, so you store them as cheaply as possible until they are needed.
The data lake also allows you to change your file schema as new information becomes available and executing this change is trivial compared with changing a database star schema.
Use your favourite scripting language to create these files. Python, C#, Azure Function App, AWS Lamda, whatever. This will depend on your own skills and resources available.
We have data of different dimensions, for example:
Name by Company
Stock prices by Date, Company
Commodity prices by Date & Commodity
Production volumes by Date, Commodity & Company
We're thinking of the best way of storing these in BigQuery. One potential method is to put them all in the same table, and nest the extra dimensions.
That would mean:
Almost all the data would be nested - e.g. there would be a single 'row' for each Company, and then its prices would be nested by Date.
Data would have to share at least one dimension - I don't think there would be a way of representing Commodity prices in a table whose first column was the company's Name
Are there disadvantages? Are there performance implications? Is it sensible to nest 5000 dates + associated values within each company's row?
It's common to have nested/repeated columns in BigQuery schemas since it makes reasoning about the data easier. Firebase produces schemas with repetition at many levels, for instance. If you flatten everything, the downside is you need some kind of unique ID for each row in order to associate events with each other, and then you'll need aggregations (using the ID as a key) rather than simple filters if you want to do any kind of counting.
As for downsides of nested/repeated schemas, one is that you may find yourself performing complicated transformations of the structure with ARRAY subqueries or STRUCT operators, for instance. These are generally fast, but they do have some overhead relative to queries without any structure imposed on the result at all.
My best suggestion would be to load some data and run some experiments. Storage and querying both are relatively cheap, so you can try a few different schema shapes and see which works better for your purposes.
Updating in Bigquery is pretty new, but based on the public available info BigQuery DML it is currently limited to only 48 updates per table per day.
Quotas
DML statements are significantly more expensive to process than SELECT
statements.
Maximum UPDATE/DELETE statements per day per table: 48 Maximum
UPDATE/DELETE statements per day per project: 500 Maximum INSERT
statements per day per table: 1,000 Maximum INSERT statements per day
per project: 10,000
Processing nested data is also very expensive since all of the data from that column is loaded on every query. It is also slow if you are doing a lot of operations on nested data.
I've got a simple cube with a fact table which has a date field among others and connected it with a time dimension which has 2 hierarchies.
What I want to do is create one measure that will be filtered only by the one time hierarchy and a second one for the second time hierarchy.
Basically this:
Measure1 ----> Cannot be affected by filtering of time_hierarchy2 and gets filtered only by time_hierarchy1
And the same for Measure2.
With what I've tried so far I can't do this because whenever I add a time hierarchy in the cube browser filter area, it affects both measures while I want them to be independent.
Is this possible?
The idea is to create two instances (i.e Cube Dimensions) of your Database Dimension and put one Hierarchy in each of them. This concept is also known as a Role-Playing Dimension.
You can then add filters using these role-playing dimensions to filter your Measures.
In the way you have described your current data model this is not possible. Within Analysis Services if you were to review the Dimension Usage tab you will notice the dimension to measure group usage. For a single measure to dimension relationship the measure will be affected by all attributes/hierarchies of the related dimension when browsing the cube.
If a viable option would be to have a separate TimeKey in your fact you may establish a Role Playing Dimension and have multiple constraints from the your fact to Time dimension.
Another option could be similar to where I recently split this setup into multiple facts each with a single reference to the Time dimension so that I could the plot separate measures to the same graph on the same time axis. How to avoid Role Playing Dimension
I'm building a dimensional data warehouse and learning how to model my various business processes from my source system in my warehouse.
I'm currently modelling a "Bid" (bid for work) from our source system in our data warehouse as a fact table which contains information such as:
Bid amount
Projected revenue
Sales employee
Bid status (active, pending, rejected, etc)
etc.
The problem is that the bid (or most any other process I'm trying to model) can go through various states and have its information updated at any given moment in the source system. According to Ralph Kimball, fact tables should only be updated if they are considered "accumulating snapshot" and I'm sure that not all of these processes would be considered an "accumulating snapshot" by the definition below.
How should these type of processes be modeled in the data warehouse according to the Kimball group's recommendations? Further more, what type of fact table would work for a bid (given the facts I've outlined above)?
Excert from http://www.kimballgroup.com/2008/11/fact-tables/
The transaction grain corresponds to a measurement taken at a single
instant. The grocery store beep is a transaction grain. The measured
facts are valid only for that instant and for that event. The next
measurement event could happen one millisecond later or next month or
never. Thus, transaction grain fact tables are unpredictably sparse or
dense. We have no guarantee that all the possible foreign keys will be
represented. Transaction grain fact tables can be enormous, with the
largest containing many billions of records.
The periodic snapshot grain corresponds to a predefined span of time,
often a financial reporting period. Figure 1 illustrates a monthly
account periodic snapshot. The measured facts summarize activity
during or at the end of the time span. The periodic snapshot grain
carries a powerful guarantee that all of the reporting entities (such
as the bank account in Figure 1) will appear in each snapshot, even if
there is no activity. The periodic snapshot is predictably dense, and
applications can rely on combinations of keys always being present.
Periodic snapshot fact tables can also get large. A bank with 20
million accounts and a 10-year history would have 2.4 billion records
in the monthly account periodic snapshot!
The accumulating snapshot fact table corresponds to a predictable
process that has a well-defined beginning and end. Order processing,
claims processing, service call resolution and college admissions are
typical candidates. The grain of an accumulating snapshot for order
processing, for example, is usually the line item on the order. Notice
in Figure 1 that there are multiple dates representing the standard
scenario that an order undergoes. Accumulating snapshot records are
revisited and overwritten as the process progresses through its steps
from beginning to end. Accumulating snapshot fact tables generally are
much smaller than the other two types because of this overwriting
strategy.
Like one of the comments mention, Change Data Capture is a fairly generic term for "how do I handle changes to data entities over time", and there are entire books on it (and a gazillion posts and articles).
Regardless of any statements that seem to suggest a clear black-and-white or always-do-it-like-this answer, the real answer, as usual, is "it depends" - in your case, on what grain you need for your particular fact table.
If your data changes in unpredictable ways or very often, it can become challenging to implement Kimball's version of an accumulated snapshot (picture how many "milestone" date columns, etc. you might end up needing).
So, if you prefer, you can decide to make your fact table be an transactional fact table rather than a snapshot, where the fact key would be (Bid Key, Timestamp), and then in your application layer (whether a view, mview, actual app, or whatever), you can ensure that a given query only gets the latest version of each Bid (note that this can be thought of as kind of a virtual accumulated snapshot). If you find that you don't need the previous versions (the history of each Bid), you can have a routine that prunes them (i.e. deletes or moves them somewhere else).
Alternatively, you can only allow the fact (Bid) to be added when it is in it's final state, but then you will likely have a significant lag where a new (updateable) Bid doesn't make it to the fact table for some time.
Either way, there are several solid and proven techniques for handling this - you just have to clearly identify the business requirements and design accordingly.
Good luck!
So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).
Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.
Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.