How do you design an aggregation framework for large datasets?

How do you design an aggregation framework for large datasets? - sql

I have a large dataset of incoming messages and what I want to do is provide aggregated statistics for each message owner, such as rate of messages per day, week, last two weeks, and year. The aggregations can be simple, such as a word count, or more complex, such as keywords used...either way, I want to -- in an organized fashion -- precalculate and store these aggregations so that when I do something like:
Person.word_count_last_10_days
-- that this query isn't run on the entire message archive database, but pulls from a table of precalculated aggregations...something like:
SELECT SUM(value) from aggregations
WHERE
category = "word_count" AND
timeframe = "day" AND date > "2013-05-18" AND date < "2013-05-28" AND
person_id = 42
GROUP BY person_id
And aggregations of larger timeframes, such as "year", would simply count up all the days that make up that year.
The overall objective is to decouple the analytics browsing from the massive message archive. For the most part, there's no reason for the analytics system to actually query the message archive, if the aggregations table contains all the data needed.
This strikes me as a very common usecase...it doesn't matter whether it's done through Hadoop or through non-parallel processing...so I was wondering if there was already a framework/wrapper/design-methodology that provides some convention for this, rather writing one completely from scratch? I'm using Ruby but language doesn't matter, I'm just interested in whatever frameworks/wrappers (in any language) that have already been conceived.

I would look into OLAP/cubes for this kind of work.
Here is an open source OLAP server: http://mondrian.pentaho.com/
The idea is that with a cube you can set up pre-processed aggregations and then run them and afterwards the results can be queried quickly.
The MDX language is the equivalent of SQL for cubes - and it has a pretty steep learning curve - but some of the basic stuff should be easy to handle out of the box.
It takes a bit of reading to get up to speed on cubes in general. Check out: http://en.wikipedia.org/wiki/OLAP_cube.
It is well worth it for pre-processed aggregations.

Related

ER Diagram for Government Statistics

I am new to database design and I am trying to practice with available government statistics for a small country. I have found almost 100 tables that store information collected for given years and months from a specific region. Some tables are updated monthly, while others are updated annually. I believe this means that in each table, there will be a natural composite PK made up of the year and month, or simply a PK made up of the year.
ER Diagram
In the above image, each parent attribute of Trip Survey represents one of the many data tables I have collected from public databanks specific to the region being researched (e.g. satisfaction_level, motivation_level, amount_spent all represent different surveys on the same population). Does it make sense to combine all of the tables into one table (e.g. Trip Survey)?
I'm not sure if my relationships are accurate (total and partial participation). My goal is to be able to queries the data to find points of correlation and make predictions about the future. I want to try and connect all of the tables over time.
The surveys collected can cover nearly any topic, but the common thread is they represent a moment in time, either monthly or annually. I want to eventually add a table of significant political events that may reflect outliers from trends.
Example Result: When motivation levels were low in 2018, spending was also down and length of stay was shorter relative to 'n' period.
As a newbie, any and all help is greatly appreciated.
Thank you

Simplify simplify simplify.
Start with one table, with at least some columns you comprehend. Load it into some dbms (pick one with geospatial capabilities and windowing functions, you may want them later: recent versions of MariaDB, MySQL and PostreSQL are fine choices). Import your table. This can be a pain in the axx neck to get right, but do your best to get it right anyhow.
Don't worry about primary keys or unique indexes when starting out. You're just exploring the data, not building it. Don't worry about buying or renting a server: most laptops can handle this kind of exploration just fine.
Pick a client program that keeps a history of the queries you put into it. HeidiSQL is a good choice. The relatively new Datagrip from Jetbrains is worth a look. Avoid Microsoft's SQL Server Management Studio: no history feature. (You'll often want to go back to something you tried a few hours or days ago when you're exploring, so the query history feature is vital.)
Then fiddle around with queries, especially aggregates ... e.g.
SELECT COUNT(*), year, origin, destination
FROM trip
GROUP BY year, origin, destination;
Look for interesting stuff you can glean from the one table. Get the hang of it. Then add another table that can be JOINed easily to the first table. Repeat your exploration.
That should get you started. Once you begin to understand your dataset, you can start ranking stuff, working out quintiles, and all that.
And, when you have to update or augment your data without reloading it, you'll need various primary / unique keys. That's in the future for you.

How to decide on a data warehouse schema for clinical dataset?

I am trying to design a schema for the a clinical dataset, MIMIC-III. I am trying to figure out a way to store the events in a way such that users can query the dataset with ease using possibly a star schema. Almost all of the entries such as diagnoses, procedures, notes, charts etc. are related to a single admission. I had a few things in my mind but I have no experience so am finding it difficult to figure out what the best way to do this is.
Create multiple fact tables, for example I would have one fact table for diagnoses, one for procedures, one for lab notes but this just seems like too many fact tables with little to gain. Like I could have a fact table with an entry for each diagnosis per user per admission but would that give me more benefit than the OLTP schema already implemented?
Create one fact table with a row per admission with multiple columns/dimensions like diagnoses, procedures etc. But the issue with is that for most there are multiple diagnoses per admission so I will have to link to a bridge table in most fact tables and then it would look like the image below. The issue with this is the required joins for the queries.
The third way that I read about is using something like an entity-attribute-value fact table where I have a fact table with each row being one fact. For example, one row could be something like (patientid - addmissionid - (Attribute) Heart Rate Reading - (VALUE) 120bpm) this would create a single fact table with almost everything inside and fewer joins required but it would require the end user to pivot the table after querying due to the nature of EAV's thus making it more complicated for the end user. It would look like the i2b2 star schema.
The last way I thought of was doing an entry per event into the fact table but having many columns in the fact table to store dimensions, like (patientid, admissionid, icustay_id, diagnosis, procedure, labnote, labevent, microbiologyevent, etc.) in which patientid, and admissionid will be in all rows but the rest will depend on the entry so one entry could have just patientid, admissionid and a single procedure. I don't know how the end result of this will be like in terms of querying due to my lack of experience. I also don't know whether or not all these entries with almost every column being irrelevant is the way to go.
Any help would be greatly appreciated, I'm trying to have this implemented into BigQuery.

Some tests have concluded that reducing the number of joins can enhance the BigQuery's performance. In other words, if you denormalize most of your data in a single table you will probably get better results since you'll not need to perform many joins.
There are some points that you should think about when deciding the data modeling:
Is it easy to write queries in your model?
If you need many joins, your model will be probably difficult to build queries. If you use nested fields to reduce the number of joins you can simplify the queries you will write. However, if you create very complex nested structures, you will start having problems to query the model again.
Is it easy to join data in your model?
If you have many tables to be joined, your data will be difficult to join. The more denormalized the data is, the easier it is to join.
Easy to update data
If you need to update your data, you should consider that denormalization can help you. If you reduce the number of tables, you will need to update fewer tables. Its important to say that if you create very complex nested fields, it will also be difficult to be updated.
Make data understandable
This is the most important point. Will your data be understandable in a given model? All the previous points are technical and not exactly related to your data. You should design your model considering these points, but your data must make sense to you.
Finally, I would like to summarize some advice:
You can get a better performance denormalizing your data as long as the data keeps understandable.
Use nested fields to denormalize the model but try not creating very complex structures (more than 2 level of nesting).
Keep in mind that your model will need more disk space when you denormalize it so your billing for storage in BigQuery will probably be higher.
I hope it helps

Initial Reading
Trying reading The Data Warehouse Toolkit, chapter 14 has a section on healthcare data modelling.
Modelling vs Storage
But what you should try and figure out is what is the important and high value data, and what is the less important and low value data. Only model and load into the database the high value data. If you try and build the perfect data model, you are never going to deliver any useful reporting to your customers/clients.
What will be used on a daily or hourly basis? This information needs to be in your data model and loaded into the database for aggregation and slicing.
What will be used only occasionally? Once the slicing and dicing is complete, there will be further questions about the minutiae of a small cohort. This is when you go to the bulk storage and retrieve this data from your Data Lake.
Data Lake
Instrument measurements are great examples of low value information. Most individual measurements are not useful, but your ETL could review them and make an overall determination of some kind. E.G. Blood Pressure Normal, Heart Rate High, etc.
Prescription drugs are another example of low value information. In the database you may set a flag if multiple prescriptions are present or something similar. Freeform notes are another. Once some cohort has been decided on based on many other factors, you enable a data scientist to process the note with some machine learning but doing this for all notes is not very useful.
Group/Junk Dimensions
A number of these measurement determinations could be lumped together inside of a group/junk dimension as a way of resolving maintaining the grain of the fact and keeping the fact from exploding with low value rows. You could even hold off on modelling a junk dimension until your customers/clients start telling you about long executions for specific types of data extracts. You can then design a junk dimension to serve these higher value measurement determinations to enable additional aggregations or slicing prior to extracting the insightful data from the data lake.
File Structure
In your data lake I would have many file schemas for the low value data. These could be JSON, parquet, csv, or whatever you prefer. You would include the data needed to connect it back to the fact along with the data specific to the file type.
PatientId
AdmissionId
MeasureType
MeasureValue
Date
Time
The point is most of this data will never be looked at, but occasionally a few records are high value. You just don't know which they will be, so you store them as cheaply as possible until they are needed.
The data lake also allows you to change your file schema as new information becomes available and executing this change is trivial compared with changing a database star schema.
Use your favourite scripting language to create these files. Python, C#, Azure Function App, AWS Lamda, whatever. This will depend on your own skills and resources available.

At what scale of data is the ROI of partitioning the most valuable?

So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).

Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.

Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.

Rails 4: dashboard/analytics and querying ALL records in DB

Working on a dashboard page which does a lot of analytics to display BOTH graphical and tabular data to users.
When the dashboard is filtered by a given year, I have to display analytics for the selected year, another year chosen for comparison, and historical averages from all time.
For the selected and comparison years, I create start/end DateTime objects that are set to the beginning_of_year and end_of_year.
year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
comp = Model.where("closed_at >= ?", comp_start).where("closed_at <= ?", comp_end).all
These queries are essentially the same, just different date filters. I don't really see any way to optimize this besides trying to only "select(...)" the fields I need, which will probably be all of them.
Since there will be an average of 250-1000 records in a given year, they aren't "horrible" (in my not-very-skilled opinion).
However, the historical averages are causing me a lot of pain. In order to adequately show the averages, I have to query ALL the records for all time and perform calculations on them. This is a bad idea, but I don't know how to get around it.
all_for_average = Model.all
Surely people have run into these kinds of problems before and have some means of optimizing them? Returning somewhere in the ballpark of 2,000 - 50,000 records for historical average analysis can't be very efficient. However, I don't see another way to perform the analysis unless I first retrieve the records.
Option 1: Grab everything and filter using Ruby
Since I'm already grabbing everything via Model.all, I "could" remove the 2 year queries by simply grabbing the desired records from the historical average instead. But this seems wrong...I'm literally "downloading" my DB (so to speak) and then querying it with Ruby code instead of SQL. Seems very inefficient. Has anyone tried this before and seen any performance gains?
Option 2: Using multiple SQL DB calls to get select information
This would mean instead of grabbing all records for a given time period, I would make several DB queries to get the "answers" from the DB instead of analyzing the data in Ruby.
Instead of running something like this,
year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
I would perform multiple queries:
year_total_count = Model.where(DATE RANGE).size
year_amount_sum = Model.where(DATE RANGE).sum("amount")
year_count_per_month = Model.where(DATE RANGE).group("MONTH(closed_at)")
...other queries to extract selected info...
Again, this seems very inefficient, but I'm not knowledgeable enough about SQL and Ruby code efficiencies to know which would lead to obvious downsides.
I "can" code both routes and then compare them with each other, but it will take a few days to code/run them since there's a lot of information on the dashboard page I'm leaving out. Certainly these situations have been run into multiple times for dashboard/analytics pages; is there a general principle for these types of situations?
I'm using PostgreSQL on Rails 4. I've been looking into DB-specific solutions as well, as being "database agnostic" really is irrelevant for most applications.

Dan, I would look into using a materialized view (MV) for the all-time historical average. This would definitely fall under the "DB-specific" solutions category, as MVs are implemented differently in different databases (or sometimes not at all). Here is the basic PG documentation.
A materialized view is essentially a physical table, except its data is based on a query of other tables. In this case, you could create an MV that is based on a query that averages the historical data. This query only gets run once if the underlying data does not change. Then the dashboard could just do a simple read query on this MV instead of running the costly query on the underlying table.

After discussing the issue with other more experienced DBAs and developers, I decided I was trying to optimize a problem that didn't need any optimization yet.
For my particular use case, I would have a few hundred users a day running these queries anywhere from 5-20 times each, so I wasn't really having major performance issues (ie, I'm not a Google or Amazon servicing billions of requests a day).
I am actually just having the PostgreSQL DB execute the queries each time and I haven't noticed any major performance issues for my users; the page loads very quickly and the queries/graphs have no noticeable delay.
For others trying to solve similar issues, I recommend trying to run it for a while a staging environment to see if you really have a problem that needs solving in the first place.
If I hit performance hiccups, my first step will be specifically indexing data that I query on, and my 2nd step will be creating DB views that "pre-load" the queries more efficiently than querying them over live data each time.
Thanks to the incredible advances in DB speed and technology, however, I don't have to worry about this problem.
I'm answering my own question so others can spend time resolving more profitable questions.

Aggregates on large databases: best platform?

I have a postgres database with several million rows, which drives a web app. The data is static: users don't write to it.
I would like to be able to offer users query-able aggregates (e.g. the sum of all rows with a certain foreign key value), but the size of the database now means it takes 10-15 minutes to calculate such aggregates.
Should I:
start pre-calculating aggregates in the database (since the data is static)
move away from postgres and use something else?
The only problem with 1. is that I don't necessarily know which aggregates users will want, and it will obviously increase the size of the database even further.
If there was a better solution than postgres for such problems, then I'd be very grateful for any suggestions.

You are trying to solve an OLAP (On-Line Analytical Process) data base structure problem with an OLTP (On-Line Transactional Process) database structure.
You should build another set of tables that store just the aggregates and update these tables in the middle of the night. That way your customers can query the aggregate set of tables and it won't interfere with the on-line transation proceessing system at all.
The only caveate is the aggregate data will always be one day behind.

Yes
Possibly. Presumably there are a whole heap of things you would need to consider before changing your RDBMS. If you moved to SQL Server, you would use Indexed views to accomplish this: Improving Performance with SQL Server 2008 Indexed Views

If you store the aggregates in an intermediate Object (something like MyAggragatedResult), you could consider a caching proxy:
class ResultsProxy {
calculateResult(param1, param2) {
.. retrieve from cache
.. if not found, calculate and store in cache
}
}
There are quite a few caching frameworks for java, and most like for other languages/environments such as .Net as well. These solution can take care of invalidation (how long should a result be stored in memory), and memory-management (remove old cache items when reaching memory limit, etc.).

If you have a set of commonly-queried aggregates, it might be best to create an aggregate table that is maintained by triggers (or an observer pattern tied to your OR/M).
Example: say you're writing an accounting system. You keep all the debits and credits in a General Ledger table (GL). Such a table can quickly accumulate tens of millions of rows in a busy organization. To find the balance of a particular account on the balance sheet as of a given day, you would normally have to calculate the sum of all debits and credits to that account up to that date, a calculation that could take several seconds even with a properly indexed table. Calculating all figures of a balance sheet could take minutes.
Instead, you could define an account_balance table. For each account and dates or date ranges of interest (usually each month's end), you maintain a balance figure by using a trigger on the GL table to update balances by adding each delta individually to all applicable balances. This spreads the cost of aggregating these figures over each individual persistence to the database, which will likely reduce it to a negligible performance hit when saving, and will decrease the cost of getting the data from a massive linear operation to a near-constant one.

For that data volume you shouldn't have to move off Postgres.
I'd look to tuning first - 10-15 minutes seems pretty excessive for 'a few million rows'. This ought to be just a few seconds. Note that the out-of-the box config settings for Postgres don't (or at least didn't) allocate much disk buffer memory. You might look at that also.
More complex solutions involve implementing some sort of data mart or an OLAP front-end such as Mondrian over the database. The latter does pre-calculate aggregates and caches them.

If you have a set of common aggregates you can calculate it before hand (like, well, once a week) in a separate table and/or columns and users get it fast.
But I'd seeking the tuning way too - revise your indexing strategy. As your database is read only, you don't need to worry about index updating overhead.
Revise your database configuration, maybe you can squeeze some performance of it - normally default configurations are targeted to easy the life of first-time users and become short-sighted fastly with large databases.
Maybe even some denormalization can speed up things after you revised your indexing and database configuration - and falls in the situation that you need even more performance, but try it as a last resort.

Oracle supports a concept called Query Rewrite. The idea is this:
When you want a lookup (WHERE ID = val) to go faster, you add an index. You don't have to tell the optimizer to use the index - it just does. You don't have to change the query to read FROM the index... you hit the same table as you always did but now instead of reading every block in the table, it reads a few index blocks and knows where to go in the table.
Imagine if you could add something like that for aggregation. Something that the optimizer would just 'use' without being told to change. Let's say you have a table called DAILY_SALES for the last ten years. Some sales managers want monthly sales, some want quarterly, some want yearly.
You could maintain a bunch of extra tables that hold those aggregations and then you'd tell the users to change their query to use a different table. In Oracle, you'd build those as materialized views. You do no work except defining the MV and an MV Log on the source table. Then if a user queries DAILY_SALES for a sum by month, ORACLE will change your query to use an appropriate level of aggregation. The key is WITHOUT changing the query at all.
Maybe other DB's support that... but this is clearly what you are looking for.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas