I have a model that looks like the below, where we have sales transactions, then products, stores, and sales people. In each case there's a many to many relationship, many sales have many types of products, many sales have many stores etc. Since the volumes are pretty high (38 million, 178 million, etc.) performance is our main issue. How could we better design this model to facilitate faster performance? We've tried combining the Dimension into the bridge tables, which have helped. But using DAX Studio, we still see the sql joining out to the other tables is very costly. Thanks for any advise.
Better performance but still queries taking 5-10 seconds when applying filters on all three tables:
Related
Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?
Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.
Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?
Problem:
We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.
We have different underlying data but this example should showcase the fundamental problem:
Let's say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.
Non-Functional Requirements:
It is an offline use-case, meaning there are no inserts, deletes or
updates happening, instead every X weeks a new complete dump is imported to replace the old data.
We would like to have an upper bound of 10 seconds
to answer our queries. The faster the better max 2 seconds for queries would be great.
The actual data has around 100-200 million entries, growth rate is linear.
The system has to serve a limited amount of concurrent users, max 100.
Questions:
What would be the right database technology or mixture of technologies to solve our problem?
What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
(Bonus) What would be the estimated hardware requirements given a specific technology?
What we tried so far:
Setting up a document store with denormalized entries. Problem: It doesn't perform well on general queries because it has to scan too many entries for aggregations.
Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.
You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.
The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.
A business intelligence (BI) tool is a "lighter" version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.
Now, there's the question of which data warehouse and/or BI solution to use.
That's a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.
Lastly, there's the data modeling piece of it.
To summarize your requirements, you have "lots of changing filter criteria" and varied statistics that you'll need, for a variety of entities.
The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you're using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.
Let's look at the example you gave:
Data of Facebook friend requests and interactions over time. You need to answer:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.
One potential dataset can be structured like:
Date of Interaction | Initiating Person's Country | Responding Person's Country | Topic | Interaction Type | Initiating Person's Top Interest | Responding Person's Top Interest
This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.
As you can tell, this is just scratching the surface of a massive topic, but what you're asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies :)
I am building a data warehouse. I need to get data from different sources and put it together so that I can generate reports. I will do lots of joining of tables. I am talking about maybe 20 tables total and each table is going to be anywhere from 100mb to 5 gigs.
I would like to know if I should be creating different databases for each table since each table might have an entirely different TYPE of dataset.
For example, I might have one table that has 1 GB of data about design of cars. And I will have another table with 3 GBs of sales data on these cars.
Would it be appropriate to separate these into different databases?
Please let me know what additional information is needed to advise me on this situation.
If there's a logical or business separation, by all means put them in different databases. That's just clean data application development. However, if you're going to be joining or merging the different data sets, then you can save some overhead and admin costs by having a single database. 20 tables total isn't a lot (I'm working on a system that has about 3700 tables, though ~1600 are audits). Keep in mind SQL Server is meant to scale up to terabytes of data, provided you have a decent model, indexes, etc.
If you're concerned with performance of the warehouse, you can jam that server full of RAM and harddrives. To leverage the harddrives properly you'd want to look at leveraging multiple files / filegroups and doling the tables out appropriately.
Splitting into different databases would normally be in order to spread I/O load. In SQL Server you can have different filegroups within the database itself if you want to spread I/O across multiple disks groups/disks. In Warehousing scenarios you often deal with SAN solutions for Database storage, and depending on your scenario, these won't really care performance wise one way or the other, while others might give you additional performance if planned properly.
You also have table partitioning which you can look at for your growing database, but in my opinion, just make sure you have plenty of good old memory, it will benefit you more than spending time and effort in worrying about databases and files.
We are running 100gig databases in a single database file and the performance is stellar. Much of the frequently accesse data is residing in memory though, but with decent table structure and logical indexes you'll have a responsive warehouse in no time.
If you planning on having foreign key relationships between these tables (and it sounds like you would) then I would keep it all in one database. Typically I use separate databases for totally separate bodies of data.
If you do separate them then you will run into some interesting challenges when you try to query both at the same time.
Over the years I have read a lot of people's opinions on how to get better performance out of their SQL (Microsoft SQL Server, just so we are all on the same page...) queries. However, they all seem to be tightly tied to either a high-performance OLTP setup or a data warehouse OLAP setup (cubes-galore...). However, my situation today is kind of in the middle of the 2, hence my indecision.
I have a general DB structure of [Contacts], [Sites], [SiteContacts] (the junction table of [Sites] and [Contacts]), [SiteTraits], and [ContractTraits]. I have nearly 3 million contacts with about 50 fields (between [Contacts] and [ContactTraits]) relating to just the contact, and about 600 thousand sites with about 150 fields (between [Sites] and [SiteTraits]) relating to just the sites. Basically it’s a pretty big flattened table or view… Most of the columns are int, bit, char(3), or short varchar(s). My problem is that a good portion of these columns are available to be used in ad-hoc queries by the user, and as quickly as possible because the main UI for this will be a website. I know the most common filters, but even with heavy indexing on them I think this will still be a beast… This data is read-only; the data doesn’t change at all during the day and the database will only be refreshed with the latest information during scheduled downtime. So I see this situation like an OLAP database with the read requirements of an OLTP database.
I see 3 options; 1. Break the table into smaller divisible units sub-query everything, 2. make one flat table and really go to town on the indexing 3. Create an OLAP cube and sub-query the rest based on what filter values I don’t put as the cube dimensions, and. I have not done much with OLAP cubes so I frankly don’t even know if that is an option, but from what I’ve done with them in the past I think it might be an option. Also, just to clarify what I mean when I say “sub-query everything” is instead of having a WHERE clause on the outer select, there would be one (if applicable) for each table being brought into the query and then the tables are INNER JOINed, to eliminate a really large Cartesian Product. As for the second option of the one large table, I have heard and seen conflicting results with that approach as it will save on joins but at the same time a table scan takes much longer.
Ideas anyone? Do I need to share what I’m smoking? I think this could turn into a pretty good discussion if everyone puts in their 2 cents. Oh, and feel free to tell me if I’m way off base with the OLAP cube idea if that’s the case, I’m new to that stuff too.
Thanks in advance to any and all opinions and help with this dilemma I’ve found myself in.
You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.
In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.
The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.
It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).
I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.
The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.
OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.
What are the most common queries? For single pieces of data or aggregates?
If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).
On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.
OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.
Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.
I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the application layer generates (views with multiple inner and outer joins, grouping, summing and averaging against tables with 90 million rows).
The RDBMS that I have tested against is DB2 on AIX. The dev environment that failed was loaded with 1/20th of the volume that will be processed in production. I am assured that the production hardware is superior to the dev and staging hardware but I just don't believe that it will cope with the sheer volume of data and complexity of queries.
Before the dev environment failed, it was taking in excess of 5 minutes to return a small dataset (several hundred rows) that was produced by a complex query (many joins, lots of grouping, summing and averaging) against the large tables.
My gut feeling is that the db architecture must change so that the aggregations currently provided by the views are performed as part of an off-peak batch process.
Now for my question. I am assured by people who claim to have experience of this sort of thing (which I do not) that my fears are unfounded. Are they? Can a modern RDBMS (SQL Server 2008, Oracle, DB2) cope with the volume and complexity I have described (given an appropriate amount of hardware) or are we in the realm of technologies like Google's BigTable?
I'm hoping for answers from folks who have actually had to work with this sort of volume at a non-theoretical level.
The nature of the data is financial transactions (dates, amounts, geographical locations, businesses) so almost all data types are represented. All the reference data is normalised, hence the multiple joins.
I work with a few SQL Server 2008 databases containing tables with rows numbering in the billions. The only real problems we ran into were those of disk space, backup times, etc. Queries were (and still are) always fast, generally in the < 1 sec range, never more than 15-30 secs even with heavy joins, aggregations and so on.
Relational database systems can definitely handle this kind of load, and if one server or disk starts to strain then most high-end databases have partitioning solutions.
You haven't mentioned anything in your question about how the data is indexed, and 9 times out of 10, when I hear complaints about SQL performance, inadequate/nonexistent indexing turns out to be the problem.
The very first thing you should always be doing when you see a slow query is pull up the execution plan. If you see any full index/table scans, row lookups, etc., that indicates inadequate indexing for your query, or a query that's written so as to be unable to take advantage of covering indexes. Inefficient joins (mainly nested loops) tend to be the second most common culprit and it's often possible to fix that with a query rewrite. But without being able to see the plan, this is all just speculation.
So the basic answer to your question is yes, relational database systems are completely capable of handling this scale, but if you want something more detailed/helpful then you might want to post an example schema / test script, or at least an execution plan for us to look over.
90 million rows should be about 90GB, thus your bottleneck is disk.
If you need these queries rarely, run them as is.
If you need these queries often, you have to split your data and precompute your gouping summing and averaging on the part of your data that doesn't change (or didn't change since last time).
For example if you process historical data for the last N years up to and including today, you could process it one month (or week, day) at a time and store the totals and averages somewhere. Then at query time you only need to reprocess period that includes today.
Some RDBMS give you some control over when views are updated (at select, at source change, offline), if your complicated grouping summing and averaging is in fact simple enough for the database to understand correctly, it could, in theory, update a few rows in the view at every insert/update/delete in your source tables in reasonable time.
It looks like you're calculating the same data over and over again from normalized data. One way to speed up processing in cases like this is to keep SQL with it's nice reporting and relationships and consistency and such, and use a OLAP Cube which is calculated every x amount of minutes. Basically you build a big table of denormalized data on a regular basis which allows quick lookups. The relational data is treated as the master, but the Cube allows quick precalcuated values to be retrieved from the database at any one point.
If that is only 1/20 of your data, you almost surely need to look into more scalable and efficient solutions, such as Google's Big Table. Have a look at NoSQL
I personally think that MongoDB is an awesome inbetween of NoSQL and RDMS. It isn't relational, but it provides a lot more features than a simple document store.
In dimensional (Kimball methodology) models in our data warehouse on SQL Server 2005, we regularly have fact tables with that many rows just in a single month partition.
Some things are instant and some things take a while, it depends on the operation and how many stars are being combined and what's going on.
The same models perform poorly on Teradata, but it is my understanding that if we re-model in 3NF, Teradata parallelization will work a lot better. The Teradata installation is many times more expensive than the SQL Server installation, so it just goes to show how much of a difference modeling and matching your data and processes to the underlying feature set matters.
Without knowing more about your data, and how it's currently modeled and what indexing choices you've made it's hard to say anything more.