When to separate tables into multiple databases? - sql

I am building a data warehouse. I need to get data from different sources and put it together so that I can generate reports. I will do lots of joining of tables. I am talking about maybe 20 tables total and each table is going to be anywhere from 100mb to 5 gigs.
I would like to know if I should be creating different databases for each table since each table might have an entirely different TYPE of dataset.
For example, I might have one table that has 1 GB of data about design of cars. And I will have another table with 3 GBs of sales data on these cars.
Would it be appropriate to separate these into different databases?
Please let me know what additional information is needed to advise me on this situation.

If there's a logical or business separation, by all means put them in different databases. That's just clean data application development. However, if you're going to be joining or merging the different data sets, then you can save some overhead and admin costs by having a single database. 20 tables total isn't a lot (I'm working on a system that has about 3700 tables, though ~1600 are audits). Keep in mind SQL Server is meant to scale up to terabytes of data, provided you have a decent model, indexes, etc.
If you're concerned with performance of the warehouse, you can jam that server full of RAM and harddrives. To leverage the harddrives properly you'd want to look at leveraging multiple files / filegroups and doling the tables out appropriately.

Splitting into different databases would normally be in order to spread I/O load. In SQL Server you can have different filegroups within the database itself if you want to spread I/O across multiple disks groups/disks. In Warehousing scenarios you often deal with SAN solutions for Database storage, and depending on your scenario, these won't really care performance wise one way or the other, while others might give you additional performance if planned properly.
You also have table partitioning which you can look at for your growing database, but in my opinion, just make sure you have plenty of good old memory, it will benefit you more than spending time and effort in worrying about databases and files.
We are running 100gig databases in a single database file and the performance is stellar. Much of the frequently accesse data is residing in memory though, but with decent table structure and logical indexes you'll have a responsive warehouse in no time.

If you planning on having foreign key relationships between these tables (and it sounds like you would) then I would keep it all in one database. Typically I use separate databases for totally separate bodies of data.
If you do separate them then you will run into some interesting challenges when you try to query both at the same time.

Related

Potential problems with having a lot of DB tables [duplicate]

Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?
Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.
Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?

SQL Database Best Practices - Use of Archive tables?

I'm not a trained DBA, but perform some SQL tasks and have this question:
In SQL databases I've noticed the use archive tables that mimic another table with the exact same fields and which are used to accept rows from the original table when that data is deemed for archiving. Since I've seen examples where those tables reside in the same database and on the same drive, my assumption is that this was done to increase performance. Such tables didn't have more than a about 10 million rows in them...
Why would this be done instead of using a column to designate the status of the row, such as a boolean for an in/active flag?
At what point would this improve performance ?
What would be the best pattern to structure this correctly, given that the data may still need to be queried (or unioned with current data) ?
What else is there to say about this ?
The notion of archiving is a physical, not logical, one. Logically the archive table contains the exact same entity and ought to be the same table.
Physical concerns tend to be pragmatic. The overarching notion is that the "database is getting too (big/slow"). Archiving records makes it easier to do things like:
Optimize the index structure differently. Archive tables can have more indexes without affecting insert/update performance on the working table. In addition, the indexes can be rebuilt with full pages, while the working table will generally want to have pages that are 50% full and balanced.
Optimize storage media differently. You can put the archive table on slower/less expensive disk drives that maybe have more capacity.
Optimize backup strategies differently. Working tables may require hot backups or log shipping while archive tables can use snapshots.
Optimize replication differently, if you are using it. If an archive table is only updated once per day via nightly batch, you can use snapshot as opposed to transactional replication.
Different levels of access. Perhaps you want different security access levels for the archive table.
Lock contention. If you working table is very hot you'd rather have your MIS developers access the archive table where they are less likely to halt your operations when they run something and forget to specify dirty read semantics.
The best practice would not to use archive tables but to move the data from the OLTP database to an MIS database, data warehouse, or data marts with denormalized data. But some organizations will have trouble justifying the cost of an additional DB system (which aren't cheap). There are far fewer hurdles to adding an additional table to an existing DB.
I say this frequently, but...
Multiple tables of identical structure almost never makes sense.
A status flag is a much better idea. There are proper ways to increase performance (partitioning/indexing) without denormalizing data or otherwise creating redundancies. 10 million records is pretty small in the world of modern rdbms, so what you're seeing is the product of poor planning or misunderstanding of databases.

Operational database schema to data mart schema, table reduction?

I'm starting to study SQL Server Analysis Services and I'm working my way through the training book, as well as the Developer Training Kit. In both, I find suggestions that the number of tables used in an OLAP database (ideally, star schema) is greatly reduced from the production OLTP database.
From the training kit:
We followed the data dimensional methodology to architect the data mart schema. From some 200 tables in the operational database, the data mart schema contained about 10 dimension tables and 2 fact tables.
From what I understand, the operational databases are usually (somewhat) normalised and the data mart schemas are heavily denormalised. I also believe that denormalising data usually involves adding more tables, not less.
I can't see how you can go from 200 tables to 12, unless you only need to report on a subset of data. And if you do only need to report on a subset of data, why can't you just use the appropriate tables in the operational database (unless there are significant performance gains to be made by using a denormalised star schema)?
Denormalizing is exactly the opposite of Normalizing a database. In a normalized database everything is spit apart into different tables to support concurrent writes to the data. This also has the side effect of generating any given subset of data exactly once (In an ideal 3rd normal form data structrure). A draw back of normalizing is that reads take a lot longer because of the fact that the data is scattered and we need to join tables to make sense of it again (Joins are pretty expensive operations).
When we denormalize, we are taking the data from multiple tables and merging them in to one table. So now we have repeating data in these tables. The repeating data is useful because we don't have to make joins to any other table to get it anymore. Writing to the data store is normally a bad idea because it would mean alot of writes to change all of the data in a table, whereas it would only take one in a normalized database.
OLTP stands for Online Transactional Processing, notice the word Transactional. Transactions are write operations and the OLTP model is optimiized for this. OLAP stands for Online Analytical Processing, Analysis being the keyword meaning lots of reads.
Going from 200 tables to 12 in an OLTP to OLAP process will suprisingly hold nearly all of the data in the OLTP database plus more. The OLTP is unable to record all of the changes over time, but OLAP specializes in this so you get all of your historical data as well as current data.
The star schema is probably the most common for OLAP data stores, the snowflake schema is also pretty common. You should learn about both and how to properly use them. It's just another great tool in your arsenal.
These two books from IBM will answer your questions much more thouroughly and they are free pdf's.
http://www.redbooks.ibm.com/abstracts/sg247138.html
http://www.redbooks.ibm.com/abstracts/sg242238.html

Efficient Ad-hoc SQL OLAP Structure

Over the years I have read a lot of people's opinions on how to get better performance out of their SQL (Microsoft SQL Server, just so we are all on the same page...) queries. However, they all seem to be tightly tied to either a high-performance OLTP setup or a data warehouse OLAP setup (cubes-galore...). However, my situation today is kind of in the middle of the 2, hence my indecision.
I have a general DB structure of [Contacts], [Sites], [SiteContacts] (the junction table of [Sites] and [Contacts]), [SiteTraits], and [ContractTraits]. I have nearly 3 million contacts with about 50 fields (between [Contacts] and [ContactTraits]) relating to just the contact, and about 600 thousand sites with about 150 fields (between [Sites] and [SiteTraits]) relating to just the sites. Basically it’s a pretty big flattened table or view… Most of the columns are int, bit, char(3), or short varchar(s). My problem is that a good portion of these columns are available to be used in ad-hoc queries by the user, and as quickly as possible because the main UI for this will be a website. I know the most common filters, but even with heavy indexing on them I think this will still be a beast… This data is read-only; the data doesn’t change at all during the day and the database will only be refreshed with the latest information during scheduled downtime. So I see this situation like an OLAP database with the read requirements of an OLTP database.
I see 3 options; 1. Break the table into smaller divisible units sub-query everything, 2. make one flat table and really go to town on the indexing 3. Create an OLAP cube and sub-query the rest based on what filter values I don’t put as the cube dimensions, and. I have not done much with OLAP cubes so I frankly don’t even know if that is an option, but from what I’ve done with them in the past I think it might be an option. Also, just to clarify what I mean when I say “sub-query everything” is instead of having a WHERE clause on the outer select, there would be one (if applicable) for each table being brought into the query and then the tables are INNER JOINed, to eliminate a really large Cartesian Product. As for the second option of the one large table, I have heard and seen conflicting results with that approach as it will save on joins but at the same time a table scan takes much longer.
Ideas anyone? Do I need to share what I’m smoking? I think this could turn into a pretty good discussion if everyone puts in their 2 cents. Oh, and feel free to tell me if I’m way off base with the OLAP cube idea if that’s the case, I’m new to that stuff too.
Thanks in advance to any and all opinions and help with this dilemma I’ve found myself in.
You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.
In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.
The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.
It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).
I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.
The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.
OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.
What are the most common queries? For single pieces of data or aggregates?
If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).
On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.
OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.
Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.

What is the best way to partition large tables in SQL Server?

In a recent project the "lead" developer designed a database schema where "larger" tables would be split across two separate databases with a view on the main database which would union the two separate database-tables together. The main database is what the application was driven off of so these tables looked and felt like ordinary tables (except some quirky things around updating). This seemed like a HUGE performance problem. We do see problems with performance around these tables but nothing to make him change his mind about his design. Just wondering what is the best way to do this, or if it is even worth doing?
I don't think that you are really going to gain anything by partitioning the table across multiple databases in a single server. All you have essentially done there is increased the overhead in working with the "table" in the first place by having several instances (i.e. open in two different DBs) of it under a single SQL Server instance.
How large of a dataset do you have? I have a client with a 6 million row table in SQL Server that contains 2 years worth of sales data. They use it transactionally and for reporting without any noticiable speed problems.
Tuning the indexes and choosing the correct clustered index is crucial to performance of course.
If your dataset is really large and you are looking to partition, you will get more bang for your buck partitioning the table across physical servers.
Partitioning is not something to be undertaken lightly as there can be many subtle performance implications.
My first question is are you referring simply to placing larger table objects in separate filegroups (on separate spindles) or are you referring to data partitioning inside of a table object?
I suspect that the situation described is an attempt to have the physical storage of certain large tables on different spindles from the rest of the tables. In this case, adding the extra overhead of separate databases, losing any ability to enforce referential integrity across databases, and the security implications of enabling cross-database ownership chaining does not provide any benefit over using multiple filegroups within a single database. If, as is quite possible, the separate databases you refer to in your question are not even stored on separate spindles but are all stored on the same spindle then you negate even the slight performance benefit you could have gained by physically separating your disk activity and have received absolutely no benefit.
I would suggest instead of using additional databases to hold large tables you look into the Filegroup topic in SQL Server Books Online or for a quick review see this article:
If you are interested in data partitioning (including partitioning into multiple file groups) then I recommend reading articles by Kimberly Tripp, who gave an excellent presentation at the time SQL Server 2005 came out about the improvements available there. A good place to start is this whitepaper
Which version of SQL Server are you using? SQL Server 2005 has partitioned tables, but in 2000 (or 7.0) you needed to use partition views.
Also, what was the reasoning for putting the table partitions in a separate database?
When I've had to partition tables in the past (pre-2005), it's usually by a date column or something similar, with a view over the various partitions. Books Online has a section that talks about how to do this and all of the rules around it. You need to follow the rules to make it work how it's supposed to work.
The key thing to remember is that your partitioning column must be part of the primary key and you want to try to always use that column in any access against the table so that the optimizer can ignore partitions that shouldn't be affected by the query.
Look up "partitioned table" in MSDN and you should be able to find a more complete tutorial for SQL Server 2005 partitioned tables as well as advice on how to set them up for maximum performance.
Are you asking about best practices in terms of database design, or convincing your lead to change his mind? :)
In terms of design... Back in the goode olde days, vertical partitioning was sometimes needed to work around database engine limitations, where the number of columns in a table was a hard limit, like 255 columns. These days the main benefits are purely for performance: putting rarely used columns, or blobs on a separate disk array. But if you're regularly pulling things from both tables it will likely be a loss. It sounds like your lead is suffering from a case of premature optimisation.
In terms of telling your lead is wrong... that requires diplomacy. If he's aware of mutterings of discontent in terms of performance, a benchmark is probably the best way to show the difference.
Create a new physical table somewhere with 'create table t1 as select * from view1' and then run some lengthy batch with the vertically partitioned table and your new table. If it's as bad as you say, the difference should be evident.
But this too may be premature optimisation. Find out what the end-users think of the performance. If the performance is good enough, for some definition of good, then don't fix what ain't broke.
There is a definite benefit for table partitioning (regardless whether it's on same or different filegroups /disks). If the partition column is correctly selected, you'll realize that your queries will hit only the required partition. So imagine if you have 100 million records (I've partitioned tables much bigger than that - about 20+ Billion rows) and if for the most part, more than 70% of your data access is only a certain category or timeline or type of data then it helps to keep the most accessed data in a separate partition. Plus you can align the partition with separate file groups with various type of disks (SATA, Fiber channel, SSDs) so that the most accessed/busy data are on the fastest storage and the least/rarely accessed are virtually on slower disks.
Although, in SQL Server there's limited partitioning ability, unlike Oracle. You can choose only one column for partitioning (even in SQL 2008). So you've to choose a column wisely where that column also is part of most of your frequent queries. For the most part, people find it easy to choose to partition by a date column. However although it seems logical to partition that way, if your queries do not have that column as part of the condition, you won't be gaining sufficient benefits from partitioning (in other words, your query will hit all the partition regardless).
It's much easier to partition for data warehouse/data mining type databases than OLTP as most DW database queries are limited by time period.
That's why these days due to the volume of data being handled by databases, it's wise to design the application in such a way that ever query is limited by some broader group such as time, geographical location or such so that when such columns are chosen for partitioning you'll gain maximum benefits.
I would disagree with the assumption that nothing can be gained by partitioning.
If the partition data is physically and logically aligned, then the potential IO of queries should be dramatically reduced.
For example, We have a table which has the batch field as an INT representing an INT.
If we partition the data by this field and then re-run a query for a particular batch, we should be able to run set statistics io ON before and after partitioning and see a reduction in IO,
If we have a million rows per partition and each partition is written to a separate device. The query should be able to eliminate the nonessential partitions.
I've not done a lot of partitioning on SQL Server, but I do have experience of partitioning on Sybase ASE, and this is known as partition eliminiation. When I have time I'm going to test out the scenario on a SQL Server 2005 machine.