Why are joins bad when considering scalability? - sql

Why are joins bad or 'slow'. I know i heard this more then once. I found this quote
The problem is joins are relatively
slow, especially over very large data
sets, and if they are slow your
website is slow. It takes a long time
to get all those separate bits of
information off disk and put them all
together again.
source
I always thought they were fast especially when looking up a PK. Why are they 'slow'?

Scalability is all about pre-computing (caching), spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.
In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.
But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.
The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. There are a number of scalability-friendly ways to take raw joins and make them much faster:
Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation
Indexes
Materialized/indexed views (think of this as a pre-computed join or managed de-normalization)
Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.
Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)
OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as generic denormalization)
Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.
Use of a caching layer like Redis to avoid re-running queries which need complex joins.
I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query β€” just as if you had denormalized the data (admittedly at the cost of slower write operations).
If you have a slow join, you're probably not using your database correctly.
De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.
* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.

Joins can be slower than avoiding them through de-normalisation but if used correctly (joining on columns with appropriate indexes an so on) they are not inherently slow.
De-normalisation is one of many optimisation techniques you can consider if your well designed database schema exhibits performance problems.

article says that they are slow when compared to absence of joins. this can be achieved with denormalization. so there is a trade off between speed and normalization. don't forget about premature optimization also :)

First of all, a relational database's raison d'etre (reason for being) is to be able to model relationships between entities. Joins are simply the mechanisms by which we traverse those relationships. They certainly do come at a nominal cost, but without joins, there really is no reason to have a relational database.
In the academic world we learn of things like the various normal forms (1st, 2nd, 3rd, Boyce-Codd, etc.), and we learn about different types of keys (primary, foreign, alternate, unique, etc.) and how these things fit together to design a database. And we learn the rudiments of SQL as well as manipulating both structure and data (DDL & DML).
In the corporate world, many of the academic constructs turn out to be substantially less viable than we had been led to believe. A perfect example is the notion of a primary key. Academically it is that attribute (or collection of attributes) that uniquely identifies one row in the table. So in many problem domains, the proper academic primary key is a composite of 3 or 4 attributes. However, almost everyone in the modern corporate world uses an auto-generated, sequential integer as a table's primary key. Why? Two reasons. The first is because it makes the model much cleaner when you're migrating FKs all over the place. The second, and most germane to this question, is that retrieving data through joins is faster and more efficient on a single integer than it is on 4 varchar columns (as already mentioned by a few folks).
Let's dig a little deeper now into two specific subtypes of real world databases. The first type is a transactional database. This is the basis for many e-commerce or content management applications driving modern sites. With a transaction DB, you're optimizing heavily toward "transaction throughput". Most commerce or content apps have to balance query performance (from certain tables) with insert performance (in other tables), though each app will have its own unique business driven issues to solve.
The second type of real world database is a reporting database. These are used almost exclusively to aggregate business data and to generate meaningful business reports. They are typically shaped differently than the transaction databases where the data is generated and they are highly optimized for speed of bulk data loading (ETLs) and query performance with large or complex data sets.
In each case, the developer or DBA needs to carefully balance both the functionality and performance curves, and there are lots of performance enhancing tricks on both sides of the equation. In Oracle you can do what's called an "explain plan" so you can see specifically how a query gets parsed and executed. You're looking to maximize the DB's proper use of indexes. One really nasty no-no is to put a function in the where clause of a query. Whenever you do that, you guarantee that Oracle will not use any indexes on that particular column and you'll likely see a full or partial table scan in the explain plan. That's just one specific example of how a query could be written that ends up being slow, and it doesn't have anything to do with joins.
And while we're talking about table scans, they obviously impact the query speed proportionally to the size of the table. A full table scan of 100 rows isn't even noticeable. Run that same query on a table with 100 million rows, and you'll need to come back next week for the return.
Let's talk about normalization for a minute. This is another largely positive academic topic that can get over-stressed. Most of the time when we talk about normalization we really mean the elimination of duplicate data by putting it into its own table and migrating an FK. Folks usually skip over the whole dependence thing described by 2NF and 3NF. And yet in an extreme case, it's certainly possible to have a perfect BCNF database that's enormous and a complete beast to write code against because it's so normalized.
So where do we balance? There is no single best answer. All of the better answers tend to be some compromise between ease of structure maintenance, ease of data maintenance and ease of code creation/maintenance. In general, the less duplication of data, the better.
So why are joins sometimes slow? Sometimes it's bad relational design. Sometimes it's ineffective indexing. Sometimes it's a data volume issue. Sometimes it's a horribly written query.
Sorry for such a long-winded answer, but I felt compelled to provide a meatier context around my comments rather than just rattle off a 4-bullet response.

People with terrabyte sized databases still use joins, if they can get them to work performance-wise then so can you.
There are many reasons not to denomalize. First, speed of select queries is not the only or even main concern with databases. Integrity of the data is the first concern. If you denormalize then you have to put into place techniques to keep the data denormalized as the parent data changes. So suppose you take to storing the client name in all tables instead of joining to the client table on the client_Id. Now when the name of the client changes (100% chance some of the names of clients will change over time), now you need to update all the child records to reflect that change. If you do this wil a cascade update and you have a million child records, how fast do you suppose that is going to be and how many users are going to suffer locking issues and delays in their work while it happens? Further most people who denormalize because "joins are slow" don't know enough about databases to properly make sure their data integrity is protected and often end up with databases that have unuseable data becasue the integrity is so bad.
Denormalization is a complex process that requires an thorough understanding of database performance and integrity if it is to be done correctly. Do not attempt to denormalize unless you have such expertise on staff.
Joins are quite fast enough if you do several things. First use a suggorgate key, an int join is almost alawys the fastest join. Second always index the foreign key. Use derived tables or join conditions to create a smaller dataset to filter on. If you have a large very complex database, then hire a professional database person with experience in partioning and managing huge databases. There are plenty of techniques to improve performance without getting rid of joins.
If you just need query capability, then yes you can design a datawarehouse which can be denormalized and is populated through an ETL tool (optimized for speed) not user data entry.

Joins are slow if
the data is improperly indexed
results poorly filtered
joining query poorly written
data sets very large and complex
So, true, the bigger your data sets the the more processing you'll need for a query but checking and working on the first three options of the above will often yield great results.
Your source gives denormalization as an option. This is fine only as long as you've exhausted better alternatives.

The joins can be slow if large portions of records from each side need to be scanned.
Like this:
SELECT SUM(transaction)
FROM customers
JOIN accounts
ON account_customer = customer_id
Even if an index is defined on account_customer, all records from the latter still need to be scanned.
For the query list this, the decent optimizers won't probably even consider the index access path, doing a HASH JOIN or a MERGE JOIN instead.
Note that for a query like this:
SELECT SUM(transaction)
FROM customers
JOIN accounts
ON account_customer = customer_id
WHERE customer_last_name = 'Stellphlug'
the join will most probably will be fast: first, an index on customer_last_name will be used to filter all Stellphlug's (which are of course, not very numerous), then an index scan on account_customer will be issued for each Stellphlug to find his transactions.
Despite the fact that these can be billions of records in accounts and customers, only few will actually need to be scanned.

Joins are fast. Joins should be considered standard practice with a properly normalized database schema. Joins allow you to join disparate groups of data in a meaningful way. Don't fear the join.
The caveat is that you must understand normalization, joining, and the proper use of indexes.
Beware premature optimization, as the number one failing of all development projects is meeting the deadline. Once you've completed the project, and you understand the trade offs, you can break the rules if you can justify it.
It's true that join performance degrades non-linearly as the size of the data set increases. Therefore, it doesn't scale as nicely as single table queries, but it still does scale.
It's also true that a bird flies faster without any wings, but only straight down.

Joins do require extra processing since they have to look in more files and more indexes to "join" the data together. However, "very large data sets" is all relative. What is the definition of large? I the case of JOINs, I think its a reference to a large result set, not that overall dataset.
Most databases can very quickly process a query that selects 5 records from a primary table and joins 5 records from a related table for each record (assuming the correct indexes are in place). These tables can have hundreds of millions of records each, or even billions.
Once your result set starts growing, things are going to slow down. Using the same example, if the primary table results in 100K records, then there will be 500K "joined" records that need to be found. Just pulling that much data out of the database with add delays.
Don't avoid JOINs, just know you may need to optimize/denormalize when datasets get "very large".

Also from the article you cited:
Many mega-scale websites with billions
of records, petabytes of data, many
thousands of simultaneous users, and
millions of queries a day are doing is
using a sharding scheme and some are
even advocating denormalization as the
best strategy for architecting the
data tier.
and
And unless you are a really large
website you probably don't need to
worry about this level of complexity.
and
It's more error prone than having the
database do all this work, but you are
able to do scale past what even the
highest end databases can handle.
The article is discussing mega-sites like Ebay. At that level of usage you are likely going to have to consider something other than plain vanilla relational database management. But in the "normal" course of business (applications with thousands of users and millions of records) those more expensive, more error prone approaches are overkill.

Joins are considered an opposing force to scalability because they're typically the bottleneck and they cannot be easily distributed or paralleled.

Properly designed tables containing with the proper indicies and correctly written queries not always slow. Where ever you heard that:
Why are joins bad or 'slow'
has no idea what they are talking about!!! Most joins will be very fast. If you have to join many many rows at one time you might take a hit as compared to a denormalized table, but that goes back to Properly designed tables, know when to denormalize and when not to. in a heavy reporting system, break out the data in denormalized tables for reports, or even create a data warehouse. In a transactional heavy system normalize the tables.

The amount of temporary data that is generated could be huge based on the joins.
For an example, one database here at work had a generic search function where all of the fields were optional. The search routine did a join on every table before the search began. This worked well in the beginning. But, now that the main table has over 10 million rows... not so much. Searches now take 30 minutes or more.
I was tasked with optimizing the search stored procedure.
The first thing I did was if any of the fields of the main table were being searched, I did a select to a temp table on those fields only. THEN, I joined all the tables with that temp table before doing the rest of the search. Searches where one of the main table fields now take less than 10 seconds.
If none of the main table fields are begin searched, I do similar optimizations for other tables. When I was done, no search takes longer than 30 seconds with most under 10.
CPU utilization of the SQL server also went WAY DOWN.

While joins (presumably due to a normalized design) can obviously be slower for data retrieval than a read from a single table, a denormalized database can be slow for data creation/update operations since the footprint of the overall transaction will not be minimal.
In a normalized database, a piece of data will live in only one place, so the footprint for an update will be as minimal as possible. In a denormalized database, it's possible that the same column in multiple rows or across tables will have to be updated, meaning the footprint would be larger and chance of locks and deadlocks can increase.

Well, yeah, selecting rows from one denormalized table (assuming decent indexes for your query) might be faster that selecting rows constructed from joining several tables, particularly if the joins don't have efficient indexes available.
The examples cited in the article - Flickr and eBay - are exceptional cases IMO, so have (and deserve) exceptional responses. The author specifically calls out the lack of RI and the extent of data duplication in the article.
Most applications - again, IMO - benefit from the validation & reduced duplication provided by RDBMSs.

They can be slow if done sloppily. For example, if you do a 'select *' on a join you will probaby take a while to get stuff back. However, if you carefully choose what columns to return from each table, and with the proper indexes in place, there should be no problem.

Related

What aspects of a sql query are relatively costly to one another? Joins? Num of records? columns selected?

How costly would SELECT One, Two, Three be compared to SELECT One, Two, Three, ..... N-Column
If you have a sql query that has two or three tables joined together and is retrieving 100 rows of data, does performance have anything to say whether I should be selecting only the number of columns I need? Or should I write a query that just yanks all the columns..
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Would 1 record vs 10 record vs 100 record matter?
As an extremely generalized version of ranking those factors you mention in terms of performance penalty and occurrence in the queries you write, I would say:
Joins - Especially when joining on tables with no indexes for the fields you're joining on and/or with tables that have a very large amount of data.
# of Rows / Amount of Data - Again, indexes mitigate this quite a bit, just make sure you have the right ones.
# of Fields - I would say the # of fields in the SELECT clause impact performance the least in most situations.
I would say any performance-driving property is always coupled with how much data you have - sure a join might be fast when your tables have 100 rows each, but when millions of rows are in the tables, you have to start thinking about more efficient design.
Several things impact the cost of a query.
First, are there appropriate indexes for it to use. Fields that are used in a join should almost always be indexed and foreign keys are not indexed by default, the designer of the database must create them. Fields used inthe the where clasues often need indexes as well.
Next, is the where clause sargable, in other words can it use the indexes even if you have the correct ones? A bad where clause can hurt a query far more than joins or extra columns. You can't get anything but a table scan if you use syntax that prevents the use of an index such as:
LIKE '%test'
Next, are you returning more data than you need? You should never return more columns than you need and you should not be using select * in production code as it has additional work to look up the columns as well as being very fragile and subject to create bad bugs as the structure changes with time.
Are you joining to tables you don't need to be joining to? If a table returns no columns in the select, is not used in the where and doesn't filter out any records if the join is removed, then you have an unnecessary join and it can be eliminated. Unnecessary joins are particularly prevalant when you use a lot of views, especially if you make the mistake of calling views from other views (which is a buig performance killer for may reasons) Sometimes if you trace through these views that call other views, you will see the same table joined to multiple times when it would not have been necessary if the query was written from scratch instead of using a view.
Not only does returning more data than you need cause the SQL Server to work harder, it causes the query to use up more of the network resources and more of the memory of the web server if you are holding the results in memory. It is an all arouns poor choice.
Finally are you using known poorly performing techniques when a better one is available. This would include the use of cursors when a set-based alternative is better, the use of correlated subqueries when a join would be better, the use of scalar User-defined functions, the use of views that call other views (especially if you nest more than one level. Most of these poor techniques involve processing row-by-agonizing-row which is generally the worst choice in a database. To properly query datbases you need to think in terms of data sets, not processing one row at a time.
There are plenty more things that affect performance of queries and the datbase, to truly get a grip onthis subject you need to read some books onthe subject. This is too complex a subject to fully discuss in a message board.
Or should I write a query that just yanks all the columns..
No. Just today there was another question about that.
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Any useless join or data retrieval costs you time and should be avoided. Retrieving rows from a datastore is costly. Joins can be more or less costly depending on the context, amount of indexes defined... you can examine the query plan of each query to see the estimated cost for each step.
Selecting more columns/rows will have some performance impacts, but honestly why would you want to select more data than you are going to use anyway?
If possible, could you help me
understand what aspects of a query
would be relatively costly compared to
one another?
Build the query you need, THEN worry about optimizing it if the performance doesn't meet your expectations. You are putting the horse before the cart.
To answer the following:
How costly would SELECT One, Two,
Three be compared to SELECT One, Two,
Three, ..... N-Column
This is not a matter of the select performance but the amount of time it takes to fetch the data. Select * from Table and Select ID from Table preform the same but the fetch of the data will take longer. This goes hand in hand with the number of rows returned from a query.
As for understanding preformance here is a good link
http://www.dotnetheaven.com/UploadFile/skrishnasamy/SQLPerformanceTunning03112005044423AM/SQLPerformanceTunning.aspx
Or google tsql Performance
Joins have the potential to be expensive. In the worst case scenario, when no indexes can be used, they require O(M*N) time, where M and N are the number of records in the tables. To speed things up, you can CREATE INDEX on columns that are part of the join condition.
The number of columns has little effect on the time required to find rows, but slows things down by requiring more data to be sent.
What others are saying is all true.
But typically, if you are working with tables that already have good indexes, what's most important for performance is what goes into the WHERE statement. There you have to worry more about using a field that has no index or using a statement that can't me optimized.
The difference between SELECT One, Two, Three FROM ... and SELECT One,...,N FROM ... could be like the difference between day and night. To understand the problem, you need to understand the concept of a covering index:
A covering index is a special case
where the index itself contains the
required data field(s) and can return
the data.
As you add more unnecessary columns to the projection list you are forcing the query optimizer to lookup the newly added columns in the 'table' (really in the clustered index or in the heap). This can change an execution plan from an efficient narrow index range scan or seek into a bloated clustered index scan, which can result in differences of times from sub-second to +hours, depending on your data. So projecting unnecessary columns is often the most impacting factor of a query.
The number of records pulled is a more subtle issue. With a large number, a query can hit the index tipping point and choose, again, a clustered index scan over narrower index range scan and lookup. Now the fact that lookups into the clustered index are necessary to start with means the narrow index is not covering, which ultimately may be caused by projecting unnecessary column.
And finally, joins. The question here is joins, as opposed to what else? If a join is required, there is no alternative, and that's all there is to say about this.
Ultimately, query performance is driven by one factor alone: amount of IO. And the amount of IO is driven ultimately by the access paths available to satisfy the query. In other words, by the indexing of your data. It is impossible to write efficient queries on bad indexes. It is possible to write bad queries on good indexes, but more often than not the optimizer can compensate and come up with a good plan. You should spend all your effort in better understanding index design:
Designing Indexes
SQL Server Optimization
Short answer: Dont select more fields then you need - Search for "*" in both your sourcecode and your stored procedures ;)
You allways have to consider what parts of the query will cause which costs.
If you have a good DB design, joining a few tables is usually not expensive. (Make sure you have correct indices).
The main issue with "select *" is that it will cause unpredictable behavior in your results. If you write a query like that, AND access the fields with the columnindex, you will be locked into the DB-Schema forever.
Another thing to consider is the amount of data you have to consider. You might think its trivial, but the Version2.0 of your application suddenly adds a ProfilePicture to the User table. And now the query that will select 100 Users will suddenly use up several Megabyte of bandwith.
The second thing you should consider is the number of rows you return. SQL is very powerfull at sorting and grouping, so let SQL do his job, and dont move it to the client. Limit the amount of records you return. In most applications it makes no sense to return more then 100 rows to a user at once. You might let the user choose to load more, but make it a choice he has to make.
Finally, monitor your SQL Server. Run a profiler against it, and try to find your worst queries. A SQL Query should not take longer then half a second, if it does, something is most likely messed up (Yes... there are operation that can take much longer, but those should have a reason)
Edit:
Once you found the slow query, look at the execution plan... You will see which parts of the query are expensive, and which parts work well... The Optimizer is also a tool that can be used.
I suggest you consider your queries in terms of I/O first. Disk I/O on my SATA II system is 6Gb/sec. My DDR3 memory bandwidth is 12GB/sec. I can move items in memory 16 times faster than I can retrieve from disk. (Ref Wikipedia and Tom's hardware)
The difference between getting a few columns and all the columns for your 100 rows could be the dfference in getting a single 8K page from disk to getting two or more pages from disk. When the pages are finally in memory moving two columns or all columns to a hash table is faster than any measuring tool I have.
I value the advice of the others on this topic related to database design. The design of narrow indexes, using included columns to make covering indexes, avoiding table or index scans in favor of seeks by using an appropiate WHERE clause, narrow primary keys, etc is the diffenence between having a DBA title and being a DBA.

Which is more efficient : 2 single table queries or 1 join query

Say tableA has 1 row to be returned but will have 100 columns returned while tableB has 100 rows to be returned but only one column from each. TableB has a foreign key for table A.
Will a left join of tableA to tableB return 100*100 cells of data while 2 separate queries return 100 + 100 cells of data or 50 times less data or is that a misunderstanding of how it works?
Is it ever more efficient to use many simple queries rather than fewer more complex ones?
First and foremost, I would question a table with 100 columns, and suggest that there is a possibly a better design for your schema. In the real world, this number of columns is less common, so typically the difference in the amount of data returned with one query vs. two becomes less significant. 100 columns in a table is not necessarily bad, just a flag that it shold be considered.
However, assuming your numbers are what they are to make clear the question, there are a few important variables to consider:
1 - What is the speed of the link between the db server and the application server? If it is very slow, then you are probably better off minimizing the amount data returned vs. the number of queries you run. If it is not slow, then you will likely expend more time in the execution of two queries than you would returning the increased payload. Which is better can only be determined by testing in your own environment.
2 - How efficient is the transport protocol itself? Perhaps there is some kind of compression of the data, or an even more clever algorithm that knows column 2 through 101 are duplicate for every row, so it only passes them once. Strategies like this in the transport protocol would mitigate any of your concerns. Again, this is why you need to test in your own envionment to know for sure.
As others have pointed out, you also need to consider what will be done with the data once you get it (e.g., JOINs, GROUPing, etc), but I am limiting my response to the specifics of your question around query count vs. payload size.
What is best at joining? A database engine or client code? Saying that, I use both techniques: it depends on the client and how data will be used.
Where the data requires some processing to, say, render on a web page I'd probably split header and details recordsets. We do use this because we have some business logic between DB and HTML
Where it's consumed simply and linearly, I'd join in the database to avoid unnecessary processing. For example, simple reports or exports
It depends, if you only take into account the SQL efficiency obviusly several simpler and smaller result queries will be more efficient.
But you need to take into account the whole process if the join will be made otherwise on the client or you need to filter results after the join, then probably the DBM will be more efficient that doing it on your code.
Coding is always a tradeoff between diferent systems, DB vs Client, RAM vs CPU... you need to be conscious about this and try to find the perfect solution.
In this case probably 2 queries outperform 1 but that is not a general solution.
I think that your question basically is about database normalization. In general, it is advisable to normalize a database into multiple tables (using primary and foreign keys) and to join them as needed upon queries. This is better for insert/update performance and for keeping the data consistent, and usually results in smaller database sizes as well.
As for the row numbers returned, only a cross join would actually return 100*100 rows; any inner or outer join will not create all combinations, but rather tie together rows on the given conditions, and for outer joins preserve rows which could not be matched. Wikipedia has some samples in its JOIN article.
For very query-intense applications, the performance may be better when using less normlized tables. However, as always with optimizations, I'd only consider going into that direction after seeing real measurable problems (e.g. with a profiling tool).
In general, try to keep the number of roundtrips to the database low; a large number of single simple queries will suffer from the overhead of talking to the DB engine (network etc.). If you need to execute complex series of statements, consider using stored procedures.
Generally fewer queries makes for better performance, as long as the queries return data that is actually related. There is no point in trying to put unrelated data into the same query just to reduce the number or queries.
There are of course exceptions, and your example may be one of them. However, it depends on more than the number of fields returnes, like what the fields actually return, i.e. the actual amount of data.
As an example of how the number of queries affects performance, I can mention a solution that I have (sadly enough) seen many times. In that solution the programmer would first get a number of records from one table, then loop through the records and run another query for each record to get the related records from another table. This clearly results in a lot of queries, and a solution having either one or two queries would be much more efficient.
β€œIs it ever more efficient to use many simple queries rather than fewer more complex ones?”
The query that requires the least amount of data to traverse, and gives you no more than what you need is the more efficient one. Beyond this, there can be RDBMS specific conditions that can be more efficient on one RDBMS system than another. At the very low level, when you deal with less data, then your results can be retrieved much quicker, so efficient queries are queries that only work with the least amount of data needed to get you the result you are looking for.

Are there any performance issues if an sql query contains lot of joins?

Are there any performance issues if an sql query contains lot of joins?
There can be -- but query performance is a sensitive thing affected by lots of factors:
Number of joins
Structure of tables
Size of the database
Presence if and data type of indexes
Data types of values being joined
etc etc
You can get into all sorts of details. But generally the best approach is to write a query that works and then profile your app to see if you actually have a problem. Then, start looking at optimizing your queries.
Yes.
But the biggest issue is: HOW are the tables joined. Suppose you had a query like:
select book.title, chapter.page_count
from chapter
join book on book.bookid=chapter.bookid
where chapter.subject='penguins'
The query would probably read the Chapter table first looking for matches on 'penguins', then join to Book. If Bookid is the primary key of book, or at least is indexed, this would be very fast. But if not, then we would have to do a full-file sequential read of Book. Depending on the engine and other factors, we might have to re-read the entire Book table for each chapter record found. That could take a long long time.
If you join three tables and both the joins require full file reads, you could be in a world of hurt.
Joins always cost you something. But joins that require full-file reads, especially multiple full-file reads, cost a lot. Some database engines mitigate this cost by recognizing it's happening and can load a table into memory and re-use it, generally doing some kind of hash search against it. This is still expensive, but not quite as bad.
Learn to read an Explain plan. These can help a lot in analyzing your queries, figuring out where they're bad, and cleaning them up. Personally, unless a query is obviously simple, like "select whatever from table where primary_key=whatever", I check out the explain plan just to be sure.
One of the best ways to boost JOIN
performance is to limit how many rows
need to be JOINed.
Read more in this article
Performance Tuning SQL Server Joins
Using lots of joins can slow down retrieval performance (though with proper indexing, the penalty is often a lot less than people think -- measure first).
However, people tend to forget that removing the joins often means 'denormalizing' the data, which then incurs costs when the data must be modified. In particular, enforcing the constraints which a fully normalized schema enforces automatically in a schema which is denormalized can be hard. Because it is hard, it is often not done. But when the constraints are not enforced, the data becomes unreliable, and there's one thing worse than (slightly) slow select operations that return the correct answer, and that is fast select operations that return wrong or confusing answers.
If the DBMS is read-mainly - that is, data is written once and seldom if ever modified, then you can consider whether the performance benefit from denormalization makes the risks of inaccurate data creeping into the database acceptable. If the data is mission critical and often updated, then the risk of inaccurate data is usually too serious to be acceptable.
But, as they say, YMMV.
Yes if you use lot of joins in SQL affect your performance .

Is it better for faster access to split tables and JOIN in a SQL database or leave a few monolithic tables?

I know it's probably not the right way to structure a database but does the database perform faster if the data is put in one huge table instead of breaking it up logically in other tables?
I want to design and create the database properly using keys to create relational integrity across tables but when quering, is JOIN'ing slower than reading the required data from one table? I want to make the database queries as fast as possible.
So many other facets affect the answer to your question. What is the size of the table? width? how many rows? What is usage pattern? Are there different usage patterns for different subsets of the columns in the table? (i.e., are two columns hit 1000 times per second, and the other 50 columns only hit once or twice a day? ) this scenario would be a prime candidate to split (partition) the table vertically (two columns in one table, the rest on another)
In general, normalize the schema to the maximum degree possible, then run performance testing with typical or predicted loads and usage patterns, and denormalize and partition to the point where the performance becomes acceptable, and no more...
It depends on the dbms flavor and your actual data, of course. But generally more smaller (narrower) tables are faster than fewer larger (wider) tables.
Access is a little slower when joins must be performed. How much slower depends greatly on the features offered by your particular DBMS, and how the physical database design exploits those features, and on the most frequent access patterns. There are a few access patterns where storing a lot of data in one row wastes time, because the entire row is retrieved, but only a little of the row is used. It depends.
When data is stored in a single table and the normalization rules are deviated from, update is typically slower. How important speed of of update is versus speed of query is dependant on the particular way you use this database.
In general, a lot of newbie database designers tend to put more weight on speed issues than those issues deserve. If your data model is inflexible and incomprehensible, but you gain a 10% speed improvement, you have probably done more harm than good.
Are you building a "read-only" database like a data warehouse? If so, storing data "pre-joined" may make sense. For everyday OLTP databases you need to take into account the performance and ease of inserts, updates and deletes as well. Also, what about queries that only want the data that would have been in one or two of the smaller tables? Now they have to grind through a big fat table full of stuff they don't care about.
It's worth remembering that joining tables is bread-and-butter stuff to a decent DBMS - they are very good at it.
It is often true that querying a single table is faster than querying multiple joined tables. But a normalized design allows you to query the data in multiple ways, with adequate performance across many types of queries.
If you denormalize the tables, you may improve performance of one specific query, while sacrificing performance of other queries against that data. And of course you'll have to manage referential integrity and redundancy manually.
What you're asking about is denormalization - it can speed up reads if done in the right way, and if you are able to ensure that you're not introducing anomalies into your database because of it.
Remember also that there is a hard limit to the amount of data that can be stored in one record. (not knowing which database you have, I can't say what it is.) Too many columns and you will hit that limit. Also if you are having columns like phone1, phone2, phone3 then you need to normalize. If you would need to add a column if the number of items to be inserted about a record changes (if you statred needing 4 instead of 3 phone numbers for instance), you need to normalize instead.
What's true for optimising SELECTS is often not so great at optimising INSERTS, UPDATES and DELETES, and thus it is with this approach. Breaking out the data into properly normalised tables reduces the overhead of changing the data.
While it's tru that in a data warehouse or decision suport system we'd often store pre-joined data (as Tony says), it usually only happens in the context of a precomputed summary (eg. a materialized view) and not for data at the atomic level of granularity. The reason for this is that pushing repeated longer character strings (eg. "Supplier Name") into a dimension table reduces total required storage space and number of physical reads required to retrieve the data. The joins are usually equijoins, and these are performed at almost no cost for large data sets.

How do you optimize tables for specific queries?

What are the patterns you use to determine the frequent queries?
How do you select the optimization factors?
What are the types of changes one can make?
This is a nice question, if rather broad (and none the worse for that).
If I understand you, then you're asking how to attack the problem of optimisation starting from scratch.
The first question to ask is: "is there a performance problem?"
If there is no problem, then you're done. This is often the case. Nice.
On the other hand...
Determine Frequent Queries
Logging will get you your frequent queries.
If you're using some kind of data access layer, then it might be simple to add code to log all queries.
It is also a good idea to log when the query was executed and how long each query takes. This can give you an idea of where the problems are.
Also, ask the users which bits annoy them. If a slow response doesn't annoy the user, then it doesn't matter.
Select the optimization factors?
(I may be misunderstanding this part of the question)
You're looking for any patterns in the queries / response times.
These will typically be queries over large tables or queries which join many tables in a single query. ... but if you log response times, you can be guided by those.
Types of changes one can make?
You're specifically asking about optimising tables.
Here are some of the things you can look for:
Denormalisation. This brings several tables together into one wider table, so in stead of your query joining several tables together, you can just read one table. This is a very common and powerful technique. NB. I advise keeping the original normalised tables and building the denormalised table in addition - this way, you're not throwing anything away. How you keep it up to date is another question. You might use triggers on the underlying tables, or run a refresh process periodically.
Normalisation. This is not often considered to be an optimisation process, but it is in 2 cases:
updates. Normalisation makes updates much faster because each update is the smallest it can be (you are updating the smallest - in terms of columns and rows - possible table. This is almost the very definition of normalisation.
Querying a denormalised table to get information which exists on a much smaller (fewer rows) table may be causing a problem. In this case, store the normalised table as well as the denormalised one (see above).
Horizontal partitionning. This means making tables smaller by putting some rows in another, identical table. A common use case is to have all of this month's rows in table ThisMonthSales, and all older rows in table OldSales, where both tables have an identical schema. If most queries are for recent data, this strategy can mean that 99% of all queries are only looking at 1% of the data - a huge performance win.
Vertical partitionning. This is Chopping fields off a table and putting them in a new table which is joinned back to the main table by the primary key. This can be useful for very wide tables (e.g. with dozens of fields), and may possibly help if tables are sparsely populated.
Indeces. I'm not sure if your quesion covers these, but there are plenty of other answers on SO concerning the use of indeces. A good way to find a case for an index is: find a slow query. look at the query plan and find a table scan. Index fields on that table so as to remove the table scan. I can write more on this if required - leave a comment.
You might also like my post on this.
That's difficult to answer without knowing which system you're talking about.
In Oracle, for example, the Enterprise Manager lets you see which queries took up the most time, lets you compare different execution profiles, and lets you analyze queries over a block of time so that you don't add an index that's going to help one query at the expense of every other one you run.
Your question is a bit vague. Which DB platform?
If we are talking about SQL Server:
Use the Dynamic Management Views. Use SQL Profiler. Install the SP2 and the performance dashboard reports.
After determining the most costly queries (i.e. number of times run x cost one one query), examine their execution plans, and look at the sizes of the tables involved, and whether they are predominately Read or Write, or a mixture of both.
If the system is under your full control (apps. and DB) you can often re-write queries that are badly formed (quite a common occurrance), such as deep correlated sub-queries which can often be re-written as derived table joins with a little thought. Otherwise, you options are to create covering non-clustered indexes and ensure that statistics are kept up to date.
For MySQL there is a feature called log slow queries
The rest is based on what kind of data you have and how it is setup.
In SQL server you can use trace to find out how your query is performing. Use ctrl + k or l
For example if u see full table scan happening in a table with large number of records then it probably is not a good query.
A more specific question will definitely fetch you better answers.
If your table is predominantly read, place a clustered index on the table.
My experience is with mainly DB2 and a smattering of Oracle in the early days.
If your DBMS is any good, it will have the ability to collect stats on specific queries and explain the plan it used for extracting the data.
For example, if you have a table (x) with two columns (date and diskusage) and only have an index on date, the query:
select diskusage from x where date = '2008-01-01'
will be very efficient since it can use the index. On the other hand, the query
select date from x where diskusage > 90
would not be so efficient. In the former case, the "explain plan" would tell you that it could use the index. In the latter, it would have said that it had to do a table scan to get the rows (that's basically looking at every row to see if it matches).
Really intelligent DBMS' may also explain what you should do to improve the performance (add an index on diskusage in this case).
As to how to see what queries are being run, you can either collect that from the DBMS (if it allows it) or force everyone to do their queries through stored procedures so that the DBA control what the queries are - that's their job, keeping the DB running efficiently.
indices on PKs and FKs and one thing that always helps PARTITIONING...
1. What are the patterns you use to determine the frequent queries?
Depends on what level you are dealing with the database. If you're a DBA or a have access to the tools, db's like Oracle allow you to run jobs and generate stats/reports over a specified period of time. If you're a developer writing an application against a db, you can just do performance profiling within your app.
2. How do you select the optimization factors?
I try and get a general feel for how the table is being used and the data it contains. I go about with the following questions.
Is it going to be updated a ton and on what fields do updates occur?
Does it have columns with low cardinality?
Is it worth indexing? (tables that are very small can be slowed down if accessed by an index)
How much maintenance/headache is it worth to have it run faster?
Ratio of updates/inserts vs queries?
etc.
3. What are the types of changes one can make?
-- If using Oracle, keep statistics up to date! =)
-- Normalization/De-Normalization either one can improve performance depending on the usage of the table. I almost always normalize and then only if I can in no other practical way make the query faster will de-normalize. A nice way to denormalize for queries and when your situation allows it is to keep the real tables normalized and create a denormalized "table" with a materialized view.
-- Index judiciously. Too many can be bad on many levels. BitMap indexes are great in Oracle as long as you're not updating the column frequently and that column has a low cardinality.
-- Using Index organized tables.
-- Partitioned and sub-partitioned tables and indexes
-- Use stored procedures to reduce round trips by applications, increase security, and enable query optimization without affecting users.
-- Pin tables in memory if appropriate (accessed a lot and fairly small)
-- Device partitioning between index and table database files.
..... the list goes on. =)
Hope this is helpful for you.