lookup table or something else? - sql

I have 8 tables;
employees
employee_subjects
outlet
outlet_subjects
subjects
geography
outlet_geography
employee_geography
Now, I need to be able to search out outlets and employees within a range of different geographies and based on a range of subjets.
My questions is: Is there a good strategy and is it a good idea to create a somewhat static lookup table where I have inserted all the data I need in my range ?
The table would potentially grow to +50 million rows but I would be able to say
SELECT ... FROM lookup WHERE subId = 1 OR subId = 2 OR geoId = 1 geoId = 2...etc etc.
So I get to keep the joins out.
Vague, yes, but I need guidance on this!

That question cannot be answered in general. In some contexts you have to keep redundant, denormalized data for performance reasons (in particular for data warehouses). However, you should not introduce redundancies or potential inconsistencies lightly.
I suggest to first measure the query performance and check your execution plans. Make sure that you create all the indexes that you need. If the query turns out to be still too slow, you might consider using a materialized view (called indexed view for sql server, see, e.g., here). A materialized is quite like the table that you suggest, but it is kept in sync with your data automatically by the DBMS.

In a Datawarehouse context for analytics queries (pulling out numbers and statistics from your system) that could make sense, but for an oltp system regularly updated by users, a big lookup table is a very bad design, hard to maintain (lot of uneeded data: not all columns needed for all records etc), bad data etc.
Keeping out joins just for querying the system does not sounds like good idea too
as it could break the work of Sql Server optimizer and has more chances to lead to table scans
(that could be hard with a big table).
Here is an interesting article from Joe Celko on big lookup tables, that sounds related to your problem, not exactly the same but could give you some insights.
A general advice would be : keep a normalized design (and especially for and oltp system).

Related

100 Join SQL query

I'm looking to essentially have a centralized table with a number of lookup tables that surround it. The central table is going to be used to store 'Users' and the lookup tables will be user attributes, like 'Religion'. The central table will store an Id, like ReligionId, and the lookup table would contain a list of religions.
Now, I've done a lot of digging into this and I've seen many people comment saying that a UserAttribute table might be the best way to go, essentially using an EAV pattern. I'm not looking to do this. I realize that my strategy will be join-heavy and that's why I ask this question here. I'm looking for a way to optimize those joins.
If the table has 100 lookup tables, how could it be optimized to be faster than just doing a massive 100 table inner join? Some ideas come to mind like using many smaller joins, sub-selects and views. I'm open to anything, including a combination of these strategies. Again, just to note, I'm not looking to do anything that's EAV-related. I need the lookup tables for other reasons and I like normalized data.
All suggestions considered!
Here's a visual look:
Edit: Is this insane?
Optimization techniques will likely depend on the size of the center table and intended query patterns. This is very similar to what you get in data warehousing star schemas, so approaches from that paradigm may help.
For one, ensuring the size of each row is absolutely as small as possible. Disk space may be cheap, but disk throughput, memory, and CPU resources are potential bottle necks. You want small rows so that it can read them quickly and cache as much as possible in memory.
A materialized/indexed view with the joins already performed allows the joins to essentially be precomputed. This may not work well if you are dealing with a center table that is being written to alot or is very large.
Anything you can do to optimize a single join should be done for all 100. Appropriate indexes based on the selectivity of the column, etc.
Depending on what kind of queries you are performing, then other techniques from data warehousing or OLAP may apply. If you are doing lots of group by's then this is likely an area to look in to. Data warehousing techniques can be applied within SQL Server with no additional tooling.
Ask yourself why so many attributes are being queried and how they are being presented? For most analysis it is not necessary to join with lookup tables until the final step where you materialize a report, at which time you may only have grouped by on a subset of columns and thus only need some of the lookup tables.
Group By's generally should be able to group on the lookup Id's without needing the text/description from the lookup table so a join is not necessary. If your lookups have other information relevant to the query at hand then consider denormalizing it into the central table to eliminate the join and/or make that discreet value its own lookup, essentially splitting the existing lookup ID into another ID.
You could implement a master code table that combines the code tables into a single table with a CodeType column. This is not the same as EAV because you'd still have a column in the center table for each code type and a join for each, where as EAV is usually used to normalize out an arbitrary number of attributes. (Note: I personally hate master code tables.)
Lastly, consider normalization the center table if you are not doing data warehousing.
Are there lots of null values in certain lookupId columns? Is the table sparse? This is an indication that you can pull some columns out into a 1 to 1/0 relationships to reduce the size of the center table. For example, a Person table that includes address information can have a PersonAddress table pulled out of it.
Partitioning the table may improve performance if there's a large number of rows and you can determine that certain rows, perhaps with a certain old datetime from couple years in the past, would rarely be queried.
Update: See "Ask yourself why so many attributes are being queried and how they are being presented?" above. Consider a user wants to know number of sales grouped by year, department, and product. You should have id's for each of these so you can just group by those IDs on the center table and in an outer query join lookups for only what columns remain. This ensures the aggregation doesn't need to pull in unnecessary information from lookups that aren't needed anyway.
If you aren't doing aggregations, then you probably aren't querying large numbers of records at a time, so join performance is less of a concern and should be taken care of with appropriate indexes.
If you're querying large numbers of records at a time pulling in all information, I'd look hard at the business case for this. No one sits down at their desk and opens a report with a million rows and 100 columns in it and does anything meaningful with all of that data, that couldn't be accomplished in a better way.
The only case for such a query be a dump of all data intended for export to another system, in which case performance shouldn't be as much as a concern as it can be scheduled overnight.
Since you are set on your way. you can consider duplicating data in order to join less times in a similar way to what is done in olap database.
http://en.wikipedia.org/wiki/OLAP_cube
With that said I don't think this is the best way to do it if you have 100 properties.
Have you tried to export it to Microsoft Excel Power Pivot with Power Query? you can make fast data analysis with pretty awsome ways to show it with Power view video sample

What is the most scalable design for this table structure

DataColumn, DataColumn, DateColumn
Every so often we put data into the table via date.
So everything seems great at first, but then I thought: What happens when there are a million or billion rows in the table? Should I be breaking up the tables by date? This way the query performance will never degrade? How do people deal with this sort of thing?
You can use partitioned tables starting with SQL 2K5: Partitioned Tables
This way you gain the benefits of keeping the logical design pure while being able to move old data into a different file group.
You should not break your tables because of data. Instead, you should worry about your indexes, normalization and so on.
Update
A little deeper explanation. Let's suppose you have a table with a million records. If you have different dates on [DateColumn], your greatest ally will be the indexes that work with the [DateColumn]. Then you make sure your queries always filter by at least [DateColumn].
This way, you will be fine.
This easily qualifies as premature optimization, which is tough to achieve in db design IMHO, because optimization is/should be closer to the surface in data modeling.
But all you need to do is create an index on the DateColumn field. An index is actually a much better performance solution than any kind of table splitting/breaking up and keeps your design and therefore all of you programming much simpler. (And you can decide to use partitioning w/o affecting your design in the future if it helps.)
Sounds like you could use a history table. If you are mostly going to query the current date's data, then migrate the old data to the history table and your main table will not grow so much.
If I understand you question correctly, you have a table with some data and a date. Your question is -- will I see improved performance if I make a new table say, every year. This way the queries will never have to look at more than one years worth of data.
This is wrong. Instead what you should do is set the date field as an index. The server will be able to give you the performance gain you need if it is an index.
If you don't do this your program's logic will get crazy and ultimately slow down your system.
Keep it simple.
(NB - There are some advanced partitioning features you can make use of, but these can be layered in later if needed -- it is unlikely you will need these features but the simple design should be able to migrate to them if needed.)
When tables and indexes become very
large, partitioning can help by
partitioning the data into smaller,
more manageable sections.
Microsoft SQL Server 2005 allows you
to partition your tables based on
specific data usage patterns using
defined ranges or lists. SQL Server
2005 also offers numerous options for
the long-term management of
partitioned tables and indexes by the
addition of features designed around
the new table and index structure.
Furthermore, if a large table exists
on a system with multiple CPUs,
partitioning the table can lead to
better performance through parallel
operations.
You might need considering the
following too: In SQL Server 2005,
related tables (such as Order and
OrderDetails tables) that are
partitioned to the same partitioning
key and the same partitioning function
are said to be aligned. When the
optimizer detects that two partitioned
and aligned tables are joined, SQL
Server 2005 can join the data that
resides on the same partitions first
and then combine the results. This
allows SQL Server 2005 to more
effectively use multiple-CPU
computers.
Read about Partitioned Tables and Indexes in SQL Server 2005

Why are joins bad when considering scalability?

Why are joins bad or 'slow'. I know i heard this more then once. I found this quote
The problem is joins are relatively
slow, especially over very large data
sets, and if they are slow your
website is slow. It takes a long time
to get all those separate bits of
information off disk and put them all
together again.
source
I always thought they were fast especially when looking up a PK. Why are they 'slow'?
Scalability is all about pre-computing (caching), spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.
In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.
But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.
The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. There are a number of scalability-friendly ways to take raw joins and make them much faster:
Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation
Indexes
Materialized/indexed views (think of this as a pre-computed join or managed de-normalization)
Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.
Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)
OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as generic denormalization)
Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.
Use of a caching layer like Redis to avoid re-running queries which need complex joins.
I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).
If you have a slow join, you're probably not using your database correctly.
De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.
* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.
Joins can be slower than avoiding them through de-normalisation but if used correctly (joining on columns with appropriate indexes an so on) they are not inherently slow.
De-normalisation is one of many optimisation techniques you can consider if your well designed database schema exhibits performance problems.
article says that they are slow when compared to absence of joins. this can be achieved with denormalization. so there is a trade off between speed and normalization. don't forget about premature optimization also :)
First of all, a relational database's raison d'etre (reason for being) is to be able to model relationships between entities. Joins are simply the mechanisms by which we traverse those relationships. They certainly do come at a nominal cost, but without joins, there really is no reason to have a relational database.
In the academic world we learn of things like the various normal forms (1st, 2nd, 3rd, Boyce-Codd, etc.), and we learn about different types of keys (primary, foreign, alternate, unique, etc.) and how these things fit together to design a database. And we learn the rudiments of SQL as well as manipulating both structure and data (DDL & DML).
In the corporate world, many of the academic constructs turn out to be substantially less viable than we had been led to believe. A perfect example is the notion of a primary key. Academically it is that attribute (or collection of attributes) that uniquely identifies one row in the table. So in many problem domains, the proper academic primary key is a composite of 3 or 4 attributes. However, almost everyone in the modern corporate world uses an auto-generated, sequential integer as a table's primary key. Why? Two reasons. The first is because it makes the model much cleaner when you're migrating FKs all over the place. The second, and most germane to this question, is that retrieving data through joins is faster and more efficient on a single integer than it is on 4 varchar columns (as already mentioned by a few folks).
Let's dig a little deeper now into two specific subtypes of real world databases. The first type is a transactional database. This is the basis for many e-commerce or content management applications driving modern sites. With a transaction DB, you're optimizing heavily toward "transaction throughput". Most commerce or content apps have to balance query performance (from certain tables) with insert performance (in other tables), though each app will have its own unique business driven issues to solve.
The second type of real world database is a reporting database. These are used almost exclusively to aggregate business data and to generate meaningful business reports. They are typically shaped differently than the transaction databases where the data is generated and they are highly optimized for speed of bulk data loading (ETLs) and query performance with large or complex data sets.
In each case, the developer or DBA needs to carefully balance both the functionality and performance curves, and there are lots of performance enhancing tricks on both sides of the equation. In Oracle you can do what's called an "explain plan" so you can see specifically how a query gets parsed and executed. You're looking to maximize the DB's proper use of indexes. One really nasty no-no is to put a function in the where clause of a query. Whenever you do that, you guarantee that Oracle will not use any indexes on that particular column and you'll likely see a full or partial table scan in the explain plan. That's just one specific example of how a query could be written that ends up being slow, and it doesn't have anything to do with joins.
And while we're talking about table scans, they obviously impact the query speed proportionally to the size of the table. A full table scan of 100 rows isn't even noticeable. Run that same query on a table with 100 million rows, and you'll need to come back next week for the return.
Let's talk about normalization for a minute. This is another largely positive academic topic that can get over-stressed. Most of the time when we talk about normalization we really mean the elimination of duplicate data by putting it into its own table and migrating an FK. Folks usually skip over the whole dependence thing described by 2NF and 3NF. And yet in an extreme case, it's certainly possible to have a perfect BCNF database that's enormous and a complete beast to write code against because it's so normalized.
So where do we balance? There is no single best answer. All of the better answers tend to be some compromise between ease of structure maintenance, ease of data maintenance and ease of code creation/maintenance. In general, the less duplication of data, the better.
So why are joins sometimes slow? Sometimes it's bad relational design. Sometimes it's ineffective indexing. Sometimes it's a data volume issue. Sometimes it's a horribly written query.
Sorry for such a long-winded answer, but I felt compelled to provide a meatier context around my comments rather than just rattle off a 4-bullet response.
People with terrabyte sized databases still use joins, if they can get them to work performance-wise then so can you.
There are many reasons not to denomalize. First, speed of select queries is not the only or even main concern with databases. Integrity of the data is the first concern. If you denormalize then you have to put into place techniques to keep the data denormalized as the parent data changes. So suppose you take to storing the client name in all tables instead of joining to the client table on the client_Id. Now when the name of the client changes (100% chance some of the names of clients will change over time), now you need to update all the child records to reflect that change. If you do this wil a cascade update and you have a million child records, how fast do you suppose that is going to be and how many users are going to suffer locking issues and delays in their work while it happens? Further most people who denormalize because "joins are slow" don't know enough about databases to properly make sure their data integrity is protected and often end up with databases that have unuseable data becasue the integrity is so bad.
Denormalization is a complex process that requires an thorough understanding of database performance and integrity if it is to be done correctly. Do not attempt to denormalize unless you have such expertise on staff.
Joins are quite fast enough if you do several things. First use a suggorgate key, an int join is almost alawys the fastest join. Second always index the foreign key. Use derived tables or join conditions to create a smaller dataset to filter on. If you have a large very complex database, then hire a professional database person with experience in partioning and managing huge databases. There are plenty of techniques to improve performance without getting rid of joins.
If you just need query capability, then yes you can design a datawarehouse which can be denormalized and is populated through an ETL tool (optimized for speed) not user data entry.
Joins are slow if
the data is improperly indexed
results poorly filtered
joining query poorly written
data sets very large and complex
So, true, the bigger your data sets the the more processing you'll need for a query but checking and working on the first three options of the above will often yield great results.
Your source gives denormalization as an option. This is fine only as long as you've exhausted better alternatives.
The joins can be slow if large portions of records from each side need to be scanned.
Like this:
SELECT SUM(transaction)
FROM customers
JOIN accounts
ON account_customer = customer_id
Even if an index is defined on account_customer, all records from the latter still need to be scanned.
For the query list this, the decent optimizers won't probably even consider the index access path, doing a HASH JOIN or a MERGE JOIN instead.
Note that for a query like this:
SELECT SUM(transaction)
FROM customers
JOIN accounts
ON account_customer = customer_id
WHERE customer_last_name = 'Stellphlug'
the join will most probably will be fast: first, an index on customer_last_name will be used to filter all Stellphlug's (which are of course, not very numerous), then an index scan on account_customer will be issued for each Stellphlug to find his transactions.
Despite the fact that these can be billions of records in accounts and customers, only few will actually need to be scanned.
Joins are fast. Joins should be considered standard practice with a properly normalized database schema. Joins allow you to join disparate groups of data in a meaningful way. Don't fear the join.
The caveat is that you must understand normalization, joining, and the proper use of indexes.
Beware premature optimization, as the number one failing of all development projects is meeting the deadline. Once you've completed the project, and you understand the trade offs, you can break the rules if you can justify it.
It's true that join performance degrades non-linearly as the size of the data set increases. Therefore, it doesn't scale as nicely as single table queries, but it still does scale.
It's also true that a bird flies faster without any wings, but only straight down.
Joins do require extra processing since they have to look in more files and more indexes to "join" the data together. However, "very large data sets" is all relative. What is the definition of large? I the case of JOINs, I think its a reference to a large result set, not that overall dataset.
Most databases can very quickly process a query that selects 5 records from a primary table and joins 5 records from a related table for each record (assuming the correct indexes are in place). These tables can have hundreds of millions of records each, or even billions.
Once your result set starts growing, things are going to slow down. Using the same example, if the primary table results in 100K records, then there will be 500K "joined" records that need to be found. Just pulling that much data out of the database with add delays.
Don't avoid JOINs, just know you may need to optimize/denormalize when datasets get "very large".
Also from the article you cited:
Many mega-scale websites with billions
of records, petabytes of data, many
thousands of simultaneous users, and
millions of queries a day are doing is
using a sharding scheme and some are
even advocating denormalization as the
best strategy for architecting the
data tier.
and
And unless you are a really large
website you probably don't need to
worry about this level of complexity.
and
It's more error prone than having the
database do all this work, but you are
able to do scale past what even the
highest end databases can handle.
The article is discussing mega-sites like Ebay. At that level of usage you are likely going to have to consider something other than plain vanilla relational database management. But in the "normal" course of business (applications with thousands of users and millions of records) those more expensive, more error prone approaches are overkill.
Joins are considered an opposing force to scalability because they're typically the bottleneck and they cannot be easily distributed or paralleled.
Properly designed tables containing with the proper indicies and correctly written queries not always slow. Where ever you heard that:
Why are joins bad or 'slow'
has no idea what they are talking about!!! Most joins will be very fast. If you have to join many many rows at one time you might take a hit as compared to a denormalized table, but that goes back to Properly designed tables, know when to denormalize and when not to. in a heavy reporting system, break out the data in denormalized tables for reports, or even create a data warehouse. In a transactional heavy system normalize the tables.
The amount of temporary data that is generated could be huge based on the joins.
For an example, one database here at work had a generic search function where all of the fields were optional. The search routine did a join on every table before the search began. This worked well in the beginning. But, now that the main table has over 10 million rows... not so much. Searches now take 30 minutes or more.
I was tasked with optimizing the search stored procedure.
The first thing I did was if any of the fields of the main table were being searched, I did a select to a temp table on those fields only. THEN, I joined all the tables with that temp table before doing the rest of the search. Searches where one of the main table fields now take less than 10 seconds.
If none of the main table fields are begin searched, I do similar optimizations for other tables. When I was done, no search takes longer than 30 seconds with most under 10.
CPU utilization of the SQL server also went WAY DOWN.
While joins (presumably due to a normalized design) can obviously be slower for data retrieval than a read from a single table, a denormalized database can be slow for data creation/update operations since the footprint of the overall transaction will not be minimal.
In a normalized database, a piece of data will live in only one place, so the footprint for an update will be as minimal as possible. In a denormalized database, it's possible that the same column in multiple rows or across tables will have to be updated, meaning the footprint would be larger and chance of locks and deadlocks can increase.
Well, yeah, selecting rows from one denormalized table (assuming decent indexes for your query) might be faster that selecting rows constructed from joining several tables, particularly if the joins don't have efficient indexes available.
The examples cited in the article - Flickr and eBay - are exceptional cases IMO, so have (and deserve) exceptional responses. The author specifically calls out the lack of RI and the extent of data duplication in the article.
Most applications - again, IMO - benefit from the validation & reduced duplication provided by RDBMSs.
They can be slow if done sloppily. For example, if you do a 'select *' on a join you will probaby take a while to get stuff back. However, if you carefully choose what columns to return from each table, and with the proper indexes in place, there should be no problem.

How do Views work in a DBM?

Say that I have two tables like those:
Employers (id, name, .... , deptId).
Depts(id, deptName, ...).
But Those data is not going to be modified so often and I want that a query like this
SELECT name, deptName FROM Employers, Depts
WHERE deptId = Depts.id AND Employers.id="ID"
be as faster as it can.
To my head comes two possible solutions:
Denormalize the table:
Despite that with this solution I will lose some of the great advantages of have "normalized databases, but here the performance is a MUST.
Create a View for that Denormalize data.
I will keep the Data Normalized and (here is my question), the performance of a query over that view will be faster that without that view.
Or another way to ask the same question, the View is "Interpreted" every time that you make a query over it, or how works the views Stuff in a DBA?.
Generally, unless you "materialize" a view, which is an option in some software like MS SQL Server, the view is just translated into queries against the base tables, and is therefore no faster or slower than the original (minus the minuscule amount of time it takes to translate the query, which is nothing compared to actually executing the query).
How do you know you've got performance problems? Are you profiling it under load? Have you verified that the performance bottleneck is these two tables? Generally, until you've got hard data, don't assume you know where performance problems come from, and don't spend any time optimizing until you know you're optimizing the right thing - 80% of the performance issues come from 20% of the code.
If Depts.ID is the primary key of that table, and you index the Employers.DeptID field, then this query should remain very fast even over millions of records.
Denormalizing doesn't make sense to me in that scenario.
Generally speaking, performance of a view will be almost exactly the same as performance when running the query itself. The advantage of a view is simply to abstract that query away, so you don't have to think about it.
You could use a Materialized View (or "snapshot" as some say), but then your data is only going to be as recent as your last refresh.
In a comment to one of the replies, the author of the question explains that he is looking for a way to create a materialized view in MySQL.
MySQL does not wrap the concept of the materialized view in a nice package for you like other DBMSes, but it does have all the tools you need to create one.
What you need to do is this:
Create the initial materialization of the result of your query.
Create a trigger on insert into the employers table that inserts into the materialized table all rows that match the newly inserted employer.
Create a trigger on delete in the employers table that deletes the corresponding rows from the materialized table.
Create a trigger on update in the employers table that updates the corresponding rows in the materialized table.
Same for the departments table.
This may work ok if your underlying tables are not frequently updated; but you need to be aware of the added cost of create/update/delete operations once you do this.
Also you'll want to make sure some DBA who doesn't know about your trickery doesn't go migrating the database without migrating the triggers, when time comes. So document it well.
Sounds like premature optimisation unless you know it is a clear and present problem.
MySQL does not materialise views, they are no faster than queries against the base tables. Moreover, in some cases they are slower as they get optimised less well.
But views also "hide" stuff from developers maintaining the code in the future to make them imagine that the query is less complex than it actually is.

How do you optimize tables for specific queries?

What are the patterns you use to determine the frequent queries?
How do you select the optimization factors?
What are the types of changes one can make?
This is a nice question, if rather broad (and none the worse for that).
If I understand you, then you're asking how to attack the problem of optimisation starting from scratch.
The first question to ask is: "is there a performance problem?"
If there is no problem, then you're done. This is often the case. Nice.
On the other hand...
Determine Frequent Queries
Logging will get you your frequent queries.
If you're using some kind of data access layer, then it might be simple to add code to log all queries.
It is also a good idea to log when the query was executed and how long each query takes. This can give you an idea of where the problems are.
Also, ask the users which bits annoy them. If a slow response doesn't annoy the user, then it doesn't matter.
Select the optimization factors?
(I may be misunderstanding this part of the question)
You're looking for any patterns in the queries / response times.
These will typically be queries over large tables or queries which join many tables in a single query. ... but if you log response times, you can be guided by those.
Types of changes one can make?
You're specifically asking about optimising tables.
Here are some of the things you can look for:
Denormalisation. This brings several tables together into one wider table, so in stead of your query joining several tables together, you can just read one table. This is a very common and powerful technique. NB. I advise keeping the original normalised tables and building the denormalised table in addition - this way, you're not throwing anything away. How you keep it up to date is another question. You might use triggers on the underlying tables, or run a refresh process periodically.
Normalisation. This is not often considered to be an optimisation process, but it is in 2 cases:
updates. Normalisation makes updates much faster because each update is the smallest it can be (you are updating the smallest - in terms of columns and rows - possible table. This is almost the very definition of normalisation.
Querying a denormalised table to get information which exists on a much smaller (fewer rows) table may be causing a problem. In this case, store the normalised table as well as the denormalised one (see above).
Horizontal partitionning. This means making tables smaller by putting some rows in another, identical table. A common use case is to have all of this month's rows in table ThisMonthSales, and all older rows in table OldSales, where both tables have an identical schema. If most queries are for recent data, this strategy can mean that 99% of all queries are only looking at 1% of the data - a huge performance win.
Vertical partitionning. This is Chopping fields off a table and putting them in a new table which is joinned back to the main table by the primary key. This can be useful for very wide tables (e.g. with dozens of fields), and may possibly help if tables are sparsely populated.
Indeces. I'm not sure if your quesion covers these, but there are plenty of other answers on SO concerning the use of indeces. A good way to find a case for an index is: find a slow query. look at the query plan and find a table scan. Index fields on that table so as to remove the table scan. I can write more on this if required - leave a comment.
You might also like my post on this.
That's difficult to answer without knowing which system you're talking about.
In Oracle, for example, the Enterprise Manager lets you see which queries took up the most time, lets you compare different execution profiles, and lets you analyze queries over a block of time so that you don't add an index that's going to help one query at the expense of every other one you run.
Your question is a bit vague. Which DB platform?
If we are talking about SQL Server:
Use the Dynamic Management Views. Use SQL Profiler. Install the SP2 and the performance dashboard reports.
After determining the most costly queries (i.e. number of times run x cost one one query), examine their execution plans, and look at the sizes of the tables involved, and whether they are predominately Read or Write, or a mixture of both.
If the system is under your full control (apps. and DB) you can often re-write queries that are badly formed (quite a common occurrance), such as deep correlated sub-queries which can often be re-written as derived table joins with a little thought. Otherwise, you options are to create covering non-clustered indexes and ensure that statistics are kept up to date.
For MySQL there is a feature called log slow queries
The rest is based on what kind of data you have and how it is setup.
In SQL server you can use trace to find out how your query is performing. Use ctrl + k or l
For example if u see full table scan happening in a table with large number of records then it probably is not a good query.
A more specific question will definitely fetch you better answers.
If your table is predominantly read, place a clustered index on the table.
My experience is with mainly DB2 and a smattering of Oracle in the early days.
If your DBMS is any good, it will have the ability to collect stats on specific queries and explain the plan it used for extracting the data.
For example, if you have a table (x) with two columns (date and diskusage) and only have an index on date, the query:
select diskusage from x where date = '2008-01-01'
will be very efficient since it can use the index. On the other hand, the query
select date from x where diskusage > 90
would not be so efficient. In the former case, the "explain plan" would tell you that it could use the index. In the latter, it would have said that it had to do a table scan to get the rows (that's basically looking at every row to see if it matches).
Really intelligent DBMS' may also explain what you should do to improve the performance (add an index on diskusage in this case).
As to how to see what queries are being run, you can either collect that from the DBMS (if it allows it) or force everyone to do their queries through stored procedures so that the DBA control what the queries are - that's their job, keeping the DB running efficiently.
indices on PKs and FKs and one thing that always helps PARTITIONING...
1. What are the patterns you use to determine the frequent queries?
Depends on what level you are dealing with the database. If you're a DBA or a have access to the tools, db's like Oracle allow you to run jobs and generate stats/reports over a specified period of time. If you're a developer writing an application against a db, you can just do performance profiling within your app.
2. How do you select the optimization factors?
I try and get a general feel for how the table is being used and the data it contains. I go about with the following questions.
Is it going to be updated a ton and on what fields do updates occur?
Does it have columns with low cardinality?
Is it worth indexing? (tables that are very small can be slowed down if accessed by an index)
How much maintenance/headache is it worth to have it run faster?
Ratio of updates/inserts vs queries?
etc.
3. What are the types of changes one can make?
-- If using Oracle, keep statistics up to date! =)
-- Normalization/De-Normalization either one can improve performance depending on the usage of the table. I almost always normalize and then only if I can in no other practical way make the query faster will de-normalize. A nice way to denormalize for queries and when your situation allows it is to keep the real tables normalized and create a denormalized "table" with a materialized view.
-- Index judiciously. Too many can be bad on many levels. BitMap indexes are great in Oracle as long as you're not updating the column frequently and that column has a low cardinality.
-- Using Index organized tables.
-- Partitioned and sub-partitioned tables and indexes
-- Use stored procedures to reduce round trips by applications, increase security, and enable query optimization without affecting users.
-- Pin tables in memory if appropriate (accessed a lot and fairly small)
-- Device partitioning between index and table database files.
..... the list goes on. =)
Hope this is helpful for you.