Three SQL tables or one? - sql

I have a choice of creating three tables with identical structure but different content or one table with all of the data and one additional column that distinguishes the data. Each table will have about 10,000 rows in it, and it will be used exclusively for looking up data. The key design criteria is speed of lookup, so which is faster: three tables with 10K rows each or one table with 30K rows, or is there no substantive difference? Note: all columns that will be used as query parameters will have indices.

There should be no substantial difference between 10k or 30k rows in any modern RDBMS in terms of lookup time. In any case not enough difference to warrant the de-normalization. Indexed qualifier column is a common approach for such a design.
The only time you may consider de-normalizing if your update pattern affects a limited set of data that you can put in a "short" table (say, today's messages in social network) with few(er) indexes for fast inserts/updates and there is a background process transferring the stabilized updates to a large, fully indexed table. The case were you really win during write operations will be a dramatic one though, with very particular and unfortunate requirements. RDBMS engines are sophisticated enough to handle most of the simple scenarios in very efficient way. 30k or rows does not sound like a candidate.
If still in doubt, it is very easy to write a test to check on your particular database / system setup. I think if you post your findings here with real data, it will be a useful info for everyone in your steps.

Apart from the speed issue, which the other posters have covered and I agree with, you should also take into consideration the business model that your are replicating in your database, as this may affect the maintenance cost of your solution.
If is it possible that the 3 'things' may turn into 4, and you have chosen the separate table path, then you will have to add another table. Whereas if you choose the discriminator path then it is as simple as coming up with a new discriminator.
However, if you choose the discriminator path and then new requirements dictate that one of 'things' has more data to store then you are going to have to add extra columns to your table which have no relevance to the other 'things'.
I cannot say which is the right way to go, as only you know your business model.

Related

SQL Server column design

I always tried to make my sql database as simple and as understandable as possible.
Until now I always used a limited number of columns, I think I never had more than 20. Now, there is one thing, that would make my life easier, if I had much more columns. Let´s say 200 columns. (not rows). What do you think about it?
I just want to know, if it is a bad idea, not why i´m doing this or if there are other possibilities, just if somebody has already experienced something like that and if it is a bad idea to do such a table.
Fewer, smaller width columns is better than lots of columns and/or large width columns.
Why? Because the narrower the row size, the more rows you fit on a 8K page. That means you do less I/O and use less memory to buffer pages. That is always a good thing.
In those (hopefully) rare cases, where the domain requires many attributes on an object (with the assumption of 1-1 object-table mapping), you should consider splitting into two tables ina 1-1 relationship, one containing the frequently used columns.
I don't think it is black and white. Having a large row size (implied by the large number of columns) will hurt performance (i.e., more I/O) -- but there are cases where taking a small hit in performance in one place will be offset by increased performance in others.
I'd say it depends on how many rows you expect this table to have, how often will it will be queried, how many of those additional columns will really be accessed, and how it would compare to your alternative design in terms of efficiency and complexity.
Luke--
It really depends on the type of the system you are working with. Example in transactional systems, most tables have at most 50 columns or so with almost no redundant data attrributes ( If you have a process date, you would not need the Process Month or the process year as a seperate column). This of course is because the records are updated/inserted frequently and you'll need to update all the redundant attributes everytime you update one row.
In Data Warehouse/reporting environments, for Dimension tables (which have the attributes for an entity) it is typical to have 100+ columns as there are could be various ways you want to categorize a given entity.The Updates here are not so much a problem as data is typically loaded once during off-peak hours and then is used mostly in selects.
Take a look at these links to know more..
http://en.wikipedia.org/wiki/Database_normalization
http://en.wikipedia.org/wiki/Star_schema
So the answer is it depends... If you want a perfectly relational system, then may be 200+ columns is kind of a red flag indicating you should look at normalize your data (May be not). Updates and Indexes are two things that you should be concerned with in such a system.
You are using SQL Server, which I think defaults to row-oriented storage (all fields in a row are stored together in a page), which can be a problem with large number of columns. However, if you use column-oriented storage, the number of columns per table does not matter because each column is stored together. I don't know if this is possible with SQL Server.

Why are joins bad when considering scalability?

Why are joins bad or 'slow'. I know i heard this more then once. I found this quote
The problem is joins are relatively
slow, especially over very large data
sets, and if they are slow your
website is slow. It takes a long time
to get all those separate bits of
information off disk and put them all
together again.
source
I always thought they were fast especially when looking up a PK. Why are they 'slow'?
Scalability is all about pre-computing (caching), spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.
In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.
But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.
The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. There are a number of scalability-friendly ways to take raw joins and make them much faster:
Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation
Indexes
Materialized/indexed views (think of this as a pre-computed join or managed de-normalization)
Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.
Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)
OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as generic denormalization)
Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.
Use of a caching layer like Redis to avoid re-running queries which need complex joins.
I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).
If you have a slow join, you're probably not using your database correctly.
De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.
* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.
Joins can be slower than avoiding them through de-normalisation but if used correctly (joining on columns with appropriate indexes an so on) they are not inherently slow.
De-normalisation is one of many optimisation techniques you can consider if your well designed database schema exhibits performance problems.
article says that they are slow when compared to absence of joins. this can be achieved with denormalization. so there is a trade off between speed and normalization. don't forget about premature optimization also :)
First of all, a relational database's raison d'etre (reason for being) is to be able to model relationships between entities. Joins are simply the mechanisms by which we traverse those relationships. They certainly do come at a nominal cost, but without joins, there really is no reason to have a relational database.
In the academic world we learn of things like the various normal forms (1st, 2nd, 3rd, Boyce-Codd, etc.), and we learn about different types of keys (primary, foreign, alternate, unique, etc.) and how these things fit together to design a database. And we learn the rudiments of SQL as well as manipulating both structure and data (DDL & DML).
In the corporate world, many of the academic constructs turn out to be substantially less viable than we had been led to believe. A perfect example is the notion of a primary key. Academically it is that attribute (or collection of attributes) that uniquely identifies one row in the table. So in many problem domains, the proper academic primary key is a composite of 3 or 4 attributes. However, almost everyone in the modern corporate world uses an auto-generated, sequential integer as a table's primary key. Why? Two reasons. The first is because it makes the model much cleaner when you're migrating FKs all over the place. The second, and most germane to this question, is that retrieving data through joins is faster and more efficient on a single integer than it is on 4 varchar columns (as already mentioned by a few folks).
Let's dig a little deeper now into two specific subtypes of real world databases. The first type is a transactional database. This is the basis for many e-commerce or content management applications driving modern sites. With a transaction DB, you're optimizing heavily toward "transaction throughput". Most commerce or content apps have to balance query performance (from certain tables) with insert performance (in other tables), though each app will have its own unique business driven issues to solve.
The second type of real world database is a reporting database. These are used almost exclusively to aggregate business data and to generate meaningful business reports. They are typically shaped differently than the transaction databases where the data is generated and they are highly optimized for speed of bulk data loading (ETLs) and query performance with large or complex data sets.
In each case, the developer or DBA needs to carefully balance both the functionality and performance curves, and there are lots of performance enhancing tricks on both sides of the equation. In Oracle you can do what's called an "explain plan" so you can see specifically how a query gets parsed and executed. You're looking to maximize the DB's proper use of indexes. One really nasty no-no is to put a function in the where clause of a query. Whenever you do that, you guarantee that Oracle will not use any indexes on that particular column and you'll likely see a full or partial table scan in the explain plan. That's just one specific example of how a query could be written that ends up being slow, and it doesn't have anything to do with joins.
And while we're talking about table scans, they obviously impact the query speed proportionally to the size of the table. A full table scan of 100 rows isn't even noticeable. Run that same query on a table with 100 million rows, and you'll need to come back next week for the return.
Let's talk about normalization for a minute. This is another largely positive academic topic that can get over-stressed. Most of the time when we talk about normalization we really mean the elimination of duplicate data by putting it into its own table and migrating an FK. Folks usually skip over the whole dependence thing described by 2NF and 3NF. And yet in an extreme case, it's certainly possible to have a perfect BCNF database that's enormous and a complete beast to write code against because it's so normalized.
So where do we balance? There is no single best answer. All of the better answers tend to be some compromise between ease of structure maintenance, ease of data maintenance and ease of code creation/maintenance. In general, the less duplication of data, the better.
So why are joins sometimes slow? Sometimes it's bad relational design. Sometimes it's ineffective indexing. Sometimes it's a data volume issue. Sometimes it's a horribly written query.
Sorry for such a long-winded answer, but I felt compelled to provide a meatier context around my comments rather than just rattle off a 4-bullet response.
People with terrabyte sized databases still use joins, if they can get them to work performance-wise then so can you.
There are many reasons not to denomalize. First, speed of select queries is not the only or even main concern with databases. Integrity of the data is the first concern. If you denormalize then you have to put into place techniques to keep the data denormalized as the parent data changes. So suppose you take to storing the client name in all tables instead of joining to the client table on the client_Id. Now when the name of the client changes (100% chance some of the names of clients will change over time), now you need to update all the child records to reflect that change. If you do this wil a cascade update and you have a million child records, how fast do you suppose that is going to be and how many users are going to suffer locking issues and delays in their work while it happens? Further most people who denormalize because "joins are slow" don't know enough about databases to properly make sure their data integrity is protected and often end up with databases that have unuseable data becasue the integrity is so bad.
Denormalization is a complex process that requires an thorough understanding of database performance and integrity if it is to be done correctly. Do not attempt to denormalize unless you have such expertise on staff.
Joins are quite fast enough if you do several things. First use a suggorgate key, an int join is almost alawys the fastest join. Second always index the foreign key. Use derived tables or join conditions to create a smaller dataset to filter on. If you have a large very complex database, then hire a professional database person with experience in partioning and managing huge databases. There are plenty of techniques to improve performance without getting rid of joins.
If you just need query capability, then yes you can design a datawarehouse which can be denormalized and is populated through an ETL tool (optimized for speed) not user data entry.
Joins are slow if
the data is improperly indexed
results poorly filtered
joining query poorly written
data sets very large and complex
So, true, the bigger your data sets the the more processing you'll need for a query but checking and working on the first three options of the above will often yield great results.
Your source gives denormalization as an option. This is fine only as long as you've exhausted better alternatives.
The joins can be slow if large portions of records from each side need to be scanned.
Like this:
SELECT SUM(transaction)
FROM customers
JOIN accounts
ON account_customer = customer_id
Even if an index is defined on account_customer, all records from the latter still need to be scanned.
For the query list this, the decent optimizers won't probably even consider the index access path, doing a HASH JOIN or a MERGE JOIN instead.
Note that for a query like this:
SELECT SUM(transaction)
FROM customers
JOIN accounts
ON account_customer = customer_id
WHERE customer_last_name = 'Stellphlug'
the join will most probably will be fast: first, an index on customer_last_name will be used to filter all Stellphlug's (which are of course, not very numerous), then an index scan on account_customer will be issued for each Stellphlug to find his transactions.
Despite the fact that these can be billions of records in accounts and customers, only few will actually need to be scanned.
Joins are fast. Joins should be considered standard practice with a properly normalized database schema. Joins allow you to join disparate groups of data in a meaningful way. Don't fear the join.
The caveat is that you must understand normalization, joining, and the proper use of indexes.
Beware premature optimization, as the number one failing of all development projects is meeting the deadline. Once you've completed the project, and you understand the trade offs, you can break the rules if you can justify it.
It's true that join performance degrades non-linearly as the size of the data set increases. Therefore, it doesn't scale as nicely as single table queries, but it still does scale.
It's also true that a bird flies faster without any wings, but only straight down.
Joins do require extra processing since they have to look in more files and more indexes to "join" the data together. However, "very large data sets" is all relative. What is the definition of large? I the case of JOINs, I think its a reference to a large result set, not that overall dataset.
Most databases can very quickly process a query that selects 5 records from a primary table and joins 5 records from a related table for each record (assuming the correct indexes are in place). These tables can have hundreds of millions of records each, or even billions.
Once your result set starts growing, things are going to slow down. Using the same example, if the primary table results in 100K records, then there will be 500K "joined" records that need to be found. Just pulling that much data out of the database with add delays.
Don't avoid JOINs, just know you may need to optimize/denormalize when datasets get "very large".
Also from the article you cited:
Many mega-scale websites with billions
of records, petabytes of data, many
thousands of simultaneous users, and
millions of queries a day are doing is
using a sharding scheme and some are
even advocating denormalization as the
best strategy for architecting the
data tier.
and
And unless you are a really large
website you probably don't need to
worry about this level of complexity.
and
It's more error prone than having the
database do all this work, but you are
able to do scale past what even the
highest end databases can handle.
The article is discussing mega-sites like Ebay. At that level of usage you are likely going to have to consider something other than plain vanilla relational database management. But in the "normal" course of business (applications with thousands of users and millions of records) those more expensive, more error prone approaches are overkill.
Joins are considered an opposing force to scalability because they're typically the bottleneck and they cannot be easily distributed or paralleled.
Properly designed tables containing with the proper indicies and correctly written queries not always slow. Where ever you heard that:
Why are joins bad or 'slow'
has no idea what they are talking about!!! Most joins will be very fast. If you have to join many many rows at one time you might take a hit as compared to a denormalized table, but that goes back to Properly designed tables, know when to denormalize and when not to. in a heavy reporting system, break out the data in denormalized tables for reports, or even create a data warehouse. In a transactional heavy system normalize the tables.
The amount of temporary data that is generated could be huge based on the joins.
For an example, one database here at work had a generic search function where all of the fields were optional. The search routine did a join on every table before the search began. This worked well in the beginning. But, now that the main table has over 10 million rows... not so much. Searches now take 30 minutes or more.
I was tasked with optimizing the search stored procedure.
The first thing I did was if any of the fields of the main table were being searched, I did a select to a temp table on those fields only. THEN, I joined all the tables with that temp table before doing the rest of the search. Searches where one of the main table fields now take less than 10 seconds.
If none of the main table fields are begin searched, I do similar optimizations for other tables. When I was done, no search takes longer than 30 seconds with most under 10.
CPU utilization of the SQL server also went WAY DOWN.
While joins (presumably due to a normalized design) can obviously be slower for data retrieval than a read from a single table, a denormalized database can be slow for data creation/update operations since the footprint of the overall transaction will not be minimal.
In a normalized database, a piece of data will live in only one place, so the footprint for an update will be as minimal as possible. In a denormalized database, it's possible that the same column in multiple rows or across tables will have to be updated, meaning the footprint would be larger and chance of locks and deadlocks can increase.
Well, yeah, selecting rows from one denormalized table (assuming decent indexes for your query) might be faster that selecting rows constructed from joining several tables, particularly if the joins don't have efficient indexes available.
The examples cited in the article - Flickr and eBay - are exceptional cases IMO, so have (and deserve) exceptional responses. The author specifically calls out the lack of RI and the extent of data duplication in the article.
Most applications - again, IMO - benefit from the validation & reduced duplication provided by RDBMSs.
They can be slow if done sloppily. For example, if you do a 'select *' on a join you will probaby take a while to get stuff back. However, if you carefully choose what columns to return from each table, and with the proper indexes in place, there should be no problem.

mysql - how many columns is too many?

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

Sql optimization Techniques

I want to know optimization techniques for databases that has nearly 80,000 records,
list of possibilities for optimizing
i am using for my mobile project in android platform
i use sqlite,i takes lot of time to retreive the data
Thanks
Well, with only 80,000 records and assuming your database is well designed and normalized, just adding indexes on the columns that you frequently use in your WHERE or ORDER BY clauses should be sufficient.
There are other more sophisticated techniques you can use (such as denormalizing certain tables, partitioning, etc.) but those normally only start to come into play when you have millions of records to deal with.
ETA:
I see you updated the question to mention that this is on a mobile platform - that could change things a bit.
Assuming you can't pare down the data set at all, one thing you might be able to do would be to try to partition the database a bit. The idea here is to take your one large table and split it into several smaller identical tables that each hold a subset of the data.
Which of those tables a given row would go into would depend on how you choose to partition it. For example, if you had a "customer_id" field that could range from 0 to 10,000 you might put customers 0 - 2500 in table1, 2,500 - 5,000 in table2, etc. splitting the one large table into 4 smaller ones. You would then have logic in your app that would figure out which table (or tables) to query to retrieve a given record.
You would want to partition your data in such a way that you generally only need to query one of the partitions at a time. Exactly how you would partition the data would depend on what fields you have and how you are using them, but the general idea is the same.
Create indexes
Delete indexes
Normalize
DeNormalize
80k rows isn't many rows these days. Clever index(es) with queries that utlise these indexes will serve you right.
Learn how to display query execution maps, then learn to understand what they mean, then optimize your indices, tables, queries accordingly.
Such a wide topic, which does depend on what you want to optimise for. But the basics:
indexes. A good indexing strategy is important, indexing the right columns that are frequently queried on/ordered by is important. However, the more indexes you add, the slower your INSERTs and UPDATEs will be so there is a trade-off.
maintenance. Keep indexes defragged and statistics up to date
optimised queries. Identify queries that are slow (using profiler/built-in information available from SQL 2005 onwards) and see if they could be written more efficiently (e.g. avoid CURSORs, used set-based operations where possible
parameterisation/SPs. Use parameterised SQL to query the db instead of adhoc SQL with hardcoded search values. This will allow better execution plan caching and reuse.
start with a normalised database schema, and then de-normalise if appropriate to improve performance
80,000 records is not much so I'll stop there (large dbs, with millions of data rows, I'd have suggested partitioning the data)
You really have to be more specific with respect to what you want to do. What is your mix of operations? What is your table structure? The generic advice is to use indices as appropriate but you aren't going to get much help with such a generic question.
Also, 80,000 records is nothing. It is a moderate-sized table and should not make any decent database break a sweat.
First of all, indexes are really a necessity if you want a well-performing database.
Besides that, though, the techniques depend on what you need to optimize for: Size, speed, memory, etc?
One thing that is worth knowing is that using a function in the where statement on the indexed field will cause the index not to be used.
Example (Oracle):
SELECT indexed_text FROM your_table WHERE upper(indexed_text) = 'UPPERCASE TEXT';

Is it better for faster access to split tables and JOIN in a SQL database or leave a few monolithic tables?

I know it's probably not the right way to structure a database but does the database perform faster if the data is put in one huge table instead of breaking it up logically in other tables?
I want to design and create the database properly using keys to create relational integrity across tables but when quering, is JOIN'ing slower than reading the required data from one table? I want to make the database queries as fast as possible.
So many other facets affect the answer to your question. What is the size of the table? width? how many rows? What is usage pattern? Are there different usage patterns for different subsets of the columns in the table? (i.e., are two columns hit 1000 times per second, and the other 50 columns only hit once or twice a day? ) this scenario would be a prime candidate to split (partition) the table vertically (two columns in one table, the rest on another)
In general, normalize the schema to the maximum degree possible, then run performance testing with typical or predicted loads and usage patterns, and denormalize and partition to the point where the performance becomes acceptable, and no more...
It depends on the dbms flavor and your actual data, of course. But generally more smaller (narrower) tables are faster than fewer larger (wider) tables.
Access is a little slower when joins must be performed. How much slower depends greatly on the features offered by your particular DBMS, and how the physical database design exploits those features, and on the most frequent access patterns. There are a few access patterns where storing a lot of data in one row wastes time, because the entire row is retrieved, but only a little of the row is used. It depends.
When data is stored in a single table and the normalization rules are deviated from, update is typically slower. How important speed of of update is versus speed of query is dependant on the particular way you use this database.
In general, a lot of newbie database designers tend to put more weight on speed issues than those issues deserve. If your data model is inflexible and incomprehensible, but you gain a 10% speed improvement, you have probably done more harm than good.
Are you building a "read-only" database like a data warehouse? If so, storing data "pre-joined" may make sense. For everyday OLTP databases you need to take into account the performance and ease of inserts, updates and deletes as well. Also, what about queries that only want the data that would have been in one or two of the smaller tables? Now they have to grind through a big fat table full of stuff they don't care about.
It's worth remembering that joining tables is bread-and-butter stuff to a decent DBMS - they are very good at it.
It is often true that querying a single table is faster than querying multiple joined tables. But a normalized design allows you to query the data in multiple ways, with adequate performance across many types of queries.
If you denormalize the tables, you may improve performance of one specific query, while sacrificing performance of other queries against that data. And of course you'll have to manage referential integrity and redundancy manually.
What you're asking about is denormalization - it can speed up reads if done in the right way, and if you are able to ensure that you're not introducing anomalies into your database because of it.
Remember also that there is a hard limit to the amount of data that can be stored in one record. (not knowing which database you have, I can't say what it is.) Too many columns and you will hit that limit. Also if you are having columns like phone1, phone2, phone3 then you need to normalize. If you would need to add a column if the number of items to be inserted about a record changes (if you statred needing 4 instead of 3 phone numbers for instance), you need to normalize instead.
What's true for optimising SELECTS is often not so great at optimising INSERTS, UPDATES and DELETES, and thus it is with this approach. Breaking out the data into properly normalised tables reduces the overhead of changing the data.
While it's tru that in a data warehouse or decision suport system we'd often store pre-joined data (as Tony says), it usually only happens in the context of a precomputed summary (eg. a materialized view) and not for data at the atomic level of granularity. The reason for this is that pushing repeated longer character strings (eg. "Supplier Name") into a dimension table reduces total required storage space and number of physical reads required to retrieve the data. The joins are usually equijoins, and these are performed at almost no cost for large data sets.