Can SQL index make search longer? - sql

I heard this question during a job interview, and the interviewer said that yes. My question is why and could someone give an example of an index that makes search longer instead of shorter.

Yes, it can.
An additional index adds possible execution plans for a query if applicable. The Postgres query planner estimates costs for a variety of possible plans and the fastest estimate wins. Since those are estimates, actual query plans can always deviate. A chosen query plan using your new index can turn out to be slower than another plan without.
If your server is configured properly (cost and resource settings, current columns statistics, ...) this outcome is unlikely, but still possible. This can happen for almost every query. More likely for more complex queries. And some types of queries are notoriously hard to estimate.
Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Also, indexes always add write cost, so if your database is write-heavy and the machine is already saturated, more indexes can bring overall performance down.

A trivial example would be on a table with very few rows.
An index search has to load the index into memory and then look up the original data. If a table has only a few rows, they probably fit onto one data page. So, a full table scan requires loading one page.
Any index search (on a cold cache) requires loading two data pages -- one for the index and one for the data. That can be (significantly) longer then just scanning the rows on a single page.
On a large table, if the "search" returns a significant proportion of the rows in the table, then an index search ends up fetching the rows in an order different from how they are stored. If the data pages do not fix in memory, then you have a situation called thrashing, which means that there is a high probability that each new row would be a cache miss.

Related

Does PostgreSQL need VACUUM or ANALYZE schedule to make use of indexes?

When creating indexes on PostgreSQL tables, EXPLAIN ANALYZE followed by an SQL command shows which indexes are used.
For example:
EXPLAIN ANALYZE SELECT A,B,C FROM MY_TABLE WHERE C=123;
Returns:
Seq Scan on public.my_table (cost=...) <- No index, BAD
And, after creating the index, it would return:
Index Scan using my_index_name on public.my_table (cost=...) <- Index, GOOD
However, for some queries that used the same index with a few hundred records, it didn't make any difference. Reading through documentation, it is recommended that either run ANALYZE or have the "Autovacuum" daemon on. This way the database would know the size of tables and decide on query plans properly.
is this absolutely necessary in a production environment? In other words, will PostgreSQL use the index when it's time to use it without need to analyse or vacuum as an extra task?
Short answer "just run autovacuum." Long answer... yes, because statistics can get out of date.
Let's talk about indexes and how/when PostgreSQL decides to use them.
PostgreSQL gets a query in, parses it, and then begins the planning process. How are we going to scan the tables? How are we going to join them and in what order? These are not trivial decisions and trying to find the generally best ways to do things typically means that PostgreSQL needs to know something about the tables.
The first thing to note is that indexes are not always a win. No plan ever beats a sequential scan through a one-page table, and even a 5 page table will almost always be faster with a sequential scan than an index scan. So PostgreSQL cannot safely decide to "use all available indexes."
So the way PostgreSQL decides whether to use an index is to check statistics. Now, these go out of date, which is why you want autovacuum to be updating them. You say your table has a few hundred records and the statics were probably out of date. If PostgreSQL cannot say that the index is a win, it won't use it. A few hundred records is going to be approaching "an index might help" territory depending on how selective the index is in weeding out records.
In your large table, there was probably no question based on existing statistics that the index would help. In your smaller table, there probably was a question and it got answered one way based on the stats it had, and a different way based on newer stats.

How can I measure the cost of a database index?

Is there a good method for judging whether the costs of creating a database index in Postgres (slower INSERTS, time to build an index, time to re-index) are worth the performance gains (faster SELECTS)?
I am actually going to disagree with Hexist. PostgreSQL's planner is pretty good, and it supports good sequential access to table files based on physical order scans, so indexes are not necessarily going to help. Additionally there are many cases where the planner has to pick an index. Additionally you are already creating primary keys for unique constraints and primary keys.
I think one of the good default positions with PostgreSQL (MySQL btw is totally different!) is to wait until you need an index to add one and then only add the indexes you most clearly need. This is, however, just a starting point and it assumes either a lack of a general lack of experience in looking at query plans or a lack of understanding of where the application is likely to go. Having experience in these areas matters.
In general, where you have tables likely to span more than 10 pages (that's 40kb of data and headers), it's a good idea to foreign keys. These can be assumed tob e clearly needed. Small lookup tables spanning 1 page should never have non-unique indexes because these indexes are never going to be used for selects (no query plan beats a sequential scan over a single page).
Beyond that point you also need to look at data distribution. Indexing boolean columns is usually a bad idea and there are better ways to index things relating to boolean searches (partial indexes being a good example). Similarly indexing commonly used function output may seem like a good idea sometimes, but that isn't always the case. Consider:
CREATE INDEX gj_transdate_year_idx ON general_journal (extract('YEAR' FROM transdate));
This will not do much. However an index on transdate might be useful if paired with a sparse index scan via a recursive CTE.
Once the basic indexes are in place, then the question becomes what other indexes do you need to add. This is often better left to later use case review than it is designed in at first. It isn't uncommon for people to find that performance significantly benefits from having fewer indexes on PostgreSQL.
Another major thing to consider is what sort of indexes you create and these are often use-case specific. A b-tree index on an array record for example might make sense if ordinality is important to the domain, and if you are frequently searching based on initial elements, but if ordinality is unimportant, I would recommend a GIN index, because a btree will do very little good (of course that is an atomicity red flag, but sometimes that makes sense in Pg). Even when ordinality is important, sometimes you need GIN indexes anyway because you need to be able to do commutitive scans as if ordinality was not. This is true if using ip4r for example to store cidr blocks and using an EXCLUDE constraint to ensure that no block contains any other block (the actual scan requires using an overlap operator rather than a contain operator since you don't know which side of the operator the violation will be found on).
Again this is somewhat database-specific. On MySQL, Hexist's recommendations would be correct, for example. On PostgreSQL, though, it's good to watch for problems.
As far as measuring, the best tool is EXPLAIN ANALYZE
Generally speaking, unless you have a log or archive table where you wont be doing selects on very frequently (or it's ok if they take awhile to run), you should index on anything your select/update/deelete statements will be using in a where clause.
This however is not always as simple as it seems, as just because a column is used in a where clause and is indexed, doesn't mean the sql engine will be able to use the index. Using the EXPLAIN and EXPLAIN ANALYZE capabilities of postgresql you can examine what indexes were used in selects and help you figure out if having an index on a column will even help you.
This is generally true because without an index your select speed goes down from some O(log n) looking operation down to O(n), while your insert speed only improves from cO(log n) to dO(log n) where d is usually less than c, ie you may speed up your inserts a little by not having an index, but you're going to kill your select speed if they're not indexed, so it's almost always worth it to have an index on your data if you're going to be selecting against it.
Now, if you have some small table that you do a lot of inserts and updates on, and frequently remove all the entries, and only periodically do some selects, it could turn out to be faster to not have any indexes.. however that would be a fairly special case scenario, so you'd have to do some benchmarking and decide if it made sense in your specific case.
Nice question. I'd like to add a bit more what #hexist had already mentioned and to the info provided by #ypercube's link.
By design, database don't know in which part of the table it will find data that satisfies provided predicates. Therefore, DB will perform a full or sequential scan of all table's data, filtering needed rows.
Index is a special data structure, that for a given key can precisely specify in which rows of the table such values will be found. The main difference when index is involved:
there is a cost for the index scan itself, i.e. DB has to find a value in the index first;
there's an extra cost of reading specific data from the table itself.
Working with index will lead to a random IO pattern, compared to a sequential one used in the full scan. You can google for the comparison figures of random and sequential disk access, but it might differ up to an order of magnitude (random being slower of course).
Still, it's clear that in some cases Index access will be cheaper and in others Full scan should be preferred. This depends on how many rows (out of all) will be returned by the specified predicate, or it's selectivity:
if predicate will return a relatively small number of rows, say, less then 10% of total, then it seems valuable to pick those directly via Index. This is a typical case for Primary/Unique keys or queries like: I need address information for customer with internal number = XXX;
if predicate has no big impact on the selectivity, i.e. if 30% (or more) rows are returned, then it's cheaper to do a Full scan, 'cos sequential disk access will beat random and data will be delivered faster. All reports, covering big areas (like a month, or all customers) fall here;
if there's a need to obtain an ordered list of values and there's an index, then doing Index scan is the fastest option. This is a special case of #2, when you need report data ordered by some column;
if number of distinct values in the column is relatively small compared to a total number of values, then Index will be a good choice. This is a case called Loose Index Scan, and typical queries will be like: I need 20 most recent purchases for each of the top 5 categories by number of goods.
How DB decides what to do, Index or Full scan? This is a runtime decision and it is based on the statistics, so make sure to keep those up to date. In fact, numbers provided above have no real life value, you have to evaluate each query independently.
All this is a very rough description of what happens. I would very much recommended to look into How PostgreSQL Planner Uses Statistics, this best what I've seen on the subject.

Can Indices actually decrease SELECT performance?

After reading some stuff about indices on SQL Server and their performance advantages for selects and disadvantages for updates / inserts, i was wondering if badly used indices could actually also hurt performance for selects.
What conditions would have to be fulfilled to have an index decrease performance of a pure select query? Do such situations exist?
Thanks!
(although I always try to include code examples, i can't think of anything that would support this question...)
Yes, albeit very slightly - so slightly that it would be justified to also answer "No".
If you have an index which might be considered for a query, but is not useable, the optimizer will waste a short time pondering whether and how to use it (in rare cases with REALLY complicated indexes and views, and more frequently when index performance hints are wrong, you might end up choosing a suboptimal query plan).
Some cases would be:
a table without indexes
a table with a badly chosen index, which gets discarded
a table where TWO indexes exist, and for some reason (e.g. obsolete statistics), the existence of the second index makes the optimizer choose it, while it would have been more convenient to use the first.
a table where the existing index (usually also thanks to obsolete statistics) tricks the optimizer into reading from the index an amount of data comparable to what could have been, more efficiently, retrieved with a full table scan; to make things worse, the index is fragmented and hashed differently than the table. What was essentially a full table scan becomes a slowed down full table scan with lots of disk thrashing.
In the first two cases the query time is the same (and entails a full scan), but in the third, you also have to analyze and discard the index. In the fourth, unlikely but possible, case an execution time which is likely very large increases and becomes huge (update 2021-10-20: I have just done this to myself. Yay me).
Where an index is likelier to hurt you - where ALL indexes hurt you - is in inserts, deletes and updates. Then, any index not used by the update query, yet affected by same, will require a write to the index itself.
So you will want to have indexes, but as few as you can without sacrificing SELECT performances. Actually, you might decide against indexing for a rarely used SELECT query in order to avoid having the needed index constantly updated by all other UPDATE queries.
Edit: after reading Heinzi's answer, I'd also like to add that most DB servers have maintenance tools which analyze the tables and indexes (and sometimes query performance counters too), and properly update the hints of which Heinzi spoke. So it's also important to periodically "maintain" the database to keep the optimizer supplied with up-to-date information on which indexes to choose from.
Update (MySQL)
There is a very nifty MySQL analysis tool that can actually suggest improvements to the existing indexing (remove unused keys, add useful keys): common_schema. It's really worth a look.
Yes, but it's very unlikely and it should not influence your decision to use indexes.
Sometimes, the SQL Server query analyzer chooses the an execution plan that's not optimal. Since the number of possible execution plans is much larger than it might seem on first sight (a simple join of n tables already produces n! possible execution plans), SQL Server has to make an educated guess. It's in the nature of guesses that they are sometimes wrong.
It's a rare occurrence, but I've seen it happen a couple of times in the past years. In that case (and only in that case), a better plan would have been chosen if the index had not been there. However, removing the index is not the correct way to solve this problem, since the index usually exists for a reason. The correct way is to add a hint to this query (and only to this query), to help the optimizer choose the right plan.
Yes, indexes can hurt performance for SELECTs. It is important to understand how database engines operate. Data is stored on disk(s) in "pages". Indexes make it possible to access the specific page that has a specific value in one or more columns in the table.
This is great if you are looking for specific values.
However, consider a query that needs to look at every row in a table. If you go through the table, you read the pages in order and -- critically -- you get every row on the page with a single read. The number of reads is the number of pages in the table. In addition, the page cache can optimize the reads with look-ahead reads and pages no longer being used are simply overwritten.
Using an index for the same reads goes through the table one record at a time rather than one page at a time. This results in random reads through the pages. In the worst case, there is one read per record in the table -- potentially a very significant hit to performance. In addition, the index itself occupies some of the page cache, reducing memory for other operations.
In generally, the optimizer component of a SQL engine does a good job distinguishing between these two situations. One of the key metrics is the selectivity of the query. How many rows is the query returning (which the optimizer looks at with respect to the number of pages)? If the number of rows is about the same as the number of pages, the optimizer would consider a full table scan rather than an index scan.
There are definitely other considerations, but in general, an index can hurt performance of even a simple select query. In general, optimizers do a good job, but there are sometimes unusual cases that trick even the best optimizers.
My guess would be if you create indices that confuse the query plan optimiser, and that ends up choosing an inefficient index for the query at hand.
This is potentially implementation-dependent, but in principle indexes should not slow down SELECT.
Obviously they can slow down INSERT and UPDATE.

Are SQL Execution Plans based on Schema or Data or both?

I hope this question is not too obvious...I have already found lots of good information on interpreting execution plans but there is one question I haven't found the answer to.
Is the plan (and more specifically the relative CPU cost) based on the schema only, or also the actual data currently in the database?
I am try to do some analysis of where indexes are needed in my product's database, but am working with my own test system which does not have close to the amount of data a product in the field would have. I am seeing some odd things like the estimated CPU cost actually going slightly UP after adding an index, and am wondering if this is because my data set is so small.
I am using SQL Server 2005 and Management Studio to do the plans
It will be based on both Schema and Data. The Schema tells it what indexes are available, the Data tells it which is better.
The answer can change in small degrees depending on the DBMS you are using (you have not stated), but they all maintain statistics against indexes to know whether an index will help. If an index breaks 1000 rows into 900 distinct values, it is a good index to use. If an index only results in 3 different values for 1000 rows, it is not really selective so it is not very useful.
SQL Server is 100% cost-based optimizer. Other RDBMS optimizers are usually a mix of cost-based and rules-based, but SQL Server, for better or worse, is entirely cost driven. A rules based optimizer would be one that can say, for example, the order of the tables in the FROM clause determines the driving table in a join. There are no such rules in SQL Server. See SQL Statement Processing:
The SQL Server query optimizer is a
cost-based optimizer. Each possible
execution plan has an associated cost
in terms of the amount of computing
resources used. The query optimizer
must analyze the possible plans and
choose the one with the lowest
estimated cost. Some complex SELECT
statements have thousands of possible
execution plans. In these cases, the
query optimizer does not analyze all
possible combinations. Instead, it
uses complex algorithms to find an
execution plan that has a cost
reasonably close to the minimum
possible cost.
The SQL Server query optimizer does
not choose only the execution plan
with the lowest resource cost; it
chooses the plan that returns results
to the user with a reasonable cost in
resources and that returns the results
the fastest. For example, processing a
query in parallel typically uses more
resources than processing it serially,
but completes the query faster. The
SQL Server optimizer will use a
parallel execution plan to return
results if the load on the server will
not be adversely affected.
The query optimizer relies on
distribution statistics when it
estimates the resource costs of
different methods for extracting
information from a table or index.
Distribution statistics are kept for
columns and indexes. They indicate the
selectivity of the values in a
particular index or column. For
example, in a table representing cars,
many cars have the same manufacturer,
but each car has a unique vehicle
identification number (VIN). An index
on the VIN is more selective than an
index on the manufacturer. If the
index statistics are not current, the
query optimizer may not make the best
choice for the current state of the
table. For more information about
keeping index statistics current, see
Using Statistics to Improve Query
Performance.
Both schema and data.
It takes the statistics into account when building a query plan, using them to approximate the number of rows returned by each step in the query (as this can have an effect on the performance of different types of joins, etc).
A good example of this is the fact that it doesn't bother to use indexes on very small tables, as performing a table scan is faster in this situation.
I can't speak for all RDBMS systems, but Postgres specifically uses estimated table sizes as part of its efforts to construct query plans. As an example, if a table has two rows, it may choose a sequential table scan for the portion of the JOIN that uses that table, whereas if it has 10000+ rows, it may choose to use an index or hash scan (if either of those are available.) Incidentally, it used to be possible to trigger poor query plans in Postgres by joining VIEWs instead of actual tables, since there were no estimated sizes for VIEWs.
Part of how Postgres constructs its query plans depend on tunable parameters in its configuration file. More information on how Postgres constructs its query plans can be found on the Postgres website.
For SQL Server, there are many factors that contribute to the final execution plan. On a basic level, Statistics play a very large role but they are based on the data but not always all of the data. Statistics are also not always up to date. When creating or rebuilding an Index, the statistics should be based on a FULL / 100% sample of the data. However, the sample rate for automatic statistics refreshing is much lower than 100% so it is possible to sample a range that is in fact not representative of much of the data. Estimated number of rows for the operation also plays a role which can be based on the number of rows in the table or the statistics on a filtered operation. So out-of-date (or incomplete) Statistics can lead the optimizer to choose a less-than-optimal plan just as a few rows in a table can cause it to ignore indexes entirely (which can be more efficient).
As mentioned in another answer, the more unique (i.e. Selective) the data is the more useful the index will be. But keep in mind that the only guaranteed column to have statistics is the leading (or "left-most" or "first") column of the Index. SQL Server can, and does, collect statistics for other columns, even some not in any Indexes, but only if AutoCreateStatistics DB option is set (and it is by default).
Also, the existence of Foreign Keys can help the optimizer when those fields are in a query.
But one area not considered in the question is that of the Query itself. A query, slightly changed but still returning the same results, can have a radically different Execution Plan. It is also possible to invalidate the use of an Index by using:
LIKE '%' + field
or wrapping the field in a function, such as:
WHERE DATEADD(DAY, -1, field) < GETDATE()
Now, keep in mind that read operations are (ideally) faster with Indexes but DML operations (INSERT, UPDATE, and DELETE) are slower (taking more CPU and Disk I/O) as the Indexes need to be maintained.
Lastly, the "estimated" CPU, etc. values for cost are not always to be relied upon. A better test is to do:
SET STATISTICS IO ON
run query
SET STATISTICS IO OFF
and focus on "logical reads". If you reduce Logical Reads then you should be improving performance.
You will, in the end, need a set of data that comes somewhat close to what you have in Production in order to performance tune with regards to both Indexes and the Queries themselves.
Oracle specifics:
The stated cost is actually an estimated execution time, but it is given in a somewhat arcane unit of measure that has to do with estimated time for block reads. It's important to realize that the calculated cost doesn't say much about the runtime anyway, unless each and every estimate made by the optimizer was 100% perfect (which is never the case).
The optimizer uses the schema for a lot of things when deciding what transformations/heuristics can be applied to the query. Some examples of schema things that matter a lot when evaluating xplans:
Foreign key constraints (can be used for table elimiation)
Partitioning (exclude entire ranges of data)
Unique constraints (index unique vs range scans for example)
Not null constraints (anti-joins are not available with not in() on nullable columns
Data types (type conversions, specialized date arithmetics)
Materialized views (for rewriting a query against an aggregate)
Dimension Hierarchies (to determine functional dependencies)
Check constraints (the constraint is injected if it lowers cost)
Index types (b-tree(?), bitmap, joined, function based)
Column order in index (a = 1 on {a,b} = range scan, {b,a} = skip scan or FFS)
The core of the estimates comes from using the statistics gathered on actual data (or cooked). Statistics are gathered for tables, columns, indexes, partitions and probably something else too.
The following information is gathered:
Nr of rows in table/partition
Average row/col length (important for costing full scans, hash joins, sorts, temp tables)
Number of nulls in col (is_president = 'Y' is pretty much unique)
Distinct values in col (last_name is not very unique)
Min/max value in col (helps unbounded range conditions like date > x)
...to help estimate the nr of expected rows/bytes returned when filtering data. This information is used to determine what access paths and join mechanisms are available and suitable given the actual values from the SQL query compared to the statistics.
On top of all that, there is also the physical row order which affects how "good" or attractive an index become vs a full table scan. For indexes this is called "clustering factor" and is a measure of how much the row order matches the order of the index entries.

What aspects of a sql query are relatively costly to one another? Joins? Num of records? columns selected?

How costly would SELECT One, Two, Three be compared to SELECT One, Two, Three, ..... N-Column
If you have a sql query that has two or three tables joined together and is retrieving 100 rows of data, does performance have anything to say whether I should be selecting only the number of columns I need? Or should I write a query that just yanks all the columns..
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Would 1 record vs 10 record vs 100 record matter?
As an extremely generalized version of ranking those factors you mention in terms of performance penalty and occurrence in the queries you write, I would say:
Joins - Especially when joining on tables with no indexes for the fields you're joining on and/or with tables that have a very large amount of data.
# of Rows / Amount of Data - Again, indexes mitigate this quite a bit, just make sure you have the right ones.
# of Fields - I would say the # of fields in the SELECT clause impact performance the least in most situations.
I would say any performance-driving property is always coupled with how much data you have - sure a join might be fast when your tables have 100 rows each, but when millions of rows are in the tables, you have to start thinking about more efficient design.
Several things impact the cost of a query.
First, are there appropriate indexes for it to use. Fields that are used in a join should almost always be indexed and foreign keys are not indexed by default, the designer of the database must create them. Fields used inthe the where clasues often need indexes as well.
Next, is the where clause sargable, in other words can it use the indexes even if you have the correct ones? A bad where clause can hurt a query far more than joins or extra columns. You can't get anything but a table scan if you use syntax that prevents the use of an index such as:
LIKE '%test'
Next, are you returning more data than you need? You should never return more columns than you need and you should not be using select * in production code as it has additional work to look up the columns as well as being very fragile and subject to create bad bugs as the structure changes with time.
Are you joining to tables you don't need to be joining to? If a table returns no columns in the select, is not used in the where and doesn't filter out any records if the join is removed, then you have an unnecessary join and it can be eliminated. Unnecessary joins are particularly prevalant when you use a lot of views, especially if you make the mistake of calling views from other views (which is a buig performance killer for may reasons) Sometimes if you trace through these views that call other views, you will see the same table joined to multiple times when it would not have been necessary if the query was written from scratch instead of using a view.
Not only does returning more data than you need cause the SQL Server to work harder, it causes the query to use up more of the network resources and more of the memory of the web server if you are holding the results in memory. It is an all arouns poor choice.
Finally are you using known poorly performing techniques when a better one is available. This would include the use of cursors when a set-based alternative is better, the use of correlated subqueries when a join would be better, the use of scalar User-defined functions, the use of views that call other views (especially if you nest more than one level. Most of these poor techniques involve processing row-by-agonizing-row which is generally the worst choice in a database. To properly query datbases you need to think in terms of data sets, not processing one row at a time.
There are plenty more things that affect performance of queries and the datbase, to truly get a grip onthis subject you need to read some books onthe subject. This is too complex a subject to fully discuss in a message board.
Or should I write a query that just yanks all the columns..
No. Just today there was another question about that.
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Any useless join or data retrieval costs you time and should be avoided. Retrieving rows from a datastore is costly. Joins can be more or less costly depending on the context, amount of indexes defined... you can examine the query plan of each query to see the estimated cost for each step.
Selecting more columns/rows will have some performance impacts, but honestly why would you want to select more data than you are going to use anyway?
If possible, could you help me
understand what aspects of a query
would be relatively costly compared to
one another?
Build the query you need, THEN worry about optimizing it if the performance doesn't meet your expectations. You are putting the horse before the cart.
To answer the following:
How costly would SELECT One, Two,
Three be compared to SELECT One, Two,
Three, ..... N-Column
This is not a matter of the select performance but the amount of time it takes to fetch the data. Select * from Table and Select ID from Table preform the same but the fetch of the data will take longer. This goes hand in hand with the number of rows returned from a query.
As for understanding preformance here is a good link
http://www.dotnetheaven.com/UploadFile/skrishnasamy/SQLPerformanceTunning03112005044423AM/SQLPerformanceTunning.aspx
Or google tsql Performance
Joins have the potential to be expensive. In the worst case scenario, when no indexes can be used, they require O(M*N) time, where M and N are the number of records in the tables. To speed things up, you can CREATE INDEX on columns that are part of the join condition.
The number of columns has little effect on the time required to find rows, but slows things down by requiring more data to be sent.
What others are saying is all true.
But typically, if you are working with tables that already have good indexes, what's most important for performance is what goes into the WHERE statement. There you have to worry more about using a field that has no index or using a statement that can't me optimized.
The difference between SELECT One, Two, Three FROM ... and SELECT One,...,N FROM ... could be like the difference between day and night. To understand the problem, you need to understand the concept of a covering index:
A covering index is a special case
where the index itself contains the
required data field(s) and can return
the data.
As you add more unnecessary columns to the projection list you are forcing the query optimizer to lookup the newly added columns in the 'table' (really in the clustered index or in the heap). This can change an execution plan from an efficient narrow index range scan or seek into a bloated clustered index scan, which can result in differences of times from sub-second to +hours, depending on your data. So projecting unnecessary columns is often the most impacting factor of a query.
The number of records pulled is a more subtle issue. With a large number, a query can hit the index tipping point and choose, again, a clustered index scan over narrower index range scan and lookup. Now the fact that lookups into the clustered index are necessary to start with means the narrow index is not covering, which ultimately may be caused by projecting unnecessary column.
And finally, joins. The question here is joins, as opposed to what else? If a join is required, there is no alternative, and that's all there is to say about this.
Ultimately, query performance is driven by one factor alone: amount of IO. And the amount of IO is driven ultimately by the access paths available to satisfy the query. In other words, by the indexing of your data. It is impossible to write efficient queries on bad indexes. It is possible to write bad queries on good indexes, but more often than not the optimizer can compensate and come up with a good plan. You should spend all your effort in better understanding index design:
Designing Indexes
SQL Server Optimization
Short answer: Dont select more fields then you need - Search for "*" in both your sourcecode and your stored procedures ;)
You allways have to consider what parts of the query will cause which costs.
If you have a good DB design, joining a few tables is usually not expensive. (Make sure you have correct indices).
The main issue with "select *" is that it will cause unpredictable behavior in your results. If you write a query like that, AND access the fields with the columnindex, you will be locked into the DB-Schema forever.
Another thing to consider is the amount of data you have to consider. You might think its trivial, but the Version2.0 of your application suddenly adds a ProfilePicture to the User table. And now the query that will select 100 Users will suddenly use up several Megabyte of bandwith.
The second thing you should consider is the number of rows you return. SQL is very powerfull at sorting and grouping, so let SQL do his job, and dont move it to the client. Limit the amount of records you return. In most applications it makes no sense to return more then 100 rows to a user at once. You might let the user choose to load more, but make it a choice he has to make.
Finally, monitor your SQL Server. Run a profiler against it, and try to find your worst queries. A SQL Query should not take longer then half a second, if it does, something is most likely messed up (Yes... there are operation that can take much longer, but those should have a reason)
Edit:
Once you found the slow query, look at the execution plan... You will see which parts of the query are expensive, and which parts work well... The Optimizer is also a tool that can be used.
I suggest you consider your queries in terms of I/O first. Disk I/O on my SATA II system is 6Gb/sec. My DDR3 memory bandwidth is 12GB/sec. I can move items in memory 16 times faster than I can retrieve from disk. (Ref Wikipedia and Tom's hardware)
The difference between getting a few columns and all the columns for your 100 rows could be the dfference in getting a single 8K page from disk to getting two or more pages from disk. When the pages are finally in memory moving two columns or all columns to a hash table is faster than any measuring tool I have.
I value the advice of the others on this topic related to database design. The design of narrow indexes, using included columns to make covering indexes, avoiding table or index scans in favor of seeks by using an appropiate WHERE clause, narrow primary keys, etc is the diffenence between having a DBA title and being a DBA.