Is using COUNT(*) or SELECT * a good idea? - sql

I've heard several times that you shouldn't perform COUNT(*) or SELECT * for performance reasons, but wasn't able to dig up some further information about it.
I can imagine that the database is then using all columns for the action, which can be an impressive performance loss, but I'm not sure about that. Does somebody have further information about the topic?

1. On count(*) vs. count(something else)
SQL is declarative in that you specify what you want. This is different from specifying how to get what you want. That means the database engine is free to realize your query in whatever way it thinks is the most efficient. Many database optimizers rewrites your query to a less costly alternative (if such a plan is available).
Given the following table:
table(
pk not null
,color not null
,nullable null
,unique(pk)
,index(color)
);
...all of the following are functionally equivalent (due to the mechanics of count and nulls):
1) select count(*) from table;
2) select count(1) from table;
3) select count(pk) from table;
4) select count(color) from table;
Regardless of which form you use, the optimizer is free to rewrite the query to another form if it is more efficient. (Again, not all optimizers are sophisticated enough to do this). The unique index(pk) would be smaller (bytes occupied) than the entire table. Therefore it would be more efficient to count the number of index entries rather than scanning through the entire table. In Oracle we have bitmap indexes, which also compress repeating strings. If we had used such an index on the color column, it would probably have been the smallest index to scan. Oracle also supports table compression which in some cases makes the physical table smaller than a composite index.
1. TL;DR;
Your specific dbms will have its own set of tools that enables different rewriting rules and in turn execution plans. That renders the question somewhat useless (unless we talk about a specific release of a specific dbms). I recommend COUNT(*) in all cases because it requires the least cognitive effort to grasp.
2. On select a,b,c vs. select *
There are very few valid uses of SELECT * in code you write and put into production. Imagine a table which contains Bluray movies (yes, the movies is stored as a blob in this table). So you slapped together your awesomesauce abstraction layer and put SELECT * FROM movies where id = ? in the getMovies(movie_id) method. I will refrain myself from explaining why SELECT name FROM movies will be transported across the network just a tad faster. Of course, in most realistic cases it won't have a noticable impact.
One last point on performance is that when all the referenced columns (selected, filtered) in your query exists as an index (called a covering index), the database need not touch the table at all. It can be fully resolved from scanning the index only. By selecting all columns you remove this option from the optimizer.
Another thing about SELECT * which is far more serious than anything, is that it creates an implicit dependency on a specific physical layout of the table. Let me explain. Consider the following tables:
table T1(name, id)
table T2(name, id)
The following statement...
insert into t1 select * from t2;
... will break or produce a different result if any of the following happens:
Any of the tables columns are rearranged for example T1(id, name)
T1 gets an additional not-null column
T2 gets another column
2. TL;DR; When possible, explicitly specify the columns you want (eventually, you'll have to do that anyway). Also, selecting fewer columns are faster than selecting more columns. A possitive side-effect on explicit selects is that it gives greater freedom to the optimizer.

COUNT(*) is different from COUNT(column1) !
COUNT(*) returns the number of records, and does NOT use more resources, while COUNT(column1) counts the number of records where column1 is non null.
For SELECT, it is different. SELECT * will of course request more data.

When using count(*) the * doesn't mean "all fields". Using count(field) will count all non-null values in the field, but count(*) will always count all records even if all fields in all records are null, so it doesn't need to check the data in the fields at all.
Using select * means that you almost always return more data than you are going to use, which of course is a waste. However, perhaps more serious is the maintainence problem; if you add fields to a table your query will return these too. That might mean that the record becomes too large to fit in the buffer, resulting in an error message.

Don't confuse the * in "COUNT(*)" with the * in "SELECT * ". They are completely unrelated but sometimes confused because it's such an odd syntax. There is nothing wrong with using COUNT(*), which just means "count rows".
SELECT * on the other hand means "select all columns". That's generally poor practice because it tightly couples your code to the database schema. That means when you change the table you probably have to change the code even if it should have been unaffected. It increases the impact of any schema change.
SELECT * may also cause a sub-optimal query plan. Either because you didn't really need all columns or because it forces the DBMS to do an extra lookup at runtime to get the list of columns.

It's absolutely true that "*" is "all columns". And you're right in the point of if you've a table with an incredible number of columns (say 100+), these kind of queries can be bad in terms of efficiency.
I believe that the best solution is creating database views previously filtering the amount of records evolved in the count operation, so, the performance impact isn't a big problem, because views can be cached.
In the other hand, it seems that "*" operator should be avoided when returning records, and it's brutally better to select the fields you really need to use in some business.

When using SELECT * it can have a performance hit. Applications which use the SELECT * syntax when they actually only need a handful of columns are transferring more data across the network than they need to consume, which is wasteful.
Also, in Microsoft SQL Server at least, there's a strange problem when you use SELECT * in a view and then add a column to the underlying table. The column headings and data returned by the view don't match each other following certain changes! See my blog post for further details of this particular problem.

Depending on the size of the database depends on how inefficient it becomes, the simnplest way to describe would be like so:
when you specifically do:
SELECT column1,column2,column3 FROM table1
Mysql knows exactly exactly what columns it looking for, but when you do
SELECT * FROM table1
Mysql does not know the columns you want, it knows you want all of them but not the names, so it has to perform extra tasks that analyse the table to discover the columns, thus resulting in using resources.

In case of COUNT(*) it depends on database and its version. For example in modern versions of MS SQL it doesn't matter [source needed].
So the best approach in case of COUNT(*) is to measure it.
Using SELECT * is really bad idea. * means read all columns which can be heavy IO and network operation (especially for various type of CHAR columns). Moreover -- rather rarely you need all columns.

Related

Most memory efficient for Cluster Tables: "select count(*)" or "select .. endselect"?

I would like to know which statement (see below) will be more efficient for determining the size of a Cluster Table. Or at least determine, whether the table size reaches a certain threshhold {n}.
Efficiency meaning using less PSAPTEMP tablespace.
The problem with Cluster Tables is, that in order to get an entry for a table the fields of one entry need to be looked up in several tables of the Cluster where they are dispersed. Thus, more than just the counted table need to be looked at. So for every entry several entries need to be looked up. This makes it inefficient for reads and this can make it dump because the COUNT uses an INT datatype that can overflow.
SELECT COUNT(*)
...
UP TO {n} rows.
SELECT *
...
UP TO {n} ROWS.
ENDSELECT. `and then determine the size of the result. `
To me they seem equivalent, but maybe they are not when using a threshold. Maybe the limitation makes a difference depending how the data is read. EDIT: Of course, SELECT .. ENDSELECT is a loop and thus less efficient principally.
But I would like to know how it actually works under the hood and understand the difference better. So far it seems like I will have to try it out.
I assume the database will differ but will most often be Oracle.
We could not really create the test environment we needed. So no final answer. But some learnings:
Reading the data from cluster tables should be done based on a full primary key sequence (Should be accessed via primary key - very fast retrieval otherwise very slow)
There are no secondary indexes
Select * is Ok because all columns are retrieved anyways. Performing an operation on multiple rows is more efficient than single row operations. -> Therefore you still want to select into an internal table.
If many rows are being selected into the internal table, you might still like to retrieve specific columns to cut down on the memory required.
There is a way to convert cluster to transparent but with downtime and this no way for us
Aggreate SQL functions (SUM, AVG, MIN, MAX, etc) are not supported
Basically Select Endselect will run a loop and there will be multiple trips to DB Server.
Technically select SELECT COUNT(*) will perform all the data on the DB server itself and in one shot.
After which you can simply put the data in an internal table and work on the same.
As per the standards, this is not at all recommended even for normal transparent tables leave aside Cluster tables.
Access to Cluster tables is very expensive. Also, to make the matter worse you cannot use any indexes on Cluster tables. Its always better to provide as much data in the where clause as possible.
The priority is always given to fetch the data in one shot from the Database Server using
select * from table into table where ....
and then loop on it on the local server.
Specifically in your use case It will be fastest if you will be using count(*) and not select endselect.
Certified SAP ABAP Consultant
Using native SQL with COUNT BIG instead of COUNT can make it not memory efficient but prevent it from dumping due to a counter overflow.

Performance of select * vs select colname in oracle

Why performance of select * from table is not as good as select col_1,col_2 from table? So far as I understand, it is the locating of the row that takes up time, not how many columns are returned.
Selecting unnecessary columns can cause query plan changes that have a massive impact on performance. For example, if there is an index on col_1, col2 but there are other columns in the table, the select * query has to do a full table scan while the select col_1, col_2 query can simply scan the index which is likely to be much smaller and, thus, much less expensive to query. If you start dealing with queries that involve more than one table or that involve queries against views, selecting subsets of columns can also sometimes change query plans by allowing Oracle to eliminate unnecessary joins or function evaluations. Now, to be fair, it's not particularly common that the query plan will change based on what columns are selected, but when it does, the change is often significant.
If you are issuing the SQL statement from an application outside the database, selecting additional columns forces Oracle to send additional data over the network so your application will spend more time waiting on network I/O to send over data that it is not interested in. That can be very inefficient particularly if your application ever gets deployed on a WAN.
Selecting unnecessary columns can also force Oracle to do additional I/O without changing the plan. If one of the columns in the table that you don't need is a LOB, for example, Oracle would have to do additional work to fetch that LOB data. If the data is stored in a chained block on disk but the columns you are interested in happen to be in the first row piece, Oracle doesn't have to fetch the additional row pieces for the query that specifies a subset of columns. The query that does a select *, on the other hand, has to fetch every row piece.
Of course, that is before considering the maintenance aspects. If you are writing an application outside of PL/SQL, doing a SELECT * means that either your code will break when someone adds a new column to the table in the future or that your application has to dynamically determine at runtime the set of columns that are being returned in order to accommodate the new column automatically. While that is certainly possible, it is likely to lead to code that is more complex and thus more difficult to debug and maintain. If you are writing PL/SQL and fetching the data into a %ROWTYPE variable, it can be perfectly reasonable to do a SELECT * in production code; in other languages, you're generally setting yourself up for a maintenance nightmare if you do a SELECT *.
There is the issue of looking up the definition from the data dictionary for the table when you do a SELECT *.
There is also the issue of the database doing a little more work than is necessary when the only columns you require are col_1 and col_2. This is particularly an issue with large tables.
And there is the issue of network bandwidth being unnecessarily swallowed by a larger than required dataset.
It's not best practice to do an a SELECT *.
It also makes embedded SQL code harder to read.

Why is SELECT * considered harmful?

Why is SELECT * bad practice? Wouldn't it mean less code to change if you added a new column you wanted?
I understand that SELECT COUNT(*) is a performance problem on some DBs, but what if you really wanted every column?
There are really three major reasons:
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
But it's not all bad for SELECT *. I use it liberally for these use cases:
Ad-hoc queries. When trying to debug something, especially off a narrow table I might not be familiar with, SELECT * is often my best friend. It helps me just see what's going on without having to do a boatload of research as to what the underlying column names are. This gets to be a bigger "plus" the longer the column names get.
When * means "a row". In the following use cases, SELECT * is just fine, and rumors that it's a performance killer are just urban legends which may have had some validity many years ago, but don't now:
SELECT COUNT(*) FROM table;
in this case, * means "count the rows". If you were to use a column name instead of * , it would count the rows where that column's value was not null. COUNT(*), to me, really drives home the concept that you're counting rows, and you avoid strange edge-cases caused by NULLs being eliminated from your aggregates.
Same goes with this type of query:
SELECT a.ID FROM TableA a
WHERE EXISTS (
SELECT *
FROM TableB b
WHERE b.ID = a.B_ID);
in any database worth its salt, * just means "a row". It doesn't matter what you put in the subquery. Some people use b's ID in the SELECT list, or they'll use the number 1, but IMO those conventions are pretty much nonsensical. What you mean is "count the row", and that's what * signifies. Most query optimizers out there are smart enough to know this. (Though to be honest, I only know this to be true with SQL Server and Oracle.)
The asterisk character, "*", in the SELECT statement is shorthand for all the columns in the table(s) involved in the query.
Performance
The * shorthand can be slower because:
Not all the fields are indexed, forcing a full table scan - less efficient
What you save to send SELECT * over the wire risks a full table scan
Returning more data than is needed
Returning trailing columns using variable length data type can result in search overhead
Maintenance
When using SELECT *:
Someone unfamiliar with the codebase would be forced to consult documentation to know what columns are being returned before being able to make competent changes. Making code more readable, minimizing the ambiguity and work necessary for people unfamiliar with the code saves more time and effort in the long run.
If code depends on column order, SELECT * will hide an error waiting to happen if a table had its column order changed.
Even if you need every column at the time the query is written, that might not be the case in the future
the usage complicates profiling
Design
SELECT * is an anti-pattern:
The purpose of the query is less obvious; the columns used by the application is opaque
It breaks the modularity rule about using strict typing whenever possible. Explicit is almost universally better.
When Should "SELECT *" Be Used?
It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference.
Otherwise, explicitly list every column that is to be used in the query - preferably while using a table alias.
Even if you wanted to select every column now, you might not want to select every column after someone adds one or more new columns. If you write the query with SELECT * you are taking the risk that at some point someone might add a column of text which makes your query run more slowly even though you don't actually need that column.
Wouldn't it mean less code to change if you added a new column you wanted?
The chances are that if you actually want to use the new column then you will have to make quite a lot other changes to your code anyway. You're only saving , new_column - just a few characters of typing.
If you really want every column, I haven't seen a performance difference between select (*) and naming the columns. The driver to name the columns might be simply to be explicit about what columns you expect to see in your code.
Often though, you don't want every column and the select(*) can result in unnecessary work for the database server and unnecessary information having to be passed over the network. It's unlikely to cause a noticeable problem unless the system is heavily utilised or the network connectivity is slow.
If you name the columns in a SELECT statement, they will be returned in the order specified, and may thus safely be referenced by numerical index. If you use "SELECT *", you may end up receiving the columns in arbitrary sequence, and thus can only safely use the columns by name. Unless you know in advance what you'll be wanting to do with any new column that gets added to the database, the most probable correct action is to ignore it. If you're going to be ignoring any new columns that get added to the database, there is no benefit whatsoever to retrieving them.
In a lot of situations, SELECT * will cause errors at run time in your application, rather than at design time. It hides the knowledge of column changes, or bad references in your applications.
Think of it as reducing the coupling between the app and the database.
To summarize the 'code smell' aspect:
SELECT * creates a dynamic dependency between the app and the schema. Restricting its use is one way of making the dependency more defined, otherwise a change to the database has a greater likelihood of crashing your application.
If you add fields to the table, they will automatically be included in all your queries where you use select *. This may seem convenient, but it will make your application slower as you are fetching more data than you need, and it will actually crash your application at some point.
There is a limit for how much data you can fetch in each row of a result. If you add fields to your tables so that a result ends up being over that limit, you get an error message when you try to run the query.
This is the kind of errors that are hard to find. You make a change in one place, and it blows up in some other place that doesn't actually use the new data at all. It may even be a less frequently used query so that it takes a while before someone uses it, which makes it even harder to connect the error to the change.
If you specify which fields you want in the result, you are safe from this kind of overhead overflow.
I don't think that there can really be a blanket rule for this. In many cases, I have avoided SELECT *, but I have also worked with data frameworks where SELECT * was very beneficial.
As with all things, there are benefits and costs. I think that part of the benefit vs. cost equation is just how much control you have over the datastructures. In cases where the SELECT * worked well, the data structures were tightly controlled (it was retail software), so there wasn't much risk that someone was going to sneek a huge BLOB field into a table.
Reference taken from this article.
Never go with "SELECT *",
I have found only one reason to use "SELECT *"
If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.
Generally you have to fit the results of your SELECT * ... into data structures of various types. Without specifying which order the results are arriving in, it can be tricky to line everything up properly (and more obscure fields are much easier to miss).
This way you can add fields to your tables (even in the middle of them) for various reasons without breaking sql access code all over the application.
Using SELECT * when you only need a couple of columns means a lot more data transferred than you need. This adds processing on the database, and increase latency on getting the data to the client. Add on to this that it will use more memory when loaded, in some cases significantly more, such as large BLOB files, it's mostly about efficiency.
In addition to this, however, it's easier to see when looking at the query what columns are being loaded, without having to look up what's in the table.
Yes, if you do add an extra column, it would be faster, but in most cases, you'd want/need to change your code using the query to accept the new columns anyways, and there's the potential that getting ones you don't want/expect can cause issues. For example, if you grab all the columns, then rely on the order in a loop to assign variables, then adding one in, or if the column orders change (seen it happen when restoring from a backup) it can throw everything off.
This is also the same sort of reasoning why if you're doing an INSERT you should always specify the columns.
Selecting with column name raises the probability that database engine can access the data from indexes rather than querying the table data.
SELECT * exposes your system to unexpected performance and functionality changes in the case when your database schema changes because you are going to get any new columns added to the table, even though, your code is not prepared to use or present that new data.
There is also more pragmatic reason: money. When you use cloud database and you have to pay for data processed there is no explanation to read data that you will immediately discard.
For example: BigQuery:
Query pricing
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed.
and Control projection - Avoid SELECT *:
Best practice: Control projection - Query only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.
Understand your requirements prior to designing the schema (if possible).
Learn about the data,
1)indexing
2)type of storage used,
3)vendor engine or features; ie...caching, in-memory capabilities
4)datatypes
5)size of table
6)frequency of query
7)related workloads if the resource is shared
8)Test
A) Requirements will vary. If the hardware can not support the expected workload, you should re-evaluate how to provide the requirements in the workload. Regarding the addition column to the table. If the database supports views, you can create an indexed(?) view of the specific data with the specific named columns (vs. select '*'). Periodically review your data and schema to ensure you never run into the "Garbage-in" -> "Garbage-out" syndrome.
Assuming there is no other solution; you can take the following into account. There are always multiple solutions to a problem.
1) Indexing: The select * will execute a tablescan. Depending on various factors, this may involve a disk seek and/or contention with other queries. If the table is multi-purpose, ensure all queries are performant and execute below you're target times. If there is a large amount of data, and your network or other resource isn't tuned; you need to take this into account. The database is a shared environment.
2) type of storage. Ie: if you're using SSD's, disk, or memory. I/O times and the load on the system/cpu will vary.
3) Can the DBA tune the database/tables for higher performance? Assumming for whatever reason, the teams have decided the select '*' is the best solution to the problem; can the DB or table be loaded into memory. (Or other method...maybe the response was designed to respond with a 2-3 second delay? --- while an advertisement plays to earn the company revenue...)
4) Start at the baseline. Understand your data types, and how results will be presented. Smaller datatypes, number of fields reduces the amount of data returned in the result set. This leaves resources available for other system needs. The system resources are usually have a limit; 'always' work below these limits to ensure stability, and predictable behaviour.
5) size of table/data. select '*' is common with tiny tables. They typically fit in memory, and response times are quick. Again....review your requirements. Plan for feature creep; always plan for the current and possible future needs.
6) Frequency of query / queries. Be aware of other workloads on the system. If this query fires off every second, and the table is tiny. The result set can be designed to stay in cache/memory. However, if the query is a frequent batch process with Gigabytes/Terabytes of data...you may be better off to dedicate additional resources to ensure other workloads aren't affected.
7) Related workloads. Understand how the resources are used. Is the network/system/database/table/application dedicated, or shared? Who are the stakeholders? Is this for production, development, or QA? Is this a temporary "quick fix". Have you tested the scenario? You'll be surprised how many problems can exist on current hardware today. (Yes, performance is fast...but the design/performance is still degraded.) Does the system need to performance 10K queries per second vs. 5-10 queries per second. Is the database server dedicated, or do other applications, monitoring execute on the shared resource. Some applications/languages; O/S's will consume 100% of the memory causing various symptoms/problems.
8) Test: Test out your theories, and understand as much as you can about. Your select '*' issue may be a big deal, or it may be something you don't even need to worry about.
There's an important distinction here that I think most answers are missing.
SELECT * isn't an issue. Returning the results of SELECT * is the issue.
An OK example, in my opinion:
WITH data_from_several_tables AS (
SELECT * FROM table1_2020
UNION ALL
SELECT * FROM table1_2021
...
)
SELECT id, name, ...
FROM data_from_several_tables
WHERE ...
GROUP BY ...
...
This avoids all the "problems" of using SELECT * mentioned in most answers:
Reading more data than expected? Optimisers in modern databases will be aware that you don't actually need all columns
Column ordering of the source tables affects output? We still select and
return data explicitly.
Consumers can't see what columns they receive from the SQL? The columns you're acting on are explicit in code.
Indexes may not be used? Again, modern optimisers should handle this the same as if we didn't SELECT *
There's a readability/refactorability win here - no need to duplicate long lists of columns or other common query clauses such as filters. I'd be surprised if there are any differences in the query plan when using SELECT * like this compared with SELECT <columns> (in the vast majority of cases - obviously always profile running code if it's critical).

select * vs select column

If I just need 2/3 columns and I query SELECT * instead of providing those columns in select query, is there any performance degradation regarding more/less I/O or memory?
The network overhead might be present if I do select * without a need.
But in a select operation, does the database engine always pull atomic tuple from the disk, or does it pull only those columns requested in the select operation?
If it always pulls a tuple then I/O overhead is the same.
At the same time, there might be a memory consumption for stripping out the requested columns from the tuple, if it pulls a tuple.
So if that's the case, select someColumn will have more memory overhead than that of select *
There are several reasons you should never (never ever) use SELECT * in production code:
since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time
if you need only 2/3 of the columns, you're selecting 1/3 too much data which needs to be retrieving from disk and sent across the network
if you start to rely on certain aspects of the data, e.g. the order of the columns returned, you could get a nasty surprise once the table is reorganized and new columns are added (or existing ones removed)
in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.
Yes, it takes a bit more typing initially (tools like SQL Prompt for SQL Server will even help you there) - but this is really one case where there's a rule without any exception: do not ever use SELECT * in your production code. EVER.
It always pulls a tuple (except in cases where the table has been vertically segmented - broken up into columns pieces), so, to answer the question you asked, it doesn't matter from a performance perspective. However, for many other reasons, (below) you should always select specifically those columns you want, by name.
It always pulls a tuple, because (in every vendors RDBMS I am familiar with), the underlying on-disk storage structure for everything (including table data) is based on defined I/O Pages (in SQL Server for e.g., each Page is 8 kilobytes). And every I/O read or write is by Page.. I.e., every write or read is a complete Page of data.
Because of this underlying structural constraint, a consequence is that Each row of data in a database must always be on one and only one page. It cannot span multiple Pages of data (except for special things like blobs, where the actual blob data is stored in separate Page-chunks, and the actual table row column then only gets a pointer...). But these exceptions are just that, exceptions, and generally do not apply except in special cases ( for special types of data, or certain optimizations for special circumstances)
Even in these special cases, generally, the actual table row of data itself (which contains the pointer to the actual data for the Blob, or whatever), it must be stored on a single IO Page...
EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:
Select colA, colB
From table1 t1
Where Exists (Select * From Table2
Where column = t1.colA)
EDIT: To address #Mike Sherer comment, Yes it is true, both technically, with a bit of definition for your special case, and aesthetically. First, even when the set of columns requested are a subset of those stored in some index, the query processor must fetch every column stored in that index, not just the ones requested, for the same reasons - ALL I/O must be done in pages, and index data is stored in IO Pages just like table data. So if you define "tuple" for an index page as the set of columns stored in the index, the statement is still true.
and the statement is true aesthetically because the point is that it fetches data based on what is stored in the I/O page, not on what you ask for, and this true whether you are accessing the base table I/O Page or an index I/O Page.
For other reasons not to use Select *, see Why is SELECT * considered harmful? :
You should always only select the columns that you actually need. It is never less efficient to select less instead of more, and you also run into fewer unexpected side effects - like accessing your result columns on client side by index, then having those indexes become incorrect by adding a new column to the table.
[edit]: Meant accessing. Stupid brain still waking up.
Unless you're storing large blobs, performance isn't a concern. The big reason not to use SELECT * is that if you're using returned rows as tuples, the columns come back in whatever order the schema happens to specify, and if that changes you will have to fix all your code.
On the other hand, if you use dictionary-style access then it doesn't matter what order the columns come back in because you are always accessing them by name.
This immediately makes me think of a table I was using which contained a column of type blob; it usually contained a JPEG image, a few Mbs in size.
Needless to say I didn't SELECT that column unless I really needed it. Having that data floating around - especially when I selected mulitple rows - was just a hassle.
However, I will admit that I otherwise usually query for all the columns in a table.
During a SQL select, the DB is always going to refer to the metadata for the table, regardless of whether it's SELECT * for SELECT a, b, c... Why? Becuase that's where the information on the structure and layout of the table on the system is.
It has to read this information for two reasons. One, to simply compile the statement. It needs to make sure you specify an existing table at the very least. Also, the database structure may have changed since the last time a statement was executed.
Now, obviously, DB metadata is cached in the system, but it's still processing that needs to be done.
Next, the metadata is used to generate the query plan. This happens each time a statement is compiled as well. Again, this runs against cached metadata, but it's always done.
The only time this processing is not done is when the DB is using a pre-compiled query, or has cached a previous query. This is the argument for using binding parameters rather than literal SQL. "SELECT * FROM TABLE WHERE key = 1" is a different query than "SELECT * FROM TABLE WHERE key = ?" and the "1" is bound on the call.
DBs rely heavily on page caching for there work. Many modern DBs are small enough to fit completely in memory (or, perhaps I should say, modern memory is large enough to fit many DBs). Then your primary I/O cost on the back end is logging and page flushes.
However, if you're still hitting the disk for your DB, a primary optimization done by many systems is to rely on the data in indexes, rather than the tables themselves.
If you have:
CREATE TABLE customer (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(150) NOT NULL,
city VARCHAR(30),
state VARCHAR(30),
zip VARCHAR(10));
CREATE INDEX k1_customer ON customer(id, name);
Then if you do "SELECT id, name FROM customer WHERE id = 1", it is very likely that you DB will pull this data from the index, rather than from the tables.
Why? It will likely use the index anyway to satisfy the query (vs a table scan), and even though 'name' isn't used in the where clause, that index will still be the best option for the query.
Now the database has all of the data it needs to satisfy the query, so there's no reason to hit the table pages themselves. Using the index results in less disk traffic since you have a higher density of rows in the index vs the table in general.
This is a hand wavy explanation of a specific optimization technique used by some databases. Many have several optimization and tuning techniques.
In the end, SELECT * is useful for dynamic queries you have to type by hand, I'd never use it for "real code". Identification of individual columns gives the DB more information that it can use to optimize the query, and gives you better control in your code against schema changes, etc.
I think there is no exact answer for your question, because you have pondering performance and facility of maintain your apps. Select column is more performatic of select *, but if you is developing an oriented object system, then you will like use object.properties and you can need a properties in any part of apps, then you will need write more methods to get properties in special situations if you don't use select * and populate all properties. Your apps need have a good performance using select * and in some case you will need use select column to improve performance. Then you will have the better of two worlds, facility to write and maintain apps and performance when you need performance.
The accepted answer here is wrong. I came across this when another question was closed as a duplicate of this (while I was still writing my answer - grr - hence the SQL below references the other question).
You should always use SELECT attribute, attribute.... NOT SELECT *
It's primarily for performance issues.
SELECT name FROM users WHERE name='John';
Is not a very useful example. Consider instead:
SELECT telephone FROM users WHERE name='John';
If there's an index on (name, telephone) then the query can be resolved without having to look up the relevant values from the table - there is a covering index.
Further, suppose the table has a BLOB containing a picture of the user, and an uploaded CV, and a spreadsheet...
using SELECT * will willpull all this information back into the DBMS buffers (forcing out other useful information from the cache). Then it will all be sent to client using up time on the network and memory on the client for data which is redundant.
It can also cause functional issues if the client retrieves the data as an enumerated array (such as PHP's mysql_fetch_array($x, MYSQL_NUM)). Maybe when the code was written 'telephone' was the third column to be returned by SELECT *, but then someone comes along and decides to add an email address to the table, positioned before 'telephone'. The desired field is now shifted to the 4th column.
There are reasons for doing things either way. I use SELECT * a lot on PostgreSQL because there are a lot of things you can do with SELECT * in PostgreSQL that you can't do with an explicit column list, particularly when in stored procedures. Similarly in Informix, SELECT * over an inherited table tree can give you jagged rows while an explicit column list cannot because additional columns in child tables are returned as well.
The main reason why I do this in PostgreSQL is that it ensures that I get a well-formed type specific to a table. This allows me to take the results and use them as the table type in PostgreSQL. This also allows for many more options in the query than a rigid column list would.
On the other hand, a rigid column list gives you an application-level check that db schemas haven't changed in certain ways and this can be helpful. (I do such checks on another level.)
As for performance, I tend to use VIEWs and stored procedures returning types (and then a column list inside the stored procedure). This gives me control over what types are returned.
But keep in mind I am using SELECT * usually against an abstraction layer rather than base tables.
Reference taken from this article:
Without SELECT *:
When you are using ” SELECT * ” at that time you are selecting more columns from the database and some of this column might not be used by your application.
This will create extra cost and load on database system and more data travel across the network.
With SELECT *:
If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.
Just to add a nuance to the discussion which I don't see here: In terms of I/O, if you're using a database with column-oriented storage you can do A LOT less I/O if you only query for certain columns. As we move to SSDs the benefits may be a bit smaller vs. row-oriented storage but there's a) only reading the blocks that contain columns you care about b) compression, which generally greatly reduces the size of the data on disk and therefore the volume of data read from disk.
If you're not familiar with column-oriented storage, one implementation for Postgres comes from Citus Data, another is Greenplum, another Paraccel, another (loosely speaking) is Amazon Redshift. For MySQL there's Infobright, the now-nigh-defunct InfiniDB. Other commercial offerings include Vertica from HP, Sybase IQ, Teradata...
select * from table1 INTERSECT select * from table2
equal
select distinct t1 from table1 where Exists (select t2 from table2 where table1.t1 = t2 )

SQL query - Select * from view or Select col1, col2, ... colN from view [duplicate]

This question already has answers here:
What is the reason not to use select *?
(20 answers)
Closed 3 years ago.
We are using SQL Server 2005, but this question can be for any RDBMS.
Which of the following is more efficient, when selecting all columns from a view?
Select * from view
or
Select col1, col2, ..., colN from view
NEVER, EVER USE "SELECT *"!!!!
This is the cardinal rule of query design!
There are multiple reasons for this. One of which is, that if your table only has three fields on it and you use all three fields in the code that calls the query, there's a great possibility that you will be adding more fields to that table as the application grows, and if your select * query was only meant to return those 3 fields for the calling code, then you're pulling much more data from the database than you need.
Another reason is performance. In query design, don't think about reusability as much as this mantra:
TAKE ALL YOU CAN EAT, BUT EAT ALL YOU TAKE.
It is best practice to select each column by name. In the future your DB schema might change to add columns that you would then not need for a particular query. I would recommend selecting each column by name.
Just to clarify a point that several people have already made, the reason Select * is inefficient is because there has to be an initial call to the DB to find out exactly what fields are available, and then a second call where the query is made using explicit columns.
Feel free to use Select * when you are debugging, running casual queries or are in the early stages of developing a query, but as soon as you know your required columns, state them explicitly.
Select * is a poor programming practice. It is as likely to cause things to break as it is to save things from breaking. If you are only querying one table or view, then the efficiency gain may not be there (although it is possible if you are not intending to actually use every field). If you have an inner join, then you have at least two fields returning the same data (the join fields) and thus you are wasting network resources to send redundant data back to the application. You won't notice this at first, but as the result sets get larger and larger, you will soon have a network pipeline that is full and doesn't need to be. I can think of no instance where select * gains you anything. If a new column is added and you don't need to go to the code to do something with it, then the column shouldn't be returned by your query by definition. If someone drops and recreates the table with the columns in a different order, then all your queries will have information displaying wrong or will be giving bad results, such as putting the price into the part number field in a new record.
Plus it is quick to drag the column names over from the object browser, so that is just pure laziness not efficiency in coding.
It depends. Inheritance of views can be a handy thing and easy to maintain (SQL Anywhere):
create view v_fruit as select F.id, S.strain from F key join S;
create view v_apples as select v_fruit.*, C.colour from v_fruit key join C;
If you're really selecting all columns, it shouldn't make any noticeable difference whether you ask for * or if you are explicit. The SQL server will parse the request the same way in pretty much the same amount of time.
Always do select col1, col2 etc from view. There's no efficieny difference between the two methods that I know of, but using "select *" can be dangerous. If you modify your view definition adding new columns, you can break a program using "select *", whereas selecting a predefined set of columns (even all of them, named), will still work.
I guess it all depends on what the query optimizer does.
If I want to get every record in the row, I will generally use the "SELECT *..." option, since I then don't have to worry should I change the underlying table structure. As well, for someone maintaining the code, seeing "SELECT *" tells them that this query is intended to return every column, whereas listing the columns individually does not convey the same intention.
For performance - look at the query plan (should be no difference).
For maintainability. - always supply a fieldlist (that goes for INSERT INTO too).
select
column1
,column2
,column3
.
.
.
from Your-View
this one is more optimizer than Using the
select *
from Your View