The performance of retrieve all columns and multiple columns - sql

I am learning SQL following "SQL in 10 minutes",
Reference to use wildcards to retrieve all the records, it states that:
As a rule, you are better off not using the * wildcard unless you really do need every column in the table. Even though use of wildcards may save you the time and effort needed to list the desired columns explicitly, retrieving unnecessary columns usually slows down the performance of your retrieval and your application.
However, It consume less time to retrieve all the records than to retrieve multiple fields:
As the result indicate, wildcards for 0.02 seconds V.S. 0.1 seconds
I tested several times, wildcards faster than multiple specified columns constantly, even though time consumed varied every times.

Kudos to you for attempting to validate advice you get in a book! A single test neither invalidates the advice nor invalidates the test. It is worthwhile to dive further.
The advice provided in SQL In 10 Minutes is sound advice -- and it explicitly states that the purpose related to performance. (Another consideration is that that it makes the code unstable when the database changes.) As a note: I regularly use select t.* for ad-hoc queries.
Why are the results different? There can be multiple reasons for this:
Databases do not have deterministic performance, so other considerations -- such as other processes running on the machine or resource contention -- can affect the performance.
As mentioned in a comment, caching can be the reason. Specifically, running the first query may require loading the data from disk, and it is already in memory for the first.
Another form of caching is for the execution plan, so perhaps the first execution plan is cached but not the second.
You don't mention the database, but perhaps your database has a really, really slow compiler and compiling the first takes longer than the second.
Fundamentally, the advice is sound from a common-sense perspective. Moving less data around should be more efficient. That is really what the advice is saying.
In any case, the difference between 10 milliseconds and 2 milliseconds is very short. I would not generalize this performance to larger data and say that the second is 5 times faster than the first in general. For whatever reason, it is 8 milliseconds shorter on a very small data set, one so small that performance would not be a consideration anyway.

For manual testing the data that's in a table or tables?
Then it doesn't matter much whether you used a * or the column names.
Sure, if the table has like 100 columns and you only are interested in a few? Then explicitly adding the columnnames will give you a less convulted result.
Plus, you can choose the order they appear in the result.
And using a * in a sub-query would drag all the fields into the resultset.
While if you only selected the columns you need could improve performance.
For manual testing, that normally doesn't matter much.
Whether a test SQL runs 1 seconds or 2 seconds, if it's a test or an ad-hoc query then it wouldn't bother you.
What the suggestion is more intended for, is about coding SQL's that are to be used in a production environment.
When using * in a SQL, that means that when something changes in the tables that are used in the query, that it can affect the output of that query.
Possibly leading to errors. Your boss would frown upon that!
For example, a SQL with a select * from tableA union select * from tableB that you coded a year ago suddenly starts crashing because a column was added to tableB. Ouch.
But by explicitly putting the column names, adding a column to 1 of the tables wouldn't make any difference to that SQL.
In other words.
In production, stability and performance matter much more than golf-coding.
Another thing to keep in mind is the effect of caching.
Some databases can temporarly store metadata or even data in memory.
Which can speed up the retrieval of a query that gets the same results of a query that just run before it.
So try running the following SQL's.
Which are in a different order than in the question.
And check if there's still a speed difference.
select * from products;
select prod_id, prod_name, prod_price from products;

Related

Select * from table vs Select col1,col2,col3 from table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
select * vs select column
I was just having a discussion with one of my colleague on the SQL Server performance on specifying the query command in the stored procedure.
So I want to know which one is preferred over another and whats the concrete reason behind that.
Suppose, We do have one table called
Employees(EmpName,EmpAddress)
And we want to select all the records from the table. So we can write the query in two ways,
Select * from Employees
Select EmpName, EmpAddress from Employees
So I would like to know is there any specific difference or performance issue in the above queries or are they just equal to the SQL Server Engine.
UPDATE:
Lets say the table schema won't change anymore. So no point for future maintenance.
Performance wise, lets say, the usage is very very high i.e. millions of hits per seconds on the database server. I want a clear and precise performance rating on both approaches.
No Indexing is done on the entire table.
The specific difference would show its ugly head if you add a column to the table.
Suddenly, the query you expected to return two columns now returns three. If you coded specifically for the two columns, the rest of your code is now broken.
Performance-wise, there shouldn't be a difference.
I always take the approach that being as specific as possible is the best when dealing with databases. If the table has two columns and you only need those two columns, be specific. Specify those two columns. It'll save you headaches in the future.
I am an avid avokat of the "be as specific as possible" rule, too. Not following it will hurt you in the long run. However, your question seems to be coming from a different background, so let me attempt to answer it.
When you submit a query to SQL Server it goes through several stages:
transmitting of query string over the network.
parsing of query string, producing a parse-tree
linking the referenced objects in the parse tree to existing objects
optimizing based on statistics and row count/size estimates
executing
transmitting of result data over the network
Let's look at each one:
The * query is a few bytes shorter, so step this will be faster
The * query contains fewer "tokens" so this should(!) be faster
During linking the list of columns need to be puled and compared to the query string. Here the "*" gets resolved to the actual column reference. Without access to the code it is impossible to say which version takes less cycles, however the amount of data accessed is about the same so this should be similar.
-6. In these stages there is no difference between the two example queries, as they will both get compiled to the same execution plan.
Taking all this into account, you will probably save a few nanoseconds when using the * notation. However, you example is very simplistic. In a more complex example it is possible that specifying as subset of columns of a table in a multi table join will lead to a different plan than using a *. If that happens we can be pretty certain that the explicit query will be faster.
The above comparison also assumes that the SQL Server process is running alone on a single processor and no other queries are submitted at the same time. If the process has to yield during the compilation those extra cycles will be far more than the ones we are trying to save.
So, the amont of saving we are talking about is very minute compared to the actual execution time and should not be used as an excuse for a "bad" coding practice.
I hope this answers your question.
You should always reference columns explicitly. This way, if the table structure changes (and such changes are made in an intelligent, backward-compatible way), your queries will continue to work and can be modified over time.
Also, unless you actually need all of the columns from the table (not typical), using SELECT * is bringing more data to your application than is necessary, and potentially forcing a clustered index scan instead of what might have been satisfied by a narrower covering index.
Bad habits to kick : using SELECT * / omitting the column list
Performance wise there are no difference between those 2 i think.But those 2 are used in different cases what may be the difference.
Consider a slightly larger table.If your table(Employees) contains 10 columns,then the 1st query will retain all of the information of the table.But for 2nd query,you may specify which columns information you need.So when you need all of the information of employees no.1 is the best one rather than specifying all of the column names.
Ofcourse,when you need to ALTER a table then those 2 would not be equal.

How to improve query performance

I have a lot of records in table. When I execute the following query it takes a lot of time. How can I improve the performance?
SET ROWCOUNT 10
SELECT StxnID
,Sprovider.description as SProvider
,txnID
,Request
,Raw
,Status
,txnBal
,Stxn.CreatedBy
,Stxn.CreatedOn
,Stxn.ModifiedBy
,Stxn.ModifiedOn
,Stxn.isDeleted
FROM Stxn,Sprovider
WHERE Stxn.SproviderID = SProvider.Sproviderid
AND Stxn.SProviderid = ISNULL(#pSProviderID,Stxn.SProviderid)
AND Stxn.status = ISNULL(#pStatus,Stxn.status)
AND Stxn.CreatedOn BETWEEN ISNULL(#pStartDate,getdate()-1) and ISNULL(#pEndDate,getdate())
AND Stxn.CreatedBy = ISNULL(#pSellerId,Stxn.CreatedBy)
ORDER BY StxnID DESC
The stxn table has more than 100,000 records.
The query is run from a report viewer in asp.net c#.
This is my go-to article when I'm trying to do a search query that has several search conditions which might be optional.
http://www.sommarskog.se/dyn-search-2008.html
The biggest problem with your query is the column=ISNULL(#column, column) syntax. MSSQL won't use an index for that. Consider changing it to (column = #column AND #column IS NOT NULL)
You should consider using the execution plan and look for missing indexes. Also, how long it takes to execute? What is slow for you?
Maybe you could also not return so many rows, but that is just a guess. Actually we need to see your table and indexes plus the execution plan.
Check sql-tuning-tutorial
For one, use SELECT TOP () instead of SET ROWCOUNT - the optimizer will have a much better chance that way. Another suggestion is to use a proper inner join instead of potentially ending up with a cartesian product using the old style table,table join syntax (this is not the case here but it can happen much easier with the old syntax). Should be:
...
FROM Stxn INNER JOIN Sprovider
ON Stxn.SproviderID = SProvider.Sproviderid
...
And if you think 100K rows is a lot, or that this volume is a reason for slowness, you're sorely mistaken. Most likely you have really poor indexing strategies in place, possibly some parameter sniffing, possibly some implicit conversions... hard to tell without understanding the data types, indexes and seeing the plan.
There are a lot of things that could impact the performance of query. Although 100k records really isn't all that many.
Items to consider (in no particular order)
Hardware:
Is SQL Server memory constrained? In other words, does it have enough RAM to do its job? If it is swapping memory to disk, then this is a sure sign that you need an upgrade.
Is the machine disk constrained. In other words, are the drives fast enough to keep up with the queries you need to run? If it's memory constrained, then disk speed becomes a larger factor.
Is the machine processor constrained? For example, when you execute the query does the processor spike for long periods of time? Or, are there already lots of other queries running that are taking resources away from yours...
Database Structure:
Do you have indexes on the columns used in your where clause? If the tables do not have indexes then it will have to do a full scan of both tables to determine which records match.
Eliminate the ISNULL function calls. If this is a direct query, have the calling code validate the parameters and set default values before executing. If it is in a stored procedure, do the checks at the top of the s'proc. Unless you are executing this with RECOMPILE that does parameter sniffing, those functions will have to be evaluated for each row..
Network:
Is the network slow between you and the server? Depending on the amount of data pulled you could be pulling GB's of data across the wire. I'm not sure what is stored in the "raw" column. The first question you need to ask here is "how much data is going back to the client?" For example, if each record is 1MB+ in size, then you'll probably have disk and network constraints at play.
General:
I'm not sure what "slow" means in your question. Does it mean that the query is taking around 1 second to process or does it mean it's taking 5 minutes? Everything is relative here.
Basically, it is going to be impossible to give a hard answer without a lot of questions asked by you. All of these will bear out if you profile the queries, understand what and how much is going back to the client and watch the interactions amongst the various parts.
Finally depending on the amount of data going back to the client there might not be a way to improve performance short of hardware changes.
Make sure Stxn.SproviderID, Stxn.status, Stxn.CreatedOn, Stxn.CreatedBy, Stxn.StxnID and SProvider.Sproviderid all have indexes defined.
(NB -- you might not need all, but it can't hurt.)
I don't see much that can be done on the query itself, but I can see things being done on the schema :
Create an index / PK on Stxn.SproviderID
Create an index / PK on SProvider.Sproviderid
Create indexes on status, CreatedOn, CreatedBy, StxnID
Something to consider: When ROWCOUNT or TOP are used with an ORDER BY clause, the entire result set is created and sorted first and then the top 10 results are returned.
How does this run without the Order By clause?

Hypothetical performance yield to not using SELECT *

To preface, I'm aware (as should you!) that using SELECT * in production is bad, but I was maintaining a script written by someone else. And, I'm also aware that this question is low on specifics... But hypothetical scenario.
Let's say I have a script that selects everything from a table of 20 fields. Let's say typical customer information.
Then let's say being the good developer I am, I shorten the SELECT * to a SELECT of the 13 specific fields I'm actually using on the display end.
What type of performance benefit, if any, could I expect by explicitly listing the fields versus SELECT *?
I will say this, both queries take advantage of the same exact indexes. The more specific query does not have access to a covering index that the other query could not use, in case you were wondering.
I'm not expecting miracles, like adding an index that targets the more specific query. I'm just wondering.
It depends on three things: the underlying storage and retrieval mechanism used by your database, the nature of the 7 columns you're leaving out, and the number of rows returned in the result set.
If the 7 (or whatever number) columns you're leaving out are "cheap to retrieve" columns, and the number of rows returned is low, I would expect very little benefit. If the columns are "expensive" (for instance, they're large, or they're BLOBs requiring reference to another file that is never cached) and / or you're retrieving a lot of rows then you could expect a significant improvement. Just how much depends on how expensive it is in your particular database to retrieve that information and assemble in memory.
There are other reasons besides speed, incidentally, to use named columns when retrieving information having to do with knowing absolutely that certain columns are contained in the result set and that the columns are in the desired order that you want to use them in.
The main difference I would expect to see is reduced network traffic. If any of the columns are large, they could take time to transfer, which is of course a complete waste if you're not displaying them.
It's also fairly critical if your database library references columns by index (instead of name), because if the column order changes in the database, it'll break the code.
Coding-style wise, it allows you to see which columns the rest of the code will be using, without having to read it.
Hmm, in one simple experiment, I was surprised at how much difference it made.
I just did a simple query with three variations:
select *
select the field that is the primary key. (It might pull get this directly from the index without actually reading the record)
select a non-key field.
I used a table with a pretty large number of fields -- 72 of them -- including one CLOB. The query was just a select with one condition in the where clause.
Results:
Run * Key Non-key
1 .647 .020 .028
2 .599 .041 .014
3 .321 .019 .027
avg .522 .027 .023
Key vs non-key didn't seem to matter. (Which surprises me.) But retrieving just one field versus select * saved 95% of the runtime!
Of course this is one tiny experiment with one table. There could be many many relevant factors. I'm certainly not claiming that you will always reduce runtime by 95% by not using select *! But it's far more impressive than I expected.
When comparing 13 vs 20 fields, if the 7 fields that are left out are not fields such as CLOB/BLOBs or such, I would expect to see no noticable performance gain.
I/O is main DB bottleneck (most DB systems are I/O bound), so you might think that you would bring execution time to 13/20 of the original query execution time (since you need that much less data), but since normal fields are stored within the same physical structure (usually fields are arranged consecutively) and the file system reads whole blocks, your disk heads will read the same amount of data (assuming all 20 fields are less then block size; situation can change if the size of a record is bigger than a block of your filesystem).
The principle that SELECT * is bad has a different cause - stability of the system.
If you use SELECT * at wrong places then changes to underlying table(s) might break your system unexpectedly (mostly later, and if things break it is usually better if they break sooner). This can especially be intresting if normalize data (move columns from one table to another, while keeping the same name). In such case if you chain SELECT * in views and if you chain your views then you might actually not get any errors, but have (essentially) different end results.
Why don't you try it yourself and let us know?
It's all going to be dependent on how many columns and how wide they are.
Better still, do you have an actual performance problem? Tell us what your actual problem is and show us the code, and then we can suggest potential improvements. Chances are there are other improvements to be made that are much better than worrying about SELECT * vs. SELECT field list.
Select * means the database has to take time to lookup the fields. If you don't need all those fields (and anytime you have have an inner join you don't as the join field is repeated!) then you are wasting but server resources to get the data and network resources to transport the data. You may also be wasting memory to hold the recordset to work with it. And while the performance improvement may be tiny for one query, how many times is that query run? And people who use this abysmally poor technique tend to use it everywhere, so fixing all of them can be a major imporvement for not that much effort. And how hard is it to specify the fields? I don't know about every database, but in SQL Server I can drag and drop what I want from the object browser in seconds. So using select * is trading less than a minute of development time for a worse performance every single time the query is run and creating code that is fragile and subject to very bad problems as the schema changes. I see no reason to ever use select * in production code.

Why is SELECT * considered harmful?

Why is SELECT * bad practice? Wouldn't it mean less code to change if you added a new column you wanted?
I understand that SELECT COUNT(*) is a performance problem on some DBs, but what if you really wanted every column?
There are really three major reasons:
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
But it's not all bad for SELECT *. I use it liberally for these use cases:
Ad-hoc queries. When trying to debug something, especially off a narrow table I might not be familiar with, SELECT * is often my best friend. It helps me just see what's going on without having to do a boatload of research as to what the underlying column names are. This gets to be a bigger "plus" the longer the column names get.
When * means "a row". In the following use cases, SELECT * is just fine, and rumors that it's a performance killer are just urban legends which may have had some validity many years ago, but don't now:
SELECT COUNT(*) FROM table;
in this case, * means "count the rows". If you were to use a column name instead of * , it would count the rows where that column's value was not null. COUNT(*), to me, really drives home the concept that you're counting rows, and you avoid strange edge-cases caused by NULLs being eliminated from your aggregates.
Same goes with this type of query:
SELECT a.ID FROM TableA a
WHERE EXISTS (
SELECT *
FROM TableB b
WHERE b.ID = a.B_ID);
in any database worth its salt, * just means "a row". It doesn't matter what you put in the subquery. Some people use b's ID in the SELECT list, or they'll use the number 1, but IMO those conventions are pretty much nonsensical. What you mean is "count the row", and that's what * signifies. Most query optimizers out there are smart enough to know this. (Though to be honest, I only know this to be true with SQL Server and Oracle.)
The asterisk character, "*", in the SELECT statement is shorthand for all the columns in the table(s) involved in the query.
Performance
The * shorthand can be slower because:
Not all the fields are indexed, forcing a full table scan - less efficient
What you save to send SELECT * over the wire risks a full table scan
Returning more data than is needed
Returning trailing columns using variable length data type can result in search overhead
Maintenance
When using SELECT *:
Someone unfamiliar with the codebase would be forced to consult documentation to know what columns are being returned before being able to make competent changes. Making code more readable, minimizing the ambiguity and work necessary for people unfamiliar with the code saves more time and effort in the long run.
If code depends on column order, SELECT * will hide an error waiting to happen if a table had its column order changed.
Even if you need every column at the time the query is written, that might not be the case in the future
the usage complicates profiling
Design
SELECT * is an anti-pattern:
The purpose of the query is less obvious; the columns used by the application is opaque
It breaks the modularity rule about using strict typing whenever possible. Explicit is almost universally better.
When Should "SELECT *" Be Used?
It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference.
Otherwise, explicitly list every column that is to be used in the query - preferably while using a table alias.
Even if you wanted to select every column now, you might not want to select every column after someone adds one or more new columns. If you write the query with SELECT * you are taking the risk that at some point someone might add a column of text which makes your query run more slowly even though you don't actually need that column.
Wouldn't it mean less code to change if you added a new column you wanted?
The chances are that if you actually want to use the new column then you will have to make quite a lot other changes to your code anyway. You're only saving , new_column - just a few characters of typing.
If you really want every column, I haven't seen a performance difference between select (*) and naming the columns. The driver to name the columns might be simply to be explicit about what columns you expect to see in your code.
Often though, you don't want every column and the select(*) can result in unnecessary work for the database server and unnecessary information having to be passed over the network. It's unlikely to cause a noticeable problem unless the system is heavily utilised or the network connectivity is slow.
If you name the columns in a SELECT statement, they will be returned in the order specified, and may thus safely be referenced by numerical index. If you use "SELECT *", you may end up receiving the columns in arbitrary sequence, and thus can only safely use the columns by name. Unless you know in advance what you'll be wanting to do with any new column that gets added to the database, the most probable correct action is to ignore it. If you're going to be ignoring any new columns that get added to the database, there is no benefit whatsoever to retrieving them.
In a lot of situations, SELECT * will cause errors at run time in your application, rather than at design time. It hides the knowledge of column changes, or bad references in your applications.
Think of it as reducing the coupling between the app and the database.
To summarize the 'code smell' aspect:
SELECT * creates a dynamic dependency between the app and the schema. Restricting its use is one way of making the dependency more defined, otherwise a change to the database has a greater likelihood of crashing your application.
If you add fields to the table, they will automatically be included in all your queries where you use select *. This may seem convenient, but it will make your application slower as you are fetching more data than you need, and it will actually crash your application at some point.
There is a limit for how much data you can fetch in each row of a result. If you add fields to your tables so that a result ends up being over that limit, you get an error message when you try to run the query.
This is the kind of errors that are hard to find. You make a change in one place, and it blows up in some other place that doesn't actually use the new data at all. It may even be a less frequently used query so that it takes a while before someone uses it, which makes it even harder to connect the error to the change.
If you specify which fields you want in the result, you are safe from this kind of overhead overflow.
I don't think that there can really be a blanket rule for this. In many cases, I have avoided SELECT *, but I have also worked with data frameworks where SELECT * was very beneficial.
As with all things, there are benefits and costs. I think that part of the benefit vs. cost equation is just how much control you have over the datastructures. In cases where the SELECT * worked well, the data structures were tightly controlled (it was retail software), so there wasn't much risk that someone was going to sneek a huge BLOB field into a table.
Reference taken from this article.
Never go with "SELECT *",
I have found only one reason to use "SELECT *"
If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.
Generally you have to fit the results of your SELECT * ... into data structures of various types. Without specifying which order the results are arriving in, it can be tricky to line everything up properly (and more obscure fields are much easier to miss).
This way you can add fields to your tables (even in the middle of them) for various reasons without breaking sql access code all over the application.
Using SELECT * when you only need a couple of columns means a lot more data transferred than you need. This adds processing on the database, and increase latency on getting the data to the client. Add on to this that it will use more memory when loaded, in some cases significantly more, such as large BLOB files, it's mostly about efficiency.
In addition to this, however, it's easier to see when looking at the query what columns are being loaded, without having to look up what's in the table.
Yes, if you do add an extra column, it would be faster, but in most cases, you'd want/need to change your code using the query to accept the new columns anyways, and there's the potential that getting ones you don't want/expect can cause issues. For example, if you grab all the columns, then rely on the order in a loop to assign variables, then adding one in, or if the column orders change (seen it happen when restoring from a backup) it can throw everything off.
This is also the same sort of reasoning why if you're doing an INSERT you should always specify the columns.
Selecting with column name raises the probability that database engine can access the data from indexes rather than querying the table data.
SELECT * exposes your system to unexpected performance and functionality changes in the case when your database schema changes because you are going to get any new columns added to the table, even though, your code is not prepared to use or present that new data.
There is also more pragmatic reason: money. When you use cloud database and you have to pay for data processed there is no explanation to read data that you will immediately discard.
For example: BigQuery:
Query pricing
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed.
and Control projection - Avoid SELECT *:
Best practice: Control projection - Query only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.
Understand your requirements prior to designing the schema (if possible).
Learn about the data,
1)indexing
2)type of storage used,
3)vendor engine or features; ie...caching, in-memory capabilities
4)datatypes
5)size of table
6)frequency of query
7)related workloads if the resource is shared
8)Test
A) Requirements will vary. If the hardware can not support the expected workload, you should re-evaluate how to provide the requirements in the workload. Regarding the addition column to the table. If the database supports views, you can create an indexed(?) view of the specific data with the specific named columns (vs. select '*'). Periodically review your data and schema to ensure you never run into the "Garbage-in" -> "Garbage-out" syndrome.
Assuming there is no other solution; you can take the following into account. There are always multiple solutions to a problem.
1) Indexing: The select * will execute a tablescan. Depending on various factors, this may involve a disk seek and/or contention with other queries. If the table is multi-purpose, ensure all queries are performant and execute below you're target times. If there is a large amount of data, and your network or other resource isn't tuned; you need to take this into account. The database is a shared environment.
2) type of storage. Ie: if you're using SSD's, disk, or memory. I/O times and the load on the system/cpu will vary.
3) Can the DBA tune the database/tables for higher performance? Assumming for whatever reason, the teams have decided the select '*' is the best solution to the problem; can the DB or table be loaded into memory. (Or other method...maybe the response was designed to respond with a 2-3 second delay? --- while an advertisement plays to earn the company revenue...)
4) Start at the baseline. Understand your data types, and how results will be presented. Smaller datatypes, number of fields reduces the amount of data returned in the result set. This leaves resources available for other system needs. The system resources are usually have a limit; 'always' work below these limits to ensure stability, and predictable behaviour.
5) size of table/data. select '*' is common with tiny tables. They typically fit in memory, and response times are quick. Again....review your requirements. Plan for feature creep; always plan for the current and possible future needs.
6) Frequency of query / queries. Be aware of other workloads on the system. If this query fires off every second, and the table is tiny. The result set can be designed to stay in cache/memory. However, if the query is a frequent batch process with Gigabytes/Terabytes of data...you may be better off to dedicate additional resources to ensure other workloads aren't affected.
7) Related workloads. Understand how the resources are used. Is the network/system/database/table/application dedicated, or shared? Who are the stakeholders? Is this for production, development, or QA? Is this a temporary "quick fix". Have you tested the scenario? You'll be surprised how many problems can exist on current hardware today. (Yes, performance is fast...but the design/performance is still degraded.) Does the system need to performance 10K queries per second vs. 5-10 queries per second. Is the database server dedicated, or do other applications, monitoring execute on the shared resource. Some applications/languages; O/S's will consume 100% of the memory causing various symptoms/problems.
8) Test: Test out your theories, and understand as much as you can about. Your select '*' issue may be a big deal, or it may be something you don't even need to worry about.
There's an important distinction here that I think most answers are missing.
SELECT * isn't an issue. Returning the results of SELECT * is the issue.
An OK example, in my opinion:
WITH data_from_several_tables AS (
SELECT * FROM table1_2020
UNION ALL
SELECT * FROM table1_2021
...
)
SELECT id, name, ...
FROM data_from_several_tables
WHERE ...
GROUP BY ...
...
This avoids all the "problems" of using SELECT * mentioned in most answers:
Reading more data than expected? Optimisers in modern databases will be aware that you don't actually need all columns
Column ordering of the source tables affects output? We still select and
return data explicitly.
Consumers can't see what columns they receive from the SQL? The columns you're acting on are explicit in code.
Indexes may not be used? Again, modern optimisers should handle this the same as if we didn't SELECT *
There's a readability/refactorability win here - no need to duplicate long lists of columns or other common query clauses such as filters. I'd be surprised if there are any differences in the query plan when using SELECT * like this compared with SELECT <columns> (in the vast majority of cases - obviously always profile running code if it's critical).

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?
This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.
There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.
I'm with you - a single SQL would be better
There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.
It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.
The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!
You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.
Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.
The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.
A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.
Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)
You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.
There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.