Determine if a SQL Insert/Update statement affects the result from a stored Select Statement - sql

Thought this would be a good place to ask for some "brainstorming." Apologies if it's a little broad/off subject.
I was wondering if anyone here had any ideas on how to approach the following problem:
First assume that I have a select statement stored somewhere as an object (this can be the tree form of the query). For example (for simplicity):
SELECT A, B FROM table_A WHERE A > 10;
It's easy to determine the below would change the result of the above query:
INSERT INTO table_A (A,B) VALUES (12,15);
But, given any possible Insert/Update/Whatever statement, as well as any possible starting Select (but we know the Selects and can analyze them all day) I'd like to determine if it would affect the result of the Select Statement.
It's fine to assume that there won't be any "outside" queries, and that we know about all the queries being sent to the DB. It is also assumed we know the DB schema.
No, this isn't for homework. Just a brain teaser I've been thinking about and started to get stuck on (obviously, SQL can get very complicated.)

Based on the reply to the comment, I'd say that without additional criteria, this ranges between very hard and impossible.
Very hard (leastways, it would be for me) because you'd have to write something to parse and interpret your SQL statements into a workable frame of reference for your goals. Doable, but can it be worth the effort?
Impossible because some queries transcend phrases like "Byzantinely complex". (Think nested queries, correlated subqueries, views, common table expressions, triggers, outer joins, and who knows what all.) Without setting criteria such as "no subqueries, no views or triggers, no more than X joins" and so forth, the problem becomes open-ended enough to warrant an NP Complete answer.

My first thought would be to put a trigger on table_A, where if any of the columns you're affecting (col A in this case) changes to meet (or no longer meet) the condition (> 10 here), then the trigger records that an "affecting" change has taken place.
E.g. have another little table to record a "last update timestamp", which the trigger could pop a getdate() into when it detects such a change.
Then, you could check that table to see if the timestamp has changed since the last time you ran the select query - if it has, then you know you need to re-run it, if it hasn't, then you know the results would be the same.
The table could hold many such timestamps (one per row, perhaps with the table/trigger name as a key value in another column) to service many such triggers.
Advantage? Being done in a trigger on the table means no risk of a change that could affect the select statement being missed.
Disadvantage? I guess depending on how your select statements come into existence, you might have an undesirable/unmanageable overhead in creating the trigger(s).

Related

Oracle 12c Subquery Factoring Inline View now has bad plan?

Update 11/2
After some additional troubleshooting, my team was able to tie this Oracle bug directly to a parameter change that was made on the 12c database the night before the query stopped working. After experiencing some performance issues from an application tied to this database, my team had our DBA change the OPTIMIZER_FEATURES_ENABLE parameter from 12.1.02 to 11.2.0.4. This fixed the performance issue for the problem application but caused the bug I have described above. To verify, I've been able to replicate this same issue in a separate environment by changing this parameter. My DBA has filed a ticket with Oracle to have this looked at.
As a workaround, I was able to make a slight change to my query in order to retrieve the expected results. Specifically, I combined Subquery1 with Subquery2 and I moved a few predicates in Subquery1 from the WHERE clause to the JOIN (where they more properly belonged). This change edited my execution plan (it is slightly less efficient than what was listed before) but was enough to address the original issue.
Original Post
Firstly, let me apologize for any vagueness in this question but I'm dealing with a confidential financial system so I am forced to hide certain implementation details.
Background
I have an Oracle query that I put into production a long time ago that has recently stopped producing expected results coincidentally after an upgrade from 11g to 12c. To my (and my production support team's) knowledge this query had been working fine for well over a year before that.
Details
The query is overly complicated and not very efficient but this is in large part because I am dealing with non-normalized tables (historically modeled after a Mainframe) and poor data input from upstream systems. In order to deal with a complicated business situation I leveraged multiple levels of Subquery Factoring (the WITH statement) and then my final statement joins together two Inline Views. The basic structure of the query without all of the complicated predicates is as follows:
I have 3 tables Table1, Table2, Table3. Table1 is a processing table made up of records from Table2.
--This grabs a subset from Table1
WITH Subquery1 as (
SELECT FROM Table1),
--This eliminates certain records from the first subset based on sister records
--from the original source table
Subquery2 as (
SELECT FROM Subquery1
WHERE NOT EXISTS FROM (SELECT from Table2)),
--This ties the records from Subquery2 to Table3
Subquery3 as (
SELECT FROM Table3
JOIN (SELECT Max(Date) FROM Table3)
JOIN Subquery2)
--This final query evaluates subquery3 in two different ways and
--only takes those records which fit the criteria items from both sets
SELECT FROM
(SELECT FROM Subquery3) -- Call this Inline View A
JOIN (SELECT FROM Subquery3) -- Call this Inline View B
The final query is pretty basic:
SELECT A.Group_No, B.Sub_Group, B.Key, B.Lob
FROM (SELECT Group_No, Lob, COUNT(Sub_Group)
FROM Subquery3
GROUP BY Group_No, Lob
HAVING COUNT(Sub_Group) = 1) A
JOIN (SELECT Group_No, Sub_Group, Key, Lob
FROM Subquery3
WHERE Sub_Group LIKE '0000%') B
ON A.Group_No = B.Group_No
AND A.Lob = B.Lob
Problem
If I edit the final query to remove the second Inline View and evaluate the output of the A inline view, I come away with 0 returned rows. I've manually evaluated the records for each individual subquery and can confirm this is an expected result.
Likewise, if I edit the final query to produce the output of only the 'B' inline view, I come away with 6 returned rows. Again, I've manually evaluated the data and this is exactly as expected.
Now when joining these two subsets (Inline View A and Inline View B) together, I would expect that the final query result would be 0 rows (since an inner join between a full set and an empty set produces no matches). However, when I run the entire query with the inner join as described above, I am getting back 1158 rows!
I have reviewed the Execution Plan but nothing jumps out at me:
Questions
Clearly I have done something to confuse the Oracle Optimizer and the updated query plan is pulling back a much different query than the one I have submitted. My best guess is that with all of these temporary views floating around within the same query, I have confused Oracle into evaluating some set before one that it depends upon.
To this day I've been unable to locate the official Oracle documentation around the WITH statement so I've never been completely confident about the order that subqueries are evaluated. I did notice in searching SO (can't find it now) someone mentioned that a factored subquery cannot refer to another factored query. I've never known this to be true before but the bizarre output above is making me wonder if I had only been lucky before with this query?
Can anyone explain the behavior I am seeing? Am I attempting to do something obviously incorrect with this query plan? Or alternatively, is there any chance that something changed between 11g and 12c that could explain why the behavior of this query might have changed?
This sounds like a "wrong results" bug in Oracle. These bugs are usually extremely specific to the version and the features you are using. There's nothing obviously wrong with the queries or execution plan you posted.
You have two ways of handling this:
Try to find the precise bug. What you're doing with common table expressions looks fine. There are some rare times when your query is technically invalid, you get "lucky" in one version and it works, and when you upgrade it fails. But when that happens the new version usually throws an error, not return wrong results. There's probably some extremely weird, specific combination of features you're using that's causing the issue. To find the real issue you need to massively simplify the query until you can make the smallest possible change and see the problem appear and disappear. You'll also want to remove all objects and only use DUAL. This process can take hours. At the end, when you're left with only a few lines of code, either post them here, look on Oracle Support, or create a Service Request.
Avoid the bug. Even if you go through the above steps there may not be a fix anyway. Sometimes the best work-around is to do something differently. It's nice to get to the bottom of every problem but you don't always have time. Instead, try re-writing the query in syntactically different but logically equivalent ways. Remove some or all of the common table expressions, maybe even repeat some SQL. But be sure to leave a comment warning future programmers of why you're doing things in a weird way.

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Select * from table vs Select col1,col2,col3 from table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
select * vs select column
I was just having a discussion with one of my colleague on the SQL Server performance on specifying the query command in the stored procedure.
So I want to know which one is preferred over another and whats the concrete reason behind that.
Suppose, We do have one table called
Employees(EmpName,EmpAddress)
And we want to select all the records from the table. So we can write the query in two ways,
Select * from Employees
Select EmpName, EmpAddress from Employees
So I would like to know is there any specific difference or performance issue in the above queries or are they just equal to the SQL Server Engine.
UPDATE:
Lets say the table schema won't change anymore. So no point for future maintenance.
Performance wise, lets say, the usage is very very high i.e. millions of hits per seconds on the database server. I want a clear and precise performance rating on both approaches.
No Indexing is done on the entire table.
The specific difference would show its ugly head if you add a column to the table.
Suddenly, the query you expected to return two columns now returns three. If you coded specifically for the two columns, the rest of your code is now broken.
Performance-wise, there shouldn't be a difference.
I always take the approach that being as specific as possible is the best when dealing with databases. If the table has two columns and you only need those two columns, be specific. Specify those two columns. It'll save you headaches in the future.
I am an avid avokat of the "be as specific as possible" rule, too. Not following it will hurt you in the long run. However, your question seems to be coming from a different background, so let me attempt to answer it.
When you submit a query to SQL Server it goes through several stages:
transmitting of query string over the network.
parsing of query string, producing a parse-tree
linking the referenced objects in the parse tree to existing objects
optimizing based on statistics and row count/size estimates
executing
transmitting of result data over the network
Let's look at each one:
The * query is a few bytes shorter, so step this will be faster
The * query contains fewer "tokens" so this should(!) be faster
During linking the list of columns need to be puled and compared to the query string. Here the "*" gets resolved to the actual column reference. Without access to the code it is impossible to say which version takes less cycles, however the amount of data accessed is about the same so this should be similar.
-6. In these stages there is no difference between the two example queries, as they will both get compiled to the same execution plan.
Taking all this into account, you will probably save a few nanoseconds when using the * notation. However, you example is very simplistic. In a more complex example it is possible that specifying as subset of columns of a table in a multi table join will lead to a different plan than using a *. If that happens we can be pretty certain that the explicit query will be faster.
The above comparison also assumes that the SQL Server process is running alone on a single processor and no other queries are submitted at the same time. If the process has to yield during the compilation those extra cycles will be far more than the ones we are trying to save.
So, the amont of saving we are talking about is very minute compared to the actual execution time and should not be used as an excuse for a "bad" coding practice.
I hope this answers your question.
You should always reference columns explicitly. This way, if the table structure changes (and such changes are made in an intelligent, backward-compatible way), your queries will continue to work and can be modified over time.
Also, unless you actually need all of the columns from the table (not typical), using SELECT * is bringing more data to your application than is necessary, and potentially forcing a clustered index scan instead of what might have been satisfied by a narrower covering index.
Bad habits to kick : using SELECT * / omitting the column list
Performance wise there are no difference between those 2 i think.But those 2 are used in different cases what may be the difference.
Consider a slightly larger table.If your table(Employees) contains 10 columns,then the 1st query will retain all of the information of the table.But for 2nd query,you may specify which columns information you need.So when you need all of the information of employees no.1 is the best one rather than specifying all of the column names.
Ofcourse,when you need to ALTER a table then those 2 would not be equal.

What is wrong with using SELECT * FROM sometable [duplicate]

I've heard that SELECT * is generally bad practice to use when writing SQL commands because it is more efficient to SELECT columns you specifically need.
If I need to SELECT every column in a table, should I use
SELECT * FROM TABLE
or
SELECT column1, colum2, column3, etc. FROM TABLE
Does the efficiency really matter in this case? I'd think SELECT * would be more optimal internally if you really need all of the data, but I'm saying this with no real understanding of database.
I'm curious to know what the best practice is in this case.
UPDATE: I probably should specify that the only situation where I would really want to do a SELECT * is when I'm selecting data from one table where I know all columns will always need to be retrieved, even when new columns are added.
Given the responses I've seen however, this still seems like a bad idea and SELECT * should never be used for a lot more technical reasons that I ever though about.
One reason that selecting specific columns is better is that it raises the probability that SQL Server can access the data from indexes rather than querying the table data.
Here's a post I wrote about it: The real reason select queries are bad index coverage
It's also less fragile to change, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.
Given your specification that you are selecting all columns, there is little difference at this time. Realize, however, that database schemas do change. If you use SELECT * you are going to get any new columns added to the table, even though in all likelihood, your code is not prepared to use or present that new data. This means that you are exposing your system to unexpected performance and functionality changes.
You may be willing to dismiss this as a minor cost, but realize that columns that you don't need still must be:
Read from database
Sent across the network
Marshalled into your process
(for ADO-type technologies) Saved in a data-table in-memory
Ignored and discarded / garbage-collected
Item #1 has many hidden costs including eliminating some potential covering index, causing data-page loads (and server cache thrashing), incurring row / page / table locks that might be otherwise avoided.
Balance this against the potential savings of specifying the columns versus an * and the only potential savings are:
Programmer doesn't need to revisit the SQL to add columns
The network-transport of the SQL is smaller / faster
SQL Server query parse / validation time
SQL Server query plan cache
For item 1, the reality is that you're going to add / change code to use any new column you might add anyway, so it is a wash.
For item 2, the difference is rarely enough to push you into a different packet-size or number of network packets. If you get to the point where SQL statement transmission time is the predominant issue, you probably need to reduce the rate of statements first.
For item 3, there is NO savings as the expansion of the * has to happen anyway, which means consulting the table(s) schema anyway. Realistically, listing the columns will incur the same cost because they have to be validated against the schema. In other words this is a complete wash.
For item 4, when you specify specific columns, your query plan cache could get larger but only if you are dealing with different sets of columns (which is not what you've specified). In this case, you do want different cache entries because you want different plans as needed.
So, this all comes down, because of the way you specified the question, to the issue resiliency in the face of eventual schema modifications. If you're burning this schema into ROM (it happens), then an * is perfectly acceptable.
However, my general guideline is that you should only select the columns you need, which means that sometimes it will look like you are asking for all of them, but DBAs and schema evolution mean that some new columns might appear that could greatly affect the query.
My advice is that you should ALWAYS SELECT specific columns. Remember that you get good at what you do over and over, so just get in the habit of doing it right.
If you are wondering why a schema might change without code changing, think in terms of audit logging, effective/expiration dates and other similar things that get added by DBAs for systemically for compliance issues. Another source of underhanded changes is denormalizations for performance elsewhere in the system or user-defined fields.
You should only select the columns that you need. Even if you need all columns it's still better to list column names so that the sql server does not have to query system table for columns.
Also, your application might break if someone adds columns to the table. Your program will get columns it didn't expect too and it might not know how to process them.
Apart from this if the table has a binary column then the query will be much more slower and use more network resources.
There are four big reasons that select * is a bad thing:
The most significant practical reason is that it forces the user to magically know the order in which columns will be returned. It's better to be explicit, which also protects you against the table changing, which segues nicely into...
If a column name you're using changes, it's better to catch it early (at the point of the SQL call) rather than when you're trying to use the column that no longer exists (or has had its name changed, etc.)
Listing the column names makes your code far more self-documented, and so probably more readable.
If you're transferring over a network (or even if you aren't), columns you don't need are just waste.
Specifying the column list is usually the best option because your application won't be affected if someone adds/inserts a column to the table.
Specifying column names is definitely faster - for the server. But if
performance is not a big issue (for example, this is a website content database with hundreds, maybe thousands - but not millions - of rows in each table); AND
your job is to create many small, similar applications (e.g. public-facing content-managed websites) using a common framework, rather than creating a complex one-off application; AND
flexibility is important (lots of customization of the db schema for each site);
then you're better off sticking with SELECT *. In our framework, heavy use of SELECT * allows us to introduce a new website managed content field to a table, giving it all of the benefits of the CMS (versioning, workflow/approvals, etc.), while only touching the code at a couple of points, instead of a couple dozen points.
I know the DB gurus are going to hate me for this - go ahead, vote me down - but in my world, developer time is scarce and CPU cycles are abundant, so I adjust accordingly what I conserve and what I waste.
SELECT * is a bad practice even if the query is not sent over a network.
Selecting more data than you need makes the query less efficient - the server has to read and transfer extra data, so it takes time and creates unnecessary load on the system (not only the network, as others mentioned, but also disk, CPU etc.). Additionally, the server is unable to optimize the query as well as it might (for example, use covering index for the query).
After some time your table structure might change, so SELECT * will return a different set of columns. So, your application might get a dataset of unexpected structure and break somewhere downstream. Explicitly stating the columns guarantees that you either get a dataset of known structure, or get a clear error on the database level (like 'column not found').
Of course, all this doesn't matter much for a small and simple system.
Lots of good reasons answered here so far, here's another one that hasn't been mentioned.
Explicitly naming the columns will help you with maintenance down the road. At some point you're going to be making changes or troubleshooting, and find yourself asking "where the heck is that column used".
If you've got the names listed explicitly, then finding every reference to that column -- through all your stored procedures, views, etc -- is simple. Just dump a CREATE script for your DB schema, and text search through it.
Performance wise, SELECT with specific columns can be faster (no need to read in all the data). If your query really does use ALL the columns, SELECT with explicit parameters is still preferred. Any speed difference will be basically unnoticeable and near constant-time. One day your schema will change, and this is good insurance to prevent problems due to this.
definitely defining the columns, because SQL Server will not have to do a lookup on the columns to pull them. If you define the columns, then SQL can skip that step.
It's always better to specify the columns you need, if you think about it one time, SQL doesn't have to think "wtf is *" every time you query. On top of that, someone later may add columns to the table that you actually do not need in your query and you'll be better off in that case by specifying all of your columns.
The problem with "select *" is the possibility of bringing data you don't really need. During the actual database query, the selected columns don't really add to the computation. What's really "heavy" is the data transport back to your client, and any column that you don't really need is just wasting network bandwidth and adding to the time you're waiting for you query to return.
Even if you do use all the columns brought from a "select *...", that's just for now. If in the future you change the table/view layout and add more columns, you'll start bring those in your selects even if you don't need them.
Another point in which a "select *" statement is bad is on view creation. If you create a view using "select *" and later add columns to your table, the view definition and the data returned won't match, and you'll need to recompile your views in order for them to work again.
I know that writing a "select *" is tempting, 'cause I really don't like to manually specify all the fields on my queries, but when your system start to evolve, you'll see that it's worth to spend this extra time/effort in specifying the fields rather than spending much more time and effort removing bugs on your views or optimizing your app.
While explicitly listing columns is good for performance, don't get crazy.
So if you use all the data, try SELECT * for simplicity (imagine having many columns and doing a JOIN... query may get awful). Then - measure. Compare with query with column names listed explicitly.
Don't speculate about performance, measure it!
Explicit listing helps most when you have some column containing big data (like body of a post or article), and don't need it in given query. Then by not returning it in your answer DB server can save time, bandwidth, and disk throughput. Your query result will also be smaller, which is good for any query cache.
You should really be selecting only the fields you need, and only the required number, i.e.
SELECT Field1, Field2 FROM SomeTable WHERE --(constraints)
Outside of the database, dynamic queries run the risk of injection attacks and malformed data. Typically you get round this using stored procedures or parameterised queries. Also (although not really that much of a problem) the server has to generate an execution plan each time a dynamic query is executed.
It is NOT faster to use explicit field names versus *, if and only if, you need to get the data for all fields.
Your client software shouldn't depend on the order of the fields returned, so that's a nonsense too.
And it's possible (though unlikely) that you need to get all fields using * because you don't yet know what fields exist (think very dynamic database structure).
Another disadvantage of using explicit field names is that if there are many of them and they're long then it makes reading the code and/or the query log more difficult.
So the rule should be: if you need all the fields, use *, if you need only a subset, name them explicitly.
The result is too huge. It is slow to generate and send the result from the SQL engine to the client.
The client side, being a generic programming environment, is not and should not be designed to filter and process the results (e.g. the WHERE clause, ORDER clause), as the number of rows can be huge (e.g. tens of millions of rows).
Naming each column you expect to get in your application also ensures your application won't break if someone alters the table, as long as your columns are still present (in any order).
Performance wise I have seen comments that both are equal. but usability aspect there are some +'s and -'s
When you use a (select *) in a query and if some one alter the table and add new fields which do not need for the previous query it is an unnecessary overhead. And what if the newly added field is a blob or an image field??? your query response time is going to be really slow then.
In other hand if you use a (select col1,col2,..) and if the table get altered and added new fields and if those fields are needed in the result set, you always need to edit your select query after table alteration.
But I suggest always to use select col1,col2,... in your queries and alter the query if the table get altered later...
This is an old post, but still valid. For reference, I have a very complicated query consisting of:
12 tables
6 Left joins
9 inner joins
108 total columns on all 12 tables
I only need 54 columns
A 4 column Order By clause
When I execute the query using Select *, it takes an average of 2869ms.
When I execute the query using Select , it takes an average of 1513ms.
Total rows returned is 13,949.
There is no doubt selecting column names means faster performance over Select *
Select is equally efficient (in terms of velocity) if you use * or columns.
The difference is about memory, not velocity. When you select several columns SQL Server must allocate memory space to serve you the query, including all data for all the columns that you've requested, even if you're only using one of them.
What does matter in terms of performance is the excecution plan which in turn depends heavily on your WHERE clause and the number of JOIN, OUTER JOIN, etc ...
For your question just use SELECT *. If you need all the columns there's no performance difference.
It depends on the version of your DB server, but modern versions of SQL can cache the plan either way. I'd say go with whatever is most maintainable with your data access code.
One reason it's better practice to spell out exactly which columns you want is because of possible future changes in the table structure.
If you are reading in data manually using an index based approach to populate a data structure with the results of your query, then in the future when you add/remove a column you will have headaches trying to figure out what went wrong.
As to what is faster, I'll defer to others for their expertise.
As with most problems, it depends on what you want to achieve. If you want to create a db grid that will allow all columns in any table, then "Select *" is the answer. However, if you will only need certain columns and adding or deleting columns from the query is done infrequently, then specify them individually.
It also depends on the amount of data you want to transfer from the server. If one of the columns is a defined as memo, graphic, blob, etc. and you don't need that column, you'd better not use "Select *" or you'll get a whole bunch of data you don't want and your performance could suffer.
To add on to what everyone else has said, if all of your columns that you are selecting are included in an index, your result set will be pulled from the index instead of looking up additional data from SQL.
SELECT * is necessary if one wants to obtain metadata such as the number of columns.
Gonna get slammed for this, but I do a select * because almost all my data is retrived from SQL Server Views that precombine needed values from multiple tables into a single easy to access View.
I do then want all the columns from the view which won't change when new fields are added to underlying tables. This has the added benefit of allowing me to change where data comes from. FieldA in the View may at one time be calculated and then I may change it to be static. Either way the View supplies FieldA to me.
The beauty of this is that it allows my data layer to get datasets. It then passes them to my BL which can then create objects from them. My main app only knows and interacts with the objects. I even allow my objects to self-create when passed a datarow.
Of course, I'm the only developer, so that helps too :)
What everyone above said, plus:
If you're striving for readable maintainable code, doing something like:
SELECT foo, bar FROM widgets;
is instantly readable and shows intent. If you make that call you know what you're getting back. If widgets only has foo and bar columns, then selecting * means you still have to think about what you're getting back, confirm the order is mapped correctly, etc. However, if widgets has more columns but you're only interested in foo and bar, then your code gets messy when you query for a wildcard and then only use some of what's returned.
And remember if you have an inner join by definition you do not need all the columns as the data in the join columns is repeated.
It's not like listing columns in SQl server is hard or even time-consuming. You just drag them over from the object browser (you can get all in one go by dragging from the word columns). To put a permanent performance hit on your system (becasue this can reduce the use of indexes and becasue sending unneeded data over the network is costly) and make it more likely that you will have unexpected problems as the database changes (sometimes columns get added that you do not want the user to see for instance) just to save less than a minute of development time is short-sighted and unprofessional.
Absolutely define the columns you want to SELECT every time. There is no reason not to and the performance improvement is well worth it.
They should never have given the option to "SELECT *"
If you need every column then just use SELECT * but remember that the order could potentially change so when you are consuming the results access them by name and not by index.
I would ignore comments about how * needs to go get the list - chances are parsing and validating named columns is equal to the processing time if not more. Don't prematurely optimize ;-)

SQL Query theory question - single-statement vs multi-statement queries

When I write SQL queries, I find myself often thinking that "there's no way to do this with a single query". When that happens I often turn to stored procedures or multi-statement table-valued functions that use temp tables (of one sort or another) and end up simply combining the results and returning the result table.
I'm wondering if anyone knows, simply as a matter of theory, whether it should be possible to write ANY query that returns a single result set as a single query (not multiple statements). Obviously, I'm ignoring relevant points such as code readability and maintainability, maybe even query performance/efficiency. This is more about theory - can it be done... and don't worry, I certainly don't plan to start forcing myself to write a single-statement query when multi-statement will better suit my purpose in all cases, but it might make me think twice or a little bit longer on whether there is a viable way to get the result from a single query.
I guess a few parameters are in order - I'm thinking of a relational database (such as MS SQL) with tables that follow common best practices (such as all tables having a primary key and so forth).
Note: in order to win 'Accepted Answer' on this, you'll need to provide a definitive proof (reference to web material or something similar.)
I believe it is possible. I've worked with very difficult queries, very long queries, and often, it is possible to do it with a single query. But most of the time, it's harder to mantain, so if you do it with a single query, make sure you comment your query carefully.
I've never encountered something that could not be done in a single query.
But sometimes it's best to do it in more than one query.
At least with the a recent version of Oracle is absolutely possible. It has a 'model clause' which makes sql turing complete. ( http://blog.schauderhaft.de/2009/06/18/building-a-turing-engine-in-oracle-sql-using-the-model-clause/ ). Of course this is all with the usual limitation that we don't really have unlimited time and memory.
For a normal sql dialect without these abdominations I don't think it is possible.
A task that I can't see how to implement in 'normal sql' would be:
Assume a table with a single column of type integer
For every row
'take the value at the current row and go that many rows back, fetch that value, go that many rows back, and continue until you fetch the same value twice consecutively and return that as the result.'
I can't prove it, but I believe the answer is a cautious yes - provided your database design is done properly. Usually being forced to write multiple statements to get a certain result is a sign that your schema may need some improvements.
I'd say "yes" but can't prove it. However, my main thought process:
Any select should be a set based operation
Your assumption is that you are dealing with mathematically correct sets (ie normalised correctly)
Set theory should guarantee it's possible
Other thoughts:
Multiple SELECT statement often load temp tables/table variables. These can be derived or separated in CTEs.
Any RBAR processing (for good or bad) now be dealt with CROSS/OUTER APPLY onto derived tables
UDFs would be classed as "cheating" in this context I feel, because it allows you to put a SELECT into another module rather than in your single one
No writes allowed in your "before" sequence of DML: this changes state from SELECT to SELECT
Have you seen some of the code in our shop?
Edit, glossary
RBAR = Row By Agonising Row
CTE = Common Table Expression
UDF = User Defined Function
Edit: APPLY: cheating?
SELECT
*
FROM
MyTable1 t1
CROSS APPLY
(
SELECT * FROM MyTable2 t2
WHERE t1.something = t2.something
) t2
In theory yes, if you use functions or a torturous maze of OUTER APPLYs or sub-queries; however, for readability and performance, we have always ended up going with temp tables and multi-statement stored procedures.
As someone above commented, this is usually a sign that your data structure is starting to smell; not that it's bad, but that maybe it's time to denormalise for performance reasons (happens to the best of us), or maybe put a denormalised querying layer in front of your normalised "real" data.