SQL or LINQ for this functionality?

SQL or LINQ for this functionality? - sql

I'm writing an SQL script to update the database of a deployed application (add a few tables and copy/update some data) to accommodate new functionality being added to the application (.NET using C#). I've written some code in the application to handle such updates (whenever a new update is available, run its .sql) which reusable for all DB updates (simply provide a new .sql).
The problem I'm running into now is that I am fairly new to SQL, and have a pretty short timeline to get this update ready. The more complex part of the update could easily be written in the application, but for consistency's sake I'd like to do it in SQL.
What I need to do, I've summarized as the following algorithm/pseudocode:
for each category in allCategories
Select * from unrankedEntities
where Category = category
order by (score)
take top Category.cutoff
insert each of these into rankedEntities with rank = 0
Select * from unrankedEntities
where Category = category
order by (score) DESC
take top 16 - Category.cutoff
insert each of these into rankedEntities with rank = null, belowcutoff=true
This is the general algorithm I am trying to implement (with a couple more tweaks to be made after this part is complete). Would it be more efficient to write this in SQL or just write the LINQ to do it (resulting in changing the application's database update code for each DB update)? If SQL is the best choice, where can I find a good reference/tutorial covering the topics I would need to do this (I've looked through few different ones online, and some seem pretty simple, so I would appreciate a recommendation of a good one).
Additionally, if you have any tips on improving the algorithm, I'd be happy to hear your advice.
EDIT:
I forgot to mention, but (score) is not a value stored in unrankedEntities. It needs to be calclulated for each entity by counting rows in another table that match a specific condition, and mulitplying by a constant and by numbers pulled from other tables. Additionally, it needs to be stored into rankedEntities to avoid all this calculation every time it is needed (often).
EDIT:
It has been pointed out to me that parts of the new additions to the application already contain the code necessary to do all of the data manipulation, so I will go with the simple solution of recycling that code, although the result is a less tidy update-via-sql class, as it now has an update-via-sql section and update-via-code section.

When using LINQ to SQL in debug mode you can see the SQL it's generating, which you could copy into a script, if you need to hurry and not learn all about SQL right at this moment. See http://weblogs.asp.net/scottgu/archive/2007/07/31/linq-to-sql-debug-visualizer.aspx
EDIT: I didn't realize you were using Entity Framework. See this response to a similar question: How do I get the raw SQL underlying a LINQ query when using Entity Framework CTP 5 "code only"?.

LINQ is perfectly fine for this, but inserts and updates each entity in a singular fashion, meaning in SQL you could update in bulk with one query, but you cannot in LINQ... it generates separate statements for each entity. So SQL has some efficiency gains. If you are more comfortable with LINQ, do it in LINQ. If you are talking thousands of objects, you may want to turn off change tracking (turned off in EF by changing the merge option) as the internal memory of changed objects will grow rather large.

Related

Keeping dynamic out of SQL while using specifications with stored procedures

A specification essentially is a text string representing a "where" clause created by an end user.
I have stored procedures that copy a set of related tables and records to other places. The operation is always the same, but dependent on some crazy user requirements like "products that are frozen and blue and on sale on Tuesday".
What if we fed the user specification (or string parameter) to a scalar function that returned true/false which executed the specification as dynamic SQL or just exec (#variable).
It could tell us whether those records exist. We could add the result of the function to our copy products where clause.
It would keep us from recompiling the copy script each time our where clauses changed. Plus it would isolate the product selection in to a single function.
Anyone ever do anything like this or have examples? What bad things could come of it?
EDIT:
This is the specification I simply added to the end of each insert/select statement:
and exists (
select null as nothing
from SameTableAsOutsideTable inside
where inside.ID = outside.id and -- Join operations to outside table
inside.page in (6, 7) and -- Criteria 1
inside.dept in (7, 6, 2, 4) -- Criteria 2
)
It would be great to feed a parameter into a function that produces records based on the user criteria, so all that above could be something like:
and dbo.UserCriteria( #page="6,7", #dept="7,6,2,4")

Dynamic Search Conditions in T-SQL
When optimizing SQL the important thing is optimizing the access path to data (ie. index usage). This trumps code reuse, maintainability, nice formatting and just about every other development perk you can think of. This is because a bad access path will cause the query to perform hundreds of times slower than it should. The article linked sums up very well all the options you have, and your envisioned function is nowhere on the radar. Your options will gravitate around dynamic SQL or very complicated static queries. I'm afraid there is no free lunch on this topic.

It doesn't sound like a very good idea to me. Even supposing that you had proper defensive coding to avoid SQL injection attacks it's not going to really buy you anything. The code still needs to be "compiled" each time.
Also, it's pretty much always a bad idea to let users create free-form WHERE clauses. Users are pretty good at finding new and innovative ways to bring a server to a grinding halt.
If you or your users or someone else in the business can't come up with some concrete search requirements then it's likely that someone isn't thinking about it hard enough and doesn't really know what they want. You can have pretty versatile search capabilities without letting the users completely loose on the system. Alternatively, look at some of the BI tools out there and consider creating a data mart where they can do these kinds of ad hoc searches.

How about this:
You create another store procedure (instead of function) and pass the right condition to it.
Based on that condition it dumps the record ids to a temp table.
Next you move procedure will read ids from that table and do the needful things?
Or you could create a user function that returns a table which is nothing but the ids of the records that matches your criteria (dynamic)
If I am totally off, then please clarify me.
Hope this helps.

If you are forced to use dynamic queries and you don't have any solid and predefined search requirements, it is strongly recommended to use sp_executesql instead of EXEC . It provides parametrized queries to prevent SQL Injection attacks (to some extent) and It makes use of execution plans to speed up performance. (More info)

Can select * usage ever be justified?

I've always preached to my developers that SELECT * is evil and should be avoided like the plague.
Are there any cases where it can be justified?
I'm not talking about COUNT(*) - which most optimizers can figure out.
Edit
I'm talking about production code.
And one great example I saw of this bad practice was a legacy asp application that used select * in a stored procedure, and used ADO to loop through the returned records, but got the columns by index. You can imagine what happened when a new field was added somewhere other than the end of the field list.

I'm quite happy using * in audit triggers.
In that case it can actually prove a benefit because it will ensure that if additional columns are added to the base table it will raise an error so it cannot be forgotten to deal with this in the audit trigger and/or audit table structure.
(Like dotjoe) I am also happy using it in derived tables and column table expressions. Though I habitually do it the other way round.
WITH t
AS (SELECT *,
ROW_NUMBER() OVER (ORDER BY a) AS RN
FROM foo)
SELECT a,
b,
c,
RN
FROM t;
I'm mostly familiar with SQL Server and there at least the optimiser has no problem recognising that only columns a,b,c will be required and the use of * in the inner table expression does not cause any unnecessary overhead retrieving and discarding unneeded columns.
In principle SELECT * ought to be fine in a view as well as it is the final SELECT from the view where it ought to be avoided however in SQL Server this can cause problems as it stores column metadata for views which is not automatically updated when the underlying tables change and the use of * can lead to confusing and incorrect results unless sp_refreshview is run to update this metadata.

There are many scenarios where SELECT * is the optimal solution. Running ad-hoc queries in Management Studio just to get a sense of the data you're working with. Querying tables where you don't know the column names yet because it's the first time you've worked with a new schema. Building disposable quick'n'dirty tools to do a one-time migration or data export.
I'd agree that in "proper" development, you should avoid it - but there's lots of scenarios where "proper" development isn't necessarily the optimum solution to a business problem. Rules and best practices are great, as long as you know when to break them. :)

I'll use it in production when working with CTEs. But, in this case it's not really select *, because I already specified the columns in the CTE. I just don't want to respecify in the final select.
with t as (
select a, b, c from foo
)
select t.* from t;

None that I can think of, if you are talking about live code.
People saying that it makes adding columns easier to develop (so they automatically get returned and can be used without changing the Stored procedure) have no idea about writing optimal code/sql.
I only ever use it when writing ad-hoc queries that will not get reused (finding out the structure of a table, getting some data when I am not sure what the column names are).

I think using select * in an exists clause is appropriate:
select some_field from some_table
where exists
(select * from related_table [join condition...])
Some people like to use select 1 in this case, but it's not elegant, and it doesn't buy any performance improvements (early optimization strikes again).

In production code, I'd tend to agree 100% with you.
However, I think that the * more than justifies its existence when performing ad-hoc queries.

You've gotten a number of answers to your question, but you seem to be dismissing everything that isn't parroting back what you want to hear. Still, here it is for the third (so far) time: sometimes there is no bottleneck. Sometimes performance is way better than fine. Sometimes the tables are in flux, and amending every SELECT query is just one more bit of possible inconsistency to manage. Sometimes you've got to deliver on an impossible schedule and this is the last thing you need to think about.
If you live in bullet time, sure, type in all the column names. But why stop there? Re-write your app in a schema-less dbms. Hell, write your own dbms in assembly. That'd really show 'em.

And remember if you use select * and you have a join at least one field will be sent twice (the join field). This wastes database resources and network resources for no reason.

As a tool I use it to quickly refresh my memory as to what I can possibly get back from a query. As a production level query itself .. no way.

When creating an application that deals with the database, like phpmyadmin, and you are in a page where to display a full table, in that case using SELECT * can be justified, I guess.

About the only thing that I can think of would be when developing a utility or SQL tool application that is being written to run against any database. Even here though, I would tend to query the system tables to get the table structure and then build any necessary query from that.
There was one recent place where my team used SELECT * and I think that it was ok... we have a database that exists as a facade against another database (call it DB_Data), so it is primarily made up of views against the tables in the other database. When we generate the views we actually generate the column lists, but there is one set of views in the DB_Data database that are automatically generated as rows are added to a generic look-up table (this design was in place before I got here). We wrote a DDL trigger so that when a view is created in DB_Data by this process then another view is automatically created in the facade. Since the view is always generated to exactly match the view in DB_Data and is always refreshed and kept in sync, we just used SELECT * for simplicity.
I wouldn't be surprised if most developers went their entire career without having a legitimate use for SELECT * in production code though.

I've used select * to query tables optimized for reading (denormalized, flat data). Very advantageous since the purpose of the tables were simply to support various views in the application.

How else do the developers of phpmyadmin ensure they are displaying all the fields of your DB tables?

It is conceivable you'd want to design your DB and application so that you can add a column to a table without needing to rewrite your application. If your application at least checks column names it can safely use SELECT * and treat additional columns with some appropriate default action. Sure the app could consult system catalogs (or app-specific catalogs) for column information, but in some circumstances SELECT * is syntactic sugar for doing that.
There are obvious risks to this, however, and adding the required logic to the app to make it reliable could well simply mean replicating the DB's query checks in a less suitable medium. I am not going to speculate on how the costs and benefits trade off in real life.
In practice, I stick to SELECT * for 3 cases (some mentioned in other answers:
As an ad-hoc query, entered in a SQL GUI or command line.
As the contents of an EXISTS predicate.
In an application that dealt with generic tables without needing to know what they mean (e.g. a dumper, or differ).

Yes, but only in situations where the intention is to actually get all the columns from a table not because you want all the columns that a table currently has.
For example, in one system that I worked on we had UDFs (User Defined Fields) where the user could pick the fields they wanted on the report, the order as well as filtering. When building a result set it made more sense to simply "select *" from the temporary tables that I was building instead of having to keep track of which columns were active.

I have several times needed to display data from a table whose column names were unknown. So I did SELECT * and got the column names at run time.
I was handed a legacy app where a table had 200 columns and a view had 300. The risk exposure from SELECT * would have been no worse than from listing all 300 columns explicitly.

Depends on the context of the production software.
If you are writing a simple data access layer for a table management tool where the user will be selecting tables and viewing results in a grid, then it would seem *SELECT ** is fine.
In other words, if you choose to handle "selection of fields" through some other means (as in automatic or user-specified filters after retrieving the resultset) then it seems just fine.
If on the other hand we are talking about some sort of enterprise software with business rules, a defined schema, etc. ... then I agree that *SELECT ** is a bad idea.
EDIT: Oh and when the source table is a stored procedure for a trigger or view, "*SELECT **" should be fine because you're managing the resultset through other means (the view's definition or the stored proc's resultset).

Select * in production code is justifiable any time that:
it isn't a performance bottleneck
development time is critical
Why would I want the overhead of going back and having to worry about changing the relevant stored procedures, every time I add a field to the table?
Why would I even want to have to think about whether or not I've selected the right fields, when the vast majority of the time I want most of them anyway, and the vast majority of the few times I don't, something else is the bottleneck?
If I have a specific performance issue then I'll go back and fix that. Otherwise in my environment, it's just premature (and expensive) optimisation that I can do without.
Edit.. following the discussion, I guess I'd add to this:
... and where people haven't done other undesirable things like tried to access columns(i), which could break in other situations anyway :)

I know I'm very late to the party but I'll chip in that I use select * whenever I know that I'll always want all columns regardless of the column names. This may be a rather fringe case but in data warehousing, I might want to stage an entire table from a 3rd party app. My standard process for this is to drop the staging table and run
select *
into staging.aTable
from remotedb.dbo.aTable
Yes, if the schema on the remote table changes, downstream dependencies may throw errors but that's going to happen regardless.

If you want to find all the columns and want order, you can do the following (at least if you use MySQL):
SHOW COLUMNS FROM mytable FROM mydb; (1)
You can see every relevant information about all your fields. You can prevent problems with types and you can know for sure all the column names. This command is very quick, because you just ask for the structure of the table. From the results you will select all the name and will build a string like this:
"select " + fieldNames[0] + ", fieldNames[1]" + ", fieldNames[2] from mytable". (2)
If you don't want to run two separate MySQL commands because a MySQL command is expensive, you can include (1) and (2) into a stored procedure which will have the results as an OUT parameter, that way you will just call a stored procedure and every command and data generation will happen at the database server.

Has anyone written a higher level query langage (than sql) that generates sql for common tasks, on limited schemas

Sql is the standard in query languages, however it is sometime a bit verbose. I am currently writing limited query language that will make my common queries quicker to write and with a bit less mental overhead.
If you write a query over a good database schema, essentially you will be always joining over the primary key, foreign key fields so I think it should be unnecessary to have to state them each time.
So a query could look like.
select s.name, region.description from shop s
where monthly_sales.amount > 4000 and s.staff < 10
The relations would be
shop -- many to one -- region,
shop -- one to many -- monthly_sales
The sql that would be eqivilent to would be
select distinct s.name, r.description
from shop s
join region r on shop.region_id = region.region_id
join monthly_sales ms on ms.shop_id = s.shop_id
where ms.sales.amount > 4000 and s.staff < 10
(the distinct is there as you are joining to a one to many table (monthly_sales) and you are not selecting off fields from that table)
I understand that original query above may be ambiguous for certain schemas i.e if there the two relationship routes between two of the tables. However there are ways around (most) of these especially if you limit the schema allowed. Most possible schema's are not worth considering anyway.
I was just wondering if there any attempts to do something like this?
(I have seen most orm solutions to making some queries easier)
EDIT: I actually really like sql. I have used orm solutions and looked at linq. The best I have seen so far is SQLalchemy (for python). However, as far as I have seen they do not offer what I am after.

Hibernate and LinqToSQL do exactly what you want

I think you'd be better off spending your time just writing more SQL and becoming more comfortable with it. Most developers I know have gone through just this progression, where their initial exposure to SQL inspires them to bypass it entirely by writing their own ORM or set of helper classes that auto-generates the SQL for them. Usually they continue adding to it and refining it until it's just as complex (if not more so) than SQL. The results are sometimes fairly comical - I inherited one application that had classes named "And.cs" and "Or.cs", whose main functions were to add the words " AND " and " OR ", respectively, to a string.
SQL is designed to handle a wide variety of complexity. If your application's data design is simple, then the SQL to manipulate that data will be simple as well. It doesn't make much sense to use a different sort of query language for simple things, and then use SQL for the complex things, when SQL can handle both kinds of thing well.

I believe that any (decent) ORM would be of help here..

Entity SQL is slightly higher level (in places) than Transact SQL. Other than that, HQL, etc. For object-model approaches, LINQ (IQueryable<T>) is much higher level, allowing simple navigation:
var qry = from cust in db.Customers
select cust.Orders.Sum(o => o.OrderValue);
etc

Martin Fowler plumbed a whole load of energy into this and produced the Active Record pattern. I think this is what you're looking for?

Not sure if this falls in what you are looking for but I've been generating SQL dynamically from the definition of the Data Access Objects; the idea is to reflect on the class and by default assume that its name is the table name and all properties are columns. I also have search criteria objects to build the where part. The DAOs may contain lists of other DAO classes and that directs the joins.
Since you asked for something to take care of most of the repetitive SQL, this approach does it. And when it doesn't, I just fall back on handwritten SQL or stored procedures.

LINQ to SQL updates

Does any one have some idea on how to run the following statement with LINQ?
UPDATE FileEntity SET DateDeleted = GETDATE() WHERE ID IN (1,2,3)
I've come to both love and hate LINQ, but so far there's been little that hasn't worked out well. The obvious solution which I want to avoid is to enumerate all file entities and set them manually.
foreach (var file in db.FileEntities.Where(x => ids.Contains(x.ID)))
{
file.DateDeleted = DateTime.Now;
}
db.SubmitChanges();
There problem with the above code, except for the sizable overhead is that each entity has a Data field which can be rather large, so for a large update a lot of data runs cross the database connection for no particular reason. (The LINQ solution is to delay load the Data property, but this wouldn't be necessary if there was some way to just update the field with LINQ to SQL).
I'm thinking some query expression provider thing that would result in the above T-SQL...

LINQ cannot perform in store updates - it is language integrated query, not update. Most (maybe even all) OR mappers will generate a select statement to fetch the data, modify it in memory, and perform the update using a separate update statement. Smart OR mappers will only fetch the primary key until additional data is required, but then they will usually fetch the whole rest because it would be much to expensive to fetch only a single attribute at once.
If you really care about this optimization, use a stored procedure or hand-written SQL statement. If you want compacter code, you can use the following.
db.FileEntities.
Where(x => ids.Contains(x.ID)).
Select(x => x.DateDeleted = DateTime.Now; return x; );
db.SubmitChanges();
I don't like this because I find it less readable, but some prefer such an solution.

LINQ to SQL is an ORM, like any other, and as such, it was not designed to handle bulk updates/inserts/deleted. The general idea with L2S, EF, NHibernate, LLBLGen and the rest is to handle the mapping of relational data to your object graphs for you, eliminating the need to manage a large library of stored procs which, ultimately, limit your flexability and adaptability.
When it comes to bulk updates, those are best left to the thing that does them best...the database server. L2S and EF both provide the ability to map stored procedures to your model, which allows your stored procs to be somewhat entity oriented. Since your using L2S, just write a proc that takes the set of identities as input, and executes the SQL statement at the beginning of your question. Drag that stored proc onto your L2S model, and call it.
Its the best solution for the problem at hand, which is a bulk update. Like with reporting, object graphs and object-relational mapping are not the best solution for bulk processes.

Why is ORM considered good but "select *" considered bad?

Doesn't an ORM usually involve doing something like a select *?
If I have a table, MyThing, with column A, B, C, D, etc, then there typically would be an object, MyThing with properties A, B, C, D.
It would be evil if that object were incompletely instantiated by a select statement that looked like this, only fetching the A, B, not the C, D:
select A, B from MyThing /* don't get C and D, because we don't need them */
but it would also be evil to always do this:
select A, B, C, D /* get all the columns so that we can completely instantiate the MyThing object */
Does ORM make an assumption that database access is so fast now you don't have to worry about it and so you can always fetch all the columns?
Or, do you have different MyThing objects, one for each combo of columns that might happen to be in a select statement?
EDIT: Before you answer the question, please read Nicholas Piasecki's and Bill Karwin's answers. I guess I asked my question poorly because many misunderstood it, but Nicholas understood it 100%. Like him, I'm interested in other answers.
EDIT #2: Links that relate to this question:
Why do we need entity objects?
http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx, especially the section "The Partial-Object Problem and the Load-Time Paradox"
http://groups.google.com/group/comp.object/browse_thread/thread/853fca22ded31c00/99f41d57f195f48b?
http://www.martinfowler.com/bliki/AnemicDomainModel.html
http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html

In my limited experience, things are as you describe--it's a messy situation and the usual cop-out "it depends" answer applies.
A good example would be the online store that I work for. It has a Brand object, and on the main page of the Web site, all of the brands that the store sells are listed on the left side. To display this menu of brands, all the site needs is the integer BrandId and the string BrandName. But the Brand object contains a whole boatload of other properties, most notably a Description property that can contain a substantially large amount of text about the Brand. No two ways about it, loading all of that extra information about the brand just to spit out its name in an unordered list is (1) measurably and significantly slow, usually because of the large text fields and (2) pretty inefficient when it comes to memory usage, building up large strings and not even looking at them before throwing them away.
One option provided by many ORMs is to lazy load a property. So we could have a Brand object returned to us, but that time-consuming and memory-wasting Description field is not until we try to invoke its get accessor. At that point, the proxy object will intercept our call and suck down the description from the database just in time. This is sometimes good enough but has burned me enough times that I personally don't recommend it:
It's easy to forget that the property is lazy-loaded, introducing a SELECT N+1 problem just by writing a foreach loop. Who knows what happens when LINQ gets involved.
What if the just-in-time database call fails because the transport got flummoxed or the network went out? I can almost guarantee that any code that is doing something as innocuous as string desc = brand.Description was not expecting that simple call to toss a DataAccessException. Now you've just crashed in a nasty and unexpected way. (Yes, I've watched my app go down hard because of just that. Learned the hard way!)
So what I've ended up doing is that in scenarios that require performance or are prone to database deadlocks, I create a separate interface that the Web site or any other program can call to get access to specific chunks of data that have had their query plans carefully examined. The architecture ends up looking kind of like this (forgive the ASCII art):
Web Site: Controller Classes
|
|---------------------------------+
| |
App Server: IDocumentService IOrderService, IInventoryService, etc
(Arrays, DataSets) (Regular OO objects, like Brand)
| |
| |
| |
Data Layer: (Raw ADO.NET returning arrays, ("Full cream" ORM like NHibernate)
DataSets, simple classes)
I used to think that this was cheating, subverting the OO object model. But in a practical sense, as long as you do this shortcut for displaying data, I think it's all right. The updates/inserts and what have you still go through the fully-hydrated, ORM-filled domain model, and that's something that happens far less frequently (in most of my cases) than displaying particular subsets of the data. ORMs like NHibernate will let you do projections, but by that point I just don't see the point of the ORM. This will probably be a stored procedure anyway, writing the ADO.NET takes two seconds.
This is just my two cents. I look forward to reading some of the other responses.

People use ORM's for greater development productivity, not for runtime performance optimization. It depends on the project whether it's more important to maximize development efficiency or runtime efficiency.
In practice, one could use the ORM for greatest productivity, and then profile the application to identify bottlenecks once you're finished. Replace ORM code with custom SQL queries only where you get the greatest bang for the buck.
SELECT * isn't bad if you typically need all the columns in a table. We can't generalize that the wildcard is always good or always bad.
edit: Re: doofledorfer's comment... Personally, I always name the columns in a query explicitly; I never use the wildcard in production code (though I use it when doing ad hoc queries). The original question is about ORMs -- in fact it's not uncommon that ORM frameworks issue a SELECT * uniformly, to populate all the fields in the corresponding object model.
Executing a SELECT * query may not necessarily indicate that you need all those columns, and it doesn't necessarily mean that you are neglectful about your code. It could be that the ORM framework is generating SQL queries to make sure all the fields are available in case you need them.

Linq to Sql, or any implementation of IQueryable, uses a syntax which ultimately puts you in control of the selected data. The definition of a query is also the definition of its result set.
This neatly avoids the select * issue by removing data shape responsibilities from the ORM.
For example, to select all columns:
from c in data.Customers
select c
To select a subset:
from c in data.Customers
select new
{
c.FirstName,
c.LastName,
c.Email
}
To select a combination:
from c in data.Customers
join o in data.Orders on c.CustomerId equals o.CustomerId
select new
{
Name = c.FirstName + " " + c.LastName,
Email = c.Email,
Date = o.DateSubmitted
}

There are two separate issues to consider.
To begin, it is quite common when using an ORM for the table and the object to have quite different "shapes", this is one reason why many ORM tools support quite complex mappings.
A good example is when a table is partially denormalised, with columns containing redundant information (often, this is done to improve query or reporting performance). When this occurs, it is more efficient for the ORM to request just the columns it requires, than to have all the extra columns brought back and ignored.
The question of why "Select *" is evil is separate, and the answer falls into two halves.
When executing "select *" the database server has no obligation to return the columns in any particular order, and in fact could reasonably return the columns in a different order every time, though almost no databases do this.
Problem is, when a typical developer observes that the columns returned seem to be in a consistent order, the assumption is made that the columns will always be in that order, and then you have code making unwarranted assumptions, just waiting to fail. Worse, that failure may not be fatal, but may simply involve, say, using Year of Birth in place of Account Balance.
The other issue with "Select *" revolves around table ownership - in many large companies, the DBA controls the schema, and makes changes as required by major systems. If your tool is executing "select *" then you only get the current columns - if the DBA has removed a redundant column that you need, you get no error, and your code may blunder ahead causing all sorts of damage. By explicitly requesting the fields you require, you ensure that your system will break rather than process the wrong information.

I am not sure why you would want a partially hydrated object. Given a class of Customer with properties of Name, Address, Id. I would want them all to create a fully populated Customer object.
The list hanging off of Customers called Orders can be lazily loaded when accessed though most ORMs. And NHibernate anyway allows you to do projections into other objects. So if you had say a simply customer list where you displayed the ID and Name, you can create an object of type CustomerListDisplay and project your HQL query into that object set and only obtain the columns you need from the database.
Friends don't let friends premature optimize. Fully hydrate your object, lazy load it's associations. And then profile your application looking for problems and optimize the problem areas.

Even ORMs need to avoid SELECT * to be effective, by using lazy loading etc.
And yes, SELECT * is generally a bad idea if you aren't consuming all the data.
So, do you have different kinds of MyThing objects, one for each column combo? – Corey Trager (Nov 15 at 0:37)
No, I have read-only digest objects (which only contain important information) for things like lookups and massive collections and convert these to fully hydrated objects on demand. – Cade Roux (Nov 15 at 1:22)

The case you describe is a great example of how ORM is not a panacea. Databases offer flexible, needs-based access to their data primarily through SQL. As a developer, I can easily and simply get all the data (SELECT *) or some of the data (SELECT COL1, COL2) as needed. My mechanism for doing this will be easily understood by any other developer taking over the project.
In order to get the same flexibility from ORM, a lot more work has to be done (either by you or the ORM developers) just to get you back to the place under the hood where you're either getting all or some of the columns from the database as needed (see the excellent answers above to get a sense of some of the problems). And all this extra stuff is just more stuff that can fail, making an ORM system intrinsically less reliable than straight SQL calls.
This is not to say that you shouldn't use ORM (my standard disclaimer is that all design choices have costs and benefits, and the choice of one or the other just depends) - knock yourself out if it works for you. I will say that I truly don't understand the popularity of ORM, given the amount of extra un-fun work it seems to create for its users. I'll stick with using SELECT * when (wait for it) I need to get every column from a table.

ORMs in general do not rely on SELECT *, but rely on better methods to find columns like defined data map files (Hibernate, variants of Hibernate, and Apache iBATIS do this). Something a bit more automatic could be set up by querying the database schema to get a list of columns and their data types for a table. How the data gets populated is specific to the particular ORM you are using, and it should be well-documented there.
It is never a good idea to select data that you do not use at all, as it can create a needless code dependency that can be obnoxious to maintain later. For dealing with data internal to the class, things are a bit more complicated.
A short rule would be to always fetch all the data that the class stores by default. In most cases, a small amount of overhead won't make a huge difference, so your main goal is to reduce maintenance overhead. Later, when you performance profiling of the code, and have reason to believe that it may benefit from adjusting the behavior, that is the time to do it.
If I saw an ORM make SELECT * statements, either visibly or under its covers, then I would look elsewhere to fulfill my database integration needs.

SELECT * is not bad. Did you ask whoever considered it to be bad "why?".

SELECT * is a strong indication you don't have design control over the scope of your application and its modules. One of the major difficulties in cleaning up someone else's work is when there is stuff in there that is for no purpose, but no indication what is needed and used, and what isn't.
Every piece of data and code in your application should be there for a purpose, and the purpose should be specified, or easily detected.
We all know, and despise, programmers who don't worry too much about why things work, they just like to try stuff until the expected things happen and close it up for the next guy. SELECT * is a really good way to do that.

If you feel the need to encapsulate everything within an object, but need something with a small subset of what is contained within a table - define your own class. Write straight sql (within or without the ORM - most allow straight sql to circumvent limitations) and populate your object with the results.
However, I'd just use the ORMs representation of a table in most situations unless profiling told me not to.

If you're using query caching select * can be good. If you're selecting a different assortment of columns every time you hit a table, it could just be getting the cached select * for all of those queries.
I think you're confusing the purpose of ORM. ORM is meant to map a domain model or similar to a table in a database or some data storage convention. It's not meant to make your application more computationally efficient or even expected to.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas