Has anyone written a higher level query langage (than sql) that generates sql for common tasks, on limited schemas - sql

Sql is the standard in query languages, however it is sometime a bit verbose. I am currently writing limited query language that will make my common queries quicker to write and with a bit less mental overhead.
If you write a query over a good database schema, essentially you will be always joining over the primary key, foreign key fields so I think it should be unnecessary to have to state them each time.
So a query could look like.
select s.name, region.description from shop s
where monthly_sales.amount > 4000 and s.staff < 10
The relations would be
shop -- many to one -- region,
shop -- one to many -- monthly_sales
The sql that would be eqivilent to would be
select distinct s.name, r.description
from shop s
join region r on shop.region_id = region.region_id
join monthly_sales ms on ms.shop_id = s.shop_id
where ms.sales.amount > 4000 and s.staff < 10
(the distinct is there as you are joining to a one to many table (monthly_sales) and you are not selecting off fields from that table)
I understand that original query above may be ambiguous for certain schemas i.e if there the two relationship routes between two of the tables. However there are ways around (most) of these especially if you limit the schema allowed. Most possible schema's are not worth considering anyway.
I was just wondering if there any attempts to do something like this?
(I have seen most orm solutions to making some queries easier)
EDIT: I actually really like sql. I have used orm solutions and looked at linq. The best I have seen so far is SQLalchemy (for python). However, as far as I have seen they do not offer what I am after.

Hibernate and LinqToSQL do exactly what you want

I think you'd be better off spending your time just writing more SQL and becoming more comfortable with it. Most developers I know have gone through just this progression, where their initial exposure to SQL inspires them to bypass it entirely by writing their own ORM or set of helper classes that auto-generates the SQL for them. Usually they continue adding to it and refining it until it's just as complex (if not more so) than SQL. The results are sometimes fairly comical - I inherited one application that had classes named "And.cs" and "Or.cs", whose main functions were to add the words " AND " and " OR ", respectively, to a string.
SQL is designed to handle a wide variety of complexity. If your application's data design is simple, then the SQL to manipulate that data will be simple as well. It doesn't make much sense to use a different sort of query language for simple things, and then use SQL for the complex things, when SQL can handle both kinds of thing well.

I believe that any (decent) ORM would be of help here..

Entity SQL is slightly higher level (in places) than Transact SQL. Other than that, HQL, etc. For object-model approaches, LINQ (IQueryable<T>) is much higher level, allowing simple navigation:
var qry = from cust in db.Customers
select cust.Orders.Sum(o => o.OrderValue);
etc

Martin Fowler plumbed a whole load of energy into this and produced the Active Record pattern. I think this is what you're looking for?

Not sure if this falls in what you are looking for but I've been generating SQL dynamically from the definition of the Data Access Objects; the idea is to reflect on the class and by default assume that its name is the table name and all properties are columns. I also have search criteria objects to build the where part. The DAOs may contain lists of other DAO classes and that directs the joins.
Since you asked for something to take care of most of the repetitive SQL, this approach does it. And when it doesn't, I just fall back on handwritten SQL or stored procedures.

Related

How to represent an SQL query graphically

I have a SQL Select query with many joins between tables, I want to know which kind of diagram could represent it graphically in order to visualise the joins between tables and their types (differentiate between INNERs and LEFTs) ?
I did this simple schema to represent my query but I'm searching for a known and better type of diagram :
I believe what you're looking for is a variation of an Entity Relationship Diagram, where the different line-ends indicate the relationship type. When structuring a database, I believe this type of model is most common and easily understood.
You can use crow's foot notation for Enitity Relationship Diagram where you can specify the relationship(one to many, one to one) as well as the optionality(enitityA has exactly one of enitityB OR enitityA has 0 or 1 of enitityB). In your case optionality might represent the type of join you need to do. Why exactly you need to specify the join types in an ERD?
I'm baffled why this isn't something that exists already, and even more baffled why more developers aren't screaming their heads off demanding this. There is an undeniable need for dynamically created visual representations of sql queries. Trying to understand a single 300 line sql procudure is difficult enough without column names obfuscating their business representation using names like "AsecRemFortsKilnNumAttr".
When there are hundreds of columns in a procedure with names like this, but subtly different, like "AsecRemAgsKilnNumAttr" and "AsecRemFortsKilnNumV" as well then the task is impossible.
Anyway, here are a couple I've found: -
https://sqldep.com/sql-parser/
http://queryviz.com/
https://sourceforge.net/projects/revj/
The last one contains a Python utility that you can use to improve and extend if you know how.
Other than that the only other tool that may allow what you're looking for is Informatica Developer 10 which allows you to create a mapping from imported sql queries.
SQL queries are usually represented in tree form, structured based on how the query is parsed (logical plan) or how the query engine will execute the query (physical plan)
See abstract syntax tree (AST) https://en.wikipedia.org/wiki/Abstract_syntax_tree as an example

Natural Join -- Relational theory and SQL

This question comes from my readings of C.J Date's SQL and Relational Theory: How to Write Accurate SQL Code and looking up about joins on the internet (which includes coming across multiple posts here on NATURAL JOINs (and about SQL Server's lack of support for it))
So here is my problem...
On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred).
On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction).
Is the reconciliation of these that:
Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS.
and / or
Good/Better table design should remove/minimise the problems that natural join creates.
?
a number of points regarding your question (even if I'm afraid I'm not really answering anything you asked),
"On one hand, in relational theory, natural joins are the only joins that should happen (or at least are highly preferred)."
This seems to suggest that you interpret theory as if it proscribes against "other kinds" of joins ... That is not really true. Relational theory does not say "you cannot have antijoins", or "you should never use antijoins", or anything like that. What it DOES say, is that in the relational algebra, a set of primitive operators can be identified, in which natural join is the only "join-like" operator. All other "join-like" operators, can always be expressed equivalently in terms of the primitive operators defined. Cartesian product, for example, is a special case of a natural join (where the set of common attributes is empty), and if you want the cartesian product of two tables that do have an attribute name in common, you can address this using RENAME. Semijoin, for example, is the natural join of the first table with some projection on the second. Antijoin, for example (SEMIMINUS or NOT MATCHING in Date's book), is the relational difference between the first table and a SEMIJOIN of the two. etc. etc.
"On the other hand, in SQL it is advised against using NATURAL JOIN and instead use alternate means (e.g inner join with restriction)."
Where are such things advised ? In the SQL standard ? I don't really think so. It is important to distinguish between the SQL language per se, which is defined by an ISO standard, and some (/any) particular implementation of that language, which is built by some particular vendor. If Microsoft advises its customers to not use NJ in SQL Server 200x, then that advice has a completely different meaning than an advice by someone to not ever use NJ in SQL altogether.
"Natural joins work in true RDBMS. SQL however, fails at completely reproducing the relational model and none of the popular SQL DBMSs are true RDBMS."
While it is true that SQL per se fails to faithfully comply with relational theory, that actually has very little to do with the question of NJ.
Whether an implementation gives good performance for invocations of NJ, is a characteristic of that implementation, not of the language, or of the "degree of trueness" of the 'R' in 'RDBMS'. It is very easy to build a TRDBMS that doesn't use SQL, and that gives ridiculous execution times for NJ. The SQL language per se has everything that is needed to support NJ. If an implementation supports NJ, then NJ will work in that implementation too. Whether it gives good performance, is a characteristic of that implementation, and poor performance of some particular implementation should not be "extrapolated" to other implementations, or be seen as a characteristic of the SQL language per se.
"Good/Better table design should remove/minimise the problems that natural join creates."
Problems that natural join creates ? Controlling the columns that appear in the arguments to a join is easily done by adding explicit projections (and renames if needed) on the columns you want. Much like you also want to avoid SELECT * as much as possible, for basically the same reason ...
First, the choice between theory and being practical is a fallacy. To quote Chris Date: "the truth is that theory--at least the theory I'm talking about here, which is relational theory--is most definitely very practical indeed".
Second, consider that natural join relies on attribute naming. Please (re)read the following sections of the Accurate SQL Code book:
6.12. The Reliance on Attribute Names. Salient quote:
The operators of the relational algebra… all rely heavily on attribute
naming.
3.9. Column Naming in SQL. Salient quote:
Strong recommendation: …if two columns in SQL represent "the same kind
of information," give them the same name wherever possible. (That's
why, for example, the two supplier number columns in the
suppliers-and-parts database are both called SNO and not, say, SNO in
one table and SNUM in the other.) Conversely, if two columns represent
different kinds of information, it's usually a good idea to give them
different names.
I'd like to address #kuru kuru pa's point (a good one too) about columns being added to a table over which you have no control, such as a "web service you're consuming." It seems to me that this problem is effectively mitigated using the strategy suggested by Date in section 3.9 (referenced above): quote:
For every base table, define a view identical to that base table except possibly for some column renaming.
Make sure the set of views so defined abides by the column naming discipline described above.
Operate in terms of those views instead of the underlying base tables.
Personally, I find the "natural join considered dangerous" attitude frustrating. Not wishing to sound self-righteous but my own naming convention, which follows the guidance of ISO 11179-5 Naming and identification principles, results in schema highly suited to natural join.
Sadly, natural join perhaps won't be supported anytime soon in the DBMS product I use professionally (SQL Server): the relevant feature request on Microsoft Connect
is currently closed as "won't fix" despite currently having a respectable +38 / -2 score
has been reopened and gained a respectable 46 / -2 score
(go vote for it now :)
The main problem with the NATURAL JOIN syntax in SQL is that it is typically too verbose.
In Tutorial D syntax I can very simply write a natural join as:
R{a,b,c} JOIN S{a,c,d};
But in SQL the SELECT statement needs either derived table subqueries or a WHERE clause and aliases to achieve the same thing. That's because a single "SELECT statement" is really a non-relational, compound operator in which the component operations always happen in a predetermined order. Projection comes after joins and columns in the result of a join don't necessarily have unique names.
E.g. the above query can be written in SQL as:
SELECT DISTINCT a, b, c, d
FROM
(SELECT a,b,c FROM R) R
NATURAL JOIN
(SELECT a,c,d FROM S) S;
or:
SELECT DISTINCT R.a, R.b, R.c, S.d
FROM R,S
WHERE R.a = S.a AND R.c = S.c;
People will likely prefer the latter version because it is shorter and "simpler".
Theory versus reality...
Natural joins are not practical.
There is no such thing as a pure (i.e. practice is idetical to theory) RDBMS, as far as I know.
I think Oracle and a few others actually support support natural joins -- TSQL doesn't.
Consider the world we live in -- chances of two tables each having a column with the same name is pretty high (like maybe [name] or [id] or [date], etc.). Maybe those chances are narrowed down a bit by grouping only those tables you might actually want to join. But regardless, without a careful examination of the table structure, you won't know if a "natural join" is a good idea or not. And even if it is, at that moment, it might not be in another year when the application gets an upgrade which adds columns to certain tables, etc., or the web service you're consuming adds fields you didn't know about, etc.
I think a "pure" system would have to be one you had 100% control over at a minimum, and then also, one that would have some good validation in the alter table / create table process that would warn / prevent you from creating a new column in some table that could be "naturally" joined to some other table you might not be intending it to be join-able to.
I guess bottom-line for me would be, valuing my sanity, wanting my applications to have maximum up-time, valuing quick/clean maintenance and upgrades, etc. -- good table design in this context means not using natural joins (ever).

Strategy for avoiding a common sql development error (misleading result on join bug)

Sometimes when i'm writing moderately complex SELECT statements with a few JOINs, wrong key columns are sometimes used in the JOIN statement that still return valid-looking results.
Because the auto numbering values (especially early in development) all tend to fall in similar ranges (sub 100s or so) the SELECT sill produces some results. These results often look valid at first glance and a problem is not detected until much, much later making debugging much more difficult because familiarity with the data structures and code has staled. (Gone stale in the dev's mind.)
i just spent several hours tracking down yet another of this issue that i've run into a too many times before. i name my tables and columns carefully, write my SQL statements methodically but this is an issue i can't seem to competely avoid. It comes back and bites me for hours of productivity about twice a year on average.
My question is: Has anyone come up with a clever method for avoiding this; what i assume is probably a common SQL bug/mistake?
i have thought of trying to auto-number starting with different start values but this feels cludgy and would get ugly trying to keep such a scheme straight for data models with dozens of tables... Any better ideas?
P.S.
i am very careful and methodical in naming my tables and columns. Patient table gets PatientId column, Facility get a FacilityId etc. This issues tends to arise when there are join tables involved where the linkage takes on extra meaning such as: RelatedPatientId, ReferingPatientId, FavoriteItemId etc.
When writing long complex SELECT statements try to limit the result to one record.
For instance, assume you have this gigantic enormous awesome CMS system and you have to write internal reports because the reports that come with it are horrendous. You notice that there are about 500 tables. Your select statement joins 30 of these tables. Your result should limit your row count by using a WHERE clause.
My advice is to rather then get all this code written and generalized for all cases, break the problem up and use WHERE and limit the row count to only say a record. Check all fields, if they look ok, break it up and let your code return more rows. Only after further checking should you generalize.
It bites a lot of us who keep adding more and more joins until it seems to look ok, but only after Joe Blow the accountant runs the report does he realize that the PO for 4 million was really the telephone bill for the entire year. Somehow that join got messed up!
One option would be to use your natural keys.
More practically, Red Gate SQL Prompt picks the FK columns for me.
I also tend to build up one JOIN at a time to see how things look.
If you have a visualization or diagramming tool for your SQL statements, you can follow the joins visually, and any errors will become immediately apparent, provided you have followed a sensible naming scheme for your primary and foreign keys.
Your column names should take care of this unless you named them all "ID". Are you writing multiple select statement using the same tables? You may want to create views for the more common ones.
If you're using SQL Server, you can use GUID columns as primary keys (that's what we do). You won't have problems with collisions again.
You could use GUIDs as your primary keys, but it has its pros and cons.
This pro is actually not mentioned on that page.
I have never tried doing this myself - I use a tool on top of SQL that makes incorrect joins very unlikely, so I don't have this problem. I just thought I'd mention it as another option though!
For IDs use TableNameID, for example for table Person, use PersonID
Use db model and look at the drawing when writing queries.
This way join looks like:
... ON p.PersonID = d.PersonID
as opposed to:
... ON p.ID = d.ID
Auto-increment integer PKs are among your best friends.

How to create dynamic and safe queries

A "static" query is one that remains the same at all times. For example, the "Tags" button on Stackoverflow, or the "7 days" button on Digg. In short, they always map to a specific database query, so you can create them at design time.
But I am trying to figure out how to do "dynamic" queries where the user basically dictates how the database query will be created at runtime. For example, on Stackoverflow, you can combine tags and filter the posts in ways you choose. That's a dynamic query albeit a very simple one since what you can combine is within the world of tags. A more complicated example is if you could combine tags and users.
First of all, when you have a dynamic query, it sounds like you can no longer use the substitution api to avoid sql injection since the query elements will depend on what the user decided to include in the query. I can't see how else to build this query other than using string append.
Secondly, the query could potentially span multiple tables. For example, if SO allows users to filter based on Users and Tags, and these probably live in two different tables, building the query gets a bit more complicated than just appending columns and WHERE clauses.
How do I go about implementing something like this?
The first rule is that users are allowed to specify values in SQL expressions, but not SQL syntax. All query syntax should be literally specified by your code, not user input. The values that the user specifies can be provided to the SQL as query parameters. This is the most effective way to limit the risk of SQL injection.
Many applications need to "build" SQL queries through code, because as you point out, some expressions, table joins, order by criteria, and so on depend on the user's choices. When you build a SQL query piece by piece, it's sometimes difficult to ensure that the result is valid SQL syntax.
I worked on a PHP class called Zend_Db_Select that provides an API to help with this. If you like PHP, you could look at that code for ideas. It doesn't handle any query imaginable, but it does a lot.
Some other PHP database frameworks have similar solutions.
Though not a general solution, here are some steps that you can take to mitigate the dynamic yet safe query issue.
Criteria in which a column value belongs in a set of values whose cardinality is arbitrary does not need to be dynamic. Consider using either the instr function or the use of a special filtering table in which you join against. This approach can be easily extended to multiple columns as long as the number of columns is known. Filtering on users and tags could easily be handled with this approach.
When the number of columns in the filtering criteria is arbitrary yet small, consider using different static queries for each possibility.
Only when the number of columns in the filtering criteria is arbitrary and potentially large should you consider using dynamic queries. In which case...
To be safe from SQL injection, either build or obtain a library that defends against that attack. Though more difficult, this is not an impossible task. This is mostly about escaping SQL string delimiters in the values to filter for.
To be safe from expensive queries, consider using views that are specially crafted for this purpose and some up front logic to limit how those views will get invoked. This is the most challenging in terms of developer time and effort.
If you were using python to access your database, I would suggest you use the Django model system. There are many similar apis both for python and for other languages (notably in ruby on rails). I am saving so much time by avoiding the need to talk directly to the database with SQL.
From the example link:
#Model definition
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
def __unicode__(self):
return self.name
Model usage (this is effectively an insert statement)
from mysite.blog.models import Blog
b = Blog(name='Beatles Blog', tagline='All the latest Beatles news.')
b.save()
The queries get much more complex - you pass around a query object and you can add filters / sort elements to it. When you finally are ready to use the query, Django creates an SQL statment that reflects all the ways you adjusted the query object. I think that it is very cute.
Other advantages of this abstraction
Your models can be created as database tables with foreign keys and constraints by Django
Many databases are supported (Postgresql, Mysql, sql lite, etc)
DJango analyses your templates and creates an automatic admin site out of them.
Well the options have to map to something.
A SQL query string CONCAT isn't a problem if you still use parameters for the options.

Why is ORM considered good but "select *" considered bad?

Doesn't an ORM usually involve doing something like a select *?
If I have a table, MyThing, with column A, B, C, D, etc, then there typically would be an object, MyThing with properties A, B, C, D.
It would be evil if that object were incompletely instantiated by a select statement that looked like this, only fetching the A, B, not the C, D:
select A, B from MyThing /* don't get C and D, because we don't need them */
but it would also be evil to always do this:
select A, B, C, D /* get all the columns so that we can completely instantiate the MyThing object */
Does ORM make an assumption that database access is so fast now you don't have to worry about it and so you can always fetch all the columns?
Or, do you have different MyThing objects, one for each combo of columns that might happen to be in a select statement?
EDIT: Before you answer the question, please read Nicholas Piasecki's and Bill Karwin's answers. I guess I asked my question poorly because many misunderstood it, but Nicholas understood it 100%. Like him, I'm interested in other answers.
EDIT #2: Links that relate to this question:
Why do we need entity objects?
http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx, especially the section "The Partial-Object Problem and the Load-Time Paradox"
http://groups.google.com/group/comp.object/browse_thread/thread/853fca22ded31c00/99f41d57f195f48b?
http://www.martinfowler.com/bliki/AnemicDomainModel.html
http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html
In my limited experience, things are as you describe--it's a messy situation and the usual cop-out "it depends" answer applies.
A good example would be the online store that I work for. It has a Brand object, and on the main page of the Web site, all of the brands that the store sells are listed on the left side. To display this menu of brands, all the site needs is the integer BrandId and the string BrandName. But the Brand object contains a whole boatload of other properties, most notably a Description property that can contain a substantially large amount of text about the Brand. No two ways about it, loading all of that extra information about the brand just to spit out its name in an unordered list is (1) measurably and significantly slow, usually because of the large text fields and (2) pretty inefficient when it comes to memory usage, building up large strings and not even looking at them before throwing them away.
One option provided by many ORMs is to lazy load a property. So we could have a Brand object returned to us, but that time-consuming and memory-wasting Description field is not until we try to invoke its get accessor. At that point, the proxy object will intercept our call and suck down the description from the database just in time. This is sometimes good enough but has burned me enough times that I personally don't recommend it:
It's easy to forget that the property is lazy-loaded, introducing a SELECT N+1 problem just by writing a foreach loop. Who knows what happens when LINQ gets involved.
What if the just-in-time database call fails because the transport got flummoxed or the network went out? I can almost guarantee that any code that is doing something as innocuous as string desc = brand.Description was not expecting that simple call to toss a DataAccessException. Now you've just crashed in a nasty and unexpected way. (Yes, I've watched my app go down hard because of just that. Learned the hard way!)
So what I've ended up doing is that in scenarios that require performance or are prone to database deadlocks, I create a separate interface that the Web site or any other program can call to get access to specific chunks of data that have had their query plans carefully examined. The architecture ends up looking kind of like this (forgive the ASCII art):
Web Site: Controller Classes
|
|---------------------------------+
| |
App Server: IDocumentService IOrderService, IInventoryService, etc
(Arrays, DataSets) (Regular OO objects, like Brand)
| |
| |
| |
Data Layer: (Raw ADO.NET returning arrays, ("Full cream" ORM like NHibernate)
DataSets, simple classes)
I used to think that this was cheating, subverting the OO object model. But in a practical sense, as long as you do this shortcut for displaying data, I think it's all right. The updates/inserts and what have you still go through the fully-hydrated, ORM-filled domain model, and that's something that happens far less frequently (in most of my cases) than displaying particular subsets of the data. ORMs like NHibernate will let you do projections, but by that point I just don't see the point of the ORM. This will probably be a stored procedure anyway, writing the ADO.NET takes two seconds.
This is just my two cents. I look forward to reading some of the other responses.
People use ORM's for greater development productivity, not for runtime performance optimization. It depends on the project whether it's more important to maximize development efficiency or runtime efficiency.
In practice, one could use the ORM for greatest productivity, and then profile the application to identify bottlenecks once you're finished. Replace ORM code with custom SQL queries only where you get the greatest bang for the buck.
SELECT * isn't bad if you typically need all the columns in a table. We can't generalize that the wildcard is always good or always bad.
edit: Re: doofledorfer's comment... Personally, I always name the columns in a query explicitly; I never use the wildcard in production code (though I use it when doing ad hoc queries). The original question is about ORMs -- in fact it's not uncommon that ORM frameworks issue a SELECT * uniformly, to populate all the fields in the corresponding object model.
Executing a SELECT * query may not necessarily indicate that you need all those columns, and it doesn't necessarily mean that you are neglectful about your code. It could be that the ORM framework is generating SQL queries to make sure all the fields are available in case you need them.
Linq to Sql, or any implementation of IQueryable, uses a syntax which ultimately puts you in control of the selected data. The definition of a query is also the definition of its result set.
This neatly avoids the select * issue by removing data shape responsibilities from the ORM.
For example, to select all columns:
from c in data.Customers
select c
To select a subset:
from c in data.Customers
select new
{
c.FirstName,
c.LastName,
c.Email
}
To select a combination:
from c in data.Customers
join o in data.Orders on c.CustomerId equals o.CustomerId
select new
{
Name = c.FirstName + " " + c.LastName,
Email = c.Email,
Date = o.DateSubmitted
}
There are two separate issues to consider.
To begin, it is quite common when using an ORM for the table and the object to have quite different "shapes", this is one reason why many ORM tools support quite complex mappings.
A good example is when a table is partially denormalised, with columns containing redundant information (often, this is done to improve query or reporting performance). When this occurs, it is more efficient for the ORM to request just the columns it requires, than to have all the extra columns brought back and ignored.
The question of why "Select *" is evil is separate, and the answer falls into two halves.
When executing "select *" the database server has no obligation to return the columns in any particular order, and in fact could reasonably return the columns in a different order every time, though almost no databases do this.
Problem is, when a typical developer observes that the columns returned seem to be in a consistent order, the assumption is made that the columns will always be in that order, and then you have code making unwarranted assumptions, just waiting to fail. Worse, that failure may not be fatal, but may simply involve, say, using Year of Birth in place of Account Balance.
The other issue with "Select *" revolves around table ownership - in many large companies, the DBA controls the schema, and makes changes as required by major systems. If your tool is executing "select *" then you only get the current columns - if the DBA has removed a redundant column that you need, you get no error, and your code may blunder ahead causing all sorts of damage. By explicitly requesting the fields you require, you ensure that your system will break rather than process the wrong information.
I am not sure why you would want a partially hydrated object. Given a class of Customer with properties of Name, Address, Id. I would want them all to create a fully populated Customer object.
The list hanging off of Customers called Orders can be lazily loaded when accessed though most ORMs. And NHibernate anyway allows you to do projections into other objects. So if you had say a simply customer list where you displayed the ID and Name, you can create an object of type CustomerListDisplay and project your HQL query into that object set and only obtain the columns you need from the database.
Friends don't let friends premature optimize. Fully hydrate your object, lazy load it's associations. And then profile your application looking for problems and optimize the problem areas.
Even ORMs need to avoid SELECT * to be effective, by using lazy loading etc.
And yes, SELECT * is generally a bad idea if you aren't consuming all the data.
So, do you have different kinds of MyThing objects, one for each column combo? – Corey Trager (Nov 15 at 0:37)
No, I have read-only digest objects (which only contain important information) for things like lookups and massive collections and convert these to fully hydrated objects on demand. – Cade Roux (Nov 15 at 1:22)
The case you describe is a great example of how ORM is not a panacea. Databases offer flexible, needs-based access to their data primarily through SQL. As a developer, I can easily and simply get all the data (SELECT *) or some of the data (SELECT COL1, COL2) as needed. My mechanism for doing this will be easily understood by any other developer taking over the project.
In order to get the same flexibility from ORM, a lot more work has to be done (either by you or the ORM developers) just to get you back to the place under the hood where you're either getting all or some of the columns from the database as needed (see the excellent answers above to get a sense of some of the problems). And all this extra stuff is just more stuff that can fail, making an ORM system intrinsically less reliable than straight SQL calls.
This is not to say that you shouldn't use ORM (my standard disclaimer is that all design choices have costs and benefits, and the choice of one or the other just depends) - knock yourself out if it works for you. I will say that I truly don't understand the popularity of ORM, given the amount of extra un-fun work it seems to create for its users. I'll stick with using SELECT * when (wait for it) I need to get every column from a table.
ORMs in general do not rely on SELECT *, but rely on better methods to find columns like defined data map files (Hibernate, variants of Hibernate, and Apache iBATIS do this). Something a bit more automatic could be set up by querying the database schema to get a list of columns and their data types for a table. How the data gets populated is specific to the particular ORM you are using, and it should be well-documented there.
It is never a good idea to select data that you do not use at all, as it can create a needless code dependency that can be obnoxious to maintain later. For dealing with data internal to the class, things are a bit more complicated.
A short rule would be to always fetch all the data that the class stores by default. In most cases, a small amount of overhead won't make a huge difference, so your main goal is to reduce maintenance overhead. Later, when you performance profiling of the code, and have reason to believe that it may benefit from adjusting the behavior, that is the time to do it.
If I saw an ORM make SELECT * statements, either visibly or under its covers, then I would look elsewhere to fulfill my database integration needs.
SELECT * is not bad. Did you ask whoever considered it to be bad "why?".
SELECT * is a strong indication you don't have design control over the scope of your application and its modules. One of the major difficulties in cleaning up someone else's work is when there is stuff in there that is for no purpose, but no indication what is needed and used, and what isn't.
Every piece of data and code in your application should be there for a purpose, and the purpose should be specified, or easily detected.
We all know, and despise, programmers who don't worry too much about why things work, they just like to try stuff until the expected things happen and close it up for the next guy. SELECT * is a really good way to do that.
If you feel the need to encapsulate everything within an object, but need something with a small subset of what is contained within a table - define your own class. Write straight sql (within or without the ORM - most allow straight sql to circumvent limitations) and populate your object with the results.
However, I'd just use the ORMs representation of a table in most situations unless profiling told me not to.
If you're using query caching select * can be good. If you're selecting a different assortment of columns every time you hit a table, it could just be getting the cached select * for all of those queries.
I think you're confusing the purpose of ORM. ORM is meant to map a domain model or similar to a table in a database or some data storage convention. It's not meant to make your application more computationally efficient or even expected to.