Select * vs specific columns & loading object properties - sql

I always thought SELECT * was bad and that you should always return only the columns you are going to use. One of the reasons for this is that the DB can return the result without hitting any tables if all the columns needed are in the index.
I have a factory class that loads the properties of a Product object. It loads all the properties everytime GetProduct is called etc.
Many of the pages won't be using all of the Product properties even though they will be loaded from the database because of the SELECT*.
Is there any design advice/guidelines on this?

The trade-off here is between squeaking out every last bit of potential performance versus code maintainability. There is no question that bringing back columns you won't use wastes some CPU cycles. The question becomes: how many? Then you have to consider what is more expensive, your wasted CPU cycles or your programmers' time for building and maintaining the code?
If you are working on a system with huge performance requirements then it may very well pay to optimize your ORM / factory code. On the other hand, if you're building a departmental line of business app and you've got scores or hundreds of ORM classes, maybe you are better off keeping it simple for the programmers (and the people who have to pay for them) and stop worrying about a few cycles. This becomes even more the case if you use a framework that scaffolds up most of your ORM code for you with code generation - like Entity Framework (or many others)...
If you are building your system without the use of any kind of code generating framework, and if your data access layer is pretty close to bare metal SQL then only bringing back what you need is good advice. If you are building an app that is going to be used by thousands or millions of people simultaneously, then by all means tune your SQL from the outset. If, on the other hand, you work in a shop that uses ORM frameworks and RAD or agile then writing dozens of SQLs is counter productive.

I'd definitely avoid SELECT *. Just retrieve the data you know you'll need. I'd prefer to write a dozen queries to the same table, where each one refers to just the few columns I need for a particular purpose, rather than write one query that retrieves all the columns and just use that everywhere.
Even if you know you need every column currently in a table, list each one explicitly. That way, if someone adds half a dozen more columns to the table in the future, all your old queries won't suddenly be retrieving more data than is needed.

Related

Database Type Agnostic Select Query Encapsulation class

I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur
I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!

Should I be concerned that ORMs, by default, return all columns?

In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.

New Core Data entity identical to existing one: separate entity or other solution?

Overview: I'm designing a restaurant management application, and I have an entity called Order which has Items. Since a restaurant can operate for many years, with many thousands of completed 'orders', and in the interest of making the networking side of my application easier and keeping the database fast, I want to introduce the concept of a ClosedOrder, which is an Order has been paid for, esentially.
I have a few options for doing this: I could add an isClosed attribute to my Order entity, and perform all fetch requests for 'open' orders with a predicate, but that keeps the problem of there being lots of records for the DB to go through every time a fetch is needed, which is often with the Order entity because of the workflow of my app. If I understand correctly, creating a 'ClosedOrder' subentity would also have the same problem, as Core Data stores all subentities in the same table in the database.
Is creating a completely separate entity foolish in this case? Or is it exactly what I need to do? I apologize for my overall lack of knowledge about database performance, Core Data is beautiful in that it abstracts needing to learn about that away, but at the same time it can't actually make it not important, especially in a case like this where performance could severely degrade if it's pushed far by my users.
If you're using SQLite store type and the "isClosed" attribute is "Indexed" (a setting in your entity editor panel), you could have hundreds of thousands of orders and still have a nice, quick fetch time when filtering only for "isClosed == YES".
Creating a separate entity won't really buy you much performance but it will make it more of a pain to maintain when things change (two migration steps for the price of one, for example). You're still storing all those items and SQLite is a competent DB library when it's setup correctly. Use an indexed attribute here, then generate some test data, then measure the performance. I'm sure you'll be satisfied.

BASIC Object-Relation Mapping question asked by a noob

I understand that, in the interest of efficiency, when you query the database you should only return the columns that are needed and no more.
But given that I like to use objects to store the query result, this leaves me in a dilemma:
If I only retrieve the column values that I need in a particular situation, I can only partially populate the object. This, I feel, leaves my object in a non-ideal state where only some of the properties and methods are available. Later, if a situation arises where I would like to the reuse the object but find that the new situation requires a different but overlapping set of columns to be returned, I am faced with a choice.
Should I reuse the existing SQL and add to the list of selected columns the additional fields that are required by the new situation so that the same query and object mapping logic can be reused for both? Or should I create another method that results in the execution of a slightly different SQL which results in the populating of only those object properties that were returned by the 2nd query?
I strongly suspect that there is no magic answer and that the answer really "depends" on the situation but I guess I am looking for general advice. In general, my approach has been to either return all columns from the queried table or to add to the query the additional columns as they are needed but to reuse the same SQL (and mapping code) that is, until performance becomes a concern. In general, I find that unless you are retrieving a large number of row - and I usually am not - that the cost of adding additional columns to the output does not have a noticable effect on performance and that the savings in development time and the simplified API that result are a good trade off.
But how do you deal with this situation when performance does become a factor? Do you create methods like
Employees.GetPersonalInfo
Employees.GetLittleMorePersonlInfoButMinusSalary
etc, etc etc
Or do you somehow end up creating an API where the user of your API has to specify which columns/properties he wants populated/returned, thereby adding complexity and making your API less friendly/easy to use?
Let's say you want to get Employee info. How many objects would typically be involved?
1) an Employee object
2) An Employees collection object containing one Employee object for each Employee row returned
3) An object, such as EmployeeQueries that returns contains methods such as "GetHiredThisWeek" which returns an Employees collection of 0 or more records.
I realize all of this is very subjective, but I am looking for suggestions on what you have found works best for you.
I would say make your application correct first, then worry about performance in this case.
You could be optimizing away your queries only to realize you won't use that query anyway. Create the most generalized queries that your entire app can use, and then as you are confident things are working properly, look for problem areas if needed.
It is likely that you won't have a great need for huge performance up front. Some people say the lazy programmers are the best programmers. Don't over-complicate things up front, make a single Employee object.
If you find a need to optimize, you'll create a method/class, or however your ORM library does it. This should be an exception to the rule; only do it if you have reason to do so.
...the cost of adding additional columns to the output does not have a noticable effect on performance...
Right. I don't quite understand what "new situation" could arise, but either way, it would be a much better idea (IMO) to get all the columns rather than run multiple queries. There isn't much of a performance penalty at all for getting more columns than you need (although the queries will take more RAM, but that shouldn't be a big issue; besides, hardware is cheap). Also, you'd save yourself quite a bit of development time.
As for the second part of your question, it's really up to you. As an example, Rails takes more of a "usability first, performance last" approach, but that may not be what you want. It just depends on your needs. If you're willing to sacrifice a little usability for performance, by all means, go for it. I would.
If you are using your Objects in a "row at a time" CRUD type application, then, by all means copy all the columns into your object, the extra overhead is minimal, and you object becomes truly re-usable for any program wanting row access to the table.
However if your SQL is doing a complex join or returning a large set of rows, then request precisely and only what you want. You get two performance penalties here, one handling each column each time will eat up cpu for no benefit, and, two most DBMS systems have a bag of tricks for optimising queries (such as index only access) which can only be used if you specify precisely which columns you want.
There is no reuse issue in most of these cases as scan/search processes tend to very specific to a particular use case.

SQL Server Views, blessing or curse? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I once worked with an architect who banned the use of SQL views. His main reason was that views made it too easy for a thoughtless coder to needlessly involve joined tables which, if that coder tried harder, could be avoided altogether. Implicitly he was encouraging code reuse via copy-and-paste instead of encapsulation in views.
The database had nearly 600 tables and was highly normalised, so most of the useful SQL was necessarily verbose.
Several years later I can see at least one bad outcome from the ban - we have many hundreds of dense, lengthy stored procs that verge on unmaintainable.
In hindsight I would say it was a bad decision, but what are your experiences with SQL views? Have you found them bad for performance? Any other thoughts on when they are or are not appropriate?
There are some very good uses for views; I have used them a lot for tuning and for exposing less normalized sets of information, or for UNION-ing results from multiple selects into a single result set.
Obviously any programming tool can be used incorrectly, but I can't think of any times in my experience where a poorly tuned view has caused any kind of drawbacks from a performance standpoint, and the value they can provide by providing explicitly tuned selects and avoiding duplication of complex SQL code can be significant.
Incidentally, I have never been a fan of architectural "rules" that are based on keeping developers from hurting themselves. These rules often have unintended side-effects -- the last place I worked didn't allow using NULLs in the database, because developers might forget to check for null. This ended up forcing us to work around "1/1/1900" dates and integers defaulted to "0" in all the software built against the databases, and introducing a litany of bugs caused by devs working around places where NULL was the appropriate value.
You've answered your own question:
he was encouraging code reuse via copy-and-paste
Reuse the code by creating a view. If the view performs poorly, it will be much easier to track down than if you have the same poorly performing code in several places.
Not a big fan of views (Can't remember the last time I wrote one) but wouldn't ban them entirely either. If your database allows you to put indexes on the views and not just on the table, you can often improve performance a good bit which makes them better. If you are using views, make sure to look into indexing them.
I really only see the need for views for partitioning data and for extremely complex joins that are really critical to the application (thinking of financial reports here where starting from the same dataset for everything might be critical). I do know some reporting tools seem to prefer views over stored procs.
I am a big proponent of never returning more records or fields than you need in a specific instance and the overuse of views tends to make people return more fields (and in way too many cases, too many joins) than they need which wastes system resources.
I also tend to see that people who rely on views (not the developer of the view - the people who only use them) often don't understand the database very well (so they would get the joins wrong if not using the view) and that to me is critical to writing good code against the database. I want people to understand what they are asking the db to do, not rely on some magic black box of a view. That is all personal opinion of course, your mileage may vary.
Like BlaM I personally haven't found them easier to maintain than stored procs.
Edited in Oct 2010 to add:
Since I orginally wrote this, I have had occasion to work with a couple of databases designed by people who were addicted to using views. Even worse they used views that called views that called views (to the point where eventually we hit the limit of the number of tables that can be called). This was a performance nightmare. It took 8 minutes to get a simple count(*) of the records in one view and much longer to get data. If you use views, be very wary of using views that call other views. You will be building a system that will very probably not work under the normal performance load on production. In SQL Server you can only index views that do not call other views, so what ends up happening when you use views in a chain, is that the entire record set has to be built for each view and it is not until you get to the last one that the where clause criteria are applied. You may need to generate millions of records just to see three. You may join to the same table 6 times when you really only need to join to it once, you may return many many more columns than you need in the final results set.
My current database was completely awash with countless small tables of no more than 5 rows each. Well, I could count them but it was cluttered. These tables simply held constant type values (think enum) and could very easily be combined into one table. I then made views that simulated each of the tables I deleted to ensure backward compactability. Worked great.
One thing that hasn't been mentioned thus far is use of views to provide a logical picture of the data to end users for ad hoc reporting or similar.
This has two merits:
To allow the user to single "tables" containing the data they expect rather requiring relatively non technical users to work out potentially complex joins (because the database is normalised)
It provides a means to allow some degree of ah hoc access without exposing the data or the structure to the end users.
Even with non ad-hoc reporting its sometimes signicantly easier to provide a view to the reporting system that contains the relveant data, neatly separating production of data from presentation of same.
Like all power, views have its own dark side. However, you cannot blame views for somebody writing bad performing code. Moreover views can limit the exposure of some columns and provide extra security.
Views are good for ad-hoc queries, the kind that a DBA does behind the scenes when he/she needs quick access to data to see what's going on with the system.
But they can be bad for production code. Part of the reason is that it's sort of unpredictable what indexes you will need with a view, since the where clause can be different, and therefore hard to tune. Also, you are generally returning a lot more data than is actually necesary for the individual queries that are using the view. Each of these queries could be tightened up and tuned individually.
There are specific uses of views in cases of data partitioning that can be extremely useful, so I'm not saying they should avoided altogether. I'm just saying that if a view can be replaced by a few stored procedures, you will be better off without the view.
We use views for all of our simple data exports to csv files. This simplifies the process of writing a package and embedding the sql within the package which becomes cumbersome and hard to debug against.
Using views, we can execute a view and see exactly what was exported, no cruft or unknowns. It greatly helps in troubleshooting problems with improper data exports and hides any complex joins behind the view. Granted, we use a very old legacy system from a TERMS based system that exports to sql, so the joins are a little more complex than usual.
Some time ago I've tried to maintain code that used views built from views built from views... That was a pain in the a**, so I got a little allergic to views :)
I usually prefer working with tables directly, especially for web applications where speed is a main concern. When accessing tables directly you have the chance to tweak your SQL-Queries to achieve the best performance. "Precompiled"/cached working plans might be one advantage of views, but in many cases just-in-time compilation with all given parameters and where clauses in consideration will result in faster processing over all.
However that does not rule out views totally, if used adequately. For example you can use a view with the "users" table joined with the "users_status" table to get an textual explanation for each status - if you need it. However if you don't need the explanation: use the "users" table, not the view. As always: Use your brain!
Views have been helpful to us in their role for use by public web based applications that dip from a production database. Simplified security is the primary advantage we see since the table design in the database may combine sensitive and non-sensitive data within the same table. A stored procedure shares much of this advantage, but the view is read-only, has potential interop advantages, and is a less complex thing for junior people to implement.
This security abstraction advantage also applies when views are used for end-user ad-hoc queries; this would be less of an advantage if we had a proper, flattened, data warehouse representation of our data.
From an application stand point which uses an ORM, it's a lot harder to execute a custom query than doing a select on a discretely mapped type (eg, the view).
For example, if you need just 5 fields of a table that has many (say 30 or 40) an ORM framework will create an entity to represent the table.
That means that even though you only need a few properties of the entity, the select query generated by the ORM framework will bring the entire entity in its full glory. A view on the other hand, although also mapped to an entity with the ORM framework, will only bring the data you need.
Second, since ORM frameworks map entities to tables, relationships between entities are generated (and hydrated) on the client side, meaning that the query has to execute and return to the app before linking of those entities can happen at runtime within the app.
Some frameworks bypass that by returning the data from multiple linked entities in a giant select (with multiple joins), bringing in the columns of all related tables in one call. Internally the framework disassembles the giant result set and structures the logical presentation of the linked entities before returning those entities to the caller app.
Point being is that views are a life saver for apps using ORM. The alternative is to manually make db calls, and manually passing the resulting recordsets into usable entities/models.
While this approach is good and definitely produces a result, it has lots of negative facets. Manual code... is manual; hard to maintain, cumbersome in implementation, and causes devs to worry more about the specifics of the DB provider API vs the logical domain model. Not to mention that it increases time to production (its a lot more labourious) costs for development, maintenance, surface area of bugs, etc.
So for anyone saying views are bad, please consider the other side of things; The stuff the high and mighty DBA's most often have no clue about.
Let's see if I can come up with a lame analogy ...
"I don't need a phillips screwdriver. I carry a flat head and a grinder!"
Dismissing views out of hand will cause pain long term. For one, it's easier to debug and modify a single view definition than it is to ship modified code.
Views can also reduce the size of complex queries (in the same way stored procs can).
This can reduce network bandwith for very busy databases.