ElasticSearch Mappings & Related Objects - lucene

Excuse the potential n00bness of this question - still trying to get my head around this non-relational NoSQL stuff.
I've been super impressed with the performance and simplicity of ElasicSearch, but I've got a mapping (borderline NoSQL theroy) question to answer before I dive too deeply into the implementation.
Lets continue to use the Twitter examples ElasticSearch have in their documentation.
Basically, we know a tweet belongs to an user, and a user has many tweets.
The objects look something like this:
user = {'screen_name':'d2kagw', 'id_str':'1234567890', 'favourites_count':'15', ...}
tweet = {'message':'lorem lipsum...', 'user_id_str':'1234567890', ...}
What I'm wondering is, can the tweet object have a reference to the user object?
Since I want to be able to write queries like:
{'query': {
'term':{'message':'lipsum'},
'range':{'user.favourites_count':{'from':10, 'to':30'}}
}}
Which I would like to return the tweets matching with the user objects as part of the response (vs. having to lazy load them later).
Am I asking too much of it?
Should I be expected to throw all the user data into the tweet object if I want to query the data in that way?
In my implementation (doesn't use twitter, this was just an elegant example) I need to have the two datasets as different indexes due to the various ways I have to query the data, so I'm not sure if I can use an object type AND have the index structure I require.
Thanks in advance for your help.

ElasticSearch doesn't really support table joins that we are so used to in SQL world. The closest it gets to it is Has Child Query that allows limiting results in one table based on a persence of a record in another table and even here it's limited to 1-to-many (parent-children) relationship.
So, a common approach in this world would be to denormalize everything and query one index at a time.

Related

Should I use Eloquent ORM or create big joins by Fluent?

Well, I'm using Eloquent ORM for a project that I'm developing, but it is bugging me with the performance issue. When I use only its own methods, I can see by its query log that it creates a lot of queries.
I'm trying to fetch data from a main table with 4 other tables, one related to it one-to-one and the others many-to-many. Eloquent creates about 6-7 queries for it, and that makes me afraid of performance issues. Then, I remove Eloquent's methods and create jumbo queries with Fluent, using lots of joins, which makes me lose code readability and practicity.
What I really need to know is: Does Eloquent sacrifice performance? Should I stick to it, or use just Fluent? And what is better, a few big joined queries or more small ones?
I'm going to extend Sebastian's answer.
I too have many to many relationships or even one to many relationships.
I have actually melded Eloquent's style of programming (its easier on the eyes) with a bit of a joint hack with Fluent. Please be reminded that Eloquent is an extension of Fluent so your not sacrificing unless you are doing bat queries.
If you do a User and then Phone model with a One to one or one to many (a user can have many phone number)
and you simply where()->get() and then $users->phone - this will make eloquent run a select * for each ID. This is where Eager Loading (as referenced by Sebastian but too short to actually explain) is used where it prefetch all the IDs required and eager load the IDs (you can verify this by running a query log profiler).
The added bonus of this is that you can eagerload many relationships like this.
So its not cut dry solution of "is Eloquent providing a performance hit" if you dont use it the right way.
Now here is a small example to how I put both eloquent and fluent to use:
Within Book Model - I have defined a Scope function which is a relationship function:
public function scopeLicensorStatus($query, $licensor_status)
{
$query->select('book.*')
->leftJoin('licensors as l', 'l.id', '=', 'book.licensor_id')
->where('l.status','=',$licensor_status);
}
$bookData = Book::
->LicensorStatus('active')
->where('book.status','=', 'active')
->whereIN('book.id',$recommendedIds)
->take($limit)
->skip($offset)
->get();
what does this do is do the Join for me as a function and let me chain up the commands fro the outside. In the end (if you do toSQL() instead of get()) you will achieve a single query that will match raw SQL, however as you can see a) the code is reusable if you forsee to reuse the join with other constraints, b) your not sacrificing speed since the end game query is a single one (just need to write it properly), c) looks nicer and readable which is why we like eloquent.
Hope this answer helps you to dive a bit more into eloquent

Most efficient method for persisting complex types with variable schemas in SQL

What I'm doing
I am creating an SQL table that will provide the back-end storage mechanism for complex-typed objects. I am trying to determine how to accomplish this with the best performance. I need to be able to query on each individual simple type value of the complex type (e.g. the String value of a City in an Address complex type).
I was originally thinking that I could store the complex type values in one record as an XML, but now I am concerned about the search performance of this design. I need to be able to create variable schemas on the fly without changing anything about the database access layer.
Where I'm at now
Right now I am thinking to create the following tables.
TABLE: Schemas
COLUMN NAME DATA TYPE
SchemaId uniqueidentifier
Xsd xml //contains the schema for the document of the given complex type
DeserializeType varchar(200) //The Full Type name of the C# class to which the document deserializes.
TABLE: Documents
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
SchemaId uniqueidentifier
TABLE: Values //The DocumentId+ValueXPath function as a PK
COLUMN NAME DATA TYPE
DocumentId uniqueidentifier
ValueXPath varchar(250)
Value text
from these tables, when performing queries I would do a series of self-joins on the value table. When I want to get the entire object by the DocumentId, I would have a generic script for creating a view mimics a denormalized datatable of the complex-type.
What I want to know
I believe there are better ways to accomplish what I am trying to, but I am a little too ignorant about the relative performance benefits of different SQL techniques. Specifically I don't know the performance cost of:
1 - comparing the value of a text field versus of a varchar field.
2 - different kind of joins versus nested queries
3 - getting a view versus an xml document from the sql db
4 - doing some other things that I don't even know I don't know would be affecting my query but, I am experienced enough to know exist
I would appreciate any information or resources about these performance issues in sql as well as a recommendation for how to approach this general issue in a more efficient way.
For Example,
Here's an example of what I am currently planning on doing.
I have a C# class Address which looks like
public class Address{
string Line1 {get;set;}
string Line2 {get;set;}
string City {get;set;}
string State {get;set;}
string Zip {get;set;
}
An instance is constructed from new Address{Line1="17 Mulberry Street", Line2="Apt C", City="New York", State="NY", Zip="10001"}
its XML value would be look like.
<Address>
<Line1>17 Mulberry Street</Line1>
<Line2>Apt C</Line2>
<City>New York</City>
<State>NY</State>
<Zip>10001</Zip>
</Address>
Using the db-schema from above I would have a single record in the Schemas table with an XSD definition of the address xml schema. This instance would have a uniqueidentifier (PK of the Documents table) which is assigned to the SchemaId of the Address record in the Schemas table. There would then be five records in the Values table to represent this Address.
They would look like:
DocumentId ValueXPath Value
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line1 17 Mulberry Street
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Line2 Apt C
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/City New York
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/State NY
82415E8A-8D95-4bb3-9E5C-AA4365850C70 /Address/Zip 10001
Just Added a Bounty...
My objective is to obtain the resources I need in order to give my application a data access layer that is fully searchable and has a data-schema generated from the application layer that does not require direct database configuration (i.e. creating a new SQL table) in order to add a new aggregate root to the domain model.
I am open to the possibility of using .NET compatible technologies other than SQL, but I will require that any such suggestions be adequately substantiated in order to be considered.
How about looking for a solution at the architectural level? I was also breaking my head on complex graphs and performance until I discovered CQRS.
[start evangelist mode]
You can go document-based or relational as storage. Even both! (Event Sourcing)
Nice separation of concerns: Read Model vs Write Model
Have your cake and eat it too!
Ok, there is an initial learning / technical curve to get over ;)
[end evangelist mode]
As you stated: "I need to be able to create variable schemas on the fly without changing anything about the database access layer." The key benefit is that your read model can be very fast since it's made for reading. If you add Event Sourcing to the mix, you can drop and rebuild your Read Model to whatever schema you want... even "online".
There are some nice opensource frameworks out there like nServiceBus which saves lots of time and technical challenges. All depends on how far you want to take these concepts what you're willing/can spend time on. You can even start with just basics if you follow Greg Young's approach. See the info in the links below.
See
CQRS Examples and Screencasts
CQRS Questions
Intro (Also see the video)
Somehow what you want sounds like a painful thing to do in SQL. Basically, you should treat the inside of a text field as opaque as when querying an SQL database. Text fields were not made for efficient queries.
If you just want to store serialized objects in a text field, that is fine. But do not try to build queries that look inside the text field to find objects.
Your idea sounds like you want to perform some joins, XML parsing, and XPath application to get to a value. This doesn't strike me as the most efficient thing to do.
So, my advise:
Either just store serialized objects in the db, and do nothing more than load them and perform all other operations in memory
Or, if you need to query complex data structures, you may really want to look into document stores/databases like CouchDB or MongoDB; you can also check Wikipedia on the subject. There are even databases specifically designed for storing XML, even though I personally don't like them very much.
Addendum, per your explanations above
Simply put, don't go over the top with this thing:
If you just want to persist C#/.NET objects, just use the XML Serialization already built into the framework, a single table and be done with it.
If you, for some reason, need to store complex XML, use a dedicated XML store
If you have a fixed database schema, but it is too complex for efficient queries, use a Document Store in memory where you keep a denormalized version of your data for faster queries (or just simplify your database schema)
If you don't really need a fixed schema, use just a Document Store, and forget about having any "schema definition" at all
As for your solution, yes, it could work somehow. As could a plain SQL schema if you set it up right. But for applying an XPath, you'll probably parse the whole XML document each time you access a record, which wouldn't be very efficient to begin with.
If you want to check out Document databases, there are .NET drivers for CouchDB and MongoDB. The eXist XML database offers a number of Web protocols, and you can probably create a client class easily with VisualStudio's point-and-shoot interface. Or just google for someone who already did.
I need to be able to create variable
schemas on the fly without changing
anything about the database access
layer.
You are re-implementing the RDBMS within an RDBMS. The DB can do this already - that is what the DDL statements like create table and create schema are for....
I suggest you look into "schemas" and SQL security. There is no reason with the correct security setup you cannot allow your users to create their own tables to store document attributes in, or even generate them automatically.
Edit:
Slightly longer answer, if you don't have full requirements immediately, I would store the data as XML data type, and query them using XPath queries. This will be OK for occasional queries over smallish numbers of rows (fewer than a few thousand, certainly).
Also, your RDBMS may support indexes over XML, which may be another way of solving your problem. CREATE XML INDEX in SqlServer 2008 for example.
However for frequent queries, you can use triggers or materialized views to create copies of relevant data in table format, so more intensive reports can be speeded up by querying the breakout tables.
I don't know your requirements, but if you are responsible for creating the reports/queries yourself, this may be an approach to use. If you need to enable users to create their own reports that's a bigger mountain to climb.
I guess what i am saying is "are you sure you need to do this and XML can't just do the job".
In part, it will depend of your DB Engine. You're using SQL Server, don't you?
Answering your topics:
1 - Comparing the value of a text field versus of a varchar field: if you're comparing two db fields, varchar fields are smarter. Nvarchar(max) stores data in unicode with 2*l+2 bytes, where "l" is the lengh. For performance issues, you will need consider how much larger tables will be, for selecting the best way to index (or not) your table fields. See the topic.
2 - Sometimes nested queries are easily created and executed, also serving as a way to reduce query time. But, depending of the complexity, would be better to use different kind of joins. The best way is try to do in both ways. Execute two or more times each query, for the DB engine "compiles" a query on first executing, then the subsequent are quite faster. Measure the times for different parameters and choose the best option.
"Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row." - link
3- There's no much information in this question, but if you will get the xml document directly from the table, would be a good idea insted a view. Again, it will depends of the view and the document.
4- Other issues is about the total records expected for your table; the indexing of the columns, in wich you need to consider sorting, joining, filtering, PK's and FK's. Each situation could demmand different aproaches. My sugestion is to invest some time reading about your database engine and queries functioning and relating to your system.
I hope I've helped.
Interesting question.
I think you may be asking the wrong question here. Broadly speaking, as long as you have a FULLTEXT index on your text field, queries will be fast. Much faster than varchar if you have to use wild cards, for instance.
However, if I were you, I'd concentrate on the actual queries you're going to be running. Do you need boolean operators? Wildcards? Numerical comparisons? That's where I think you will encounter the real performance worries.
I would imagine you would need queries like:
"find all addresses in the states of New York, New Jersey and Pennsylvania"
"find all addresses between house numbers 1 and 100 on Mulberry Street"
"find all addresses where the zipcode is missing, and the city is New York"
At a high level, the solution you propose is to store your XML somewhere, and then de-normalize that XML into name/value pairs for querying.
Name/value pairs have a long and proud history, but become unwieldy in complex query situations, because you're not using the built-in optimizations and concepts of the relational database model.
Some refinements I'd recommend is to look at the domain model, and at least see if you can factor out separate data types into the "value" column; you might end up with "textValue", "moneyValue", "integerValue" and "dateValue". In the example you give, you might factor "address 1" into "housenumber" (as an integer) and "streetname".
Having said all this - I don't think there's a better solution other than completely changing tack to a document-focused database.

Schema less SQL database table - practical compromise

This question is an attempt to find a practical solution for this question.
I need a semi-schema less design for my SQL database. However, I can limit the flexibility to shoehorn it into the entire SQL paradigm. Moving to schema less databases might be an option in the future but right now, I' stuck with SQL.
I have a table in a SQL database (let's call it Foo). When an row is added to this, it needs to be able to store an arbitrary number of "meta" fields with this. An example would be the ability to attach arbitrary metadata like tags, collaborators etc. All the fields are optional but the problem is that they're of different types. Some might be numeric, some might be textual etc.
A simple design linking Foo to a table of OptionalValues with fields like name, value_type, value_string, value_int, value_date etc. seems direct although it descends into the whole EAV model which Alex mentions on that last answer and it looks quite wasteful. Also, I imagine queries out of this when it grows will be quite slow. I don't expect to search or sort by anything in this table though. All I need is that when I get a row out of Foo, these extra attributes should be obtainable as well.
Are there any best practices for implementing this kind of a setup in a SQL database or am I simply looking at the whole thing wrongly?
Add a string column "Metafields" to your table "Foo" and store your metadata there as an XML or JSON string.

DB Design question: Tree (one table) vs. Two tables for tweets and retweets?

I've heard that on stackoverflow questions and answers are stored in the same DB table.
If you were to build a twitter like service that only would allow 1 level of commenting. ie 1 tweet and then comments/replies to that tweet but no re-comments or re-replies.
would you use two tables for tweets and retweets? or just one table where the field parent_tweet_id is optional?
I know this is an open question, but what are some advantages of either solutions?
Retweets are still normal tweets as well. So one table. You wouldn't want to have to load from two tables to include the retweets.
Advantages of one table:
You can search through all tweets and comments in a simple way.
You can use one identity column easily for all posts.
Every post has the same set of columns.
Advantages of two tables:
If it's more common to search or display only top-level tweets instead of tweets + comments, the table of tweets is that much smaller without comments.
Two tables can have different sets of columns, so if there are columns meaningful for one type of post but not the other, you can put these columns in the respective table without having to leave them null when not applicable.
Indexes can also be different on two tables, so if you have the need to search comments in different ways, you can make indexes specialized to that task.
In short, it depends on how you use the data, not only how it's structured. You haven't said much about the operations you need to do with the data.
Like all design questions, it depends.
I don't normally like to mix concepts in a single table. I find it can quickly damage the conceptual integrity of the database schema. For example, I would not put posts and replies in the same table because they are different entities.

Why is ORM considered good but "select *" considered bad?

Doesn't an ORM usually involve doing something like a select *?
If I have a table, MyThing, with column A, B, C, D, etc, then there typically would be an object, MyThing with properties A, B, C, D.
It would be evil if that object were incompletely instantiated by a select statement that looked like this, only fetching the A, B, not the C, D:
select A, B from MyThing /* don't get C and D, because we don't need them */
but it would also be evil to always do this:
select A, B, C, D /* get all the columns so that we can completely instantiate the MyThing object */
Does ORM make an assumption that database access is so fast now you don't have to worry about it and so you can always fetch all the columns?
Or, do you have different MyThing objects, one for each combo of columns that might happen to be in a select statement?
EDIT: Before you answer the question, please read Nicholas Piasecki's and Bill Karwin's answers. I guess I asked my question poorly because many misunderstood it, but Nicholas understood it 100%. Like him, I'm interested in other answers.
EDIT #2: Links that relate to this question:
Why do we need entity objects?
http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx, especially the section "The Partial-Object Problem and the Load-Time Paradox"
http://groups.google.com/group/comp.object/browse_thread/thread/853fca22ded31c00/99f41d57f195f48b?
http://www.martinfowler.com/bliki/AnemicDomainModel.html
http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html
In my limited experience, things are as you describe--it's a messy situation and the usual cop-out "it depends" answer applies.
A good example would be the online store that I work for. It has a Brand object, and on the main page of the Web site, all of the brands that the store sells are listed on the left side. To display this menu of brands, all the site needs is the integer BrandId and the string BrandName. But the Brand object contains a whole boatload of other properties, most notably a Description property that can contain a substantially large amount of text about the Brand. No two ways about it, loading all of that extra information about the brand just to spit out its name in an unordered list is (1) measurably and significantly slow, usually because of the large text fields and (2) pretty inefficient when it comes to memory usage, building up large strings and not even looking at them before throwing them away.
One option provided by many ORMs is to lazy load a property. So we could have a Brand object returned to us, but that time-consuming and memory-wasting Description field is not until we try to invoke its get accessor. At that point, the proxy object will intercept our call and suck down the description from the database just in time. This is sometimes good enough but has burned me enough times that I personally don't recommend it:
It's easy to forget that the property is lazy-loaded, introducing a SELECT N+1 problem just by writing a foreach loop. Who knows what happens when LINQ gets involved.
What if the just-in-time database call fails because the transport got flummoxed or the network went out? I can almost guarantee that any code that is doing something as innocuous as string desc = brand.Description was not expecting that simple call to toss a DataAccessException. Now you've just crashed in a nasty and unexpected way. (Yes, I've watched my app go down hard because of just that. Learned the hard way!)
So what I've ended up doing is that in scenarios that require performance or are prone to database deadlocks, I create a separate interface that the Web site or any other program can call to get access to specific chunks of data that have had their query plans carefully examined. The architecture ends up looking kind of like this (forgive the ASCII art):
Web Site: Controller Classes
|
|---------------------------------+
| |
App Server: IDocumentService IOrderService, IInventoryService, etc
(Arrays, DataSets) (Regular OO objects, like Brand)
| |
| |
| |
Data Layer: (Raw ADO.NET returning arrays, ("Full cream" ORM like NHibernate)
DataSets, simple classes)
I used to think that this was cheating, subverting the OO object model. But in a practical sense, as long as you do this shortcut for displaying data, I think it's all right. The updates/inserts and what have you still go through the fully-hydrated, ORM-filled domain model, and that's something that happens far less frequently (in most of my cases) than displaying particular subsets of the data. ORMs like NHibernate will let you do projections, but by that point I just don't see the point of the ORM. This will probably be a stored procedure anyway, writing the ADO.NET takes two seconds.
This is just my two cents. I look forward to reading some of the other responses.
People use ORM's for greater development productivity, not for runtime performance optimization. It depends on the project whether it's more important to maximize development efficiency or runtime efficiency.
In practice, one could use the ORM for greatest productivity, and then profile the application to identify bottlenecks once you're finished. Replace ORM code with custom SQL queries only where you get the greatest bang for the buck.
SELECT * isn't bad if you typically need all the columns in a table. We can't generalize that the wildcard is always good or always bad.
edit: Re: doofledorfer's comment... Personally, I always name the columns in a query explicitly; I never use the wildcard in production code (though I use it when doing ad hoc queries). The original question is about ORMs -- in fact it's not uncommon that ORM frameworks issue a SELECT * uniformly, to populate all the fields in the corresponding object model.
Executing a SELECT * query may not necessarily indicate that you need all those columns, and it doesn't necessarily mean that you are neglectful about your code. It could be that the ORM framework is generating SQL queries to make sure all the fields are available in case you need them.
Linq to Sql, or any implementation of IQueryable, uses a syntax which ultimately puts you in control of the selected data. The definition of a query is also the definition of its result set.
This neatly avoids the select * issue by removing data shape responsibilities from the ORM.
For example, to select all columns:
from c in data.Customers
select c
To select a subset:
from c in data.Customers
select new
{
c.FirstName,
c.LastName,
c.Email
}
To select a combination:
from c in data.Customers
join o in data.Orders on c.CustomerId equals o.CustomerId
select new
{
Name = c.FirstName + " " + c.LastName,
Email = c.Email,
Date = o.DateSubmitted
}
There are two separate issues to consider.
To begin, it is quite common when using an ORM for the table and the object to have quite different "shapes", this is one reason why many ORM tools support quite complex mappings.
A good example is when a table is partially denormalised, with columns containing redundant information (often, this is done to improve query or reporting performance). When this occurs, it is more efficient for the ORM to request just the columns it requires, than to have all the extra columns brought back and ignored.
The question of why "Select *" is evil is separate, and the answer falls into two halves.
When executing "select *" the database server has no obligation to return the columns in any particular order, and in fact could reasonably return the columns in a different order every time, though almost no databases do this.
Problem is, when a typical developer observes that the columns returned seem to be in a consistent order, the assumption is made that the columns will always be in that order, and then you have code making unwarranted assumptions, just waiting to fail. Worse, that failure may not be fatal, but may simply involve, say, using Year of Birth in place of Account Balance.
The other issue with "Select *" revolves around table ownership - in many large companies, the DBA controls the schema, and makes changes as required by major systems. If your tool is executing "select *" then you only get the current columns - if the DBA has removed a redundant column that you need, you get no error, and your code may blunder ahead causing all sorts of damage. By explicitly requesting the fields you require, you ensure that your system will break rather than process the wrong information.
I am not sure why you would want a partially hydrated object. Given a class of Customer with properties of Name, Address, Id. I would want them all to create a fully populated Customer object.
The list hanging off of Customers called Orders can be lazily loaded when accessed though most ORMs. And NHibernate anyway allows you to do projections into other objects. So if you had say a simply customer list where you displayed the ID and Name, you can create an object of type CustomerListDisplay and project your HQL query into that object set and only obtain the columns you need from the database.
Friends don't let friends premature optimize. Fully hydrate your object, lazy load it's associations. And then profile your application looking for problems and optimize the problem areas.
Even ORMs need to avoid SELECT * to be effective, by using lazy loading etc.
And yes, SELECT * is generally a bad idea if you aren't consuming all the data.
So, do you have different kinds of MyThing objects, one for each column combo? – Corey Trager (Nov 15 at 0:37)
No, I have read-only digest objects (which only contain important information) for things like lookups and massive collections and convert these to fully hydrated objects on demand. – Cade Roux (Nov 15 at 1:22)
The case you describe is a great example of how ORM is not a panacea. Databases offer flexible, needs-based access to their data primarily through SQL. As a developer, I can easily and simply get all the data (SELECT *) or some of the data (SELECT COL1, COL2) as needed. My mechanism for doing this will be easily understood by any other developer taking over the project.
In order to get the same flexibility from ORM, a lot more work has to be done (either by you or the ORM developers) just to get you back to the place under the hood where you're either getting all or some of the columns from the database as needed (see the excellent answers above to get a sense of some of the problems). And all this extra stuff is just more stuff that can fail, making an ORM system intrinsically less reliable than straight SQL calls.
This is not to say that you shouldn't use ORM (my standard disclaimer is that all design choices have costs and benefits, and the choice of one or the other just depends) - knock yourself out if it works for you. I will say that I truly don't understand the popularity of ORM, given the amount of extra un-fun work it seems to create for its users. I'll stick with using SELECT * when (wait for it) I need to get every column from a table.
ORMs in general do not rely on SELECT *, but rely on better methods to find columns like defined data map files (Hibernate, variants of Hibernate, and Apache iBATIS do this). Something a bit more automatic could be set up by querying the database schema to get a list of columns and their data types for a table. How the data gets populated is specific to the particular ORM you are using, and it should be well-documented there.
It is never a good idea to select data that you do not use at all, as it can create a needless code dependency that can be obnoxious to maintain later. For dealing with data internal to the class, things are a bit more complicated.
A short rule would be to always fetch all the data that the class stores by default. In most cases, a small amount of overhead won't make a huge difference, so your main goal is to reduce maintenance overhead. Later, when you performance profiling of the code, and have reason to believe that it may benefit from adjusting the behavior, that is the time to do it.
If I saw an ORM make SELECT * statements, either visibly or under its covers, then I would look elsewhere to fulfill my database integration needs.
SELECT * is not bad. Did you ask whoever considered it to be bad "why?".
SELECT * is a strong indication you don't have design control over the scope of your application and its modules. One of the major difficulties in cleaning up someone else's work is when there is stuff in there that is for no purpose, but no indication what is needed and used, and what isn't.
Every piece of data and code in your application should be there for a purpose, and the purpose should be specified, or easily detected.
We all know, and despise, programmers who don't worry too much about why things work, they just like to try stuff until the expected things happen and close it up for the next guy. SELECT * is a really good way to do that.
If you feel the need to encapsulate everything within an object, but need something with a small subset of what is contained within a table - define your own class. Write straight sql (within or without the ORM - most allow straight sql to circumvent limitations) and populate your object with the results.
However, I'd just use the ORMs representation of a table in most situations unless profiling told me not to.
If you're using query caching select * can be good. If you're selecting a different assortment of columns every time you hit a table, it could just be getting the cached select * for all of those queries.
I think you're confusing the purpose of ORM. ORM is meant to map a domain model or similar to a table in a database or some data storage convention. It's not meant to make your application more computationally efficient or even expected to.