Hibernate Search, Entities, and SQL VIEWs - sql

I have a table that maintains rows of products that are for sale (tbl_products) using PostgreSQL 9.1. There are also several other tables that maintain ratings on the items, comments, etc. We're using JPA/Hibernate for ORM in a Seam application, and have the appropriate entities wired up properly. In an effort to provide better listings of these items, I've created a SQL VIEW (v_product_summary) that aggregates some of the basic product data (name, description, price, etc.) with data from the other tables (number of comments, average rating, etc.). This provides a nice concise view of the data, and I've created a corresponding JPA entity object that provides read-only access to the view data.
Everything is working fine with respect to running JPQL queries on either the Product object (tbl_products) or the ProductSummary (v_product_summary) objects. However, we'd like to provide a richer search experience using Hibernate Search and Lucene. The issue we're running into, though, is how do we query the ProductSummary objects using Hibernate Search? They're not indexed upon creation, because they're never really "created". They're obtained as read-only objects from the v_product_summary VIEW. An index entry is only created on Product when it's persisted to the database, and not for ProductSummary since it's never persisted.
Our thought is that we should be able to:
Persist our Product object to the database
Immediately query the corresponding ProductSummary object using the product's ID
Manually update the Hibernate Search index for the ProductSummary object
Is this possible? Is this even a good idea? I can see there will be a performance impact since we're executing a query for the ProductSummary object every time a new Product is persisted. However, products are not added to the database at a high volume, so I don't think this will be a huge issue.
We'd really like to find a better, more efficient way of accomplishing this. Can anyone provide any tips or recommendations? If we do go the route of updating the search index manually, is that even doable? Can anyone provide a resource explaining how we can add a single ProductSummary to the index?
Any help you can provide is GREATLY appreciated.

If I understand the question correctly, you're trying to do the normal thing of persisting an object and indexing it at that point, but you're dealing with 2 separate objects.
I find myself doing kludgey things in Hibernate all the time, it feels like it almost demands it of you. Yes, there'd be a performance impact, and as you say, it is probably not a big deal, so it might be worth profiling.
A part of me remembers there's a way you can refresh the object upon write, and wonders if there's a way you can wrap the Product and the ProductSummary and tweak the mapping so that you read part and write part of it (waves hands on syntax and mapping). Or create a Hibernate-facing object with readonly fields that can be split and merged into your two objects. I don't know if your design allows Hibernate-only objects, it's a common idiom in my system.
Either way could be useful if you had a lot of objects in this situation, if this is the only object you're searching in this way, your 3 steps look much clearer.
As for the syntax for adding an object manually, I think you're looking for something like this, after your fetch:
FullTextSession textSession = Search.getFullTextSession(session);
textSession.index(myProductSummary);
Was that all you wanted?

Since you are using postgresql, you could insert to the view and use a rule to redirect the insert to the appropriate table.
A postgresql rule is a way to change the query just before it gets executed. I used it in an application which needed a change in schema but required the old queries to still work for a little while.
You can check out the documentation about rules on insert queries on the postgresql site
Since you'll be inserting and updating to the view, hibernate search will work as usual.
EDIT
An easier strategy. You could insert and update ProductSummary when doing so on Product and tell PostgreSQL to ignore the inserts, updates and deletes on the view.
On the database side"
create RULE dontinsert AS ON insert to v_product_summary do instead nothing
create RULE dontupdate AS ON update to v_product_summary do instead nothing
create RULE dontdelete AS ON delete to v_product_summary do instead nothing
But I guess you will need to hack a little, since the jdbc call executeUpdate will return 0, and hibernate will probably freak.

Technically I think this would be possible, but I think your entire efficiency dilemma might be better solved using something like memcached, therefore making performance less of an issue, and perhaps increasing code maintainability depending on how you currently have it implemented at statement level. By updating the search index manually, do you mean the database index? That is not recommended, and I'm not sure if it's even doable. Why not index them on creation?

Related

Database Type Agnostic Select Query Encapsulation class

I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur
I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!

MongoDB - helps with cascading hierarchies?

It becomes more and more common where in my app I have to maintain a cascading tree structure.
The tree defines the permissions and any objects associated with it. That cascade down, so that changing one, modifies (or derives the data for) the whole subtree.
With ActiveRecord and SQL it is pretty complicated to deal with. One option is to denormalise it and use "paths": "root/child1/child2" and SQL like-s to work with it.
But I just start wondering if the database itself is right for this and maybe consider using MongoDB.
There are other parts that might not be a good fir for using with MongoDB: transactions (during payments, reservations).
So my question is, would switching to MongoDB help to solve the "cascading hierarchy" problem?
So my question is, would switching to MongoDB help to solve the "cascading hierarchy" problem?
The primary benefit is honestly the ability to store the entire hierarchy in one document. Then you can basically "annotate" portions of the document with their permissions. Obviously, such "annotations" don't actually exist, they would just be fields in the document, so this will involve you building a layer for all of this.
So in theory, it's quite possible to do what you want, but it does have some limitations.
In particular, when MongoDB queries a document, it queries the whole document. If your whole hierarchy fits in a single document then you get the whole hierarchy with one query. If not, then you start devolving back to the SQL method of handling this.

Should I be concerned that ORMs, by default, return all columns?

In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.

Coldfusion ORM Large Tables

Say if I have a large dataset, the table has well over a million records and the database is normalized so theres foreign keys and stuff. Ive set up the relations properly and i get a list of the first object applications = EntityLoad("entityName") but because of the relations and stuff the page takes like 24 seconds to load, even when i limit the number of records to show to like 5 it takes an awful long time to load.
My solution to this was create another object that just gets the list, and then when the user wants to , use the object with all the relations and show it to the user. Is this the right way to approach it, or am i missing a big ORM concept?
Are you counting just the time to get the data, or are you perhaps doing a CFDUMP on it or something else visually that could be slow. In other words, have you wrapped the EntityLoad by itself in a cftimer tag to be sure that it is the culprit?
The first thing I would do is enable SQL logging in your Application.cfc. Add logSQL=true to This.ormSettings.
That should allow you to grab the SQL that ORM generates. Run it in an analyzer. See if the ORM SQL is doing somethign crazy. See if it is an index that you missed or something.
Also are you doing paging as Ray talks about here: http://www.coldfusionjedi.com/index.cfm/2009/8/14/Simple-ColdFusion-9-ORM-Paging-Demo?
If not have you tried using ORMExecuteQuery and HQL to enable paging.
Those are my thoughts.
When defining complex domain models with Hibernate - you will sometimes need to tweak the mapping to improve performance. This is especially true if you are dealing with inheritance (not sure how much inheritance is in your model). The ultimate goal is to have your query pulling from as few tables as possible while still preserving your domain model. This might require using the advanced inheritance mappings (more on that in a sec).
LOGGING SQL
As Terry mentioned, you will want to be sure you can log the actual SQL that is being passed to your database (yeah, you don't totally get away from SQL with ORM). Here is a great article on setting up logging for Hibernate in CF9 from Rupesh:
http://www.rupeshk.org/blog/index.php/2009/07/coldfusion-orm-how-to-log-sql/
HIBERNATE MAPPING FILES
Anytime you want to do something beyond the basic, you want to be sure that you are looking at the actual Hibernate mapping files that are generated for your CFC's. Be sure to set the following with all of your hibernate options in Application.cfc:
savemapping = true
While the cfproperty properties allow you to define many aspects of the mapping, there are actually some things that can only be done in the Hibernate mapping files (and there are tons of community resources on this.
INHERITANCE MAPPING
As I mentioned earlier, Hibernate provides different inheritance strategies for mapping. They are Table per Hierarchy, Table per subclass, Table per concrete class, and implicit polymorphism. You can read more about these types in the CF9 docs under Advanced Mapping > Inheritance Mapping or in the Hibernate documentation (as it would take forever to explain each of these).
Knowing how your tables are mapped is very important with inheritance (and it is also where Hibernate can generate some HUGE queries if you don't tweak your setup).
Those are the things I can think of - if you can give some additional information about your domain model - we can look to see what other things might be done to tweak it.
There is a good chance Hibernate is doing it's caching thing. A fair comparison in my mind (everyone please feel free to add) is doing an:
EntityLoad("entity_name") is the same as doing a select * from TABLE
So, in this case, what Hibernate might be doing in instantiating the memory, and caching it a certain way, your database server might do this similarly when you sent such a broad SQL instruction.
I have been extremely interested in ORM the past few weeks and it looks to be a very rewarding undertaking.
For this reason, is there a tiem you would ever load all 500,000 records as a result? I assume not.
I have one large logging table that I will be attacking, I am finding that the SQL good stuff must be there. For example, mark the fields that are indexes as such, this will speed it up incredibly when searching. I am sure the ORM can handle this.
Beyond this:
Find some excellent Hibernate forums, resources, and tutorials so you can learn Hibernate. This isn't really as much a Coldfusion --> ORM issue as what Hibernate might do on it's own. I have ordered a few Hibernate books that I'm waiting on to see how they are.
Likewise there seems to be an incredible amount of Hibernate resources out there where you can bring the Performance enhancement solutions of Hibernate into the Coldfusion sphere. I might be making it too simple, but I see the CF-ORM implementation as a wrapper with some code generation to save us time.
Take a look at implementing filters to cut down your data in the EntityLoad() call.
As recommended in other threads, turn on sql logging and see what sql is being generated. Chances are it might not be what you need. Check out HQL to see if you can form a better statement.
Most importantly, share what you find. I'll volunteer to do the same on this as you've tempted me to go try this out in my spare time a bit sooner than planned.
Faisal, we ran into this with Linq (c# orm).
Our solution was to create simple objects not holding the relational data. For instance, along with Users we had a SimpleUsers object which held little or no relation to any other object and had a limited set of columns.
There could be other ways of handling this but this approach helped tremendously with the query speed.

Upgrade strategies for bad DB schema designs

I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including
No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
Practically every field can be NULL, which isn't really true
Naming conventions for tables and columns are practically non-existent
Varchars which are storing concatenated strings of relational information
Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.
I want to fix this. The issues I see now are:
We have a lot of data (migration, possibly tricky)
All of the DB logic is in code (with migration comes big code changes)
I'm also tempted to do something "radical" like moving to a schema-free DB.
What are some good strategies when faced with an existing DB built upon a poorly designed schema?
Enforce Foreign Keys: If a relationship exists in the domain, then it should have a Foreign Key.
Renaming existing tables/columns is fraught with danger, especially if there are many systems accessing the Database directly. Gotchas include tasks that run only periodically; these are often missed.
Of Interest: Scott Ambler's article: Introduction To Database Refactoring
and Catalog of Database Refactorings
Views are commonly used to transition between changing data models because of the encapsulation. A view looks like a table, but does not exist as a finite object in the database - you can change what column is being returned for a given column alias as desired. This allows you to setup your codebase to use a view, so you can move from the old table structure to the new one without the application needing to be updated. But it means the view has to return the data in the existing format. For example - your current data model has:
SELECT t.column --a list of concatenated strings, assuming comma separated
FROM TABLE t
...so the first version of the view would be the query above, but once you created the new table that uses 3NF, the query for the view would use:
SELECT GROUP_CONCAT(t.column SEPARATOR ',')
FROM NEW_TABLE t
...and the application code would never know that anything changed.
The problem with MySQL is that the view support is limited - you can't use variables within it, nor can they have subqueries.
The reality to the changes you wish to make is effectively rewriting the application from the ground up. Moving logic from the codebase into the data model will drastically change how the application gets the data. Model-View-Controller (MVC) is ideal to implement with changes like these, to minimize the cost of future changes like these.
I'd say leave it alone until you really understand it. Then make sure you don't start with one of the Things You Should Never Do.
Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.
Create a completely new schema and make sure that it is fully normalized and contains any unique, check and not null constraints etc that are required and that appropriate data types are used.
Prepopulate each table that fills the parent role in a foreign key relationship with a single 'Unknown' record.
Create an ETL (Extract Transform Load) process (I can recommend SSIS (SQL Server Integration Services) but there are plenty of others) that you can use to refill the new schema from the existing one on a regular basis. Use the 'Unknown' record as the parent of any orphaned records - there will be plenty ;). You will need to put some thought into how you will consolidate duplicate records - this will probably need to be on a case by case basis.
Use as many iterations as are necessary to refine your new schema (ensure that the ETL Process is maintained and run regularly).
Create views over the new schema that match the existing schema as closely as possible.
Incrementally modify any clients to use the new schema making temporary use of the views where necessary. You should be able to gradually turn off parts of the ETL process and eventually disable it completely.
First see how bad the code is related to the DB if it is all mixed in no DAO layer you shouldn't think about a rewrite but if there is a DAO layer then it would be time to rewrite that layer and DB along with it. If possible make the migration tool based on using the two DAOs.
But my guess is there is no DAO so you need to find what areas of the code you are going to be changing and what parts of the DB that relates to hopefully you can cut it up into smaller parts that can be updated as you maintain. Biggest deal is to get FKs in there and start checking for proper indexes there is a good chance they aren't being done correctly.
I wouldn't worry too much about naming until the rest of the db is under control. As for the NULLs if the program chokes on a value being NULL don't let it be NULL but if the program can handle it I wouldn't worry about it at this point in the future if it is doing a default value move that to the DB but that is way down the line from the sound of things.
Do something about the Varchars sooner rather then later. If anything make that the first pure background fix to the program.
The other thing to do is estimate the effort of each areas change and then add that price to the cost of new development on that section of code. That way you can fix the parts as you add new features.