MongoDB - helps with cascading hierarchies? - sql

It becomes more and more common where in my app I have to maintain a cascading tree structure.
The tree defines the permissions and any objects associated with it. That cascade down, so that changing one, modifies (or derives the data for) the whole subtree.
With ActiveRecord and SQL it is pretty complicated to deal with. One option is to denormalise it and use "paths": "root/child1/child2" and SQL like-s to work with it.
But I just start wondering if the database itself is right for this and maybe consider using MongoDB.
There are other parts that might not be a good fir for using with MongoDB: transactions (during payments, reservations).
So my question is, would switching to MongoDB help to solve the "cascading hierarchy" problem?

So my question is, would switching to MongoDB help to solve the "cascading hierarchy" problem?
The primary benefit is honestly the ability to store the entire hierarchy in one document. Then you can basically "annotate" portions of the document with their permissions. Obviously, such "annotations" don't actually exist, they would just be fields in the document, so this will involve you building a layer for all of this.
So in theory, it's quite possible to do what you want, but it does have some limitations.
In particular, when MongoDB queries a document, it queries the whole document. If your whole hierarchy fits in a single document then you get the whole hierarchy with one query. If not, then you start devolving back to the SQL method of handling this.

Related

MongoDB embedding vs SQL foreign keys?

Are there any particular advantages to MongoDB's ability to embed objects within a document, compared to SQL's use of foreign keys for the same logic?
It seems to me that the only advantage is ease of use (and perhaps performance?), and even that seems like it could be easily abstracted away (e.g. Django seems to handle SQL's foreign keys pretty intuitively).
This boils down to a classic question of whether to embed or not.
Here are a few links to get started before I explain some more:
Where should I put activities timeline in mongodb, embedded in user or separately?
MongoDB schema design -- Choose two collection approach or embedded document
MongoDB schema for storing user location history
Now to answer more specifically.
You must remember the server-side usage of foreign keys in SQL: JOINs. Embedding is a single round trip to get all the data you need in a single document however Joins are not, they are infact two selections based upon a range and then merged to omit duplicates (with significant overhead on some data sets).
So the use of foreign keys is not totally app dependant, it is also server and database dependant.
That being said some people misunderstand embedding in MongoDB and try and make all their data fit into one document. Unfortunately this is re-inforced by the common knowledge that you should always try to embed everything. The links and more will provide some useful guides on this.
Now that we cleared some things up the main pros of embedding over JOINs are:
Single round trip
Easy to update the document in a lot of cases, unless you embed many levels deep
Can keep entity data with the entity it is related to
However embedding has a few flaws:
The document must be paged in to get it's values, this can be problematic on larger documents
Subdocuments are designed to be unique to that entity that do not require advanced querying so you normally would not get two separate entities that are related together, i.e. a post could embed comments but a user probably wouldn't embed posts due to the query needs.
Nesting more than 3 levels deep could effect your ability to use things such as the atomic lock.
So when used right MongoDBs embedding can become a huge power over SQL Joins but you must understand when to use it right.
The core strength of Mongo is in its document-view of data, and naturally this can be extended to a "POCO" view of data. Mongo clients like the NoRM Project in .NET will seem astonishingly similar to experienced Fluent NHibernate users, and this is no accident - your POCO data models are simply serialized to BSON and saved in Mongo 1:1. No mappings required.
Overall, the biggest difference between these two technologies is the model and how developers have to think about their data. Mongo is better suited to rapid application development.

Hibernate Search, Entities, and SQL VIEWs

I have a table that maintains rows of products that are for sale (tbl_products) using PostgreSQL 9.1. There are also several other tables that maintain ratings on the items, comments, etc. We're using JPA/Hibernate for ORM in a Seam application, and have the appropriate entities wired up properly. In an effort to provide better listings of these items, I've created a SQL VIEW (v_product_summary) that aggregates some of the basic product data (name, description, price, etc.) with data from the other tables (number of comments, average rating, etc.). This provides a nice concise view of the data, and I've created a corresponding JPA entity object that provides read-only access to the view data.
Everything is working fine with respect to running JPQL queries on either the Product object (tbl_products) or the ProductSummary (v_product_summary) objects. However, we'd like to provide a richer search experience using Hibernate Search and Lucene. The issue we're running into, though, is how do we query the ProductSummary objects using Hibernate Search? They're not indexed upon creation, because they're never really "created". They're obtained as read-only objects from the v_product_summary VIEW. An index entry is only created on Product when it's persisted to the database, and not for ProductSummary since it's never persisted.
Our thought is that we should be able to:
Persist our Product object to the database
Immediately query the corresponding ProductSummary object using the product's ID
Manually update the Hibernate Search index for the ProductSummary object
Is this possible? Is this even a good idea? I can see there will be a performance impact since we're executing a query for the ProductSummary object every time a new Product is persisted. However, products are not added to the database at a high volume, so I don't think this will be a huge issue.
We'd really like to find a better, more efficient way of accomplishing this. Can anyone provide any tips or recommendations? If we do go the route of updating the search index manually, is that even doable? Can anyone provide a resource explaining how we can add a single ProductSummary to the index?
Any help you can provide is GREATLY appreciated.
If I understand the question correctly, you're trying to do the normal thing of persisting an object and indexing it at that point, but you're dealing with 2 separate objects.
I find myself doing kludgey things in Hibernate all the time, it feels like it almost demands it of you. Yes, there'd be a performance impact, and as you say, it is probably not a big deal, so it might be worth profiling.
A part of me remembers there's a way you can refresh the object upon write, and wonders if there's a way you can wrap the Product and the ProductSummary and tweak the mapping so that you read part and write part of it (waves hands on syntax and mapping). Or create a Hibernate-facing object with readonly fields that can be split and merged into your two objects. I don't know if your design allows Hibernate-only objects, it's a common idiom in my system.
Either way could be useful if you had a lot of objects in this situation, if this is the only object you're searching in this way, your 3 steps look much clearer.
As for the syntax for adding an object manually, I think you're looking for something like this, after your fetch:
FullTextSession textSession = Search.getFullTextSession(session);
textSession.index(myProductSummary);
Was that all you wanted?
Since you are using postgresql, you could insert to the view and use a rule to redirect the insert to the appropriate table.
A postgresql rule is a way to change the query just before it gets executed. I used it in an application which needed a change in schema but required the old queries to still work for a little while.
You can check out the documentation about rules on insert queries on the postgresql site
Since you'll be inserting and updating to the view, hibernate search will work as usual.
EDIT
An easier strategy. You could insert and update ProductSummary when doing so on Product and tell PostgreSQL to ignore the inserts, updates and deletes on the view.
On the database side"
create RULE dontinsert AS ON insert to v_product_summary do instead nothing
create RULE dontupdate AS ON update to v_product_summary do instead nothing
create RULE dontdelete AS ON delete to v_product_summary do instead nothing
But I guess you will need to hack a little, since the jdbc call executeUpdate will return 0, and hibernate will probably freak.
Technically I think this would be possible, but I think your entire efficiency dilemma might be better solved using something like memcached, therefore making performance less of an issue, and perhaps increasing code maintainability depending on how you currently have it implemented at statement level. By updating the search index manually, do you mean the database index? That is not recommended, and I'm not sure if it's even doable. Why not index them on creation?

Are modifiable join views a reasonable design choice?

To be clear, by modifiable join view I mean a view constructed from the joining of two or more tables that allows insert/update/delete actions that modify any/all of the component tables.
This may be a postgres specific question, not sure. I am also interested if other DBMSs have idiosyncratic features for modifiable join views, since as far as I can tell, they are not possible in standard SQL.
I'm working on a postgres schema, and some of my recent reading has suggested that it is possible to construct modifiable join views using instead rules (CREATE RULE ... DO INSTEAD ...). Modifiable join views seem desirable since it would allow for hiding strong normalization behind an interface, providing a mechanism for classic abstraction. Rules are the only option for implementation, since currently triggers cannot be set on views.
However, the first modifiable view I tried to design ran into problems, and I find out that many consider non-trivial rules to be harmful (see links in comments to this SO answer). Also, I can't find any examples of modifiable join views on the web.
Questions (Edit to put finer points on the questions):
Do you have any experience with modifiable join views and can you provide a concrete example with select/insert/delete/update ability?
Are they practical, i.e. can they be treated transparently without having to tiptoe around mines/black holes?
Are they ever a good design choice, in terms of functionality/effort ratio and maintainability?
Would greatly appreciate links to any examples/discussions on this topic. Thanks.
Yes, I have some experience with updatable views in general. I think they're practical in PostgreSQL. Like all design choices, they can be a good choice, and they can be a bad choice.
I find them particularly useful in dealing with supertype/subtype tables. I create one view for each subtype; the view joins the subtype to the supertype. Revoke permissions on the base tables, write rules for the view, and give client code access only to the views. All data manipulation done by client code then goes through the view and the rules defined on them.
I don't think rules are really different from any other feature in any other environment. And by environment, I mean C, C++, Java, Ruby, Python, Erlang, and BASIC, not just dbms environments.
Use the good features of a language. Avoid the bad ones.
"Don't use malloc()" is bad advice. "Always check the return value of malloc()" is good advice. "Never use rules" is bad advice. "Avoid using rules in ways that are known to have questionable behavior" is good advice. The rules you need for views on supertype/subtype tables are simple and easy to understand. They don't misbehave.
At the theoretical level, views provide logical data independence. But that's only possible if the views are updatable. (And many views should be updatable directly by the database engine, without any need of rules or triggers.)
I use them as a replacement for ORMs. I think as long as you do not run-a-muck sprinkling them everywhere through the database they can be easy enough to understand. I define a schema for an application and then whatever views are in that schema are the methods and operations of that app. The client code can be mostly automated after that since the views give the abstraction I need to write generic client code.
People point out that the rule rewrite is not a real table (but it is posing as one) which makes it possible to write things that will break. This is possible but I have yet to come across it yet. The idea is to hide the complexity in the rewrite and then only do simple deletes and update with no joins. If it turns out that a join is needed - it is time to rewrite the rule, not the top level query.
At the end, I find it a very compact way to write the database. All the ways of interfacing with it are written as rules. No connection should have access to a real table. Your business logic is very explicit. If a view does not have an UPDATE rule for it - it can not be updated period. Since you have written all this in the database level instead of the client level, it is not tied to a web framework or a particular language. This leads to a lot of flexibility in how you want to connect to the database. Imagine you used web framework, but as time goes on you need direct access to the database for another source. Direct access will also bypass all of ORM business rules you worked so hard on. With a rule writing interface you can expose, the interface without fear that the new connection will corrupt the data.
If people say you can really F UP a database with them - then sure - of course you can. But you can with everything else too. If people say you can not use them at all with out mucking things up, then I would disagree.
Two quick links:
Why using rules is bad idea
Triggers on views
My personal preference is to use views only for reading data, (virtually) never for inserting or updating. By essentially re-normalizing data (which sounds like what you are doing) in your database, you are likely creating a system that will be very difficult to test and maintain in the long term.
If at all possible, look at mapping your denormalized data back to a normal schema somewhere in your application code, and providing it to the database that way (to individual tables IMHO) in a single transaction.
I know in SQL Server if you update a view you must limit the change to only one table anyway which makes using views for updating useless in my mind as you have to know which fields go with which tables anyway.
If you want to abstract the information out and not have to worry about the database structure for inserts adn updates, an ORM mught do a better job for you than views.
I have never used modifiable views of any sort but as you are asking whether they are a "reasonable design choice", can I suggest an alternative design choice with many benefits where modifiable views are not needed: a Transactional API
Basically what this amounts to is:
Users have no access to tables and cannot issue insert, update, delete statements at all
Users have access to functions that represent well defined transactions - at the simplest level these may just do a single DML, but often would not. The important thing is that they map to transactions in the 'business' sense rather than in the 'database' sense
For querying, users have access to (non-modifiable) views
I do usually do views in the form of "last-valid-record" just hidding and tracking modifications (like a wiki)
The only drawback that I see to this is: then you use your view as a table, and you join it with anything, and and you use it on "wheres", and you insert records on it, and so on, but behinds you have made lot of performance lost compared to the same acctions against a real table (more bigger and more complex). I think it depends on how many people must understud de schema. Its true that some DBMS also admit to index the views, but I think you lose an important amount of performance anyway. Sorry about my english.

Coldfusion ORM Large Tables

Say if I have a large dataset, the table has well over a million records and the database is normalized so theres foreign keys and stuff. Ive set up the relations properly and i get a list of the first object applications = EntityLoad("entityName") but because of the relations and stuff the page takes like 24 seconds to load, even when i limit the number of records to show to like 5 it takes an awful long time to load.
My solution to this was create another object that just gets the list, and then when the user wants to , use the object with all the relations and show it to the user. Is this the right way to approach it, or am i missing a big ORM concept?
Are you counting just the time to get the data, or are you perhaps doing a CFDUMP on it or something else visually that could be slow. In other words, have you wrapped the EntityLoad by itself in a cftimer tag to be sure that it is the culprit?
The first thing I would do is enable SQL logging in your Application.cfc. Add logSQL=true to This.ormSettings.
That should allow you to grab the SQL that ORM generates. Run it in an analyzer. See if the ORM SQL is doing somethign crazy. See if it is an index that you missed or something.
Also are you doing paging as Ray talks about here: http://www.coldfusionjedi.com/index.cfm/2009/8/14/Simple-ColdFusion-9-ORM-Paging-Demo?
If not have you tried using ORMExecuteQuery and HQL to enable paging.
Those are my thoughts.
When defining complex domain models with Hibernate - you will sometimes need to tweak the mapping to improve performance. This is especially true if you are dealing with inheritance (not sure how much inheritance is in your model). The ultimate goal is to have your query pulling from as few tables as possible while still preserving your domain model. This might require using the advanced inheritance mappings (more on that in a sec).
LOGGING SQL
As Terry mentioned, you will want to be sure you can log the actual SQL that is being passed to your database (yeah, you don't totally get away from SQL with ORM). Here is a great article on setting up logging for Hibernate in CF9 from Rupesh:
http://www.rupeshk.org/blog/index.php/2009/07/coldfusion-orm-how-to-log-sql/
HIBERNATE MAPPING FILES
Anytime you want to do something beyond the basic, you want to be sure that you are looking at the actual Hibernate mapping files that are generated for your CFC's. Be sure to set the following with all of your hibernate options in Application.cfc:
savemapping = true
While the cfproperty properties allow you to define many aspects of the mapping, there are actually some things that can only be done in the Hibernate mapping files (and there are tons of community resources on this.
INHERITANCE MAPPING
As I mentioned earlier, Hibernate provides different inheritance strategies for mapping. They are Table per Hierarchy, Table per subclass, Table per concrete class, and implicit polymorphism. You can read more about these types in the CF9 docs under Advanced Mapping > Inheritance Mapping or in the Hibernate documentation (as it would take forever to explain each of these).
Knowing how your tables are mapped is very important with inheritance (and it is also where Hibernate can generate some HUGE queries if you don't tweak your setup).
Those are the things I can think of - if you can give some additional information about your domain model - we can look to see what other things might be done to tweak it.
There is a good chance Hibernate is doing it's caching thing. A fair comparison in my mind (everyone please feel free to add) is doing an:
EntityLoad("entity_name") is the same as doing a select * from TABLE
So, in this case, what Hibernate might be doing in instantiating the memory, and caching it a certain way, your database server might do this similarly when you sent such a broad SQL instruction.
I have been extremely interested in ORM the past few weeks and it looks to be a very rewarding undertaking.
For this reason, is there a tiem you would ever load all 500,000 records as a result? I assume not.
I have one large logging table that I will be attacking, I am finding that the SQL good stuff must be there. For example, mark the fields that are indexes as such, this will speed it up incredibly when searching. I am sure the ORM can handle this.
Beyond this:
Find some excellent Hibernate forums, resources, and tutorials so you can learn Hibernate. This isn't really as much a Coldfusion --> ORM issue as what Hibernate might do on it's own. I have ordered a few Hibernate books that I'm waiting on to see how they are.
Likewise there seems to be an incredible amount of Hibernate resources out there where you can bring the Performance enhancement solutions of Hibernate into the Coldfusion sphere. I might be making it too simple, but I see the CF-ORM implementation as a wrapper with some code generation to save us time.
Take a look at implementing filters to cut down your data in the EntityLoad() call.
As recommended in other threads, turn on sql logging and see what sql is being generated. Chances are it might not be what you need. Check out HQL to see if you can form a better statement.
Most importantly, share what you find. I'll volunteer to do the same on this as you've tempted me to go try this out in my spare time a bit sooner than planned.
Faisal, we ran into this with Linq (c# orm).
Our solution was to create simple objects not holding the relational data. For instance, along with Users we had a SimpleUsers object which held little or no relation to any other object and had a limited set of columns.
There could be other ways of handling this but this approach helped tremendously with the query speed.

Upgrade strategies for bad DB schema designs

I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including
No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
Practically every field can be NULL, which isn't really true
Naming conventions for tables and columns are practically non-existent
Varchars which are storing concatenated strings of relational information
Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.
I want to fix this. The issues I see now are:
We have a lot of data (migration, possibly tricky)
All of the DB logic is in code (with migration comes big code changes)
I'm also tempted to do something "radical" like moving to a schema-free DB.
What are some good strategies when faced with an existing DB built upon a poorly designed schema?
Enforce Foreign Keys: If a relationship exists in the domain, then it should have a Foreign Key.
Renaming existing tables/columns is fraught with danger, especially if there are many systems accessing the Database directly. Gotchas include tasks that run only periodically; these are often missed.
Of Interest: Scott Ambler's article: Introduction To Database Refactoring
and Catalog of Database Refactorings
Views are commonly used to transition between changing data models because of the encapsulation. A view looks like a table, but does not exist as a finite object in the database - you can change what column is being returned for a given column alias as desired. This allows you to setup your codebase to use a view, so you can move from the old table structure to the new one without the application needing to be updated. But it means the view has to return the data in the existing format. For example - your current data model has:
SELECT t.column --a list of concatenated strings, assuming comma separated
FROM TABLE t
...so the first version of the view would be the query above, but once you created the new table that uses 3NF, the query for the view would use:
SELECT GROUP_CONCAT(t.column SEPARATOR ',')
FROM NEW_TABLE t
...and the application code would never know that anything changed.
The problem with MySQL is that the view support is limited - you can't use variables within it, nor can they have subqueries.
The reality to the changes you wish to make is effectively rewriting the application from the ground up. Moving logic from the codebase into the data model will drastically change how the application gets the data. Model-View-Controller (MVC) is ideal to implement with changes like these, to minimize the cost of future changes like these.
I'd say leave it alone until you really understand it. Then make sure you don't start with one of the Things You Should Never Do.
Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.
Create a completely new schema and make sure that it is fully normalized and contains any unique, check and not null constraints etc that are required and that appropriate data types are used.
Prepopulate each table that fills the parent role in a foreign key relationship with a single 'Unknown' record.
Create an ETL (Extract Transform Load) process (I can recommend SSIS (SQL Server Integration Services) but there are plenty of others) that you can use to refill the new schema from the existing one on a regular basis. Use the 'Unknown' record as the parent of any orphaned records - there will be plenty ;). You will need to put some thought into how you will consolidate duplicate records - this will probably need to be on a case by case basis.
Use as many iterations as are necessary to refine your new schema (ensure that the ETL Process is maintained and run regularly).
Create views over the new schema that match the existing schema as closely as possible.
Incrementally modify any clients to use the new schema making temporary use of the views where necessary. You should be able to gradually turn off parts of the ETL process and eventually disable it completely.
First see how bad the code is related to the DB if it is all mixed in no DAO layer you shouldn't think about a rewrite but if there is a DAO layer then it would be time to rewrite that layer and DB along with it. If possible make the migration tool based on using the two DAOs.
But my guess is there is no DAO so you need to find what areas of the code you are going to be changing and what parts of the DB that relates to hopefully you can cut it up into smaller parts that can be updated as you maintain. Biggest deal is to get FKs in there and start checking for proper indexes there is a good chance they aren't being done correctly.
I wouldn't worry too much about naming until the rest of the db is under control. As for the NULLs if the program chokes on a value being NULL don't let it be NULL but if the program can handle it I wouldn't worry about it at this point in the future if it is doing a default value move that to the DB but that is way down the line from the sound of things.
Do something about the Varchars sooner rather then later. If anything make that the first pure background fix to the program.
The other thing to do is estimate the effort of each areas change and then add that price to the cost of new development on that section of code. That way you can fix the parts as you add new features.