Complex taxonomy ORM mapping - looking for suggestions - orm

In my project (ASP.NET MVC + NHibernate) I have all my entities, lets say Documents, described by set of custom metadata. Metadata is contained in a structure that can have multiple tags, categories etc. These terms have the most importance for users seeking the document they want, so it has an impact on views as well as underlying data structures, database querying etc.
From view side of application, what interests me the most are the string values for the terms. Ideally I would like to operate directly on the collections of strings like that:
class MetadataAsSeenInViews
{
public IList<string> Categories;
public IList<string> Tags;
// etc.
}
From model perspective, I could use the same structure, do the simplest-possible ORM mapping and use it in queries like "fetch all documents with metadata exactly like this".
But that kind of structure could turn out useless if the application needs to perform complex database queries like "fetch all documents, for which at least one of categories is IN (cat1, cat2, ..., catN) OR at least one of tags is IN (tag1, ..., tagN)". In that case, for performance reasons, we would probably use numeric keys for categories and tags.
So one can imagine a structure opposite to MetadataAsSeenInViews that operates on numeric keys and provide complex mappings of integers to strings and other way round. But that solution doesn't really satisfy me for several reasons:
it smells like single responsibility violation, as we're dealing with database-specific issues when just wanting to describe Document business object
database keys are leaking through all layers
it adds unnecessary complexity in views
and I believe it doesn't take advantage of what can good ORM do
Ideally I would like to have:
single, as simple as possible metadata structure (ideally like the one at the top) in my whole application
complex querying issues addressed only in the database layer (meaning DB + ORM + at less as possible additional code for data layer)
Do you have any ideas how to structure the code and do the ORM mappings to be as elegant, as effective and as performant as it is possible?

I have found that it is problematic to use domain entities directly in the views. To help decouple things I apply two different techniques.
Most importantly I'm using separate ViewModel classes to pass data to views. When the data corresponds nicely with a domain model entity, AutoMapper can ease the pain of copying data between them, but otherwise a bit of manual wiring is needed. Seems like a lot of work in the beginning but really helps out once the project starts growing, and is especially important if you haven't just designed the database from scratch. I'm also using an intermediate service layer to obtain ViewModels in order to keep the controllers lean and to be able to reuse the logic.
The second option is mostly for performance reasons, but I usually end up creating custom repositories for fetching data that spans entities. That is, I create a custom class to hold the data I'm interested in, and then write custom LINQ (or whatever) to project the result into that. This can often dramatically increase performance over just fetching entities and applying the projection after the data has been retrieved.
Let me know if I haven't been elaborate enough.

The solution I've finally implemented don't fully satisfy me, but it'll do by now.
I've divided my Tags/Categories into "real entities", mapped in NHibernate as separate entities and "references", mapped as components depending from entities they describe.
So in my C# code I have two separate classes - TagEntity and TagReference which both carry the same information, looking from domain perspective. TagEntity knows database id and is managed by NHibernate sessions, whereas TagReference carries only the tag name as string so it is quite handy to use in the whole application and if needed it is still easily convertible to TagEntity using static lookup dictionary.
That entity/reference separation allows me to query the database in more efficient way, joining two tables only, like select from articles join articles_tags ... where articles_tags.tag_id = X without joining the tags table, which will be joined too when doing simple fully-object-oriented NHibernate queries.

Related

What is the best way to represent a form with hundreds of questions in a model

I am trying to design an income return tax software.
What is the best way to represent/store a form with hundreds of questions in a model?
Just for this example, I need at least 6 models (T4, T4A(OAS), T4A(P), T1032, UCCB, T4E) which possibly contain hundreds of fields.
Is it by creating hundred of fields? Storing values in a map? An Array?
One very generic approach could be XML
XML allows you to
nest your data to any degree
combine values and meta information (attributes and elements)
describe your data in detail with XSD
store it externally
maintain it easily
even combine it with additional information (look at processing instructions)
and (last but not least) store the real data in almost the same format as the modell...
and (laster but even not leaster :-) ) there is XSLT to transform your XML data into any other format (such as HTML for nice presentation)
There is high support for XML in all major languages and database systems.
Another way could be a typical parts list (or bill of materials/BOM)
This tree structure is - typically - implemented as a table with a self-referenced parentID. Working with such a table needs a lot of recursion...
It is very highly recommended to store your data type-safe. Either use a character storage format and a type identifier (that means you have to cast all your values here and there), or you use different type-safe side tables via reference.
Further more - if your data is to be filled from lists - you should define a datasource to load a selection list dynamically.
Conclusio
What is best for you mainly depends on your needs: How often will the modell change? How many rules are there to guarantee data's integrity? Are you using a RDBMS? Which language/tools are you using?
With a case like this, the monolithic aggregate is probably unavoidable (unless you can deduce common fields). I'm going to exclude RDBMS since the topic seems to focus more on lower-level data structures and a more proprietary-style solution, though that could be a very valid option that can manage all these fields.
In this case, I think it ceases to become so much about formalities as just daily practicalities.
Probably worst from that standpoint in this case is a formal object aggregating fields, like a class or struct with a boatload of data members. Those tend to be the most awkward and the most unattractive as monoliths, since they tend to have a static nature about them. Depending on the language, declaration/definition/initialization could be separate which means 2-3 lines of code to maintain per field. If you want to read/write these fields from a file, you have to write a separate line of code for each and every field, and maintain and update all that code if new fields added or existing ones removed. If you start approaching anything resembling polymorphic needs in this case, you might have to write a boatload of branching code for each and every field, and that too has to be maintained.
So I'd say hundreds of fields in a static kind of aggregate is, by far, the most unmaintainable.
Arrays and maps are effectively the same thing to me here in a very language-agnostic sense provided that you need those key/value pairs, with only potential differences in where you store the keys and what kind of algorithmic complexity is involved. Whatever you do, probably a key search in this monolith should be logarithmic time or better. 'Maps/associative arrays' in most languages tend to inherently have this quality.
Those can be far more suitable, and you can achieve the kind of runtime flexibility that you like on top of those (like being able to manage these from a file and add the fields on the fly with no pre-existing knowledge). They'll be far more forgiving here.
So if the choice is between a bunch of fields in a class and something resembling a map, I'd suggest going for a map. The dynamic nature of it will be far more forgiving for these kinds of cases and will typically far outweigh the compile-time benefits of, say, checking to make sure a field actually exists and producing a syntax error otherwise. That kind of checking is easy to add back in and more if we just accept that it will occur at runtime.
An exception that might make the field solution more appealing is if you involve reflection and more dynamic techniques to generate an object with the appropriate fields on the fly. Then you get back those dynamic benefits and flexibility at runtime. But that might be more unwieldy to initialize the structure, could involve leaning a lot more heavily on heavy-duty (and possibly very computationally-expensive) introspection and type manipulation and code generation mechanisms, and also end up with more funky code that's hard to maintain.
So I think the safest bet is the map or associative array, and a language that lets you easily add new fields, inspect existing ones, etc. with very fast turnaround. If the language doesn't inherently have that quality, you could look to an external file to dynamically add fields, and just maintain the file.

Problems in reusing a single POJO class for different databases

I am using POJO (Plain Old Java Object) classes for mapping the relational database and using Apache Solr to index the database.
I don't know whether I can re-use pojo classes for Apache Solr or not.
Since mapping classes are too specific and are designed with foreign key relationship in mind, it is very difficult to use the classes with Solr (a single schema search server), but creating new POJO classes for Apache Solr is also difficult.
So I want to know which is the better design approach for reusing.
Also I would like to know the pitfalls of reusing the same POJO class.
SOLR is very different from a relational database...basically it is something like a big table (with several differences like multivalued columns).
Now, I see your problem a step behind the concrete implementation (POJO)...
First, you have to de-normalize your table(s)...that's the real hard thing you need to do when working with SOLR. I mean passing from your ER to SOLR schema. Once did that, you can use Solrj to map entity with pojo, but this is only the last part of the story.
Still about denormalization: doesn't make sense do a raw translation of a set of POJO mapped on top of a relational database. Relational databases are general-purpose: their design approach is data-centric. I mean, first decide how to store your data and after that SQL clients will be able to get what they need.
SOLR works in a different way: in order to determine moreless exactly your schema you should know your search requirements (i.e. queries). The schema is not general-purpose (like a database) but is tiered on top of search requirements. That's the reason why you could index an atrtribute or not, decide what kind of analysis needs a particular field, multivalued, monovalued, stemming, etc etc etc
So basically, it's all about denormalization and query requirements.

What design patterns for marshalling JSON APIs to/from SQL

I'm working on a first JSON-RPC/JSON-REST API. One of the conveniences of JSON is that it can easily represent structured data (a user may have multiple email addresses, multiple addresses), etc...
For example, the Facebook Graph API nicely represents the kind of thing that's handy to return as JSON objects:
https://fbcdn-dragon-a.akamaihd.net/hphotos-ak-ash3/851559_339008529558010_1864655268_n.png
However, in implementing an API such as this with a relational database, we end up shattering structured objects into very many tables (at least one for each list in the JSON object), and un-shattering them when responding to requests. So:
requires a lot of modelling (separate models for JSON object and SQL tables).
inconsistencies creep in between the models: e.g. user_id (in SQL) vs. userID (in JSON)
marshaling stuff between one model and the other is very time consuming (tedious, error-prone and pointless boilerplate).
What design-patterns exist to help in this situation?
I'm not sure you are looking for design patterns. I would look for tools that handle this better.
I assume that you want to be able to query these objects, and not just store them in TEXT fields. Many databases support XML fairly well, so I would convert the JSON to XML (with a library) and then store that in the database.
You may also want to consider a JSON document based database. That will definitely get you where you want to go.
If you don't need to be able to query these, or only need to query a very small subset of fields, just store the objects as text, and extract those query-able fields into actual columns. This way you don't need to touch the majority of the data, but you can still query the few fields you care about. (Plus you can index them for speedier lookup.)
I have always chosen to implement this functionality in a facade pattern. Since the point of the facade is to simplify (abstract) an underlying complexity as a boundary between two or more systems, it seemed like the perfect place to handle this.
I realize however that this does not quite answer the question. I am talking about the container for the marshalling while the question is about how to better manage the contents (the code that does the job).
My approach here is somewhat old fashioned, but since this an old question maybe that’s okay. I employ (as much as possible) stored procedures in the dB. This promotes better encapsulation than one typically finds with a code layer outside of the dB that has to “know about” dB structure. What inevitably happens in the latter case is that more than one system will be written to do this (one large company I worked at had at least 6 competing ESBs) and there will be conflicts. Also, usually the stored procedure scripting will benefit from some sort of IDE that will helps maintain contextual awareness of the dB structure.
So this approach - even though it is not a pattern per se - makes managing the ORM a lot easier.

MongoDB embedding vs SQL foreign keys?

Are there any particular advantages to MongoDB's ability to embed objects within a document, compared to SQL's use of foreign keys for the same logic?
It seems to me that the only advantage is ease of use (and perhaps performance?), and even that seems like it could be easily abstracted away (e.g. Django seems to handle SQL's foreign keys pretty intuitively).
This boils down to a classic question of whether to embed or not.
Here are a few links to get started before I explain some more:
Where should I put activities timeline in mongodb, embedded in user or separately?
MongoDB schema design -- Choose two collection approach or embedded document
MongoDB schema for storing user location history
Now to answer more specifically.
You must remember the server-side usage of foreign keys in SQL: JOINs. Embedding is a single round trip to get all the data you need in a single document however Joins are not, they are infact two selections based upon a range and then merged to omit duplicates (with significant overhead on some data sets).
So the use of foreign keys is not totally app dependant, it is also server and database dependant.
That being said some people misunderstand embedding in MongoDB and try and make all their data fit into one document. Unfortunately this is re-inforced by the common knowledge that you should always try to embed everything. The links and more will provide some useful guides on this.
Now that we cleared some things up the main pros of embedding over JOINs are:
Single round trip
Easy to update the document in a lot of cases, unless you embed many levels deep
Can keep entity data with the entity it is related to
However embedding has a few flaws:
The document must be paged in to get it's values, this can be problematic on larger documents
Subdocuments are designed to be unique to that entity that do not require advanced querying so you normally would not get two separate entities that are related together, i.e. a post could embed comments but a user probably wouldn't embed posts due to the query needs.
Nesting more than 3 levels deep could effect your ability to use things such as the atomic lock.
So when used right MongoDBs embedding can become a huge power over SQL Joins but you must understand when to use it right.
The core strength of Mongo is in its document-view of data, and naturally this can be extended to a "POCO" view of data. Mongo clients like the NoRM Project in .NET will seem astonishingly similar to experienced Fluent NHibernate users, and this is no accident - your POCO data models are simply serialized to BSON and saved in Mongo 1:1. No mappings required.
Overall, the biggest difference between these two technologies is the model and how developers have to think about their data. Mongo is better suited to rapid application development.

Avoid loading unnecessary data from db into objects (web pages)

Really newbie question coming up. Is there a standard (or good) way to deal with not needing all of the information that a database table contains loaded into every associated object. I'm thinking in the context of web pages where you're only going to use the objects to build a single page rather than an application with longer lived objects.
For example, lets say you have an Article table containing id, title, author, date, summary and fullContents fields. You don't need the fullContents to be loaded into the associated objects if you're just showing a page containing a list of articles with their summaries. On the other hand if you're displaying a specific article you might want every field loaded for that one article and maybe just the titles for the other articles (e.g. for display in a recent articles sidebar).
Some techniques I can think of:
Don't worry about it, just load everything from the database every time.
Have several different, possibly inherited, classes for each table and create the appropriate one for the situation (e.g. SummaryArticle, FullArticle).
Use one class but set unused properties to null at creation if that field is not needed and be careful.
Give the objects access to the database so they can load some fields on demand.
Something else?
All of the above seem to have fairly major disadvantages.
I'm fairly new to programming, very new to OOP and totally new to databases so I might be completely missing the obvious answer here. :)
(1) Loading the whole object is, unfortunately what ORMs do, by default. That is why hand tuned SQL performs better. But most objects don't need this optimization, and you can always delay optimization until later. Don't optimize prematurely (but do write good SQL/HQL and use good DB design with indexes). But by and large, the ORM projects I've seen resultin a lot of lazy approaches, pulling or updating way more data than needed.
2) Different Models (Entities), depending on operation. I prefer this one. May add more classes to the object domain, but to me, is cleanest and results in better performance and security (especially if you are serializing to AJAX). I sometimes use one model for serializing an object to a client, and another for internal operations. If you use inheritance, you can do this well. For example CustomerBase -> Customer. CustomerBase might have an ID, name and address. Customer can extend it to add other info, even stuff like passwords. For list operations (list all customers) you can return CustomerBase with a custom query but for individual CRUD operations (Create/Retrieve/Update/Delete), use the full Customer object. Even then, be careful about what you serialize. Most frameworks have whitelists of attributes they will and won't serialize. Use them.
3) Dangerous, special cases will cause bugs in your system.
4) Bad for performance. Hit the database once, not for each field (Except for BLOBs).
You have a number of methods to solve your issue.
Use Stored Procedures in your database to remove the rows or columns you don't want. This can work great but takes up some space.
Use an ORM of some kind. For .NET you can use Entity Framework, NHibernate, or Subsonic. There are many other ORM tools for .NET. Ruby has it built in with Rails. Java uses Hibernate.
Write embedded queries in your website. Don't forget to parametrize them or you will open yourself up to hackers. This option is usually frowned upon because of the mingling of SQL and code. Also, it is the easiest to break.
From you list, options 1, 2 and 4 are probably the most commonly used ones.
1. Don't worry about it, just load everything from the database every time: Well, unless your application is under heavy load or you have some extremely heavy fields in your tables, use this option and save yourself the hassle of figuring out something better.
2. Have several different, possibly inherited, classes for each table and create the appropriate one for the situation (e.g. SummaryArticle, FullArticle): Such classes would often be called "view models" or something similar, and depending on your data access strategy, you might be able to get hold of such objects without actually declaring any new class. Eg, using Linq-2-Sql the expression data.Articles.Select(a => new { a .Title, a.Author }) will give you a collection of anonymously typed objects with the properties Title and Author. The generated SQL will be similar to select Title, Author from Article.
4. Give the objects access to the database so they can load some fields on demand: The objects you describe here would usaly be called "proxy objects" and/or their properties reffered to as being "lazy loaded". Again, depending on your data access strategy, creating proxies might be hard or easy. Eg. with NHibernate, you can have lazy properties, by simply throwing in lazy=true in your mapping, and proxies are automatically created.
Your question does not mention how you are actually mapping data from your database to objects now, but if you are not using any ORM framework at the moment, do have a look at NHibernate and Entity Framework - they are both pretty solid solutions.