What is the best way to represent a form with hundreds of questions in a model - oop

I am trying to design an income return tax software.
What is the best way to represent/store a form with hundreds of questions in a model?
Just for this example, I need at least 6 models (T4, T4A(OAS), T4A(P), T1032, UCCB, T4E) which possibly contain hundreds of fields.
Is it by creating hundred of fields? Storing values in a map? An Array?

One very generic approach could be XML
XML allows you to
nest your data to any degree
combine values and meta information (attributes and elements)
describe your data in detail with XSD
store it externally
maintain it easily
even combine it with additional information (look at processing instructions)
and (last but not least) store the real data in almost the same format as the modell...
and (laster but even not leaster :-) ) there is XSLT to transform your XML data into any other format (such as HTML for nice presentation)
There is high support for XML in all major languages and database systems.
Another way could be a typical parts list (or bill of materials/BOM)
This tree structure is - typically - implemented as a table with a self-referenced parentID. Working with such a table needs a lot of recursion...
It is very highly recommended to store your data type-safe. Either use a character storage format and a type identifier (that means you have to cast all your values here and there), or you use different type-safe side tables via reference.
Further more - if your data is to be filled from lists - you should define a datasource to load a selection list dynamically.
Conclusio
What is best for you mainly depends on your needs: How often will the modell change? How many rules are there to guarantee data's integrity? Are you using a RDBMS? Which language/tools are you using?

With a case like this, the monolithic aggregate is probably unavoidable (unless you can deduce common fields). I'm going to exclude RDBMS since the topic seems to focus more on lower-level data structures and a more proprietary-style solution, though that could be a very valid option that can manage all these fields.
In this case, I think it ceases to become so much about formalities as just daily practicalities.
Probably worst from that standpoint in this case is a formal object aggregating fields, like a class or struct with a boatload of data members. Those tend to be the most awkward and the most unattractive as monoliths, since they tend to have a static nature about them. Depending on the language, declaration/definition/initialization could be separate which means 2-3 lines of code to maintain per field. If you want to read/write these fields from a file, you have to write a separate line of code for each and every field, and maintain and update all that code if new fields added or existing ones removed. If you start approaching anything resembling polymorphic needs in this case, you might have to write a boatload of branching code for each and every field, and that too has to be maintained.
So I'd say hundreds of fields in a static kind of aggregate is, by far, the most unmaintainable.
Arrays and maps are effectively the same thing to me here in a very language-agnostic sense provided that you need those key/value pairs, with only potential differences in where you store the keys and what kind of algorithmic complexity is involved. Whatever you do, probably a key search in this monolith should be logarithmic time or better. 'Maps/associative arrays' in most languages tend to inherently have this quality.
Those can be far more suitable, and you can achieve the kind of runtime flexibility that you like on top of those (like being able to manage these from a file and add the fields on the fly with no pre-existing knowledge). They'll be far more forgiving here.
So if the choice is between a bunch of fields in a class and something resembling a map, I'd suggest going for a map. The dynamic nature of it will be far more forgiving for these kinds of cases and will typically far outweigh the compile-time benefits of, say, checking to make sure a field actually exists and producing a syntax error otherwise. That kind of checking is easy to add back in and more if we just accept that it will occur at runtime.
An exception that might make the field solution more appealing is if you involve reflection and more dynamic techniques to generate an object with the appropriate fields on the fly. Then you get back those dynamic benefits and flexibility at runtime. But that might be more unwieldy to initialize the structure, could involve leaning a lot more heavily on heavy-duty (and possibly very computationally-expensive) introspection and type manipulation and code generation mechanisms, and also end up with more funky code that's hard to maintain.
So I think the safest bet is the map or associative array, and a language that lets you easily add new fields, inspect existing ones, etc. with very fast turnaround. If the language doesn't inherently have that quality, you could look to an external file to dynamically add fields, and just maintain the file.

Related

Design patterns for translating dynamic content?

I want to store translations for user provided content.
I've seen some apps stores separate records for translations in each locale, but
I want to store all the translations in a record by serializing the translations, I mean
$post_title=serialize(['en_US'=>$enUS['title'], 'fr_FR'=>$frFR['title']]);
$post_content=serialize(['en_US'=>$enUS['conttent'], 'fr_FR'=>$frFR['conttent']);
$sql="INSERT INTO `posts` (`title`, `content`) VALUES(:post_title, :post_content)"
Is it a bad practice?
Assuming (from your INSERT query) that you are using a relational DB, it can be considered a bad practice because:
Time and memory overhead:To load a post in any language, you have to query for $post_title and $post_content which are the serialized data for all languages therefore your DB query results would have a larger memory footprint. Then you have to unserialize the fields to get to the content in your desired language and this also means additional cycles and unnecessary overhead if the front-end design does not need it; i.e. you are not showing the post in all available languages at the same time.
Concurrency: What if two translators update the translation of the same post into their respective languages exactly at the same time.
Breaking opacity: As mentioned in one of the comments, another problem of serialization in this case is breaking the intended abstraction and encapsulation provided by your DB interface. This might not be a big concern when dealing with localization data, but if your DBMS offers you column-level privileges, then serializing everything in one column prevents you from using that feature.
Inconsistency in data modeling: By using serialization in such a case, the conceptual data model (website with post content in multiple languages) ends up being incoherent with the logical data model ($post_content being something other than post content in a desired language). Simply put by looking at the data model, one can not say that these tables are used by a multilingual website.
i18n issues: Mixing various languages as a serialized value in one column limits the possibility of using multiple encodings for your content; e.g. keeping Japanese in Shift-JIS, German in Latin1, and everything else in UTF8. Also, it makes checking for mojibake or other encoding issues harder without unserializing the data.
It can be considered a good workaround if:
Showing all languages together: ...you never show the translations separately and they always appear or edited together.
Keep legacy data model: ...you want to keep the data a legacy model intact. One does not have to change the existing data model to make way for multilingual content. This is helpful for internationalizing a legacy software without messing too much with the data model and dealing with DB migration.

What design patterns for marshalling JSON APIs to/from SQL

I'm working on a first JSON-RPC/JSON-REST API. One of the conveniences of JSON is that it can easily represent structured data (a user may have multiple email addresses, multiple addresses), etc...
For example, the Facebook Graph API nicely represents the kind of thing that's handy to return as JSON objects:
https://fbcdn-dragon-a.akamaihd.net/hphotos-ak-ash3/851559_339008529558010_1864655268_n.png
However, in implementing an API such as this with a relational database, we end up shattering structured objects into very many tables (at least one for each list in the JSON object), and un-shattering them when responding to requests. So:
requires a lot of modelling (separate models for JSON object and SQL tables).
inconsistencies creep in between the models: e.g. user_id (in SQL) vs. userID (in JSON)
marshaling stuff between one model and the other is very time consuming (tedious, error-prone and pointless boilerplate).
What design-patterns exist to help in this situation?
I'm not sure you are looking for design patterns. I would look for tools that handle this better.
I assume that you want to be able to query these objects, and not just store them in TEXT fields. Many databases support XML fairly well, so I would convert the JSON to XML (with a library) and then store that in the database.
You may also want to consider a JSON document based database. That will definitely get you where you want to go.
If you don't need to be able to query these, or only need to query a very small subset of fields, just store the objects as text, and extract those query-able fields into actual columns. This way you don't need to touch the majority of the data, but you can still query the few fields you care about. (Plus you can index them for speedier lookup.)
I have always chosen to implement this functionality in a facade pattern. Since the point of the facade is to simplify (abstract) an underlying complexity as a boundary between two or more systems, it seemed like the perfect place to handle this.
I realize however that this does not quite answer the question. I am talking about the container for the marshalling while the question is about how to better manage the contents (the code that does the job).
My approach here is somewhat old fashioned, but since this an old question maybe that’s okay. I employ (as much as possible) stored procedures in the dB. This promotes better encapsulation than one typically finds with a code layer outside of the dB that has to “know about” dB structure. What inevitably happens in the latter case is that more than one system will be written to do this (one large company I worked at had at least 6 competing ESBs) and there will be conflicts. Also, usually the stored procedure scripting will benefit from some sort of IDE that will helps maintain contextual awareness of the dB structure.
So this approach - even though it is not a pattern per se - makes managing the ORM a lot easier.

How should I (if I should at all) implement Generic DB Tables without falling into the Inner-platform effect?

I have a db model like this:
tb_Computer (N - N) tb_Computer_Peripheral (N - 1) tb_Peripheral
Each computer has N peripherals. But each peripheral is different in nature, and will have different fields. A keyboard will have model, language, etc, and a network card has specification about speed and such.
But I don't think it's viable to create as many tables as there are peripherals. Because one day someone will come up with a very specific peripheral and I don't want him to be unable to add it just because it is not a keyboard neither a network card.
Is it a bad practice to create a field data inside tb_Peripheral which contains JSON data about a specific peripheral?
I could even create a tb_PeripheralType with specific information about which data a specific type of peripheral has.
I read about this in many places and found everywhere that this is a bad practice, but I can't think of any other way to implement this the way I want, completely dynamic.
What is the best way to achieve what I want? Is the current model wrong? What would you do ?
It's not a question of "good practices" or "bad practices". Making things completely dynamic has an upside and a downside. You have outlined the upside fairly well.
The downside of a completely dynamic design is that the process of turning the data into useful information is not nearly as routine as it is with a database that pins down the semantics of the data within the scope of the design.
Can you build a report and a report generating process that will adapt itself to the new structure of the data when you begin to add data about a new kind of peripheral? If you end up stuck with doing maintenance on the application when requirements change, what have you gained by making the database design completely dynamic?
PS: If the changes to the database design consist only of adding new tables, the "ripple effect" on your existing applications will be negligible.
I can think of four options.
The first is to create a table peripherals that would have all the information you could want about peripherals. This would have NULLs in the columns where the field is not appropriate to the type. When a new peripheral is added, you would have to add the descriptive columns.
The second is to create a separate table for each peripheral.
The third is to encode the information in something like JSON.
The fourth is to store the data as pairs. So each peripheral would have many different rows.
There are also hybrids for these approaches. For instance, you could store common fields in a single table (ala (1)) and then have key value pairs for other values.
The question is how this information is going to be used. I do most of my work directly in SQL, so the worst option for me is (3). I don't want to parse strange information formats to get something potentially useful to a SQL query.
Option (4) is the most flexible, but it also requires more work to get a complete picture of all the possible attributes.
If I were starting from scratch, and I had a pretty good idea of what fields I wanted, then I would start with (1), a single table for peripherals. If I had requirements where peripherals and attributes would be changing fairly regularly, then I would seriously consider (4). If the tables are only being used by applications, then I might consider (3), but I would probably reject it anyway.
Only one question to answer when you do this sort of design. JSON, a serialised object, xml, or heaven forbid a csv, doesn't really matter.
Do you want to consume them outside of the API that knows the structure?
If you want to say use sql to get all peripherals of type keyboard with a number of keys property >= 102 say.
If you do, it gets messy, much messier than extra tables.
No different to say having a table of pdfs or docs and trying to find all the ones which have more than 10 pages.
Gets even funnier if you want to version the content as your application evolves.
Have a look at a Nosql back end, it's designed for stuff like this, a relational database is not.

Avoid loading unnecessary data from db into objects (web pages)

Really newbie question coming up. Is there a standard (or good) way to deal with not needing all of the information that a database table contains loaded into every associated object. I'm thinking in the context of web pages where you're only going to use the objects to build a single page rather than an application with longer lived objects.
For example, lets say you have an Article table containing id, title, author, date, summary and fullContents fields. You don't need the fullContents to be loaded into the associated objects if you're just showing a page containing a list of articles with their summaries. On the other hand if you're displaying a specific article you might want every field loaded for that one article and maybe just the titles for the other articles (e.g. for display in a recent articles sidebar).
Some techniques I can think of:
Don't worry about it, just load everything from the database every time.
Have several different, possibly inherited, classes for each table and create the appropriate one for the situation (e.g. SummaryArticle, FullArticle).
Use one class but set unused properties to null at creation if that field is not needed and be careful.
Give the objects access to the database so they can load some fields on demand.
Something else?
All of the above seem to have fairly major disadvantages.
I'm fairly new to programming, very new to OOP and totally new to databases so I might be completely missing the obvious answer here. :)
(1) Loading the whole object is, unfortunately what ORMs do, by default. That is why hand tuned SQL performs better. But most objects don't need this optimization, and you can always delay optimization until later. Don't optimize prematurely (but do write good SQL/HQL and use good DB design with indexes). But by and large, the ORM projects I've seen resultin a lot of lazy approaches, pulling or updating way more data than needed.
2) Different Models (Entities), depending on operation. I prefer this one. May add more classes to the object domain, but to me, is cleanest and results in better performance and security (especially if you are serializing to AJAX). I sometimes use one model for serializing an object to a client, and another for internal operations. If you use inheritance, you can do this well. For example CustomerBase -> Customer. CustomerBase might have an ID, name and address. Customer can extend it to add other info, even stuff like passwords. For list operations (list all customers) you can return CustomerBase with a custom query but for individual CRUD operations (Create/Retrieve/Update/Delete), use the full Customer object. Even then, be careful about what you serialize. Most frameworks have whitelists of attributes they will and won't serialize. Use them.
3) Dangerous, special cases will cause bugs in your system.
4) Bad for performance. Hit the database once, not for each field (Except for BLOBs).
You have a number of methods to solve your issue.
Use Stored Procedures in your database to remove the rows or columns you don't want. This can work great but takes up some space.
Use an ORM of some kind. For .NET you can use Entity Framework, NHibernate, or Subsonic. There are many other ORM tools for .NET. Ruby has it built in with Rails. Java uses Hibernate.
Write embedded queries in your website. Don't forget to parametrize them or you will open yourself up to hackers. This option is usually frowned upon because of the mingling of SQL and code. Also, it is the easiest to break.
From you list, options 1, 2 and 4 are probably the most commonly used ones.
1. Don't worry about it, just load everything from the database every time: Well, unless your application is under heavy load or you have some extremely heavy fields in your tables, use this option and save yourself the hassle of figuring out something better.
2. Have several different, possibly inherited, classes for each table and create the appropriate one for the situation (e.g. SummaryArticle, FullArticle): Such classes would often be called "view models" or something similar, and depending on your data access strategy, you might be able to get hold of such objects without actually declaring any new class. Eg, using Linq-2-Sql the expression data.Articles.Select(a => new { a .Title, a.Author }) will give you a collection of anonymously typed objects with the properties Title and Author. The generated SQL will be similar to select Title, Author from Article.
4. Give the objects access to the database so they can load some fields on demand: The objects you describe here would usaly be called "proxy objects" and/or their properties reffered to as being "lazy loaded". Again, depending on your data access strategy, creating proxies might be hard or easy. Eg. with NHibernate, you can have lazy properties, by simply throwing in lazy=true in your mapping, and proxies are automatically created.
Your question does not mention how you are actually mapping data from your database to objects now, but if you are not using any ORM framework at the moment, do have a look at NHibernate and Entity Framework - they are both pretty solid solutions.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.