How important is it that models be consistent across project components? - oop

I have a project with two components, a server-side component and a client-side component. For various reasons, the client-side device doesn't carry a fully copy of the database around.
How important is it that my models have a 1:1 correlation between the two sides? And, to extend the question to my bigger concern, are there any time-bombs I'm going to run into down the line if they don't? I'm not talking about having different information on each side, but rather the way the information is encapsulated will vary. (Obviously, storage mechanisms will also vary) The server side will store each user, each review, each 'item' with seperate tables, and create links between them to gather data as necessary. The client side shouldn't have a complete user database, however, so rather than link against the user for gathering things like 'name', I'd store that on the review. In other words...
--- Server Side ---
Item:
+id
//Store stuff about the item
User:
+id
+Name
-Password
Review:
+id
+itemId
+rating
+text
+userId
--- Device Side ---
Item:
+id
+AverageRating
Review:
+id
+rating
+text
+userId
+name
User:
+id
+Name
//Stuff
The basic idea is that certain 'critical' information gets moved one level 'up'. A user gets the list of 'items' relevant to their query, with certain review-oriented info moved up (i. e. average rating). If they want more info, they query the detail view for the item, and the actual reviews get queried and added to the dataset (and displayed). If they query the actual review, the review gets queried and they pick up some additional user info along the way (maybe; I'm not sure if the user would have any use for any of the additional user information).
My basic concern is that I don't wan't to glut the user's bandwidth or local storage with a huge variety of information that they just don't need, even if proper database normalizations suggests that information REALLY should be stored at a 'lower' level.
I've phrased this as a fairly low-level conceptual issue because that's the level I'm trying to think / worry over, but if it matters I'm creating a PHP / MySQL server that provides data for a iOS / CoreData client.

In a situation where you're moving data across a network like this or where you only store some but not all on a client, it's fine to not have exact same structures across the divide.
However, you will want to use same terminology and same naming as much as possible to avoid confusion as to what is what. (Using same terminology is also something that is touched in the Domain Driven Design book)
It is not a hugely problematic thing, but having the same structure on both ends will make it easier to understand things. So, if you want to pursue this, you might want to consider using some encapsulation on the data that's being transfered rather than simplifying the client's model itself.
The pattern is typically known as Data Transfer Object or DTO. Essentially you would use it to give a structure to the data you're moving over the wire, which you could then deconstruct into the same model structure on your client. In PHP, you could just use associative arrays to represent this structure, but using actual classes can help in making sure the data assigned to the DTOs conforms to the expected format.

Related

Data modeling Issue for Moqui custom application

We are working on one custom project management application on top of Moqui framework. Our requirement is, we need to inform any changes in ticket to the developers associated with the project through email.
Currently we are using WorkEffortParty entity to store all parties associated with the project and then PartyContactMech entity to store their email addresses. Here we need to iterate through WorkEffortParty and PartyContactMech everytime to fetch all email address to which we need to send emails for changes in tickets every time.
To avoid these iterations, we are now thinking of giving feature to add comma separated email addresses at project level. Project admin can add email addresses of associated parties or mailing list address to which he needs to send email notification for ticket change.
For this requirement, we studied around the data model but we didn't got the right place to store this information. Do we need to extend any entity for this or is there any best practice for this? This requirement is very useful in any project management application. We appreciate any help on this data modeling problem.
The best practice is to use existing data model elements as they are available. Having a normalized data model involves more work in querying data, but also more flexibility in addressing a wide variety of requirements without changes to the data structures.
In this case with a joined query you can get a list of email addresses in a single query based on the project's workEffortId. If you are dealing with massive data and message volumes there are better solutions than denormalizing source data, but I doubt that's the case... unless you're dealing with more than thousands of projects and millions of messages per day the basic query and iterate approach will work just fine.
If you need to go beyond that the easiest approach with Moqui is to use a DataDocument and DataFeed to send updates on the fly to ElasticSearch, and then use it for your high volume queries and filtering (with arbitrarily complex filtering, etc requirements).
Your question is way too open to answer directly, data modeling is a complex topic and without good understanding of context and intended usage there are no good answers. In general it's best to start with a data model based on decades of experience and used in a large number of production systems. The Mantle UDM is one such model.

Relational or document database for storing instant messages? Maybe something else?

I started playing around with RavenDB a few days ago. I like it this far, but I am pretty new to the whole NoSQL world. I am trying to think of patterns when to prefer it (or any other DocumentDB or any other NoSQL-kind of data store) over traditional RDBMSs. I do understand that "when you need to store documents or unstructured/dynamically structured data opt for DocumentDB" but that just feels way too general to grasp.
Why? Because from what I've read, people had been writing examples for "documents" such as order details in an e-commerce application and form details of a workflow management application. But these has been developed with RDBMSs for ages without too much trouble - for example, the details of an order, such as quantity, total price, discount, etc. are perfectly structured.
So I think there's an overlap here. But now, I am not asking for general advices for when to use what, because I believe the best for me would be to figure it out by myself through experimenting; so I am just going to ask about a concrete case along with my concerns.
So let's say I develop an instant messenger application which stores messages to ages back, like Facebook's messaging system does. I think using an RDBMS here is not suitable. My reason to this is that most poeple use instant messaging systems like this:
A: hi
B: hey
A: how r u?
B: good, u?
A: me 2
...
The thing to note is that most messages are very short, so storing each in a single row with this structure:
Messages(fromUserId, toUserId, sent, content)
feels very ineffective, because the "actual useful information (content)" is very small, whereas the table would contain incredible amounts of rows and therefore the indexes would grow huge. Adding to this the fact that messages are sent very frequently, the size of indexes would have a huge impact on performance. So a very large amount of rows must be managed and stored while every row contains a minimal amount of actual information.
In RavenDB, I would use a structure such as this:
// a Conversation object
{
"FirstUserId": "users/19395",
"SecondUserId": "users/19396",
"Messages": [
{
"Order": 0,
"Sender": "Second",
"Sent": "2016-04-02T19:27:35.8140061",
"Content": "lijhuttj t bdjiqzu "
},
{
"Order": 1,
"Sender": "Second",
"Sent": "2016-04-02T19:27:35.8200960",
"Content": "pekuon eul co"
}
]
}
With this structure, I only need to find out which conversation I am looking for: the one between User A and User B. Any message between User A and User B is stored in this object, regardless of whether User A or User B was the sender. So once I find the conversation between them - and there are far less converations than actual messages - I can just grab all of the messages associated with it.
However, if the two participants talk a lot (and assuming that messages are stored for, let's say, 3 years) there can be tens of thousands of messages in a single conversation causing the object to grow very large.
But there is one thing I don't know how it works (specifically) in RavenDB. Does its internal storage and query mechanism allow (the DB engine, not the client) to grab just the (for example) 50 most recent messages without reading the whole object? Afterall, it uses indexing on the properties of objects, but I haven't found any information about whether reading parts of an object is possible DB-side. (That is, without the DB engine reading the whole object from disk, parsing it and then sending back just the required parts to the client).
If it is possible, I think using Raven is a better option in this scenario, if not, then I am not sure. So please help me clean it up by answering the issue mentioned in the previous paragraph along with any advices on what DB model would suit this certain scenario the best. RDBMSs? DocDBs? Maybe something else?
Thanks.
I would say the primary distinctions will be:
Does your application consume the data in JSON? -- Then store it as JSON (in a document DB) and avoid serializing/deserializing it.
Do you need to run analytical workloads on the data? -- Then use SQL
What consistency levels do you need? -- SQL is made for high consistency, docDBs are optimized for lower consistencies
Does your schema change much? -- then use a (schema-less) docDB
What scale are you anticipating? -- docDBs are usually easier to scale out
Note also that many modern cloud document databases (like Azure DocDB) can give you the best of both worlds as they support geo-replication, schema-less documents, auto-indexing, guaranteed latencies, and SQL queries. SQL Databases (like AWS Aurora) can handle massive throughput rates, but usually still require more hand-holding from a DBA.

web app OO concept confusion

This is a concept question, regarding "best practice" and "efficient use" of resources.
Specifically dealing with large data sets in a db and on-line web applications, and moving from a procedural processing approach to a more Object Oriented approach.
Take a "list" page, found in almost all CRUD aspects of the application. The list displays a company, address and contact. For the sake of argument, and "proper" RDBM, assume we've normalized the data such that a company can have multiple addresses, and contacts.
- for our scenario, lets say I have a list of 200 companies, each with 2-10 addresses, each address has a contact. i.e. any franchise where the 'store' is named 'McDonalds', but there may be multiple addresses by that 'name').
TABLES
companies
addresses
contacts
To this point, I'd make a single DB call and use joins to pull back ALL my data, loop over the data and output each line... Some grouping would be done at the application layer to display things in a friendly manner. (this seems like the most efficient way, as the RDBM did the heavy lifting - there was a minimum network calls (one to the db, one from the db, one http request, one http response).
Another way of doing this, if you couldn't group at the application layer, is to query for the company list, loop over that, and inside the loop make separate DB call(s) for the address, contact. less efficient, because you're making multiple DB calls
Now - the question, or sticking point.... Conceptually...
If I have a company object, an address object and a contact object - it seems that in order to achieve the same result - you would call a 'getCompanies' method that would return a list, and you'd loop over the list, and call 'getAdderss' for each, and likewise a 'getContact' - passing in the company ID etc.
In a web app - this means A LOT more traffic from the application layer to the DB for the data, and a lot of smaller DB calls, etc. - it seems SERIOUSLY less effective.
If you then move a fair amount of this logic to the client side, for an AJAX application, you're incurring network traffic ON TOP of the increased internal network overhead.
Can someone please comment on the best ways to approach this. Maybe its a conceptual thing.
Someone suggested that a 'gateway' is when you access these large data-sets, as opposed to smaller more granular object data - but this doesn't really help my understanding,and Im not sure it's accurate.
Of course getting everything you need at once from the database is the most efficient. You don't need to give that up just because you want to write your code as an OO model. Basically, you get all the results from the database first, then translate the tabular data into a hierarchical form to fill objects with. "getCompanies" could make a single database call joining addresses and contacts, and return "company" objects that contain populated lists of "addresses" and "contacts". See Object-relational mapping.
I've dealt with exactly this issue many times. The first and MOST important thing to remember is : don't optimize prematurely. Optimize your code for readability, the DRY principle, etc., then come back and fix things that are "slow".
However, specific to this case, rather than iteratively getting the addresses for each company one at a time, pass a list of all the company IDs to the fetcher, and get all the addresses for all those company ids, then cache that list of addresses in a map. When you need to fetch an address by addressID, fetch it from that local cache. This is called an IdentityMap. However, like I said, I don't recommend recoding the flow for this optimization until needed. Most often there are 10 things on a page, not 100 so you are saving only a few milliseconds by changing the "normal" flow for the optimized flow.
Of course, once you've done this 20 times, writing code in the "optimized flow" becomes more natural, but you also have the experience of when to do it and when not to.

Observing social web behavior: to log or populate databases?

When considering social web app architecture, is it a better approach to document user social patterns in a database or in logs? I thought for sure that behavior, actions, events would be strictly database stored but I noticed that some of the larger social sites out there also track a lot by logging what happens.
Is it good practice to store prominent data about users in a database and since thousands of user actions can be spawned easily, should they be simply logged?
Remember that Facebook, for example, doesn't update users information per se, they just insert your new information and use the most recent one, keeping the old one. If you plan to take this approach is HIGHLY recommended, if not mandatory, to use a NoSQL DB like Cassandra, you'll need speed over integrity.
Information = money. Update = lose information = lose money.
Obviously, it depends on what you want to do with it (and what you mean be "logging").
I'd recommend a flexible database storage. That way you can query it reasonably easily, and also make it flexible to changes later on.
Also, from a privacy point of view, it's appropriate to be able to easily associate items with certain entities so they can be removed, if so requested.
You're making an artificial distinction between "logging" and "database".
Whenever practical, I log to a database, even though this data will effectively be static and never updated. This is because the data analysis is much easier if you can cross-reference the log table with other, non-static data.
Of course, if you have a high volume of things to track, logging to a SQL data table may not be practical, but in that case you should probably be considering some other kind of database for the application.

How to manage multiple versions of the same record

I am doing short-term contract work for a company that is trying to implement a check-in/check-out type of workflow for their database records.
Here's how it should work...
A user creates a new entity within the application. There are about 20 related tables that will be populated in addition to the main entity table.
Once the entity is created the user will mark it as the master.
Another user can make changes to the master only by "checking out" the entity. Multiple users can checkout the entity at the same time.
Once the user has made all the necessary changes to the entity, they put it in a "needs approval" status.
After an authorized user reviews the entity, they can promote it to master which will put the original record in a tombstoned status.
The way they are currently accomplishing the "check out" is by duplicating the entity records in all the tables. The primary keys include EntityID + EntityDate, so they duplicate the entity records in all related tables with the same EntityID and an updated EntityDate and give it a status of "checked out". When the record is put into the next state (needs approval), the duplication occurs again. Eventually it will be promoted to master at which time the final record is marked as master and the original master is marked as dead.
This design seems hideous to me, but I understand why they've done it. When someone looks up an entity from within the application, they need to see all current versions of that entity. This was a very straightforward way for making that happen. But the fact that they are representing the same entity multiple times within the same table(s) doesn't sit well with me, nor does the fact that they are duplicating EVERY piece of data rather than only storing deltas.
I would be interested in hearing your reaction to the design, whether positive or negative.
I would also be grateful for any resoures you can point me to that might be useful for seeing how someone else has implemented such a mechanism.
Thanks!
Darvis
I've worked on a system like this which supported the static data for trading at a very large bank. The static data in this case is things like the details of counterparties, standard settlement instructions, currencies (not FX rates) etc. Every entity in the database was versioned, and changing an entity involved creating a new version, changing that version and getting the version approved. They did not however let multiple people create versions at the same time.
This lead to a horribly complex database, with every join having to take version and approval state into account. In fact the software I wrote for them was middleware that abstracted this complex, versioned data into something that end-user applications could actually use.
The only thing that could have made it any worse was to store deltas instead of complete versioned objects. So the point of this answer is - don't try to implement deltas!
This looks like an example of a temporal database schema -- Often, in cases like that, there is a distinction made between an entity's key (EntityID, in your case) and the row primary key in the database (in your case, {EntityID, date}, but often a simple integer). You have to accept that the same entity is represented multiple times in the database, at different points in its history. Every database row still has a unique ID; it's just that your database is tracking versions, rather than entities.
You can manage data like that, and it can be very good at tracking changes to data, and providing accountability, if that is required, but it makes all of your queries quite a bit more complex.
You can read about the rationale behind, and design of temporal databases on Wikipedia
You are describing a homebrew Content Management System which was probably hacked together over time, is - for the reasons you state - redundant and inefficient, and given the nature of such systems in firms is unlikely to be displaced without massive organizational effort.