Relational or document database for storing instant messages? Maybe something else? - ravendb

I started playing around with RavenDB a few days ago. I like it this far, but I am pretty new to the whole NoSQL world. I am trying to think of patterns when to prefer it (or any other DocumentDB or any other NoSQL-kind of data store) over traditional RDBMSs. I do understand that "when you need to store documents or unstructured/dynamically structured data opt for DocumentDB" but that just feels way too general to grasp.
Why? Because from what I've read, people had been writing examples for "documents" such as order details in an e-commerce application and form details of a workflow management application. But these has been developed with RDBMSs for ages without too much trouble - for example, the details of an order, such as quantity, total price, discount, etc. are perfectly structured.
So I think there's an overlap here. But now, I am not asking for general advices for when to use what, because I believe the best for me would be to figure it out by myself through experimenting; so I am just going to ask about a concrete case along with my concerns.
So let's say I develop an instant messenger application which stores messages to ages back, like Facebook's messaging system does. I think using an RDBMS here is not suitable. My reason to this is that most poeple use instant messaging systems like this:
A: hi
B: hey
A: how r u?
B: good, u?
A: me 2
...
The thing to note is that most messages are very short, so storing each in a single row with this structure:
Messages(fromUserId, toUserId, sent, content)
feels very ineffective, because the "actual useful information (content)" is very small, whereas the table would contain incredible amounts of rows and therefore the indexes would grow huge. Adding to this the fact that messages are sent very frequently, the size of indexes would have a huge impact on performance. So a very large amount of rows must be managed and stored while every row contains a minimal amount of actual information.
In RavenDB, I would use a structure such as this:
// a Conversation object
{
"FirstUserId": "users/19395",
"SecondUserId": "users/19396",
"Messages": [
{
"Order": 0,
"Sender": "Second",
"Sent": "2016-04-02T19:27:35.8140061",
"Content": "lijhuttj t bdjiqzu "
},
{
"Order": 1,
"Sender": "Second",
"Sent": "2016-04-02T19:27:35.8200960",
"Content": "pekuon eul co"
}
]
}
With this structure, I only need to find out which conversation I am looking for: the one between User A and User B. Any message between User A and User B is stored in this object, regardless of whether User A or User B was the sender. So once I find the conversation between them - and there are far less converations than actual messages - I can just grab all of the messages associated with it.
However, if the two participants talk a lot (and assuming that messages are stored for, let's say, 3 years) there can be tens of thousands of messages in a single conversation causing the object to grow very large.
But there is one thing I don't know how it works (specifically) in RavenDB. Does its internal storage and query mechanism allow (the DB engine, not the client) to grab just the (for example) 50 most recent messages without reading the whole object? Afterall, it uses indexing on the properties of objects, but I haven't found any information about whether reading parts of an object is possible DB-side. (That is, without the DB engine reading the whole object from disk, parsing it and then sending back just the required parts to the client).
If it is possible, I think using Raven is a better option in this scenario, if not, then I am not sure. So please help me clean it up by answering the issue mentioned in the previous paragraph along with any advices on what DB model would suit this certain scenario the best. RDBMSs? DocDBs? Maybe something else?
Thanks.

I would say the primary distinctions will be:
Does your application consume the data in JSON? -- Then store it as JSON (in a document DB) and avoid serializing/deserializing it.
Do you need to run analytical workloads on the data? -- Then use SQL
What consistency levels do you need? -- SQL is made for high consistency, docDBs are optimized for lower consistencies
Does your schema change much? -- then use a (schema-less) docDB
What scale are you anticipating? -- docDBs are usually easier to scale out
Note also that many modern cloud document databases (like Azure DocDB) can give you the best of both worlds as they support geo-replication, schema-less documents, auto-indexing, guaranteed latencies, and SQL queries. SQL Databases (like AWS Aurora) can handle massive throughput rates, but usually still require more hand-holding from a DBA.

Related

Is it a good idea to store lists in document-based databases?

I’m trying to build a mobile app that involves users following each other. I’ve seen posts (here) that say it is a cardinal sin to store a users’ followees and followers as a list in a SQL database as each “cell” should only store one discrete value.
However, is this the case for noSQL, document-based databases? What are the pros and cons of storing followers and followees as a list in the user document, vs storing it in a separate collection?
The only ones i can see now is that retrieving the follower/followee data (could be?) faster for the former method as you don’t have to index the entire follower/followee collection, unlike the latter method (or is the time difference negligible?). On the other hand, one would require 2 writes every time someone follows/unfollows another user, which may be disadvantageous for billing in cloud databases, but might not be a problem if the database is hosted locally (?)
I’m very new to working with databases so I’m hoping to get some insight from more experienced people about long term/large scale effects of this choice. Thanks!

web app OO concept confusion

This is a concept question, regarding "best practice" and "efficient use" of resources.
Specifically dealing with large data sets in a db and on-line web applications, and moving from a procedural processing approach to a more Object Oriented approach.
Take a "list" page, found in almost all CRUD aspects of the application. The list displays a company, address and contact. For the sake of argument, and "proper" RDBM, assume we've normalized the data such that a company can have multiple addresses, and contacts.
- for our scenario, lets say I have a list of 200 companies, each with 2-10 addresses, each address has a contact. i.e. any franchise where the 'store' is named 'McDonalds', but there may be multiple addresses by that 'name').
TABLES
companies
addresses
contacts
To this point, I'd make a single DB call and use joins to pull back ALL my data, loop over the data and output each line... Some grouping would be done at the application layer to display things in a friendly manner. (this seems like the most efficient way, as the RDBM did the heavy lifting - there was a minimum network calls (one to the db, one from the db, one http request, one http response).
Another way of doing this, if you couldn't group at the application layer, is to query for the company list, loop over that, and inside the loop make separate DB call(s) for the address, contact. less efficient, because you're making multiple DB calls
Now - the question, or sticking point.... Conceptually...
If I have a company object, an address object and a contact object - it seems that in order to achieve the same result - you would call a 'getCompanies' method that would return a list, and you'd loop over the list, and call 'getAdderss' for each, and likewise a 'getContact' - passing in the company ID etc.
In a web app - this means A LOT more traffic from the application layer to the DB for the data, and a lot of smaller DB calls, etc. - it seems SERIOUSLY less effective.
If you then move a fair amount of this logic to the client side, for an AJAX application, you're incurring network traffic ON TOP of the increased internal network overhead.
Can someone please comment on the best ways to approach this. Maybe its a conceptual thing.
Someone suggested that a 'gateway' is when you access these large data-sets, as opposed to smaller more granular object data - but this doesn't really help my understanding,and Im not sure it's accurate.
Of course getting everything you need at once from the database is the most efficient. You don't need to give that up just because you want to write your code as an OO model. Basically, you get all the results from the database first, then translate the tabular data into a hierarchical form to fill objects with. "getCompanies" could make a single database call joining addresses and contacts, and return "company" objects that contain populated lists of "addresses" and "contacts". See Object-relational mapping.
I've dealt with exactly this issue many times. The first and MOST important thing to remember is : don't optimize prematurely. Optimize your code for readability, the DRY principle, etc., then come back and fix things that are "slow".
However, specific to this case, rather than iteratively getting the addresses for each company one at a time, pass a list of all the company IDs to the fetcher, and get all the addresses for all those company ids, then cache that list of addresses in a map. When you need to fetch an address by addressID, fetch it from that local cache. This is called an IdentityMap. However, like I said, I don't recommend recoding the flow for this optimization until needed. Most often there are 10 things on a page, not 100 so you are saving only a few milliseconds by changing the "normal" flow for the optimized flow.
Of course, once you've done this 20 times, writing code in the "optimized flow" becomes more natural, but you also have the experience of when to do it and when not to.

What's the best way to get a 'lot' of small pieces of data synced between a Mac App and the Web?

I'm considering MongoDB right now. Just so the goal is clear here is what needs to happen:
In my app, Finch (finchformac.com for details) I have thousands and thousands of entries per day for each user of what window they had open, the time they opened it, the time they closed it, and a tag if they choose one for it. I need this data to be backed up online so it can sync to their other Mac computers, etc.. I also need to be able to draw charts online from their data which means some complex queries hitting hundreds of thousands of records.
Right now I have tried using Ruby/Rails/Mongoid in with a JSON parser on the app side sending up data in increments of 10,000 records at a time, the data is processed to other collections with a background mapreduce job. But, this all seems to block and is ultimately too slow. What recommendations does (if anyone) have for how to go about this?
You've got a complex problem, which means you need to break it down into smaller, more easily solvable issues.
Problems (as I see it):
You've got an application which is collecting data. You just need to
store that data somewhere locally until it gets sync'd to the
server.
You've received the data on the server and now you need to shove it
into the database fast enough so that it doesn't slow down.
You've got to report on that data and this sounds hard and complex.
You probably want to write this as some sort of API, for simplicity (and since you've got loads of spare processing cycles on the clients) you'll want these chunks of data processed on the client side into JSON ready to import into the database. Once you've got JSON you don't need Mongoid (you just throw the JSON into the database directly). Also you probably don't need rails since you're just creating a simple API so stick with just Rack or Sinatra (possibly using something like Grape).
Now you need to solve the whole "this all seems to block and is ultimately too slow" issue. We've already removed Mongoid (so no need to convert from JSON -> Ruby Objects -> JSON) and Rails. Before we get onto doing a MapReduce on this data you need to ensure it's getting loaded into the database quickly enough. Chances are you should architect the whole thing so that your MapReduce supports your reporting functionality. For sync'ing of data you shouldn't need to do anything but pass the JSON around. If your data isn't writing into your DB fast enough you should consider Sharding your dataset. This will probably be done using some user-based key but you know your data schema better than I do. You need choose you sharding key so that when multiple users are sync'ing at the same time they will probably be using different servers.
Once you've solved Problems 1 and 2 you need to work on your Reporting. This is probably supported by your MapReduce functions inside Mongo. My first comment on this part, is to make sure you're running at least Mongo 2.0. In that release 10gen sped up MapReduce (my tests indicate that it is substantially faster than 1.8). Other than this you can can achieve further increases by Sharding and directing reads to the the Secondary servers in your Replica set (you are using a Replica set?). If this still isn't working consider structuring your schema to support your reporting functionality. This lets you use more cycles on your clients to do work rather than loading your servers. But this optimisation should be left until after you've proven that conventional approaches won't work.
I hope that wall of text helps somewhat. Good luck!

Observing social web behavior: to log or populate databases?

When considering social web app architecture, is it a better approach to document user social patterns in a database or in logs? I thought for sure that behavior, actions, events would be strictly database stored but I noticed that some of the larger social sites out there also track a lot by logging what happens.
Is it good practice to store prominent data about users in a database and since thousands of user actions can be spawned easily, should they be simply logged?
Remember that Facebook, for example, doesn't update users information per se, they just insert your new information and use the most recent one, keeping the old one. If you plan to take this approach is HIGHLY recommended, if not mandatory, to use a NoSQL DB like Cassandra, you'll need speed over integrity.
Information = money. Update = lose information = lose money.
Obviously, it depends on what you want to do with it (and what you mean be "logging").
I'd recommend a flexible database storage. That way you can query it reasonably easily, and also make it flexible to changes later on.
Also, from a privacy point of view, it's appropriate to be able to easily associate items with certain entities so they can be removed, if so requested.
You're making an artificial distinction between "logging" and "database".
Whenever practical, I log to a database, even though this data will effectively be static and never updated. This is because the data analysis is much easier if you can cross-reference the log table with other, non-static data.
Of course, if you have a high volume of things to track, logging to a SQL data table may not be practical, but in that case you should probably be considering some other kind of database for the application.

Multi-tenancy with SQL/WCF/Silverlight

We're building a Silverlight application which will be offered as SaaS. The end product is a Silverlight client that connects to a WCF service. As the number of clients is potentially large, updating needs to be easy, preferably so that all instances can be updated in one go.
Not having implemented multi tenancy before, I'm looking for opinions on how to achieve
Easy upgrades
Data security
Scalability
Three different models to consider are listed on msdn
Separate databases. This is not easy to maintain as all schema changes will have to be applied to each customer's database individually. Are there other drawbacks? A pro is data separation and security. This also allows for slight modifications per customer (which might be more hassle than it's worth!)
Shared Database, Separate Schemas. A TenantID column is added to each table. Ensuring that each customer gets the correct data is potentially dangerous. Easy to maintain and scales well (?).
Shared Database, Separate Schemas. Similar to the first model, but each customer has its own set of tables in the database. Hard to restore backups for a single customer. Maintainability otherwise similar to model 1 (?).
Any recommendations on articles on the subject? Has anybody explored something similar with a Silverlight SaaS app? What do I need to consider on the client side?
Depends on the type of application and scale of data. Each one has downfalls.
1a) Separate databases + single instance of WCF/client. Keeping everything in sync will be a challenge. How do you upgrade X number of DB servers at the same time, what if one fails and is now out of sync and not compatible with the client/WCF layer?
1b) "Silos", separate DB/WCF/Client for each customer. You don't have the sync issue but you do have the overhead of managing many different instances of each layer. Also you will have to look at SQL licensing, I can't remember if separate instances of SQL are licensed separately ($$$). Even if you can install as many instances as you want, the overhead of multiple instances will not be trivial after a certain point.
3) Basically same issues as 1a/b except for licensing.
2) Best upgrade/management scenario. You are right that maintaining data isolation is a huge concern (1a technically shares this issue at a higher level). The other issue is if your application is data intensive you have to worry about data scalability. For example if every customer is expected to have tens/hundreds millions rows of data. Then you will start to run into issues and query performance for individual customers due to total customer base volumes. Clients are more forgiving for slowdowns caused by their own data volume. Being told its slow because the other 99 clients data is large is generally a no-go.
Unless you know for a fact you will be dealing with huge data volumes from the start I would probably go with #2 for now, and begin looking at clustering or moving to 1a/b setup if needed in the future.
We also have a SaaS product and we use solution #2 (Shared DB/Shared Schema with TenandId). Some things to consider for Share DB / Same schema for all:
As mention above, high volume of data for one tenant may affect performance of the other tenants if you're not careful; for starters index your tables properly/carefully and never ever do queries that force a table scan. Monitor query performance and at least plan/design to be able to partition your DB later on based some criteria that makes sense for your domain.
Data separation is very very important, you don't want to end up showing a piece of data to some tenant that belongs to other tenant. every query must have a WHERE TenandId = ... in it and you should be able to verify/enforce this during dev.
Extensibility of the schema is something that solutions 1 and 3 may give you, but you can go around it by designing a way to extend the fields that are associated with the documents/tables in your domain that make sense (ie. Metadata for tables as the msdn article mentions)
What about solutions that provide an out of the box architecture like Apprenda's SaaSGrid? They let you make database decisions at deploy and maintenance time and not at design time. It seems they actively transform and manage the data layer, as well as provide an upgrade engine.
I've similar case, but my solution is take both advantage.
Where data and how data being placed is the question from tenant. Being a tenant of course I don't want my data to be shared, I want my data isolated, secure and I can get at anytime I want.
Certain data it possibly share eg: company list. So database should be global and tenant database, just make sure to locked in operation tenant database schema, and procedure to update all tenant database at once.
Anyway SaaS model everything delivered as server / web service, so no matter where the database should come to client as service, then only render by client GUI.
Thanks
Existing answers are good. You should look deeply into the issue of upgrading and managing multiple databases. Without knowing the specific app, it might turn out easier to have multiple databases and not have to pay the extra cost of tracking the TenantID. This might not end up being the right decision, but you should certainly be wary of the dev cost of data sharing.