Creating a DB for comment sections with multiple page tags - sql

I'm having a tough time choosing the correct database (SQL, NoSQL) for this use case even though it's so common.
This is the main record -
Contains a number of fields (which will probably change and updated in the future).
Connected to a list of tags (Which can contain up to 50 tags).
Connected to a comment section.
Page records will be queried by using the tags.
In general, the reads are more important (so writes could be more expensive) and the availability should be high.
The reason not to choose mongodb style DB is because of the comment section, There are no joins, so the comment section must be embedded in the page and the document size could grow too much.
Also MongoDB is less reliant on availability (using CAP) and availability is important to me.
The reason not to choose SQL is because the scheme of the Page could be updated and there is no fixed scheme.
Also because of the tags system - another relational table should be created and as I understood, it's bad for performance.
What's the best approach here?

Take a look at Postgres and you can have the best of both.
Postgres supports jsonb which allows indexing of jsonb data types so your search by tags can be executed pretty efficiently, or keep them as an array data type.
If you concerned about the comments embedding, then link off to another table and benefit from joins which are first-class citizens.
Given your use-case, you could have a Pages table with main, well known columns and a few foreign keys to Authors etc, tags as an array or jsonb column, some page attributes in jsonb and your comments in a separate Comments table with foreign key to Users and Pages.
Both Mongodb and Postgres and great choices.
PS, I have built far more on Mongodb than Postgres, but really impressed by Postgres after recent evaluation for a new project.


Is it ever a good idea to store an array as a field value, or store array values as records?

In my application I've got "articles" (similar to posts/tweets/articles) that are tagged with descriptive predefined tags: i.e "difficult", "easy", "red", "blue", "business" etc
These available tags are stored in a table, call it "tags" that contains all available tags.
Each article can be tagged with multiple tags, editable through a custom admin interface.
It could be tempting to simply bundle the tags for each entity into a stringified array of the IDs of each tag and store it alongside the article record in my "articles" table:
id | title | author | tags
1 | title | TG | "[1,4,7,12]"
though I'm sure this is a bad idea for a number of reasons, is there ever a reasonable reason to do the above?
I think you should read about Database normalization and decide for yourself. In short though, there are a number of issues with your proposal, but you may decide you can live with them.
The most obvious are:
What if an additional tag is added to row(1)? Do you have to first parse, check if it's already present then update the row to be tags.append(newTag).
Worse still deleting a tag? Search tags, is present, re-create tags.
What if a tag is to change name - some moderation process, perhaps?
Worse again, what about dfferent people specifying a tag-name differently - it'd be hard to rationalise.
What if you want to query data based on tags? Your query becomes far more complex than it would need to be.
Presentation: The client has to parse the tag in order to use it. What about the separator field? Change that and all clients have to change.
In short, all of these operations become harder and more cumbersome. Normalization is designed to overcome such issues. Probably the only reason for doing what you say, IMO, is that you're capturing the data as a one-off and it's informational only - that is, makes sense to a user but not to a system per-se. This is kind of like saying it's probably best avoided (again, IMO).
It seems to me like you want to have a separate table that stores tags and holds a foreign key which relates the tag records back to their parent record in the articles table (this is referred to as "normalizing" the database structure).
Doing it like you have suggested by cramming the tags into one field may seem to make sense now, but it will prove to be difficult to maintain and difficult/time consuming to pull the values out efficiently as your application grows in size or the amount of data grows a lot larger.
I would say that there are very few reasons to do what you have suggested, given how straightforward it is to create another table and setup a relationship to link keys between the two tables to maintain referential integrity.
I totally agree that it CAN be a good idea. I am a strong advocate of storing tags in the database as a single delimited list of strings.
BUT: The reason that I agree is that I like to use Azure Search API to index these types of data, so the query to do a lookup based on tags is not done via SQL. (using the Azure search API service is not necessary, but In my experience you will get much better performance and scalability by using a search index that is outside of the database.)
If you primary query language will be SQL (relational based queries)
then you are better off creating a child table that has a row for each
tag, otherwise you will wear a performance hit when your query has to
perform a logic on each value to split it for analysis.
Tagging is a concept that we use to get around relational data or hierarchical mapping, so to get the best performance do no try to use these relational concepts to query the tags. It is often best implemented in NoSQL data storage because they don't try to use the database to process the search queries.
I encourage you to store the data as a delimited string, and use an external indexing service to provide search and insights into your data. This is a good trade off between CRUD data access performance attempts to manage the data and indexes to optimise for searching. Sure you can optimise the DB and the search queries to make it work in SQL but it can take effort to get it right.
Once your user base hits large volumes and you need to support multiple concurrent searches without affecting update performance you will find that external indexing is an awesome investment in your time now, to save you time and resources later.

SQL Full Text Search - Design Decision: Multiple Tables or One Huge One

I'm somewhat new to SQL but I'm creating a database where there's one table of movie metadata, and I need to do full text searches on the movie scripts, which are currently organized into large tables, one for each movie, with columns for line number, timestamp, and a body of text (which needs to be able to be searched for keywords, phrases). My question is whether my searches would run faster using one massive table for all of the scripts instead of one for each movie. I'm using SQLite and Python. I'm using fts4 to implement the full text search capabilities.
Having each movie into a separate table is not a architecture easy to maintain, as you have to add/delete tables when a new movie is inserted of removed.
Searching in a massive table can be efficient only with indexes with a good clustering factor.
Generally as I see the whole problem I think that the adaptation of a documentation based database like mongoDB will provide better storage and search of your data.

MongoDB embedding vs SQL foreign keys?

Are there any particular advantages to MongoDB's ability to embed objects within a document, compared to SQL's use of foreign keys for the same logic?
It seems to me that the only advantage is ease of use (and perhaps performance?), and even that seems like it could be easily abstracted away (e.g. Django seems to handle SQL's foreign keys pretty intuitively).
This boils down to a classic question of whether to embed or not.
Here are a few links to get started before I explain some more:
Where should I put activities timeline in mongodb, embedded in user or separately?
MongoDB schema design -- Choose two collection approach or embedded document
MongoDB schema for storing user location history
Now to answer more specifically.
You must remember the server-side usage of foreign keys in SQL: JOINs. Embedding is a single round trip to get all the data you need in a single document however Joins are not, they are infact two selections based upon a range and then merged to omit duplicates (with significant overhead on some data sets).
So the use of foreign keys is not totally app dependant, it is also server and database dependant.
That being said some people misunderstand embedding in MongoDB and try and make all their data fit into one document. Unfortunately this is re-inforced by the common knowledge that you should always try to embed everything. The links and more will provide some useful guides on this.
Now that we cleared some things up the main pros of embedding over JOINs are:
Single round trip
Easy to update the document in a lot of cases, unless you embed many levels deep
Can keep entity data with the entity it is related to
However embedding has a few flaws:
The document must be paged in to get it's values, this can be problematic on larger documents
Subdocuments are designed to be unique to that entity that do not require advanced querying so you normally would not get two separate entities that are related together, i.e. a post could embed comments but a user probably wouldn't embed posts due to the query needs.
Nesting more than 3 levels deep could effect your ability to use things such as the atomic lock.
So when used right MongoDBs embedding can become a huge power over SQL Joins but you must understand when to use it right.
The core strength of Mongo is in its document-view of data, and naturally this can be extended to a "POCO" view of data. Mongo clients like the NoRM Project in .NET will seem astonishingly similar to experienced Fluent NHibernate users, and this is no accident - your POCO data models are simply serialized to BSON and saved in Mongo 1:1. No mappings required.
Overall, the biggest difference between these two technologies is the model and how developers have to think about their data. Mongo is better suited to rapid application development.

Is duplicating data in SQL and Document store (like MongoDB) a legit idea or should be avoided?

I have a question. I am considering using a data store for some type of objects (e.g. products data). Criteria for using document store is if object has a detail page, so fast read of the entire object is necessary (example - product with all attributes, images, comments etc). Criteria for using SQL is displaying lists (e.g. N newest, most popular etc).
Some objects meet both criteria. Products is an example. So is it a normal practice to store info that will be used in rendering lists on index pages in SQL database, and other data in document store?
If denormalization is suitable for getting performance, go ahead with denormalization. But you have to ensure that you have a way to deal with updates of denormalized data. Your options in MongoDB are:
multiple queries to avoid denormalization
embedded docs
database references
make your choice..
The main idea is mongoDB was created for denormalization and embedding. At one of my past projects i've done sql denormalization to get better performance, but i don't like sql denormalization because very many dublicated data( if you have one to many relation for example). Second step was rewriting data access layer to mongoDB. And in mongoDB for some difficult pages where i need to load multiple documents i've created denormalized document(with embeded collections and plain data from different documents) to fit page content. No all my problem pages work fast, like facebook ;).
But here possible problems, becase you should support denormalized document every time. Also all my denormalized data updates work async, and some data can be stale in some moment, but it's normal practice. Even stackoverlow use denormalization because sometime when open question i see an answer, but when i return back to questions list and refresh page sometimes question doesn't have answer.
If i need denormalization i choose mongodb.

Index strategy for tagged documents where tags can change often

In addition to text content my documents have tags which can be searched too. The problem now is that the tags change quite often and every time a tag gets added or removed I have to call UpdateDocument which is quite slow when done for hundreds of documents.
Are there any well performing strategies for storing tags that change often and need to be searched with Lucene? I have been thinking about keeping the tags in separate documents to keep them smaller but I can't figure out how to quickly search for tags AND content.
Store [tag, UID] pairs in a relational database. Every time a tag is added or updated, it is added and updated in this table in the database.
When performing a Lucene search that includes both tag data (stored in a database) and content (indexed in Lucene) you will need to merge the results together. One way you can do this is to:
Make a database query to pull up all the UID's for the tag in question
Translate all the UID's to Lucene doc ID's and set a bit in a BitSet for every matching Lucene doc ID
Create a Filter that wraps the BitSet, and pass that filter in to your search.
We implemented this approach in our system, and it works well. You might need to put a cache in front of the database for performance reasons, though. The particulars of step (3) will vary depending on which version of Lucene you're using.