Database layout tagging system - sql

I am creating a web site for a customer and they want to be able to create articles. My idea is to tag them so I am going to implement the system.
What is the best design, both from an architectural and a perfomance perspective:
1. To have table with all tags and then have a one to many relationship table that links a tag like this:
articles table with ID
tags table with ID
one to many table with columns Article.ID and Tags.ID
2. To have one table with articles and one with tags for articles like this:
articles table with ID
tags table with Article.ID and tag text
Thanks in advance!

Your first option is the most appropriate and theoretically right.
Guess, your clients do not think tags like a nice feature to have because everybody has it - they would like to have search by tags. Even if they don't yet understand their needs and really want to have tags because everybody around has them - they will realize their needs soon.
First option will give you better search operation performance.
Implement separate table for articles, tags and many-to-many between them.

Definitely the first option.
Apart from the other benefits, you could enforce some regularity in using tags, by checking if the tag (or a similar one) is already present before adding it, allowing users to select from existing tags, and/or allowing only superusers to add new tags.
This way you avoid mispellings or alternate spellings of the same tags (i.e. US, USA, USofA, U.S.A., U.S, US., America, Amerika, Amrica and so on when labelling something about the United States)

Related

Parent/Child Design For a Basic Social Network Using SQL

I'm trying to build a simple structure for a social network style application. But I have a little confusion about how to design the relationship between posts, comments and medias.
So simply, media can be an image or a video (with enumaration). It contains size and URL info about a standard image, thumbnail and video (according to the mediaType enumaration). A post may have multiple media attached to it. A comment may have multiple media attached to it. But when a media is used at one place, it cannot be used at another. No other post can have it. No other comment can have it. Also, when I implement users, they can refer to an image type media as their profilePic. When there will be a messaging feature, some media might be attached to a message etc. So, I want things to be a little flexible.
I didn't want to add specific columns about about thumbnailWidth, thumbnailSize, thumbnailURL etc to multiple tables, because it would be just too much repetation. So I've decided to use a centralized media table to hold all the main information about an uploaded image or video.
I've decided to put the thumbnail and standard image infos to the same row, otherwise it just felt too complicated to handle. I may divide images and videos to separate tables later.
Note: I don't have a structure for comments to reply each other. That is a later concern :)
Here is the current design without the connection between media and other tables.
Media
----------
id
thumbnail_width
thumbnail_height
thumbnail_URL
standard_width
standard_height
standard_URL
media_type ("video" or "image")
source_URL (only used if media_type is "video")
(maybe other columns to be used with "video" type)
user_id (who uploaded the media)
Post
----------
id
title
body
user_id (who sent the post.)
Comment
----------
id
body
post_id (which post has this comment)
user_id
Option 1
So one option is, putting commentId and postId fields (as nullable) to media table.
If a media is attached to a post, put the postId there. If it is attached to a comment, do same for the commentId. If one of them has a value, others must be null. But this may result in too many reference columns in the media table, because a media might be used in a lot of places as the project grows.
Option2
Another option is creating tables for each relationship like;
PostMedia
----------
id
post_id
media_id (unique. one-to-one relationship with Media table)
CommentMedia
----------
id
comment_id
media_id (unique. one-to-one relationship with Media table)
But now it becomes harder to check if a media is used in a post, before saving a comment. Or the other way around. We need to check the whole PostMedia table each time.
Another situation might be, when a user sets a media with the image type as their profile picture, we need to check if it was used in a post or a comment. I'm not sure about this constraint but it might come in handy for some situations.
I might set some ownerType enumaration in media table. That might be post, comment, profilePic etc. And PostMedia table could reference a media only if the ownerType is post.
Option 3
The centralized Media table idea is cool, but it comes with a lot of complexity I think. With an object-oriented design, I might just create an abstract media class, put all of the required columns and methods in that class and extend it as PostMedia, CommentMedia etc. And it would be much more easy to handle, but ends up a lot of same columns and similar tables across the db. I don't know if it is a good design.
What would be the best practice here? I might be thinking things too complicated, there might be simpler solutions. I'm open to any advices :)
Thanks!

Creating a SOLR index for activity stream or newsfeed

I am trying to index the activity feed of a social portal am building. The portal allows users to follow each other to get updates from the people they follow as an activity feed sorted by date.
For example, user A will be following users B, C, D, E & F. So user A should see all the posts from B, C, D, E & F on his/her activity feed.
Let's assume the post consist of just two fields.
1. The text of the post. (text_field)
2. The name/UID of the user who posted it. (user_field)
Currently, I am creating an index for all the posts and indexing the text_field & user_field. In scale, there can be 1,000,000+ posts. A user may follow 100s if not 1000s of users. What will be the best way to create an index for this scenario?
Should I also index a person followers, so that its quickly looked up and then pass it to a second query for getting the posts of all those users sorted by date?
What is the best way to query the index consisting of all these posts, by passing the UID of all the users that are followed? Considering this may be in 100's or more.
Update:
The motivation for using Solr for the news feed was mainly inspired by this detailed slide and my brief discussion with OpenSocial team.
When starting off with a social portal, Fan out on write seems an overkill and more expensive. However Fan out on read is better. Both the slide and the OpenSocial team suggested using a search backend for Fan out on read. The slide mentioned above also have data on how it helped them.
At present, the feed is going to be flat and only sort criteria will be the date(recency). We won't be considering relevance or posts from more closer groups.
It's kind of abstract, but I will do my best here. Based on what you mentioned, I am not sure if Solr is really the right tool for the job here. You can still have Solr for full text search, but I am not sure about generating a news feed from it in this scenario. Remember that although Solr is pretty impressive, it is a search engine. I will pretend that you will stick with Solr for the rest of the post, keep in mind that we are trying to put a square peg through a round hole here though.
Here are a few additional questions you should think about.
You will probably want to add a timestamp of the post to the data element
You need to figure out how to properly sort the results. Is it in order of recency? Or based on posts that the user is more likely to interact with?
If a user has 1000+ connections, would he want to see an update from every one of them in the main feed? Or should posts from a closer group of friends show up higher?
Here are some comments about your questions:
1) If you index person's followers, it may be hard to keep up. I am assuming followers are going to be changing often and re-indexing in this scenario would not really be practical.
2) That sounds more on par, but again, you need to figure out the sorting. You can get a list of connections for the user, then run a search for top posts from all of them.

How to build tag hierarchy on rails using acts-as-taggable-on

I have a rails app that includes tagging for blog posts using the gem acts-as-taggable-on. My idea is to extend the tagging mechanism of this gem using a moderate-link approach where I can choose to create a few users as tag owners and they can choose to link one tag to another as parent/child.
Presently the system has independent tags like Education, Child Education and Distance Education
The tag owner of Education can choose to select Child Education and Distance Education
as first level child and link them together. This relationship wont be visible until its approved by the Taxonomist(A tag administrator).
Similarly, an end user can also suggest Distance Education tag to be the child of Education and this request will become visible to the tag administrator. Based on his approval the relationship will be established.
These are the few questions I have pertaining to the requirement above:-
Is it recommended to extended the gem or should I use an independent tagging model written from scratch to support this hierarchical system ?
If I go ahead with the schema provided by the gem , what kind of a model should be used to design such a requirement. Specifically, should I use a single table with a parent_id column with the tag id and tag name ? Or should I maintain their relationship in a separate table with many-to-many associations (tag_id, parent_tag_id (as Foreign key)).
I am also new to data structures so I will need some initial inputs on the choice of algorithms to efficiently traverse between a tag family. Using linked list was one my options however considering Rails mantra of convention over configuration , I am really unsure of how to proceed on this.
I remember doing something similir like 4 years ago with ActsAsTree.
There s also an example of how to do it manually here.
Both options will need that parent_id column on your tags table and are really straightforward. Just create a tag.rb in your models folder and extend Tag class.
Ps. Its been a long time, but i remember having to check that there are no loops, just keep it in mind

Storing Blog Comments/Upvotes - Tracking Users?

I am working on a blog-type website in ASP .net MVC3. I am trying to figure out how I will deal with post upvotes/downvotes(I will have to know what users have already voted where to prevent spam voting). Comments on a blog post is another issue.
My thoughts so far(I am sure they are pretty far off the mark):
Votes:
Store a list of UserIDs in a voted field of my Blog table.
For each user in my Users table, store a list of all PostIDs they have voted on.
Comments:
Make a separate Comments table and in that table have a field referencing the parent blog post.
Store a list of CommentIDs in a Comment field in my Blogs table.
I know there are several other ways to go about this but I am trying to set this up so that I won't have to rewrite the whole thing should I get an influx of users.
You might wanna consider creating a Votes table like
User|Post|Type?
john|43 |Up
mary|43 |Down
making User + Post a composite primary key, and thus indexing by both... Then you can easily check if a user has already voted for a post or not... You can also create additional indexes by user or post if needed...
I'd also be a good idea then to have the "Current Ups and Current Downs" in the blogs table, so you don't have to count them each time...

Product catalogue storage in mongoDB from an RDBMS perspective

I have a product page with an URL of the form http://host/products/{id}/{seo-friendly-url}, where the "seo-friendly-url" part may be something like category/subcategory/product.
The products controller gets the product with the specified ID and then ensures that the URL that follows is correct for the product - if it isn't the user is redirected to the appropriate URL (all URLs in the shop are generated correctly though, the redirect is just to maintain a canonical URL in the case of mistyping by the user or the URL changing since Google crawled it etc). The ID ensures fast product look-up, and the part on the end ensures keywords make it into the URL.
To check the URL, I have a SQL view which utilises a recursive common table expression to concatenate the product URL chunk with the URLs of its parent category URLs all the way up the hierarchy (generally just 3 deep).
I've recently came across document oriented storage and I can see it being very useful in a variety of situations (for example, my product entities have tags and multibuy prices and attributes etc all in different tables currently).
So on to my question - how can I achieve the above functionality in mongoDB, or is there a better way to think about it? The naive way would be to retrieve each category in the hierarchy individually, but I'm assuming that would be slow.
Related: I've read in the docs that skip/limit for is slow for large result sets - would this be noticeable for the maximum of say 10 pages of 25 products each likely to be present in a retail website category?
I think your best option is to just store the full slug with the product. Then when you get the product just check to see if the slug matches and if not, redirect. Now, the trade-off is that if you want to rename a category you will need to do a batch job to find all products in the category and change their slugs. The good news is that category renames will be much less common than views (hopefully) so your total load will be reduced.
Not sure how skip and limit are related to this question, except that they both involve mongodb. Anyway for 25 results it's really no problem. Limit isn't slow and in fact can speed things up if less than 100 (default first batch size). Skip can hurt performance, but only by making it as slow as if you fetched all skipped documents w/o the extra network traffic. Therefore I wouldn't skip 1 million docs, but skipping 100 would be fine.
You can model a collection called products, with the document like:
product:{id:someId,category:someCategory,subcategory:someSubCategory,productSlug:somenameslug}
The query to get the product given the id, category and subcategory would be something like:
db.products.find({id:123,category:cat,subCategory:subcat})
This sounds pretty simpleton but given my understanding of your question IMO this should be a good start.
For your other question, there are skip and limit modifiers to help with pagination.