Facebook database design? - sql

I have always wondered how Facebook designed the friend <-> user relation.
I figure the user table is something like this:
user_email PK
user_id PK
password
I figure the table with user's data (sex, age etc connected via user email I would assume).
How does it connect all the friends to this user?
Something like this?
user_id
friend_id_1
friend_id_2
friend_id_3
friend_id_N
Probably not. Because the number of users is unknown and will expand.

Keep a friend table that holds the UserID and then the UserID of the friend (we will call it FriendID). Both columns would be foreign keys back to the Users table.
Somewhat useful example:
Table Name: User
Columns:
UserID PK
EmailAddress
Password
Gender
DOB
Location
TableName: Friends
Columns:
UserID PK FK
FriendID PK FK
(This table features a composite primary key made up of the two foreign
keys, both pointing back to the user table. One ID will point to the
logged in user, the other ID will point to the individual friend
of that user)
Example Usage:
Table User
--------------
UserID EmailAddress Password Gender DOB Location
------------------------------------------------------
1 bob#bob.com bobbie M 1/1/2009 New York City
2 jon#jon.com jonathan M 2/2/2008 Los Angeles
3 joe#joe.com joseph M 1/2/2007 Pittsburgh
Table Friends
---------------
UserID FriendID
----------------
1 2
1 3
2 3
This will show that Bob is friends with both Jon and Joe and that Jon is also friends with Joe. In this example we will assume that friendship is always two ways, so you would not need a row in the table such as (2,1) or (3,2) because they are already represented in the other direction. For examples where friendship or other relations aren't explicitly two way, you would need to also have those rows to indicate the two-way relationship.

TL;DR:
They use a stack architecture with cached graphs for everything above the MySQL bottom of their stack.
Long Answer:
I did some research on this myself because I was curious how they handle their huge amount of data and search it in a quick way. I've seen people complaining about custom made social network scripts becoming slow when the user base grows. After I did some benchmarking myself with just 10k users and 2.5 million friend connections - not even trying to bother about group permissions and likes and wall posts - it quickly turned out that this approach is flawed. So I've spent some time searching the web on how to do it better and came across this official Facebook article:
TAO: Facebook’s Distributed Data Store for the Social Graph
TAO: The power of the graph.
I really recommend you to watch the presentation of the first link above before continue reading. It's probably the best explanation of how FB works behind the scenes you can find.
The video and article tells you a few things:
They're using MySQL at the very bottom of their stack
Above the SQL DB there is the TAO layer which contains at least two levels of caching and is using graphs to describe the connections.
I could not find anything on what software / DB they actually use for their cached graphs
Let's take a look at this, friend connections are top left:
Well, this is a graph. :) It doesn't tell you how to build it in SQL, there are several ways to do it but this site has a good amount of different approaches. Attention: Consider that a relational DB is what it is: It's thought to store normalised data, not a graph structure. So it won't perform as good as a specialised graph database.
Also consider that you have to do more complex queries than just friends of friends, for example when you want to filter all locations around a given coordinate that you and your friends of friends like. A graph is the perfect solution here.
I can't tell you how to build it so that it will perform well but it clearly requires some trial and error and benchmarking.
Here is my disappointing test for just findings friends of friends:
DB Schema:
CREATE TABLE IF NOT EXISTS `friends` (
`id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`friend_id` int(11) NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8;
Friends of Friends Query:
(
select friend_id
from friends
where user_id = 1
) union (
select distinct ff.friend_id
from
friends f
join friends ff on ff.user_id = f.friend_id
where f.user_id = 1
)
I really recommend you to create you some sample data with at least 10k user records and each of them having at least 250 friend connections and then run this query. On my machine (i7 4770k, SSD, 16gb RAM) the result was ~0.18 seconds for that query. Maybe it can be optimized, I'm not a DB genius (suggestions are welcome). However, if this scales linear you're already at 1.8 seconds for just 100k users, 18 seconds for 1 million users.
This might still sound OKish for ~100k users but consider that you just fetched friends of friends and didn't do any more complex query like "display me only posts from friends of friends + do the permission check if I'm allowed or NOT allowed to see some of them + do a sub query to check if I liked any of them". You want to let the DB do the check on if you liked a post already or not or you'll have to do in code. Also consider that this is not the only query you run and that your have more than active user at the same time on a more or less popular site.
I think my answer answers the question how Facebook designed their friends relationship very well but I'm sorry that I can't tell you how to implement it in a way it will work fast. Implementing a social network is easy but making sure it performs well is clearly not - IMHO.
I've started experimenting with OrientDB to do the graph-queries and mapping my edges to the underlying SQL DB. If I ever get it done I'll write an article about it.
How can I create a well performing social network site?
Update 2021-04-10: I'll probably never ever write the article ;) but here are a few bullet points how you could try to scale it:
Use different read and write repositories
Build specific read repositories based on faster non-relational DB systems made for that purpose, don't be afraid of denormalizing data. Write to a normalized DB but read from specialized views.
Use eventual consistence
Take a look at CQRS
For a social network graphs based read repositories might be also good idea.
Use Redis as a read repository in which you store whole serialized data sets
If you combine the points from the above list in a smart way you can build a very well performing system. The list is not a "todo" list, you'll still have to understand, think and adept it! https://microservices.io/ is a nice site that covers a few of the topics I mentioned before.
What I do is to store events that are generated by aggregates and use projects and handlers to write to different DBs as mentioned above. The cool thing about this is, I can re-build my data as needed at any time.

Have a look at the following database schema, reverse engineered by Anatoly Lubarsky:

My best bet is that they created a graph structure. The nodes are users and "friendships" are edges.
Keep one table of users, keep another table of edges. Then you can keep data about the edges, like "day they became friends" and "approved status," etc.

It's most likely a many to many relationship:
FriendList (table)
user_id -> users.user_id
friend_id -> users.user_id
friendVisibilityLevel
EDIT
The user table probably doesn't have user_email as a PK, possibly as a unique key though.
users (table)
user_id PK
user_email
password

Take a look at these articles describing how LinkedIn and Digg are built:
http://hurvitz.org/blog/2008/06/linkedin-architecture
http://highscalability.com/scaling-digg-and-other-web-applications
There's also "Big Data: Viewpoints from the Facebook Data Team" that might be helpful:
http://developer.yahoo.net/blogs/theater/archives/2008/01/nextyahoonet_big_data_viewpoints_from_the_fac.html
Also, there's this article that talks about non-relational databases and how they're used by some companies:
http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php
You'll see that these companies are dealing with data warehouses, partitioned databases, data caching and other higher level concepts than most of us never deal with on a daily basis. Or at least, maybe we don't know that we do.
There are a lot of links on the first two articles that should give you some more insight.
UPDATE 10/20/2014
Murat Demirbas wrote a summary on
TAO: Facebook's distributed data store for the social graph (ATC'13)
F4: Facebook's warm BLOB storage system (OSDI'14)
http://muratbuffalo.blogspot.com/2014/10/facebooks-software-architecture.html
HTH

It's not possible to retrieve data from RDBMS for user friends data for data which cross more than half a billion at a constant time
so Facebook implemented this using a hash database (no SQL) and they opensourced the database called Cassandra.
So every user has its own key and the friends details in a queue; to know how cassandra works look at this:
http://prasath.posterous.com/cassandra-55

Its a type of graph database:
http://components.neo4j.org/neo4j-examples/1.2-SNAPSHOT/social-network.html
Its not related to Relational databases.
Google for graph databases.

You're looking for foreign keys. Basically you can't have an array in a database unless it has it's own table.
Example schema:
Users Table
userID PK
other data
Friends Table
userID -- FK to users's table representing the user that has a friend.
friendID -- FK to Users' table representing the user id of the friend

Probably there is a table, which stores the friend <-> user relation, say "frnd_list", having fields 'user_id','frnd_id'.
Whenever a user adds another user as a friend, two new rows are created.
For instance, suppose my id is 'deep9c' and I add a user having id 'akash3b' as my friend, then two new rows are created in table "frnd_list" with values ('deep9c','akash3b') and ('akash3b','deep9c').
Now when showing the friends-list to a particular user, a simple sql would do that: "select frnd_id from frnd_list where user_id="
where is the id of the logged-in user (stored as a session-attribute).

Regarding the performance of a many-to-many table, if you have 2 32-bit ints linking user IDs, your basic data storage for 200,000,000 users averaging 200 friends apiece is just under 300GB.
Obviously, you would need some partitioning and indexing and you're not going to keep that in memory for all users.

Related

Use DB Relation To Avoid Redundancy

I have designed an ERD of movies and tv series which is confidential. I can give you an overview of database.
It has more then 20 tables (more tables will be added later) and it is normalized. I have tables like Movie, Actors, Tv Seriers, Director, Producer etc. So these tables will contain most important information and also these tables are connected (by foreign keys and middle tables like MovieActor, MovieDirector etc).
So the scenario is like
1) The standard “starting” database should have Actors, Directors, Producers, Music Composers, Genres, Resolution Types… pre populated and pre defined by the Admin.
2) For every user creating his personal movie collection, he will be starting of his database with all the pre defined data, but if he wants to, he may add further data to his personal database. These changes will only be affecting his database and not the standard "starting" database (which was defined by Admin).
3) The Admin should have a separate view to add Actors, Directors, Producers… that will become part of the standard "starting" database. Any further changes done to this database will be available to the users as updates.
Suggested Solution
Question
The suggested solution is seems like I have to create new databases all the time for each user which seems not possible. My question is how can I manipulate the suggested solution so that my solution will be effective and possible. I would prefer to handle the situation by using database relations, not by separate storage.
You wouldn't create multiple databases, you would simply add an ownerId field to all relevant tables - admin would have ownerId = 0, indicating the row is part of the 'starting database' and new admin entries are instantly available to users.
In any output for a user where you want to display the starting data and their own, you would add WHERE (ownerId = 0 or ownerId = userId) to the appropriate query or if they need to see just their own, just ownerId = userId.
Presumably, they would be able to create relationships between their own data or 'starting' data and this approach should still work.
Foreign keys will still work but deleting will delete user data - basically you should only ever add to the starting data, not take away or you will run into problems.

What is this form of database called?

I'm new to databases and I'm thinking of creating one for a website. I started with SQL, but I really am not sure if I'm using the right kind of database.
Here's the problem:
What I have right now is the first option. So that means that, my query looks something like this:
user_id photo_id photo_url
0 0 abc.jpg
0 1 123.jpg
0 2 lol.png
etc.. But to me that seems a little bit inefficient when the database becomes BIG. So the thing I want is the second option shown in the picture. Something like this, then:
user_id photos
0 {abc.jpg, 123.jpg, lol.png}
Or something like that:
user_id photo_ids
0 {0, 1, 2}
I couldn't find anything like that, I only find the ordinary SQL. Is there anyway to do something like that^ (even if it isn't considered a "database")? If not, why is SQL more efficient for those kinds of situations? How can I make it more efficient?
Thanks in advance.
Your initial approach to having a user_id, photo_id, photo_url is correct. This is the normalized relationship that most database management systems use.
The following relationship is called "one to many," as a user can have many photos.
You may want to go as far as separating the photo details and just providing a reference table between the users and photos.
The reason your second approach is inefficient is because databases are not designed to search or store multiple values in a single column. While it's possible to store data in this fashion, you shouldn't.
If you wanted to locate a particular photo for a user using your second approach, you would have to search using LIKE, which will most likely not make use of any indexes. The process of extracting or listing those photos would also be inefficient.
You can read more about basic database principles here.
Your first example looks like a traditional relational database, where a table stores a single record per row in a standard 1:1 key-value attribute set. This is how data is stored in RDBMS' like Oracle, MySQL and SQL Server. Your second example looks more like a document database or NoSQL database, where data is stored in nested data objects (like hashes and arrays). This is how data is stored in database systems like MongoDB.
There are benefits and costs to storing data in either model. With relational databases, where data is spread accross multiple tables and linked by keys, it is easy to get at data from multiple angles and aggregate it for multiple purposes. With document databases, data is typically more difficult to join in single queries, but much faster to retrieve, and also typically formatted for quicker application use.
For your application, the latter (document database model) might be best if you only care about referencing a user's images when you have a user ID. This would not be ideal for say, querying for all images of category 'profile pic' or for all images uploaded after a certain date. You could probably accomplish your task with either database type, and choosing the right database will always depend on the application(s) that it will be used for, but as a general rule-of-thumb, relational databases are more flexible and hard to go wrong with.
What you want (having user -> (photo1, photo2, ...)) is kind of an INDEX :
When you execute your request, it will go to the INDEX and fetch the INDEX "user" in the photos table, and get the photo list to fetch. Not all the database will be looked up, it's optimised.
I would do something like
Users_Table(One User - One Photo)
With all the column that every user will have. if one user will have only one photo then just add a column in this table with photo_url
One User Many Photos
If one User Can have multiple Photos. then create a table separately for photos which contains only UserID from Users_Table and the Photo_ID and Photo_File.
Many Users Many Photos
If One Photo can be assigned to multiple users then Create a Separate table for Photos Where there are PhotoID and Photo_File. Third Table User_Photos which can have UserID from Users_Table and Photo_ID from Photos Table.

General design principle for performance of OneToMany relations in relational model

I notice a pattern which seems pretty obvious now.
Need to get your opinion on this.
Suppose have a One To Many relationship from table 1 to table 2 in a relational model.
For example table 1 could be a User table and table 2 could be a Login table which logs all user logins. One user can log in multiple times.
Given a user we can find all logins by that user.
The first idea that comes to mind will be to store the logins only in the login table. This is design one.
But if for some usecases we are interested in a particular login of the user (say the last login) it is "generally a good idea" to cache the last login time in the user table itself.
Is that right?
Design 2 is obviously redundant as we can always find the last login time by performing a join and then discarding all but the previous logins.
For one user either should be fine. But if you want to find last login time for all the users a SQL query then design 1 would involve a join and a subquery to filter out the unneeded results.
But given our usecase it is a good idea to store last login time in the user table itself which will save us from the join. Is that right?
Is that a generic pattern that you see when designing schemas?
You are confusing the concepts of TABLE and RELATION, a common mistake. You have two RELATIONS in your conceptual model (Users & Logins), but in practice this will involve more than two TABLES in your physical model, as non-clustered indices are nothing more than additional TABLES to speed-up the joining of multiple RELATIONS.
Once the INDEX (UserID, LoginTime) exists on Logins to support the FK relationship to Users, the query to find the most recent login for a user is covered by the non-clustered index. Only when a known, measurable, severe performance problem has been identified with this default model would one look to denormalize, as this (like all denormalizaions) introduces a performance hit for EVERY OTHER READ AND WRITE operation on the denormalized table.

Find key by value

The think I'm trying to implement is an id table. Basically it has the structure (user_id, lecturer_id) which user_id refers to the primary key in my User table and lecturer_id refers to the primary key of my Lecturer table.
I'm trying to implement this in redis but if I set the key as User's primary id, when I try to run a query like get all the records with lecturer id=5 since lecturer is not the key, but value I won't be able to reach it in O(1) time.
How can I form a structure like the id table I mentioned in above, or Redis does not support that?
One of the things you learn fast while working with redis is that you get to design your data structure around your accessing needs, specially when it comes to relations (it's not a relational database after all)
There is no way to search by "value" with a O(1) time complexity as you already noticed, but there are ways to approach what you describe using redis. Here's what I would recommend:
Store your user data by user id (in e.g. a hash) as you are already doing.
Have an additional set for each lecturer id containing all user ids that correspond to the lecturer id in question.
This might seem like duplicating the data of the relation, since your user data would have to store the lecture id, and your lecture data would store user ids, but that's the (tiny) price to pay if one is to build relations in a no-relational data store like redis. In practical terms this works well; memory is rarely a bottleneck for small-ish data-sets (think thousands of ids).
To get a better picture at how are people using redis to model applications with relations, I recommend reading Design and implementation of a simple Twitter clone and the source code of Lamernews, both of which are written by redis author Salvatore Sanfilippo.
As already answered, in vanilla Redis there is no way to store the data only once and have Redis query them for you.
You have to maintain secondary indexes yourself.
However with the modules in Redis, this is not necessary true. Modules like zeeSQL, or RediSearch allow to store data directly in Redis and retrieve them with a SQL query (for zeeSQL) or simil SQL for RediSearch.
In your case, a small example with zeeSQL.
> ZEESQL.CREATE_DB DB
OK
> ZEESQL.EXEC DB COMMAND "CREATE TABLE user(user_id INT, lecture_id INT);"
OK
> ZEESQL.EXEC DB COMMAND "SELECT * FROM user WHERE lecture_id = 3;"
... your result ...

Link table(s) or redundant columns, SQL optimisation

I have two tables, Users and People, both of which share a common attribute, email address, of which they should be allowed to have many email addresses.
I can see three options myself:
One link table with redundant columns:
Users [id,email_id] and People [id,email_id]
EmailAddress [id,user_id,person_id,email_id]
Emails [id,address,type]
Two link tables without redundancies:
Users [id,email_id] and People [id,email_id]
PersonEmail [id,person_id,email_id]
UserEmail [id,user_id,email_id]
Emails [id,address,type]
No link tables with redundant columns:
Users [id] and People [id]
Emails [id,address,type,user_id,person_id]
Does anyone have any idea what would be the best option, or if there is any other ways? Also, if anyone knows how to implement or feel it is better to have link tables without the generated id column please also specify.
Update: a User has many People, a person belongs to a User
First off, the relationship between user and e-mail is 1:N, not M:N, so in any case you don't need the "link" table EmailAddress.
You need to decide which of these possibilities is true for your application:
User is always person.
Person is always user.
There can be a person that is not user and there can be a user that is not person.
Option 1:
Assuming the option (1) is the correct one, the logical model should look like this:
The symbol between Person and User is "category", which at the level of the physical database can be implemented either:
as a "1 to 0 or 1" relationship between separate tables Person and User,
or a single table containing both person and user fields, where user fields are NULL for persons that are not also users.
If you have...
many user-specific fields,
there are user-specific foreign keys,
new kinds of persons could be added in the future
and you don't need to squeeze-out every last drop of performance,
...choose the implementation strategy with two tables.
If there are:
relatively few user-specific fields,
there are no user-specific relationships,
low "evolvability" is acceptable
and performance is of high importance,
...choose the implementation strategy with the single table.
Similar analysis can be done for each of the remaining possibilities...
Option 2:
Option 3:
If the two entities are conceptually related, then it might make sense to have one table. But if they are two different concepts, then in my experience it is best to have separate tables in order to avoid future confusion. And you're not going to take a big hit anywhere by doing so.
Isn't the User a Person (People)?
That would solve the redundant field issue right away.
----------
| Person |
----------
|
--------
| User |
--------
The User should have the single e-mail field, or mantain the relation with the e-mails table, since Person is an abstract concept not related to any application.
I would say start thinking about (re)modeling your schema, so you won't have problems like this.
Read the Multiple Table Inheritance in Rails guide, that should get you started.