SQL Insert performance question - sql

I have this table phonebook SQL Server 2005:
username(PK) Serial(PK) contact_name contact_adr contact_email contact_phone
bob 1 Steve 12 abc street steve#bb.com 1234
bob 2 John 34 xyz street john#bb.com 5345
bob 3 Mark 98 ggs street mark#bb.com 1234
patrick 4 lily 77 fgs street lily#bb.com 1234
patrick 5 mily 76 fgs street mily#bb.com 1234
von 8 jim 6767 jsd way jim#bb.com 4564
Now you can see the phonebook stores all contacts of same user together.
Storing this way has advantages which I can't avoid.
My question is:
If I have 100 million entries in the table for all users, will my future insertion in the above table be very expensive?
Since SQL Engine needs to find the actual location where to enter the data (I mean under which username)
I tested with 1 million rows, I don't see noticeable issues.
I am asking if anyone has this experience or suggestions for me?
Thanks

The approach that is optimal for an address book is a NOSQL hashed-table. There's no need for an index on the PK. The algorithm returns the "page" where the row identified by the PK can be found. The address book of the user is also stored with the user, as a denormalized relation. Insert overhead is negligible. Hashed-PK is optimized for insert/retrieval when the PK is known. Excellent for OLTP systems. Now if you want to do something like figure out who knows whom, so that a given user's contacts need to be related to the contacts of all other users, then you have a different can of worms. But a straightforward address-book application, where the contacts of a given user remain "private" to that user, then a hashed primary key system is superb.

One of the first principles in db design is data non-redundancy: your db table design doesn't comply to that principle as you have same data repeated many times. A resonable solution would be to create separate table for users, a separate table for contacts and a table for realationship between users and contacts.

It depends on the underlying database. Every implementation has something different under its sleeves.
But! Performance will almost definitely suffer if you use indexes on that table and you have many, many, many, many rows inside of it.

First of all, username doesn't seem to be a primary key for your table by itself. You will probably have to use it in combination with another field if you want it to work. At this point, I would rather use your serial column as primary key, and have an index on username to answer the query get bob's contacts efficiently.
You insert will certainly become slower as your table grow. But I don't think it will be too slow to avoid following this approach.

You can't force the data to be stored together. Are you re-sequencing the Serial upon an insert? How are you ensuring the data is "stored together"?
If you mean putting all this data in one table, then it really depends on your index structure. The more indexes on the table, the more processing that takes place on very insert. Since user tables are usually heavily queried and rarely inserted (relatively), they are usually indexed heavily, in which case inserts can be slow. The answer, as with almost every DB question is: "It depends".

Related

Is it good idea to create new table for each message beetween 2 user

Is it good idea to create new table for each message beetween 2 user?For example Jonny and Henry's table for only Jonny and Henry's all messages.Is it fast or column is enough.I want to get best speed.Which one is fast?Create just one messages table for all users or create table for each user(It will be uniqui because its name will be like how their id.For example Jonny id 13 Henry id is 19 and table will be 13_->19)
Generally speaking, modifying the schema (e.g., creating tables) at runtime is a bad, bad idea. Every rule has its exceptions, but unless you have an extremely good reason to do this, indexing the relevant columns, or even partitioning your tables by them should be more than enough.
For sure no.
you can keep all of the messages in one table and have some specific columns like sender_id and receiver_id.
Make the relations between tables.
For achieving more speed, consider your indexes, table engine.
And, you can archive your old messages in another table in a specific period

Indexing and table relationships for sql database

I'd really appreciate some advice on how to setup my table design and what type of indexing I should use.
I figured this type of requirement has come up alot before so hopefully I could benefit from your advice!
The requirement, and my initial plan is as follows:
I have one table which identifies a limited amount of forms
FormID FormName Desc etc..
I will have a second table that populates information for those forms.
(EquipmentIds are unique. So a piece of equipment may require one of the forms from the previous table.)
ID FormID EquipmentID Element Value
-----------------------------------------------------
1 25 3432 lightswitch GE Lightbulb
2 25 3432 lamp nice lamp
3 25 3432 rug really ties the room together
4 25 3432 shelf good shelf
5 25 3432 ... ....
6 23 2314 ... ....
So essentially all of the information for the forms would be in the second table. To populate a form I would then select from form filler on the FormID AND EquipmentID.
Is there a better way of doing this? It makes sense to me but I could see the table growing very quickly and I am wondering what the best way to index this second table would be.
Thank you very much for your time and help
You are implementing a variation on the Entity-Attribute-Value design. I have written several times that this is not a valid design for a relational database.
You lose the ability to use SQL data types appropriately.
You lose the ability to use SQL constraints like UNIQUE, FOREIGN KEY, and even NOT NULL.
Queries against EAV data don't scale well.
The storage is bulky and inefficient. Not only does it create many rows, but the first three columns have a lot of duplication in values, making indexes less efficient.
To use EAV data, you have to write lots more application code to check data types, simulate constraints, and reformat result sets. This is extra work that the RDBMS already does for you more efficiently.
The correct design for a relational database is to design a table for each of your forms, with form fields in distinct columns, so you can choose the column names, data types, constraints appropriately. Form fields that allow multiple values require separate child tables too.

Is this database structure sane, correct and normalized?

So, yesterday I asked 2 questions that pivoted around the same idea: Reorganizing a database that A- wasn't normalized and B- was a mess by virtue of my ignorance. I spent the better part of the day organizing my thoughts, reading up and working through some tests. Today I think I have a much better idea of how my DB should look and act, but I wanted to make sure I understood the core ideas of proper SQL DB design and normalization processes.
Originally I had ONE table called "Files" that held data about a file (it's URL, date uploaded, user ID of whomever uploaded it etc.) as well as a column called "grades" that represented the grade level you might use that file for. (FYI: These files are lesson plans for schools) I realized I'd violated Rule #1 about Normalization- I was storing my "grades" like this "1,2" or "2,6" or "3,5,6" in one column. This caused major headaches when trying to parse that data if I wanted to see JUST 3rd grade lessons or JUST 5th grade lessons.
What was suggested to me, and what became evident later, was that I have 3 tables:
files (data about the files, url etc.)
grades (a table of available grade levels. Likely 1-6 to start)
files_grades (a junction table)
This makes sense,. I just want to make sure I understand what I'm doing before I do it. Let's say User A uploads File xyz and decides that it's good for grades 2 and 3.
I'd write ONE record to the "files" table with data about that file (kb size, url, description, name, primary key files_id). Let's say it gets id 345.
Because of the limited number of grade options, grades will likely be equivalent to their ID (i.e., Grade 1 is grades_id 1, Grade 2 is grades_id 2)
I'd then write TWO records to the "files_grade" junction table containing
files_grade_id, files_id, and grades_id i.e.
1,345,2
1,345,3
To represent the 2 grades that files_id 345 is good for. Then I wave my magic SELECT and JOIN wands and pull the data I need.
Does this make sense? Am I, again, misunderstanding the proper structure of a relational many-to-many database?
Problem 2 which just dawned on me: So, a Lesson can have Multiple "Grades". No problem, we just solved that (I hope!). But it could, in theory, have multiple "Schools" as well- Elementary, Middle, High. What do we do if a files entry has Grades 1,2 for Middle,High? This could very easily be solved by saying "One school per file, users!", but I like to throw this out there.
I'd then write TWO records to the "files_grade" junction table containing
files_grade_id, files_id, and grades_id i.e.
files_grade_id here is redundant, because the combination of files_id and grades_id is already unique (thus can be set as the primary key).
But it could, in theory, have multiple "Schools" as well- Elementary, Middle, High. What do we do if a files entry has Grades 1,2 for Middle,High?
Depending on your requirement, you can perhaps store those as "continuations" of the previous grades, e.g. 1-6 elementary, 7-9 middle, 10-12 high. Then you can make do without the grades table completely (since you can just store these numbers in the files_grade table).
From the sounds of it, it sounds pretty good. Just one thing, though: you don't really need to have an ID for the bridge table (FILES_GRADES), and if you do, you need to increment the ID.
You would have a two-part primary key: grade_id and file_id, the files_grade_id just complicates things, and would make for a bad index, since you'd never use it in a select.
Since you first question is already been answered I will take a stab at the second one.
There are multiple ways to do this, but one possibility is to add another table for the "Schools" and include it as part of the junction table - renaming the junction table of course to match the new design. So, you could have:
School Table:
-------------------------
SchoolId | School
-------------------------
1 | Elementary
2 | Middle
3 | High
-------------------------
Files_grades_school
------------------------------------
FileId | GradeId | SchoolId
------------------------------------
345 | 1 | 1
345 | 1 | 2
You will probably want to create multiple indexes based on your usage patterns.

Normalization in plain English

I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.

Facebook database design?

I have always wondered how Facebook designed the friend <-> user relation.
I figure the user table is something like this:
user_email PK
user_id PK
password
I figure the table with user's data (sex, age etc connected via user email I would assume).
How does it connect all the friends to this user?
Something like this?
user_id
friend_id_1
friend_id_2
friend_id_3
friend_id_N
Probably not. Because the number of users is unknown and will expand.
Keep a friend table that holds the UserID and then the UserID of the friend (we will call it FriendID). Both columns would be foreign keys back to the Users table.
Somewhat useful example:
Table Name: User
Columns:
UserID PK
EmailAddress
Password
Gender
DOB
Location
TableName: Friends
Columns:
UserID PK FK
FriendID PK FK
(This table features a composite primary key made up of the two foreign
keys, both pointing back to the user table. One ID will point to the
logged in user, the other ID will point to the individual friend
of that user)
Example Usage:
Table User
--------------
UserID EmailAddress Password Gender DOB Location
------------------------------------------------------
1 bob#bob.com bobbie M 1/1/2009 New York City
2 jon#jon.com jonathan M 2/2/2008 Los Angeles
3 joe#joe.com joseph M 1/2/2007 Pittsburgh
Table Friends
---------------
UserID FriendID
----------------
1 2
1 3
2 3
This will show that Bob is friends with both Jon and Joe and that Jon is also friends with Joe. In this example we will assume that friendship is always two ways, so you would not need a row in the table such as (2,1) or (3,2) because they are already represented in the other direction. For examples where friendship or other relations aren't explicitly two way, you would need to also have those rows to indicate the two-way relationship.
TL;DR:
They use a stack architecture with cached graphs for everything above the MySQL bottom of their stack.
Long Answer:
I did some research on this myself because I was curious how they handle their huge amount of data and search it in a quick way. I've seen people complaining about custom made social network scripts becoming slow when the user base grows. After I did some benchmarking myself with just 10k users and 2.5 million friend connections - not even trying to bother about group permissions and likes and wall posts - it quickly turned out that this approach is flawed. So I've spent some time searching the web on how to do it better and came across this official Facebook article:
TAO: Facebook’s Distributed Data Store for the Social Graph
TAO: The power of the graph.
I really recommend you to watch the presentation of the first link above before continue reading. It's probably the best explanation of how FB works behind the scenes you can find.
The video and article tells you a few things:
They're using MySQL at the very bottom of their stack
Above the SQL DB there is the TAO layer which contains at least two levels of caching and is using graphs to describe the connections.
I could not find anything on what software / DB they actually use for their cached graphs
Let's take a look at this, friend connections are top left:
Well, this is a graph. :) It doesn't tell you how to build it in SQL, there are several ways to do it but this site has a good amount of different approaches. Attention: Consider that a relational DB is what it is: It's thought to store normalised data, not a graph structure. So it won't perform as good as a specialised graph database.
Also consider that you have to do more complex queries than just friends of friends, for example when you want to filter all locations around a given coordinate that you and your friends of friends like. A graph is the perfect solution here.
I can't tell you how to build it so that it will perform well but it clearly requires some trial and error and benchmarking.
Here is my disappointing test for just findings friends of friends:
DB Schema:
CREATE TABLE IF NOT EXISTS `friends` (
`id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`friend_id` int(11) NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8;
Friends of Friends Query:
(
select friend_id
from friends
where user_id = 1
) union (
select distinct ff.friend_id
from
friends f
join friends ff on ff.user_id = f.friend_id
where f.user_id = 1
)
I really recommend you to create you some sample data with at least 10k user records and each of them having at least 250 friend connections and then run this query. On my machine (i7 4770k, SSD, 16gb RAM) the result was ~0.18 seconds for that query. Maybe it can be optimized, I'm not a DB genius (suggestions are welcome). However, if this scales linear you're already at 1.8 seconds for just 100k users, 18 seconds for 1 million users.
This might still sound OKish for ~100k users but consider that you just fetched friends of friends and didn't do any more complex query like "display me only posts from friends of friends + do the permission check if I'm allowed or NOT allowed to see some of them + do a sub query to check if I liked any of them". You want to let the DB do the check on if you liked a post already or not or you'll have to do in code. Also consider that this is not the only query you run and that your have more than active user at the same time on a more or less popular site.
I think my answer answers the question how Facebook designed their friends relationship very well but I'm sorry that I can't tell you how to implement it in a way it will work fast. Implementing a social network is easy but making sure it performs well is clearly not - IMHO.
I've started experimenting with OrientDB to do the graph-queries and mapping my edges to the underlying SQL DB. If I ever get it done I'll write an article about it.
How can I create a well performing social network site?
Update 2021-04-10: I'll probably never ever write the article ;) but here are a few bullet points how you could try to scale it:
Use different read and write repositories
Build specific read repositories based on faster non-relational DB systems made for that purpose, don't be afraid of denormalizing data. Write to a normalized DB but read from specialized views.
Use eventual consistence
Take a look at CQRS
For a social network graphs based read repositories might be also good idea.
Use Redis as a read repository in which you store whole serialized data sets
If you combine the points from the above list in a smart way you can build a very well performing system. The list is not a "todo" list, you'll still have to understand, think and adept it! https://microservices.io/ is a nice site that covers a few of the topics I mentioned before.
What I do is to store events that are generated by aggregates and use projects and handlers to write to different DBs as mentioned above. The cool thing about this is, I can re-build my data as needed at any time.
Have a look at the following database schema, reverse engineered by Anatoly Lubarsky:
My best bet is that they created a graph structure. The nodes are users and "friendships" are edges.
Keep one table of users, keep another table of edges. Then you can keep data about the edges, like "day they became friends" and "approved status," etc.
It's most likely a many to many relationship:
FriendList (table)
user_id -> users.user_id
friend_id -> users.user_id
friendVisibilityLevel
EDIT
The user table probably doesn't have user_email as a PK, possibly as a unique key though.
users (table)
user_id PK
user_email
password
Take a look at these articles describing how LinkedIn and Digg are built:
http://hurvitz.org/blog/2008/06/linkedin-architecture
http://highscalability.com/scaling-digg-and-other-web-applications
There's also "Big Data: Viewpoints from the Facebook Data Team" that might be helpful:
http://developer.yahoo.net/blogs/theater/archives/2008/01/nextyahoonet_big_data_viewpoints_from_the_fac.html
Also, there's this article that talks about non-relational databases and how they're used by some companies:
http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php
You'll see that these companies are dealing with data warehouses, partitioned databases, data caching and other higher level concepts than most of us never deal with on a daily basis. Or at least, maybe we don't know that we do.
There are a lot of links on the first two articles that should give you some more insight.
UPDATE 10/20/2014
Murat Demirbas wrote a summary on
TAO: Facebook's distributed data store for the social graph (ATC'13)
F4: Facebook's warm BLOB storage system (OSDI'14)
http://muratbuffalo.blogspot.com/2014/10/facebooks-software-architecture.html
HTH
It's not possible to retrieve data from RDBMS for user friends data for data which cross more than half a billion at a constant time
so Facebook implemented this using a hash database (no SQL) and they opensourced the database called Cassandra.
So every user has its own key and the friends details in a queue; to know how cassandra works look at this:
http://prasath.posterous.com/cassandra-55
Its a type of graph database:
http://components.neo4j.org/neo4j-examples/1.2-SNAPSHOT/social-network.html
Its not related to Relational databases.
Google for graph databases.
You're looking for foreign keys. Basically you can't have an array in a database unless it has it's own table.
Example schema:
Users Table
userID PK
other data
Friends Table
userID -- FK to users's table representing the user that has a friend.
friendID -- FK to Users' table representing the user id of the friend
Probably there is a table, which stores the friend <-> user relation, say "frnd_list", having fields 'user_id','frnd_id'.
Whenever a user adds another user as a friend, two new rows are created.
For instance, suppose my id is 'deep9c' and I add a user having id 'akash3b' as my friend, then two new rows are created in table "frnd_list" with values ('deep9c','akash3b') and ('akash3b','deep9c').
Now when showing the friends-list to a particular user, a simple sql would do that: "select frnd_id from frnd_list where user_id="
where is the id of the logged-in user (stored as a session-attribute).
Regarding the performance of a many-to-many table, if you have 2 32-bit ints linking user IDs, your basic data storage for 200,000,000 users averaging 200 friends apiece is just under 300GB.
Obviously, you would need some partitioning and indexing and you're not going to keep that in memory for all users.