How to store a large list in a database? - sql

I'm starting a project and unsure of the proper way to do this so I just need to be pointed in a general direction.
Essentially, each user stored in the database will have a large list of people (millions) that they want to connect with, and a large list of people they have connected with. The lists of people to connect with will be updated weekly, possibly monthly and duplicates will need to be checked for.
*It might be important to note that the lists of people to connect with won't be users in the system.
Should each list for each user be stored in separate tables and linked to or is there a more efficient structure for this?
Thanks!

One way would be
users table
-----------
user_id
user_name
...
lists table
-----------
list_id
user_id
list_name
...
list_persons table
----------------
list_id
person_name
...

Related

Multiple locations and different user privileges for database

I am not sure what the best route to go on this is. I have a client who has 3 different locations for his business. Each locations employees can only access their locations data. The owner can access all... Then, different roles should be able to access their stuff only (finance can see finance but not sales, etc..).
What is the best way to go about this? The solutions I can think of are:
Create a user table, give a location ID and role ID and base the data off of that. This would require adding the location ID a lot though..
Create 3 separate databases and have the information display based off of a role ID. This doesn't seem ideal
Use functionality on the DB side, stored procedures, etc...
Retrofitting a multi-tenancy security model into an existing database isn't a simple task - IMO this should be designed into the model from the start.
An extremely simple model (One Role per user, One Location per User) would look like this:
-- You need to add simple lookup tables for Role, Location
CREATE TABLE User
(
UserId INT, -- PK
RoleId INT, -- FK
LocationId INT NULL -- FK
);
All sensitive tables would either directly need the LocationId classification, or need to be joinable to a table which has the LocationId classification, i.e.:
CREATE TABLE SomeTable -- with location-sensitive data
(
Col1 ... Col N,
LocationId INT
);
The hard part however is to adjust all of your system's queries on the sensitive data tables such that they now enforce the Location-specific restriction. This is commonly done as an additional predicate filter which is appended to the where clause of queries done on these tables, and then joining back to the user-location table:
SELECT Col1 ... ColN
FROM SomeTable
INNER JOIN User on SomeTable.LocationId = User.LocationId
WHERE -- Usual Filter Criteria
AND ((User.UserId = #UserIdExecutingThisQuery
AND User.RoleId = `Finance`) -- Well, the Id for Finance
OR User.RoleId = `Administrator`) -- Well, the Id for Admin
As a result of the redesign effort, as a short term solution, you might look at at instead maintaining 3 distinct regional databases (or 3 regional schemas in the same database), and then using replication or similar to then centralize all data to a master database for the owner role to use.
This will give you the time to redesign your database (and app(s)) to use a multi-tenancy design. I would suggest a more comprehensive model of allowing multiple roles per user, and multiple locations per user (i.e. many-many junction tables), and not the simplistic model shown here.

Multiple tables or one big one?

In a scenario where I have a table of students all applying for a job and I want to record where they have applied would it be better practise to create a separate table for each student with their individual applications in or create a separate big table that just links an application to a student?
When you say database do you mean table?
Each noun should map to an entity, which in most cases would be represented by a single table so:
STUDENTS-<APPLICATIONS>-JOBS
A student can make many applications
An application is made by a single student
An application relates to a single job
One job could have many applications.
Students
--------
STUDENT_ID
NAME
...
Applications
------------
APPLICATION_ID
JOB_ID
STUDENT_ID
APPLICATION_DATE
...
Jobs
----
JOB_ID
TITLE
SALARY
...
add more columns as required!
One 'big' database but with two separate tables.
As long as the business process doesn't change, you shouldn't need to modify the database either. If you have a database per student, you will have to create new databases when you want to enter more students. That already is a signal that you're not doing it right.
Just make a table of students, and a table of student applications, which has a student_id that references a student. That way, you can store as many students as you like, and as many applications per student as you like, with only two tables, without redundant data, and without having to modify the database when you want to file a new record.
And in case you wondered why I put 'big' in quotes: two tables is not much at all, and unless you are going to store every student in your country along with every application they had anywhere, it won't be big in data terms either. A professional database like ProgreSQL can easily manage millions of rows if your database structure is setup properly.
I'm thinking you're talking about a Table, A database is a collection of tables. and a server can hoste multiple databases.
For the question, I think the best answer is to have a table of students, and a table of applications with the students id.
I would really not consider creating a table per students as that will force you to create a table for any new students.

Link table(s) or redundant columns, SQL optimisation

I have two tables, Users and People, both of which share a common attribute, email address, of which they should be allowed to have many email addresses.
I can see three options myself:
One link table with redundant columns:
Users [id,email_id] and People [id,email_id]
EmailAddress [id,user_id,person_id,email_id]
Emails [id,address,type]
Two link tables without redundancies:
Users [id,email_id] and People [id,email_id]
PersonEmail [id,person_id,email_id]
UserEmail [id,user_id,email_id]
Emails [id,address,type]
No link tables with redundant columns:
Users [id] and People [id]
Emails [id,address,type,user_id,person_id]
Does anyone have any idea what would be the best option, or if there is any other ways? Also, if anyone knows how to implement or feel it is better to have link tables without the generated id column please also specify.
Update: a User has many People, a person belongs to a User
First off, the relationship between user and e-mail is 1:N, not M:N, so in any case you don't need the "link" table EmailAddress.
You need to decide which of these possibilities is true for your application:
User is always person.
Person is always user.
There can be a person that is not user and there can be a user that is not person.
Option 1:
Assuming the option (1) is the correct one, the logical model should look like this:
The symbol between Person and User is "category", which at the level of the physical database can be implemented either:
as a "1 to 0 or 1" relationship between separate tables Person and User,
or a single table containing both person and user fields, where user fields are NULL for persons that are not also users.
If you have...
many user-specific fields,
there are user-specific foreign keys,
new kinds of persons could be added in the future
and you don't need to squeeze-out every last drop of performance,
...choose the implementation strategy with two tables.
If there are:
relatively few user-specific fields,
there are no user-specific relationships,
low "evolvability" is acceptable
and performance is of high importance,
...choose the implementation strategy with the single table.
Similar analysis can be done for each of the remaining possibilities...
Option 2:
Option 3:
If the two entities are conceptually related, then it might make sense to have one table. But if they are two different concepts, then in my experience it is best to have separate tables in order to avoid future confusion. And you're not going to take a big hit anywhere by doing so.
Isn't the User a Person (People)?
That would solve the redundant field issue right away.
----------
| Person |
----------
|
--------
| User |
--------
The User should have the single e-mail field, or mantain the relation with the e-mails table, since Person is an abstract concept not related to any application.
I would say start thinking about (re)modeling your schema, so you won't have problems like this.
Read the Multiple Table Inheritance in Rails guide, that should get you started.

Adding new fields vs creating separate table

I am working on a project where there are several types of users (students and teachers). Currently to store the user's information, two tables are used. The users table stores the information that all users have in common. The teachers table stores information that only teachers have with a foreign key relating it to the users table.
users table
id
name
email
34 other fields
teachers table
id
user_id
subject
17 other fields
In the rest of the database, there are no references to teachers.id. All other tables who need to relate to a user use users.id. Since a user will only have one corresponding entry in the teachers table, should I just move the fields from the teachers table into the users table and leave them blank for users who aren't teachers?
e.g.
users
id
name
email
subject
51 other fields
Is this too many fields for one table? Will this impede performance?
I think this design is fine, assuming that most of the time you only need the user data, and that you know when you need to show the teacher-specific fields.
In addition, you get only teachers just by doing a JOIN, which might come in handy.
Tomorrow you might have another kind of user who is not a teacher, and you'll be glad of the separation.
Edited to add: yes, this is an inheritance pattern, but since he didn't say what language he was using I didn't want to muddy the waters...
In the rest of the database, there are no references to teachers.id. All other tables who need to relate to a user
use users.id.
I would expect relating to the teacher_id for classes/sections...
Since a user will only have one corresponding entry in the teachers table, should I just move the fields from the teachers table into the users table and leave them blank for users who aren't teachers?
Are you modelling a system for a high school, or post-secondary? Reason I ask is because in post-secondary, a user can be both a teacher and a student... in numerous subjects.
I would think it fine provided neither you or anyone else succumbs to the temptation to reuse 'empty' columns for other purposes.
By this I mean, there will in your new table be columns that are only populated for teachers. Someone may decide that there is another value they need to store for non-teachers, and use one of the teacher's columns to hold it, because after all it'll never be needed for this non-teacher, and that way we don't need to change the table, and pretty soon your code fills up with things testing row types to find what each column holds.
I've seen this done on several systems (for instance, when loaning a library book, if the loan is a long loan the due date holds the date the book is expected back. but if it's a short loan the due date holds the time it's expected back, and woe betide anyone who doesn't somehow know that).
It's not too many fields for one table (although without any details it does seem kind of suspicious). And worrying about performance at this stage is premature.
You're probably dealing with very few rows and a very small amount of data. You concerns should be 1) getting the job done 2) designing it correctly 3) performance, in that order.
It's really not that big of a deal (at this stage/scale).
I would not stuff all fields in one table. Student to teacher ratio is high, so for 100 teachers there may be 10000 students with NULLs in those 17 fields.
Usually, a model would look close to this:
I your case, there are no specific fields for students, so you can omit the Student table, so the model would look like this
Note that for inheritance modeling, the Teacher table has UserID, same as the User table; contrast that to your example which has an Id for the Teacher table and then a separate user_id.
it won't really hurt the performance, but the other programmers might hurt you if you won't redisign it :) (55 fielded tables ??)

Facebook database design?

I have always wondered how Facebook designed the friend <-> user relation.
I figure the user table is something like this:
user_email PK
user_id PK
password
I figure the table with user's data (sex, age etc connected via user email I would assume).
How does it connect all the friends to this user?
Something like this?
user_id
friend_id_1
friend_id_2
friend_id_3
friend_id_N
Probably not. Because the number of users is unknown and will expand.
Keep a friend table that holds the UserID and then the UserID of the friend (we will call it FriendID). Both columns would be foreign keys back to the Users table.
Somewhat useful example:
Table Name: User
Columns:
UserID PK
EmailAddress
Password
Gender
DOB
Location
TableName: Friends
Columns:
UserID PK FK
FriendID PK FK
(This table features a composite primary key made up of the two foreign
keys, both pointing back to the user table. One ID will point to the
logged in user, the other ID will point to the individual friend
of that user)
Example Usage:
Table User
--------------
UserID EmailAddress Password Gender DOB Location
------------------------------------------------------
1 bob#bob.com bobbie M 1/1/2009 New York City
2 jon#jon.com jonathan M 2/2/2008 Los Angeles
3 joe#joe.com joseph M 1/2/2007 Pittsburgh
Table Friends
---------------
UserID FriendID
----------------
1 2
1 3
2 3
This will show that Bob is friends with both Jon and Joe and that Jon is also friends with Joe. In this example we will assume that friendship is always two ways, so you would not need a row in the table such as (2,1) or (3,2) because they are already represented in the other direction. For examples where friendship or other relations aren't explicitly two way, you would need to also have those rows to indicate the two-way relationship.
TL;DR:
They use a stack architecture with cached graphs for everything above the MySQL bottom of their stack.
Long Answer:
I did some research on this myself because I was curious how they handle their huge amount of data and search it in a quick way. I've seen people complaining about custom made social network scripts becoming slow when the user base grows. After I did some benchmarking myself with just 10k users and 2.5 million friend connections - not even trying to bother about group permissions and likes and wall posts - it quickly turned out that this approach is flawed. So I've spent some time searching the web on how to do it better and came across this official Facebook article:
TAO: Facebook’s Distributed Data Store for the Social Graph
TAO: The power of the graph.
I really recommend you to watch the presentation of the first link above before continue reading. It's probably the best explanation of how FB works behind the scenes you can find.
The video and article tells you a few things:
They're using MySQL at the very bottom of their stack
Above the SQL DB there is the TAO layer which contains at least two levels of caching and is using graphs to describe the connections.
I could not find anything on what software / DB they actually use for their cached graphs
Let's take a look at this, friend connections are top left:
Well, this is a graph. :) It doesn't tell you how to build it in SQL, there are several ways to do it but this site has a good amount of different approaches. Attention: Consider that a relational DB is what it is: It's thought to store normalised data, not a graph structure. So it won't perform as good as a specialised graph database.
Also consider that you have to do more complex queries than just friends of friends, for example when you want to filter all locations around a given coordinate that you and your friends of friends like. A graph is the perfect solution here.
I can't tell you how to build it so that it will perform well but it clearly requires some trial and error and benchmarking.
Here is my disappointing test for just findings friends of friends:
DB Schema:
CREATE TABLE IF NOT EXISTS `friends` (
`id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`friend_id` int(11) NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8;
Friends of Friends Query:
(
select friend_id
from friends
where user_id = 1
) union (
select distinct ff.friend_id
from
friends f
join friends ff on ff.user_id = f.friend_id
where f.user_id = 1
)
I really recommend you to create you some sample data with at least 10k user records and each of them having at least 250 friend connections and then run this query. On my machine (i7 4770k, SSD, 16gb RAM) the result was ~0.18 seconds for that query. Maybe it can be optimized, I'm not a DB genius (suggestions are welcome). However, if this scales linear you're already at 1.8 seconds for just 100k users, 18 seconds for 1 million users.
This might still sound OKish for ~100k users but consider that you just fetched friends of friends and didn't do any more complex query like "display me only posts from friends of friends + do the permission check if I'm allowed or NOT allowed to see some of them + do a sub query to check if I liked any of them". You want to let the DB do the check on if you liked a post already or not or you'll have to do in code. Also consider that this is not the only query you run and that your have more than active user at the same time on a more or less popular site.
I think my answer answers the question how Facebook designed their friends relationship very well but I'm sorry that I can't tell you how to implement it in a way it will work fast. Implementing a social network is easy but making sure it performs well is clearly not - IMHO.
I've started experimenting with OrientDB to do the graph-queries and mapping my edges to the underlying SQL DB. If I ever get it done I'll write an article about it.
How can I create a well performing social network site?
Update 2021-04-10: I'll probably never ever write the article ;) but here are a few bullet points how you could try to scale it:
Use different read and write repositories
Build specific read repositories based on faster non-relational DB systems made for that purpose, don't be afraid of denormalizing data. Write to a normalized DB but read from specialized views.
Use eventual consistence
Take a look at CQRS
For a social network graphs based read repositories might be also good idea.
Use Redis as a read repository in which you store whole serialized data sets
If you combine the points from the above list in a smart way you can build a very well performing system. The list is not a "todo" list, you'll still have to understand, think and adept it! https://microservices.io/ is a nice site that covers a few of the topics I mentioned before.
What I do is to store events that are generated by aggregates and use projects and handlers to write to different DBs as mentioned above. The cool thing about this is, I can re-build my data as needed at any time.
Have a look at the following database schema, reverse engineered by Anatoly Lubarsky:
My best bet is that they created a graph structure. The nodes are users and "friendships" are edges.
Keep one table of users, keep another table of edges. Then you can keep data about the edges, like "day they became friends" and "approved status," etc.
It's most likely a many to many relationship:
FriendList (table)
user_id -> users.user_id
friend_id -> users.user_id
friendVisibilityLevel
EDIT
The user table probably doesn't have user_email as a PK, possibly as a unique key though.
users (table)
user_id PK
user_email
password
Take a look at these articles describing how LinkedIn and Digg are built:
http://hurvitz.org/blog/2008/06/linkedin-architecture
http://highscalability.com/scaling-digg-and-other-web-applications
There's also "Big Data: Viewpoints from the Facebook Data Team" that might be helpful:
http://developer.yahoo.net/blogs/theater/archives/2008/01/nextyahoonet_big_data_viewpoints_from_the_fac.html
Also, there's this article that talks about non-relational databases and how they're used by some companies:
http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php
You'll see that these companies are dealing with data warehouses, partitioned databases, data caching and other higher level concepts than most of us never deal with on a daily basis. Or at least, maybe we don't know that we do.
There are a lot of links on the first two articles that should give you some more insight.
UPDATE 10/20/2014
Murat Demirbas wrote a summary on
TAO: Facebook's distributed data store for the social graph (ATC'13)
F4: Facebook's warm BLOB storage system (OSDI'14)
http://muratbuffalo.blogspot.com/2014/10/facebooks-software-architecture.html
HTH
It's not possible to retrieve data from RDBMS for user friends data for data which cross more than half a billion at a constant time
so Facebook implemented this using a hash database (no SQL) and they opensourced the database called Cassandra.
So every user has its own key and the friends details in a queue; to know how cassandra works look at this:
http://prasath.posterous.com/cassandra-55
Its a type of graph database:
http://components.neo4j.org/neo4j-examples/1.2-SNAPSHOT/social-network.html
Its not related to Relational databases.
Google for graph databases.
You're looking for foreign keys. Basically you can't have an array in a database unless it has it's own table.
Example schema:
Users Table
userID PK
other data
Friends Table
userID -- FK to users's table representing the user that has a friend.
friendID -- FK to Users' table representing the user id of the friend
Probably there is a table, which stores the friend <-> user relation, say "frnd_list", having fields 'user_id','frnd_id'.
Whenever a user adds another user as a friend, two new rows are created.
For instance, suppose my id is 'deep9c' and I add a user having id 'akash3b' as my friend, then two new rows are created in table "frnd_list" with values ('deep9c','akash3b') and ('akash3b','deep9c').
Now when showing the friends-list to a particular user, a simple sql would do that: "select frnd_id from frnd_list where user_id="
where is the id of the logged-in user (stored as a session-attribute).
Regarding the performance of a many-to-many table, if you have 2 32-bit ints linking user IDs, your basic data storage for 200,000,000 users averaging 200 friends apiece is just under 300GB.
Obviously, you would need some partitioning and indexing and you're not going to keep that in memory for all users.