Friends of friends use case - Redis vs a Graph Databse - redis

I made a small social network as part of an assignemt for school. Our next task is to implement a functionality into the project which will use a non-relational database. We were suggested to go for Redis or ElasticSearch.
It is clear to me, that I could use ElasticSearch to look for people and groups based on their names etc.
But at the moment I am more interested in making a potential friend finder which suggests friends based on common friends of the two users and maybe the groups they are part of.
My question is: Is this a good use-case for Redis or would it be much much better to use a Graph database for something like this?
This is how I imagined it:
I have a Set of registered users called "users" stored inside redis
For each user I have a Set which keeps track of their friends e.g.
"user:1:friends"
I also have a SortedSet of potential friends stored for each user
e.g. "user:1:potential"
Let's say a user I am not friends with will add one of my friends to their friends-list. When this happens I would take all the sets of friends of my friend and check if my friends new friend is part of each of the sets. If not, then I would increment the score assigned to his id in the sets of potential friends of my friends friends which are not friends with the new guy.
All in all this seems to me like a lot of work which is why I am not sure wether this is even a good idea.
So again- Would a graph database just be much better for somehting like this?

Considering you'd have anything to implement (inlcuding setting up the no sql db) I would definitely go with graphDBs.
Friends of friends (and more generally suggestions) is a basic use case for this type of DB. It's were they show their full potential.
I'd suggest that you have a look at Neo4J: http://neo4j.com
And their social network use case: http://neo4j.com/use-cases/social-network/

Related

Creating a SOLR index for activity stream or newsfeed

I am trying to index the activity feed of a social portal am building. The portal allows users to follow each other to get updates from the people they follow as an activity feed sorted by date.
For example, user A will be following users B, C, D, E & F. So user A should see all the posts from B, C, D, E & F on his/her activity feed.
Let's assume the post consist of just two fields.
1. The text of the post. (text_field)
2. The name/UID of the user who posted it. (user_field)
Currently, I am creating an index for all the posts and indexing the text_field & user_field. In scale, there can be 1,000,000+ posts. A user may follow 100s if not 1000s of users. What will be the best way to create an index for this scenario?
Should I also index a person followers, so that its quickly looked up and then pass it to a second query for getting the posts of all those users sorted by date?
What is the best way to query the index consisting of all these posts, by passing the UID of all the users that are followed? Considering this may be in 100's or more.
Update:
The motivation for using Solr for the news feed was mainly inspired by this detailed slide and my brief discussion with OpenSocial team.
When starting off with a social portal, Fan out on write seems an overkill and more expensive. However Fan out on read is better. Both the slide and the OpenSocial team suggested using a search backend for Fan out on read. The slide mentioned above also have data on how it helped them.
At present, the feed is going to be flat and only sort criteria will be the date(recency). We won't be considering relevance or posts from more closer groups.
It's kind of abstract, but I will do my best here. Based on what you mentioned, I am not sure if Solr is really the right tool for the job here. You can still have Solr for full text search, but I am not sure about generating a news feed from it in this scenario. Remember that although Solr is pretty impressive, it is a search engine. I will pretend that you will stick with Solr for the rest of the post, keep in mind that we are trying to put a square peg through a round hole here though.
Here are a few additional questions you should think about.
You will probably want to add a timestamp of the post to the data element
You need to figure out how to properly sort the results. Is it in order of recency? Or based on posts that the user is more likely to interact with?
If a user has 1000+ connections, would he want to see an update from every one of them in the main feed? Or should posts from a closer group of friends show up higher?
Here are some comments about your questions:
1) If you index person's followers, it may be hard to keep up. I am assuming followers are going to be changing often and re-indexing in this scenario would not really be practical.
2) That sounds more on par, but again, you need to figure out the sorting. You can get a list of connections for the user, then run a search for top posts from all of them.

Activity streams / feeds / news in social network database schema

I have a goal to implement database schema for simple \ typical social network.
I have read many threads \ answers but have couple open questions.
So we have User table (userId, name and etc). We can make some Actions (reply, like, follow and etc). I want to implement some log for all activities and do it as PULL-MODEL. So we write entry in Activity table for any action. Schema for this table is (id, ownerId, actionType, targetId, time) where ownerId is User's id, who made action. actionType is reply, follow or other action. targetId is id of user or post and depends on actionType. When User get his activities we just do query by friends ids. So it is clear for me. My questions are:
1) In case if I follow User and unfollow him, what I should do? Should I make two entries in Activity table or I should remove the first followAction entry? What is the best practice?
2) It is clear foe me do query by friend ids so I get all activities of my friends. But in case any not my friend liked my photo and I must get event that "Some not my friends liked my photo". So, what are good solutions there for this case. May be I must to change my current schema?
Releated questions :
How to implement the activity stream in a social network
Database Design - "Push" Model, or Fan-out-on-write
What's the best manner of implementing a social activity stream?
Thanks you all for good answers.
First, it may be better to split each kind of action into its own table, rather than having all actions in one table, distinguished by types. This makes your metadata about each action more flexible; as you say, the target ID depends on the action; without splitting them out into other tables, it's harder to write constraints on what the data should be.
Second - on your question #1, I think you're confusing a log of user actions with user status. You may need both; you might want two separate data structures. For example, if a user follows and then unfollows, the status is that they aren't following, but the log of actions is that they followed, then unfollowed. So I think you should be careful to have a separate data structure that captures current status of certain relationships, apart from actions. Then the problem becomes simpler, you log all actions as they happen, and update status accordingly.
For question #2, the photo should be its own data object, with "likes" split out into a different table; users like posts. Then of all of the users who like a post, they can easily be grouped into two categories; friends (those who have a friend relationship to the poster) and non-friends.

Ways to handle security/authorization in a multi tenant API

I'm playing around with a spare time project, mainly to try out new stuff :)
This involves designing a REST API for a system that is multi tenant. Lets say you have an "organization" that is the "top" entity, this might have an API key assigned that is used for authenticating each request. So on each request we have an organization associated.
Now when a user of the API would like to get a list of, lets say projects, only those that belong to that organization should be returned. The actual implementation, the queries to the database, is pretty straight forward. However the approach is interesting I think.
You could implement the filtering each time you query the database, but a better approach would be a general pre-query applied to all "organization" related queries, like all queries for enities that belong to an organization. It's all about avoiding the wrong entities from being returned. You could isolate the database, but if that is not possible how would you approach it?
Right now I use NancyFX and RavenDB so input for that stack would be appreciated, but general ideas and best practices, do's and don't is very welcome.
In this case you could isolate your collections by prefixing them with the organization_id. It will duplicate maybe many collections.
Use case with mongodb: http://support.mongohq.com/use-cases/multi-tenant.html

should database verify if user is authorized to perform action

Should database be verifying if user is authorized to perform certain action?
Two examples:
1)User is enrolled in 30 teams max and it can see scoresheet of these teams only. I'm passing in userid and teamid to the stored procedure and fetching the scoresheet only if user is authorized to view the scoresheet. Is it more appropriate to only pass in only teamid and check beforehand what all teams user is enrolled in? Should I do both?
2)Currently I'm passing in userid of the poster and the commentid of the comment to be deleted and I'm deleting comment only if both criteria is met - userid matches to the poster id and commentid matches to the commentid - just to make sure user is deleting his own comment and not somebody else's. Is it an overkill?
Multiple layers of validation is best practice and it doesn't seem like your methods would cause additional overhead. Just make sure to limit connecting to the database once, I've found that the most costly part of running database queries is the connection and cursors.
http://msdn.microsoft.com/en-us/library/aa174437%28v=sql.80%29.aspx
Security experts will tell you that No amount of security is enough! But at the same time you have to find a balance b/w security and unnecessary layers of protection that are bound to affect your application's performance.
Answering your 2nd question first: It is a good idea to pass both userid as well as commentid, and matching both, so that you accidentally don't delete all comments by a particular user.
Coming to your 1st question now: As I understand it, you want users only part of the team to be able to view the team's scoresheet, right? In order to do so passing only the teamid of all the teams the user is a part of will do. I am not sure what you mean by authorization here!
NOTE:
I have answered your question from a theoretical view with no idea about your Table structure or whats written in your Stored Procedures.
Your frontend is a much more friendlier (libraries, frameworks, best practices) environment to implement whatever access restrictions or authorization that you could possibly have in mind. Adding another layer inside the database just adds a lot of complexity and duplicate implementation of your access restrictions.
I would only consider doing it if clients connect and execute commands directly against the database.
So, rely on the ids provided by the application and spend your energy on sanitizing user input and implementing a sane authentication model. You will need it.

Typical normalization security issue in web applications

i am currently having a problem, i guess a lot of people have run into before and i would like to know how you handled it.
So, imagine you have 10.000 Users on your App. ( each one has an own user/pw login to administrate his stuff ).
Imagine further, that you have a growing normalized SQL-tablestructure in the backend, with tables like: Users, Orders, OrderPositions, Invoices, etc.
So, to show/edit/delete stuff of a table which isn't the usertable itself, u'll probably have links like these, to let ypur users interact with the application.
~/Orders/EditOrder?id=12
~/Orders/ShowOrderPosition?orderId=12&posId=443
Ok, and now the problem:
How, do i prevent in a "none-complex"-way, that user A has access ( show/edit/delete ) the data of user B.
Example:
User B calls:
~/Orders/ShowOrderPosition?orderId=12&posId=443
which is an order of user A, so user B should have no access to it.
So, in my code i would need to have a UserIdentity-check before or within every single SQL-statement, like:
select * from OrderPosition op, Order o, User u
where op.Id = :orderId
and op.Fk_OrderId = :orderpositionId
and o.Id = :orderId
and o.Fk_User = :userId
Only this way i can make sure, that the data belongs to the requesting user.
To reach the usertable will of course get far more complex, the deeper the usertable-connection is "buried" in the normalization ( imagine tables like payments or invoices, connected to the order-table... )
Question:
What is your approach to deal with this, concidering: Low complexity, DRY and performance
( Hope u understand what i mean ;) )
This is a bit like a multi-tenant application - I have gone down this route and denormalized an ID onto all those tables that require this kind of check (a tenant ID, in your case, sounds like the user id).
I then created an interface that contains this field only and applied it to all those classes in my model layer that required this access.
In my base data access (repository) class, where all the select/update/delete calls go through, I then check to see if the class if of the type of that interface, and I then check that the current access matches that ID.
Of course, this depends on how your code is structured, and how simple/complex making this global kind of change will be...
Never expose ids.
And if you have to: encrypt them.
Performance
for ultimate performance you will have to denormalize to the point that reading the row and comparing with some application level variable would give you an answer on what kind of rights the user has (this is fairly fast and if your DAO/BAO level is well organized plugging it in will keep it relatively DRY and at relatively low complexity.) NOTE: complexity is also a function of your security model, once you start to implement inheritable, positive and negative, role-based access rules then it can not be really simple.
DRY
another route to take (which is very seldomly taken these days) is to use your database roles to manage security; this might get complicated but will offer unparalleled security (as it will be ensured at the DB level and not application level. Complexity should go down, at the application code level, if you manage to encapsulate all of your access paths into VIEWS, which might require quite a bit of re-tailoring at the database level. However(!), it might be possible to implement security model with very little changes to the application code - by renaming existing tables and replacing them with secured views)
Don't use your internal ID column, encrypted or not, it'll come back to bite you one day.
Create a random, unique, string (GUID, whatever), which contains the link between the user and the data he's requesting. So, instead of having, for user 34567:
Edit order
Create a record {"5dsfwe8frf823jrf",34567,12} in a temporary table and show:
Edit order
When the users clicks the link, fetch 34567,12 from your temporary table.
The string 5dsfwe8frf823jrf is impossible to guess = no security risk.