How to design a news feed system like google reader? [closed] - redis

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I’m preparing a system design interview, i was expected to be asked such kind of question in the interview, so I want to show my design process about this. In addition, I would like what are the best practices to solve some difficulties in the process. I would like to think in terms of scalability and how would i handle heavy read and write on database. Please correct me if i’m wrong in any thoughts.
First, I want to build a function subscribe/ unsubscribe. For user, I want to design mark feeds read/unread. How can i design a system like this ? At first glance, the first problems i can see is that If I put every data in database, it could involved tons of read /write operation on database once thousands of users subscribe/unsubscribe from certain source or media source like CNN posts feed every 5 - 10 mins.
Obviously, database would be a bottleneck once the user grows into certain point.
How can I solve the issue of this ? What’s the thoughts of solving this problem. Although database is a bottleneck in this point of view, we still need to have database but have a better design right ? I saw a lot of articles talking about denormalized data.
Question:
What’s the best way to store like subscribers for each source ?
In database, i can think of a table have “source_id” “user_id”, which user_id subscribes to source_id. Is this a good design or bad? If tons of user subscribe to new source, then database will be a burden.
Approach I can think of this is using Redis, which provides fast write and fast read.
Advantages:
Fast read and write operations.
Provide multiple data structures rather than simple key-value store.
Disadvantages:
Data need to fit in memory ⇒ solution to this: sharding. Sharding I
can use twemproxy to manage cluster.
If data loss, we loss data ⇒ solution: replication, embrace
“master-slave” setup. Write to master, read from slaves and backup
data to disk(data persistency). Additionally, take snapshot on hourly
basis.
Now I listed pros and cons of moving to redis cluster, how do I store the relation between source and subscriber in redis ? Is it a good design if I have a hashmap, stored each source and each point to a list of subscriber.
For example,
Cnn ⇒ (sub1, sub2, sub3, sub4….)
Espn ⇒ (sub1,sub2,sub3,sub4..)
…
In terms of scalability, we can shard each source and user into each dedicated redis nodes.
<<== this is at least i can think of right now.
In addition, we can store user info (what sources does user subscribes to) in redis as well and shard user to multiple cluster
User1 ⇒ (source1, source2, source4..)
User2 ⇒ (source1, source2, source4..)
...
For feed and posts from single source, i can have both database table and redis data structure (basically, my idea is to store everything in redis and database as backup, is it a good design consideration in this case? Maybe not everything, only active user in redis or recent feeds)
Database: i want to make as compact as possible, storing only a copy of it.
feedID, sourceID, created_timestamp, data
Redis: store feedID, source_id and content, and find subscriber based on source_id.
For read/unread part, I don’t have clear idea how to design around these limitation.
Every user have join timestamp and server will push the feed (at most 10 feeds per source) if user haven’t read it. What’s the good design to tell if user read data or not? My initial thoughts of this is to keep track of every feed user read or unread. But the table could grow linearly to the size of feed . In redis, i can design similar structure.
Userid, feedid, status
User1, 001, read
User1, 002, read
User1, 003, unread
At this point, my initial idea of setting up structure of data are like above. Redis runs “master-slave” setup and backup to disk on hourly basis.
Now i’m going to think about how the process of subscribe/ unsubscribe function works. User click subscribe button on media page, for instance, CNN. web server receive request saying “user X” subscribes to “source Y”. On application layer logic, find the machine that has data of user X, this could be achieved by installing sharding map on every application server. Work like this user_id mod shard = machineid.
Once application lookup the server ip that has his (user X) data, application server talks to redis node and update the user structure with new source_id. Subscribe function is the same thing.
For read/unread of specific feed on user X, the application lookup the redis node and update the structure of it, and redis asynchronously make updates to database. (here i embrace eventual consistency).
Let’s thinking about how to design push/pull model.
For pushing notification, once there’s a recent feed, i can store it most recent feed in redis and update only the active user (the reason is to avoid as much write operation on database as possible).
For pull model, only updates the user once they reload their home feed page, which also avoid lots of disk seek time.
Some points:
Only put active user in redis (logged in last 30 days)
If a user is not active for 6 months, and recently logged back and
would like to check feeds. There’s another service reconstruct data
from database and put into redis and serve the user.
Store recent feed in redis and only push notification to active
subscribers at this time. This is to avoid disk seek time on
database.
In order to make feed sortable, design timestamp in feedID. For
example, first 10 bits of feedID is timestamp, and also we can have
another 10 bits for sourceID embedded in ID as well. This make feed
sortable in nature.
Application server can horizontally scale and hide behind
load-balancer.
Application server connect to redis cluster, and database is for
storing back and reconstruct data if possible (like inactive user
case)
Redis apply “master-slave” setup. Write to master, read from slave
and replicate data asynchronously. Data backup to disk on timely
basis. Also updates the database asynchronously.
Questions:
Is updating database asynchronously using redis when new events
coming a feasible solution ? or just keeping the replication is
enough ?
I know it’s a long post and want to hear back from the community. Please correct me if i’m wrong or point out any so we can discuss more about approach.

Related

System design to aggregate in near-real time, the N most shared articles over the last five minutes, last hour and last day?

I was recently asked this system design question in an interview:
Let's suppose an application allows users to share articles from 3rd
party sites with their connections. Assume all share actions go
through a common code path on the app site (served by multiple servers
in geographically diverse colos). Design a system to aggregate, in
near-real time, the N most shared articles over the last five minutes,
last hour and last day. Assume the number of unique shared articles
per day is between 1M and 10M.
So I came up with below components:
Existing service tier that handles share events
Aggregation service
Data Store
Some Transport mechanism to send notifications of share events to aggregation service
Now I started talking about how data from existing service tier that handles share events will get to the aggregation servers? Possible solution was to use any messaging queue like Kafka here.
And interviewer asked me why you chose Kafka here and how Kafka will work like what topics you will create and how many partitions will it have. Since I was confuse so couldn't answer properly. Basically he was trying to get some idea on point-to-point vs publish-subscribe or push vs pull model?
Now I started talking about how Aggregation service operates. One solution I gave was to keep a collection of counters for each shared URL by 5 minute bucket for the last 24 hours (244 buckets per URL) As each share events happens, increment the current bucket and recompute the 5 min, hour, and day totals. Update Top-N lists as necessary. As each newly shared URL comes in, push out any URLs that haven't been updated in 24 hours. Now I think all this can be done on single machine.
Interviewer asked me can this all be done on one machine? Also can maintenance of 1M-10M tracked shares be done on one machine? If not, how would you partition? What happens if it crashes and how will you recover? Basically I was confuse how Aggregation service will actually work here? How it is getting data from Kafka and what is going to do actually with those data.
Now for data store part, I don't think we need persistent data store here so I suggested we can use Redis with partitioning and redundancy.
Interviewer asked me how will you partition and have redundancy here? And how Redis instance will get updated from the entire flow and how Redis will be structured? I was confuse on this as well. I told him that we can write output from Aggregation service to these redis instance.
There were few things I was not able to answer since I am confuse on how the entire flow will work. Can someone help me understand how we can design a system like this in a distributed fashion? And what I should have answered for the questions that interviewer asked me.
The intention of these questions is not to get ultimate answer for the problem. Instead check the competence and thought process of the interviewee. There is no point to be panic while answering these kind questions while facing tough follow up questions. Intention of the follow up questions is to guide you or give some hint for the interviewee.
I will try to share one probable answer for this problem. Assume I have s distributed persistent system like Cassandra. And I am going to maintain the status of sharing at any moment using my Cassandra infrastructure. I will maintain a Redis cluster ahead of persistence layer for LRU caching and maintain the buckets for 5 minutes, 1 hour and a day. Eviction will be configured using expire set. Now my aggregator service only need to address minimal data present within my Redis LRU cache. Set up a high through put distributed Kafka cluster will pump data from shared handler. And Kafka feed the data to Redis cluster and from there to Cassandra. To maintain the near real time output, we have to maintain the Kafka cluster throughput matching with it.

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

The faster method to move redis data to MySQL

We have big shopping and product dealing system. We have faced lots problem with MySQL so after few r&D we planned to use Redis and we start integrating Redis in our system.
Following this previously directly hitting the database now we have moved the Redis system
User shopping cart details
Affiliates clicks tracking records
We have product dealing user data.
other site stats.
I am not only storing the data in Redis system i have written crons which moves Redis data in MySQL data at time intervals. This is the main point i am facing the issues.
Bellow points i am looking for solution
Is their any other ways to dump big data from Redis to MySQL?
Redis fail our store data in file so is it possible to store that data directly to MySQL database?
Is Redis have any trigger system using that i can avoid the crons like queue system?
Is their any other way to dump big data from Redis to MySQL?
Redis has the possibility (using bgsave) to generate a dump of the data in a non blocking and consistent way.
https://github.com/sripathikrishnan/redis-rdb-tools
You could use Sripathi Krishnan's well-known package to parse a redis dump file (RDB) in Python, and populate the MySQL instance offline. Or you can convert the Redis dump to JSON format, and write scripts in any language you want to populate MySQL.
This solution is only interesting if you want to copy the complete data of the Redis instance into MySQL.
Does Redis have any trigger system that i can use to avoid the crons like queue system?
Redis has no trigger concept, but nothing prevents you to post events in Redis queues each time something must be copied to MySQL. For instance, instead of:
# Add an item to a user shopping cart
RPUSH user:<id>:cart <item>
you could execute:
# Add an item to a user shopping cart
MULTI
RPUSH user:<id>:cart <item>
RPUSH cart_to_mysql <id>:<item>
EXEC
The MULTI/EXEC block makes it atomic and consistent. Then you just have to write a little daemon waiting on items of the cart_to_mysql queue (using BLPOP commands). For each dequeued item, the daemon has to fetch the relevant data from Redis, and populate the MySQL instance.
Redis fail our store data in file so is it possible to store that data directly to MySQL database?
I'm not sure I understand the question here. But if you use the above solution, the latency between Redis updates and MySQL updates will be quite limited. So if Redis fails, you will only loose the very last operations (contrary to a solution based on cron jobs). It is of course not possible to have 100% consistency in the propagation of data though.

What's the best way to get a 'lot' of small pieces of data synced between a Mac App and the Web?

I'm considering MongoDB right now. Just so the goal is clear here is what needs to happen:
In my app, Finch (finchformac.com for details) I have thousands and thousands of entries per day for each user of what window they had open, the time they opened it, the time they closed it, and a tag if they choose one for it. I need this data to be backed up online so it can sync to their other Mac computers, etc.. I also need to be able to draw charts online from their data which means some complex queries hitting hundreds of thousands of records.
Right now I have tried using Ruby/Rails/Mongoid in with a JSON parser on the app side sending up data in increments of 10,000 records at a time, the data is processed to other collections with a background mapreduce job. But, this all seems to block and is ultimately too slow. What recommendations does (if anyone) have for how to go about this?
You've got a complex problem, which means you need to break it down into smaller, more easily solvable issues.
Problems (as I see it):
You've got an application which is collecting data. You just need to
store that data somewhere locally until it gets sync'd to the
server.
You've received the data on the server and now you need to shove it
into the database fast enough so that it doesn't slow down.
You've got to report on that data and this sounds hard and complex.
You probably want to write this as some sort of API, for simplicity (and since you've got loads of spare processing cycles on the clients) you'll want these chunks of data processed on the client side into JSON ready to import into the database. Once you've got JSON you don't need Mongoid (you just throw the JSON into the database directly). Also you probably don't need rails since you're just creating a simple API so stick with just Rack or Sinatra (possibly using something like Grape).
Now you need to solve the whole "this all seems to block and is ultimately too slow" issue. We've already removed Mongoid (so no need to convert from JSON -> Ruby Objects -> JSON) and Rails. Before we get onto doing a MapReduce on this data you need to ensure it's getting loaded into the database quickly enough. Chances are you should architect the whole thing so that your MapReduce supports your reporting functionality. For sync'ing of data you shouldn't need to do anything but pass the JSON around. If your data isn't writing into your DB fast enough you should consider Sharding your dataset. This will probably be done using some user-based key but you know your data schema better than I do. You need choose you sharding key so that when multiple users are sync'ing at the same time they will probably be using different servers.
Once you've solved Problems 1 and 2 you need to work on your Reporting. This is probably supported by your MapReduce functions inside Mongo. My first comment on this part, is to make sure you're running at least Mongo 2.0. In that release 10gen sped up MapReduce (my tests indicate that it is substantially faster than 1.8). Other than this you can can achieve further increases by Sharding and directing reads to the the Secondary servers in your Replica set (you are using a Replica set?). If this still isn't working consider structuring your schema to support your reporting functionality. This lets you use more cycles on your clients to do work rather than loading your servers. But this optimisation should be left until after you've proven that conventional approaches won't work.
I hope that wall of text helps somewhat. Good luck!

Is this a good use-case for Redis on a ServiceStack REST API?

I'm creating a mobile app and it requires a API service backend to get/put information for each user. I'll be developing the web service on ServiceStack, but was wondering about the storage. I love the idea of a fast in-memory caching system like Redis, but I have a few questions:
I created a sample schema of what my data store should look like. Does this seems like it's a good case for using Redis as opposed to a MySQL DB or something like that?
schema http://www.miles3.com/uploads/redis.png
How difficult is the setup for persisting the Redis store to disk or is it kind of built-in when you do writes to the store? (I'm a newbie on this NoSQL stuff)
I currently have my setup on AWS using a Linux micro instance (because it's free for a year). I know many factors go into this answer, but in general will this be enough for my web service and Redis? Since Redis is in-memory will that be enough? I guess if my mobile app skyrockets (hey, we can dream right?) then I'll start hitting the ceiling of the instance.
What to think about when desigining a NoSQL Redis application
1) To develop correctly in Redis you should be thinking more about how you would structure the relationships in your C# program i.e. with the C# collection classes rather than a Relational Model meant for an RDBMS. The better mindset would be to think more about data storage like a Document database rather than RDBMS tables. Essentially everything gets blobbed in Redis via a key (index) so you just need to work out what your primary entities are (i.e. aggregate roots)
which would get kept in its own 'key namespace' or whether it's non-primary entity, i.e. simply metadata which should just get persisted with its parent entity.
Examples of Redis as a primary Data Store
Here is a good article that walks through creating a simple blogging application using Redis:
http://www.servicestack.net/docs/redis-client/designing-nosql-database
You can also look at the source code of RedisStackOverflow for another real world example using Redis.
Basically you would need to store and fetch the items of each type separately.
var redisUsers = redis.As<User>();
var user = redisUsers.GetById(1);
var userIsWatching = redisUsers.GetRelatedEntities<Watching>(user.Id);
The way you store relationship between entities is making use of Redis's Sets, e.g: you can store the Users/Watchers relationship conceptually with:
SET["ids:User>Watcher:{UserId}"] = [{watcherId1},{watcherId2},...]
Redis is schema-less and idempotent
Storing ids into redis sets is idempotent i.e. you can add watcherId1 to the same set multiple times and it will only ever have one occurrence of it. This is nice because it means you don't ever need to check the existence of the relationship and can freely keep adding related ids like they've never existed.
Related: writing or reading to a Redis collection (e.g. List) that does not exist is the same as writing to an empty collection, i.e. A list gets created on-the-fly when you add an item to a list whilst accessing a non-existent list will simply return 0 results. This is a friction-free and productivity win since you don't have to define your schemas up front in order to use them. Although should you need to Redis provides the EXISTS operation to determine whether a key exists or a TYPE operation so you can determine its type.
Create your relationships/indexes on your writes
One thing to remember is because there are no implicit indexes in Redis, you will generally need to setup your indexes/relationships needed for reading yourself during your writes. Basically you need to think about all your query requirements up front and ensure you set up the necessary relationships at write time. The above RedisStackOverflow source code is a good example that shows this.
Note: the ServiceStack.Redis C# provider assumes you have a unique field called Id that is its primary key. You can configure it to use a different field with the ModelConfig.Id() config mapping.
Redis Persistance
2) Redis supports 2 types persistence modes out-of-the-box RDB and Append Only File (AOF). RDB writes routine snapshots whilst the Append Only File acts like a transaction journal recording all the changes in-between snapshots - I recommend adding both until your comfortable with what each does and what your application needs. You can read all Redis persistence at http://redis.io/topics/persistence.
Note Redis also supports trivial replication you can read more about at: http://redis.io/topics/replication
Redis loves RAM
3) Since Redis operates predominantly in memory the most important resource is that you have enough RAM to hold your entire dataset in memory + a buffer for when it snapshots to disk. Redis is very efficient so even a small AWS instance will be able to handle a lot of load - what you want to look for is having enough RAM.
Visualizing your data with the Redis Admin UI
Finally if you're using the ServiceStack C# Redis Client I recommend installing the Redis Admin UI which provides a nice visual view of your entities. You can see a live demo of it at:
http://servicestack.net/RedisAdminUI/AjaxClient/