Calculate inbox based on dynamic groups - sql

I am facing this problem when calculating the inbox for a user:
On one hand I have a bunch of documents that can potentially have
many readers (DOCS table).
Each reader belongs to one or more defined groups of users.
I have a table DOC_ACCES_BY_GROUP with (DOC_ID, GROUP_ID)
I need to know if a user has read a document or not. So, I have a table DOC_UNREAD with (DOC_ID, USER_ID) so that if a document is in that table, the user has not read the document yet.
Then each group can change in participants at any time, so I need to calculate my "inbox" for a certain user in real time.
The first guess is: Calculate all the groups in which a user is involved, then make a join between all the DOCS and the DOC_ACCESS_BY_GROUP table to get all the documents for that user (with the data asociated), and then another join to see if that document is read or not for the user.
The problem is, when my DOCS table grows considerably and I have many users, and many groups... the performance is really poor.
I'm trying to abstract the problem, which is actually a bit more complex. The possibility of storing document permissions per user is discarded. I also imagine it's not a problem that can be solved by optimizing the SQL query but should be done by software. We also support many data bases such as Mysql, Posgre or MSSQL so it can not be linked to a specific vendor solution (I guess).
So, the question is: Does anyone know any mechanism or framework or algorithm to do things differently and solve this problem, in an optimal and performant way?
Memcached? Infinispan? Hadoop?

You probably want to "materialize" the inbox and update it every time the user reads something, the membership of a group changes etc. The materialized inbox could be stored either in a DB table or in a separate system like Infinispan/memcached.

Related

Keep data in memory or use database

Let's assume we have a ticketing system web page where are displayed tickets (tickets are distributed on multiple pages). Also, in the same page there is a search form which allows filtering.
Those tickets can be modified anytime (delete,update,insert).
So i'm a bit confused. How should the internal architecture look?I've been thinking for a while and I haven't found a clear path.
From my point of view there are 2 ways:
use something like an in-memory database and store all the data there. So it's very easy to filter content and to display the requested items. But this solution implies storing a lot of useless data in ram. Like tickets closed or resolved. And those tickets should be there because they can be requested.
use database for every search, page display, etc. So there will be a lot of queries. Every search, every page (per user) will result in a database query. Isn't this a bit too much ?
Which solution is better? Are there any better solutions ? Are my concerns futile?
You said "But this solution implies storing a lot of useless data in ram. Like tickets closed or resolved. And those tickets should be there because they can be requested."
If those tickets should be there because they can be requested, then it's not really useless data, is it?
It sounds like a good use case for a hybrid in-memory/persistent database. Keep the open/displayed tickets in in-memory tables. When closed, move them to persistent tables.

Table with multiple foreign keys -- only one not null

I'm trying to design a system where an administrator will have to approve changes to the data and other various administrative tasks -- add a user, add an admin etc.
My idea is to have a notification table that contains these notifications, but the problem is that a notification can be any of the previously mentioned types, ie it's data is stored in one of many tables. Here is a picture to describe my current plan -- note I'm sure that it's not a proper ER diagram.
full_screen
Also, the data goes into a pending table, that reflects the table it will eventually wind up in, provided the data is approved -- it's a staging ground of sorts. So, a pending_user is a user that is not in the user table. And as you can see the user table, amongst others, is not shown here, but one can use their imagination.
I'm concerned that the multiple null values in the pending table will have adverse effects that I'm not totally aware of, such as increased space usage and possibly increase query time. Also, I'm not sure how I'll implement the retrieval of these notifications. My naive approach is to select the first X notifications, analyze the rows to find the non-null column, retrieve the appropriate data and then load all the data in a response.
Is there a more straight forward pattern for this type of problem?
Thanks in advance for any help.
I think, the traditional way is to provide various levels of access/read/write rights to users. These access rights define what actions a user can and can't perform. In this traditional approach if a user has access to a certain function, he can do it without further approval.
Also, traditionally there are some kind of audit logs that contain a trace of all important changes to the data. With such logs it would be possible to know who made a change (and when).
If you need to build a two-stage system, where a change has to go through an approval, I'd add a flag column to each important table that would indicate that values in the given row are not final and have to be approved. The table would store all historical changes to the data and with the help of this flag the system would know which variant is the latest approved version and which variant is pending and waiting for approval.
I would not try to make a single universal table that would hold data related to changes in many different tables. Each table is different and approval process for each table is likely to be different. I doubt that you'll have more than a dozen entities that are important enough to go through this approval process.

Best way to handle multiple lists in redis?

I am building a site that allows users to view and do some activities (vote, comments,...) on articles. I am using MySql as main storage. In order to improve performance, I am considering using Redis (4.x) to handle some view activities such as top/hot articles...
I am gonna use one sortedSet, called topAticleSortedSet, to store top articles, and this set will be updated frequently every time a user vote or somment on a certain article.
Since each user will login and follow some topics and I also need to filter and display articles in the topArticleSortedSet based on users' following topics.
There is of course scroll paging as well.
For those reasons, I intend to create one topArticleSortedSet for each user and that way each user will have one independent list. But I dont know if this is best practice because there might be million of logged-in users access in my site (then it would be million of sets which is around 1000 article items for each).
Can anyone give me some advice please?
I think you should keep to one Set, and filter it for each user, instead of having a Set per user. Here is why:
My understanding is that the Set have to be updated each time someone reads an article (incrementing a counter probably).
Let's say you have n users, each one reading p articles per day. So you have to update the Set n*p times a day.
In the "single" set option, you will need to update just one set when there is an article read. So it makes a total of n*p updates. In the "one set per user" architecture, you will need to do n*p*n updates, which is much bigger.
Of course, filtering a single Set will take you some time, longer than accessing a Set designed for one user. But on average, I guess it would take you much less time than n operations. Basically, you need to know which is faster: filtering one Set or updating n Sets ?

Should I create multiple tables, or even databases for multiple users of a CRM

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.

Database for microblogging startup

I will do microblogging web service (for school, so don't blast me for lack of new idea) and I worry that DB could be often be overloaded (user could following other users or even tag so I suppouse that SELECT will be heavy - check 20 latest messages which contains all observing tags and user).
My idea is create another table, and store in it only statusID and userID (who should pick up message). Danger of that is, if some tag or user has many followers there will be a lot of record with that status ID. So, is it good idea? Or maybe better is used M2M relation? (one status -> many receivers)
I think most databases can easily handle large record sets. The responsibility to have it preform lies in your design with properly setting up the indexes. If you create the right indexes the select clauses should perform really well.
I'd go with a users table, a table to have the m2m relationship between users and messages table.
You can then do one select to find all of the users a user is following and then a second select in to get all of the messages of interest (sorting and limiting the results as appropriate). Extending this to tagging should be pretty simple.
This design should be fine for large numbers of users and messages as long as you index the right columns. If you got massive then you could also run the users tables and messages tables to different servers or have read only replicates. I wouldn't even worry about that for the moment - you'd need to be huge.
When implementing Collabinate (http://www.collabinate.com), a service-based engine for microblogging and shared activity streams, I used a graph database. The fact that people create posts and follow other people lends itself to a graph structure. With the right relationships and algorithms, this can be a very efficient and performant solution.