Database for microblogging startup - sql

I will do microblogging web service (for school, so don't blast me for lack of new idea) and I worry that DB could be often be overloaded (user could following other users or even tag so I suppouse that SELECT will be heavy - check 20 latest messages which contains all observing tags and user).
My idea is create another table, and store in it only statusID and userID (who should pick up message). Danger of that is, if some tag or user has many followers there will be a lot of record with that status ID. So, is it good idea? Or maybe better is used M2M relation? (one status -> many receivers)

I think most databases can easily handle large record sets. The responsibility to have it preform lies in your design with properly setting up the indexes. If you create the right indexes the select clauses should perform really well.

I'd go with a users table, a table to have the m2m relationship between users and messages table.
You can then do one select to find all of the users a user is following and then a second select in to get all of the messages of interest (sorting and limiting the results as appropriate). Extending this to tagging should be pretty simple.
This design should be fine for large numbers of users and messages as long as you index the right columns. If you got massive then you could also run the users tables and messages tables to different servers or have read only replicates. I wouldn't even worry about that for the moment - you'd need to be huge.

When implementing Collabinate (http://www.collabinate.com), a service-based engine for microblogging and shared activity streams, I used a graph database. The fact that people create posts and follow other people lends itself to a graph structure. With the right relationships and algorithms, this can be a very efficient and performant solution.

Related

When to use one to many vs many to many in right situation?

i quite confuse when, or not to use one to many vs many to many. ex, user roles. in such situation many to many have advantage in reduce data size cause it just point to integer, maybe it save 1-10bytes each row, ex, senior developer char with id 7, it consume 2 bytes in smallint, instead 16 bytes. but, it makes bloat table. if such situation use many to many. why one to many should exists if many to many have the advantage? is it not always good to many to many?
Users table
id
username
password
Users_Roles table
user_id
role
Versus
Users table
Users_Roles table
user_id
role_id
Roles table
id
role
You're prematurely optimizing. A few integers here and there is unlikely to impact your data size nor your performance. If it does, the schema can be changed later, but usually there is much bigger bloat to be concerned with.
One-to-many vs many-to-many is not an optimizing issue. It's about the relationship between the tables.
If one and only one user can have a role, use one-to-many.
If many users can have the same role, use many-to-many.
For example, if you have an admin role and there can ever only be one admin user, use one-to-many. If there can be many admins, use many-to-many. You have to decide what the relationship is between users and roles.
Note: Use bigints for ids. 4 billion might seem like a lot, but it comes up fast and one of the worst things that can happen is to run out of IDs.
This is a data modeling question, and it's answer comes out of and is dictated by the analysis of the relationships of the entities involved. You have identified 2 entities you want to store data about, users and roles. Now describe their relationship in spoken language terms, looking at the relationship from both directions.
Can a user have more than one role? Can a role be held by more than one user? If the answer to both is yes, than it's a many to many relationship. Take the primary keys of both entities and bring them together as the composite primary key of an associative table. It may not have any attribute unless there is data about the relationship of a user/role itself that needs to be captured.
However, what if you are modeling entities of invoices and line items? Can an invoice have more than one line item? Yes. Can an instance of a line item on an invoice belong to more than one invoice? No (note I'm modeling a line item, not a product or part number as a line item could include special pricing for this invoice, color, logo, etc). So this is clearly a one to many relationship in the direction of one invoice can have many line items.
For more information, do some searching on data modeling, it will be a huge help in your database design efforts and you will end up with a better design for more efficient queries by designing the database correctly.
Looks like Schwern and I were typing at the same time :-)

Handling relational model in Cassandra

Background
We have chosen Cassandra as our storage engine since we have an application that must handle async messaging between many users on the website and event storing (some types of analytics, what happens on site and when, etc.). Also we have a voting platform so we are storing votes per users per day and Cassandra are good in those use cases.
Recently we got new requirements to build a relational model on top of our existing system (at least we think it is relational). Some types of political candidates with lists of jobs, education, historical voting, endorsements, etc.
Problem
We have relations which can be edited on both ends (i.e. candidate is supported by companies, but in our admin panel that company can be edited without candidate). A candidate is one row in our Cassandra DB identified by a UUID. On the front end, we would need full information about candidates (political party, schools, jobs, voting history, supporting companies). We want to place the majority of candidate info in a single row so we can read data with a single read. However when we place the list of supporting companies UDT we have problems editing it (we need to change it in company_by_id and candidate_by_id tables).
Question
How to solve the editing problem and relational model issues in our situation?
We came up with couple of solutions:
Track relations in Cassandra with additional index-like tables: candidates_by_supporting_company. When updating company, we update candidates who have that company as well.
Similar to 1, but using secondary index if relation is low carnality and updating based on secondary index (we have 10 political parties so we can place index on political party in candidates table and when political party changes we can change candidates by political party since we have index)
Use a relational database for relational type of data and leave Cassandra to handle only suitable use cases like time-series data, messaging, event sorting (this adds the maintenance cost of one more database, deployment costs and problems since our system is distributed how to have replication of data)
Use Spark to do joins (this will not be the sole purpose of adding Spark to the system, we are thinking of adding it for importing huge data sets in CSV and doing transformation so having Spark will be an added bonus and we can use SparkSQL for places where we need joins)
We are leaning towards option 3 since we will add Spark anyway, we will stay with only Cassandra database (which does not complicate maintenance and deployment of one more database) and we get sort of JOINS and GROUP BY efficient on application level with it.
What do you think?
If you want to use only cassandra the right way to proceed is the number 1: denormalization. But if yu have a lot of relationships it will bring a lot of effort at application level.
If adding an other dbms is not a problem in your environment, using the right tool for the right job is the best choice: number 3 for me

Calculate inbox based on dynamic groups

I am facing this problem when calculating the inbox for a user:
On one hand I have a bunch of documents that can potentially have
many readers (DOCS table).
Each reader belongs to one or more defined groups of users.
I have a table DOC_ACCES_BY_GROUP with (DOC_ID, GROUP_ID)
I need to know if a user has read a document or not. So, I have a table DOC_UNREAD with (DOC_ID, USER_ID) so that if a document is in that table, the user has not read the document yet.
Then each group can change in participants at any time, so I need to calculate my "inbox" for a certain user in real time.
The first guess is: Calculate all the groups in which a user is involved, then make a join between all the DOCS and the DOC_ACCESS_BY_GROUP table to get all the documents for that user (with the data asociated), and then another join to see if that document is read or not for the user.
The problem is, when my DOCS table grows considerably and I have many users, and many groups... the performance is really poor.
I'm trying to abstract the problem, which is actually a bit more complex. The possibility of storing document permissions per user is discarded. I also imagine it's not a problem that can be solved by optimizing the SQL query but should be done by software. We also support many data bases such as Mysql, Posgre or MSSQL so it can not be linked to a specific vendor solution (I guess).
So, the question is: Does anyone know any mechanism or framework or algorithm to do things differently and solve this problem, in an optimal and performant way?
Memcached? Infinispan? Hadoop?
You probably want to "materialize" the inbox and update it every time the user reads something, the membership of a group changes etc. The materialized inbox could be stored either in a DB table or in a separate system like Infinispan/memcached.

Should I create multiple tables, or even databases for multiple users of a CRM

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.

archiving strategies and limitations of data in a table

Environment: Jboss, Mysql, JPA, Hibernate
Our web application will be catering to a large amount of users (~ 1,000,000) and there are a lots of child table where user specific data are stored (e.g. personal, health, forum contributions ...).
What would be the best practice to archive user & user specific information.
[a] Would it be wise to move the archived user & user specific information to their respective tables within the same database (e.g. user_archive, user_forum_comments_archive ...) OR
[b] Would you just mark the database entries with a flag in the original table(s) and just query only non archived entries.
We have a unique constraint on User.loginid, how do you handle this requirement if the users are archived via 1-[a] (i.e if a user with loginid 'samuel' gets moved into the archive table and if a new user gets added with the same name in the original table, how would you prevent this. What would be the best strategy to address the unique key constraints.
We have a requirement to selectively archive records and bring it back if necessary, will you rely on database tools are would you handle this via your persistence APIs exposed by the JPA entity model.
Personally, I'd go for solution "[a]".
Having things split on two table sets (current and archived) would make things a bit hard to manage in terms of common RDBMS concepts (example: forum comment author would be a foreign key pointing to the user's table... but you can't have a field behave as a foreign key to two different tables).
You could go for a compromise (users table uses solution "a", all the other tables like profile get archived to a twin table like per solution "b") but this would make things unnecessarily complicated for your code (in some cases you have to look at the non-archived, in some to the archived only, in some other cases to the union of both).
Solution A would easily solve #2 and #3 requirements, too. Uniqueness of user name is easy to enforce if everything is in the same table, and resurrecting archived users is just a matter of flipping a bit (Archived=Y/N) on the main user table.
10% is not much, I doubt that the difference in terms of performance would really justify the extra complexity (and risk of bugs).
I would put an archived flag on the table and then create a view to use when you don't want to see archived records. That way people will be more consistent in applying the archive flag I suspect.