For a website having users. Each user having the ability to create any amount of, we'll call it "posts":
Efficiency-wise - is it better to create one table for all of the posts, saving the user-id of the user which created the post, for each post - OR creating a different separate table for each user and putting there just the posts created by that user?
The database layout should not change when you add more data to it, so the user data should definitely be in one table.
Also:
Having multiple tables means that you have to create queries dynamically.
The cached query plan for one table won't be used for any other of the tables.
Having a lot of data in one table doesn't affect performance much, but having a lot of tables does.
If you want to add an index to the table to make queries faster, it's a lot easier to do on a single table.
Well to answer the specific question: In terms of efficiency of querying, it will always be better to have small tables, hence a table per user is likely to be the most efficient.
However, unless you have a lot of posts and users, this is not likely to matter. Even with millions of rows, you will get good performance with a well-placed index.
I would strongly advise against the table-per-user strategy, because it adds a lot of complexity to your solution. How would you query when you need to find, say, users that have posted on a subject within the year ?
Optimize when you need to. Not because you think/are afraid something will be slow. (And even if you need to optimize, there will be easier options than table-per-user)
Schemas with a varying number of tables are, generally, bad. Use one single table for your posts.
If performance is a concern, you should learn about database indexes. While indexes is not part of the SQL standard, nearly all databases support them to help improve performance.
I recommend that you create a single table for all users' posts and then add an indexes to this table to improve the performance of searching. For example you can add an index on the user column so that you can quickly find all posts for a given user. You may also want to consider adding other indexes, depending on your application's requirements.
Your first proposal of having a single user and a single post table is the standard approach to take.
At the moment posts may be the only user-specific feature on your site, but imagine that it might need to grow in the future to support users having messages, preferences, etc. Now your separate table-per-user approach leads to an explosion in the number of tables you'd need to create.
I have a similar but different issue with your answer because both #guffa and #driis are assuming that the "posts" need to be shared among users.
In my particular situation: not a single user datapoint can be shared for privacy reason with any other user not even for analytics.
We plan on using mysql or postgres and here are the three options our team is warring about:
N schema and 5 tables - some of our devs feel that this is the best direction to make to keep the data completely segregated.
Pros - less complexity if you think of schema as a folder and tables as files. We'll have one schema per user
Cons - most ORMs do connection pooling per schema
1 schema and nx5 tables - some devs like this because it allows for connection pooling but appears to make the issue more complex.
Pros - connection pooling in the ORM is possible
Cons - cannot find an ORM where Models are set up for this
1 schema and 5 tables - some devs like this because they think we benefit from caching.
Pros: ORMs are happy because this is what they are designed to do
Cons: every query requires the username table
I, personally, land in camp 1: n schemas.
My lead dev lands in camp 3: 1 schema 5 tables.
Caching:
If data is always 1:1, I cannot see how caching will ever help regardless of the solution we use because each user will be searching for different info.
Any thoughts?
Related
I've research this topic and I'm relatively sure in most practices the answer is "No", but I would like some second opinions specific to my case.
We're currently working on a multi user web-app where each user will basically have there own copy "portal/app" within the web-app. It's not performance I'm worried about, but security.
I'm considering partitioning the data with a prefix userid_table1, userid_table2 to make it more manageable and ensure no security validation oversight is made by the team in development as we can easily add a validation to ensure that queries can only be run against tables with userid_*.
Would you still recommend against this method ?
I'm considering partitioning the data with a prefix userid_table1, userid_table2 to make it more manageable and ensure no security validation oversight is made by the team in development as we can easily add a validation to ensure that queries can only be run against tables with userid_*.
More manageable? That sounds like a joke. Your database will end up with a zillion different tables. Any operation that you want to do across all users will be a nightmare:
Declaring foreign key constraints.
Defining a new index on the tables.
Adding a new column.
Restructuring the tables.
And so on. And so on.
Your users may be limited to a single table. But the application developer and DBA need to deal with all of them. I cringe thinking about trying to figure out where performance bottlenecks are in such a system.
I should add that databases are optimized for big tables not lots of tables, so multiple tables are typically less efficient. And even less efficient when you think about all the half-filled pages in all those tables.
The same entities should not be spread among multiple tables, unless you have a really, really good reason. This is not a really good reason. One simple solution is to prevent users from having access to the base tables. Just give them access to views or user-defined table functions -- and have all of these filter on user ids.
There are some edge cases where you do want separate tables for each user. Typically, each user would have a very complex tables (think B2B application) and, in fact, they might have their own database. There may also be legal requirements to separate data. In these cases, though, the "separateness" would typically be at the database level, not the table level.
I am thinking and exploring options on designing database for my new application. In general, I will have registered users and info about them. They will be able to do some things in app and that data will be in the sam DB as users data (so I can have FK's shared and stuff)
But, then I plan to have second database that will be in logic totally independent of the first database except it will share userID as FK.
I don't know should I even put that second logic in an extra DB or should I have everything in the same database. I plan to have subdomain in my app for second logic (it is like app in app) but what if I discover they should share more data? Will that cross querying drop my peformances? And is that a way to go actually, is there a real reason to separate databases ?
As soon as you have two databases you have potential complexity. You have not given any particular reason why you need two databases. So keep it simple until you have a reason.
An example of what folks do: have a "current" database, small, holding just the data needed right now. That might be where orders are taken and fulfilled. Once the data is no longer current, say some days or weeks after the order is filled move the data to a "historic" database. There marketing and mangement folks can look at overall trends in the history without affecting performance of the "current" database, whose performance might be critical to keeping your customers happy.
As an example of complexity: any time you have two databases you need to consider consistency between them, this is much harder to ensure than it might appear. Databases do offer Two-Phase Transactional capabilities, or you can devise batch processes but there are always subtleties that are hard to catch.
I would just keep all in one database. Unless you have dozens of tables there should be no real performance problems, imho. It will however facilitate your life greatly, only having to work with one database connection & not having to worry about merging information from two queries,
Also agree that unless volume of your data is going to be huge (judging by the question, doesn't seem like that is the case here), you can use single database to store your data without performance issues.
For "visual" separation of data structure, you can always create tables in two schemas of single database.
My web app offers personalized recommendations. When a user starting to use it, about 1000+ rows are being inserted to one big recommendation table, correlating with other tables in the database. Every item the user votes for affects all of those 1000+ rows.
Since the recommendation info is only useful during the session, and since the recommendation table is getting huge, we'd like to switch to a more appropiate method. There's the possibility of deleting the relevant rows as soon as the user session is over. I guess PHP session array or temp tables are better for this case?
One temp table per session will lead to catalog pollution, so not really recommended.
Have you considered actually keeping the data, so as periodically mine it to improve the suggestions?
First: consider redesigning your data structure, I think it is not optimal.
Store a user's recommendation in a table user-recommendeditem-score: I don't see any need for a temp table or anything else.
Otherwise, you could start using sessions, but you should encapsulate the code carefully, making it easy to change if/when this solution is no more maintainable.
I suspect that the method is flawed - 1000+ recommendations per user? How many of them do they ever look at? If you don't know the answer to that question - then you need to spend some time thinking about why you don't know the answer.
Every item the user votes for affects all of those 1000+ rows
Are you sure your data is properly normalised?
But leaving that aside for the moment. The right place to generate / store that is in the database - a relational database is explicitly designed, and a lot more efficient about generating and maintaining tabular sets of data then a conventional programming language.
I am designing an app that would involve users 'following' each other's activity, in the twitter sense, but I am not very experienced with database/query design/efficiency. Are there best practices for managing this, pitfalls to avoid, etc.? I gather this can create a very large load on the db if not done properly (or maybe even then?).
If it makes a difference it is likely that people will 'follow' only a relatively small number of people (but a person may have many followers). However this is not certain, and I wouldn't want to count on it.
Any advice gratefully received. Thanks.
Pretty simple and easy to do with full normalisation. If you have a table of users, each with a unique ID, you would have a TABLE_FOLLOWERS table with the columns, USERID and FOLLOWERID which would describe all the followers for each user as a one to one to many relationship.
Even with millions of assosciations on a half decent database server this will perform well and fast as long as you are using a good database (IE, not MS-Access).
The model is fairly simple. The problem is in the size of the Subscription table; if there are 1 million users, and each subscribes to 1000, then the Subscription table has 1 billion rows.
That depends on how many users you expect to need to support; how many followers you expect users to have; and what sort of funding/development-effort you expect to have access to should your answers to the previous questions prove optimistic.
For a small scale project I would likely ignore the database, design the application as a simple object model with User objects that maintain a List[followers]. Keep it all in RAM for normal operation and use an ORM to persist to a database periodically (probably postgresql or mysql).
For a larger project I would not be using a relational database at all; but exactly what I would use would depend on the specific details of the project.
If you are only trying to spike the concept, go with the ORM approach; but, keep in mind it won't scale.
You probably should read http://highscalability.com/ and it's articles on how this is managed by the big sites.
I am looking to design a database for a website where users will be able to gain points (reputation) for performing certain activities and am struggling with the database design.
I am planning to keep records of the things a user does so they may have 25 points for an item they have submitted, 1 point each for 30 comments they have made and another 10 bonus points for being awesome!
Clearly all the data will be there, but it seems like a lot or querying to get the total score for each user which I would like to display next to their username (in the form of a level). For example, a query to the submitted items table to get the scores for each item from that user, a query to the comments table etc. If all this needs to be done for every user mentioned on a page.... LOTS of queries!
I had considered keeping a score in the user table, which would seem a lot quicker to look up, but I've had it drummed into me that storing data that can be calculated from other data is BAD!
I've seen a lot of sites that do similar things (even stack overflow does similar) so I figure there must be a "best practice" to follow. Can anyone suggest what it may be?
Any suggestions or comments would be great. Thanks!
I think that this is definitely a great question. I've had to build systems that have similar behavior to this--especially when the table with the scores in it is accessed pretty often (like in your scenario). Here's my suggestion to you:
First, create some tables like the following (I'm using SQL Server best practices, but name them however you see fit):
UserAccount UserAchievement
-Guid (PK) -Guid (PK)
-FirstName -UserAccountGuid (FK)
-LastName -Name
-EmailAddress -Score
Once you've done this, go ahead and create a view that looks something like the following (no, I haven't verified this SQL, but it should be a good start):
SELECT [UserAccount].[FirstName] AS FirstName,
[UserAccount].[LastName] AS LastName,
SUM([UserAchievement].[Score]) AS TotalPoints
FROM [UserAccount]
INNER JOIN [UserAchievement]
ON [UserAccount].[Guid] = [UserAchievement].[UserAccountGuid]
GROUP BY [UserAccount].[FirstName],
[UserAccount].[LastName]
ORDER BY [UserAccount].[LastName] ASC
I know you've mentioned some concern about performance and a lot of queries, but if you build out a view like this, you won't ever need more than one. I recommend not making this a materialized view; instead, just index your tables so that the lookups that you need (essentially, UserAccountGuid) will enable fast summation across the table.
I will add one more point--if your UserAccount table gets huge, you may consider a slightly more intelligent query that would incorporate the names of the accounts you need to get roll-ups for. This will make it possible not to return huge data sets to your web site when you're only showing, you know, 3-10 users' information on the page. I'd have to think a bit more about how to do this elegantly, but I'd suggest staying away from "IN" statements since this will invoke a linear search of the table.
For very high read/write ratios, denormalizing is a very valid option. You can use an indexed view and the data will be kept in sync declaratively (so you never have to worry about there being bad score data). The downside is that it IS kept in sync.. so the updates to the store total are a synchronous aspect of committing the score action. This would normally be quite fast, but it is a design decision. If you denormalize yourself, you can choose if you want to have some kind of delayed update system.
Personally I would go with an indexed view for starting, and then later you can replace it fairly seamlessly with a concrete table if your needs dictate.
In the past we've always used some sort of nightly or perodic cron job to calculate the current score and save it in the database - sort of like a persistent view of the SUM on the activities table. Like most "best practices" they are simply guidelines and it's often better and more practical to deviate from a specific hard nosed practice on very specific areas.
Plus it's not really all that much of a deviation if you use the cron job as it's better viewed as a cache stored in the database.
If you have a separate scores table, you could update it each time an item is submitted or a comment is posted by a user. You could do this using a trigger or within the sites code.
The user scores would be updated continuously, and could be quickly queried for display.