What's the best way to store/calculate user scores? - sql

I am looking to design a database for a website where users will be able to gain points (reputation) for performing certain activities and am struggling with the database design.
I am planning to keep records of the things a user does so they may have 25 points for an item they have submitted, 1 point each for 30 comments they have made and another 10 bonus points for being awesome!
Clearly all the data will be there, but it seems like a lot or querying to get the total score for each user which I would like to display next to their username (in the form of a level). For example, a query to the submitted items table to get the scores for each item from that user, a query to the comments table etc. If all this needs to be done for every user mentioned on a page.... LOTS of queries!
I had considered keeping a score in the user table, which would seem a lot quicker to look up, but I've had it drummed into me that storing data that can be calculated from other data is BAD!
I've seen a lot of sites that do similar things (even stack overflow does similar) so I figure there must be a "best practice" to follow. Can anyone suggest what it may be?
Any suggestions or comments would be great. Thanks!

I think that this is definitely a great question. I've had to build systems that have similar behavior to this--especially when the table with the scores in it is accessed pretty often (like in your scenario). Here's my suggestion to you:
First, create some tables like the following (I'm using SQL Server best practices, but name them however you see fit):
UserAccount UserAchievement
-Guid (PK) -Guid (PK)
-FirstName -UserAccountGuid (FK)
-LastName -Name
-EmailAddress -Score
Once you've done this, go ahead and create a view that looks something like the following (no, I haven't verified this SQL, but it should be a good start):
SELECT [UserAccount].[FirstName] AS FirstName,
[UserAccount].[LastName] AS LastName,
SUM([UserAchievement].[Score]) AS TotalPoints
FROM [UserAccount]
INNER JOIN [UserAchievement]
ON [UserAccount].[Guid] = [UserAchievement].[UserAccountGuid]
GROUP BY [UserAccount].[FirstName],
[UserAccount].[LastName]
ORDER BY [UserAccount].[LastName] ASC
I know you've mentioned some concern about performance and a lot of queries, but if you build out a view like this, you won't ever need more than one. I recommend not making this a materialized view; instead, just index your tables so that the lookups that you need (essentially, UserAccountGuid) will enable fast summation across the table.
I will add one more point--if your UserAccount table gets huge, you may consider a slightly more intelligent query that would incorporate the names of the accounts you need to get roll-ups for. This will make it possible not to return huge data sets to your web site when you're only showing, you know, 3-10 users' information on the page. I'd have to think a bit more about how to do this elegantly, but I'd suggest staying away from "IN" statements since this will invoke a linear search of the table.

For very high read/write ratios, denormalizing is a very valid option. You can use an indexed view and the data will be kept in sync declaratively (so you never have to worry about there being bad score data). The downside is that it IS kept in sync.. so the updates to the store total are a synchronous aspect of committing the score action. This would normally be quite fast, but it is a design decision. If you denormalize yourself, you can choose if you want to have some kind of delayed update system.
Personally I would go with an indexed view for starting, and then later you can replace it fairly seamlessly with a concrete table if your needs dictate.

In the past we've always used some sort of nightly or perodic cron job to calculate the current score and save it in the database - sort of like a persistent view of the SUM on the activities table. Like most "best practices" they are simply guidelines and it's often better and more practical to deviate from a specific hard nosed practice on very specific areas.
Plus it's not really all that much of a deviation if you use the cron job as it's better viewed as a cache stored in the database.

If you have a separate scores table, you could update it each time an item is submitted or a comment is posted by a user. You could do this using a trigger or within the sites code.
The user scores would be updated continuously, and could be quickly queried for display.

Related

Dynamic SQL Tables A Good Idea?

I was hoping someone a bit more knowledgeable might be able to help me with the problem I'm facing.
I have a game in which I use rails as the backend for handling a number of tasks, including fetching high scores. In the game users can create their own levels, and other users can subsequently play on said levels. I currently store scores in a table called "scores" in the following way
level_id: the id of the level
user_id: id of the user playing the level
total_score:score the user achieved on that level
Each time a user completes a level the high scores for that level are shown, however the problem I see with this is that I for each level I have to fetch every database entry with the corresponding level_id, then order the entries fetched in descending order by total_score. I feel as though there is likely a better way to do this, as I currently have to sort every score for each level, which seems repetitive, and also I see that this will likely be slow with a large number of entries in the table, with the level_id being fetched.
I was wondering if a better approach to this would be to create an table for each level, which would possibly hold the user_id and total_score, and would be ordered by total_score. I feel as though this would cut down on the number of sort procedures that would need to be executed, but am wondering if there is a downside to having a large number of tables in a database in this format, or if there is possibly another way to address this issue?
Thanks,
John
No.
Your current design is most likely adequate and using one table per Level would offer few benefits but many headaches.
Basically ActiveRecord is built around a "fixed" schema (it does not change at runtime) and having each model instance occupy its own table is just not going to work. A relative database is not a document store.
You would need batsh*t crazy joins to do pull information from several levels and you would basically have to do every kind of query by hand or write your own ORM.
Most likely what you are looking to do can be done by setting a default order on scores and joining:
#level = Level.eager_load(:scores)
.find(params[:id])
However first of all you need to ask yourself if this is premature optimization. Premature optimization often leads to the wrong answers to the wrong questions.

Database efficiency - table per user vs. table of users

For a website having users. Each user having the ability to create any amount of, we'll call it "posts":
Efficiency-wise - is it better to create one table for all of the posts, saving the user-id of the user which created the post, for each post - OR creating a different separate table for each user and putting there just the posts created by that user?
The database layout should not change when you add more data to it, so the user data should definitely be in one table.
Also:
Having multiple tables means that you have to create queries dynamically.
The cached query plan for one table won't be used for any other of the tables.
Having a lot of data in one table doesn't affect performance much, but having a lot of tables does.
If you want to add an index to the table to make queries faster, it's a lot easier to do on a single table.
Well to answer the specific question: In terms of efficiency of querying, it will always be better to have small tables, hence a table per user is likely to be the most efficient.
However, unless you have a lot of posts and users, this is not likely to matter. Even with millions of rows, you will get good performance with a well-placed index.
I would strongly advise against the table-per-user strategy, because it adds a lot of complexity to your solution. How would you query when you need to find, say, users that have posted on a subject within the year ?
Optimize when you need to. Not because you think/are afraid something will be slow. (And even if you need to optimize, there will be easier options than table-per-user)
Schemas with a varying number of tables are, generally, bad. Use one single table for your posts.
If performance is a concern, you should learn about database indexes. While indexes is not part of the SQL standard, nearly all databases support them to help improve performance.
I recommend that you create a single table for all users' posts and then add an indexes to this table to improve the performance of searching. For example you can add an index on the user column so that you can quickly find all posts for a given user. You may also want to consider adding other indexes, depending on your application's requirements.
Your first proposal of having a single user and a single post table is the standard approach to take.
At the moment posts may be the only user-specific feature on your site, but imagine that it might need to grow in the future to support users having messages, preferences, etc. Now your separate table-per-user approach leads to an explosion in the number of tables you'd need to create.
I have a similar but different issue with your answer because both #guffa and #driis are assuming that the "posts" need to be shared among users.
In my particular situation: not a single user datapoint can be shared for privacy reason with any other user not even for analytics.
We plan on using mysql or postgres and here are the three options our team is warring about:
N schema and 5 tables - some of our devs feel that this is the best direction to make to keep the data completely segregated.
Pros - less complexity if you think of schema as a folder and tables as files. We'll have one schema per user
Cons - most ORMs do connection pooling per schema
1 schema and nx5 tables - some devs like this because it allows for connection pooling but appears to make the issue more complex.
Pros - connection pooling in the ORM is possible
Cons - cannot find an ORM where Models are set up for this
1 schema and 5 tables - some devs like this because they think we benefit from caching.
Pros: ORMs are happy because this is what they are designed to do
Cons: every query requires the username table
I, personally, land in camp 1: n schemas.
My lead dev lands in camp 3: 1 schema 5 tables.
Caching:
If data is always 1:1, I cannot see how caching will ever help regardless of the solution we use because each user will be searching for different info.
Any thoughts?

Permanent table, temp tables or php session?

My web app offers personalized recommendations. When a user starting to use it, about 1000+ rows are being inserted to one big recommendation table, correlating with other tables in the database. Every item the user votes for affects all of those 1000+ rows.
Since the recommendation info is only useful during the session, and since the recommendation table is getting huge, we'd like to switch to a more appropiate method. There's the possibility of deleting the relevant rows as soon as the user session is over. I guess PHP session array or temp tables are better for this case?
One temp table per session will lead to catalog pollution, so not really recommended.
Have you considered actually keeping the data, so as periodically mine it to improve the suggestions?
First: consider redesigning your data structure, I think it is not optimal.
Store a user's recommendation in a table user-recommendeditem-score: I don't see any need for a temp table or anything else.
Otherwise, you could start using sessions, but you should encapsulate the code carefully, making it easy to change if/when this solution is no more maintainable.
I suspect that the method is flawed - 1000+ recommendations per user? How many of them do they ever look at? If you don't know the answer to that question - then you need to spend some time thinking about why you don't know the answer.
Every item the user votes for affects all of those 1000+ rows
Are you sure your data is properly normalised?
But leaving that aside for the moment. The right place to generate / store that is in the database - a relational database is explicitly designed, and a lot more efficient about generating and maintaining tabular sets of data then a conventional programming language.

Is it bad to not use normalised tables in this database?

I recently learned about normalisation in my informatics class and I'm developing a multiplayer game using SQLite as backend database at the moment.
Some information on it:
The simplified structure looks a bit like the following:
player_id | level | exp | money | inventory
---------------------------------------------------------
1 | 3 | 120 | 400 | {item a; item b; item c}
Okay. As you can see, I'm storing a table/array in string form in the column "inventory". This is against normalization.
But the thing is: Making an extra table for the inventory of players brings only disadvantages for me!
The only points where I access the database is:
When a player joins the game and his profile is loaded
When a player's profile is saved
When a player joins, I load his data from the DB and store it in memory. I only write to the DB like every five minutes when the player is saved. So there are actually very few SQL queries in my script.
If I used an extra table for the inventory I would have to, upon loading:
Perform an performance and probably more data-intensive query to fetch all items from the inventory table which belong to player X
Walk through the results and convert them into a table for storage in memory
And upon saving:
Delete all items from the inventory table which belong to player X (player might have dropped/sold some items?)
Walk through the table and perform a query for each item the player owns
If I kept all the player data in one table:
I'd only have one query for saving and loading
Everything would be in one place
I would only have to (de)serialize the tables upon loading and saving, in my script
What should I do now?
Do my arguments and situation justify working against normalisation?
Are you saying that you think parsing a string out of "inventory" doesn't take any time or effort? Because everything you need to do to store/retrieve inventory items from a sub table is something you'd need to do with this string, and with the string you don't have any database tools to help you do it.
Also, if you had a separate subtable for inventory items, you could add and remove items in real time, meaning that if the app crashes or the user disconnects, they don't lose anything.
There are a lot of possible answers, but the one that works for you is the one to choose. Keep in mind, your choice may need to change over time.
If the amount of data you need to persist is small (ie: fits into a single table row) and you only need to update that data infrequently, and you don't have any reason to care about subsets of that data, then your approach makes sense. As time goes on and your players gain more items and you add more personalization to the game, you may begin to push up against the limits of SQLite, and you'll need to evolve your design. If you discover that you need to be able to query the item list to determine which players have what items, you'll need to evolve your design.
It's generally considered a good idea to get your data architecture right early, but there's no point in sitting in meetings today trying to guess how you'll use your software in 5-10 years. Better to get a design that meets this year's needs, and then plan to re-evaluate the design again after a year.
What's going to happen when you have one hundred thousand items in your inventory and you only want to bring back two?
If this is something that you're throwing together for a one off class and that you won't ever use again, then yes, the quick and dirty route might be a quicker option for you.
However if this is something you're going to be working on for a few months, then you're going to run into long-term issues with that design decision.
No, your arguments aren't valid. They basically boil down to "I want to do all of this processing in my client code instead of in SQL and then just write it all to a single field" because you are still doing all of the exact same processing to generate the string. By doing this you are removing the ability to easily load a small portion of the list and losing relationships to the actual item table which could contain more information about the items (I assume you're hard coding it all based on names instead of using internal item IDs which is a really bad idea, imo).
Don't do it. Long term the approach you are wanting to take will generate a lot more work for you as your needs evolve.
Another case of premature optimization.
You are trying to optimize something that you don't have any performance metrics. What is the target platform? Even crappiest computers nowadays could run at least hundreds of your reading operation per second. Then you add better hardware for more users, then you can go to cloud and when you come into problem space that Google, Twitter and Facebook are dealing with, you can consider denormalizing. Even then, best solution is some sort of key-value database.
Maybe you should check Wikipedia article on Database Normalization to remind you why normalized database is a good thing.
You should also think about the items. Are the items unique for every user or does user1 could have item1 and user2 have item1 to. If you now want to change item1 you have to go through your whole table and check which user have this item. If you would normalize your table, this would be much more easy.
But it the end, I think the answer is: It depends
Do my arguments and situation justify
working against normalisation?
Not based on what I've seen so far.
Normalized database designs (appropriately indexed and with efficient usage of the database with UPSERTS, transactions, etc) in general-purpose engines will generally outperform code except where code is very tightly optimized. Typically in such code, some feature of the general purpose RDBMS engine is abandoned, such as one of the ACID properties or referntial integrity.
If you want to have very simple data access (you tout one table, one query as a benefit), perhaps you should look at a document centric database like mongodb or couchdb.
The reason that you use any technology is to leverage the technology's advantages. SQL has many advantages that you seem to not want to use, and that's fine, if you don't need them. In Neal Stephenson's Zodiac, the main character mentions that few things bought from a hardware store are used for their intended purpose. Software's like that, too. What counts is that it works, and it works nearly 100% of the time, and it works fast enough.
And yet, I can't help but think that someday you're going to have some overpowered item released into the wild, and you're going to want to deal with this problem at the database layer. Say you accidently gave out some superinstakillmegadeathsword inventory items that kill everything within 50 meters on use (wielder included), and you want to remove those things from play. As an apology to the people who lose their superinstakillmegadeathsword items, you want to give them 100 money for each superinstakillmegadeathsword you take away.
With a properly normalized database structure, that's a trivial task. With a denormalized structure, it's quite a bit harder and slower. A normalized database is also going to be easier to expand on the design in the future.
So are you sure you don't want to normalize your database?

Any SQL database: When is it better to fetch a whole table instead of querying for particular rows?

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it