PostgreSQL - "Ten most frequent entries" - sql

We've got a table with two colums: USER and MESSAGE
An USER can have more than one message.
The table is frequently updated with more USER-MESSAGE pairs.
I want to frequently retrieve the top X users that sent the most messages. What would be the optimal (DX and performnce wise) solution for it?
The solutions I see myself:
I could GROUP BY and COUNT, however it doesn't seem like the most performant nor clean solution.
I could keep an additional table that'd keep count of every user's messages. On every message insertion into the main table, I could also update the relevant row here. Could the update be done automaticaly? Perhaps I could write a procedure for it?
For the main table, I could create a VIEW that'd have an additional "calculated" column - it'd GROUP BY and COUNT, but again, it's probably not the most performant solution. I'd query the view instead.
Please tell me whatever you think might be the best solution.

Some databases have incrementally updated views, where you create a view like in your example 3, and it automatically keeps it updated like in your example 2. PostgreSQL does not have this feature.
For your option 1, it seems pretty darn clean to me. Hard to get much simpler than that. Yes, it could have performance problems, but how fast do you really need it to be? You should make sure you actually have a problem before worrying about solving it.
For your option 2, what you are looking for is a trigger. For each insertion, it would increment a count in the user table. If you ever delete, you would also need to decrease the count. Also, if ever update to change the user of an existing entry, the trigger would need to decrease the count of the old user and increase it of the new user. This will decrease the concurrency, as if two processes try to insert messages from the same user at the same time, one will block until the other finishes. This may not matter much to you. Also, the mere existence of triggers imposes some CPU overhead, plus whatever the trigger itself actually does. But unless our server is already overloaded, this might not matter.
Your option 3 doesn't make much sense to me, at least not in PostgreSQL. There is no performance benefit, and it would act to obscure rather than clarify what is going on. Anyone who can't understand a GROUP BY is probably going to have even more problems understanding a view which exists only to do a GROUP BY.
Another option is a materialized view. But you will see stale data from them between refreshes. For some uses that is acceptable, for some it is not.

The first and third solutions are essentially the same, since a view is nothing but a “crystallized” query.
The second solution would definitely make for faster queries, but at the price of storing redundant data. The disadvantages of such an approach are:
You are running danger of inconsistent data. You can reduce that danger somewhat by using triggers that automatically keep the data synchronized.
The performance of modifications of message will be worse, because the trigger will have to be executed, and each modification will also modify users (that is the natural place to keep such a count).
The decision should be based on the question whether the GROUP BY query will be fast enough for your purposes. If yes, use it and avoid the above disadvantages. If not, consider storing the extra count.

Related

Buffer table in a database, Good or not?

I have a question !
I need to make a university project, and in this project i will have one database table like this :
This table will have a LOT of records !!!!!!
And for manage this i need to create a validation system.
What is the best (and why) between create a buffer table like this :
Or add a column in my table like this :
Thank you !
Your question does not have enough information to provide a real answer. Here is some guidance on how to think about the situation. Which approach depends on the nature of your application and especially on what "validation" means.
One reasonable interpretation is that "validation" is part of a work-flow process, so it happens only once (or 99% of the time only once). And, you never want to see unvalidated advertisements when you look look at advertisements. If this is the case, then there would typically be additional information about the validation process.
This scenario suggests two reasonable approaches:
Do the validation inside a transaction. This would be reasonable if the validation process were entirely in the database and was measured in seconds.
Have a separate table for advertisements being validated. Perhaps even a separate table per "user" or "entity" responsible for them. Depending on the nature of the validation process, this could be a queue that feeds them to people doing the validation.
Putting them in the "advertisements" table doesn't make sense, because there is likely to be additional information involved with the validation process -- who, what, where, when, how.
If an advertisement can be validated and invalidated multiple times, then the best approach may be to put them in the same table. Once again, there are questions about the nature of the process.
Getting access to the two groups without a full table scan is tricky. If 10% of the rows are invalidated and 90% are validated, then a normal index would require a full table scan for reading either group. To get faster access to the smaller group, here are two options:
clustered index on the validation flag.
separate partitions for validated and invalidated rows.
In both cases, changing the validation flag for a record is relatively expensive, because it involves reading and writing the record on different data pages. Unless dozens of changes are made per second, this is probably not a big deal.
Here, there is no need to have a separate "buffer table". You can just properly index the valid field. So the following index would essentially automatically create a buffer table:
create unique index x on y (id)
include (all columns)
where (valid = 0)
This index creates a copy of the yet invalid data. You can do lots of variations such as
create unique index x on y (valid, id)
There's really no need for a separate table. Indexes are very easy compared to partitioning or even manually partitioning. Much less work, more general, more flexible and less potential for human error.
Either approach is valid, and which will perform better will depend more on the type of database you are using rather than the theoretical question of whether it is more correct to use a boolean or partition this into two tables.
I actually prefer the partitioning approach (your buffer table idea), but it will be more complex to code around. This may be a significant point to consider. Most modern databases will handle the boolean criteria very well with an index, but sometimes you can be surprised.
The most important thing from a development perspective right now is to pick one and run with it instead of paralyzing your project while you decide the "right" one.

Improve Log Exceptions

I am planning to use log4net in a new web project. In my experience, I see how big the log table can get, also I notice that errors or exceptions are repeated. For instance, I just query a log table that have more than 132.000 records, and I using distinct and found that only 2.500 records are unique (~2%), the others (~98%) are just duplicates. so, I came up with this idea to improve logging.
Having a couple of new columns: counter and updated_dt, that are updated every time try to insert same record.
If want to track the user that cause the exception, need to create a user_log or log_user table, to map N-N relationship.
Create this model may made the system slow and inefficient trying to compare all these long text... Here the trick, we should also has a hash column of binary of 16 or 32, that hash the message and the exception, and configure an index on it. We can use HASHBYTES to help us.
I am not an expert in DB, but I think that will made the faster way to locate a similar record. And because hashing doesn't guarantee uniqueness, will help to locale those similar record much faster and later compare by message or exception directly to make sure that are unique.
This is a theoretical/practical solution, but will it work or bring more complexity? what aspects I am leaving out or what other considerations need to have? the trigger will do the job of insert or update, but is the trigger the best way to do it?
I wouldn't be too concerned with a log table of 132,000 records to be honest, I have seen millions, if not billions of records in a log table. If you are logging out 132,000 records every few minutes then you might want to tone it down a bit.
I think the idea is interesting but here is my major concerns:
You could actually hurt the performance of your application by doing this. The Log4Net ADO.NET appender is synchronous. This means if you make your INSERT anymore complicated than it needs to be (aka looking up if the data already exists, calculating hash codes etc.) you will block the thread calling logging. That's not good! You could fix this writing to some sort of a staging table and doing it out of band with a job or something but now you've created a bunch of moving parts for something that could be much simpler.
Time could probably be better spent doing other things. Storage is cheap, developer hours aren't and logs don't need to be extremely fast to access so a denormalized model should be fine.
Thoughts?
Yes you can do that. It is a good idea and it will work. Watch out for concurrency issues when inserting from multiple threads or processes. You probably need to investigate locking in detail. You should look into locking hints (in your case UPDLOCK, HOLDLOCK, ROWLOCK) and the MERGE statement. They can be used to maintain the dimension table.
As an alternative you could log to a file and compress it. Typical compression algorithms are very good at eliminating this type of exact redundancy.

Any SQL database: When is it better to fetch a whole table instead of querying for particular rows?

I have a table that contains maybe 10k to 100k rows and I need varying sets of up to 1 or 2 thousand rows, but often enough a lot less. I want these queries to be as fast as possible and I would like to know which approach is generally smarter:
Always query for exactly the rows I need with a WHERE clause that's different all the time.
Load the whole table into a cache in memory inside my app and search there, syncing the cache regularly
Always query the whole table (without WHERE clause), let the SQL server handle the cache (it's always the same query so it can cache the result) and filter the output as needed
I'd like to be agnostic of a specific DB engine for now.
with 10K to 100K rows, number 1 is the clear winner to me. If it was <1K I might say keep it cached in the application, but with this many rows, let the DB do what it was designed to do. With the proper indexes, number 1 would be the best bet.
If you were pulling the same set of data over and over each time then caching the results might be a better bet too, but when you are going to have a different where all the time, it would be best to let the DB take care of it.
Like I said though, just make sure you index well on all the appropriate fields.
Seems to me that a system that was designed for rapid searching, slicing, and dicing of information is going to be a lot faster at it than the average developers' code. On the other hand, some factors that you don't mention include the location or potential location of the database server in relation to the application - returning large data sets over slower networks would certainly tip the scales in favor of the "grab it all and search locally" option. I think that, in the 'general' case, I'd recommend querying for just what you want, but that in special circumstances, other options may be better.
I firmly believe option 1 should be preferred in an initial situation.
When you encounter performance problems, you can look on how you could optimize it using caching. (Pre optimization is the root of all evil, Dijkstra once said).
Also, remember that if you would choose option 3, you'll be sending the complete table-contents over the network as well. This also has an impact on performance .
In my experience it is best to query for what you want and let the database figure out the best way to do it. You can examine the query plan to see if you have any bottlenecks that could be helped by indexes as well.
First of all, let us dismiss #2. Searching tables is data servers reason for existence, and they will almost certainly do a better job of it than any ad hoc search you cook up.
For #3, you just say 'filter the output as needed" without saying where that filter is been done. If it's in the application code as in #2, than, as with #2, than you have the same problem as #2.
Databases were created specifically to handle this exact problem. They are very good at it. Let them do it.
The only reason to use anything other than option 1 is if the WHERE clause itself is huge (i.e. if your WHERE clause identifies each row individually, e.g. WHERE id = 3 or id = 4 or id = 32 or ...).
Is anything else changing your data? The point about letting the SQL engine optimally slice and dice is a good one. But it would be surprising if you were working with a database and do not have the possibility of "someone else" changing the data. If changes can be made elsewhere, you certainly want to re-query frequently.
Trust that the SQL server will do a better job of both caching and filtering than you can afford to do yourself (unless performance testing shows otherwise.)
Note that I said "afford to do" not just "do". You may very well be able to do it better but you are being paid (presumably) to provide functionality not caching.
Ask yourself this... Is spending time writing cache management code helping you fulfil your requirements document?
if you do this:
SELECT * FROM users;
mysql should perform two queries: one to know fields in the table and another to bring back the data you asked for.
doing
SELECT id, email, password FROM users;
mysql only reach the data since fields are explicit.
about limits: always ss best query the quantity of rows you will need, no more no less. more data means more time to drive it

Refactoring "extreme" SQL queries

I have a business user who tried his hand at writing his own SQL query for a report of project statistics (e.g. number of tasks, milestones, etc.). The query starts off declaring a temp table of 80+ columns. There are then almost 70 UPDATE statements to the temp table over almost 500 lines of code that each contain their own little set of business rules. It finishes with a SELECT * from the temp table.
Due to time constraints and 'other factors', this was rushed into production and now my team is stuck with supporting it. Performance is appalling, although thanks to some tidy up it's fairly easy to read and understand (although the code smell is nasty).
What are some key areas we should be looking at to make this faster and follow good practice?
First off, if this is not causing a business problem, then leave it until it becomes a problem. Wait until it becomes a problem, then fix everything.
When you do decide to fix it, check if there is one statement causing most of your speed issues ... issolate and fix it.
If the speed issue is over all the statements, and you can combine it all into a single SELECT, this will probably save you time. I once converted a proc like this (not as many updates) to a SELECT and the time to run it went from over 3 minutes to under 3 seconds (no shit ... I couldn't believe it). By the way, don't attempt this if some of the data is coming from a linked server.
If you don't want to or can't do that for whatever reason, then you might want to adjust the existing proc. Here are some of the things I would look at:
If you are creating indexes on the temp table, wait until after your initial INSERT to populate it.
Adjust your initial INSERT to insert as many of the columns as possible. There are probably some update's you can eliminate by doing this.
Index the temp table before running your updates. Do not create indexes on any of the columns targetted by the update statements until after their updated.
Group your updates if your table(s) and groupings allow for it. 70 updates is quite a few for only 80 columns, and sounds like there may be an opportunity to do this.
Good luck
First thing I would do is check to make sure there is an active index maintenance job being run periodically. If not, get all existing indexes rebuilt or if not possible at least get statistics updated.
Second thing I would do is set up a trace (as described here) and find out which statements are causing the highest number of reads.
Then I would run in SSMS with 'show actual execution plan' and tally the results with the trace. From this you should be able to work out whether there are missing indexes that could improve performance.
EDIT: If you are going to downvote, please leave a comment as to why.
Just like any refactoring, make sure you have an automated way to verify your refactorings after each change (you can write this yourself using queries which check the development output against a known good baseline). That way, you are always matching the known good data. This will give you a high degree of confidence in the correctness of your approach when you enter the phase where you are deciding whether to switch over to your new version of the process and want to run side by side for a few iterations to ensure correctness.
I also like to log all the test batches and the run times of the processes within the batch, so I can tell if some particular process within the batch was adversely affected at some point in time. I can get average times for processes and see trends of improvement or spot potential problems. This also lets me identify the low-hanging fruit within the batch where I can make the most improvement.
There are then almost 70 UPDATE
statements to the temp table over
almost 500 lines of code that each
contain their own little set of
business rules. It finishes with a
SELECT * from the temp table.
Actually this sounds like it can be followed and understood quite well, each update statement does one thing to the table with a specific purpose and set of business rules. I think that maintaining procedures of 500 lines of code with one or a couple of select statements that does "everything", built with 15 or so joins, and case statements etc scattered all over the place, is a lot harder to maintain. Although it would make for better performance..
It's a bit of a dilemma with SQL, that writing clear and concise code (using multiple updates, creating functions etc) always seems to have a big negative impact on performance. Trying to do everything at once, which is considered bad practice in other programming languages, seems to be the very core of set oriented languages.
If this is a report generating stored procedure, how often is it being run? If it's only necessary to run it once a day and is run during the night how much of an issue is the performance?
If it's not I'd recommend being careful in your choice to re-write it because there is a chance that you could muck up your figures.
Also it sounds like the sort of thing that should be pulled out into an SSIS package building up a new permanent table with the results so it only has to be run once.
Hope this makes sense
One thing you could try is to replace the temp table with a table variable. There are times when this is faster and times when it is not, you will have to just try it and see.
Look at the 70 update statements. It is possible to combine any of them? If the person writing did not use CASE statments, it might be possible to do fewer statements.
Other obvious things to look at - eliminate any cursors, change any subqueries to joins to tables or derived tables.
Rewrite perhaps. One hardware solution would be to make sure your database temp table goes on a 'fast' drive, perhaps a solid state disk (SSD), or can be managed all in memory.
My guess is this 'solution' was developed by someone with a grasp of and a dependency upon spreadsheets, someone who may not be very savvy on 'normalized' databases--how to construct and populate tables to retain data for reporting purposes, something which perhaps BI Business Intelligence software can be utilized with sophistication and yet be adaptable.
You didn't say 'where' the update process is being run. Is the update process being run as a SQL script from a separate computer (desktop) against the server where the data is? There can be significant bottlenecks and overhead created by that approach. If so, consider running the entire update process directly on the server as a local job, as a compiled stored procedure, bypassing the network and (multiple) cursor management overhead. It could have a scheduled time to run and a controlled priority, completing in off peak business data usage hours.
Evaluate how often 'commit' statements are really needed for the sequence of update statements...saving on a bunch of commit lines could notably improve the overall update time. There may be a couple of settings in the database client driver software which can make a notable difference.
Can the queries used for update conditions be factored out as static 'views' which in turn can be shared across multiple update statements? Views can keep in memory data/query rows frequently accessed. There may be performance tuning in determining how much update data can be pended before a commit is optimal.
It might be worth evaluating whether Triggers could be used to replace the batch job update sequence. You don't say from how many tables the data used comes from...that might help with decision making. I don't know if you have the option of adding triggers to the database tables from which the data is gathered. If so, adding a few triggers to a number of tables wouldn't really degrade overall system performance much, but might save a big wad of time on that update process. You could try replacing the update statements one at a time with triggers and see if the results are the same as before. Create a similar temp table, based on the same update process, then carefully test whether triggers feeding updates to the temp table could replace individual update statements. Perhaps you may have a sort of 'Data Warehouse' application. It might be useful to review how to set up a 'star' schema of tables to retain summarized business data for reporting.
Creating a comprehensive and cached 'view' which updates via the queries once per day, reflecting the updates might be another approach to explore.
Well, since the only thing you've told us about this stored procedure is that it has a 80+ column temp table, the only thing I can recommend is to remove that table, and rewrite the rest to remove the need for it.
You should get a tool that allows you to get an explain plan of all queries your app will run. It is the best bang for the buck on an SQL heavy app for performace increases. If you read and react to what the Explain Plan is telling you. If you are on Oracle what we used to use was TOAD by Qwest(?) I think. It was a great tool.
I would recommend looking at the tables involved, the end result, and starting from scratch to see if the query can be done in a more efficient manner. Keep the query to verify that the new one is working exactly the same as the old one, but try to forget all methods used to obtain the end result.
I would rewrite it from scratch.
You say that you understand what it supposed to do so it should not be that difficult. And I bet that the requirements for that piece of code will keep changing so if you do not rewrite it now you may end up maintaining some ugly monster

use SUM() or caching

I don't have much experience with databases, so I don't know which is better for long-term performance, better practice, etc.
Here is my (hypothetical) case: imagine you have a database of customer information and history of purchase orders for each. You want to keep track of how much each customer is buying. I can think of two ways of calculating this:
1) Just do a SUM() every time it is needed. This is an easy solution, but the concern is that this database could be 20 years old with tens of thousands of rows for each customer. As more customers purchases are added to the database, the SUM() operation will take longer to calculate.
2) Keep the sum cached in the customer information table, and every time a new purchase is made (updated, deleted, etc), update this cache. That way no matter how many purchase orders there are, there won't be an increase in the calculation time. The downside is that, this is a less flexible solution (only the sum over all rows, what about per month sum? other intervals? etc); this cached value could get out of sync with the actual total somehow (technically shouldn't happen, but it might)
So what should I be doing for this? I know I shouldn't be storing anything I can calculate from what's already in the database, but part of me doesn't like the fact that this type of calculation will get worse in time, and that there is some kind of elegance to option 2.
In database terms, there is no elegance in option 2 - it would be considered a hack that you might use as a last resort if your database gets to be truly humongous - not likely to happen for a novice setting it up for the first time (but possible).
There would be a lot of work maintaining the totals; and you'd be forever dealing with the question: "Why doesn't the detail add up to the total?"
Go for option 1 until you've proven you can't. Which will in most cases be a long time.
What you're describing in option #2 is a case of premature optimization. Using the SUM() of all purchases is going to work for a really long time (years). When (if) you start to see this function degrade, you can add indexes or a totals table to your database to speed things up. Don't complicate things when a simple solution exists.
Of course, the real solution is to try both solutions with 20 years of made-up data and see if there's any real difference. I suspect there isn't.
Almost always 1.
How often will you be querying for the total for 20 year's worth of history? If the answer is often, and the performance is poor, then you can think about optimisation or OLAP.
I suspect you're worrying about optimisation too early. This is the sort of thing databases are designed for - let them worry about caching.
I'll just add that another possibility is to create roll-up tables. For instance, when tracking page hits, it's not very useful necessarily to know that IP such-and-such accessed page1.php at 14:42:04 on 11/19/2008; but you might want to track daily stats for page1.php. In this case, at the end of each day you might run a procedure to sum up the hits for each page and create a record in a summary table, which can in turn be heavily indexed. Your reporting then can run against that table. In addition to speeding up reporting, it can also speed up writing the original records, as you don't need to worry about table locking or building indexes.
That said, good indexes can go a long way with reporting; and like others here have admonished, it's best to go with the easier, even if less optimal, solution until (if ever) it becomes a problem.
Kudos for thinking ahead, but the question that begs asking, is your sales data going to stay in the transactional database for 20 years?
There comes a point where it's going to be a lot easier to move that data into a data warehouse, and just keep your frontline database current.
If this is a new project, be more concerned with making it work, and getting people to use it. Worry about scalability when you've crossed those bridges.
Use option 1. Later, if performance becomes poor, you can identify specific bottlenecks and resolve them with options like #2, or materialized views, or several other possibilities.