Best approach to cache Counts from SQL tables?

Best approach to cache Counts from SQL tables? - sql-server-2005

I would like to develop a Forum from scratch, with special needs and customization.
I would like to prepare my forum for intensive usage and wondering how to cache things like User posts count and User replies count.
Having only three tables, tblForum, tblForumTopics, tblForumReplies, what is the best approach of cache the User topics and replies counts ?
Think at a simple scenario: user press a link and open the Replies.aspx?id=x&page=y page, and start reading replies. On the HTTP Request, the server will run an SQL command wich will fetch all replies for that page, also "inner joining with tblForumReplies to find out the number of User replies for each user that replied."
select
tblForumReplies.*,
tblFR.TotalReplies
from
tblForumReplies
inner join
(
select IdRepliedBy, count(*) as TotalReplies
from tblForumReplies
group by IdRepliedBy
) as tblFR
on tblFR.IdRepliedBy = tblForumReplies.IdRepliedBy
Unfortunately this approach is very cpu intensive, and I would like to see your ideas of how to cache things like table Counts.
If counting replies for each user on insert/delete, and store it in a separate field, how to syncronize with manual data changing. Suppose I will manually delete Replies from SQL.

These are the three approaches I'd be thinking of:
1) Maybe SQL Server performance will be good enough that you don't need to cache. You might be underestimating how well SQL Server can do its job. If you do your joins right, it's just one query to get all the counts of all the users that are in that thread. If you are thinking of this as one query per user, that's wrong.
2) Don't cache. Redundantly store the user counts on the user table. Update the user row whenever a post is inserted or deleted.
3) If you have thousands of users, even many thousand, but not millions, you might find that it's practical to cache user and their counts in the web layer's memory - for ASP.NET, the "Application" cache.

I would not bother with caching untill I will need this for sure. From my expirience this is no way to predict places that will require caching. Try iterative approach, try to implement witout cashe, then gether statistics and then implement right caching (there are many kinds like content, data, aggregates, distributed and so on).
BTW, I do not think that your query is CPU consuming. SQL server will optimaze that stuff and COUNT(*) will run in ticks...

tbl prefixes suck -- as much as Replies.aspx?id=x&page=y URIs do. Consider ASP.NET MVC or just routing part.
Second, do not optimize prematurely. However, if you really need so, denormalize your data: add TotalReplies column to your ForumTopics table and either rely on your DAL/BL to keep this field up to date (possibly with a scheduled task to resync those), or use triggers.

For each reply you need to keep TotalReplies and TotalDirectReplies. That way, you can support tree-like structure of replies, and keep counts update throughout the entire hierarchy without a need to count each time.

Related

How to tell there is new data available in ga_sessions_intraday_ efficiently

Google Analytics data should be exported to Big Query 3 times a day, according to the docs. I trying to determine an efficient way to detect new data is available in the ga_sessions_intraday_ table and run a query in BQ to extract on the new data.
My best idea is to poll ga_sessions_intraday_ by running a SQL query every hour. I would track the max visitStartTime (storing the state somewhere) and if a new max visitStartTime shows up in the ga_sessions_intraday_ then I would run my full queries.
Problems with this approach is I need to store state about the max visitStartTime. I would prefer something simpler.
Does GA Big Query have a better way of telling that new data is available in ga_sessions_intraday_? Some kind of event that fires? Do I use the last modified date of the table (but I need to keep track of the time window to run against)?
Thanks in advance for your help,
Kevin

Last modified time on the table is probably the best approach here (and cheaper than issuing a probe query). I don't believe there is any other signalling mechanism for delivery of the data.
If your full queries run more quickly than your polling interval, you could probably just use the modified time of your derived tables to hold the data (and update when your output tables are older than your input tables).
Metadata queries are free, so you can even embed most of the logic in a query:
SELECT
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_INPUT_DATASET.__TABLES__`) >
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_OUTPUT_DATASET.__TABLES__`) need_update
If you have a mix of tables in your output dataset, you can be more selective (with a WHERE clause) to filter down the tables you examine.
If you need a convenient place to run this scheduling logic (that isn't a developer's workstation), you might consider one of my previous answers. (Short version: Apps Script is pretty neat)
You might also consider filing a feature request for "materialized views" or "scheduled queries" on BigQuery's public issue tracker. I didn't see a existing entry for this with a quick skim, but I've certainly heard similar requests in the past.
I'm not sure how the Google Analytics team handles feature requests, but having a pubsub notification upon delivery of a new batch of Analytics data seems like it could be useful as well.

What is the best way to increase view counter column of a row in each select?

Often with portals like news sites, I wonder whether it is good practice or not to update view counter field of a table while selecting the row. Lets say I have a News table with id, title, details, publishDate and viewCounter. Is it good to perform following query on each request of news detail page? How would this mixing of select and update together for each request hurt performance?
select * from News where id=120;
update News
set viewCounter=viewCounter+1
where id=120;
Could there be any difference in performance if I put view tracker data in another table, say table ViewsCount with columns id, newsID,viewCount? In this case, I would execute following code:
select * from News where id=120;
update ViewsCount
set viewCount=viewCount+1
where newsID=120;
I would see one more option where I would track browser request data for each request, and later aggregate the rows for each news id. With this design, I would run two queries for each request: select and insert, like following:
select * from News where id=120;
insert into NewsView(newsID,browser,ipAddress,operatingSystem,col1,col2)
values(120,'Netscape','202.xx.xx.xx','Windows',col1Value,col2Value)
But with this I have seen that in short span of time I would get lots of rows and database size also increases significantly with heavy traffic portals. This would definitely slow down aggregate queries.
What are the alternatives I could use? Or is it ok to go with page view trackers like Google Analytics? I would welcome suggestions based on best practices you have been following in the similar context.

Updating a row for every view will take an exclusive lock on that row effectively serializing read access to that resource, as each queuing read transaction will need to wait for the previous one to commit which in turn requires confirmation that the transaction log has been persisted to disc.
This will quickly become a bottle neck for even moderately popular articles.
I would consider tracking page view deltas in memory in the application and just writing them to the database at periodic intervals. If the application crashes you will lose the views for that period but this may well be acceptable.
Alternatively your web server may well have log files that are appended to on each view and can be periodically parsed to extract information about new page views.

Is it possible to Cache the result set of a select query in the database?

I am trying to optimize the search query which is the most used in our system. So far I have added some missing indexes and that has helped slightly. But I want to further reduce the load on the db server. One option that I will use is caching the result set as a LIST in the asp.net Cache so that I don't have to hit the db often.
However, I was wondering if there is a way to Cache some portions of the select query at the db as well. e.g. for the search results we consider only users who have been active in the last 180 days and who have share-info set as true. So this is like a super set which the db processes everytime and then applies other conditions such as category specified, city etc. which are passed. Is it possible to somehow Cache the Super Set so that I can run queries against the super set rather than run the query against the whole table? Will creating a View help in this? I am a bit hesitant to create a view as I read managing views can be an overhead and takes away some flexibility to modfy the tables.
I am using Sql-Server 2005 so cannot create a filtered index on the table, which I think would have been helpful.

I agree with #Neville K. SQL Server is pretty smart at caching data in memory. You might see limited / no performance gains for your effort.
You could consider indexed views (Enterprise Edition only) http://technet.microsoft.com/en-us/library/cc917715.aspx for your sub-query.

It is, of course, possible to do this - but I'm not sure if it will help.
You can create a scheduled job - once a night, perhaps - which populates a table called "active_users_with_share_info" by truncating it, and then repopulating it based on a select query filtering out users active in the last 180 days with "share_info = true".
Then you can join your search query to this table.
However, I doubt this would do much good - SQL Server is pretty smart at caching. Unless you're dealing with huge volumes of data (100 of millions of records), or very limited hardware, I doubt you'd get any measurable performance improvements - but by all means try it!
Of course, the price for this would be more moving parts in your application, more interesting failure modes (what happens if the overnight batch fails silently?), and more training for any new developers you bring into the team.

Complex SQL design, help/advice needed

i have few questions for SQL gurus in here ...
Briefly this is ads management system where user can define campaigns for different countries, categories, languages. I have few questions in mind so help me with what you can.
Generally i'm using ASP.NET and i want to cache all result set of certain user once he asks for statistics for the first time, this way i will avoid large round-trips to server.
any help is welcomed
Click here for diagram with all details you need for my questions
1.Main issue of this application is to show to the user how many clicks/impressions were and how much money he spent on campaign. What is the easiest way to get this information for him? I will also include filtering by date, date ranges and few other params in this statistics table.
2.Other issue is what happens when user will try to edit campaign. Old campaign will die this means if user set 0.01$ as campaignPPU (pay-per-unit) and next day updates it to 0.05$ all will be reset to 0.05$.
3.If you could re-design some parts of table design so it would be more flexible and easier to modify, how would you do it?
Thanks... sorry for so large job but it may interest some SQL guys in here

For #1, you might want to use a series of views to show interesting statistics. If you want to cache results, you could store the results to a reporting table that only gets refreshed every n hours (maybe up to 3 or 4 times a day? I don't know what would be suitable for this scenario).
Once all the data is in a report table, you can better index it for filtering, and since it will be purged and re-populated on a schedule, it should be faster to access.
Note that this will only work if populating the stats table does not take too long (you will have to be the judge of how long is "too long").
For #2, it sounds like an underlying design issue. What do you mean by "edit"? Does this edit operation destroy the old campaign and create a new "clone" campaign (that is obviously not a perfect clone or there wouldn't be a problem)? Is there historical data that is important, but getting orphaned or deleted by the edit? You might want to analyze this "edit" process and see if you need to add history-tracking to some of these tables. Maybe a simple datetime to old records, or a separate "history" table(s) that mirrors the structure of the tables being modified by the "edit" operation.
For #3, It looks alright, but I'm only seeing a sliver of the system. I don't know how the rest of the app is designed...

Eugene,
If you plan to keep the edited campaigns around consider not removing them. Instead make the campaigns date sensitive. For example UserA started campaign 1 on 1/1/2010 and ended it on 2/1/2010 then started campaign2 on 2/2/2010.
Or if you dont like the notion of end dating your campaigns. You could consider a history table for your campaigns. Basically the same table structure but an added UniqueIdentifier to make rows unique.
I should also note that estimated size of this campaign table and its related tables are important on your design. If you expect to have only 1000s of rows keeping old and current records in one table isnt a problem. however if you plan to have 1000000s or more then you may want to separate the old from the new for faster queries, or properly plan statistics and indicies on fields you know will need to be filtered on. Also remember indicies are usefule for reads but they slow down writes.

Deleting rows from a contended table

I have a DB table in which each row has a randomly generated primary key, a message and a user. Each user has about 10-100 messages but there are 10k-50k users.
I write the messages daily for each user in one go. I want to throw away the old messages for each user before writing the new ones to keep the table as small as possible.
Right now I effectively do this:
delete from table where user='mk'
Then write all the messages for that user. I'm seeing a lot of contention because I have lots of threads doing this at the same time.
I do have an additional requirement to retain the most recent set of messages for each user.
I don't have access to the DB directly. I'm trying to guess at the problem based on some second hand feedback. The reason I'm focusing on this scenario is that the delete query is showing a lot of wait time (again - to the best of my knowledge) plus it's a newly added bit of functionality.
Can anyone offer any advice?
Would it be better to:
select key from table where user='mk'
Then delete individual rows from there? I'm thinking that might lead to less brutal locking.

If you do this everyday for every user, why not just delete every record from the table in a single statement? Or even
truncate table whatever reuse storage
/
edit
The reason why I suggest this approach is that the process looks like a daily batch upload of user messages preceded by a clearing out of the old messages. That is, the business rules seems to me to be "the table will hold only one day's worth of messages for any given user". If this process is done for every user then a single operation would be the most efficient.
However, if users do not get a fresh set of messages each day and there is a subsidiary rule which requires us to retain the most recent set of messages for each user then zapping the entire table would be wrong.

No, it is always better to perform a single SQL statement on a set of rows than a series of "row-by-row" (or what Tom Kyte calls "slow-by-slow") operations. When you say you are "seeing a lot of contention", what are you seeing exactly? An obvious question: is column USER indexed?
(Of course, the column name can't really be USER in an Oracle database, since it is a reserved word!)
EDIT: You have said that column USER is not indexed. This means that each delete will involve a full table scan of up to 50K*100 = 5 million rows (or at best 10K * 10 = 100,000 rows) to delete a mere 10-100 rows. Adding an index on USER may solve your problems.

Are you sure you're seeing lock contention? It seems more likely that you're seeing disk contention due to too many concurrent (but unrelated updates). The solution to that is simply to reduce the number of threads you're using: Less disk contention will mean higher total throughput.

I think you need to define your requirements a bit clearer...
For instance. If you know all of the users who you want to write messages for, insert the IDs into a temp table, index it on ID and batch delete. Then the threads you are firing off are doing two things. Write the ID of the user to a temp table, Write the message to another temp table. Then when the threads have finished executing, the main thread should
DELETE * FROM Messages INNER JOIN TEMP_MEMBERS ON ID = TEMP_ID
INSERT INTO MESSAGES SELECT * FROM TEMP_messges
im not familiar with Oracle syntax, but that is the way i would approach it IF the users messages are all done in rapid succession.
Hope this helps

TALK TO YOUR DBA
He is there to help you. When we DBAs take access away from the developers for something such as this, it is assumed we will provide the support for you for that task. If your code is taking too long to complete and that time appears to be tied up in the database, your DBA will be able to look at exactly what is going on and offer suggestions or possibly even solve the problem without you changing anything.
Just glancing over your problem statement, it doesn't appear you'd be looking at contention issues, but I don't know anything about your underlying structure.
Really, talk to your DBA. He will probably enjoy looking at something fun instead of planning the latest CPU deployment.

This might speed things up:
Create a lookup table:
create table rowid_table (row_id ROWID ,user VARCHAR2(100));
create index rowid_table_ix1 on rowid_table (user);
Run a nightly job:
truncate table rowid_table;
insert /*+ append */ into rowid_table
select ROWID row_id , user
from table;
dbms_stats.gather_table_stats('SCHEMAOWNER','ROWID_TABLE');
Then when deleting the records:
delete from table
where ROWID IN (select row_id
from rowid_table
where user = 'mk');

Your own suggestion seems very sensible. Locking in small batches has two advantages:
the transactions will be smaller
locking will be limited to only a few rows at a time
Locking in batches should be a big improvement.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas