Best way to store "views" of an topic - sql

I use this code to update views of an topic.
UPDATE topics
SET views = views + 1
WHERE id = $id
Problem is that users likes spam to F5 to get ridiculous amounts of views.
How should I do to get unique hits? Make a new table where I store the IP?
Don't want to store it in cookies. It's too easy to clear your cookies.

I would create a separate table for storing this information. You can then capture a larger amount of data and not require updating the table that is likely to be read the most.
You would always use INSERT INTO tblTopicViews...
And you would want to capture as much information as you can, IP address, date and time of the hit, perhaps some information on browser version, operating system etc - whatever you can get your hands on. That way, you can fine-tune how you filter out refresh requests over time.
It's worth bearing in mind that many users can share an IP - for example, an entire office might go via the same router.

I would create a table which stores unique views:
CREATE TABLE unique_views(
page_id number,
user_agent varchar2(500),
ip_address varchar2(16),
access_time date,
PRIMARY KEY (page_id, user_agent, ip_address, access_time)
)
Now if someone accesses the page and you want to allow one view per user per day, you could do
INSERT INTO unique_views (:page_id, :user_agent, :ip_address, trunc(SYSDATE, 'day'))
which won't allow duplicate views for the same user during one day. You could then count the views for each page with a simple GROUP BY (example for today's views):
SELECT page_id, count(*) page_views
FROM unique_views
WHERE access_time = trunc(SYSDATE, 'day')
GROUP BY page_id

Well, you could write the individual page hits to a log table, including identifying information like cookings or IP address. You can analyze that table at leisure.
But the web server probably has a facility for this. I know both IIS and Apache can create detailed usage logs. And for both, there's a variety of graphing and analysis tools that keeps things like IP addresses into account.
So instead of rolling your own logging, you could use the web server one.

You could use the session_id() to discriminate between different users, obiously you need a separate table to track each visit.
UPDATE: I just noticed you don't want to depend on cookies, so this may not be suitable for you.

Note that due to various problems (eg. unknown behavior of cache servers) this kind of thing is always going to be inaccurate and a balance between various factors. However, for a rough vaguely-secure counter, using a separate table as Karl Bartel and others suggest is a decent solution.
However, depending on how seriously you take this problem, you may want to leave out "user_agent" - it's far to easy to fake, so if I really wanted to inflate my hit counter I could rack up the hits with a script that called my page with user-agent="bot1", then again from the same IP with "bot2", etc.
But then, 2 users behind one IP will be only counted as 1 hit so you lose accuracy - see what I mean about a balance between various factors?

Related

How to store unique visits in Redis

I want to know how many people visited each blog page. For that, I have a column in the Blogs table (MS SQL DB) to keep the total visits count. But I also want the visits to be as unique as possible.
So I keep the user's unique Id and blog Id in the Redis cache, and every time a user visits a page, I check if she has visited this page before, if not, I will increase the total visit count.
My question is, what is the best way of storing such data?
Currently, I create a key like this "project-visit-{blogId}-{userId}" and use StringSetAsync and StringGetAsync. But I don't know if this method is efficient or not.
Any ideas?
If you can sacrifice some precision, the HyperLogLog (HLL) probabilistic data structure is a great solution for counting unique visits because:
It only uses 12K of memory, and those are fixed - they don't grow with the number of unique visits
You don't need to store user data, which makes your service more privacy-oriented
The HyperLogLog algorithm is really smart, but you don't need to understand its inner workings in order to use it, some years ago Redis added it as a data structure. So all you, as a user, need to know is that with HyperLogLogs you can count unique elements (visits) in a fixed memory space of 12K, with a 0.81% margin of error.
Let's say you want to keep a count of unique visits per day; you would have to have one HyperLogLog per day, named something like cnt:page-name:20200917 and every time a user visits a page you would add them to the HLL:
> PFADD cnt:page-name:20200917 {userID}
If you add the same user multiple time, they will still only be counted once.
To get the count you run:
> PFCOUNT cnt:page-name:20200917
You can change the granularity of unique users by having different HLLs for different time intervals, for example cnt:page-name:202009 for the month of September, 2020.
This quick explainer lays it out pretty well: https://www.youtube.com/watch?v=UAL2dxl1fsE
This blog post might help too: https://redislabs.com/redis-best-practices/counting/hyperloglog/
And if you're curious about the internal implementation Antirez's release post is a great read: http://antirez.com/news/75
NOTE: Please note that with this solution you lose the information of which user visited that page, you only have the count
Your solution is not atomic, unless you wrap the get and set operation in a transaction or Lua script.
A better solution is to save project-visit-{blogId}-{userId} into a Redis set. When you get a visit, call SADD add an item into the set. Redis adds a new item to the set, only if the user has not visited this page before. If you want to get the total count, just call SCARD to get the size of the set.
Regardless of the back-end technology (programming language etc.), you can use Redis stream. It is a very new feature in Redis 5 and allows you to define publisher and subscriber to a topic (stream) created in Redis. Then, in each user visit, you commit a new record (of course, async) to this stream. You can hold whatever info you want in that record (user ip, id etc..).
Defining a key for each unique visit is not a good idea at all, because:
It makes the life harder for redis GC
Performance, comparing the use-case, is not comparable to Stream, specially if you use that instance of redis for other purposes
Constantly collecting these unique visits and processing them is not efficient. You have to always scan through all keys
Conclusion:
If you want to use Redis, go with Redis Stream. If Redis can be changed, go with Kafka for sure (or a similar technology).

Using Bigquery for web analytics, how to filter out malicious bad data like bots etc

I am using bigquery to analyze web traffic, and I have some problems figuring out how to filter out real users from bots and malicious request.
I can filter out based on IP, but it will quickly become a long query if I have to include all ip's that is bad. So that doesn't sounds like a good solution.
I can avoid it coming into BigQuery, but them problem is I only notice that it is bad/malicious/spam data after some time, I can't prevent it from getting it first. I can generate a query to find bots and take the result and feed it back to the ingest to block those from getting into bigquery, but that sounds like something other most have experience.
I can also ingest data in Bigquery, run my query to find malicious users and then create a new table with the cleaned up data. That could also be a solution, but I am missing experience of how others do it.
Is it just noise in your dataset that you must accept if it is a small percentage or what measures should I take?
Filtering by IP is a good idea. Only thing here is to keep bad IP address in a table so that your query would not grow when more IP is added.
SELECT * FROM my_visit_history
WHERE ip_addr NOT IN (SELECT ip FROM blacklisted_ips);
-- Or with a view to further simplify your future query:
CREATE VIEW my_clean_visit_history AS
SELECT * FROM my_visit_history
WHERE ip_addr NOT IN (SELECT ip FROM blacklisted_ips);

Track database changes or differentiate records with timestamp?

Keeping track of changes to a database must be a big concern for lots of people, but it seems that the big names have software for that.
My question is for a small SQL database with 10 tables, <10 columns each, using joins to create a "master" junction table: is there a downside to updating a few times per year by adding rows (with a lot of duplicate information) and then taking the MAX id (PK) to generate and post on a website the most recent data in tabular form (excerpted from the "master")? This versus updating the records, in which I'll lose information on the values at a particular moment.
A typical row for teacher contact information would have fName, lName, schoolName, [address & phone info]; for repertoire or audition information: year, instrument, piece, composer, publisher/edition.
Others have asked about tracking db changes, but only one recently, and not with a lot of votes/details:
How to track data changes in a database table
Keeping history of data revisions - best practice?
How to track data changes in a database table
This lightweight solution seems promising, but I don't know if it didn't get votes because it's not helpful, or because folks just weren't interested.
How to keep track of changes to data in a table?
more background if needed:
I'm a music teacher (i.e. amateur programmer) maintaining a Joomla website for our organization. I'm using a Joomla plugin called Sourcerer to create dynamic content (PHP/SQL to the Joomla database) to make it easier to communicate changes (dates, personnel, rules, repertoire, etc.) For years, this was done with static pages (and paper handbooks) that took days to update.
I also, however, want to be able to look back and see the database state at a particular time: who taught where, what audition piece was listed, etc., as we could with paper versions. NOTE: I'm not tracking HTML changes, only that information fed from the database.
Thanks for any help! (I've followed SO for years, but this is my first question.)
The code I'm using now to generate the "master junction table." I would modify this to "INSERT into" for my new rows and query from it via Sourcerer to post the information online.
CREATE TABLE 011people_to_schools_junction
AS (
SELECT *
FROM (
SELECT a.peopleID, a.districtID, a.firstName, a.lastName, a.statusID, c.schoolName
FROM 01People a
INNER JOIN (
SELECT districtID, MAX(peopleID) peopleID
FROM 01People
GROUP BY districtID
) b
ON a.districtID = b.districtID
AND a.peopleID = b.peopleID
INNER JOIN (
SELECT schoolID, MAX(peopleID) peopleID
FROM 01people_to_schools_junction ab
GROUP BY schoolID
) z
ON z.peopleID = a.peopleID
LEFT JOIN 01Schools c
ON c.schoolID = z.schoolID
WHERE z.schoolID IS NOT NULL
OR z.peopleID IS NOT NULL
ORDER BY c.schoolName
) t1
);
#Add a primary key as the first column
ALTER TABLE 011people_to_schools_junction
ADD COLUMN 011people_to_schoolsID INT NOT NULL AUTO_INCREMENT FIRST,
ADD PRIMARY KEY (011people_to_schoolsID);
To answer your questions in order:
Is there a downside?
Of course, and it's performance - related. If you add a million records each year, it will hurt performances; and occupy space on disk.
Where the suggestions in the linked question bad or just not popular?
The question and answers are good; but the right answer depends on your specific use case: are you doing it for legal reasons, how fast you want to be able to access the data, how much data and updates you have, how much you want your history functionality to last without changes... only if it met your use case you would vote.
As a rule of thumb, history should go to a different table, this would provide several advantages:
your current tables don't change, so your code needs no change except for storing the current version also in history;
your application doesn't slow down;
if your history tables grow you can move them easily to a different server;
In order to choose whether to have a single history table or several (one per backed up table) depends on how you plan to retrieve the data and what you want to do with it:
if you mirror each of your tables adding a timestamp and the user id, your code would need little modifications; but you'd end up with twice as many tables, and any structure change would then need to be replicated on the history table as well;
if you build a single history table with the timestamp, the user id, the table name and a json representation of the record, you will have an easier life building it, while for retrieving it you should access the data using an Object per row i.e. using Joomla's dbo getObjectList(), then the objects will be the same format you store in the history table and the changes there will be fairly easy. But querying for changes across specific tables/fields will be much harder.
Keep in mind that having data is useless if you can't retrieve it properly.
Since you mention pushing to the website a few times a year, the overhead of the queries should not be an issue (if you update monthly, waiting 5 minutes may not be a problem).
You should seek the best solution based on the other uses of this data: for it to be useful to anyone, you will have to implement a system to retrieve historical data. If phpmyadmin is enough, well look no further.
I hope this scared you. Either way it's a lot of hard work.
If you just want to be able to look up old data, you may instead store a copy of the markup/output you generate from time to time, and save it to different folders on the webserver. This will take minutes to set up, and be extremely reliable.
Sure, it's more fun to code it. But are you really sure you need it? And you can keep the database dumps just in case one day you change your mind.

Creating a variable on database to hold global stats

Let's pretend I've got a social network.
I'm always showing to the user how many users are registered and have activated their profile.
So, everytime a single user logs in, it goes to DB and make a:
select count(*) from users where status = 'activated'
so if 5.000 users logs in, or simply refreshes the page, it will make 5.000 requests to SQL above.
I was wondering if is better to have a variable some place(that I still have no idea where to put) that everytime a user activates his profile will add 1 and then, when I want to show how many users are registered to that social network, I'll only get the value of this variable.
How can I make this? Is it really a better solution to what I've got?
You could use an indexed view, that SQL Server will automatically maintain:
create table dbo.users (
ID int not null,
Activated bit not null
)
go
create view dbo.user_status_stats (Activated,user_count)
with schemabinding
as
select Activated,COUNT_BIG(*) from dbo.users group by Activated
go
create unique clustered index IX_user_status_stats on dbo.user_status_stats (Activated)
go
This just has two possible statuses, but could expand to more using a different data type. As I say, in this case, SQL Server will maintain the counts behind the scenes, so you can just query the view:
SELECT user_count from user_status_stats with (NOEXPAND) where Activated = 1
and it won't have to query the underlying table. You need to use the WITH (NOEXPAND) hint on editions below (Enterprise/Developer).
Although as #Jim suggested, doing a COUNT(*) against an index when the index column(s) can satisfy the query criteria using equality comparisons should be pretty quick also.
As you've already guessed - it's not a great idea to calculate this value every time someone hits the site.
You could do as you suggest, and update a central value as users are added, although you'll have to ensure that you don't end up with two processes updating the number simultaneously.
Alternatively you could have a job which runs your SQL routinely and updates the central 'user count' value.
Alternatively #2, you could use something like MemCache to hold the calculated value for a period of time, and then when the cache expires, recalculate it again.
There's a few options you could consider:
1) like you say, maintain a global count each time a profile is activated to save the hit on the users table each time. You could just store that count in a "Stats" table and then query that value from there.
2) don't show the actual "live" count, show a count that's "pretty much up to date" - e.g. cache the count in your application and have the value expire periodically so you then requery the count less frequently. Or if you store the count in a "Stats" table per above, you could have a scheduled job that updates the count every hour, instead of every time a profile is activated.
Depends whether you want to show the exact figure in real-time or whether you can live with a delay. Obviously, data volumes matter too - if you have a large database, then having a slightly out of date cached value could be worth while.
From a purely SQL Server standpoint, no, you are not going to find a better way of doing this. Unless, perhaps, your social network is Facebook sized. Denormalizing your data design (such as keeping a count in a separate table) will lead to possible sources of the data getting out of sync. It doesn't have to get out of sync if it is coded properly, but it can...
Just make sure that you have an index on Status. At which point SQL will not scan the table for the count, but it will scan the index instead. The index will be much smaller (that is, more data will fit in a disk page). If you were to convert your status to an int, smallint, or tinyint you would get even more index leaves in a disk page and thus much less IO. To get your description ('activated', etc.), use a reference table. The reference table would be so small, SQL would just keep the whole thing in RAM after the first access.
Now, if you still think this is too much overhead (and it should't be) you could come up with hybrid method. You could store your count in a separate table (which SQL would keep in RAM if it is just the one record) or assuming your site is in asp.net you could create an Application variable to keep track of the count. You could increment it in Session_Start and decrement it in Session_End. But, you will have to come up with a way of making the the increment and decrement thread safe so two sessions don't try and update the value at the same time.
You can also use the Global Temporary table. You will always get fast retrieval. Even
if you are setting 30 seconds ping. The Example Trigger Link1, Example Trigger Link2 will maintain such activities in this table.

Best approach to cache Counts from SQL tables?

I would like to develop a Forum from scratch, with special needs and customization.
I would like to prepare my forum for intensive usage and wondering how to cache things like User posts count and User replies count.
Having only three tables, tblForum, tblForumTopics, tblForumReplies, what is the best approach of cache the User topics and replies counts ?
Think at a simple scenario: user press a link and open the Replies.aspx?id=x&page=y page, and start reading replies. On the HTTP Request, the server will run an SQL command wich will fetch all replies for that page, also "inner joining with tblForumReplies to find out the number of User replies for each user that replied."
select
tblForumReplies.*,
tblFR.TotalReplies
from
tblForumReplies
inner join
(
select IdRepliedBy, count(*) as TotalReplies
from tblForumReplies
group by IdRepliedBy
) as tblFR
on tblFR.IdRepliedBy = tblForumReplies.IdRepliedBy
Unfortunately this approach is very cpu intensive, and I would like to see your ideas of how to cache things like table Counts.
If counting replies for each user on insert/delete, and store it in a separate field, how to syncronize with manual data changing. Suppose I will manually delete Replies from SQL.
These are the three approaches I'd be thinking of:
1) Maybe SQL Server performance will be good enough that you don't need to cache. You might be underestimating how well SQL Server can do its job. If you do your joins right, it's just one query to get all the counts of all the users that are in that thread. If you are thinking of this as one query per user, that's wrong.
2) Don't cache. Redundantly store the user counts on the user table. Update the user row whenever a post is inserted or deleted.
3) If you have thousands of users, even many thousand, but not millions, you might find that it's practical to cache user and their counts in the web layer's memory - for ASP.NET, the "Application" cache.
I would not bother with caching untill I will need this for sure. From my expirience this is no way to predict places that will require caching. Try iterative approach, try to implement witout cashe, then gether statistics and then implement right caching (there are many kinds like content, data, aggregates, distributed and so on).
BTW, I do not think that your query is CPU consuming. SQL server will optimaze that stuff and COUNT(*) will run in ticks...
tbl prefixes suck -- as much as Replies.aspx?id=x&page=y URIs do. Consider ASP.NET MVC or just routing part.
Second, do not optimize prematurely. However, if you really need so, denormalize your data: add TotalReplies column to your ForumTopics table and either rely on your DAL/BL to keep this field up to date (possibly with a scheduled task to resync those), or use triggers.
For each reply you need to keep TotalReplies and TotalDirectReplies. That way, you can support tree-like structure of replies, and keep counts update throughout the entire hierarchy without a need to count each time.