Optimizing SQL to determine unique page views per user - sql

I need to determine if user has already visited a page, for tracking unique page views.
I have already implemented some HTTP header cache, but now I need to optimize the SQL queries.
The visit is unique, when:
pair: page_id + user_id is found in the visit table
or pair: page_id + session_id is found
or: page_id + [ip + useragent] - (this is a topic for another discussion, whether it should be only ip or ip+useragent)
So I have a table tracking user visits:
visit:
page_id
user_id
session_id
useragent
ip
created_at
updated_at
Now on each user visit (which does not hit cache) I will update a row if it exists. If there are any affected rows, I will insert new visit to the table.
This are one or two queries (assuming the cache will work, mostly two queries), but the number of rows is limited somehow. Maybe it would be better to store all the visits and then clean up the database within e.g. a month?
The questions are:
how should be the visit table constructed (keys, indexes, relations to user and page_views table). Some of the important fields may be null (e.g. user_id), what about indexes then? Do I need a multi column primary key?
which would be the fastest sql query to find the unique user?
is this sane approach?
I use PostgreSQL and PDO (Doctrine ORM).
All my sessions are stored in the same DB.

Personally I would not put this in the request-response path. I would log the the raw data in a table (or push it on a queue) and let a background task/thread/cron job deal with that.
The queue (or the message passing table) should then just contain pageid, userip, sessionid, useragen,ip.
Absolute timings are less important now as long as the background task can keep up. since a single thread will now do the heavy lifting it will not create conflicting locks when updating the unique pageviews tables.

Just some random thoughts:
Can I verify that the thinking behind the unique visit types is:
pageid + userid = user has logged in
pageid + sessionid = user not identified but has cookies enabled
pageid + ip / useragent = user not identified and no cookies enabled
For raw performance, you might consider #2 to be redundant since #3 will probably cover #2 i most conditions (or is #2 important e.g. if the user then registers and then #2 can be mapped to a #1)? (meaning that session id might still be logged, but not used in any visit determination)
IMHO IP will always be present (even if spoofed) and will be a good candidate for an Index. User agent can be hidden and will only have a limited range (not very selectable).
I would use a surrogate primary key in this instance due to the nullable fields and since none of the fields is unique by themselves.
IMHO your idea about storing ALL the visits and then trimming the duplicates via batch out is a good one to weigh up (rather than checking if exists to update vs insert new)
So PK = Surrogate
Clustering = Not sure - another query / requirement might drive this better.
NonClustered Index = IP Address, Page Id (assuming more distinct IP addresses than page id's)

Related

In what instances would a GA4 session ID have 2 different user pseudo ids?

So I've been sorting out tracking and reporting on a new website using GA4 with a Big Query export.
As I've started building a report I've found 5 session IDs (out of a few hundred) that have 2 different user pseudo ids attached to them.
Any ideas why/when this would happen?
While I would expect one pseudo user id to have have more than one session id, I was only expecting each session id to only have one user pseudo id.
The only thing I thought it might be is if cookies were deleted during a session but I've tried this and the same pseudo used id persists if I change page (my page changes are just history changes) or I get a new pseudo id AND session id if I hard refresh.
ga_session_ids are unique to user_pseudo_id/user_ids. That means to identify a unique session in your property, you need to have a composite key of ga_session_id and user_pseudo_id. You can see the official/standard method of identifying and calculating sessions here. (Disclaimer: I wrote the linked article).
ga_session_ids are basically a time stamp when the session started linked to an
event, its possible for multiple visits to happen at the same time.
Its the user_pseudo_id that defines the events are from different users so combining them will give you the correct number of sessions
count(distinct concat(user_pseudo_id,(select value.int_value from unnest(event_params) where key = 'ga_session_id'))) as sessions

Table design in Azure Table Storage

I should organize REST-service for messaging using azure. Now i have problem with DB. I have 3 tables: users, chats, messages of chats.
Users contains user data like login, password hash, salt.
Chats contains partitionkey - userlogin, rowkey - chatId, nowInChat - the user came from a chat.
Messages of chat contains partitionkey, wich consists of
userlogin_chatId_datetimeticks
(zevis_8a70ff8d-c363-4eb4-8a51-f853fa113fa8 _634292263478068039),
rowkey - messageId, message, sender - userLogin.
I saw disadvantages in the design, such as, if you imagine that users are actively communicated a year ago, and now do not talk, and one of them wants to look at the history, then I'll have to send a large number of requests to the server with the time intervals, such as a week, request data. Sending the request with a time less than today will be ineffective, because We get the whole story.
How should we change the design of the table?
Because Azure Storage Tables do not support secondary indexes, and storage is very inexpensive, your best option is to store the data twice, using different partition and/or row keys. From the Azure Storage Table Design Guide:
To work around the lack of secondary indexes, you can store multiple
copies of each entity with each copy using different PartitionKey and
RowKey values
https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#table-design-patterns
Thank you for your post, you have two options here. The easiest answer with the least amount of design change would be to include a StartTime and EndTime in the Chat table. Although these properties would not be indexed I'm guessing there will not be many rows to scan once you filter on the UserID.
The second option requires a bit more work, but cleaner, would be to create an additional table with Partition Key = UserID, Row Key = DateTimeTicks and your entity properties would contain the ChatID. This would enable you to quickly filter by user on a given date/date range. (This is the denormalization answer provided above).
Hopefully this helps your design progress.
I would create a separate table with these PK and RK values:
Partition Key = UserID, Row Key = DateTime.Max - DateTimeTicks
Optionally you can also append ChatId to the end of the Row Key above.
This way the most recent communication made by the user will always be on top. So you can later on simply query the table with passing in only the UserId and a take count (ie. Take Count = 1 if you want the latest chat entry from the user). The query will also be very fast because since you use inverted ticks for your row keys, azure table storage service will sort the entries for the same user id in increasing lexicographical order of Row Keys, always keeping the latest chat on top of the partition as it will have the minimum inverted tick value.
Even if you add Chat Id at the end of the RowKey (ie. InvertedTicks_ChatId) the sort order will not change and latest conversation will be on top regardless of chat id.
Once you read the entity back, you subtract the inverted ticks from DateTime.Max to find the actual date.

Model simple social network in Azure table service

What is the best table design for a simple social networking website using Azure Table Service?
The website could have millions of users.
Users need to be able to view a list of all other users in the system sorted by the number of mutual connections.
Users must be able to view a list of their connections
User must be able to view content posted by themselves and their connections.
One major design constraint is that Azure table service queries are generally limited to the partition key and row key when there are a large number of records or else they get really slow. Another constraint is that query results are only sorted by the partition key and then the row key.
Try this Design:
UserTable
PK: GUID ( GUID for PK will maximize scalability, only one partition with single row in each server)
RK: GUID
... Rest of properties
UserFriendsTable
PK: UserTable.RK ( Every User with his friends in a separate server)
RK: GUID
FriendWith: UserTable.Pk - UserTable.RK (Concatenate PK and RK from user table separated with "-", this will help you to execute point query fast when you try to access friend profile )
PostsTable
PK: UserTable.RK + "-" +YYYYMM+ Random number (This will allow azure to put all monthly posts of any user in a separate server. Random number to prevent azure from auto grouping partitions in sequence. You can query posts with filtering PK partly ex: pk start with XCtghi94ktY-201411.
RK use following code to generate row key in descending order. means latest post comes first.
long ticks = DateTimeOffset.MaxValue.UtcDateTime.Ticks - DateTimeOffset.Now.UtcDateTime.Ticks;
string guid = Guid.NewGuid().ToString("N");
string suffix = "-";
string.Format("{0:d21}{1}{2}", ticks, suffix, guid);
Post : String

How to properly index my database to increase query performance

I'm working on simple log in page using OpenID: if the user has just registered for an OpenID, then I need to create a new entry in the database for the user, otherwise I just display their alias with a greeting. Every time somebody gets authenticated with their Open ID, I must find their alias by looking up which user has the given OpenID and it seems that it might be fairly slow if the primary key is the UserID (and there are millions of users).
I'm using SQL Server 2008 and I have two tables in my database (Users and OpenIDs): I plan the check if the Open ID exists in the OpenIDs table, then use the corresponding UserID to get the rest of the user information from the Users table.
The Users table is indexed by UserID and has the following columns:
UserID (pk)
EMail
Alias
OpenID (fk)
The OpenIDs table is indexed by OpenID and has the following columns:
OpenID (pk)
UserID (fk)
Alternately, I can index the Users table by UserID and OpenID (i.e have 2 indexes) and completely drop the OpenIDs table.
What would be the recommended way to improve the query for a user with the matching OpenID in this case: index the Users table with two keys or use the OpenIDs table to find the matching UserID?
May be the answers to What are some best practises and “rules of thumb” for creating database indexes? can help you.
Without knowing what kind of queries you'll be running in detail, I would recommend indexing the two foreign key columns - Users.OpenID and OpenIDs.UserID.
Indexing the foreign keys is typically a good idea to help with JOIN conditions and other queries.
But quite honestly, if you use the OpenIDs table only to check the existance of an OpenID, you'd be much better off just indexing (possibly a unique index?) that column in the Users table and be done with it. That OpenIDs table as you have it now serves no real purpose at all - just takes up space for redundant information.
Other than that: you need to observe how your application behaves, samples some usage data, and then see what kind of queries are running the most often, and the longest, and then start doing performance tweaking. Don't over-do the ahead-of-time performance optimizations - too many indices can be worse than having none at all !
Every time somebody gets authenticated
with their Open ID, I must find their
alias by looking up which user has the
given OpenID and it seems that it
might be fairly slow if the primary
key is the UserID (and there are
millions of users).
Actually, quite the contrary! If you have a value that's unique amongst millions of rows, finding that single value is actually quite quick - even with millions of users. It will take only a handful (max. 5-6) comparisons, and bang! you have your one user out of a million. If you have an index on that OpenID column, that should be pretty fast indeed. Such a highly selective index (one value picks out 1 in a million) work very very efficiently.

Strategy to reduce db indexes, selectively

I have an indexed field users.username, which is only used in the admin interface. Because the table has currently lots of writes, I'd like to remove that index. Of course I want to keep the index searchable for admins.
I could extract the whole column, to move that index to another table. But it feels stupid because I'm already planning to move the write heavy fields into another table (with just one index).
Throwing in an search engine would be overkill.
Any ideas for a simple solution?
[edit]
I've just realized that the need for the admins to search and sort lots of fields has impact on many tables (which would actually need much more indexes). For the first step I'll ensure that the admins get an dedicated server+db to keep off the slow sorts/searches from live servers and in the long run I'll investigate if a search engine is suitable. Thanks all!
Maintaining an index only accessible by certain users is not supported in MySQL, and even if it was, it would be as expensive as maintaining a usual index.
Assuming the usernames are unique, you can create a separate index-like table like that:
CREATE TABLE shadow_username (username VARCHAR(100) NOT NULL PRIMARY KEY, userid INT NOT NULL, UNIQUE (userid))
, fill in on a timely basis:
TRUNCATE
shadow_username
INSERT
INTO shadow_username
SELECT username, id
FROM users
and query it:
SELECT u.*
FROM (
SELECT id
FROM shadow_username
WHERE username = 'user'
) s
JOIN users u
ON u.id = s.id
UNION ALL
SELECT u.*
FROM users
WHERE id >
(
SELECT MAX(id)
FROM shadow_username
)
AND username = 'user'
UNION ALL
SELECT *
FROM users
WHERE username = 'user'
LIMIT 1
The first part does a normal search; the second part processes the usernames that were inserted in between the updates to shadow_username; the third part is a fallback method which does a normal search only if previous two steps found nothing (that may happen if a user changed their username).
If the username never changes, you should omit the third step.
If I understand you correctly, you can't have an index for only a certain subset of $ways_to_access_data (ie, admin interface vs public interface).
Either the column is indexed, or it isn't.
I'm not sure where the actual problem is. Either the "username" field is written to, in which case updating the index is warranted (and whether to have it indexed or not is a trade off between read performance and write performance), or it isn't written to (which I'd assume, as most users tend to change their name rather seldom), in which case your RDBMS should not be touching the index at all.
Looking into my crystal ball, I'd assume the "write heavy" fields in the "users" table are login sessions, which should live in a separate table anyway.