SQL query to calculate visit duration from log table - sql

I have a MySQL table LOGIN_LOG with fields ID, PLAYER, TIMESTAMP and ACTION. ACTION can be either 'login' or 'logout'. Only around 20% of the logins have an accompanying logout row. For those that do, I want to calculate the average duration.
I'm thinking of something like
select avg(LL2.TIMESTAMP - LL1.TIMESTAMP)
from LOGIN_LOG LL1
inner join LOGIN_LOG LL2 on LL1.PLAYER = LL2.PLAYER and LL2.TIMESTAMP > LL1.TIMESTAMP
left join LOGIN_LOG LL3 on LL3.PLAYER = LL1.PLAYER
and LL3.TIMESTAMP between LL1.TIMESTAMP + 1 and LL2.TIMESTAMP - 1
and LL3.ACTION = 'login'
where LL1.ACTION = 'login' and LL2.ACTION = 'logout' and isnull(LL3.ID)
is this the best way to do it, or is there one more efficient?

Given the data you have, there probably isn't anything much faster you can do because you have to look at a LOGIN and a LOGOUT record, and ensure there is no other LOGIN (or LOGOUT?) record for the same user between the two.
Alternatively, find a way to ensure that a disconnect records a logout, so that the data is complete (instead of 20% complete). However, the query probably still has to ensure that the criteria are all met, so it won't help the query all that much.
If you can get the data into a format where the LOGIN and corresponding LOGOUT times are both in the same record, then you can simplify the query immensely. I'm not clear if the SessionManager does that for you.

Do you have a SessionManager type object that can timeout sessions? Because a timeout could be logged there, and you could get the last activity time from that and the timeout period.
Or you log all activity on the website/service, and thus you can query website/service visit duration directly, and see what activities they performed. For a website, Apache log analysers can probably generate the required stats.

I agree with JeeBee, but another advantage to a SessionManager type object is that you can handle the sessionEnd event and write a logout row with the active time in it. This way you would likely go from 20% accompanying logout rows to 100% accompanying logout rows. Querying for the activity time would then be trivial and consistent for all sessions.

If only 20% of your users actually log out, this search will not give you a very accurate time of each session. A better way to gauge how long an average user session is would be to take the average time between actions, or avg. time per page. This, then, can multiplied by the average number of pages/actions per visit to give a more accurate time.
Additionally, you can determine avg. time for each page, and then get your session end time = session time to that point + avg time spent on their last page. This will give you a much more fine-grained(and accurate) measure of time spent per session.
Regarding the given SQL, it seems to be more complicated than you really need. This sort of statistical operation can often be better handled/more maintainable in code external to the database where you can have the full power of whichever language you choose, and not just the rather convoluted abilities of SQL for statistical calculations

Related

Reducing database load from consecutive queries

I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.

What exactly Peak Concurrent User means

what exactly PCU refer?
I have data like this can anyone tell me from this data what would be the result for peak concurrent users?
How to calculate PCU using sql query according to this data?
PCU would mean the maximum number of users logged in at the same time during a given time period.
In your data you only have two users, so it can never be more than 2. We can see that at times, both users are logged in at the same time, so PCU=2 for this data.
Assuming you have access to Window Functions / Analytic Functions...
SELECT
MAX(concurrent_users) AS peak_concurrent_users
FROM
(
SELECT
SUM(CASE WHEN event = 'logout' THEN -1 ELSE 1 END)
OVER (ORDER BY __time, user_id, event)
AS concurrent_users
FROM
yourTable
WHERE
event IN ('login', 'logout')
)
AS running_total
This just adds up how many people have logged in or out in time order, and keeps a running total. When someone logs in, the number goes up, when someone logs out, the number goes down.
Where two events happen at exactly the same time, it assumes the lowest user_id actually went first. And if a user has a login and logout at exactly the same time, it assumes the login actually went first.

Laravel where clause based on conditions from value in database

I am building an event reminder page where people can set a reminder for certain events. There is an option for the user to set the amount of time before they need to be notified. It is stored in notification_time and notification_unit. notification_time keeps track of the time before they want to be notified and notification_unit keeps track of the PHP date format in which they selected the time, eg. i for minutes, H for hours.
Eg. notification_time - 2 and notification_unit - H means they need to be notified 2 hours before.
I have Cron jobs running in the background for handling the notification. This function is being hit once every minute.
Reminder::where(function ($query) {
$query->where('event_time', '>=', now()->subMinutes(Carbon::createFromFormat('i', 60)->diffInMinutes() - 1)->format('H:i:s'));
$query->where('event_time', '<=', now()->subMinutes(Carbon::createFromFormat('i', 60)->diffInMinutes())->format('H:i:s'));
})
In this function, I am hard coding the 'i', 60 while it should be fetched from the database. event_time is also part of the same table
The table looks something like this -
id event_time ... notification_unit notification_time created_at updated_at
Is there any way to solve this issue? Is it possible to do the same logic with SQL instead?
A direct answer to this question is not possible. I found 2 ways to resolve my issue.
First solution
Mysql has DATEDIFF and DATE_SUB to get timestamp difference and subtract certain intervals from a timestamp. In my case, the function runs every minute. To use them, I have to refactor my database to store the time and unit in seconds in the database. Then do the calculation. I chose not to use this way because both operations are a bit heavy on the server-side since I am running the function every minute.
Second Solution
This is the solution that I personally did in my case. Here I did the calculations while storing it in the database. Meaning? Let me explain. I created a new table notification_settings which is linked to the reminder (one-one relation). The table looks like this
id, unit, time, notify_at, repeating, created_at, updated_at
The unit and time columns are only used while displaying the reminder. What I did is, I calculated when to be notified in the notify_at column. So in the event scheduler, I need to check for the reminders at present (since I am running it every minute). The repeating column is there to keep track of whether the reminder is repeating or not. If it is repeating I re-calculate the notify_at column at the time of scheduling. Once the user is notified notify_at is set to null.

How to alert on an event that normally happens once a day?

I have a batch job that runs once per day.
At the end of the job I submit a meter metric with a count of the items processed.
I want to alert if one day this metric is not updated.
On http://metrics.librato.com the maximum time I can check "not reported for" when creating an alert is 60 minutes.
I thought maybe I can create a composite metric and take the avg rate of change over the past 24 hours, and alert if that reaches zero.
I've been trying:
derive(s("my.metric", "%", {function:"sum", period:"86400"}))
However it seems that, because I log only a single event, above quite small values of period (~250s) my rate of change simply drops to zero ...I guess the low frequency means my single value is completely lost by the sampling.
Maybe I am using the wrong tool for the job...
Is there a way to achieve this in Librato?
There currently is not a way to achieve this as composite metrics are subject to the 60 minute limitation of alerts as well (as of 5/15/2015). You may need to look into configuring the metric (or a similar metric) to report within the 60m time range if possible.

Using redis, how to identify active users based on their number of login during a period of time?

I was told redis was born for analytic, and I came across some bitmap using cases. They are useful when counting based on yes/no(0/1), but I can't find an efficient way to count the number of user who login at least 4 times during the last 10 days. Because redis runs in memory, I tried using bit map to keep track login flag of each user, and using bitcount to filer, on my laptop, it took a minute to return the count from about 4Million users' login activity.
Is there any way to solve this problem? I guess the round trips between my node redis client and redis server may be the issue, I'll try batch command or lua script to see if it works.
I think you need to use SortedSets with user id in value, and timestamp in score.
When user logs in, score (time stamp) for this user updates to current. Than you can get ether N last logged in users (ZREVRANGE), or users, logged in between some datetime range (ZRANGEBYSCORE)