SQL question: count of occurrence greater than N in any given hour - sql

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span

You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.

I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.

Related

Oracle SQL: How to best go about counting how many values were in time intervals? Database query vs. pandas (or more efficient libraries)?

I currently have to wrap my head around programming the following task.
Situation: suppose we have one column where we have time data (Year-Month-Day Hours-Minutes). Our program shall get the input (weekday, starttime, endtime, timeslot) and we want to return the interval (specified by timeslot) where there are the least values. For further information, the database has several million entries.
So our program would be specified as
def calculate_optimal_window(weekday, starttime, endtime, timeslot):
return optimal_window
Example: suppose we want to input
weekday = Monday, starttime = 10:00, endtime = 12:00, timeslot = 30 minutes.
Here we want to count how many entries there are between 10:00 and 12:00 o'clock, and compute the number of values in every single 30 minute slot (i.e. 10:00 - 10:30, 10:01 - 10:31 etc.) and in the end return the slot with the least values. How would you go about formulating an efficient query?
Since I'm working with an Oracle SQL database, my second question is: would it be more efficient to work with libraries like Dask or Vaex to get the filtering and counting done? Where is the bottleneck in this situation?
Happy to provide more information if the formulation was too blurry.
All the best.
This part:
Since I'm working with an Oracle SQL database, my second question is:
would it be more efficient to work with libraries like Dask or Vaex to
get the filtering and counting done? Where is the bottleneck in this
situation?
Depending on your server's specs and the cluster/machine you have available for Dask, it is rather likely that the bottleneck in your analysis would be the transfer of data between the SQL and Dask workers, even in the (likely) case that this can be efficiently parallelised. From the DB's point of view, selecting data and serialising it is likely at least as expensive as counting in a relatively small number of time bins.
I would start by investigating how long the process takes with SQL alone, and whether this is acceptable, before moving the analysis to Dask. Usual rules would apply: having good indexing and sharding on the time index.
You should at least do the basic filtering and counting in the SQL query. With a simple predicate, Oracle can decide whether to use an index or a partition and potentially reduce the database processing time. And sending fewer rows will significantly decrease the network overhead.
For example:
select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between timestamp '2021-01-25 10:00:00' and timestamp '2021-01-25 11:59:59'
group by trunc(the_time, 'MI')
order by the_minute desc;
(The trickiest part of these queries will probably be off-by-one issues. Do you really want "between 10:00 and 12:00", or do you want "between 10:00 and 11:59:59"?)
Optionally, you can perform the entire calculation in SQL. I would wager the SQL version will be slightly faster, again because of the network overhead. But sending one result row versus 120 aggregate rows probably won't make a noticeable difference unless this query is frequently executed.
At this point, the question veers into the more subjective question about where to put the "business logic". I bet most programmers would prefer your Python solution to my query. But one minor advantage of doing all the work in SQL is keeping all of the weird date logic in one place. If you process the results in multiple steps there are more chances for an off-by-one error.
--Time slots with the smallest number of rows.
--(There will be lots of ties because the data is so boring.)
with dates as
(
--Enter literals or bind variables here:
select
cast(timestamp '2021-01-25 10:00:00' as date) begin_date,
cast(timestamp '2021-01-25 11:59:59' as date) end_date,
30 timeslot
from dual
)
--Choose the rows with the smallest counts.
select begin_time, end_time, total_count
from
(
--Rank the time slots per count.
select begin_time, end_time, total_count,
dense_rank() over (order by total_count) smallest_when_1
from
(
--Counts per timeslot.
select begin_time, end_time, sum(the_count) total_count
from
(
--Counts per minute.
select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between (select begin_date from dates) and (select end_date from dates)
group by trunc(the_time, 'MI')
order by the_minute desc
) counts
join
(
--Time ranges.
select
begin_date + ((level-1)/24/60) begin_time,
begin_date + ((level-1)/24/60) + (timeslot/24/60) end_time
from dates
connect by level <=
(
--The number of different time ranges.
select (end_date - begin_date) * 24 * 60 - timeslot + 1
from dates
)
) time_ranges
on the_minute between begin_time and end_time
group by begin_time, end_time
)
)
where smallest_when_1 = 1
order by begin_time;
You can run a db<>fiddle here.

sql query to find specific entries based on date range

I have a database with a table containing the history of the login times of all accounts. The table includes columns USERID and LOGIN_DATE. I want to find those users who have not logged in for over 60 days, so an SQL query that says
Find users who have a login date which was greater than 60 days ago, but have no entry for any date more recently than 60 days ago
Can anyone suggest how I would do this ?
Following can be a solution
select USERID, min(login_date) mind, max(login_date) maxd
from Logins
group by UserId
having max(login_date) < dateadd(d,-60,getdate())
You could use aggregation, and filter on users whose maximum login date is older than 60 days:
select userid
from mytable
group by userid
having max(login_date) < current_date - interval '60' day
You did not tell which database you are using, so this uses standard date arithmetics. You might need to adapt that to your actual database (all major databases have alternatives for this).
Now the question was tagged Oracle. The above would work; you might want to truncate the time portion of the date to check on entire days. And if you want to display the last login, just add the aggregate function to the select clause:
select userid, max(login_date) last_login_date
from mytable
group by userid
having max(login_date) < trunc(current_date) - interval '60' day
You can use aggregation:
select userid
from t
group by userid
having max(logindate) < trunc(sysdate) - interval '60' day;
Date/time functions are notoriously database-specific, so the exact syntax for the having clause might depend on your database.

How to group timestamps into islands (based on arbitrary gap)?

Consider this list of dates as timestamptz:
I grouped the dates by hand using colors: every group is separated from the next by a gap of at least 2 minutes.
I'm trying to measure how much a given user studied, by looking at when they performed an action (the data is when they finished studying a sentence.) e.g.: on the yellow block, I'd consider the user studied in one sitting, from 14:24 till 14:27, or roughly 3 minutes in a row.
I see how I could group these dates with a programming language by going through all of the dates and looking for the gap between two rows.
My question is: how would go about grouping dates in this way with Postgres?
(Looking for 'gaps' on Google or SO brings too many irrelevant results; I think I'm missing the vocabulary for what I'm trying to do here.)
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, lag(done) OVER (ORDER BY done) <= done - interval '2 min' AS step
FROM tbl
) sub
ORDER BY done;
The subquery sub returns step = true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.
The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.
fiddle
Related:
Query to find all timestamps more than a certain interval apart
How to label groups in postgresql when group belonging depends on the preceding line?
Select longest continuous sequence
Grouping or Window
About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
Conditional lead/lag function PostgreSQL?
Building up on Erwin's answer, here is the full query for tallying up the amount of time people spent on those sessions/islands:
My data only shows when people finished reviewing something, not when they started, which means we don't know when a session truly started; and some islands only have one timestamp in them (leading to a 0-duration estimate.) I'm accounting for both by calculating the average review time and adding it to the total duration of islands.
This is likely very idiosyncratic to my use case, but I learned a thing or two in the process, so maybe this will help someone down the line.
-- Returns estimated total study time and average time per review, both in seconds
SELECT (EXTRACT( EPOCH FROM logged) + countofislands * avgreviewtime) as totalstudytime, avgreviewtime -- add total logged time to estimate for first-review-in-island and 1-review islands
FROM
(
SELECT -- get the three key values that will let us calculate total time spent
sum(duration) as logged
, count(island) as countofislands
, EXTRACT( EPOCH FROM sum(duration) FILTER (WHERE duration != '00:00:00'::interval) )/( sum(reviews) FILTER (WHERE duration != '00:00:00'::interval) - count(reviews) FILTER (WHERE duration != '00:00:00'::interval)) as avgreviewtime
FROM
(
SELECT island, age( max(done), min(done) ) as duration, count(island) as reviews -- calculate the duration of islands
FROM
(
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS island -- give a unique number to each island
FROM (
SELECT -- detect the beginning of islands
done,
(
lag(done) OVER (ORDER BY done) <= done - interval '2 min'
) AS step
FROM review
WHERE clicker_id = 71 AND "done" > '2015-05-13' AND "done" < '2015-05-13 15:00:00' -- keep the queries small and fast for now
) sub
ORDER BY done
) grouped
GROUP BY island
) sessions
) summary

Select the date of a UserIDs first/most recent purchase

I am working with Google Analytics data in BigQuery, looking to aggregate the date of last visit and first visit up to UserID level, however my code is currently returning the max visit date for that user, so long as they have purchased within the selected date range, because I am using MAX().
If I remove MAX() I have to GROUP by DATE, which I don't want as this then returns multiple rows per UserID.
Here is my code which returns a series of dates per user - last_visit_date is currently working, as it's the only date that can simply look at the last date of user activity. Any advice on how I can get last_ord_date to select the date on which the order actually occurred?
SELECT
customDimension.value AS UserID,
# Last order date
IF(COUNT(DISTINCT hits.transaction.transactionId) > 0,
(MAX(DATE)),
"unknown") AS last_ord_date,
# first visit date
IF(SUM(totals.newvisits) IS NOT NULL,
(MAX(DATE)),
"unknown") AS first_visit_date,
# last visit date
MAX(DATE) AS last_visit_date,
# first order date
IF(COUNT(DISTINCT hits.transaction.transactionId) > 0,
(MIN(DATE)),
"unknown") AS first_ord_date
FROM
`XXX.XXX.ga_sessions_20*` AS t
CROSS JOIN
UNNEST (hits) AS hits
CROSS JOIN
UNNEST(t.customdimensions) AS customDimension
CROSS JOIN
UNNEST(hits.product) AS hits_product
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND customDimension.index = 2
AND customDimension.value NOT LIKE "true"
AND customDimension.value NOT LIKE "false"
AND customDimension.value NOT LIKE "undefined"
AND customDimension.value IS NOT NULL
GROUP BY
UserID
the most efficient and clear way to do this (and also most portable) is to have a simple table/view that has two columns: userid, last_purchase and another that has other two cols userid, first_visit.
then you inner join it with the original raw table on userid and hit timestamp to get, say, the session IDs you're interested in. 3 steps but simple, readable and easy to maintain
It's very easy to hit too much complexity for a query that relies on first or last purchase/action (just look at the unnest operations you have there) that is becomes unusable and you'll spend way too much time trying to figure out the meaning of the output.
Also keep in mind that using the wildcard in the query has a limit of 1000 tables, so your last and first visits are in a rolling window of 1000 days.

How do I design a SQL query that will show me all users who visited at least one page for the last 20 out of 24 hours?

In order to identify human traffic (as opposed to crawlers, bots, etc), I would like to design an SQL query that will identify all unique visitor ID's that have visited websites in the last 20 of 24 hours (as most humans would not be browsing for that long). I believe I understand how I want to structure it, "How many UNIQUE hours have any activity for each visitor in the past 24 hours, and WHERE 20 hours have at least some activity".
While the specifics of such a query would depend on the tables involved, I'm having trouble understanding if my structure is on the right track:
SELECT page_url, affinity, num
FROM (
SELECT AGG GROUP BY visitor_id, pages.page_url, max(v.max_affinity) as affinity, COUNT(*) as num, Row_Number()
OVER (Partition By v.visitor_id ORDER BY COUNT(visitor_id) DESC) AS RowNumber
FROM audience_lab_active_visitors v
SELECT pages ON pages.p_date >= '2017-09-14'
WHERE v.p_date='2017-09-14'
GROUP BY v.vispage_visitors, pages.page_url
) tbl WHERE RowNumber < 20
I don't believe your query is valid SQL, but I have an idea of what you're trying to accomplish. Rather than use a static date, I filtered by the past 24 hours and truncated the current timestamp to the hour, otherwise the query would be considering 25 unique hours. I also removed page_url from the query since it didn't seem relevant to the results based on what you're trying to solve.
For each visitor_id, the query counts the number of unique hours recorded based on the column (timestamp_col in this example) used to record the timestamp of the page view. HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20 returns those you've identified at humans, meaning they visited the website up to 19 of the past 24 hours.
SELECT
visitor_id,
COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) AS num,
MAX(v.max_affinity) AS affinity
FROM audience_lab_active_visitors AS v
JOIN pages AS p ON v.page_url = p.page_url
WHERE
v.p_date >= DATE_TRUNC('hour', CURRENT_TIMESTAMP) - INTERVAL '24' hour
GROUP BY 1
HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20;