How to calculate an SQL MAX() for any N-second duration within a certain timeframe - sql

I need to answer a question like this:
For each user, what is the most items that user viewed in any 60 second
time frame between START_TIMESTAMP and END_TIMESTAMP?
The 60 second time frame is a sliding window. It's not just a matter of "items viewed" counts for each whole minute. Also, 60 seconds was just an example, it should work for any number of seconds.
My data is stored like this:
-- Timestamped log of users viewing items
CREATE TABLE user_item_views (
user_id integer,
item_id integer,
timestamp timestamp
);
Doing it for each whole minute is easy enough, just format timestamp to something like YYYY-MM-DD hh:mm and do a count grouped by that formatted timestamp and the user_id.
Doing it for a sliding window, I have no idea how to approach.
If this would be easier outside of SQL, I am open to exporting the data to another format, or using another language.
Desired output is something like:
User ID Max items viewed in N seconds, between START and END.
... ...
... ...
... ...
How can I do this?

Here's how I would do it (beware, untested code, this ist just to outline the idea).
You need a helper table with as many rows as there are seconds between START_TIMESTAMP and END_TIMESTAMP. Create that as a temp table before you begin your query.
For the sake of the sample, let's call it every_second. I'm assuming your minimum time resolution is one second.
Then do:
SELECT
s.timestamp,
v.user_id,
(
SELECT COUNT(*) FROM user_item_views
WHERE timestamp BETWEEN s.timestamp AND ADDTIME(s.timestamp, '00:00:59')
AND user_id = v.user_id
) item_count
FROM
every_second s
LEFT JOIN user_item_views v ON v.timestamp = s.timestamp
GROUP BY
s.timestamp,
v.user_id
Store that in another temporary table and select the desired maxima from it (this is necessary because of the "select max from group" problem).

In MySQL (assuming that timestamp is unique):
SELECT
user_id
, MAX(max_count) AS max_count
FROM
( SELECT
a.user_id
, COUNT(*) AS max_count
FROM
user_item_views AS a
JOIN
user_item_views AS b
ON a.user_id = b.user_id
AND a.timestamp <= b.timestamp
AND b.timestamp < a.timestamp + INTERVAL 60 SECOND
GROUP BY
a.user_id
, a.timestamp
) AS grp
GROUP BY
user_id

Related

Postgres Date and ID fill in Zero if no entry

If I have a table that has the following format:
purchase_time | user_id | items_purchased
The current query I'm doing is something like this:
SELECT user_id, date(purchase_time), sum(items_purchased)
from user_purchase_metrics
GROUP BY date(purchase_time), user_id;
I'm trying to create a query that will fill in 0 for purchases if there isn't an entry in that date for that given user. Is this possible?
Side stepping the valid concern by #xQbert about generating missing dates performance must always give way to necessity. Without a convenient calendar table generating the dates of interest is a necessity. Moreover, in this case the dates must be generated for each user_id. In the following this is done by generating each date with the distinct user_id from the user_purchase_metrics table. The result is then LEFT joined to the same table to sum the purchases and giving the desired 0 results for the missing dates. (see demo, for dates I just picked March):
with dates( user_id, idate ) as
( select user_id, d::date
from ( select distinct user_id
from user_purchase_metrics
) u
join generate_series( date '2021-03-01' --- start_date
, date '2021-03-31' --- end_date
, interval '1 day'
) gs(d)
on true
) -- select * from dates;
select d.user_id
, d.idate
, coalesce(sum(pm.items_purchased),0)
from dates d
left join user_purchase_metrics pm
on ( pm.user_id = d.user_id
and date(pm.purchase_time) = d.idate
)
group by d.user_id, d.idate
order by d.user_id, d.idate;
To parametrize the query can be embedded in a SQL function that returns a table. (Also in demo).

How to Average Number of Chats per Day on LEFT JOIN table in Snowflake SQL?

In Snowflake SQL dictation, how do I average the number of video chats per day using a field from a table I left joined to the entire query?
I'm thinking I have to do a SUM function to total the number of video chats and then aggregate by # of video chats for each date and then divide by 30 days (the rolling date range I specified throughout my entire query).
Any help would be appreciated as deadlines are approaching. Thank you.
SELECT DISTINCT
t1."pid",
IFNULL(t2."VideoChats",0),
t3."SFUser",
t3."TotalProviders",
t4."dimaccount.practice_specialty",
t5."Account: CMRR",
t6."CreatedDate",
t7."stg_sf_case.Date_Time_Resolved__c",
t8."stg_sf_case.Closed_Date",
t9."pid"
FROM (SELECT "pid"
FROM "EDW_PROD"."PUBLIC"."STG_MYSQL_PROVIDERMODULES" AS a
WHERE a."active"
AND a."status" = 'PURCHASED'
AND a."module_id" = '14'
GROUP BY a."pid"
) t1
LEFT JOIN (SELECT "started_at",
"pid",
COUNT(*) AS "VideoChats"
FROM "EDW_PROD"."PUBLIC"."STG_MYSQL_VIDEOCHATROOM" AS b
LEFT JOIN "EDW_PROD"."PUBLIC"."DIMACCOUNT" AS dimaccount
ON b."pid" = dimaccount."PID"
WHERE b."started_at" >= DATE_TRUNC('month', CURRENT_DATE())
AND b."started_at" < DATEADD('month', 1, DATE_TRUNC('month', CURRENT_DATE()))
AND dimaccount."CurrentRow" = 'Y'
GROUP BY b."pid", b."started_at"
) t2 ON t1."pid" = t2."pid"
For a rolling average you probably want to use a window function. Something along these lines.
SELECT AVG(VideoChats) over (partition by pid order by started_at rows between 30 preceding and current row) as AvgVideoChats
--I saw a post about AVG not allowing a sliding window, so you may have to do this instead
SELECT SUM(VideoChats) over (partition by pid order by started_at rows between 30 preceding and current row) / 30. as AvgVideoChats
You may need to do this in a wrapper around your t2 query and adjust your date filters so that there are values available for averaging, but I'm not quite clear enough on what your query is doing with dates, or what results you are looking for, to be sure.

Get 30 days prior data for each row of query

I have a query where I have a list of ~ 20k users for a specific week of the month that represents that they have logged on to our site.
What I need to get - for each of these users, in the past 30 days if they have
1. logged on: defined by any rows recorded in the same table
2. max event in the 30 day window, prior to the date in the current where clause
This is the current code snippet that helps me narrow to the ~20k users for a given week to begin with:
select
user_id,
max(timestamp)
from table
where timestamp between '2019-02-01' and '2019-02-05'
group by 1,2;
Expected result set/columns:
user_id,
max(timestamp),
logged_on, [if they have any # of rows in the same table within 30 days prior to their max(timestamp) date]
previous_timestamp, [the 2nd most recent login date within 30 days prior to their max(timestamp) date]
I think this is what you're looking for. Not sure if it's the most efficient method though - perhaps windowing functions may perform better but like bob-mccormick mentioned: the tricky bit would be filling in dates where the user (partition key) was not active so that the range query will work correctly.
Example data setup (Snowflake syntax)
-- Create sample table
create temporary table user_logins (userid number, date_logged_on timestamp);
;
-- Insert some random sample data
insert overwrite into user_logins
select
uniform(1,10,random()) userid,
dateadd('minutes', uniform(1,86400,random()) * -1,current_timestamp::timestamp_ntz) date_logged_on
from table(generator(rowcount => 100))
;
Select statement
-- Run select
with user_last_logins as (
select
userid,
max(date_logged_on) last_login
from user_logins
where
date_logged_on between '2019-01-01' and '2019-05-08'
group by userid
)
select
user_last_logins.userid,
max(user_last_logins.last_login) last_logged_on,
count(prior_30_each_user.userid) num_logins_prior_30,
max(prior_30_each_user.date_logged_on)
from user_last_logins
left join user_logins prior_30_each_user
on user_last_logins.userid = prior_30_each_user.userid
and prior_30_each_user.date_logged_on > dateadd('day', -30, user_last_logins.last_login) and prior_30_each_user.date_logged_on < user_last_logins.last_login
group by user_last_logins.userid
;

Optimizing Max Value query

I wanted to ask for advice on how I could optimize my query? I hope to make it run faster as I feel it takes away from the UX with the speed.
My program collects data every hour and I want to optimize my query which takes the latest data and creates the top 100 people for a specific event,
SELECT a.user_id as user, nickname, value, s.created_on
FROM stats s,accounts a
WHERE a.user_id = s.user_id AND event_id = 1 AND s.created_on in
(SELECT created_on FROM stats WHERE created_on >= NOW() - '1 hour'::INTERVAL)
ORDER BY value desc
LIMIT 100
The query I have returns the top 100 from the last hour for event_id = 1 but I wish to optimize it and I believe the subquery is the root cause of the problem. I've tried other queries but they end up with either duplicates or the result is not from the latest dataset
Thank you
EDIT::
The table account contains [user_id, nickname]
the stats table contains [user_id, event_id, value, created_on]
NOW() - '1 hour'::INTERVAL in not MySQL syntax; perhaps you meant NOW() - INTERVAL 1 HOUR?
IN ( SELECT ... ) optimizes very poorly.
Not knowing the relationship between accounts and stats (1:1, 1:many, etc), I can only guess at what might work:
SELECT a.user_id as user, nickname, value, s.created_on
FROM stats s,accounts a
WHERE a.user_id = s.user_id
AND event_id = 1
AND s.created_on >= NOW() - INTERVAL 1 HOUR)
ORDER BY value desc
LIMIT 100
INDEX(event_id, value) -- if they are both in `a`
INDEX(user_id, created_on)
or...
SELECT user_id as user, nickname, value,
( SELECT MAX(created_on) FROM stats
WHERE user_id = a.user_id ) AS created_on
FROM accounts
AND event_id = 1
AND EXISTS
( SELECT *
FROM stats
WHERE created_on >= NOW() - INTERVAL 1 HOUR
AND user_id = a.user_id
)
ORDER BY value desc
LIMIT 100
INDEX(user_id, created_on)
INDEX(event_id, value)
Please provide
SHOW CREATE TABLE for each table
EXPLAIN SELECT ...; for any reasonable candidates

Select one row per day for each value

I have a SQL query in PostgreSQL 9.4 that, while more complex due to the tables I am pulling data from, boils down to the following:
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever limits I want, such as limiting the date range or users>
With the result that I have one row per user, per day for which I have data. In general, this query would be run for an entry_date period of one month, with the desired result of having one row per day of the month for each user.
The problem is that there may not be data for every user every day of the month, and this query only returns rows for days that have data.
Is there some way to modify this query so it returns one row per day for each user, even if there is no data (other than the date and the user) in some of the rows?
I tried doing a join with a generate_series(), but that didn't work - it can make there be no missing days, but not per user. What I really need would be something like "for each user in list, generate series of (user,date) records"
EDIT: To clarify, the final result that I am looking for would be that for each user in the database - defined as a record in a user table - I want one row per date. So if I specify a date range of 5/1/15-5/31/15 in my where clause, I want 31 rows per user, even if that user had no data in that range, or only had data for a couple of days.
generate_series() was the right idea. You probably did not get the details right. Could work like this:
WITH cte AS (
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever limits I want>
)
SELECT *
FROM (SELECT DISTINCT user_id FROM cte) u
CROSS JOIN (
SELECT entry_date::date
FROM generate_series(current_date - interval '1 month'
, current_date - interval '1 day'
, interval '1 day') entry_date
) d
LEFT JOIN cte USING (user_id, entry_date);
I picked a running time window of one month ending "yesterday". You did not define your "month" exactly.
Assuming entry_date to be data type date.
Simpler for your updated requirements
To get results for every user in a users table (and not for a current selection) and for your given time range, it gets simpler. You don't need the CTE:
SELECT *
FROM (SELECT user_id FROM users) u
CROSS JOIN (
SELECT entry_date::date
FROM generate_series(timestamp '2015-05-01'
, timestamp '2015-05-31'
, interval '1 day') entry_date
) d
LEFT JOIN (
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever>
) t USING (user_id, entry_date);
Why this particular way to call generate_series()?
Generating time series between two dates in PostgreSQL
And best use ISO 8601 date format (YYYY-MM-DD) which works regardless of locale settings.