Monthly Moving Average of User Activity in SQL Server Using Window Functions - sql

Let's say I have a table UserActivity in SQL Server 2012 with two columns:
ActivityDateTime
UserID
I want to calculate number of distinct users with any activity in a 30-day period (my monthly active users) on a daily basis. (So I have a 30-day window that increments a day at a time. How do I do this efficiently using window functions in SQL Server?
The output would look like this:
Date,NumberActiveUsersInPrevious30Days
01-01-2010,13567
01-02-2010,14780
01-03-2010,13490
01-04-2010,15231
01-05-2010,15321
01-06-2010,14513
...

SQL Server doesn't support COUNT(DISTINCT ... ) OVER () or a numeric value (30 PRECEDING) in conjunction with RANGE
I wouldn't bother trying to coerce window functions into doing this. Because of the COUNT(DISTINCT UserID) requirement it is always going to have to re-examine the entire 30 day window for each date.
You can create a calendar table with a row for each date and use
SELECT C.Date,
NumberActiveUsersInPrevious30Days
FROM Calendar C
CROSS APPLY (SELECT COUNT(DISTINCT UserID)
FROM UserActivity
WHERE ActivityDateTime >= DATEADD(DAY, -30, C.[Date])
AND ActivityDateTime < C.[Date]) CA(NumberActiveUsersInPrevious30Days)
WHERE C.Date BETWEEN '2010-01-01' AND '2010-01-06'

Option 1: For (while) loop though each day and select 30 days backward for each (obviously quite slow).
Option 2: A separate table with a row for each day and join on the original table (again quite slow).
Option 3: Recursive CTEs or stored procs (still not doing much better).
Option 4: For (while) loop in combination with cursors (efficient, but requires some advanced SQL knowledge). With this solution you will step through each day and each row in order and keep track of the average (you'll need some sort of wrap-around array to know what value to subtract when a day moves out of range).
Option 5: Option 3 in a general-purpose / scripting programming language (C++ / Java / PHP) (easy to do with basic knowledge of one of those languages, efficient).
Some related questions.

Related

Configurable SQL GROUP BY variable

I am currently developing some SQL aggregation queries to calculate data from one source table. The goal is, to have different aggregation granularities in one executable query / function / etc. I am currently developing on PostgreSQL but the code should be ANSI SQL compliant as much as possible to be compatible to most DB variants.
Example:
SELECT
COUNT(a) as amount,
SUM(b) as sum,
c as static_grouping,
#vargr as variable_grouping,
#vardesc as variable_grouping_description
FROM whatever
GROUP BY c, #vargr, #vardesc
#vargr can be date driven like daily, weekly, monthly, ...
#vardesc is the identifier to see aggregation type as text
Having multiple queries with UNION is not an option, since there are multiple grouping statements changing (resulting in 60+ single queries per result set). Is there a way to do this with a function, while loop, etc.?
Thanks for a hint in the right direction, have a good day and stay safe!
Best regards
Christian
If you want to group by varying date granularity, then one option uses date_trunc(). You would typically use one of the supported precision (such as day, week, month and so on) as parameter. Assuming that your date or timestamp column is tscol, you would do:
SELECT
COUNT(a) as amount,
SUM(b) as sum,
c as static_grouping,
DATE_TRUNC($1, tscol) as variable_grouping,
$2 as variable_grouping_description
FROM whatever
GROUP BY c, DATE_TRUNC($1, tscol)

Average calls per day in MS Access 2010

I'm a receptionist keeping track of incoming calls in MS-Access 2010. The table has Date column. I can get count of calls per day but am having trouble with SQL to get average calls per day.
Assuming your table has one record per call, you can use a query like this, just replace the table and field names:
SELECT Avg(TotalCalls.DailyCalls) AS AverageCalls
FROM
(
SELECT MyTable.MyDateField, Count(MyTable.MyDateField) AS DailyCalls
FROM MyTable
WHERE MyDate > #1-Feb-2017# AND MyDate <= #28-Feb-2017#
GROUP BY MyTable.MyDateField
) AS TotalCalls
This won't take into account days that have no calls, just the ones that do. The WHERE clause is optional, but you might want to use that to pick a specific date range.

SQL for Next/Prior Business Day from Calendar table (in MS Access)

I have a Calendar table pulled from our mainframe DBs and saved as a local Access table. The table has history back to the 1930s (and I know we use back to the 50s in at least one place), resulting in 31k records. This Calendar table has 3 fields of interest:
Bus_Dt - every day, not just business days. Primary Key
Bus_Day_Ind - indicates if the day was a valid business day for the stock market.
Prir_Bus_Dt - the prior business day. Contains some errors (about 50), all old.
I have written a query to retrieve the first business day on or after the current calendar day, but it runs supremely slowly. (5+ minutes) I have examined the showplan output and see it is being run via an x-join, which between 30k+ record tables gives a solution space (and date comparisons) in the order of nearly 10 million. However, the actual task is not hard, and could be preformed comfortably by excel in minimal time using a simple sort.
My question is thus, is there any way to fix the poor performance of the query, or is this an inherent failing of SQL? (DB2 run on the mainframe also is slow, though not crushingly so. Throwing cycles at the problem and all that.) Secondarily, if I were to trust prir_bus_dt, can I get there better? Or restrict the date range (aka, "cheat"), or any other tricks I didn't think of yet?
SQL:
SELECT TE2Clndr.BUS_DT AS Cal_Dt
, Min(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
, TE2Clndr AS TE2Clndr_1
WHERE TE2Clndr_1.BUS_DAY_IND="Y" AND
TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]
GROUP BY TE2Clndr.BUS_DT;
Showplan:
Inputs to Query
Table 'TE2Clndr'
Table 'TE2Clndr'
End inputs to Query
01) Restrict rows of table TE2Clndr
by scanning
testing expression "TE2Clndr_1.BUS_DAY_IND="Y""
store result in temporary table
02) Inner Join table 'TE2Clndr' to result of '01)'
using X-Prod join
then test expression "TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]"
03) Group result of '02)'
Again, the question is, can this be made better (faster), or is this already as good as it gets?
I have a new query that is much faster for the same job, but it depends on the prir_bus_dt field (which has some errors). It also isn't great theory since prior business day is not necessarily available on everyone's calendar. So I don't consider this "the" answer, merely an answer.
New query:
SELECT TE2Clndr.BUS_DT as Cal_Dt
, Max(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
INNER JOIN TE2Clndr AS TE2Clndr_1
ON TE2Clndr.PRIR_BUS_DT = TE2Clndr_1.PRIR_BUS_DT
GROUP BY TE2Clndr.BUS_DT;
What about this approach
select min(bus_dt)
from te2Clndr
where bus_dt >= date()
and bus_day_ind = 'Y'
This is my reference for date() representing the current date

How to adapt this query to use window functions

When I started tackling this problem, I thought, "This will be a great query to learn about Window Functions." I wasn't able to end up getting it to work with window functions, but I was able to get what I wanted using a join.
How would you adapt this query to use window functions:
SELECT
day,
COUNT(i.project) as num_open
FROM generate_series(0, 364) as t(day)
LEFT JOIN issues i on (day BETWEEN i.closed_days_ago AND i.created_days_ago)
GROUP BY day
ORDER BY day;
The query above takes a list of issues that have a range represented by created_days_ago and closed_days ago and for the last 365 days, it'll count the number of issues that were created but not yet closed for that specific day.
http://sqlfiddle.com/#!15/663f6/2
The issues table looks like:
CREATE TABLE issues (
id SERIAL,
project VARCHAR(255),
created_days_ago INTEGER,
closed_days_ago INTEGER);
What I was thinking was that the partition for a given day should include all the rows in issues where day is between the created and closed days ago. Something like SELECT day, COUNT(i.project) OVER (PARTITION day BETWEEN created_days_ago AND closed_days_ago) ...
I've never used window functions before, so I might be missing something basic, but it seemed like this was just the type of query that makes window functions so awesome.
The fact that you use generate_series() to create a full range of days, including those days with no changes, and thus no rows in table issues, does not rule out the use of window functions.
In fact, this query runs 50 times faster than the query in the Q in my local test:
SELECT t.day
, COALESCE(sum(a.created) OVER (ORDER BY t.day DESC), 0)
- COALESCE(sum(b.closed) OVER (ORDER BY t.day DESC), 0) AS open_tickets
FROM generate_series(0, 364) t(day)
LEFT JOIN (SELECT created_days_ago AS day, count(*) AS created
FROM issues GROUP BY 1) a USING (day)
LEFT JOIN (SELECT closed_days_ago AS day, count(*) AS closed
FROM issues GROUP BY 1) b USING (day)
ORDER BY 1;
It is also correct, as opposed to the query in the question, which results in 17 open tickets on day 0, although all of them have been closed.
The error is due to BETWEEN in your join condition, which includes upper and lower border. This way tickets are still counted as "open" on the day they are closed.
Each row in the result reflects the number of open tickets at the end of the day.
Explain
The query combines window functions with aggregate functions.
Subquery a counts the number of created tickets per day. This results in a single row per day, making the rest easier.
Subquery b does the same for closed tickets.
Use LEFT JOINs to join to the generated list of days in subquery t.
Be wary of joining to multiple unaggregated tables! That could trigger a CROSS JOIN among the joined tables for multiple matches per row, generating incorrect results. Compare:
Two SQL LEFT JOINS produce incorrect result
Finally use two window functions to compute the running total of created versus closed tickets.
An alternative would be to use this in the outer SELECT
sum(COALESCE(a.created, 0)
- COALESCE(b.closed, 0)) OVER (ORDER BY t.day DESC) AS open_tickets
Performs the same in my tests.
-> SQLfiddle demo.
Aside: I would never store "days_ago" in a table, but the absolute date / timestamp. Looks like a simplification for the purpose of this question.

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.