Using SQL to compute average daily unique usage - sql

I have a MySQL table "stats", which is a list of entries for each login into a website. Each entry has a "userId" string, a "loginTime" timestamp and other fields. There can be more than one entry for each user - one for each login that he makes. I want to write a query that will calculate the average of unique daily logins over, say, 30 days.
Any ideas?

/*
This should give you one row for each date and unique visits on that date
*/
SELECT DATE(loginTime) LoginDate, COUNT(userID) UserCount
FROM stats
WHERE DATE(loginTime) BETWEEN [start date] AND [end date]
GROUP BY DATE(logintime), userID
Note: It will be more helpful if you can provide some sample data with the result you are looking for.

i'm probably wrong but if you did: select count(distinct userid) from stats where logintime between start of :day and end of day for day in each of those 30 days fetched those 30 counts (which could be pre-calculated cached (as you probably don't have users logging in at past times)) and them just average them in the programing language that your executing the query from
i read http://unganisha.org/home/pages/Generating_Sequences_With_SQL/index.html while looking and thought if you had a table of say the numbers 0 to 30 lets name it offsets for this example:
select avg(userstoday)
from (select count(userid) as userstoday, day
from stats join offsets on (stats.logintime=(current_day)-offsets.day)
group by day)
and as i noted, the userstoday value could be pre-calculated and stored in a table

Thanks everyone, eventually I used:
SELECT SUM( uniqueUsers ) / 30 AS DAU
FROM (
SELECT DATE( loginTime ) AS DATE, COUNT( DISTINCT userID ) AS uniqueUsers
FROM user_requests
WHERE DATE( loginTime ) > DATE_SUB( CURDATE( ) , INTERVAL 30
DAY )
GROUP BY DATE( loginTime )
) AS daily_users
I use a SUM and divide by 30 instead of average, because on some days I may not have any logins and I want to account for that. But on any daily heavy-traffic website simply using AVG will give the same results

Related

Get 30 days prior data for each row of query

I have a query where I have a list of ~ 20k users for a specific week of the month that represents that they have logged on to our site.
What I need to get - for each of these users, in the past 30 days if they have
1. logged on: defined by any rows recorded in the same table
2. max event in the 30 day window, prior to the date in the current where clause
This is the current code snippet that helps me narrow to the ~20k users for a given week to begin with:
select
user_id,
max(timestamp)
from table
where timestamp between '2019-02-01' and '2019-02-05'
group by 1,2;
Expected result set/columns:
user_id,
max(timestamp),
logged_on, [if they have any # of rows in the same table within 30 days prior to their max(timestamp) date]
previous_timestamp, [the 2nd most recent login date within 30 days prior to their max(timestamp) date]
I think this is what you're looking for. Not sure if it's the most efficient method though - perhaps windowing functions may perform better but like bob-mccormick mentioned: the tricky bit would be filling in dates where the user (partition key) was not active so that the range query will work correctly.
Example data setup (Snowflake syntax)
-- Create sample table
create temporary table user_logins (userid number, date_logged_on timestamp);
;
-- Insert some random sample data
insert overwrite into user_logins
select
uniform(1,10,random()) userid,
dateadd('minutes', uniform(1,86400,random()) * -1,current_timestamp::timestamp_ntz) date_logged_on
from table(generator(rowcount => 100))
;
Select statement
-- Run select
with user_last_logins as (
select
userid,
max(date_logged_on) last_login
from user_logins
where
date_logged_on between '2019-01-01' and '2019-05-08'
group by userid
)
select
user_last_logins.userid,
max(user_last_logins.last_login) last_logged_on,
count(prior_30_each_user.userid) num_logins_prior_30,
max(prior_30_each_user.date_logged_on)
from user_last_logins
left join user_logins prior_30_each_user
on user_last_logins.userid = prior_30_each_user.userid
and prior_30_each_user.date_logged_on > dateadd('day', -30, user_last_logins.last_login) and prior_30_each_user.date_logged_on < user_last_logins.last_login
group by user_last_logins.userid
;

Group by for each row in bigquery

I have a table that stores user comments for each month. Comments are stored using UTC timestamps, I want to get the users that posts more than 20 comments per day. I am able to get the timestamp start and end for each day, but I can't group the comments table by number of comments.
This is the script that I have for getting dates, timestamps and distinct users.
SELECT
DATE(TIMESTAMP_SECONDS(r.ts_start)) AS date,
r.ts_start AS timestamp_start,
r.ts_start+86400 AS timestamp_end,
COUNT(*) AS number_of_comments,
COUNT(DISTINCT s.author) AS dictinct_authors
FROM ((
WITH
shifts AS (
SELECT
[STRUCT(" 00:00:00 UTC" AS hrs,
GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY) AS dt_range) ] AS full_timestamps )
SELECT
UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) AS ts_start,
UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) + 86400 AS ts_end
FROM
shifts,
shifts.full_timestamps
LEFT JOIN
full_timestamps.dt_range AS dt)) r
INNER JOIN
`user_comments.2018_07` s
ON
(s.created_utc BETWEEN r.ts_start
AND r.ts_end)
GROUP BY
r.ts_start
ORDER BY
number_of_comments DESC
And this is the sample output 1:
The user_comments.2018_07 table is as the following:
More concretely I want the first output 1, has one more column showing the number of authors that have more than 20 comments for the date. How can I do that?
If the goal is only to get the number of users with more than twenty comments for each day from table user_comments.2018_07, and add it to the output you have so far, this should simplify the query you first used. So long as you're not attached to keeping the min/max timestamps for each day.
with nb_comms_per_day_per_user as (
SELECT
day,
author,
COUNT(*) as nb_comments
FROM
# unnest as we don't really want an array
unnest(GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY)) AS day
INNER JOIN `user_comments.2018_07` c
on
# directly convert timestamp to a date, without using min/max timestamp
date(timestamp_seconds(created_utc))
=
day
GROUP BY day, c.author
)
SELECT
day,
sum(nb_comments) as total_comments,
count(*) as distinct_authors, # we have already grouped by author
# sum + if enables to count "very active" users
sum(if(nb_comments > 20, 1, 0)) as very_active_users
FROM nb_comms_per_day_per_user
GROUP BY day
ORDER BY total_comments desc
Also I supposed the column comment containing booleans is not used, as you do not use it in your initial query?

How many distinct active users did I have on a 90 day window? [duplicate]

This question already has answers here:
Query for count of distinct values in a rolling date range
(5 answers)
Closed 6 years ago.
I have a complex problem that seems to be trivial at first sight:
for a given 90 day window, how many distinct active users did I have?
The table I will use to query this is the login table (hosted in Redshift), and it has a timestamp with the logintime and usertoken as the user identifier.
Whenever I want to answer this for a single day, the query is easy and straightforward:
select count (distinct usertoken)
from logins
where datediff('d',logintime,getdate()) <= 90
The problem becomes complex because I want to have this in a table with the number for every given date.
07/07 100k
07/06 98k
07/05 99k
07/04 101k
(...)
Window functions do not help me because I need to count distinct, and this is not possible in a window function.
To my knowledge, there is no way to iterate in a SQL query.
How should I go about this?
Perhaps I am missing something but from what I understand this should do :
-- In SQL Server
select cast(logintime As Date) , count (distinct usertoken) from logins
where datediff(D,logintime,getdate()) <= 90 Group by
cast(logintime As Date)
in PostGreSQL
Change cast(logintime As Date) to trunc_Date(Day, logintime )
and datediff(D,logintime,getdate()) to datediff('d',logintime,getdate())
I am assuming that if a day has zero users logging in you don't mind not showing it in the list.
First we get a set of all the days we care about and call that set "days".
with days as (
select date_trunc('day', date) as day from logins
where date > now() - '90 days'::interval
group by day
)
Then we join the days set with the logins.
select day, count(distinct userid)
from days
join logins on date_trunc('day', logins.date) = days.day
group by day
order by day
The trivial way is very computationally expensive:
select days.d, count(distinct l.userid)
from (select distinct date_trunc('day', logintime) as d
from logins l
) days left join
(select distinct userid, date_trunc('day', logintime) as d
from logins
) l
on datediff('d', l.d, days.d) between 0 and 89
group by days.d
order by days.d;

Select one row per day for each value

I have a SQL query in PostgreSQL 9.4 that, while more complex due to the tables I am pulling data from, boils down to the following:
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever limits I want, such as limiting the date range or users>
With the result that I have one row per user, per day for which I have data. In general, this query would be run for an entry_date period of one month, with the desired result of having one row per day of the month for each user.
The problem is that there may not be data for every user every day of the month, and this query only returns rows for days that have data.
Is there some way to modify this query so it returns one row per day for each user, even if there is no data (other than the date and the user) in some of the rows?
I tried doing a join with a generate_series(), but that didn't work - it can make there be no missing days, but not per user. What I really need would be something like "for each user in list, generate series of (user,date) records"
EDIT: To clarify, the final result that I am looking for would be that for each user in the database - defined as a record in a user table - I want one row per date. So if I specify a date range of 5/1/15-5/31/15 in my where clause, I want 31 rows per user, even if that user had no data in that range, or only had data for a couple of days.
generate_series() was the right idea. You probably did not get the details right. Could work like this:
WITH cte AS (
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever limits I want>
)
SELECT *
FROM (SELECT DISTINCT user_id FROM cte) u
CROSS JOIN (
SELECT entry_date::date
FROM generate_series(current_date - interval '1 month'
, current_date - interval '1 day'
, interval '1 day') entry_date
) d
LEFT JOIN cte USING (user_id, entry_date);
I picked a running time window of one month ending "yesterday". You did not define your "month" exactly.
Assuming entry_date to be data type date.
Simpler for your updated requirements
To get results for every user in a users table (and not for a current selection) and for your given time range, it gets simpler. You don't need the CTE:
SELECT *
FROM (SELECT user_id FROM users) u
CROSS JOIN (
SELECT entry_date::date
FROM generate_series(timestamp '2015-05-01'
, timestamp '2015-05-31'
, interval '1 day') entry_date
) d
LEFT JOIN (
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever>
) t USING (user_id, entry_date);
Why this particular way to call generate_series()?
Generating time series between two dates in PostgreSQL
And best use ISO 8601 date format (YYYY-MM-DD) which works regardless of locale settings.

SQL: Need to SUM on results that meet a HAVING statement

I have a table where we record per user values like money_spent, money_spent_on_candy and the date.
So the columns in this table (let's call it MoneyTable) would be:
UserId
Money_Spent
Money_Spent_On_Candy
Date
My goal is to SUM the total amount of money_spent -- but only for those users where they have spent more than 10% of their total money spent for the date range on candy.
What would that query be?
I know how to select the Users that have this -- and then I can output the data and sum that by hand but I would like to do this in one single query.
Here would be the query to pull the sum of Spend per user for only the users that have spent > 10% of their money on candy.
SELECT
UserId,
SUM(Money_Spent),
SUM(Money_Spent_On_Candy) / SUM(Money_Spent) AS PercentCandySpend
FROM MoneyTable
WHERE DATE >= '2010-01-01'
HAVING PercentCandySpend > 0.1;
You couldn't do this with a single query. You'd need a query that could reach back in time and retroactively filter the source table to handle only users with 10% candy spending. Luckily, that's kind of what sub-queries do:
SELECT SUM(spent) FROM (
SELECT SUM(Money_Spent) AS spent
FROM MoneyTable
WHERE (DATE >= '2010-01-01')
GROUP BY UserID
HAVING (SUM(Money_Spent_On_Candy)/SUM(Money_Spent)) > 0.1
);
The inner query does the heavy lifting of figuring out what the "10%" users spent, and then the outer query uses the sub-query as a virtual table to sum up the per-user Money_Spent sums.
Of course, this only works if you need ONLY the global total Money_Spent. If you end up needing the per-user sums as well, then you'd be better off just running the inner query and doing the global total in your application.
You can use common table expressions. Like this:
WITH temp AS (SELECT
UserId,
SUM(Money_Spent) AS MoneySpent,
SUM(Money_Spent_On_Candy)/SUM(Money_Spent) AS PercentCandySpend
FROM MoneyTable
WHERE DATE >= '2010-01-01'
HAVING PercentCandySpend > 0.1)
SELECT
UserId
SUM(MoneySpent)
FROM UserId
Or you can use a derived table:
SELECT SUM(Total_Money_Spent)
FROM ( SELECT UserId, Total_Money_Spent = SUM(Money_Spent), SUM(Money_Spent_On_Candy)/SUM(Money_Spent) AS PercentCandySpend
FROM MoneyTable
WHERE DATE >= '2010-01-01'
HAVING PercentCandySpend > 0.1 ) x;