Usage of two different timestamps in one SQL query - sql

I am stuck on a SQL challenge:
Per day in 2019 the number of page visits and registrations
The dataset looks the following:
I don’t get how I can count both the number of page visits per day and the number of registrations per day as they are based on two different timestamps (users.user_registration_timestamp and informations.info_timestamp)? Do I need to use a subquery? And how would the SQL query look like?

You don't need a join, you just need to seperate Selects:
select cast(user_registration_timestamp as date), count(*)
from users
where user_registration_timestamp >= timestamp '2019-00-01 00:00:00'
and user_registration_timestamp < timestamp '2020-00-01 00:00:00'
group by cast(user_registration_timestamp as date)
union all
select cast(info_timestamp as date), count(*)
from informations
where info_timestamp >= timestamp '2019-00-01 00:00:00'
and info_timestamp < timestamp '2020-00-01 00:00:00'
group by cast(info_timestamp as date)
This returns two rows, if you want a single row CROSS JOIN both Selects.

Related

PL-SQL query to calculate customers per period from start and stop dates

I have a PL-SQL table with a structure as shown in the example below:
I have customers (customer_number) with insurance cover start and stop dates (cover_start_date and cover_stop_date). I also have dates of accidents for those customers (accident_date). These customers may have more than one row in the table if they have had more than one accident. They may also have no accidents. And they may also have a blank entry for the cover stop date if their cover is ongoing. Sorry I did not design the data format, but I am stuck with it.
I am looking to calculate the number of accidents (num_accidents) and number of customers (num_customers) in a given time period (period_start), and from that the number of accidents-per-customer (which will be easy once I've got those two pieces of information).
Any ideas on how to design a PL-SQL function to do this in a simple way? Ideally with the time periods not being fixed to monthly (for example, weekly or fortnightly too)? Ideally I will end up with a table like this shown below:
Many thanks for any pointers...
You seem to need a list of dates. You can generate one in the query and then use correlated subqueries to calculate the columns you want:
select d.*,
(select count(distinct customer_id)
from t
where t.cover_start_date <= d.dte and
(t.cover_end_date > d.date + interval '1' month or t.cover_end_date is null)
) as num_customers,
(select count(*)
from t
where t.accident_date >= d.dte and
t.accident_date < d.date + interval '1' month
) as accidents,
(select count(distinct customer_id)
from t
where t.accident_date >= d.dte and
t.accident_date < d.date + interval '1' month
) as num_customers_with_accident
from (select date '2020-01-01' as dte from dual union all
select date '2020-02-01' as dte from dual union all
. . .
) d;
If you want to do arithmetic on the columns, you can use this as a subquery or CTE.

Get 30 days prior data for each row of query

I have a query where I have a list of ~ 20k users for a specific week of the month that represents that they have logged on to our site.
What I need to get - for each of these users, in the past 30 days if they have
1. logged on: defined by any rows recorded in the same table
2. max event in the 30 day window, prior to the date in the current where clause
This is the current code snippet that helps me narrow to the ~20k users for a given week to begin with:
select
user_id,
max(timestamp)
from table
where timestamp between '2019-02-01' and '2019-02-05'
group by 1,2;
Expected result set/columns:
user_id,
max(timestamp),
logged_on, [if they have any # of rows in the same table within 30 days prior to their max(timestamp) date]
previous_timestamp, [the 2nd most recent login date within 30 days prior to their max(timestamp) date]
I think this is what you're looking for. Not sure if it's the most efficient method though - perhaps windowing functions may perform better but like bob-mccormick mentioned: the tricky bit would be filling in dates where the user (partition key) was not active so that the range query will work correctly.
Example data setup (Snowflake syntax)
-- Create sample table
create temporary table user_logins (userid number, date_logged_on timestamp);
;
-- Insert some random sample data
insert overwrite into user_logins
select
uniform(1,10,random()) userid,
dateadd('minutes', uniform(1,86400,random()) * -1,current_timestamp::timestamp_ntz) date_logged_on
from table(generator(rowcount => 100))
;
Select statement
-- Run select
with user_last_logins as (
select
userid,
max(date_logged_on) last_login
from user_logins
where
date_logged_on between '2019-01-01' and '2019-05-08'
group by userid
)
select
user_last_logins.userid,
max(user_last_logins.last_login) last_logged_on,
count(prior_30_each_user.userid) num_logins_prior_30,
max(prior_30_each_user.date_logged_on)
from user_last_logins
left join user_logins prior_30_each_user
on user_last_logins.userid = prior_30_each_user.userid
and prior_30_each_user.date_logged_on > dateadd('day', -30, user_last_logins.last_login) and prior_30_each_user.date_logged_on < user_last_logins.last_login
group by user_last_logins.userid
;

Group by for each row in bigquery

I have a table that stores user comments for each month. Comments are stored using UTC timestamps, I want to get the users that posts more than 20 comments per day. I am able to get the timestamp start and end for each day, but I can't group the comments table by number of comments.
This is the script that I have for getting dates, timestamps and distinct users.
SELECT
DATE(TIMESTAMP_SECONDS(r.ts_start)) AS date,
r.ts_start AS timestamp_start,
r.ts_start+86400 AS timestamp_end,
COUNT(*) AS number_of_comments,
COUNT(DISTINCT s.author) AS dictinct_authors
FROM ((
WITH
shifts AS (
SELECT
[STRUCT(" 00:00:00 UTC" AS hrs,
GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY) AS dt_range) ] AS full_timestamps )
SELECT
UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) AS ts_start,
UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) + 86400 AS ts_end
FROM
shifts,
shifts.full_timestamps
LEFT JOIN
full_timestamps.dt_range AS dt)) r
INNER JOIN
`user_comments.2018_07` s
ON
(s.created_utc BETWEEN r.ts_start
AND r.ts_end)
GROUP BY
r.ts_start
ORDER BY
number_of_comments DESC
And this is the sample output 1:
The user_comments.2018_07 table is as the following:
More concretely I want the first output 1, has one more column showing the number of authors that have more than 20 comments for the date. How can I do that?
If the goal is only to get the number of users with more than twenty comments for each day from table user_comments.2018_07, and add it to the output you have so far, this should simplify the query you first used. So long as you're not attached to keeping the min/max timestamps for each day.
with nb_comms_per_day_per_user as (
SELECT
day,
author,
COUNT(*) as nb_comments
FROM
# unnest as we don't really want an array
unnest(GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY)) AS day
INNER JOIN `user_comments.2018_07` c
on
# directly convert timestamp to a date, without using min/max timestamp
date(timestamp_seconds(created_utc))
=
day
GROUP BY day, c.author
)
SELECT
day,
sum(nb_comments) as total_comments,
count(*) as distinct_authors, # we have already grouped by author
# sum + if enables to count "very active" users
sum(if(nb_comments > 20, 1, 0)) as very_active_users
FROM nb_comms_per_day_per_user
GROUP BY day
ORDER BY total_comments desc
Also I supposed the column comment containing booleans is not used, as you do not use it in your initial query?

How many distinct active users did I have on a 90 day window? [duplicate]

This question already has answers here:
Query for count of distinct values in a rolling date range
(5 answers)
Closed 6 years ago.
I have a complex problem that seems to be trivial at first sight:
for a given 90 day window, how many distinct active users did I have?
The table I will use to query this is the login table (hosted in Redshift), and it has a timestamp with the logintime and usertoken as the user identifier.
Whenever I want to answer this for a single day, the query is easy and straightforward:
select count (distinct usertoken)
from logins
where datediff('d',logintime,getdate()) <= 90
The problem becomes complex because I want to have this in a table with the number for every given date.
07/07 100k
07/06 98k
07/05 99k
07/04 101k
(...)
Window functions do not help me because I need to count distinct, and this is not possible in a window function.
To my knowledge, there is no way to iterate in a SQL query.
How should I go about this?
Perhaps I am missing something but from what I understand this should do :
-- In SQL Server
select cast(logintime As Date) , count (distinct usertoken) from logins
where datediff(D,logintime,getdate()) <= 90 Group by
cast(logintime As Date)
in PostGreSQL
Change cast(logintime As Date) to trunc_Date(Day, logintime )
and datediff(D,logintime,getdate()) to datediff('d',logintime,getdate())
I am assuming that if a day has zero users logging in you don't mind not showing it in the list.
First we get a set of all the days we care about and call that set "days".
with days as (
select date_trunc('day', date) as day from logins
where date > now() - '90 days'::interval
group by day
)
Then we join the days set with the logins.
select day, count(distinct userid)
from days
join logins on date_trunc('day', logins.date) = days.day
group by day
order by day
The trivial way is very computationally expensive:
select days.d, count(distinct l.userid)
from (select distinct date_trunc('day', logintime) as d
from logins l
) days left join
(select distinct userid, date_trunc('day', logintime) as d
from logins
) l
on datediff('d', l.d, days.d) between 0 and 89
group by days.d
order by days.d;

Select one row per day for each value

I have a SQL query in PostgreSQL 9.4 that, while more complex due to the tables I am pulling data from, boils down to the following:
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever limits I want, such as limiting the date range or users>
With the result that I have one row per user, per day for which I have data. In general, this query would be run for an entry_date period of one month, with the desired result of having one row per day of the month for each user.
The problem is that there may not be data for every user every day of the month, and this query only returns rows for days that have data.
Is there some way to modify this query so it returns one row per day for each user, even if there is no data (other than the date and the user) in some of the rows?
I tried doing a join with a generate_series(), but that didn't work - it can make there be no missing days, but not per user. What I really need would be something like "for each user in list, generate series of (user,date) records"
EDIT: To clarify, the final result that I am looking for would be that for each user in the database - defined as a record in a user table - I want one row per date. So if I specify a date range of 5/1/15-5/31/15 in my where clause, I want 31 rows per user, even if that user had no data in that range, or only had data for a couple of days.
generate_series() was the right idea. You probably did not get the details right. Could work like this:
WITH cte AS (
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever limits I want>
)
SELECT *
FROM (SELECT DISTINCT user_id FROM cte) u
CROSS JOIN (
SELECT entry_date::date
FROM generate_series(current_date - interval '1 month'
, current_date - interval '1 day'
, interval '1 day') entry_date
) d
LEFT JOIN cte USING (user_id, entry_date);
I picked a running time window of one month ending "yesterday". You did not define your "month" exactly.
Assuming entry_date to be data type date.
Simpler for your updated requirements
To get results for every user in a users table (and not for a current selection) and for your given time range, it gets simpler. You don't need the CTE:
SELECT *
FROM (SELECT user_id FROM users) u
CROSS JOIN (
SELECT entry_date::date
FROM generate_series(timestamp '2015-05-01'
, timestamp '2015-05-31'
, interval '1 day') entry_date
) d
LEFT JOIN (
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever>
) t USING (user_id, entry_date);
Why this particular way to call generate_series()?
Generating time series between two dates in PostgreSQL
And best use ISO 8601 date format (YYYY-MM-DD) which works regardless of locale settings.