Select one row per day for each value - sql

I have a SQL query in PostgreSQL 9.4 that, while more complex due to the tables I am pulling data from, boils down to the following:
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever limits I want, such as limiting the date range or users>
With the result that I have one row per user, per day for which I have data. In general, this query would be run for an entry_date period of one month, with the desired result of having one row per day of the month for each user.
The problem is that there may not be data for every user every day of the month, and this query only returns rows for days that have data.
Is there some way to modify this query so it returns one row per day for each user, even if there is no data (other than the date and the user) in some of the rows?
I tried doing a join with a generate_series(), but that didn't work - it can make there be no missing days, but not per user. What I really need would be something like "for each user in list, generate series of (user,date) records"
EDIT: To clarify, the final result that I am looking for would be that for each user in the database - defined as a record in a user table - I want one row per date. So if I specify a date range of 5/1/15-5/31/15 in my where clause, I want 31 rows per user, even if that user had no data in that range, or only had data for a couple of days.

generate_series() was the right idea. You probably did not get the details right. Could work like this:
WITH cte AS (
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever limits I want>
)
SELECT *
FROM (SELECT DISTINCT user_id FROM cte) u
CROSS JOIN (
SELECT entry_date::date
FROM generate_series(current_date - interval '1 month'
, current_date - interval '1 day'
, interval '1 day') entry_date
) d
LEFT JOIN cte USING (user_id, entry_date);
I picked a running time window of one month ending "yesterday". You did not define your "month" exactly.
Assuming entry_date to be data type date.
Simpler for your updated requirements
To get results for every user in a users table (and not for a current selection) and for your given time range, it gets simpler. You don't need the CTE:
SELECT *
FROM (SELECT user_id FROM users) u
CROSS JOIN (
SELECT entry_date::date
FROM generate_series(timestamp '2015-05-01'
, timestamp '2015-05-31'
, interval '1 day') entry_date
) d
LEFT JOIN (
SELECT entry_date, user_id, <other_stuff>
FROM <tables, joins, etc>
GROUP BY entry_date, user_id
WHERE <whatever>
) t USING (user_id, entry_date);
Why this particular way to call generate_series()?
Generating time series between two dates in PostgreSQL
And best use ISO 8601 date format (YYYY-MM-DD) which works regardless of locale settings.

Related

Data value on a given date

This time I have a table on a PostgreSQL database that contains the employee name, the date that he started working and the date that he leaves the company, in the cases of the employee still remains in the company, this field has null value.
Knowing this, I would like to know how many people was working on a predetermined date, ex:
I would like to know how many people works on the company in January 2021.
I don't know where to start, in some attempts I got the number of hires and layoffs per month, but I need to show this accumulated value per month, in another column.
I hope I made myself understood, I'll leave the last SQL I got here.
select reference, sum(hires) from
(
select
date_trunc('month', date_hires) as reference,
count(*) as hires
from
ponto_mais_relatorio_colaboradores
group by
date_hires
union all
select
date_trunc('month', date_layoff) as reference,
count(*)*-1 as layoffs
from
ponto_mais_relatorio_colaboradores
group by
date_layoff
) as reference
join calendar_aux on calendar_aux.ano_mes = reference
group by reference
order by reference
Break the requirement down. The question: how many are employed on any given date? That would include all hired before that date and do not have a layoff date plus all hired before with a layoff date later then the date your interested period. I.e you are interested in Jan so you still want to count an employee with a layoff date in Feb. With that in place convert into SQL. The preceding is available from select comparing dates. other issue is that Jan is not a date, it is a range of dates, so you need each date. You can use generate series to create each day in Jan. Then Join the generated dates with and selection from your table. Resulting query:
with jan_dates( jdate ) as
( select generate_series( date '2021-01-01'
, date '2021-01-31'
, interval '1' day
)::date
)
select jdate "Date", count(*) "Employees"
from jan_dates j
join employees e
on ( e.date_hires <= j.jdate
and ( e.date_layoff is null
or e.date_layoff > j.jdate
)
)
group by j.jdate
order by j.jdate;
Note: Not tested.

Get 30 days prior data for each row of query

I have a query where I have a list of ~ 20k users for a specific week of the month that represents that they have logged on to our site.
What I need to get - for each of these users, in the past 30 days if they have
1. logged on: defined by any rows recorded in the same table
2. max event in the 30 day window, prior to the date in the current where clause
This is the current code snippet that helps me narrow to the ~20k users for a given week to begin with:
select
user_id,
max(timestamp)
from table
where timestamp between '2019-02-01' and '2019-02-05'
group by 1,2;
Expected result set/columns:
user_id,
max(timestamp),
logged_on, [if they have any # of rows in the same table within 30 days prior to their max(timestamp) date]
previous_timestamp, [the 2nd most recent login date within 30 days prior to their max(timestamp) date]
I think this is what you're looking for. Not sure if it's the most efficient method though - perhaps windowing functions may perform better but like bob-mccormick mentioned: the tricky bit would be filling in dates where the user (partition key) was not active so that the range query will work correctly.
Example data setup (Snowflake syntax)
-- Create sample table
create temporary table user_logins (userid number, date_logged_on timestamp);
;
-- Insert some random sample data
insert overwrite into user_logins
select
uniform(1,10,random()) userid,
dateadd('minutes', uniform(1,86400,random()) * -1,current_timestamp::timestamp_ntz) date_logged_on
from table(generator(rowcount => 100))
;
Select statement
-- Run select
with user_last_logins as (
select
userid,
max(date_logged_on) last_login
from user_logins
where
date_logged_on between '2019-01-01' and '2019-05-08'
group by userid
)
select
user_last_logins.userid,
max(user_last_logins.last_login) last_logged_on,
count(prior_30_each_user.userid) num_logins_prior_30,
max(prior_30_each_user.date_logged_on)
from user_last_logins
left join user_logins prior_30_each_user
on user_last_logins.userid = prior_30_each_user.userid
and prior_30_each_user.date_logged_on > dateadd('day', -30, user_last_logins.last_login) and prior_30_each_user.date_logged_on < user_last_logins.last_login
group by user_last_logins.userid
;

Group by for each row in bigquery

I have a table that stores user comments for each month. Comments are stored using UTC timestamps, I want to get the users that posts more than 20 comments per day. I am able to get the timestamp start and end for each day, but I can't group the comments table by number of comments.
This is the script that I have for getting dates, timestamps and distinct users.
SELECT
DATE(TIMESTAMP_SECONDS(r.ts_start)) AS date,
r.ts_start AS timestamp_start,
r.ts_start+86400 AS timestamp_end,
COUNT(*) AS number_of_comments,
COUNT(DISTINCT s.author) AS dictinct_authors
FROM ((
WITH
shifts AS (
SELECT
[STRUCT(" 00:00:00 UTC" AS hrs,
GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY) AS dt_range) ] AS full_timestamps )
SELECT
UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) AS ts_start,
UNIX_SECONDS(CAST(CONCAT( CAST(dt AS STRING), CAST(hrs AS STRING)) AS TIMESTAMP)) + 86400 AS ts_end
FROM
shifts,
shifts.full_timestamps
LEFT JOIN
full_timestamps.dt_range AS dt)) r
INNER JOIN
`user_comments.2018_07` s
ON
(s.created_utc BETWEEN r.ts_start
AND r.ts_end)
GROUP BY
r.ts_start
ORDER BY
number_of_comments DESC
And this is the sample output 1:
The user_comments.2018_07 table is as the following:
More concretely I want the first output 1, has one more column showing the number of authors that have more than 20 comments for the date. How can I do that?
If the goal is only to get the number of users with more than twenty comments for each day from table user_comments.2018_07, and add it to the output you have so far, this should simplify the query you first used. So long as you're not attached to keeping the min/max timestamps for each day.
with nb_comms_per_day_per_user as (
SELECT
day,
author,
COUNT(*) as nb_comments
FROM
# unnest as we don't really want an array
unnest(GENERATE_DATE_ARRAY('2018-07-01','2018-07-31', INTERVAL 1 DAY)) AS day
INNER JOIN `user_comments.2018_07` c
on
# directly convert timestamp to a date, without using min/max timestamp
date(timestamp_seconds(created_utc))
=
day
GROUP BY day, c.author
)
SELECT
day,
sum(nb_comments) as total_comments,
count(*) as distinct_authors, # we have already grouped by author
# sum + if enables to count "very active" users
sum(if(nb_comments > 20, 1, 0)) as very_active_users
FROM nb_comms_per_day_per_user
GROUP BY day
ORDER BY total_comments desc
Also I supposed the column comment containing booleans is not used, as you do not use it in your initial query?

BigQuery Cross Join Failing

I'm trying to pull user activity by date. I am trying to built a table of every day since a user account was created, using cross join and a where clause. In my case, cross join cannot be avoided. The calendar table is just a list of all dates for last 365 days (365 rows). The user table has ~1b rows.
Here is the query that fails with insufficient resources:
SELECT
u.user_id as user_id,
date(u.created) as signup_date,
cal.date as date,
from (select date(dt) as date from [dw.calendar] where date(dt) <
CURRENT_DATE() ) cal
cross join each dw.user u
where
date(u.created) <= cal.date
Based on https://cloud.google.com/bigquery/query-reference, cross joins do not even support the "each" clause. How do I perform the above operation to successfully create a table?
You do not need to fill "empty" days to just calculate daily count and perform window function to get the aggregated sum, so you don't even need calendar table for this. To make this happen you need to use RANGE vs. ROWS in your window. See example below (for BigQuery Standard SQL)
#standardSQL
SELECT
user_id, created, daily_count,
SUM(daily_count) OVER(
PARTITION BY user_id ORDER BY created_unix_date DESC
RANGE BETWEEN CURRENT ROW AND 6 FOLLOWING
) weekly_avg
FROM `dw.user`, UNNEST([UNIX_DATE(created)]) AS created_unix_date
ORDER BY user_id, created DESC
i am not sure about exact schema /types of your table so might need to adjust above respectively, but meantime you can test/play with below dummy data
#standardSQL
WITH `dw.user` AS (
SELECT
day AS created,
CAST(1 + 10 * RAND() AS INT64) AS user_id,
CAST(100 * RAND() AS INT64) AS daily_count
FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-04-26')) AS day
)
SELECT
user_id, created, daily_count,
SUM(daily_count) OVER(
PARTITION BY user_id ORDER BY created_unix_date DESC
RANGE BETWEEN CURRENT ROW AND 6 FOLLOWING
) weekly_avg
FROM `dw.user`, UNNEST([UNIX_DATE(created)]) AS created_unix_date
ORDER BY user_id, created DESC

Postgres inner query performance

I have a table which I need to select from everything with this rule:
id = 4524522000143 and validPoint = true
and date > (max(date)- interval '12 month')
-- the max date, is the max date for this id
Explaining the rule: I have to get all registers and count them, they must be at least 1 year old from the newest register.
This is my actual query:
WITH points as (
select 1 as ct from base_faturamento_mensal
where id = 4524522000143 and validPoint = true
group by id,date
having date > (max(date)- interval '12 month')
) select sum(ct) from points
Is there a more efficient way for this?
Well your query is using the trick with including an unaggregated column within HAVING clause but I don't find it particularly bad. It seems fine, but without the EXPLAIN ANALYZE <query> output I can't say much more.
One thing to do is you can get rid of the CTE and use count(*) within the same query instead of returning 1 and then running a sum on it afterwards.
select count(*) as ct
from base_faturamento_mensal
where id = 4524522000143
and validPoint = true
group by id, date
having date > max(date) - interval '12 months'