BigQuery: iterating groups within a window of 28days before a start_date column using _TABLE_SUFFIX - sql

I got a table like this:
group_id
start_date
end_date
19335
20220613
20220714
19527
20220620
20220719
19339
20220614
20220720
19436
20220616
20220715
20095
20220711
20220809
I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.
Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.
My idea is to:
Retrieve distinct user_id per group_id within the period [start_date, end_date]
Retrieve previous 28d metric data prior to the start date of each group_id
A snippet code on how to retrieve data from a single group_id is the following:
WITH users_per_group AS (
SELECT
users_metadata.user_id,
users_metadata.group_id,
FROM
`my_table_users_*` users_metadata
WHERE
_TABLE_SUFFIX BETWEEN '20220314' --start_date
AND '20220413' --end_date
AND experiment_id = 16709
GROUP BY
1,
2
)
SELECT
_TABLE_SUFFIX AS date,
user_id,
SUM(
COALESCE(metric, 0)
) AS metric,
FROM
users_per_group
JOIN `my_metric_table*` metric USING (user_id)
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 28 DAY
)
) -- 28 days before it starts
AND FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 1 DAY
)
) -- 1 day before it starts
GROUP BY
1,
2
ORDER BY
date ASC
Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.
Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?

I can think of 2 approaches.
Join all the tables and then perform your query.
Create dynamic queries for each of your users.
Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
EG:
WITH users_per_group AS (
SELECT
user_id, group_id
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
FROM TableName
)
Once you have this kind of table then you can use any of the mentioned approaches.
Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
Approach 1
-- consider this your main table which contains user,grp,start_date,end_date
with maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
),
--then calculate search from-to date for every user and group
user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from --change interval as per your need
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
)
select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits from
user_per_grp ug
left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
group by 1,2
Approach 2
declare queries array<string> default [];
create temp table maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
);
create temp table user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
);
-- for each user create a seperate query here
FOR record IN (SELECT * from user_per_grp)
DO
set queries = queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
--replace your query here.
END FOR;
--aggregating all the queries and executing it
execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.

Related

Converting event-wise table to timeseries

I have an SQLite database (with Django as ORM) with a table of change events (an Account is assigned a new Strategy). I would like to convert it to a timeseries, to have on each day the Strategy the Account was following.
My table :
Expected output :
As showed, there can be more than 1 change per day. In this case I select the last change of the day, as the desired timeseries output must have only one value per day.
My question is similar to this one but in SQL, not BigQuery (but I'm not sure I understood the unnest part they propose). I have a working solution in Pandas with reindex and fillna, but I'm sure there is an elegant and simple solution in SQL (maybe even better with Django ORM).
You can use a RECURSIVE Common Table Expression to generate all dates between first and last and then join this generated table with your data to get the needed value for each day:
WITH RECURSIVE daterange(d) AS (
SELECT date(min(created_at)) from events
UNION ALL
SELECT date(d,'1 day') FROM daterange WHERE d<(select max(created_at) from events)
)
SELECT d, account_id, strategy_id
FROM daterange JOIN events
WHERE created_at = (select max(e.created_at) from events e where e.account_id=events.account_id and date(e.created_at) <= d)
GROUP BY account_id, d
ORDER BY account_id, d
date() function is used to convert a datetime value to a simple date, so you can use it to group your data by date.
date(d, '1 day') applies a modifier of +1 calendar day to d.
Here is an example with your data:
CREATE TABLE events (
created_at,
account_id,
strategy_id
);
insert into events
VALUES ('2022-10-07 12:53:53', 4801323843, 7),
('2022-10-07 08:10:07', 4801323843, 5),
('2022-10-07 15:00:45', 4801323843, 8),
('2022-10-10 13:01:16', 4801323843, 6);
WITH RECURSIVE daterange(d) AS (
SELECT date(min(created_at)) from events
UNION ALL
SELECT date(d,'1 day') FROM daterange WHERE d<(select max(created_at) from events)
)
SELECT d, account_id, strategy_id
FROM daterange JOIN events
WHERE created_at = (select max(e.created_at) from events e where e.account_id=events.account_id and date(e.created_at) <= d)
GROUP BY account_id, d
ORDER BY account_id, d
d
account_id
strategy_id
2022-10-07
4801323843
8
2022-10-08
4801323843
8
2022-10-09
4801323843
8
2022-10-10
4801323843
6
2022-10-11
4801323843
6
fiddle
The query could be slow with many rows. In that case create an index on the created_at column:
CREATE INDEX events_created_idx ON events(created_at);
My final version is the version proposed by #Andrea B., with just a slight improve in performance, merging only the rows that we need in the join, and therefore discarding the where clause.
I also converted the null to date('now')
Here is the final version I used :
with recursive daterange(day) as
(
select min(date(created_at)) from events
union all select date(day, '1 day') from daterange
where day < date('now')
),
events as (
select account_id, strategy_id, created_at as start_date,
case lead(created_at) over(partition by account_id order by created_at) is null
when True then datetime('now')
else lead(created_at) over(partition by account_id order by created_at)
end as end_date
from events
)
select * from daterange
join events on events.start_date<daterange.day and daterange.day<events.end_date
order by events.account_id
Hope this helps !

Postgres Date and ID fill in Zero if no entry

If I have a table that has the following format:
purchase_time | user_id | items_purchased
The current query I'm doing is something like this:
SELECT user_id, date(purchase_time), sum(items_purchased)
from user_purchase_metrics
GROUP BY date(purchase_time), user_id;
I'm trying to create a query that will fill in 0 for purchases if there isn't an entry in that date for that given user. Is this possible?
Side stepping the valid concern by #xQbert about generating missing dates performance must always give way to necessity. Without a convenient calendar table generating the dates of interest is a necessity. Moreover, in this case the dates must be generated for each user_id. In the following this is done by generating each date with the distinct user_id from the user_purchase_metrics table. The result is then LEFT joined to the same table to sum the purchases and giving the desired 0 results for the missing dates. (see demo, for dates I just picked March):
with dates( user_id, idate ) as
( select user_id, d::date
from ( select distinct user_id
from user_purchase_metrics
) u
join generate_series( date '2021-03-01' --- start_date
, date '2021-03-31' --- end_date
, interval '1 day'
) gs(d)
on true
) -- select * from dates;
select d.user_id
, d.idate
, coalesce(sum(pm.items_purchased),0)
from dates d
left join user_purchase_metrics pm
on ( pm.user_id = d.user_id
and date(pm.purchase_time) = d.idate
)
group by d.user_id, d.idate
order by d.user_id, d.idate;
To parametrize the query can be embedded in a SQL function that returns a table. (Also in demo).

adding all columns from mutiple tables

I have a simple question.
I need to count all records from multiple tables with day and hour and add all of them together in a single final table.
So the query for each tab is something like this
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_1
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_2
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_3
and so on so forth
I would like to combine all the results showing number of total records for each day and hour from these tables.
Expected results will be like this
date, hour, number of records of table 1, number of records of table 2, number of records of table 3 ........
What would the most optimum SQL query for this?
Probably the simplest way is to union them together and aggregation:
select timestamp_trunc(timestamp, hour) as hh,
countif(which = 1) as num_1,
countif(which = 2) as num_2
from ((select timestamp, 1 as which
from table_1
) union all
(select timestamp, 2 as which
from table_2
) union all
. . .
) t
group hh
order by hh;
You are using timestamp_trunc(). It returns a timestamp truncated to the hour -- there is no need to also include the date.
Below is for BigQuery Standard SQL
#standardSQL
SELECT
TIMESTAMP_TRUNC(TIMESTAMP, DAY) day,
EXTRACT(HOUR FROM TIMESTAMP) hour,
COUNT(*) cnt,
_TABLE_SUFFIX AS table
FROM `project.dataset.table_*`
GROUP BY day, hour, table

Get 30 days prior data for each row of query

I have a query where I have a list of ~ 20k users for a specific week of the month that represents that they have logged on to our site.
What I need to get - for each of these users, in the past 30 days if they have
1. logged on: defined by any rows recorded in the same table
2. max event in the 30 day window, prior to the date in the current where clause
This is the current code snippet that helps me narrow to the ~20k users for a given week to begin with:
select
user_id,
max(timestamp)
from table
where timestamp between '2019-02-01' and '2019-02-05'
group by 1,2;
Expected result set/columns:
user_id,
max(timestamp),
logged_on, [if they have any # of rows in the same table within 30 days prior to their max(timestamp) date]
previous_timestamp, [the 2nd most recent login date within 30 days prior to their max(timestamp) date]
I think this is what you're looking for. Not sure if it's the most efficient method though - perhaps windowing functions may perform better but like bob-mccormick mentioned: the tricky bit would be filling in dates where the user (partition key) was not active so that the range query will work correctly.
Example data setup (Snowflake syntax)
-- Create sample table
create temporary table user_logins (userid number, date_logged_on timestamp);
;
-- Insert some random sample data
insert overwrite into user_logins
select
uniform(1,10,random()) userid,
dateadd('minutes', uniform(1,86400,random()) * -1,current_timestamp::timestamp_ntz) date_logged_on
from table(generator(rowcount => 100))
;
Select statement
-- Run select
with user_last_logins as (
select
userid,
max(date_logged_on) last_login
from user_logins
where
date_logged_on between '2019-01-01' and '2019-05-08'
group by userid
)
select
user_last_logins.userid,
max(user_last_logins.last_login) last_logged_on,
count(prior_30_each_user.userid) num_logins_prior_30,
max(prior_30_each_user.date_logged_on)
from user_last_logins
left join user_logins prior_30_each_user
on user_last_logins.userid = prior_30_each_user.userid
and prior_30_each_user.date_logged_on > dateadd('day', -30, user_last_logins.last_login) and prior_30_each_user.date_logged_on < user_last_logins.last_login
group by user_last_logins.userid
;

BigQuery Cross Join Failing

I'm trying to pull user activity by date. I am trying to built a table of every day since a user account was created, using cross join and a where clause. In my case, cross join cannot be avoided. The calendar table is just a list of all dates for last 365 days (365 rows). The user table has ~1b rows.
Here is the query that fails with insufficient resources:
SELECT
u.user_id as user_id,
date(u.created) as signup_date,
cal.date as date,
from (select date(dt) as date from [dw.calendar] where date(dt) <
CURRENT_DATE() ) cal
cross join each dw.user u
where
date(u.created) <= cal.date
Based on https://cloud.google.com/bigquery/query-reference, cross joins do not even support the "each" clause. How do I perform the above operation to successfully create a table?
You do not need to fill "empty" days to just calculate daily count and perform window function to get the aggregated sum, so you don't even need calendar table for this. To make this happen you need to use RANGE vs. ROWS in your window. See example below (for BigQuery Standard SQL)
#standardSQL
SELECT
user_id, created, daily_count,
SUM(daily_count) OVER(
PARTITION BY user_id ORDER BY created_unix_date DESC
RANGE BETWEEN CURRENT ROW AND 6 FOLLOWING
) weekly_avg
FROM `dw.user`, UNNEST([UNIX_DATE(created)]) AS created_unix_date
ORDER BY user_id, created DESC
i am not sure about exact schema /types of your table so might need to adjust above respectively, but meantime you can test/play with below dummy data
#standardSQL
WITH `dw.user` AS (
SELECT
day AS created,
CAST(1 + 10 * RAND() AS INT64) AS user_id,
CAST(100 * RAND() AS INT64) AS daily_count
FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-04-26')) AS day
)
SELECT
user_id, created, daily_count,
SUM(daily_count) OVER(
PARTITION BY user_id ORDER BY created_unix_date DESC
RANGE BETWEEN CURRENT ROW AND 6 FOLLOWING
) weekly_avg
FROM `dw.user`, UNNEST([UNIX_DATE(created)]) AS created_unix_date
ORDER BY user_id, created DESC