If I have a table that has the following format:
purchase_time | user_id | items_purchased
The current query I'm doing is something like this:
SELECT user_id, date(purchase_time), sum(items_purchased)
from user_purchase_metrics
GROUP BY date(purchase_time), user_id;
I'm trying to create a query that will fill in 0 for purchases if there isn't an entry in that date for that given user. Is this possible?
Side stepping the valid concern by #xQbert about generating missing dates performance must always give way to necessity. Without a convenient calendar table generating the dates of interest is a necessity. Moreover, in this case the dates must be generated for each user_id. In the following this is done by generating each date with the distinct user_id from the user_purchase_metrics table. The result is then LEFT joined to the same table to sum the purchases and giving the desired 0 results for the missing dates. (see demo, for dates I just picked March):
with dates( user_id, idate ) as
( select user_id, d::date
from ( select distinct user_id
from user_purchase_metrics
) u
join generate_series( date '2021-03-01' --- start_date
, date '2021-03-31' --- end_date
, interval '1 day'
) gs(d)
on true
) -- select * from dates;
select d.user_id
, d.idate
, coalesce(sum(pm.items_purchased),0)
from dates d
left join user_purchase_metrics pm
on ( pm.user_id = d.user_id
and date(pm.purchase_time) = d.idate
)
group by d.user_id, d.idate
order by d.user_id, d.idate;
To parametrize the query can be embedded in a SQL function that returns a table. (Also in demo).
Related
I have an SQLite database (with Django as ORM) with a table of change events (an Account is assigned a new Strategy). I would like to convert it to a timeseries, to have on each day the Strategy the Account was following.
My table :
Expected output :
As showed, there can be more than 1 change per day. In this case I select the last change of the day, as the desired timeseries output must have only one value per day.
My question is similar to this one but in SQL, not BigQuery (but I'm not sure I understood the unnest part they propose). I have a working solution in Pandas with reindex and fillna, but I'm sure there is an elegant and simple solution in SQL (maybe even better with Django ORM).
You can use a RECURSIVE Common Table Expression to generate all dates between first and last and then join this generated table with your data to get the needed value for each day:
WITH RECURSIVE daterange(d) AS (
SELECT date(min(created_at)) from events
UNION ALL
SELECT date(d,'1 day') FROM daterange WHERE d<(select max(created_at) from events)
)
SELECT d, account_id, strategy_id
FROM daterange JOIN events
WHERE created_at = (select max(e.created_at) from events e where e.account_id=events.account_id and date(e.created_at) <= d)
GROUP BY account_id, d
ORDER BY account_id, d
date() function is used to convert a datetime value to a simple date, so you can use it to group your data by date.
date(d, '1 day') applies a modifier of +1 calendar day to d.
Here is an example with your data:
CREATE TABLE events (
created_at,
account_id,
strategy_id
);
insert into events
VALUES ('2022-10-07 12:53:53', 4801323843, 7),
('2022-10-07 08:10:07', 4801323843, 5),
('2022-10-07 15:00:45', 4801323843, 8),
('2022-10-10 13:01:16', 4801323843, 6);
WITH RECURSIVE daterange(d) AS (
SELECT date(min(created_at)) from events
UNION ALL
SELECT date(d,'1 day') FROM daterange WHERE d<(select max(created_at) from events)
)
SELECT d, account_id, strategy_id
FROM daterange JOIN events
WHERE created_at = (select max(e.created_at) from events e where e.account_id=events.account_id and date(e.created_at) <= d)
GROUP BY account_id, d
ORDER BY account_id, d
d
account_id
strategy_id
2022-10-07
4801323843
8
2022-10-08
4801323843
8
2022-10-09
4801323843
8
2022-10-10
4801323843
6
2022-10-11
4801323843
6
fiddle
The query could be slow with many rows. In that case create an index on the created_at column:
CREATE INDEX events_created_idx ON events(created_at);
My final version is the version proposed by #Andrea B., with just a slight improve in performance, merging only the rows that we need in the join, and therefore discarding the where clause.
I also converted the null to date('now')
Here is the final version I used :
with recursive daterange(day) as
(
select min(date(created_at)) from events
union all select date(day, '1 day') from daterange
where day < date('now')
),
events as (
select account_id, strategy_id, created_at as start_date,
case lead(created_at) over(partition by account_id order by created_at) is null
when True then datetime('now')
else lead(created_at) over(partition by account_id order by created_at)
end as end_date
from events
)
select * from daterange
join events on events.start_date<daterange.day and daterange.day<events.end_date
order by events.account_id
Hope this helps !
I got a table like this:
group_id
start_date
end_date
19335
20220613
20220714
19527
20220620
20220719
19339
20220614
20220720
19436
20220616
20220715
20095
20220711
20220809
I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.
Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.
My idea is to:
Retrieve distinct user_id per group_id within the period [start_date, end_date]
Retrieve previous 28d metric data prior to the start date of each group_id
A snippet code on how to retrieve data from a single group_id is the following:
WITH users_per_group AS (
SELECT
users_metadata.user_id,
users_metadata.group_id,
FROM
`my_table_users_*` users_metadata
WHERE
_TABLE_SUFFIX BETWEEN '20220314' --start_date
AND '20220413' --end_date
AND experiment_id = 16709
GROUP BY
1,
2
)
SELECT
_TABLE_SUFFIX AS date,
user_id,
SUM(
COALESCE(metric, 0)
) AS metric,
FROM
users_per_group
JOIN `my_metric_table*` metric USING (user_id)
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 28 DAY
)
) -- 28 days before it starts
AND FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 1 DAY
)
) -- 1 day before it starts
GROUP BY
1,
2
ORDER BY
date ASC
Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.
Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?
I can think of 2 approaches.
Join all the tables and then perform your query.
Create dynamic queries for each of your users.
Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
EG:
WITH users_per_group AS (
SELECT
user_id, group_id
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
FROM TableName
)
Once you have this kind of table then you can use any of the mentioned approaches.
Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
Approach 1
-- consider this your main table which contains user,grp,start_date,end_date
with maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
),
--then calculate search from-to date for every user and group
user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from --change interval as per your need
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
)
select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits from
user_per_grp ug
left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
group by 1,2
Approach 2
declare queries array<string> default [];
create temp table maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
);
create temp table user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
);
-- for each user create a seperate query here
FOR record IN (SELECT * from user_per_grp)
DO
set queries = queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
--replace your query here.
END FOR;
--aggregating all the queries and executing it
execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.
I'm trying to create a query in Toad for Oracle that allows me to pull users who have had more than a one day gap between their previous and current supervisor(s) with a Supervisor Type of 'Registered Principal'.
For example, if the user has a Supervisor with an end date of 10/20/2019, I would expect to see a Supervisor assigned by 10/21/2019. If not then I would want those exceptions displayed since as of 10/22/2019, there is a one day gap. If date of '12/31/9999' is displayed then that means the supervisor is current.
SELECT DISTINCT a.AssocID, a.SupervisorAssocID, TRUNC(a.StartDate),
TRUNC(a.EndDate), a.SupervisorType
FROM TableName a
INNER JOIN (SELECT AssocID, StartDate, EndDate
FROM TableName
) b ON a.AssocID = b.AssocID
WHERE a.StartDate != TRUNC(b.StartDate)
AND TRUNC(b.EndDate) > a.StartDate
AND a.StartDate != TRUNC(b.EndDate)
AND a.SupervisorType = 'Registered Principal';
I expect to only see users who have had a gap of more than one day between Supervisors.
You can use LEAD analytic function to get the next start date:
SELECT *
FROM (
SELECT a.*,
LEAD( startdate ) OVER (
PARTITION BY AssocId
ORDER BY StartDate ASC
) AS next_startdate
FROM tablename a
-- WHERE SupervisorType = 'Registered Principal'
)
WHERE SupervisorType = 'Registered Principal'
AND TRUNC( enddate ) + INTERVAL '1' DAY < TRUNC( next_startdate )
Note: its unclear where you want to filter on SupervisorType; your query makes it seem like it should be the outer query but it could be the inner query if you only want to consider differences between Registered Principals and not any other type of supervisor.
I need help in business days calculation.
I've two tables
1) One table ACTUAL_TABLE containing order date and contact date with timestamp datatypes.
2) The second table BUSINESS_DATES has each of the calendar dates listed and has a flag to indicate weekend days.
using these two tables, I need to ensure business days and not calendar days (which is the current logic) is calculated between these two fields.
My thought process was to first get a range of dates by comparing ORDER_DATE with TABLE_DATE field and then do a similar comparison of CONTACT_DATE to TABLE_DATE field. This would get me a range from the BUSINESS_DATES table which I can then use to calculate count of days, sum(Holiday_WKND_Flag) fields making the result look like:
Order# | Count(*) As DAYS | SUM(WEEKEND DATES)
100 | 25 | 8
However this only works when I use a specific order number and cant' bring all order numbers in a sub query.
My Query:
SELECT SUM(Holiday_WKND_Flag), COUNT(*) FROM
(
SELECT
* FROM
BUSINESS_DATES
WHERE BUSINESS.Business BETWEEN (SELECT ORDER_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
AND
(SELECT CONTACT_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
TEMP
Uploading the table structure for your reference.
SELECT ORDER#, SUM(Holiday_WKND_Flag), COUNT(*)
FROM business_dates bd
INNER JOIN actual_table at ON bd.table_date BETWEEN at.order_date AND at.contact_date
GROUP BY ORDER#
Instead of joining on a BETWEEN (which always results in a bad Product Join) followed by a COUNT you better assign a bussines day number to each date (in best case this is calculated only once and added as a column to your calendar table). Then it's two Equi-Joins and no aggregation needed:
WITH cte AS
(
SELECT
Cast(table_date AS DATE) AS table_date,
-- assign a consecutive number to each busines day, i.e. not increased during weekends, etc.
Sum(CASE WHEN Holiday_WKND_Flag = 1 THEN 0 ELSE 1 end)
Over (ORDER BY table_date
ROWS Unbounded Preceding) AS business_day_nbr
FROM business_dates
)
SELECT ORDER#,
Cast(t.contact_date AS DATE) - Cast(t.order_date AS DATE) AS #_of_days
b2.business_day_nbr - b1.business_day_nbr AS #_of_business_days
FROM actual_table AS t
JOIN cte AS b1
ON Cast(t.order_date AS DATE) = b1.table_date
JOIN cte AS b2
ON Cast(t.contact_date AS DATE) = b2.table_date
Btw, why are table_date and order_date timestamp instead of a date?
Porting from Oracle?
You can use this query. Hope it helps
select order#,
order_date,
contact_date,
(select count(1)
from business_dates_table
where table_date between a.order_date and a.contact_date
and holiday_wknd_flag = 0
) business_days
from actual_table a
I want to count ID's per month using generate_series(). This query works in PostgreSQL 9.1:
SELECT (to_char(serie,'yyyy-mm')) AS year, sum(amount)::int AS eintraege FROM (
SELECT
COUNT(mytable.id) as amount,
generate_series::date as serie
FROM mytable
RIGHT JOIN generate_series(
(SELECT min(date_from) FROM mytable)::date,
(SELECT max(date_from) FROM mytable)::date,
interval '1 day') ON generate_series = date(date_from)
WHERE version = 1
GROUP BY generate_series
) AS foo
GROUP BY Year
ORDER BY Year ASC;
This is my output:
"2006-12" | 4
"2007-02" | 1
"2007-03" | 1
But what I want to get is this output ('0' value in January):
"2006-12" | 4
"2007-01" | 0
"2007-02" | 1
"2007-03" | 1
Months without id should be listed nevertheless.
Any ideas how to solve this?
Sample data:
drop table if exists mytable;
create table mytable(id bigint, version smallint, date_from timestamp);
insert into mytable(id, version, date_from) values
(4084036, 1, '2006-12-22 22:46:35'),
(4084938, 1, '2006-12-23 16:19:13'),
(4084938, 2, '2006-12-23 16:20:23'),
(4084939, 1, '2006-12-23 16:29:14'),
(4084954, 1, '2006-12-23 16:28:28'),
(4250653, 1, '2007-02-12 21:58:53'),
(4250657, 1, '2007-03-12 21:58:53')
;
Untangled, simplified and fixed, it might look like this:
SELECT to_char(s.tag,'yyyy-mm') AS monat
, count(t.id) AS eintraege
FROM (
SELECT generate_series(min(date_from)::date
, max(date_from)::date
, interval '1 day'
)::date AS tag
FROM mytable t
) s
LEFT JOIN mytable t ON t.date_from::date = s.tag AND t.version = 1
GROUP BY 1
ORDER BY 1;
db<>fiddle here
Among all the noise, misleading identifiers and unconventional format the actual problem was hidden here:
WHERE version = 1
You made correct use of RIGHT [OUTER] JOIN. But adding a WHERE clause that requires an existing row from mytable converts the RIGHT [OUTER] JOIN to an [INNER] JOIN effectively.
Move that filter into the JOIN condition to make it work.
I simplified some other things while being at it.
Better, yet
SELECT to_char(mon, 'yyyy-mm') AS monat
, COALESCE(t.ct, 0) AS eintraege
FROM (
SELECT date_trunc('month', date_from)::date AS mon
, count(*) AS ct
FROM mytable
WHERE version = 1
GROUP BY 1
) t
RIGHT JOIN (
SELECT generate_series(date_trunc('month', min(date_from))
, max(date_from)
, interval '1 mon')::date
FROM mytable
) m(mon) USING (mon)
ORDER BY mon;
db<>fiddle here
It's much cheaper to aggregate first and join later - joining one row per month instead of one row per day.
It's cheaper to base GROUP BY and ORDER BY on the date value instead of the rendered text.
count(*) is a bit faster than count(id), while equivalent in this query.
generate_series() is a bit faster and safer when based on timestamp instead of date. See:
Generating time series between two dates in PostgreSQL