SQL - Unequal left join BigQuery - sql

New here. I am trying to get the Daily and Weekly active users over time. they have 30 days before they are considered inactive. My goal is to create graph's that can be split by user_id to show cohorts, regions, categories, etc.
I have created a date table to get every day for the time period and I have the simplified orders table with the base info that I need to calculate this.
I am trying to do a Left Join to get the status by date using the following SQL Query:
WITH daily_use AS (
SELECT
__key__.id AS user_id
, DATE_TRUNC(date(placeOrderDate), day) AS activity_date
FROM `analysis.Order`
where isBuyingGroupOrder = TRUE
AND testOrder = FALSE
GROUP BY 1, 2
),
dates AS (
SELECT DATE_ADD(DATE "2016-01-01", INTERVAL d.d DAY) AS date
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY __key__.id) -1 AS d
FROM `analysis.Order`
ORDER BY __key__.id
LIMIT 1096
) AS d
ORDER BY 1 DESC
)
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
LEFT JOIN daily_use
ON wd.date >= daily_use.activity_date
AND wd.date < DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
I am getting this Error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. In BigQuery and was wondering how can I go around this. I am using Standard SQL within BigQuery.
Thank you

Below is for BigQuery Standard SQL and mostly reproduce logic in your query with exception of not including days where no activity at all is found
#standardSQL
SELECT
daily_use.user_id
, wd.date AS DATE
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
-- ORDER BY 1,2
if for whatever reason you still need to exactly reproduce your logic - you can embrace above with final left join as below:
#standardSQL
SELECT *
FROM dates AS wd
LEFT JOIN (
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
) AS daily_use
USING (date)
-- ORDER BY 1,2

Related

In Postgres how do I write a SQL query to select distinct values overall but aggregated over a set time period

What I mean by this is if I have a table called payments with a created_at column and user_id column I want to select the count of purchases aggregated weekly (can be any interval I want) but only selecting first time purchases e.g. if a user purchased for the first time in week 1 it would be counted but if he purchased again in week 2 he would not be counted.
created_at
user_id
timestamp
1
timestamp
1
This is the query I came up with. The issue is if the user purchases multiple times they are all included. How can I improve this?
WITH dates AS
(
SELECT *
FROM generate_series(
'2022-07-22T15:30:06.687Z'::DATE,
'2022-11-21T17:04:59.457Z'::DATE,
'1 week'
) date
)
SELECT
dates.date::DATE AS date,
COALESCE(COUNT(DISTINCT(user_id)), 0) AS registrations
FROM
dates
LEFT JOIN
payment ON created_at::DATE BETWEEN dates.date AND dates.date::date + '1 ${dateUnit}'::INTERVAL
GROUP BY
dates.date
ORDER BY
dates.date DESC;
You want to count only first purchases. So get those first purchases in the first step and work with these.
WITH dates AS
(
SELECT *
FROM generate_series(
'2022-07-22T15:30:06.687Z'::DATE,
'2022-11-21T17:04:59.457Z'::DATE,
'1 week'
) date
)
, first_purchases AS
(
SELECT user_id, MIN(created_at:DATE) AS purchase_date
FROM payment
GROUP BY user_id
)
SELECT
d.date,
COALESCE(COUNT(p.purchase_date), 0) AS registrations
FROM
dates d
LEFT JOIN
first_purchases p ON p.purchase_date >= d.date
AND p.purchase_date < d.date + '1 ${dateUnit}'::INTERVAL
GROUP BY
d.date
ORDER BY
d.date DESC;

Postgresql left join date_trunc with default values

I have 3 tables which I'm querying to get the data based on different conditions. I have from and to params and these are the ones I'm using to create a range of time in which I'm looking for the data in those tables.
For instance if I have from equals to '2020-07-01' and to equals to '2020-08-01' I'm expecting to receive the grouped row values of the tables by week, if in some case some of the weeks don't have records I want to return 0, if some tables have records for the same week, I'd like to sum them.
Currently I have this:
SELECT d.day, COALESCE(t.total, 0)
FROM (
SELECT day::date
FROM generate_series(timestamp '2020-07-01',
timestamp '2020-08-01',
interval '1 week') day
) d
LEFT JOIN (
SELECT date AS day,
SUM(total)
FROM table1
WHERE id = '1'
AND date BETWEEN '2020-07-01' AND '2020-08-01'
GROUP BY day
) t USING (day)
ORDER BY d.day;
I'm generating a series of dates grouped by week, and on top of that I'm doing adding a left join. Now for some reason, it only works if the dates match completely, otherwise COALESCE(t.total, 0) returns 0 even if in that week the SUM(total) is not 0.
The same way I'm applying the LEFT JOIN, I'm using other left joins with other tables in the same query, so I'm falling with the same problem.
Please see if this works for you. Whenever you find yourself aggregating more than once, ask yourself whether it is necessary.
Rather than try to match on discrete days, use time ranges.
with limits as (
select '2020-07-01'::timestamp as dt_start,
'2020-08-01'::timestamp as dt_end
), weeks as (
SELECT x.day::date as day, least(x.day::date + 7, dt_end::date) as day_end
FROM limits l
CROSS JOIN LATERAL
generate_series(l.dt_start, l.dt_end, interval '1 week') as x(day)
WHERE x.day::date != least(x.day::date + 7, dt_end::date)
), t1 as (
select w.day,
sum(coalesce(t.total, 0)) as t1total
from weeks w
left join table1 t
on t.id = 1
and t.date >= w.day
and t.date < w.day_end
group by w.day
), t2 as (
select w.day,
sum(coalesce(t.sum_measure, 0)) as t2total
from weeks w
left join table2 t
on t.something = 'whatever'
and t.date >= w.day
and t.date < w.day_end
group by w.day
)
select t1.day,
t1.t1total,
t2.t2total
from t1
join t2 on t2.day = t1.day;
You can keep adding tables like that with CTEs.
My earlier example with multiple left join was bad because it blows out the rows due to a lack of join conditions between the left-joined tables.
There is an interesting corner case for e.g. 2019-02-01 to 2019-03-01 which returns an empty interval as the last week. I have updated to filter that out.

How to do query on multiple dates on certain range on Google Big Query

I'm doing quite long query to find a customer with certain condition on certain dates, in this case '2019-6-20', the query is like this
Here's my code
select current_date() as date , count(customer_id) as cell13
from(
select customer_id, count(id) as total, string_agg(payment_state order by created_at desc limit 1) as cek
from(
select distinct(A.id), A.customer_id, extract(month from A.created_at) as months,extract(day from A.created_at) as days, extract(year from A.created_at) as years, payment_state, A.created_at, A.grandtotal_cents
from bl.orders as A
left join bl.blacklists as B
on A.customer_id = B.customer_id
where date(A.created_at) >= date_sub(date('2019-6-20') , interval 60 day) and grandtotal_cents > 0 and B.customer_id is null
)
group by customer_id
having cek = "unpaid")
Here's the result
Row date cell13
1 2019-06-21 696
Now I need to query this to multiple dates in certain date range, for example 2019-03-23 to 2019-06-21. How suppose I do this, so the output will like
Row date cell13
1 2019-06-21 696
...
90 2019-03-23 ...
You can generate a table of dates using generate_date_array() and unnest() and then use this with a left join.
Overall, though, your query is a message an hard to follow, but here is the idea:
with dates as (
select dte
from (select generate_date_array('2019-03-23', '2016-06-21', interval 1 day) d
) d cross join
unnest(d.d) dte
)
select . . .
from dates left join
bl.orders o
on date(o.created_at) >= date_sub(dte, interval 60 day)
. . .

LEFT OUTER JOIN Error creating a subquery on bigquery

I'm trying to eval MAL, WAL and DAU from a event table on my bq...
I create a query find DAU and with him find WAU and MAU,
but it does not work, i received this error:
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
It's my query
WITH dau AS (
SELECT
date,
COUNT(DISTINCT(events.device_id)) as DAU_explorer
FROM `workspace.event_table` as events
GROUP BY 1
)
SELECT
date,
dau,
(SELECT
COUNT(DISTINCT(device_id))
FROM `workspace.event_table` as events
WHERE events.date BETWEEN DATE_ADD(dau.date, INTERVAL -30 DAY) AND dau.date
) AS mau,
(SELECT
COUNT(DISTINCT(device_id)) as DAU_explorer
FROM `workspace.event_table` as events
WHERE events.date BETWEEN DATE_ADD(dau.date, INTERVAL -7 DAY) AND dau.date
) AS wau
FROM dau
Where is my error? Is not possible run subqueries like this on bq?
Try this instead:
WITH data AS (
SELECT DATE(creation_date) date, owner_user_id device_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
)
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT IF(i<31,device_id,null)) unique_30_day_users
, COUNT(DISTINCT IF(i<8,device_id,null)) unique_7_day_users
FROM `data`, UNNEST(GENERATE_ARRAY(1, 30)) i
GROUP BY 1
ORDER BY date_grp
LIMIT 100
OFFSET 30
And if you are looking for a more efficient solution, try approximate results.

Oracle - Split a record into multiple records

I have a schedule table for each month schedule. And this table also has days off within that month. I need a result set that will tell working days and off days for that month.
Eg.
CREATE TABLE SCHEDULE(sch_yyyymm varchar2(6), sch varchar2(20), sch_start_date date, sch_end_date date);
INSERT INTO SCHEDULE VALUES('201703','Working Days', to_date('03/01/2017','mm/dd/yyyy'), to_date('03/31/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','Off Day', to_date('03/05/2017','mm/dd/yyyy'), to_date('03/07/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','off Days', to_date('03/08/2017','mm/dd/yyyy'), to_date('03/10/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','off Days', to_date('03/15/2017','mm/dd/yyyy'), to_date('03/15/2017','mm/dd/yyyy'));
Using SQL or PL/SQL I need to split the record with Working Days and Off Days.
From above records I need result set as:
201703 Working Days 03/01/2017 - 03/04/2017
201703 Off Days 03/05/2017 - 03/10/2017
201703 Working Days 03/11/2017 - 03/14/2017
201703 Off Days 03/15/2017 - 03/15/2017
201703 Working Days 03/16/2017 - 03/31/2017
Thank You for your help.
Edit: I've had a bit more of a think, and this approach works fine for your insert records above - however, it misses records where there are not continuous "off day" periods. I need to have a bit more of a think and will then make some changes
I've put together a test using the lead and lag functions and a self join.
The upshot is you self-join the "Off Days" onto the existing tables to find the overlaps. Then calculate the start/end dates on either side of each record. A bit of logic then lets us work out which date to use as the final start/end dates.
SQL fiddle here - I used Postgres as the Oracle function wasn't working but it should translate ok.
select sch,
/* Work out which date to use as this record's Start date */
case when prev_end_date is null then sch_start_date
else off_end_date + 1
end as final_start_date,
/* Work out which date to use as this record's end date */
case when next_start_date is null then sch_end_date
when next_start_date is not null and prev_end_date is not null then next_start_date - 1
else off_start_date - 1
end as final_end_date
from (
select a.*,
b.*,
/* Get the start/end dates for the records on either side of each working day record */
lead( b.off_start_date ) over( partition by a.sch_start_date order by b.off_start_date ) as next_start_date,
lag( b.off_end_date ) over( partition by a.sch_start_date order by b.off_start_date ) as prev_end_date
from (
/* Get all schedule records */
select sch,
sch_start_date,
sch_end_date
from schedule
) as a
left join
(
/* Get all non-working day schedule records */
select sch as off_sch,
sch_start_date as off_start_date,
sch_end_date as off_end_date
from schedule
where sch <> 'Working Days'
) as b
/* Join on "Off Days" that overlap "Working Days" */
on a.sch_start_date <= b.off_end_date
and a.sch_end_date >= b.off_start_date
and a.sch <> b.off_sch
) as c
order by final_start_date
If you had a dates table this would have been easier.
You can construct a dates table using a recursive cte and join on to it. Then use the difference of row number approach to classify rows with same schedules on consecutive dates into one group and then get the min and max of each group which would be the start and end dates for a given sch. I assume there are only 2 sch values Working Days and Off Day.
with dates(dt) as (select date '2017-03-01' from dual
union all
select dt+1 from dates where dt < date '2017-03-31')
,groups as (select sch_yyyymm,dt,sch,
row_number() over(partition by sch_yyyymm order by dt)
- row_number() over(partition by sch_yyyymm,sch order by dt) as grp
from (select s.sch_yyyymm,d.dt,
/*This condition is to avoid a given date with 2 sch values, as 03-01-2017 - 03-31-2017 are working days
on one row and there is an Off Day status for some of these days.
In such cases Off Day would be picked up as sch*/
case when count(*) over(partition by d.dt) > 1 then min(s.sch) over(partition by d.dt) else s.sch end as sch
from dates d
join schedule s on d.dt >= s.sch_start_date and d.dt <= s.sch_end_date
) t
)
select sch_yyyymm,sch,min(dt) as start_date,max(dt) as end_date
from groups
group by sch_yyyymm,sch,grp
I couldn't get the recursive cte running in Oracle. Here is a demo using SQL Server.
Sample Demo in SQL Server