Reducing Series of Dates to minimal representation in BigQuery

Reducing Series of Dates to minimal representation in BigQuery - google-bigquery

If I have a table like:
start_date|end_date
1/1/2018|1/5/2018
1/4/2018|1/10/2018
1/9/2018|1/22/2018
2/1/2018|2/1/2018
1/31/2018|2/5/2018
And I want to get all the date ranges that are covered by these rows. So I would want something returned like:
1/1/2018|1/22/2018
1/31/2018|2/5/2018
Is there a function in BigQuery that can handle this?

There is no such function - but you can try something like below (BigQuery Standard SQL)
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1/1/2018' start_date, '1/5/2018' end_date UNION ALL
SELECT '1/4/2018', '1/10/2018' UNION ALL
SELECT '1/9/2018', '1/22/2018' UNION ALL
SELECT '2/1/2018', '2/1/2018' UNION ALL
SELECT '1/31/2018', '2/5/2018'
), parsed_as_dates AS (
SELECT PARSE_DATE('%m/%d/%Y', start_date) start_date, PARSE_DATE('%m/%d/%Y', end_date) end_date
FROM `project.dataset.table`
), days AS (
SELECT day FROM
(SELECT MIN(start_date) min_date, MAX(end_date) max_date FROM parsed_as_dates),
UNNEST(GENERATE_DATE_ARRAY(min_date, max_date)) day
), temp AS (
SELECT day, SIGN(COUNTIF(day BETWEEN start_date AND end_date)) flag
FROM days CROSS JOIN parsed_as_dates GROUP BY day
)
SELECT MIN(day) start_date, MAX(day) end_date
FROM (
SELECT day, flag, SUM(start) OVER(ORDER BY day) grp
FROM (
SELECT day, flag, ABS(flag - IFNULL(LAG(flag) OVER(ORDER BY day), 0)) start
FROM temp
)
)
WHERE flag = 1
GROUP BY grp
-- ORDER BY start_date
with below result
Row start_date end_date
1 2018-01-01 2018-01-22
2 2018-01-31 2018-02-05
Just "quick" idea - you might want to refactor it a little - as it looks a little over-engineered to me :o) but at least does its work

Related

Taking Count Based On Year and Month from Date Columns

I want to take count based on from and to date. using from and to date I am trying to take year and month then based on month and year taking count. can someone suggest me how can i implement this.
Database : Snowflake

You want to do more less the solution to this other question
but here let me do all the work for you:
WITH data_table(start_date, end_date) as (
SELECT * from values
('2022-01-15'::date, '2022-02-12'::date),
('2021-12-25'::date, '2022-03-18'::date),
('2022-02-25'::date, '2022-03-06'::date),
('2021-10-20'::date, '2022-01-07'::date)
), large_range as (
SELECT row_number() over (order by null)-1 as rn
FROM table(generator(ROWCOUNT => 1000))
), pre_condition as (
SELECT
date_trunc('month', start_date) as month_start
,datediff('month', month_start, date_trunc('month', end_date)) as m
FROM data_table
)
SELECT
to_char(dateadd('month', r.rn, d.month_start),'MON-YY') as month_yr
,count(*) as count
FROM pre_condition as d
JOIN large_range as r ON r.rn <= d.m
GROUP BY 1;
MONTH_YR
COUNT
Jan-22
3
Dec-21
2
Feb-22
3
Oct-21
1
Nov-21
1
Mar-22
2

How to get max date among others ids for current id using BigQuery?

I need to get max date for each row over other ids. Of course I can do this with CROSS JOIN and JOIN .
Like this
WITH t AS (
SELECT 1 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-09-01','2021-09-09', INTERVAL 1 DAY)) rep_date
UNION ALL
SELECT 2 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-08-20','2021-09-03', INTERVAL 1 DAY)) rep_date
UNION ALL
SELECT 3 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-08-25','2021-09-05', INTERVAL 1 DAY)) rep_date
)
SELECT id, rep_date, MAX(rep_date) OVER (PARTITION BY id) max_date, max_date_over_others FROM t
JOIN (
SELECT t.id, MAX(max_date) max_date_over_others FROM t
CROSS JOIN (
SELECT id, MAX(rep_date) max_date FROM t
GROUP BY 1
) t1
WHERE t1.id <> t.id
GROUP BY 1
) USING (id)
But it's too wired for huge tables. So I'm looking for the some simpler way to do this. Any ideas?

Your version is good enough I think. But if you want to try other options - consider below approach. It might looks more verbose from first look - but should be more optimal and cheaper to compare with your version with cross join
temp as (
select id,
greatest(
ifnull(max(max_date_for_id) over preceding_ids, '1970-01-01'),
ifnull(max(max_date_for_id) over following_ids, '1970-01-01')
) as max_date_for_rest_ids
from (
select id, max(rep_date) max_date_for_id
from t
group by id
)
window
preceding_ids as (order by id rows between unbounded preceding and 1 preceding),
following_ids as (order by id rows between 1 following and unbounded following)
)
select *
from t
join temp
using (id)

Assuming your original table data just has columns id and dt - wouldn't this solve it? I'm using the fact that if an id has the max dt of everything, then it gets the second-highest over the other id values.
WITH max_dates AS
(
SELECT
id,
MAX(dt) AS max_dt
FROM
data
GROUP BY
id
),
with_top1_value AS
(
SELECT
*,
MAX(dt) OVER () AS max_overall_dt_1,
MIN(dt) OVER () AS min_overall_dt
FROM
max_dates
),
with_top2_values AS
(
SELECT
*,
MAX(CASE WHEN dt = max_overall_dt_1 THEN min_overall_dt ELSE dt END) AS max_overall_dt2
FROM
with_top1_value
),
SELECT
*,
CASE WHEN dt = max_overall_dt1 THEN max_overall_dt2 ELSE max_overall_dt1 END AS max_dt_of_others
FROM
with_top2_values

how to calculate difference between dates in BigQuery

I have a table named Employees with Columns: PersonID, Name, StartDate. I want to calculate 1) difference in days between the newest and oldest employee and 2) the longest period of time (in days) without any new hires. I have tried to use DATEDIFF, however the dates are in a single column and I'm not sure what other method I should use. Any help would be greatly appreciated

Below is for BigQuery Standard SQL
#standardSQL
SELECT
SUM(days_before_next_hire) AS days_between_newest_and_oldest_employee,
MAX(days_before_next_hire) - 1 AS longest_period_without_new_hire
FROM (
SELECT
DATE_DIFF(
StartDate,
LAG(StartDate) OVER(ORDER BY StartDate),
DAY
) days_before_next_hire
FROM `project.dataset.your_table`
)
You can test, play with above using dummy data as in the example below
#standardSQL
WITH `project.dataset.your_table` AS (
SELECT DATE '2019-01-01' StartDate UNION ALL
SELECT '2019-01-03' StartDate UNION ALL
SELECT '2019-01-13' StartDate
)
SELECT
SUM(days_before_next_hire) AS days_between_newest_and_oldest_employee,
MAX(days_before_next_hire) - 1 AS longest_period_without_new_hire
FROM (
SELECT
DATE_DIFF(
StartDate,
LAG(StartDate) OVER(ORDER BY StartDate),
DAY
) days_before_next_hire
FROM `project.dataset.your_table`
)
with result
Row days_between_newest_and_oldest_employee longest_period_without_new_hire
1 12 9
Note use of -1 in calculating longest_period_without_new_hire - it is really up to you to use this adjustment or not depends on your preferences of counting gaps

1) difference in days between the newest and oldest record
WITH table AS (
SELECT DATE(created_at) date, *
FROM `githubarchive.day.201901*`
WHERE _table_suffix<'2'
AND repo.name = 'google/bazel-common'
AND type='ForkEvent'
)
SELECT DATE_DIFF(MAX(date), MIN(date), DAY) max_minus_min
FROM table
2) the longest period of time (in days) without any new records
WITH table AS (
SELECT DATE(created_at) date, *
FROM `githubarchive.day.201901*`
WHERE _table_suffix<'2'
AND repo.name = 'google/bazel-common'
AND type='ForkEvent'
)
SELECT MAX(diff) max_diff
FROM (
SELECT DATE_DIFF(date, LAG(date) OVER(ORDER BY date), DAY) diff
FROM table
)

How can I count users in a month that were not present in the month before?

I am trying to count unique users on a monthly basis that were not present in the previous month. So if a user has a record for January and then another one for February, then I would only count January for that user.
user_id time
a1 1/2/17
a1 2/10/17
a2 2/18/17
a4 2/5/17
a5 3/25/17
My results should look like this
Month User Count
January 1
February 2
March 1

I'm not really familiar with BigQuery, but here's how I would solve the problem using TSQL. I imagine that you'd be able to use similar logic in BigQuery.
1). Order the data by user_id first, and then time. In TSQL, you can accomplish this with the following and store it in a common table expression, which you will query in the step after this.
;WITH cte AS
(
select ROW_NUMBER() OVER (PARTITION BY [user_id] ORDER BY [time]) AS rn,*
from dbo.employees
)
2). Next query for only the rows with rn = 1 (the first occurrence for a particular user) and group by the month.
select DATENAME(month, [time]) AS [Month], count(*) AS user_count
from cte
where rn = 1
group by DATENAME(month, [time])
This is assuming that 2017 is the only year you're dealing with. If you're dealing with more than one year, you probably want step #2 to look something like this:
select year([time]) as [year], DATENAME(month, [time]) AS [month],
count(*) AS user_count
from cte
where rn = 1
group by year([time]), DATENAME(month, [time])

First aggregate by the user id and the month. Then use lag() to see if the user was present in the previous month:
with du as (
select date_trunc(time, month) as yyyymm, user_id
from t
group by date_trunc(time, month)
)
select yyyymm, count(*)
from (select du.*,
lag(yyyymm) over (partition by user_id order by yyyymm) as prev_yyyymm
from du
) du
where prev_yyyymm is not null or
prev_yyyymm < date_add(yyyymm, interval 1 month)
group by yyyymm;
Note: This uses the date functions, but similar functions exist for timestamp.

The way I understood question is - to exclude user to be counted in given month only if same user presented in previous month. But if same user present in few months before given, but not in previous - user should be counted.
If this is correct - Try below for BigQuery Standard SQL
#standardSQL
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
you can test / play with above using below example with dummy data from your question
#standardSQL
WITH yourTable AS (
SELECT 'a1' AS user_id, '1/2/17' AS time UNION ALL
SELECT 'a1', '2/10/17' UNION ALL
SELECT 'a2', '2/18/17' UNION ALL
SELECT 'a4', '2/5/17' UNION ALL
SELECT 'a5', '3/25/17'
)
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
The output is
Year Month User_Count
2017 January 1
2017 February 2
2017 March 1

Try this query:
SELECT
t1.d,
count(DISTINCT t1.user_id)
FROM
(
SELECT
EXTRACT(MONTH FROM time) AS d,
--EXTRACT(MONTH FROM time)-1 AS d2,
user_id
FROM nbitra.tmp
) t1
LEFT JOIN
(
SELECT
EXTRACT(MONTH FROM time) AS d,
user_id
FROM nbitra.tmp
) t2
ON t1.d = t2.d+1
WHERE
(
t1.user_id <> t2.user_id --User is in previous month
OR t2.user_id IS NULL --To handle january, since there is no previous month to compare to
)
GROUP BY t1.d;

Lowest continuous date without break

I have a table and each record has a date. We can assume that a date range is contiguous if there's not a 3 month break. How can I find the start of the most recent contiguous date range?
For example, imagine if I had this data:
1990-5-1
1990-6-4
1990-10-28
1990-11-14
1990-12-19
1991-1-20
1991-4-30
1991-5-13
I'd like for it to return 1991-4-30 because it's the start of the most recent contiguous range of dates.

I think this does what you're looking for. Using my own table and column names as test data. This is on Oracle.
select * from (
select * from sm_ss_tickets t1 where exists (
select * from sm_ss_tickets t2 where t2.created_date between t1.created_date and t1.created_date+90 and t1.rowid <> t2.rowid
) order by created_date asc
) where rownum = 1;

Maybe something like the following would work:
WITH d1 AS (
SELECT date'1990-05-01' AS dt FROM dual
UNION ALL
SELECT date'1990-06-04' AS dt FROM dual
UNION ALL
SELECT date'1990-10-28' AS dt FROM dual
UNION ALL
SELECT date'1990-11-14' AS dt FROM dual
UNION ALL
SELECT date'1990-12-19' AS dt FROM dual
UNION ALL
SELECT date'1991-01-20' AS dt FROM dual
UNION ALL
SELECT date'1991-04-30' AS dt FROM dual
UNION ALL
SELECT date'1991-05-13' AS dt FROM dual
)
SELECT MAX(dt) FROM (
SELECT dt, LAG(dt) OVER ( ORDER BY dt ) AS prev_dt, LEAD(dt) OVER ( ORDER BY dt ) AS next_dt
FROM d1
) WHERE ( dt > ADD_MONTHS(prev_dt, 3) OR prev_dt IS NULL )
AND dt > ADD_MONTHS(next_dt, -3)
In the above, a date can only be the start of a contiguous sequence if there is no prior date within 3 months (either it is more than three months ago or it doesn't exist at all) and there is also a subsequent date within 3 months.

You can use LAG and LEAD. Find the query below. I think it works fine.
tmp_year is the table I have created. tdate is the column.
The records in the table are
28-JAN-15
27-JAN-15
26-JAN-15
25-JAN-15
12-JUL-14
11-JUL-14
10-JUL-14
09-JUL-14
24-DEC-13
23-DEC-13
22-DEC-13
21-DEC-13
15-SEP-13
07-JUN-13
27-FEB-13
19-NOV-12
11-AUG-12
Please find the query which returns 25th Jan 2015.
select max(d.tdate) from (
select c.tdate,c.next_date,c.date_diff,lag(date_diff) over( order by tdate) prev_diff from (
select b.tdate ,b.next_date,(next_date-tdate) date_diff from
(select a.tdate,lead(a.tdate) over(order by a.tdate) next_date from tmp_year a ) b ) c) d where d.date_diff<90 and d.prev_diff>=90;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reducing Series of Dates to minimal representation in BigQuery - google-bigquery

Related

Taking Count Based On Year and Month from Date Columns

How to get max date among others ids for current id using BigQuery?

how to calculate difference between dates in BigQuery

How can I count users in a month that were not present in the month before?

Lowest continuous date without break

Categories

Resources