Check any missing days record in Bigquery PartitionTable

Check any missing days record in Bigquery PartitionTable - sql

I have a Bigquery table with a date partition key.
I get daily records in that table and I try to find if there's any missing day for like 3 years of historical data.
So I tried to use the following query:
SELECT KeyPartitionDate
FROM (
SELECT KeyPartitionDate, DATE(KeyPartitionDate) as day, DATE_ADD(date(KeyPartitionDate), INTERVAL 1 DAY) AS dayplusone
FROM `project.dataset.table`
)
WHERE DATE_DIFF(day, dayplusone , DAY) > 1
GROUP BY KeyPartitionDate
ORDER BY KeyPartitionDate
The query is valid but returns no results while I know there are some...
My guess is that I'm messing with the DATE_ADD function but cant tell how

Below is for BigQuery Standard SQL and just gives you the list of missing days
#standardSQL
SELECT day AS missing_days
FROM (
SELECT MIN(KeyPartitionDate) min_day, MAX(KeyPartitionDate) max_day
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day)) day
LEFT JOIN (
SELECT DISTINCT KeyPartitionDate AS day
FROM `project.dataset.table`
) t
USING(day)
WHERE t.day IS NULL

You went about this the wrong way:
day = DATE(KeyPartitionDate)
then you did
dayplusone = DATE_ADD(date(KeyPartitionDate), INTERVAL 1 DAY)
which is basically saying dayplusone = day +(1 day)
Then you do :
WHERE DATE_DIFF(day, dayplusone , DAY) > 1
which is like saying : dayplusone - day > (1 day) which would mean
day + (1 day) - day > (1 day)
You can clearly see why that is wrong.
What you needed to do instead is compare the current row date with the preivous row date. That is achieved using window functions:
SELECT KeyPartitionDate FROM (
SELECT DISTINCT KeyPartitionDate,
LAG(KeyPartitionDate)
OVER (ORDER BY KeyPartitionDate ASC) AS PreviousKeyPartitionDate
FROM `project.dataset.table`)
WHERE DATE_DIFF(DATE(PreviousKeyPartitionDate),DATE(KeyPartitionDate), DAY ) > 1
ORDER BY KeyPartitionDate

Related

Day wise Rolling 30 day uniques user count bigquery

I am trying to generate a day on day rolling 30 days unique count using this query but the problem is running this query day on the day I need aug full month rolling 30 days day on day count in one script pls help
-----------------------------------------
SELECT max(date),count(DISTINCT user_id) as MAU
FROM user_data
WHERE date between DATE_SUB('2020-08-31' ,INTERVAL 29 DAY) and '2020-08-31';

BigQuery doesn't support rolling windows for count(distinct). So, one approach is a brute force method:
select dte,
(select count(distinct ud.user_id)
from user_data ud
where ud.date between DATE_SUB(dte, INTERVAL 29 DAY) and dte
) as num_users
from unnest(generate_date_array(date('2020-08-01'), date('2020-08-31'))) dte

Gordon approach works great.
If you need to calculate more numbers - Cross join the data.
SELECT
date_gen,
COUNT(DISTINCT IF(ud.date BETWEEN DATE_SUB(date_gen ,INTERVAL 29 DAY) AND date_gen,ud.user_id,NULL)) as MAU
FROM
UNNEST(GENERATE_DATE_ARRAY(DATE_SUB('2020-08-31' ,INTERVAL 29 DAY), date('2020-08-31'))) date_gen,
(SELECT * FROM user_data WHERE date BETWEEN DATE_SUB('2020-08-31' ,INTERVAL 60 DAY) AND '2020-08-31') AS ud
GROUP BY 1
ORDER BY 1 DESC
With SET and DECLARE you can get rid of replacing the 'DATE' multiple times.

Below is for BigQuery Standard SQL
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM `project.dataset.user_data`
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t
Above assumes you have entries in project.dataset.user_data table for all days in time period of your interest
If this is not a case, and you actually have some gaps in your data - you can use below
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM UNNEST(GENERATE_DATE_ARRAY('2020-08-01', '2020-08-31')) AS date
LEFT JOIN `project.dataset.user_data`
USING(date)
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t

Appending the result query in bigquery

I am doing a query where the query will append the data from previous date as the outcome in BigQuery.
So, the result data for today will be higher than yesterdays as the data is appending by days.
So far, what I only managed to get the outcome is the data by days (where you can see the number of ID declining and is not appending from previous day) as this result:
What should I do to add appending function in the query so each day will get the result of data from the previous day in bigquery?
code:
WITH
table1 AS (
SELECT
ID,
...
FROM t
WHERE DATE_SUB('2020-01-31', INTERVAL 31 DAY) and '2020-01-31'
),
table2 AS (
SELECT
ID,
COUNTIF((rating < 7) as bad,
COUNTIF((rating >= 7 AND SAFE_CAST(NPS_Rating as INT64) < 9) as intermediate,
COUNTIF((rating as good
FROM
t
WHERE DATE_SUB('2020-01-31', INTERVAL 31 DAY) and '2020-01-31'
)
SELECT
DATE_SUB('2020-01-31', INTERVAL 31 DAY) as date,
*
FROM table1
FULL OUTER JOIN table2 USING (ID)

If you have counts that you want to accumulate, then you want a cumulative sum. The query would look something like this:
select datecol, count(*), sum(count(*)) over (order by datecol)
from t
group by datecol
order by datecol;

SQL Get last 7 days from event date

The best way to explain what I need is showing, so, here it is:
Currently I have this query
select
date_
,count(*) as count_
from table
group by date_
which returns me the following database
Now I need to get a new column, that shows me the count off all the previous 7 days, considering the row date_.
So, if the row is from day 29/06, I have to count all ocurrencies of that day ( my query is already doing it) and get all ocurrencies from day 22/06 to 29/06
The result should be something like this:

If you have values for all dates, without gaps, then you can use window functions with a rows frame:
select
date,
count(*) cnt
sum(count(*)) over(order by date rows between 7 preceding and current row) cnt_d7
from mytable
group by date
order by date

you can try something like this:
select
date_,
count(*) as count_,
(select count(*)
from table as b
where b.date_ <= a.date_ and b.date_ > a.date - interval '7 days'
) as count7days_
from table as a
group by date_

If you have gaps, you can do a more complicated solution where you add and subtract the values:
with t as (
select date_, count(*) as count_
from table
group by date_
union all
select date_ + interval '8 day', -count(*) as count_
from table
group by date_
)
select date_,
sum(sum(count_)) over (order by date_ rows between unbounded preceding and current row) - sum(count_)
from t;
The - sum(count_) is because you do not seem to want the current day in the cumulated amount.
You can also use the nasty self-join approach . . . which should be okay for 7 days:
with t as (
select date_, count(*) as count_
from table
group by date_
)
select t.date_, t.count_, sum(tprev.count_)
from t left join
t tprev
on tprev.date_ >= t.date_ - interval '7 day' and
tprev.date_ < t.date_
group by t.date_, t.count_;
The performance will get worse and worse as "7" gets bigger.

Try with subquery for the new column:
select
table.date_ as groupdate,
count(table.date_) as date_count,
(select count(table.date_)
from table
where table.date_ <= groupdate and table.date_ >= groupdate - interval '7 day'
) as total7
from table
group by groupdate
order by groupdate

How to generate date series to occupy absent dates in google BiqQuery?

I am trying to get daily sum of sales from a google big-query table. I used following code for that.
select Day(InvoiceDate) date, Sum(InvoiceAmount) sales from test_gmail_com.sales
where year(InvoiceDate) = Year(current_date()) and
Month(InvoiceDate) = Month(current_date())
group by date order by date
From the above query it gives only the sum of sales daily which were in the table. There is a chance that some days do not have any sales. For those kind of situations, I need to get the date and sum should be 0. As an example, in every month should 30 0r 31 rows with sum of sales. Examples show below. 4th day of the month does not have a sales. So its sum should be 0.
date | sales
-----+------
1 | 259
-----+------
2 | 359
-----+------
3 | 45
-----+------
4 | 0
-----+------
5 | 156
Is it possible to do in Big-query? Basically date column should be a series from 1 - 28/29/30 or 31st depending on the month of the year

Generting a list of dates and then joining whatever table you need on top seems the easiest. I used the generate_date_array + unnest and it looks quite clean.
To generate a list of days (one day per row):
SELECT
*
FROM
UNNEST(GENERATE_DATE_ARRAY('2018-10-01', '2020-09-30', INTERVAL 1 DAY)) AS example

You can use below to generate on fly all dates in given range (in below example it is all dates from 2015-06-01 till CURRENT_DATE() - by changing those you can control which dates range to generate)
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
so, now - you can use it with LEFT JOIN with your table to have all dates accounted. See potential example below
SELECT
calendar_day,
IFNULL(sales, 0) AS sales
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
) AS all_dates
LEFT JOIN (
SELECT DAY(InvoiceDate) DATE, SUM(InvoiceAmount) sales
FROM test_gmail_com.sales
WHERE YEAR(InvoiceDate) = YEAR(CURRENT_DATE()) AND
MONTH(InvoiceDate) = MONTH(CURRENT_DATE())
GROUP BY DATE
)
ON DATE = calendar_day
I wanna need to get previous months sales
Below gives all days of previous month
SELECT DATE(DATE_ADD(DATE_ADD(DATE_ADD(CURRENT_DATE(), -1, "MONTH"), 1 - DAY(CURRENT_DATE()), "DAY"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(DATE_ADD(CURRENT_DATE(), - DAY(CURRENT_DATE()), "DAY"), DATE_ADD(DATE_ADD(CURRENT_DATE(), -1, "MONTH"), 1 - DAY(CURRENT_DATE()), "DAY")), '.'),'') AS h
FROM (SELECT NULL)),h
)))

Using the Standard SQL dialect and the generate_array function to simplify the code:
WITH serialnum AS (
SELECT
sn
FROM
UNNEST(GENERATE_ARRAY(0,
DATE_DIFF(DATE_ADD(DATE_TRUNC(CURRENT_DATE()
, MONTH)
, INTERVAL 1 MONTH)
, DATE_TRUNC(CURRENT_DATE(), MONTH)
, DAY) - 1)
) AS sn
), date_seq AS (
SELECT
DATE_ADD(DATE_TRUNC(CURRENT_DATE(), MONTH),
INTERVAL(sn) DAY) AS this_day
FROM
serialnum
)
SELECT
Day(InvoiceDate) date
, Sum(IFNULL(InvoiceAmount, 0)) sales
FROM
date_seq
LEFT JOIN
test_gmail_com.sales
ON
date_seq.this_day = DAY(test_gmail_com.sales.InvoiceDate)
WHERE
year(InvoiceDate) = Year(current_date())
and
Month(InvoiceDate) = Month(current_date())
GROUP BY
date
ORDER BY
date
;
UPDATE
Or, simpler still using the generate_date_array function:
WITH date_seq AS (
SELECT
GENERATE_DATE_ARRAY(DATE_TRUNC(CURRENT_DATE(), MONTH),
DATE_ADD(DATE_ADD(DATE_TRUNC(CURRENT_DATE(), MONTH)
, INTERVAL 1 MONTH)
, INTERVAL -1 DAY)
, INTERVAL 1 DAY)
AS this_day
)
SELECT
Day(InvoiceDate) date
, Sum(IFNULL(InvoiceAmount, 0)) sales
FROM
date_seq
LEFT JOIN
test_gmail_com.sales
ON
date_seq.this_day = DAY(test_gmail_com.sales.InvoiceDate)
WHERE
year(InvoiceDate) = Year(current_date())
and
Month(InvoiceDate) = Month(current_date())
GROUP BY
date
ORDER BY
date
;

For these purposes it is practical to have a 'calendar' table, a table that just lists all the days within a certain range. For your specific question, it would suffice to have a table with the numbers 1 to 31. A quick way to get this table is to make a spreadsheet with these numbers, save it as a csv file and import this file into BigQuery as a table.
You then left outer join your result set onto this table, with ifnull(sales,0) as sales.
If you want the number of days per month (28--31) to be right, you basically have two options. Either you create a proper calendar table that covers several years and that you join on using year, month and day. Or you use the simple table with numbers 1--31 and remove numbers based on the month and the year.

For Standard SQL
WITH
splitted AS (
SELECT
*
FROM
UNNEST( SPLIT(RPAD('',
1 + DATE_DIFF(CURRENT_DATE(), DATE("2015-06-01"), DAY),
'.'),''))),
with_row_numbers AS (
SELECT
ROW_NUMBER() OVER() AS pos,
*
FROM
splitted),
calendar_day AS (
SELECT
DATE_ADD(DATE("2015-06-01"), INTERVAL (pos - 1) DAY) AS day
FROM
with_row_numbers)
SELECT
*
FROM
calendar_day
ORDER BY
day DESC

Google BigQuery: Rolling Count Distinct

I have a table with is simply a list of dates and user IDs (not aggregated).
We define a metric called active users for a given date by counting the distinct number of IDs that appear in the previous 45 days.
I am trying to run a query in BigQuery that, for each day, returns the day plus the number of active users for that day (count distinct user from 45 days ago until today).
I have experimented with window functions, but can't figure out how to define a range based on the date values in a column. Instead, I believe the following query would work in a database like MySQL, but does not in BigQuery.
SELECT
day,
(SELECT
COUNT(DISTINCT visid)
FROM daily_users
WHERE day BETWEEN DATE_ADD(t.day, -45, "DAY") AND t.day
) AS active_users
FROM daily_users AS t
GROUP BY 1
This doesn't work in BigQuery: "Subselect not allowed in SELECT clause."
How to do this in BigQuery?

BigQuery documentation claims that count(distinct) works as a window function. However, that doesn't help you, because you are not looking for a traditional window frame.
One method would adds a record for each date after a visit:
select theday, count(distinct visid)
from (select date_add(u.day, n.n, "day") as theday, u.visid
from daily_users u cross join
(select 1 as n union all select 2 union all . . .
select 45
) n
) u
group by theday;
Note: there may be simpler ways to generate a series of 45 integers in BigQuery.

Below should work with BigQuery
#legacySQL
SELECT day, active_users FROM (
SELECT
day,
COUNT(DISTINCT id)
OVER (ORDER BY ts RANGE BETWEEN 45*24*3600 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts
FROM daily_users
)
) GROUP BY 1, 2 ORDER BY 1
Above assumes that day field is represented as '2016-01-10' format.
If it is not a case , you should adjust TIMESTAMP_TO_SEC(TIMESTAMP(day)) in most inner select
Also please take a look at COUNT(DISTINC) specifics in BigQuery
Update for BigQuery Standard SQL
#standardSQL
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 3888000 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1
You can test / play with it using below dummy sample
#standardSQL
WITH daily_users AS (
SELECT 1 AS id, '2016-01-10' AS day UNION ALL
SELECT 2 AS id, '2016-01-10' AS day UNION ALL
SELECT 1 AS id, '2016-01-11' AS day UNION ALL
SELECT 3 AS id, '2016-01-11' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-13' AS day
)
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Check any missing days record in Bigquery PartitionTable - sql

Related

Day wise Rolling 30 day uniques user count bigquery

Appending the result query in bigquery

SQL Get last 7 days from event date

How to generate date series to occupy absent dates in google BiqQuery?

Google BigQuery: Rolling Count Distinct

Categories

Resources