Appending the result query in bigquery - sql

I am doing a query where the query will append the data from previous date as the outcome in BigQuery.
So, the result data for today will be higher than yesterdays as the data is appending by days.
So far, what I only managed to get the outcome is the data by days (where you can see the number of ID declining and is not appending from previous day) as this result:
What should I do to add appending function in the query so each day will get the result of data from the previous day in bigquery?
code:
WITH
table1 AS (
SELECT
ID,
...
FROM t
WHERE DATE_SUB('2020-01-31', INTERVAL 31 DAY) and '2020-01-31'
),
table2 AS (
SELECT
ID,
COUNTIF((rating < 7) as bad,
COUNTIF((rating >= 7 AND SAFE_CAST(NPS_Rating as INT64) < 9) as intermediate,
COUNTIF((rating as good
FROM
t
WHERE DATE_SUB('2020-01-31', INTERVAL 31 DAY) and '2020-01-31'
)
SELECT
DATE_SUB('2020-01-31', INTERVAL 31 DAY) as date,
*
FROM table1
FULL OUTER JOIN table2 USING (ID)

If you have counts that you want to accumulate, then you want a cumulative sum. The query would look something like this:
select datecol, count(*), sum(count(*)) over (order by datecol)
from t
group by datecol
order by datecol;

Related

Day wise Rolling 30 day uniques user count bigquery

I am trying to generate a day on day rolling 30 days unique count using this query but the problem is running this query day on the day I need aug full month rolling 30 days day on day count in one script pls help
-----------------------------------------
SELECT max(date),count(DISTINCT user_id) as MAU
FROM user_data
WHERE date between DATE_SUB('2020-08-31' ,INTERVAL 29 DAY) and '2020-08-31';
BigQuery doesn't support rolling windows for count(distinct). So, one approach is a brute force method:
select dte,
(select count(distinct ud.user_id)
from user_data ud
where ud.date between DATE_SUB(dte, INTERVAL 29 DAY) and dte
) as num_users
from unnest(generate_date_array(date('2020-08-01'), date('2020-08-31'))) dte
Gordon approach works great.
If you need to calculate more numbers - Cross join the data.
SELECT
date_gen,
COUNT(DISTINCT IF(ud.date BETWEEN DATE_SUB(date_gen ,INTERVAL 29 DAY) AND date_gen,ud.user_id,NULL)) as MAU
FROM
UNNEST(GENERATE_DATE_ARRAY(DATE_SUB('2020-08-31' ,INTERVAL 29 DAY), date('2020-08-31'))) date_gen,
(SELECT * FROM user_data WHERE date BETWEEN DATE_SUB('2020-08-31' ,INTERVAL 60 DAY) AND '2020-08-31') AS ud
GROUP BY 1
ORDER BY 1 DESC
With SET and DECLARE you can get rid of replacing the 'DATE' multiple times.
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM `project.dataset.user_data`
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t
Above assumes you have entries in project.dataset.user_data table for all days in time period of your interest
If this is not a case, and you actually have some gaps in your data - you can use below
#standardSQL
SELECT date, (SELECT COUNT(DISTINCT id) FROM t.users AS id) AS MAU
FROM (
SELECT date, ARRAY_AGG(user_id) OVER(mau_win) users
FROM UNNEST(GENERATE_DATE_ARRAY('2020-08-01', '2020-08-31')) AS date
LEFT JOIN `project.dataset.user_data`
USING(date)
WINDOW mau_win AS (
ORDER BY UNIX_DATE(date) DESC RANGE BETWEEN CURRENT ROW AND 29 FOLLOWING
)
) t

Rolling 12 month filter criteria in SQL

Having an issue in SQL script where I’m trying to achieve filter criteria of rolling 12 months in the day column which stored data as a text in server.
Goal is to count sizes for product at retail store location over the last 12 months from the current day. Currently, in my query I'm using the criteria of year 2019 which only counts the sizes for that year but not for rolling 12 months from current date.
CALENDARDAY column is in text field in the data set and data stores in yyyymmdd format.
When trying to run below script in Tableau with GETDATE and DATEADD function it is giving me a functional error. I am trying to access SAP HANA server with below query.
Any help would be appreciated
Select
SKU, STYLE_ID, Base_Style_ID, COLOR, SIZEKEY, STORE, Year,
count(SIZEKEY)over(partition by STYLE_ID,COLOR,STORE,Year) as SZ_CNT
from
(
select
a."RAW" As SKU,
a."STYLENUM" As STYLE_ID,
mat."BASENUM" AS Base_Style_ID,
a."COLORNUM" AS COLOR,
a."SIZE" AS SIZEKEY,
a."STORENUM" AS STORE,
substring(a."CALENDARDAY",1,4) As year
from PRTRPT_XRE as a
JOIN ZAT_SKU As mat On a."RAW" = mat."SKU"
where a."ORGANIZATION" = 'M20'
and a."COLORNUM" is not null
and substring(a."CALENDARDAY",1,4) = '2019'
Group BY
a."RAW",
a."STYLENUM",
mat."BASENUM",
a."ZCOLORCD",
a."SIZE",
a."STORENUM",
substring(a."CALENDARDAY",1,4)
)
I have never worked on that DB / Server, so I don't have a way to test this.
But hopefully this will work (expecting exact 12 months before today's date)
AND ADD_MONTHS (TO_DATE (a."CALENDARDAY", 'YYYY-MM-DD'), 12) > CURRENT_DATE
or
AND ADD_MONTHS (a."CALENDARDAY", 12) > CURRENT_DATE
Below condition from one of our CALENDAR table also worked same way as ADD_MONTHS mentioned in above response
select distinct CALENDARDAY
from
(
select FISCALWEEK, CALENDARDAY, CNST, row_number()over(partition by CNST order by FISCALWEEK desc) as rnum
from
(
select distinct FISCALWEEK, CALENDARDAY, 'A' as CNST
from CALENDARTABLE
where CALENDARDAY < current_date
order by 1,2
)
) where rnum < 366

Check any missing days record in Bigquery PartitionTable

I have a Bigquery table with a date partition key.
I get daily records in that table and I try to find if there's any missing day for like 3 years of historical data.
So I tried to use the following query:
SELECT KeyPartitionDate
FROM (
SELECT KeyPartitionDate, DATE(KeyPartitionDate) as day, DATE_ADD(date(KeyPartitionDate), INTERVAL 1 DAY) AS dayplusone
FROM `project.dataset.table`
)
WHERE DATE_DIFF(day, dayplusone , DAY) > 1
GROUP BY KeyPartitionDate
ORDER BY KeyPartitionDate
The query is valid but returns no results while I know there are some...
My guess is that I'm messing with the DATE_ADD function but cant tell how
Below is for BigQuery Standard SQL and just gives you the list of missing days
#standardSQL
SELECT day AS missing_days
FROM (
SELECT MIN(KeyPartitionDate) min_day, MAX(KeyPartitionDate) max_day
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day)) day
LEFT JOIN (
SELECT DISTINCT KeyPartitionDate AS day
FROM `project.dataset.table`
) t
USING(day)
WHERE t.day IS NULL
You went about this the wrong way:
day = DATE(KeyPartitionDate)
then you did
dayplusone = DATE_ADD(date(KeyPartitionDate), INTERVAL 1 DAY)
which is basically saying dayplusone = day +(1 day)
Then you do :
WHERE DATE_DIFF(day, dayplusone , DAY) > 1
which is like saying : dayplusone - day > (1 day) which would mean
day + (1 day) - day > (1 day)
You can clearly see why that is wrong.
What you needed to do instead is compare the current row date with the preivous row date. That is achieved using window functions:
SELECT KeyPartitionDate FROM (
SELECT DISTINCT KeyPartitionDate,
LAG(KeyPartitionDate)
OVER (ORDER BY KeyPartitionDate ASC) AS PreviousKeyPartitionDate
FROM `project.dataset.table`)
WHERE DATE_DIFF(DATE(PreviousKeyPartitionDate),DATE(KeyPartitionDate), DAY ) > 1
ORDER BY KeyPartitionDate

adding all columns from mutiple tables

I have a simple question.
I need to count all records from multiple tables with day and hour and add all of them together in a single final table.
So the query for each tab is something like this
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_1
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_2
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_3
and so on so forth
I would like to combine all the results showing number of total records for each day and hour from these tables.
Expected results will be like this
date, hour, number of records of table 1, number of records of table 2, number of records of table 3 ........
What would the most optimum SQL query for this?
Probably the simplest way is to union them together and aggregation:
select timestamp_trunc(timestamp, hour) as hh,
countif(which = 1) as num_1,
countif(which = 2) as num_2
from ((select timestamp, 1 as which
from table_1
) union all
(select timestamp, 2 as which
from table_2
) union all
. . .
) t
group hh
order by hh;
You are using timestamp_trunc(). It returns a timestamp truncated to the hour -- there is no need to also include the date.
Below is for BigQuery Standard SQL
#standardSQL
SELECT
TIMESTAMP_TRUNC(TIMESTAMP, DAY) day,
EXTRACT(HOUR FROM TIMESTAMP) hour,
COUNT(*) cnt,
_TABLE_SUFFIX AS table
FROM `project.dataset.table_*`
GROUP BY day, hour, table

Add one for every row that fulfills where criteria between period

I have a Postgres table that I'm trying to analyze based on some date columns.
I'm basically trying to count the number of rows in my table that fulfill this requirement, and then group them by month and year. Instead of my query looking like this:
SELECT * FROM $TABLE WHERE date1::date <= '2012-05-31'
and date2::date > '2012-05-31';
it should be able to display this for the months available in my data so that I don't have to change the months manually every time I add new data, and so I can get everything with one query.
In the case above I'd like it to group the sum of rows which fit the criteria into the year 2012 and month 05. Similarly, if my WHERE clause looked like this:
date1::date <= '2012-06-31' and date2::date > '2012-06-31'
I'd like it to group this sum into the year 2012 and month 06.
This isn't entirely clear to me:
I'd like it to group the sum of rows
I'll interpret it this way: you want to list all rows "per month" matching the criteria:
WITH x AS (
SELECT date_trunc('month', min(date1)) AS start
,date_trunc('month', max(date2)) + interval '1 month' AS stop
FROM tbl
)
SELECT to_char(y.mon, 'YYYY-MM') AS mon, t.*
FROM (
SELECT generate_series(x.start, x.stop, '1 month') AS mon
FROM x
) y
LEFT JOIN tbl t ON t.date1::date <= y.mon
AND t.date2::date > y.mon -- why the explicit cast to date?
ORDER BY y.mon, t.date1, t.date2;
Assuming date2 >= date1.
Compute lower and upper border of time period and truncate to month (adding 1 to upper border to include the last row, too.
Use generate_series() to create the set of months in question
LEFT JOIN rows from your table with the declared criteria and sort by month.
You could also GROUP BY at this stage to calculate aggregates ..
Here is the reasoning. First, create a list of all possible dates. Then get the cumulative number of date1 up to a given date. Then get the cumulative number of date2 after the date and subtract the results. The following query does this using correlated subqueries (not my favorite construct, but handy in this case):
select thedate,
(select count(*) from t where date1::date <= d.thedate) -
(select count(*) from t where date2::date > d.thedate)
from (select distinct thedate
from ((select date1::date as thedate from t) union all
(select date2::date as thedate from t)
) d
) d
This is assuming that date2 occurs after date1. My model is start and stop dates of customers. If this isn't the case, the query might not work.
It sounds like you could benefit from the DATEPART T-SQL method. If I understand you correctly, you could do something like this:
SELECT DATEPART(year, date1) Year, DATEPART(month, date1) Month, SUM(value_col)
FROM $Table
-- WHERE CLAUSE ?
GROUP BY DATEPART(year, date1),
DATEPART(month, date1)