BigQuery wildcard date range with subselect seems to return null

BigQuery wildcard date range with subselect seems to return null - google-bigquery

I'm trying to create a quarterly report where some of the dates are generated from a lookup query. The input is start_date = 20181001 and end date = 20191231. While I could just query the whole range, I don't need Q1/2/3 so I'm dynamically generating the in-between dates.
The problem comes when I use them in the subquery with the table_suffix.
The dynamically generated ones don't work; it looks like it returns null and queries the entire table rather than date partitioned. But when I just hard code the values in a subquery, they work fine.
If I you query both date lookup tables, they look identical
Results of both dynamically_created and hard_coded table. So I have no idea where this error is coming from.
CREATE TEMP FUNCTION start_end() AS ( [parse_date('%Y%m%d','{start_date}'), parse_date('%Y%m%d','{end_date}')] );
CREATE TEMP FUNCTION wildcard_format(date_object date) as (replace(cast(date_object as string),"-",""));
-- create a calendar table 1 column "day" and one row for each day in the desired timeframe
WITH
calendar AS (
SELECT
extract(quarter from day) quarter
,extract(year from day) year
,day
FROM
UNNEST(GENERATE_DATE_ARRAY( start_end()[OFFSET(0)], start_end()[OFFSET(1)], INTERVAL 1 DAY) ) AS day
),
dynamically_created as (
select
wildcard_format(min(day)) start_py
,wildcard_format(max(case when year = extract (year from parse_date('%Y%m%d','{start_date}')) then day else null end)) end_py
,wildcard_format(min(case when year = extract (year from parse_date('%Y%m%d','{end_date}')) then day else null end)) start_cy
,wildcard_format(max(day)) end_cy
from
calendar
where quarter = extract (quarter from parse_date('%Y%m%d','{end_date}'))
),
hard_coded as (
SELECT
'20181001' as start_py
,'20181231' as end_py
,'20191001' as start_cy
,'20191231' as end_cy
),
sesh_data as (
select
*
from
`projectid.datasetid.summary_*`
where
(SELECT _table_suffix between start_py AND end_py FROM dynamically_created) #not working
(SELECT _table_suffix between start_py AND end_py FROM hard_coded) #working
),
select * from sesh_data

Related

most effective way to query all the month-end data in BigQuery

I have a table containing the daily transactions with date column.
The table is in BigQuery and is partitioned by the date column.
What is the most effective way to query all month-end data from the table?
I tired the sql like below but it processed the whole table which is about 100GB
SELECT * FROM table
WHERE date = LAST_DAY(date , month)
It should process less bytes as the table is partitioned by the date? (like 300 mb if I just choose one specific end of month in the where clause)
SELECT * FROM table
WHERE date = "2022-11-30"
Any ways to get what I want with processing less data?

You can minimize volume of data processed and cost by Calculating a list of In Scope last_date of the month and apply filter condition over data partitioned tables.
Following example will explain you:-
Original data looks like as given below, output expected is highlighted record without scanning complete table
Code to achieve it is:-
with data as
(select '2020-11-20' as add1, 'Robert' as name Union all
select '2021-10-10' as add1, 'Smith' as name Union all
select '2023-9-9' as add1, 'Mike' as name Union all
select '2024-8-2' as add1, 'Donal' as name Union all
select '2025-7-31' as add1, 'Kim' as name ),
-- Calculing Inscope List of last_dates of the month
new_data as
(select add1, LAST_DAY(cast (add1 as date)) as last_dt
from data)
-- Applying filter condition on date fileds
select * from data a, new_data b
where cast (a.add1 as date)=last_dt
Output will be last record which is having last day of the month.

You can use the following query to filter on the last day of the current month and to process only the partition of the last day of month :
SELECT * FROM table
WHERE date = DATE_TRUNC(DATE_ADD(CURRENT_DATE('Europe/Paris'), INTERVAL 1 MONTH), MONTH) - 1;
The same query with a date column instead of the current date :
SELECT * FROM table
WHERE date = DATE_TRUNC(DATE_ADD(your_date_column, INTERVAL 1 MONTH), MONTH) - 1;

BigQuery: iterating groups within a window of 28days before a start_date column using _TABLE_SUFFIX

I got a table like this:
group_id
start_date
end_date
19335
20220613
20220714
19527
20220620
20220719
19339
20220614
20220720
19436
20220616
20220715
20095
20220711
20220809
I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.
Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.
My idea is to:
Retrieve distinct user_id per group_id within the period [start_date, end_date]
Retrieve previous 28d metric data prior to the start date of each group_id
A snippet code on how to retrieve data from a single group_id is the following:
WITH users_per_group AS (
SELECT
users_metadata.user_id,
users_metadata.group_id,
FROM
`my_table_users_*` users_metadata
WHERE
_TABLE_SUFFIX BETWEEN '20220314' --start_date
AND '20220413' --end_date
AND experiment_id = 16709
GROUP BY
1,
2
)
SELECT
_TABLE_SUFFIX AS date,
user_id,
SUM(
COALESCE(metric, 0)
) AS metric,
FROM
users_per_group
JOIN `my_metric_table*` metric USING (user_id)
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 28 DAY
)
) -- 28 days before it starts
AND FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 1 DAY
)
) -- 1 day before it starts
GROUP BY
1,
2
ORDER BY
date ASC
Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.
Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?

I can think of 2 approaches.
Join all the tables and then perform your query.
Create dynamic queries for each of your users.
Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
EG:
WITH users_per_group AS (
SELECT
user_id, group_id
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
FROM TableName
)
Once you have this kind of table then you can use any of the mentioned approaches.
Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
Approach 1
-- consider this your main table which contains user,grp,start_date,end_date
with maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
),
--then calculate search from-to date for every user and group
user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from --change interval as per your need
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
)
select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits from
user_per_grp ug
left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
group by 1,2
Approach 2
declare queries array<string> default [];
create temp table maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
);
create temp table user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
);
-- for each user create a seperate query here
FOR record IN (SELECT * from user_per_grp)
DO
set queries = queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
--replace your query here.
END FOR;
--aggregating all the queries and executing it
execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.

Appending the result query in bigquery

I am doing a query where the query will append the data from previous date as the outcome in BigQuery.
So, the result data for today will be higher than yesterdays as the data is appending by days.
So far, what I only managed to get the outcome is the data by days (where you can see the number of ID declining and is not appending from previous day) as this result:
What should I do to add appending function in the query so each day will get the result of data from the previous day in bigquery?
code:
WITH
table1 AS (
SELECT
ID,
...
FROM t
WHERE DATE_SUB('2020-01-31', INTERVAL 31 DAY) and '2020-01-31'
),
table2 AS (
SELECT
ID,
COUNTIF((rating < 7) as bad,
COUNTIF((rating >= 7 AND SAFE_CAST(NPS_Rating as INT64) < 9) as intermediate,
COUNTIF((rating as good
FROM
t
WHERE DATE_SUB('2020-01-31', INTERVAL 31 DAY) and '2020-01-31'
)
SELECT
DATE_SUB('2020-01-31', INTERVAL 31 DAY) as date,
*
FROM table1
FULL OUTER JOIN table2 USING (ID)

If you have counts that you want to accumulate, then you want a cumulative sum. The query would look something like this:
select datecol, count(*), sum(count(*)) over (order by datecol)
from t
group by datecol
order by datecol;

Rolling 12 month filter criteria in SQL

Having an issue in SQL script where I’m trying to achieve filter criteria of rolling 12 months in the day column which stored data as a text in server.
Goal is to count sizes for product at retail store location over the last 12 months from the current day. Currently, in my query I'm using the criteria of year 2019 which only counts the sizes for that year but not for rolling 12 months from current date.
CALENDARDAY column is in text field in the data set and data stores in yyyymmdd format.
When trying to run below script in Tableau with GETDATE and DATEADD function it is giving me a functional error. I am trying to access SAP HANA server with below query.
Any help would be appreciated
Select
SKU, STYLE_ID, Base_Style_ID, COLOR, SIZEKEY, STORE, Year,
count(SIZEKEY)over(partition by STYLE_ID,COLOR,STORE,Year) as SZ_CNT
from
(
select
a."RAW" As SKU,
a."STYLENUM" As STYLE_ID,
mat."BASENUM" AS Base_Style_ID,
a."COLORNUM" AS COLOR,
a."SIZE" AS SIZEKEY,
a."STORENUM" AS STORE,
substring(a."CALENDARDAY",1,4) As year
from PRTRPT_XRE as a
JOIN ZAT_SKU As mat On a."RAW" = mat."SKU"
where a."ORGANIZATION" = 'M20'
and a."COLORNUM" is not null
and substring(a."CALENDARDAY",1,4) = '2019'
Group BY
a."RAW",
a."STYLENUM",
mat."BASENUM",
a."ZCOLORCD",
a."SIZE",
a."STORENUM",
substring(a."CALENDARDAY",1,4)
)

I have never worked on that DB / Server, so I don't have a way to test this.
But hopefully this will work (expecting exact 12 months before today's date)
AND ADD_MONTHS (TO_DATE (a."CALENDARDAY", 'YYYY-MM-DD'), 12) > CURRENT_DATE
or
AND ADD_MONTHS (a."CALENDARDAY", 12) > CURRENT_DATE

Below condition from one of our CALENDAR table also worked same way as ADD_MONTHS mentioned in above response
select distinct CALENDARDAY
from
(
select FISCALWEEK, CALENDARDAY, CNST, row_number()over(partition by CNST order by FISCALWEEK desc) as rnum
from
(
select distinct FISCALWEEK, CALENDARDAY, 'A' as CNST
from CALENDARTABLE
where CALENDARDAY < current_date
order by 1,2
)
) where rnum < 366

Oracle - Split a record into multiple records

I have a schedule table for each month schedule. And this table also has days off within that month. I need a result set that will tell working days and off days for that month.
Eg.
CREATE TABLE SCHEDULE(sch_yyyymm varchar2(6), sch varchar2(20), sch_start_date date, sch_end_date date);
INSERT INTO SCHEDULE VALUES('201703','Working Days', to_date('03/01/2017','mm/dd/yyyy'), to_date('03/31/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','Off Day', to_date('03/05/2017','mm/dd/yyyy'), to_date('03/07/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','off Days', to_date('03/08/2017','mm/dd/yyyy'), to_date('03/10/2017','mm/dd/yyyy'));
INSERT INTO SCHEDULE VALUES('201703','off Days', to_date('03/15/2017','mm/dd/yyyy'), to_date('03/15/2017','mm/dd/yyyy'));
Using SQL or PL/SQL I need to split the record with Working Days and Off Days.
From above records I need result set as:
201703 Working Days 03/01/2017 - 03/04/2017
201703 Off Days 03/05/2017 - 03/10/2017
201703 Working Days 03/11/2017 - 03/14/2017
201703 Off Days 03/15/2017 - 03/15/2017
201703 Working Days 03/16/2017 - 03/31/2017
Thank You for your help.

Edit: I've had a bit more of a think, and this approach works fine for your insert records above - however, it misses records where there are not continuous "off day" periods. I need to have a bit more of a think and will then make some changes
I've put together a test using the lead and lag functions and a self join.
The upshot is you self-join the "Off Days" onto the existing tables to find the overlaps. Then calculate the start/end dates on either side of each record. A bit of logic then lets us work out which date to use as the final start/end dates.
SQL fiddle here - I used Postgres as the Oracle function wasn't working but it should translate ok.
select sch,
/* Work out which date to use as this record's Start date */
case when prev_end_date is null then sch_start_date
else off_end_date + 1
end as final_start_date,
/* Work out which date to use as this record's end date */
case when next_start_date is null then sch_end_date
when next_start_date is not null and prev_end_date is not null then next_start_date - 1
else off_start_date - 1
end as final_end_date
from (
select a.*,
b.*,
/* Get the start/end dates for the records on either side of each working day record */
lead( b.off_start_date ) over( partition by a.sch_start_date order by b.off_start_date ) as next_start_date,
lag( b.off_end_date ) over( partition by a.sch_start_date order by b.off_start_date ) as prev_end_date
from (
/* Get all schedule records */
select sch,
sch_start_date,
sch_end_date
from schedule
) as a
left join
(
/* Get all non-working day schedule records */
select sch as off_sch,
sch_start_date as off_start_date,
sch_end_date as off_end_date
from schedule
where sch <> 'Working Days'
) as b
/* Join on "Off Days" that overlap "Working Days" */
on a.sch_start_date <= b.off_end_date
and a.sch_end_date >= b.off_start_date
and a.sch <> b.off_sch
) as c
order by final_start_date

If you had a dates table this would have been easier.
You can construct a dates table using a recursive cte and join on to it. Then use the difference of row number approach to classify rows with same schedules on consecutive dates into one group and then get the min and max of each group which would be the start and end dates for a given sch. I assume there are only 2 sch values Working Days and Off Day.
with dates(dt) as (select date '2017-03-01' from dual
union all
select dt+1 from dates where dt < date '2017-03-31')
,groups as (select sch_yyyymm,dt,sch,
row_number() over(partition by sch_yyyymm order by dt)
- row_number() over(partition by sch_yyyymm,sch order by dt) as grp
from (select s.sch_yyyymm,d.dt,
/*This condition is to avoid a given date with 2 sch values, as 03-01-2017 - 03-31-2017 are working days
on one row and there is an Off Day status for some of these days.
In such cases Off Day would be picked up as sch*/
case when count(*) over(partition by d.dt) > 1 then min(s.sch) over(partition by d.dt) else s.sch end as sch
from dates d
join schedule s on d.dt >= s.sch_start_date and d.dt <= s.sch_end_date
) t
)
select sch_yyyymm,sch,min(dt) as start_date,max(dt) as end_date
from groups
group by sch_yyyymm,sch,grp
I couldn't get the recursive cte running in Oracle. Here is a demo using SQL Server.
Sample Demo in SQL Server

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery wildcard date range with subselect seems to return null - google-bigquery

Related

most effective way to query all the month-end data in BigQuery

BigQuery: iterating groups within a window of 28days before a start_date column using _TABLE_SUFFIX

Appending the result query in bigquery

Rolling 12 month filter criteria in SQL

Oracle - Split a record into multiple records

Categories

Resources