Creating views periodically in BigQuery - google-bigquery

I'm currently using Firebase Analytics to export user-related data to BigQuery.
Is there a way to create a view automatically in BigQuery (every 24 hours for example) as exports from Firebase create a new table everyday, or a single view gathering the data from the tables created daily.
Is it possible to do such things with the WebUI ?

You can create a view over a wildcard table so that you don't need to update it each day. Here is an example view definition, using the query from one of your previous questions:
#standardSQL
SELECT
*,
PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS date
FROM `com_test_testapp_ANDROID.app_events_*`
CROSS JOIN UNNEST(event_dim) AS event_dim
WHERE event_dim.name IN ("EventGamePlayed", "EventGetUserBasicInfos", "EventGetUserCompleteInfos");
Let's say that you name this view com_test_testapp_ANDROID.event_view (make sure to pick a name that isn't included in the app_events_* expansion). Now you can run a query to select yesterday's events, for instance:
#standardSQL
SELECT event_dim
FROM `com_test_testapp_ANDROID.event_view`
WHERE date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
Or all events in the past seven days:
#standardSQL
SELECT event_dim
FROM `com_test_testapp_ANDROID.event_view`
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 WEEK);
The important part is having a column in the select list for the view that lets you restrict the _TABLE_SUFFIX to whatever range of time you are interested in.

Related

Big query - Schedule Query on an external partitioned table with the keyword #run_date

Because it is client data I've replace in this post the project name and the dataset name by ******)
I'm trying to create a new schedule query in BigQuery on Google cloud platform
The problem is I've got this error in the web Query editor
Cannot query over table '******.raw_bounce_rate' without a filter over column(s) 'dt' that can be used for partition elimination
The thing is I do filter on the column dt.
Here is the scheme of my external partitioned table
Tracking_Code STRING
Pages STRING NULLABLE
Clicks_to_Page INTEGER
Path_Lengths INTEGER
Visit_Number INTEGER
Visitor_ID STRING
Mobile_Device_Type STRING
All_Visits INTEGER
dt DATE
dt is the field of the partition and I selected the option "Require partition filter"
Here is the simplify sql of my query
WITH yesterday_raw_bounce_rate AS (
SELECT *
FROM `******.raw_bounce_rate`
WHERE dt = DATE_SUB(#run_date, INTERVAL 1 DAY)
),
entries_table as (
SELECT dt,
ifnull(Tracking_Code, "sans campagne") as tracking_code,
ifnull(Pages, "page non trackée") as pages,
Visitor_ID,
Path_Lengths,
Clicks_to_Page,
SUM(all_visits) AS somme_visites
FROM
yesterday_raw_bounce_rate
GROUP BY
dt,
Tracking_Code,
Pages,
Visitor_ID,
Path_Lengths,
Clicks_to_Page
HAVING
somme_visites = 1 and Clicks_to_Page = 1
)
select * from entries_table
if I remove the statement
Clicks_to_Page = 1
or if I replace the
DATE_SUB(#run_date, INTERVAL 1 DAY)
by a hard coded date
the query is accepted by Big Query, it does not make sense to me
Currently, there is an open issue, here. It addresses the error regarding using #run_date filter in the filter of scheduled queries to partitioned tables with required filter. The engineering team is currently working on it, although there is no ETA.
In your scheduled query, you can use one of the two workarounds using #run_date.As follows:
First option,
DECLARE runDateVariable DATE DEFAULT #run_date;
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
Second option,
DECLARE runDateVariable DATE DEFAULT CAST(#run_date AS DATE);
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
In addition, you can also use CURRENT_DATE() instead of #run_date, as shwon below:
DECLARE runDateVariable DATE DEFAULT CURRENT_DATE();
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
UPDATE
I have set up another scheduled query to run daily with a table partitioned by DATE from a field called date_formatted and the partition filter is required. Then I have set up a backfill, here, so I could see the result of the scheduled query for previous days. Below is the code I used:
DECLARE runDateVariable DATE DEFAULT #run_date;
SELECT #run_date as run_date, date_formatted, fullvisitorId FROM `project_id.dataset.table_name` WHERE date_formatted > DATE_SUB(runDateVariable, INTERVAL 1 DAY)

Use DataStudio to specify the date range for a custom query in BigQuery, where the date range influences operators in the query

I currently have a DataStudio dashboard connected to a BigQuery custom query.
That BQ query has a hardcoded date range and the status of one of the columns (New_or_Relicensed) can change dynamically for a row, based on the dates specified in the range. I would like to be able to alter that range from DataStudio.
I have tried:
simply connecting the DS dashboard to the custom query in BQ and then introducing a date range filter, but as you can imagine - that does not work because it's operating on an already hard-coded date range.
reviewing similar answers, but their problem doesn't appear to be quite the same E.g. BigQuery Data Studio Custom Query
Here is the query I have in BQ:
SELECT t0.New_Or_Relicensed, t0.Title_Category FROM (WITH
report_range AS
(
SELECT
TIMESTAMP '2019-06-24 00:00:00' AS start_date,
TIMESTAMP '2019-06-30 00:00:00' AS end_date
)
SELECT
schedules.schedule_entry_id AS Schedule_Entry_ID,
schedules.schedule_entry_starts_at AS Put_Up,
schedules.schedule_entry_ends_at AS Take_Down,
schedule_entries_metadata.contract AS Schedule_Entry_Contract,
schedules.platform_id AS Platform_ID,
platforms.platform_name AS Platform_Name,
titles_metadata.title_id AS Title_ID,
titles_metadata.name AS Title_Name,
titles_metadata.category AS Title_Category,
IF (other_schedules.schedule_entry_id IS NULL, "new", "relicensed") AS New_Or_Relicensed
FROM
report_range, client.schedule_entries AS schedules
JOIN client.schedule_entries_metadata
ON schedule_entries_metadata.schedule_entry_id = schedules.schedule_entry_id
JOIN
client.platforms
ON schedules.platform_id = platforms.platform_id
JOIN
client.titles_metadata
ON schedules.title_id = titles_metadata.title_id
LEFT OUTER JOIN
client.schedule_entries AS other_schedules
ON schedules.platform_id = other_schedules.platform_id
AND other_schedules.schedule_entry_ends_at < report_range.start_date
AND schedules.title_id = other_schedules.title_id
WHERE
((schedules.schedule_entry_starts_at >= report_range.start_date AND
schedules.schedule_entry_starts_at <= report_range.end_date) OR
(schedules.schedule_entry_ends_at >= report_range.start_date AND
schedules.schedule_entry_ends_at <= report_range.end_date))
) AS t0 LIMIT 100;
Essentially - I would like to be able to set the start_date and end_date from google data studio, and have those dates incorporated into the report_range that then influences the operations in the rest of the query (that assign a schedule entry as new or relicensed).
Have you looked at using the Custom Query interface of the BigQuery connector in Data Studio to define start_date and end_date as parameters as part of a filter.
Your query would need a little re-work...
The following example custom query uses the #DS_START_DATE and #DS_END_DATE parameters as part of a filter on the creation date column of a table. The records produced by the query will be limited to the date range selected by the report user, reducing the number of records returned and resulting in a faster query:
Resources:
Introducing BigQuery parameters in Data Studio
https://www.blog.google/products/marketingplatform/analytics/introducing-bigquery-parameters-data-studio/
Running parameterized queries
https://cloud.google.com/bigquery/docs/parameterized-queries
I had a similar issue where I wanted to incorporate a 30 day look back before the start (#ds_start_date). In this case I was using Google Analytics UA session data and using table suffix in my where clause. I was able to calculate a date RELATIVE to the built in data studio "string" dates by using the following:
...
WHERE
_table_suffix BETWEEN
CAST(FORMAT_DATE('%Y%m%d', DATE_SUB (PARSE_DATE('%Y%m%d',#DS_START_DATE), INTERVAL 30 DAY)) AS STRING)
AND
CAST(FORMAT_DATE('%Y%m%d', DATE_SUB (PARSE_DATE('%Y%m%d',#DS_END_DATE), INTERVAL 0 DAY)) AS STRING)

Check if a dataset has been updated with new table BigQuery

I currently have a dataset (eg: Traffic) with Shard tables that get added every week with the name 'Traffic_timestamp' where timestamp is the day it is created.
I would like to check if a particular 'Traffic_timestamp' is present in the dataset. Looking for an automatic way of checking instead of manually checking the dataset.
Below example (for BigQuery Standard SQL) should give you an idea
#standardSQL
SELECT *
FROM `project.dataset.__TABLES_SUMMARY__`
WHERE REGEXP_CONTAINS(table_id, CONCAT('Traffic_', r'\d{8}'))
AND SUBSTR(table_id, -8) = FORMAT_DATE('%Y%m%d', CURRENT_DATE())
You can adjust to whatever specific logic of new table you have
For example if you would looking for table for previous day - you would use
AND SUBSTR(table_id, -8) = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))

Update tables with new data

I receive new tables with data in bigquery everyday.
One tables = one date.
For example: costdata_01012018, costdata02012018 and so on.
I have script that union them every day so I have a new tables with all data I need. For now I truncate the final table every day and it doesn't seem right.
Is there any way to union them without truncation?
I just need to add a new table to the final one
I tried to create 'from' instruction that dynamically finds new table but it doesn't work.
SELECT date, adcost
FROM CONCAT('[test-project-187411:dataset.CostData_', STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d"), ']')
What am I doing wrong?
Two options to do this:
#standardsql
SELECT date, adcost
FROM `test-project-187411:dataset.CostData_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
Legacy SQL
#legacysql
SELECT date, adcost
FROM TABLE_QUERY([test-project-187411:dataset], 'tableid = CONCAT("CostData_", STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d")')

With daily tables on BigQuery, how can I query rolling 12 months?

I have daily tables in BigQuery, (table_name_yyyymmdd). How can I write a view that will always query the rolling 12 months of data?
As an example:
Save below query as a view (let's name it - view_test - I assume it in the same dataset as tables)
#standardSQL
SELECT PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) as table, COUNT(1) as rows_count
FROM `yourProject.yourDataset.table_name_*`
WHERE _TABLE_SUFFIX
BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 13 DAY) )
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) )
GROUP BY 1
Now you can use it as below for example:
#standardSQL
SELECT *
FROM `yourProject.yourDataset.view_test`
So, this views referencing last 12 full days
You can change DAY to MONTH to have 12 months instead
Hope you got an idea
If needed this can easily be "translated" to Legacy SQL (make sure the view and query that calls that view are using the same SQL version/dialect)
Note: Google recommends migrate to Standard SQL whenever it is possible!
You could use function TABLE_DATE_RANGE which according to doc (https://cloud.google.com/bigquery/docs/reference/legacy-sql#table-date-range) :
Queries multiple daily tables that span a date range.
like below:
SELECT *FROM TABLE_DATE_RANGE(data_set_name.table_name,
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-12-31'))
as there is currently no option to parametrise your view programatically you need to generate your queries/views by some other tool/code