With daily tables on BigQuery, how can I query rolling 12 months?

With daily tables on BigQuery, how can I query rolling 12 months? - google-bigquery

I have daily tables in BigQuery, (table_name_yyyymmdd). How can I write a view that will always query the rolling 12 months of data?

As an example:
Save below query as a view (let's name it - view_test - I assume it in the same dataset as tables)
#standardSQL
SELECT PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) as table, COUNT(1) as rows_count
FROM `yourProject.yourDataset.table_name_*`
WHERE _TABLE_SUFFIX
BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 13 DAY) )
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) )
GROUP BY 1
Now you can use it as below for example:
#standardSQL
SELECT *
FROM `yourProject.yourDataset.view_test`
So, this views referencing last 12 full days
You can change DAY to MONTH to have 12 months instead
Hope you got an idea
If needed this can easily be "translated" to Legacy SQL (make sure the view and query that calls that view are using the same SQL version/dialect)
Note: Google recommends migrate to Standard SQL whenever it is possible!

You could use function TABLE_DATE_RANGE which according to doc (https://cloud.google.com/bigquery/docs/reference/legacy-sql#table-date-range) :
Queries multiple daily tables that span a date range.
like below:
SELECT *FROM TABLE_DATE_RANGE(data_set_name.table_name,
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-12-31'))
as there is currently no option to parametrise your view programatically you need to generate your queries/views by some other tool/code

Related

How to select from partitioned table except a version in big query?

My data set have a partitioned table and I have select all from them:
Select Customer id
from Company.database.Customer_*
various from (2022-01-01 till today)
But have a error version on 2022-06-08 and I dont want to select this version?
I tried
Select Customer id
from Company.database.Customer_*[^20220608] but well doesnt work

Finally I found the answer:
Company.database.Customer_TABLE_SUFFIX >=
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

Check if a dataset has been updated with new table BigQuery

I currently have a dataset (eg: Traffic) with Shard tables that get added every week with the name 'Traffic_timestamp' where timestamp is the day it is created.
I would like to check if a particular 'Traffic_timestamp' is present in the dataset. Looking for an automatic way of checking instead of manually checking the dataset.

Below example (for BigQuery Standard SQL) should give you an idea
#standardSQL
SELECT *
FROM `project.dataset.__TABLES_SUMMARY__`
WHERE REGEXP_CONTAINS(table_id, CONCAT('Traffic_', r'\d{8}'))
AND SUBSTR(table_id, -8) = FORMAT_DATE('%Y%m%d', CURRENT_DATE())
You can adjust to whatever specific logic of new table you have
For example if you would looking for table for previous day - you would use
AND SUBSTR(table_id, -8) = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))

Update tables with new data

I receive new tables with data in bigquery everyday.
One tables = one date.
For example: costdata_01012018, costdata02012018 and so on.
I have script that union them every day so I have a new tables with all data I need. For now I truncate the final table every day and it doesn't seem right.
Is there any way to union them without truncation?
I just need to add a new table to the final one
I tried to create 'from' instruction that dynamically finds new table but it doesn't work.
SELECT date, adcost
FROM CONCAT('[test-project-187411:dataset.CostData_', STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d"), ']')
What am I doing wrong?

Two options to do this:
#standardsql
SELECT date, adcost
FROM `test-project-187411:dataset.CostData_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
Legacy SQL
#legacysql
SELECT date, adcost
FROM TABLE_QUERY([test-project-187411:dataset], 'tableid = CONCAT("CostData_", STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d")')

how to get the most recent dataset using a wildcard in google bigquery

if I have a series of fact tables such as:
fact-01012001
fact-01022001
fact-01032001
dim001
dim002
a wildcard will allow me to search all three, for example:
select * from fact-*
is there a way to use wildcards or otherwise to get the most recent fact table? say only 01032001?

Until the relevant feature request is implemented, you will need to use a query to determine the most recent date, then another query to select from that table. For example:
#standardSQL
SELECT _TABLE_SUFFIX AS latest_date
FROM `fact-*`
ORDER BY PARSE_DATE('%m%d%Y', _TABLE_SUFFIX) DESC LIMIT 1;
After retrieving the latest date, query it:
#standardSQL
SELECT *
FROM `fact-01032001`;

Below is one-step approach for BigQuery Standard SQL
#standardSQL
SELECT *
FROM `yourProject.yourDataset.fact_*`
WHERE _TABLE_SUFFIX IN (
SELECT
FORMAT_DATE('%m%d%Y', MAX(PARSE_DATE('%m%d%Y', SUBSTR(table_id, - 8)))) AS d
FROM `yourProject.yourDataset.__TABLES_SUMMARY__`
WHERE SUBSTR(table_id, 1, LENGTH('fact_')) = 'fact_'
AND LENGTH(table_id) = LENGTH('fact_') + 8
GROUP BY SUBSTR(table_id, 1, LENGTH(table_id) - 8)
)
Of course you can replace LENGTH('fact_') with 5 - I just put it this way so it is understood better
And 8 is the length of expected suffix, so you are catching only expected table from list of :
fact_01012001
fact_01022001
fact_01032001

I would like to improve one step further the solution given by Mikhail Berlyant.
As the number of shards grow, you will see a couple of problems:
You are always querying all the shards, increasing the billing (BigQuery bills by the bytes processed by the query, more shards, more bytes).
There is a limit of 1000 shards that you can query in a single query (one shard per day is a bit under 3 years worth of data).
So, with this solution, you would be only be querying 1 or 2 shards, depending if the daily data have been already loaded or not.
SELECT *
FROM `yourProject.yourDataset.fact_*`
WHERE
PARSE_DATE('%m%d%Y',
_TABLE_SUFFIX) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND CURRENT_DATE()
AND
_TABLE_SUFFIX IN (
SELECT
FORMAT_DATE('%m%d%Y', MAX(PARSE_DATE('%m%d%Y', SUBSTR(table_id, - 8)))) AS d
FROM `yourProject.yourDataset.__TABLES_SUMMARY__`
WHERE STARTS_WITH(table_id, 'fact_')
AND LENGTH(table_id) = LENGTH('fact_') + 8
)
All the notes from the original answer applies.

Creating views periodically in BigQuery

I'm currently using Firebase Analytics to export user-related data to BigQuery.
Is there a way to create a view automatically in BigQuery (every 24 hours for example) as exports from Firebase create a new table everyday, or a single view gathering the data from the tables created daily.
Is it possible to do such things with the WebUI ?

You can create a view over a wildcard table so that you don't need to update it each day. Here is an example view definition, using the query from one of your previous questions:
#standardSQL
SELECT
*,
PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS date
FROM `com_test_testapp_ANDROID.app_events_*`
CROSS JOIN UNNEST(event_dim) AS event_dim
WHERE event_dim.name IN ("EventGamePlayed", "EventGetUserBasicInfos", "EventGetUserCompleteInfos");
Let's say that you name this view com_test_testapp_ANDROID.event_view (make sure to pick a name that isn't included in the app_events_* expansion). Now you can run a query to select yesterday's events, for instance:
#standardSQL
SELECT event_dim
FROM `com_test_testapp_ANDROID.event_view`
WHERE date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
Or all events in the past seven days:
#standardSQL
SELECT event_dim
FROM `com_test_testapp_ANDROID.event_view`
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 WEEK);
The important part is having a column in the select list for the view that lets you restrict the _TABLE_SUFFIX to whatever range of time you are interested in.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

With daily tables on BigQuery, how can I query rolling 12 months? - google-bigquery

I have daily tables in BigQuery, (table_name_yyyymmdd). How can I write a view that will always query the rolling 12 months of data?

Related

How to select from partitioned table except a version in big query?

Check if a dataset has been updated with new table BigQuery

Update tables with new data

how to get the most recent dataset using a wildcard in google bigquery

Creating views periodically in BigQuery

Categories

Resources