Update tables with new data

Update tables with new data - sql

I receive new tables with data in bigquery everyday.
One tables = one date.
For example: costdata_01012018, costdata02012018 and so on.
I have script that union them every day so I have a new tables with all data I need. For now I truncate the final table every day and it doesn't seem right.
Is there any way to union them without truncation?
I just need to add a new table to the final one
I tried to create 'from' instruction that dynamically finds new table but it doesn't work.
SELECT date, adcost
FROM CONCAT('[test-project-187411:dataset.CostData_', STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d"), ']')
What am I doing wrong?

Two options to do this:
#standardsql
SELECT date, adcost
FROM `test-project-187411:dataset.CostData_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
Legacy SQL
#legacysql
SELECT date, adcost
FROM TABLE_QUERY([test-project-187411:dataset], 'tableid = CONCAT("CostData_", STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d")')

Related

Athena queries changes on the fly according to the partitions introduced in S3

I have partitioned my DATA_BUCKET in S3 with structure of
S3/DATA_BUCKET/table_1/YYYY/MM/DD/files.parquet
Now I have three additional columns in the table_1 which are visible in Athena as "partition_0", ""partition_1" and "partition_2" (for Year, Month and Day respectively).
Till now my apps were making time-related-queries based on the "time_stamp" column in the table:
select * from table_1 where time_stamp like '2023-01-17%'
Now to leverage the performance because of the partitions, the corresponding new query is:
select * from table_1 where partition_0 = '2023' and partition_1 = '01' and partition_2 = '17'
Problem:
Since there are many previous queries made on time_stamp in my apps I do not want to change them but still somehow transform those queries to my "partitions-type-queries" like above.
Is there any way like internally in Athena or something else ?
TIA

You can create a view from the original table with a new "time_stamp" column.
This column calculate Date from date-parts:
CREATE OR REPLACE VIEW my_view AS
SELECT mytable.col1,
mytable.col2,
cast(date_add('day', trans_day - 1, date_add('month', trans_month - 1, date_add('year', trans_year - 1970, from_unixtime(0)))) as Date) as time_stamp
FROM my_db.my_table mytable

Big query - Schedule Query on an external partitioned table with the keyword #run_date

Because it is client data I've replace in this post the project name and the dataset name by ******)
I'm trying to create a new schedule query in BigQuery on Google cloud platform
The problem is I've got this error in the web Query editor
Cannot query over table '******.raw_bounce_rate' without a filter over column(s) 'dt' that can be used for partition elimination
The thing is I do filter on the column dt.
Here is the scheme of my external partitioned table
Tracking_Code STRING
Pages STRING NULLABLE
Clicks_to_Page INTEGER
Path_Lengths INTEGER
Visit_Number INTEGER
Visitor_ID STRING
Mobile_Device_Type STRING
All_Visits INTEGER
dt DATE
dt is the field of the partition and I selected the option "Require partition filter"
Here is the simplify sql of my query
WITH yesterday_raw_bounce_rate AS (
SELECT *
FROM `******.raw_bounce_rate`
WHERE dt = DATE_SUB(#run_date, INTERVAL 1 DAY)
),
entries_table as (
SELECT dt,
ifnull(Tracking_Code, "sans campagne") as tracking_code,
ifnull(Pages, "page non trackée") as pages,
Visitor_ID,
Path_Lengths,
Clicks_to_Page,
SUM(all_visits) AS somme_visites
FROM
yesterday_raw_bounce_rate
GROUP BY
dt,
Tracking_Code,
Pages,
Visitor_ID,
Path_Lengths,
Clicks_to_Page
HAVING
somme_visites = 1 and Clicks_to_Page = 1
)
select * from entries_table
if I remove the statement
Clicks_to_Page = 1
or if I replace the
DATE_SUB(#run_date, INTERVAL 1 DAY)
by a hard coded date
the query is accepted by Big Query, it does not make sense to me

Currently, there is an open issue, here. It addresses the error regarding using #run_date filter in the filter of scheduled queries to partitioned tables with required filter. The engineering team is currently working on it, although there is no ETA.
In your scheduled query, you can use one of the two workarounds using #run_date.As follows:
First option,
DECLARE runDateVariable DATE DEFAULT #run_date;
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
Second option,
DECLARE runDateVariable DATE DEFAULT CAST(#run_date AS DATE);
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
In addition, you can also use CURRENT_DATE() instead of #run_date, as shwon below:
DECLARE runDateVariable DATE DEFAULT CURRENT_DATE();
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
UPDATE
I have set up another scheduled query to run daily with a table partitioned by DATE from a field called date_formatted and the partition filter is required. Then I have set up a backfill, here, so I could see the result of the scheduled query for previous days. Below is the code I used:
DECLARE runDateVariable DATE DEFAULT #run_date;
SELECT #run_date as run_date, date_formatted, fullvisitorId FROM `project_id.dataset.table_name` WHERE date_formatted > DATE_SUB(runDateVariable, INTERVAL 1 DAY)

Check if a dataset has been updated with new table BigQuery

I currently have a dataset (eg: Traffic) with Shard tables that get added every week with the name 'Traffic_timestamp' where timestamp is the day it is created.
I would like to check if a particular 'Traffic_timestamp' is present in the dataset. Looking for an automatic way of checking instead of manually checking the dataset.

Below example (for BigQuery Standard SQL) should give you an idea
#standardSQL
SELECT *
FROM `project.dataset.__TABLES_SUMMARY__`
WHERE REGEXP_CONTAINS(table_id, CONCAT('Traffic_', r'\d{8}'))
AND SUBSTR(table_id, -8) = FORMAT_DATE('%Y%m%d', CURRENT_DATE())
You can adjust to whatever specific logic of new table you have
For example if you would looking for table for previous day - you would use
AND SUBSTR(table_id, -8) = FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))

Creating views periodically in BigQuery

I'm currently using Firebase Analytics to export user-related data to BigQuery.
Is there a way to create a view automatically in BigQuery (every 24 hours for example) as exports from Firebase create a new table everyday, or a single view gathering the data from the tables created daily.
Is it possible to do such things with the WebUI ?

You can create a view over a wildcard table so that you don't need to update it each day. Here is an example view definition, using the query from one of your previous questions:
#standardSQL
SELECT
*,
PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS date
FROM `com_test_testapp_ANDROID.app_events_*`
CROSS JOIN UNNEST(event_dim) AS event_dim
WHERE event_dim.name IN ("EventGamePlayed", "EventGetUserBasicInfos", "EventGetUserCompleteInfos");
Let's say that you name this view com_test_testapp_ANDROID.event_view (make sure to pick a name that isn't included in the app_events_* expansion). Now you can run a query to select yesterday's events, for instance:
#standardSQL
SELECT event_dim
FROM `com_test_testapp_ANDROID.event_view`
WHERE date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
Or all events in the past seven days:
#standardSQL
SELECT event_dim
FROM `com_test_testapp_ANDROID.event_view`
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 WEEK);
The important part is having a column in the select list for the view that lets you restrict the _TABLE_SUFFIX to whatever range of time you are interested in.

With daily tables on BigQuery, how can I query rolling 12 months?

I have daily tables in BigQuery, (table_name_yyyymmdd). How can I write a view that will always query the rolling 12 months of data?

As an example:
Save below query as a view (let's name it - view_test - I assume it in the same dataset as tables)
#standardSQL
SELECT PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) as table, COUNT(1) as rows_count
FROM `yourProject.yourDataset.table_name_*`
WHERE _TABLE_SUFFIX
BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 13 DAY) )
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) )
GROUP BY 1
Now you can use it as below for example:
#standardSQL
SELECT *
FROM `yourProject.yourDataset.view_test`
So, this views referencing last 12 full days
You can change DAY to MONTH to have 12 months instead
Hope you got an idea
If needed this can easily be "translated" to Legacy SQL (make sure the view and query that calls that view are using the same SQL version/dialect)
Note: Google recommends migrate to Standard SQL whenever it is possible!

You could use function TABLE_DATE_RANGE which according to doc (https://cloud.google.com/bigquery/docs/reference/legacy-sql#table-date-range) :
Queries multiple daily tables that span a date range.
like below:
SELECT *FROM TABLE_DATE_RANGE(data_set_name.table_name,
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-12-31'))
as there is currently no option to parametrise your view programatically you need to generate your queries/views by some other tool/code

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Update tables with new data - sql

Related

Athena queries changes on the fly according to the partitions introduced in S3

Big query - Schedule Query on an external partitioned table with the keyword #run_date

Check if a dataset has been updated with new table BigQuery

Creating views periodically in BigQuery

With daily tables on BigQuery, how can I query rolling 12 months?

Categories

Resources