I'm partitioning my data on BigQuery by day, and I want a quick way to query "yesterday's data".
Is this possible? How can I write queries that automatically point to the latest data, without having to re-write the tables I want to query?
You can create a view with TABLE_QUERY to find yesterday's (or an arbitrary relative date) data.
For example, GitHubArchive stores daily tables, and I created a view that points to yesterday's table:
SELECT *
FROM TABLE_QUERY(githubarchive:day, 'table_id CONTAINS "events_"
AND table_id CONTAINS STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d")')
You can test and query this view:
SELECT COUNT(*)
FROM [fh-bigquery:public_dump.github_yesterday]
Related
I'm working on BigQuery and have created a view using multiple tables. Each day data needs to be synced with multiple platforms. I need to insert a date or some other field via SQL through which I can identify which rows were added into the view each day or which rows got updated so only that data I can take forward each day instead of syncing all every day. Best way I can think is to somehow add the the current date wherever an update to a row happens but that date needs to be constant until a further update happens for that record.
Ex:
Sample data
Say we get the view T1 on 1st September and T2 on 2nd. I need to to only spot ID:2 for 1st September and ID:3,4,5 on 2nd September. Note: no such date column is there.I need help in creating such column or any other approach to verify which rows are getting updated/added daily
You can create a BigQuery schedule queries with frequency as daily (24 hours) using below INSERT statement:
INSERT INTO dataset.T1
SELECT
*
FROM
dataset.T2
WHERE
date > (SELECT MAX(date) FROM dataset.T1);
Your table where the data is getting streamed to (in your case: sample data) needs to be configured as a partitioned table. Therefor you use "Partition by ingestion time" so that you don't need to handle the date yourself.
Configuration in BQ
After you recreated that table append your existing data to that new table with the help of the format options in BQ (append) and RUN.
Then you create a view based on that table with:
SELECT * EXCEPT (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (GROUP BY invoice_id ORDER BY _PARTITIONTIME desc) AS rank
FROM `your_dataset.your_sample_data_table`
)
WHERE rank = 1
Always use the view from that on.
I want to create such table:
CREATE TABLE sometable
(SELECT columns, columns, date_col)
PARTITIONED BY date_col
And I want it to be date partitioned with the date in table suffix: sometable$date_partition
I read the docs, but can't complete this neither with web UI nor with SQL.
The web UI shows such error "Missing argument for parameter DATE."
My table name is "daily_export_${DATE}"
My partitioning column isn't blank, it's date_col.
Can I have a simple example, please?
PARTITION BY goes earlier
The query needs to parse the table suffix into a DATE type.
For example:
CREATE OR REPLACE TABLE temp.so
PARTITION BY date_from_table_name
AS
SELECT PARSE_DATE('%Y%m%d', _table_suffix) date_from_table_name, event_timestamp, event_name, items
FROM `bingo-blast-174dd.analytics_151321511.events_*`
WHERE _table_suffix BETWEEN '20200530' AND '20200531'
LIMIT 10
As you can see in this documentation, BigQuery implements two different concepts: sharded tables and partitioned tables
The first one (sharded tables) is a way of dividing a whole table into many tables with a date suffix. You can query those tables individually or using wildcards. For example, instead of creating a single table named events, you can create many tables named events_20200101, events_20200102, [...]
When you do that, you are able to query any of those tables individually or you can query all of them by running some query like select * from events_*
The second concept (partitioned tables) is an approach to fragment your table in smaller pieces in order to improve the performance and reduce costs when querying data. Partitioned tables can be based on some column of your table or even on the ingestion time. When you table is partitioned by ingestion time you can access a pseudo column named _PARTITIONTIME
When comparing both approaches, the documentation says:
Date/timestamp partitioned tables perform better than tables sharded
by date. When you create date-named tables, BigQuery must maintain a
copy of the schema and metadata for each date-named table. Also, when
date-named tables are used, BigQuery might be required to verify
permissions for each queried table. This practice also adds to query
overhead and impacts query performance. The recommended best practice
is to use date/timestamp partitioned tables instead of date-sharded
tables.
In your case, you basically need to create a partitioned table without a date in its name.
I have many SQL Server tables in a database that have information about the same domain (same columns) and their names are the same plus a date suffix (yyyyMMdd):
TABLE_ABOUT_THIS_THING_20200131
TABLE_ABOUT_THIS_THING_20191231
TABLE_ABOUT_THIS_THING_20191130
TABLE_ABOUT_THIS_THING_20191031
TABLE_ABOUT_THIS_THING_20190930
TABLE_ABOUT_THIS_THING_20190831
...
This seems like it would make more sense if it was all in the same table. Is there a way, using a query/SSIS or something similar, to merge this tables into one (TABLE_ABOUT_THIS_THING) with a new column (extraction_date) made out of the current table suffix?
Using SSIS: use union for the collect data from the multi table and use Derived Column for the extraction_date before the destination for more information you can take from following link:
https://www.tutorialgateway.org/union-all-transformation-in-ssis/
You can use UNION ALL:
create view v_about_this_thing as
select convert(date, '20200131') as extraction_date t.*
from TABLE_ABOUT_THIS_THING_20200131
union all
select convert(date, '20201912') as extraction_date t.*
from TABLE_ABOUT_THIS_THING_20191231
union all
. . .
This happens to be a partitioned view, which has some other benefits.
The challenge is how to keep this up-to-date. My recommendation is to fix your data processing so all the data goes into a single table. You can also set up a job that runs once a month and inserts the most recent values into an existing table.
An alternative is to reconstruct the view every month or periodically. You can do this using a DDL trigger that recreates the view when the new table appears.
Another alternative is to create a year's worth of tables all at once -- but empty -- and to create the view manually once a year. But a note on your calendar to remind you!
You can use SSIS with the new table "TABLE_ABOUT_THIS_THING" as destination and a query that look like this as source:
`Select * FROM table1
UNION
Select * FROM table2
UNION
.
.
.`
We have a large database with monthly partitioned tables. I need to aggregate a selection of these tables every month but I don't want to update the union all every month to add the new monthly table.
CREATE VIEW dynamic_view AS
SELECT timestamp,
traffic
FROM traffic_table_m_2017_01
UNION ALL
SELECT timestamp,
traffic
FROM traffic_table_m_2017_02
Is this where I would use a stored procedure? I am not really familiar with them.
I think it would also work as:
SELECT timestamp,
traffic
FROM REPLACE(REPLACE('traffic_table_m_yyyy_mm',
yyyy, FORMAT(GETDATE(),'yyyy', 'en-us')),
mm, FORMAT(GETDATE(),'mm', 'en-us'));
This might work for the current month but I would need to save the data from the past months which would also be an issue.
you should append each table as it arrives to 1 larger table then run your queries against that. there are many ways to do this but probable the fastest and most elegant is to use.
ALTER TABLE APPEND
Instructions here https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE_APPEND.html
A BigQuery best practice is to split timeseries in daily tables (as "NAME_yyyyMMdd") and then use Table Wildcards to query one or more of these tables.
Sometimes it is useful to get the last update time on a certain set of data (i.e. to check correctness of the ingestion procedure). How do I get the last update time over a set of tables organized like that?
A good way to achieve that is to use the __TABLES__ meta-table. Here is a generic query I use in several projects:
SELECT
MAX(last_modified_time) LAST_MODIFIED_TIME,
IF(REGEXP_MATCH(RIGHT(table_id,8),"[0-9]{8}"),LEFT(table_id,LENGTH(table_id) - 8),table_id) AS TABLE_ID
FROM
[my_dataset.__TABLES__]
GROUP BY
TABLE_ID
It will return the last update time of every table in my_dataset. For tables organized with a daily-split structure, it will return a single value (the update time of the latest table), with the initial part of their name as TABLE_ID.
SELECT *
FROM project_name.data_set_name.INFORMATION_SCHEMA.PARTITIONS
where table_name='my_table';
Solution for Google