How to create sharded table in GCP BigQuery - google-bigquery

As we started working on GCP BigQuery, our code has to retrieve data from so called sharded table in a dataset. This table group is with the name seen like sometablename_(3000) with the icon represent as . The number there in parenthesis represents total count of tables created in the dataset so far with the date, and everyday the tables are getting added there by some other publishers and the count increases daily thus. Our code needs a wildcard query to limit date range to read data from this table which works fine. Only other option we see while creating a table from console is partition table which is represented differently.
But curious question is how are these tables getting created daily in the first place? When we manually tried creating another table with same name format, it's getting created as separate table but getting into this group. Not sure if documentation has any reference but can't find any.
So any help in understanding this background is appreciated.

Sharded tables are generated automatically once google-bigquery finds tables that share the following characteristics:
Exist in the same dataset
Have the exact same table schema
The same prefix
Have a suffix of the form _YYYYMMDD (eg. 20210130)
You can find additional info about sharded table on official documention, Partitioning versus sharding.
So, that means if I create 3 tables named BUSINES_YYYYMMDD it will be grouped once refreshed in the UI.
* Business_(3)
- Business_20211201
- Business_20211202
- Business_20211203
And if I want to query those tables I will just have to either go trough the ui and select the table.
# UI under schema tab
BUSINESS_20211203 2021-12-03 v # Filter tables under the shard
Table schema
...
Or just go directly to the query ui compose new query and perform a query.
Select * from my-project-id.my-dataset.Business_20211203 limit 1
So if you are getting tables created by publishers/org inside the same dataset that fits the conditions mention at the top it will be grouped.
About querying this groups, google recommends to do partition instead of sharding. You can see the process of converting sharded into partion table by going to this link.
Also, I found this post which also shows the vs of each mode.

Related

Get column timestamp when they got added in Bigquery

I'm trying to find which all new columns got added to the table. Is there any way to find it? I was thinking to get all columns for a table with timestamps when they got created or modified so that I can filter which are new columns.
With INFORMATION_SCHEMA.SCHEMATA I get only table creation and modified date but not for columns.
With INFORMATION_SCHEMA.COLUMNS I am able to get all column names and it's information but no details about its modified or creation timestamp.
My table doesn't have a snapshot so I can't compare it with the previous version to get changes.
Is there any way to capture this?
According to the BigQuery columns documentation, this is not metadata currently capture by BigQuery.
A possible solution would be to go into the BigQuery logs to see when and how tables were updated. Source control over the schemas and scripts that create these tables could also give you insight into how and when columns may have been added.
As #RileyRunnoe mentioned, this kind of metadata is not captured by BQ and a possible solution is go dig into the Audit Logs. Prior to doing this, you should have created a BQ sink that points to the dataset. See creating a sink for more details.
When the sink is created, all operations to be executed will store data usage logs in table cloudaudit_googleapis_com_data_access_YYYYMMDD and activity logs in table cloudaudit_googleapis_com_activity_YYYYMMDD under the BigQuery dataset you selected in your sink. Keep in mind that you can only track the usage starting at the date when you set up the logs export tables.
The query below has a CTE that queries from cloudaudit_googleapis_com_data_access_* since this logs the data changes and only gets completed jobs hence filtering for jobservice.jobcompleted. Query the CTE to get queries that contain "COLUMN" and don't include queries that don't have a destination table like the query we are about to run.
WITH CTE AS (
SELECT
protopayload_auditlog.methodName,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query as query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatus.state as status,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.datasetId as dataset,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.tableId as table,
timestamp
FROM `my-project.dataset_name.cloudaudit_googleapis_com_data_access_*`
WHERE protopayload_auditlog.methodName = 'jobservice.jobcompleted'
)
SELECT query,
REGEXP_EXTRACT(query,r'ADD COLUMN (\w+) \w+') as column,
table,
timestamp,
status
FROM CTE
WHERE query like '%COLUMN%'
AND NOT REGEXP_CONTAINS(dataset, r'^_')
ORDER BY timestamp DESC
Result:

How to check if tables refreshed in Bigquery or not?

Currently I have around 1000 tables in which I need to track around 500 tables in various bigquery datasets and generate a report or create of dashboard.so that we can monitor and act promptly if a table is not refreshed.
Could someone please tell me how can I do that with minimal usage of Bigquery slots.
I think you should be able to query the last modification time as shown here:
https://cloud.google.com/bigquery/docs/dataset-metadata
You could then add a table with the max allowed time interval for a table to be updated and include that table in the query to create your own alerts.
drftr
There is a Preview feature INFORMATION_SCHEMA.PARTITIONS giving you the LAST_MODIFIED_TIME per table in a dataset
select *
from yourDataset.INFORMATION_SCHEMA.PARTITIONS;

Create a date-limited view on a hive table containing complex types in a way that is queryable with Impala?

I have a very large parquet table containing nested complex types such as structs and arrays. I have partitioned it by date and would like to restrict certain users to, say, the latest week of data.
The usual way of doing this would be to create a time-limited view on top of the table, e.g.:
''' CREATE VIEW time_limited_view
AS SELECT * FROM my_table
WHERE partition_date >= '2020-01-01' '''
This will work fine when querying the view in Hive. However, if I try to query this view from Impala, I get an error:
** AnalysisException: Expr 'my_table.struct_column' in select list returns a complex type **
The reason for this is that Impala does not allow complex types in the select list. Any view I build which selects the complex columns will cause errors like this. If I flatten/unnest the complex types, this would of course get around this issue. However due to the layers of nesting involved I would like to keep the table structure as is.
I see another suggested workaround has been to use Ranger row-level filtering but I do not have Ranger and will not be able to install it on the cluster. Any suggestions on Hive/Impala SQL workarounds would be appreciated
While working on a different problem I came across a kind of solution that fits my needs (but is by no means a general solution). I figured I'd post it in case anyone has similar needs.
Rather than using a view, I can simply use an external table. So firstly I would create a table in database_1 using Hive, which has a corresponding location, location_1, in hdfs. This is my "production" database/table which I use for ETL and contains a very large amount of data. Only certain users have access to this database.
CREATE TABLE database_1.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Next, I create a second, external table in the same location in hdfs. However this table is stored in a database with a much broader user group (database_2).
CREATE EXTERNAL TABLE database_2.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Since this is an external table, I can add/drop date partitions at will without affecting the underlying data. I can add 1 weeks' worth of date partitions to the metastore and as far as end users can tell, that's all that is available in the table. I can even make this part of my ETL job, where each time new data is added, I add that partition to the external table and then drop a partition from a week ago, resulting in this rolling window of 1 weeks' data being made available to this user group without having to duplicate a load of data to a separate location.
This is by no means a row-filtering solution, but is a handy way to use partitions to expose a subset of data to a broader user group without having to duplicate that data in a separate location.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Bigquery caching when hitting table would provide a different result?

As part of our Bigquery solution we have a cron job which checks the latest table created in a dataset and will create more if this table is out of date.This check is done with the following query
SELECT table_id FROM [dataset.__TABLES_SUMMARY__] WHERE table_id LIKE 'table_root%' ORDER BY creation_time DESC LIMIT 1
Our integration tests have recently been throwing errors because this query is hitting Bigquery's internal cache even though running the query against the underlying table would provide a different result. This caching also occurs if I run this query in the web interface from Google cloud console.
If I specify for the query not to cache using the
queryRequest.setUseQueryCache(false)
flag in the code then the tests pass correctly.
My understanding was that Bigquery automatic caching would not occur if running the query against the underlying table would provide a different result. Am I incorrect in this assumption in which case when does it occur or is this a bug?
Well the answer for your question is: you are doing conceptually wrong. You always need to set the no cache param if you want no cache data. Even on the web UI there are options you need to use. The default is to use the cached version.
But, fundamentally you need to change the process and use the recent features:
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a templateSuffix parameter to your insertAll request
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds.
This way you don't need to have a cron, as it will automatically create the missing tables.
Read more here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables