Currently I have around 1000 tables in which I need to track around 500 tables in various bigquery datasets and generate a report or create of dashboard.so that we can monitor and act promptly if a table is not refreshed.
Could someone please tell me how can I do that with minimal usage of Bigquery slots.
I think you should be able to query the last modification time as shown here:
https://cloud.google.com/bigquery/docs/dataset-metadata
You could then add a table with the max allowed time interval for a table to be updated and include that table in the query to create your own alerts.
drftr
There is a Preview feature INFORMATION_SCHEMA.PARTITIONS giving you the LAST_MODIFIED_TIME per table in a dataset
select *
from yourDataset.INFORMATION_SCHEMA.PARTITIONS;
Related
I'm trying to find which all new columns got added to the table. Is there any way to find it? I was thinking to get all columns for a table with timestamps when they got created or modified so that I can filter which are new columns.
With INFORMATION_SCHEMA.SCHEMATA I get only table creation and modified date but not for columns.
With INFORMATION_SCHEMA.COLUMNS I am able to get all column names and it's information but no details about its modified or creation timestamp.
My table doesn't have a snapshot so I can't compare it with the previous version to get changes.
Is there any way to capture this?
According to the BigQuery columns documentation, this is not metadata currently capture by BigQuery.
A possible solution would be to go into the BigQuery logs to see when and how tables were updated. Source control over the schemas and scripts that create these tables could also give you insight into how and when columns may have been added.
As #RileyRunnoe mentioned, this kind of metadata is not captured by BQ and a possible solution is go dig into the Audit Logs. Prior to doing this, you should have created a BQ sink that points to the dataset. See creating a sink for more details.
When the sink is created, all operations to be executed will store data usage logs in table cloudaudit_googleapis_com_data_access_YYYYMMDD and activity logs in table cloudaudit_googleapis_com_activity_YYYYMMDD under the BigQuery dataset you selected in your sink. Keep in mind that you can only track the usage starting at the date when you set up the logs export tables.
The query below has a CTE that queries from cloudaudit_googleapis_com_data_access_* since this logs the data changes and only gets completed jobs hence filtering for jobservice.jobcompleted. Query the CTE to get queries that contain "COLUMN" and don't include queries that don't have a destination table like the query we are about to run.
WITH CTE AS (
SELECT
protopayload_auditlog.methodName,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query as query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatus.state as status,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.datasetId as dataset,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.tableId as table,
timestamp
FROM `my-project.dataset_name.cloudaudit_googleapis_com_data_access_*`
WHERE protopayload_auditlog.methodName = 'jobservice.jobcompleted'
)
SELECT query,
REGEXP_EXTRACT(query,r'ADD COLUMN (\w+) \w+') as column,
table,
timestamp,
status
FROM CTE
WHERE query like '%COLUMN%'
AND NOT REGEXP_CONTAINS(dataset, r'^_')
ORDER BY timestamp DESC
Result:
As we started working on GCP BigQuery, our code has to retrieve data from so called sharded table in a dataset. This table group is with the name seen like sometablename_(3000) with the icon represent as . The number there in parenthesis represents total count of tables created in the dataset so far with the date, and everyday the tables are getting added there by some other publishers and the count increases daily thus. Our code needs a wildcard query to limit date range to read data from this table which works fine. Only other option we see while creating a table from console is partition table which is represented differently.
But curious question is how are these tables getting created daily in the first place? When we manually tried creating another table with same name format, it's getting created as separate table but getting into this group. Not sure if documentation has any reference but can't find any.
So any help in understanding this background is appreciated.
Sharded tables are generated automatically once google-bigquery finds tables that share the following characteristics:
Exist in the same dataset
Have the exact same table schema
The same prefix
Have a suffix of the form _YYYYMMDD (eg. 20210130)
You can find additional info about sharded table on official documention, Partitioning versus sharding.
So, that means if I create 3 tables named BUSINES_YYYYMMDD it will be grouped once refreshed in the UI.
* Business_(3)
- Business_20211201
- Business_20211202
- Business_20211203
And if I want to query those tables I will just have to either go trough the ui and select the table.
# UI under schema tab
BUSINESS_20211203 2021-12-03 v # Filter tables under the shard
Table schema
...
Or just go directly to the query ui compose new query and perform a query.
Select * from my-project-id.my-dataset.Business_20211203 limit 1
So if you are getting tables created by publishers/org inside the same dataset that fits the conditions mention at the top it will be grouped.
About querying this groups, google recommends to do partition instead of sharding. You can see the process of converting sharded into partion table by going to this link.
Also, I found this post which also shows the vs of each mode.
I am using firebase analytics and bigguery with average of 50~60 GB daily data.
For the most recent daily table, a query gives different result from yesterday even if query conditions are exact same including target date.
I just found that there are 1~2days gap between table creation date and last modified date.
I assume the difference between the query results are because of this. (Calculating on different data volume, maybe)
Is this date gap means a single daily table needs at least 2 days to be fully loaded from intraday table?
Thanks in advance.
biqguery table info
In the documentation we can find the following information:
After you link a project to BigQuery, the first daily export of
events creates a corresponding dataset in the associated BigQuery
project. Then, each day, raw event data for each linked app populates
a new daily table in the associated dataset, and raw event data is
streamed into a separate intraday BigQuery table in real-time.
It seems that the intraday table is loaded to the main table each day and if you want to access this data in real-time you`ll have to use this intraday separate table.
If this information doesn`t help you, please provide some extra information so I can help you more efficiently.
When I try to create a view which query more than 600 tables, BigQuery was running for a long time and response is :
BigQuery error in mk operation: Backend Error.
the query itself is like:
'select col1,col2,col3 from t1,t2,t3......t600'
I suspect the operation is timing out. The limit here is whether validating the view query can be completed within the deadline limits for a single synchronous request like view creation. This many tables may just be too many.
A potential work-around might be to shard this view: create smaller view tables, then a single view of the set of smaller views.
An alternate solution would be to explore your data layout. Perhaps you don't need 600 tables to hold your data? The BigQuery team announced at GCP Next 2016 that table partitioning by date will be coming soon, so if you are sharding your tables by day and need to reference years of data, then there will be a single-table solution for you soon.
I am trying to copy a BigQuery table using the API from one table to the other in the same dataset.
While copying big tables seems to work just fine, copying small tables with a limited number of rows (1-10) I noticed that the destination table comes out empty (created but 0 rows).
I get the same results using the API and the BigQuery management console.
The issue is replicated for any table in any dataset I have. Looks like a bug or a designed behavior.
Could not find any "minimum lines" directive in the docs.. am I missing something?
EDIT:
Screenshots
Original table: video_content_events with 2 rows
Copy table: copy111 with 0 rows
How are you populating the small tables? Are you perchance using streaming insert (bq insert from the command line tool, tabledata.insertAll method)? If so, per the documentation, data can take up to 90 minutes to be copyable/exportable:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
I won't get super detailed, but the reason is that our copy and export operations are optimized to work on materialized files. Data within our streaming buffers are stored in a completely different system, and thus aren't picked up until the buffers are flushed into the traditional storage mechanism. That said, we are working on removing the copy/export delay.
If you aren't using streaming insert to populate the table, then definitely contact support/file a bug here.
There is no minimum records limit to copy the table within the same dataset or over a different dataset. This applies both for the API and the BigQuery UI. I just replicated your scenario of creating a new table with just 2 records and I was able to successfully copy the table to another table using UI.
Attaching screenshot
I tried to copy to a timestamp partitioned table. I messed up the timestamp, and 1000 x current timestamp. Guess it is beyond BigQuery's max partition range. Despite copy job success, no data is actually loaded to the destination table.