Bigquery caching when hitting table would provide a different result? - google-bigquery

As part of our Bigquery solution we have a cron job which checks the latest table created in a dataset and will create more if this table is out of date.This check is done with the following query
SELECT table_id FROM [dataset.__TABLES_SUMMARY__] WHERE table_id LIKE 'table_root%' ORDER BY creation_time DESC LIMIT 1
Our integration tests have recently been throwing errors because this query is hitting Bigquery's internal cache even though running the query against the underlying table would provide a different result. This caching also occurs if I run this query in the web interface from Google cloud console.
If I specify for the query not to cache using the
queryRequest.setUseQueryCache(false)
flag in the code then the tests pass correctly.
My understanding was that Bigquery automatic caching would not occur if running the query against the underlying table would provide a different result. Am I incorrect in this assumption in which case when does it occur or is this a bug?

Well the answer for your question is: you are doing conceptually wrong. You always need to set the no cache param if you want no cache data. Even on the web UI there are options you need to use. The default is to use the cached version.
But, fundamentally you need to change the process and use the recent features:
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a templateSuffix parameter to your insertAll request
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds.
This way you don't need to have a cron, as it will automatically create the missing tables.
Read more here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables

Related

How to create sharded table in GCP BigQuery

As we started working on GCP BigQuery, our code has to retrieve data from so called sharded table in a dataset. This table group is with the name seen like sometablename_(3000) with the icon represent as . The number there in parenthesis represents total count of tables created in the dataset so far with the date, and everyday the tables are getting added there by some other publishers and the count increases daily thus. Our code needs a wildcard query to limit date range to read data from this table which works fine. Only other option we see while creating a table from console is partition table which is represented differently.
But curious question is how are these tables getting created daily in the first place? When we manually tried creating another table with same name format, it's getting created as separate table but getting into this group. Not sure if documentation has any reference but can't find any.
So any help in understanding this background is appreciated.
Sharded tables are generated automatically once google-bigquery finds tables that share the following characteristics:
Exist in the same dataset
Have the exact same table schema
The same prefix
Have a suffix of the form _YYYYMMDD (eg. 20210130)
You can find additional info about sharded table on official documention, Partitioning versus sharding.
So, that means if I create 3 tables named BUSINES_YYYYMMDD it will be grouped once refreshed in the UI.
* Business_(3)
- Business_20211201
- Business_20211202
- Business_20211203
And if I want to query those tables I will just have to either go trough the ui and select the table.
# UI under schema tab
BUSINESS_20211203 2021-12-03 v # Filter tables under the shard
Table schema
...
Or just go directly to the query ui compose new query and perform a query.
Select * from my-project-id.my-dataset.Business_20211203 limit 1
So if you are getting tables created by publishers/org inside the same dataset that fits the conditions mention at the top it will be grouped.
About querying this groups, google recommends to do partition instead of sharding. You can see the process of converting sharded into partion table by going to this link.
Also, I found this post which also shows the vs of each mode.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Switching a BigQuery table into "readonly mode"

I want to ensure that a table in BigQuery can no longer receive any inserts (be it "load/batch" inserts or "streaming" inserts).
Is there any possibility to turn a table into a "readonly mode"?
I would like to avoid playing with the standard IAM / access control whose smallest level of permissions is the dataset level. If there were an option to force "readonly" on 1 table for all the users independently of their role (just like when you force a filesystem in "readonly mode"), that would be awesome.
(final goal is to do a safe merge of a "master" and "update" table as explained here: Delete/update table entries by joining 2 tables on Google BigQuery without import/export ).
Currently, this is not possible in BigQuery. You can submit feature request at https://issuetracker.google.com/issues/new?component=187149&template=0
Meantime, workaround would be to use Snapshot decorators.
So, without restricting adding rows to table, you will be able to get table state at any moment (within last two weeks if I remember correctly) - so indirectly this will give you what you want
Snapshot decorators are available in Legacy SQL and I think they recently were added to Standard SQL

allowLargeResults in Query job in BigQuery

I'm trying to run a Query job in BigQuery and getting the following error:
Response too large to return. Consider setting allowLargeResults to
true in your job configuration
I understand that I need to set allowLargeResults to True in my job configuration, but then I also have to supply a destination table field.
I don't want to insert the results of the query to specific table, only to process it locally.
how can I manage this situation?
I don't want to insert the results of the query to specific table,
only to process it locally.
Wanted to clarify – so you hopefully feel better about using destination table:
In reality, any query result ends up in some table!
If result is smaller than 128MB - BigQuery creates temporary table on your behalf (in special dataset which name starts with underscore so it is not visible in Web UI dataset/table navigator).
This temporary table is available for 24 hours and is used if you use Query Cashing or you can even use it by yourself – you just need to find which table is created. You can find this in API – destination table – which as I said above exists even if you have not set specific table. Or you can find it in Web UI
When result is bigger than 128MB – you must set destination table. The only drawback in your case is that you need to make sure you delete this table after you don’t need it anymore otherwise you will be paying for storage
You can do this either by actually deleting table - manually (in UI) or programmatically (API). Or you can set expiration on the table (API)
First of all if it's means it's too large, then probably greater than 128MB. You need to make sure that you query is accurate and if indeed you want to return the large data. Usually people make mistakes in the queries, like join explosion, missing time filters to reduce data, or missing limits.
After you are convinced the data is too large, you need to write to a table, then export to GCS, then download, and then deal with it.
https://cloud.google.com/bigquery/docs/exporting-data#exportingmultiple

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).