If I have a partitioned-by-date table with a WRITE_APPEND policy, what happens if I write data into existing partitions? Does it simply get ignored or it gets appended as the name indicates? My understanding is that it appends existing data in the same partition but not 100% sure.
The doc only says that "WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.". This is highly ambiguous and doesn't even bother to speak about partitioned table.
Related
I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.
Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.
recently our 5-year old MySQL data warehouse (used mostly for business reporting) has gotten quite full and we need to come up with a way to archive old data which is not frequently accessed to clear up space.
I created a process which dumps old data from the DW into .parquet files in Amazon S3, which are then mapped onto an Athena table. This works quite well.
however we sometimes add/rename/delete columns in existing tables. I'd like the changes to be reflected in the old, archived data as well, but I just can't come up with a good way to do it without reprocessing the entire dataset.
is there a 'canon' way to mantain structural compatibility between a live data warehouse and its file-based archived data? I've googled relevant literature and come up with nothing.
should I just accept the fact that if I need to actively maintain schemas then the data is not really archived?
There are tons of materials in internet if you search the term "Schema evolution" in big data space.
The Athena documentation has a chapter on schema updates case by case example here.
If you are re-processing the whole archived dataset to handle schema change, probably you are doing a bit too much.
Since you have parquet files and by default Athena parquet resolves the column by column name rather than by index, you are safe in almost all cases i.e. add new columns, drop columns etc except column rename. TO handle renamed columns (and to handle addition/dropping of columns), the fastest way is to use view. In the view definition you can alias the renamed column. Also, if column rename is mostly the case of your schema evolution and if you are doing it a lot, you can also consider AVRO to gracefully handle that.
Plan A:
It's too late to do this, but PARTITIONing is an excellent tool for getting the data out of the table.
I say "too late" because adding partitioning would require enough space for making a copy of the already-big table. And you don't have that much disk space?
If the table were partitioned by Year or Quarter or Month, you could
Every period, "Export tablespace" to remove the oldest from the partition scheme.
That tablespace will the be a separate table; you could copy/dump/whatever, then drop it.
At about the same time, you would build a new partition to receive new data.
(I would keep the two processes separate so that you could stretch beyond 5 years or shrink below 5 with minimal extra effort.)
A benefit of the method is that there is virtually zero impact on the big table during the processing.
An extra benefit of partitioning: You can actually return space to the OS (assuming you have innodb_file_per_table=ON).
Plan B:
Look at what you do with the oooold data. Only a few things? Possibly involving summarization? So...
Don't archive the old data.
Summarize the data to-be-removed into new tables. Since they will be perhaps one-tenth the size, you can keep them online 'forever'.
I'm trying to run a Query job in BigQuery and getting the following error:
Response too large to return. Consider setting allowLargeResults to
true in your job configuration
I understand that I need to set allowLargeResults to True in my job configuration, but then I also have to supply a destination table field.
I don't want to insert the results of the query to specific table, only to process it locally.
how can I manage this situation?
I don't want to insert the results of the query to specific table,
only to process it locally.
Wanted to clarify – so you hopefully feel better about using destination table:
In reality, any query result ends up in some table!
If result is smaller than 128MB - BigQuery creates temporary table on your behalf (in special dataset which name starts with underscore so it is not visible in Web UI dataset/table navigator).
This temporary table is available for 24 hours and is used if you use Query Cashing or you can even use it by yourself – you just need to find which table is created. You can find this in API – destination table – which as I said above exists even if you have not set specific table. Or you can find it in Web UI
When result is bigger than 128MB – you must set destination table. The only drawback in your case is that you need to make sure you delete this table after you don’t need it anymore otherwise you will be paying for storage
You can do this either by actually deleting table - manually (in UI) or programmatically (API). Or you can set expiration on the table (API)
First of all if it's means it's too large, then probably greater than 128MB. You need to make sure that you query is accurate and if indeed you want to return the large data. Usually people make mistakes in the queries, like join explosion, missing time filters to reduce data, or missing limits.
After you are convinced the data is too large, you need to write to a table, then export to GCS, then download, and then deal with it.
https://cloud.google.com/bigquery/docs/exporting-data#exportingmultiple
When streaming data into a BigQuery table, I wonder if the default is to append the json data to a BigQuery table if the table has existed already? The api documentation for tabledata().insertAll() is very brief and doesn't mention parameters like configuration.load.writeDisposition as in a load job.
There are no multiple choices here, so there is no default and no overridden case. Don't forget that BigQuery is a WORM technology (append-only by design). It looks for me, that you are not aware of this thing, as there is no option like UPDATE.
You just set the path parameters, the trio of project, dataset, table ID,
then set the existing schema as json and the rows, and it will append to the table.
To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis.
In case of error you have a short error code that summarizes the error. For help on debugging the specific reason value you receive, see troubleshooting errors.
Also worth reading:
Bigquery internalError when streaming data