I am looking at moving our shopify data to BigQuery for reporting purposes. I paginate through the customers endpoint from the shopify API and get all the customer level data. I then export this into a csv that I store on google cloud storage and then import to BigQuery. My question is what is the best way to deal with incremental data loads, given that some of the entries on the current customer datamart (for example, total order count) might have changed and some new customers might have been created since the last table udpate. any advice on the design pattern would be appreciated. Many thanks
To handle incremental data which is getting loaded on GCS (source) and target is Bigquery, you have couple of Google options:-
Dataflow:- You can create a Dataflow pipeline and load Incremental data to Bigquery (intermediate tables). Once data is loaded on Bigquery intermediate table, then you can calculate Current status on Joining 2 tables (target & intermediate) and get latest data appended to target Bigquery tables.
Data calculation can be done through scheduled Dataflow pipeline or through scheduled Bigquery.
DataPrep:- Here you can refer, how to create ETL Pipeline. You can add target (BigQuery table) as reference.
Related
Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.
I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And when loading cluster your tables, for massive improvements in costs for the most common queries:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
In summary:
BigQuery can read CSVs and other files straight from GCS.
You can define a view that parses those CSVs in any way you might prefer, all within SQL.
You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.
15 TB can be handled easily by BigQuery.
I am trying to explore BigQuery's abilities to load CSV file (Doulbelick impression data) into BigQuery's partitioned table. My use case includes:
1. Reading daily (nightly load) dumps (csv) from Google cloud storage for my customer's (ad agency) 30 different clients into BQ. Daily dump may contain data from previous day/week. All data should be loaded into respective daily partition (into BQ) so as to provide daily reporting to individual clients.
2.The purpose here is to build an analytical system that gives ad agency an ability to run "Trends & Pattern over time and across clients".
I am new to BQ and thus trying to understand its Schema layout.
Should i create a single table with daily partitions (holding data from all 50 clients/50 daily load -csv files)? Does the partitions need to be created well in advance ?
Should i create 50 different tables(partitioned by date) for each client so as NOT to run into any data sharing/security concerns of a single table option ?
My customer wants a simple solution with min cost.
If you are going to use transfer service (as mentioned in the comment), you don't need to create tables by hand. Instead transfer service will do that for you. Transfer service will schedule daily jobs and load data into partition. Also, if there is short delay (2-3 days), transfer service will still pick up the data.
I am trying to create date-partitioned + template tables in BigQuery:
Create base table using bq mk --time_partitioning_type=DAY myapp.customer
Call API insertAll with "tableId": "customer", "templateSuffix": "_activated"
The resulting customer_activated table inherits the schema of the customer table, but has no timePartitioning.
How can I ensure template tables inherit the time partitioning of the base table?
For people coming here in the future, the accepted answer is outdated. BigQuery Streaming APIs support date-partition tables now, both to the table and to a specific partition
Link to docs
Streaming APIs do not yet support date-partitioning
Your option is to use load job with the partition as the destination for initial population and then just use streaming directly to the table (without using partitions) and let bigquery infer the partition timestamp
Otherwise you should wait when streaming will support date-partitioning which Google Team mentioned to happen in near future
Update:
Since around mid-2017 BigQuery supports Streaming into partitioned tables
Just FYI, as of November 2022, it is indeed possible to stream data into already existing partitioned tables, however, tables created automatically using a template table do NOT inherit the time partitioning configuration of the parent table, which is what OP was asking in the first place.
I'm using dataflow to process files stored in GCS and write to Bigquery tables. Below are my requirements:
input files contain events records, each record pertains to one eventType;
need to partition records by eventType;
for each eventType output/write records to a corresponding Bigquery table, one table per eventType.
events in each batch input files vary;
I'm thinking of applying transforms such as "groupByKey" and "partition", however seems that I have to know number of (and type of) events at the development time which is needed to determine the partitions.
Do you guys have a good idea to do the partitioning dramatically? meaning partitions can be determined at run time?
Why not loading everything into a single "raw" bigquery table and then using BigQuery API determine the different number of events and export each event type to its own table (e.g., via https://cloud.google.com/bigquery/bq-command-line-tool#createtablequery) or an API call?
If your input format is simple, you can do that without using dataflow at all and it will be probably more cost efficient.