ETL on Google Cloud - (Dataflow vs. Spring Batch) -> BigQuery - google-bigquery

I am considering BigQuery as my data warehouse requirement. Right now, I have my data in google cloud (cloud SQL and BigTable). I have exposed my REST APIs to retrieve data from both. Now, I would like to retrieve data from these APIs, do the ETL and load the data into BigQuery. I am evaluating 2 options of ETL (daily frequency of job for hourly data) right now:-
Use JAVA Spring Batch and create microservice and use Kubernetes as deployment environment. Will it scale?
Use Cloud DataFlow for ETL
Then use BigQuery batch insert API (for initial load) and streaming insert API (for incremental load when new data available in source) to load BigQuery denormalized schema.
Please let me know your opinions.

Without knowing your data volumes, specifically how much new or diff data you have per day and how you are doing paging with your REST APIs - here is my guidance...
If you go down the path of a using Spring Batch you are more than likely going to have to come up with your own sharding mechanism: how will you divide up REST calls to instantiate your Spring services? You will also be in the Kub management space and will have to handle retries with the streaming API to BQ.
If you go down the Dataflow route you will have to write a some transform code to call your REST API and peform the paging to populate your PCollection destined for BQ. With the recent addition of Dataflow templates you could: create a pipeline that is triggered every N hours and parameterize your REST call(s) to just pull data ?since=latestCall. From there you could execute BigQuery writes. I recommend doing this in batch mode as 1) it will scale better if you have millions of rows 2) be less cumbersome to manage (during non-active times).
Since Cloud Dataflow has built in re-try logic for BiqQuery and provides consistency across all input and output collections -- my vote is for Dataflow in this case.
How big are your REST call results in record count?

Related

Avoid session shutdown on BigQuery Storage API with Dataflow

I am implementing an ETL job that migrates a non partitioned BigQuery Table to a partitioned one.
To do so I use the Storage API from BigQuery. This creates a number of sessions to pull Data from.
In order to route the BigQuery writes to the right partition I use the File Loads methods.
Streaming inserts was not the option due to the limitation of 30 days.
Storage Write API seems to be limited identifying the partition.
By residing to the File Load Method the Data are being written to GCS.
The issue is that this takes too much time and there is the risk of the sessions to close.
Behind the scenes the File Load Method is a complex one with multiple steps. For example writings to GCS and combining the entries to a destination/partition joined file.
Based on the Dataflow processes it seems that nodes can execute workloads on different parts of the pipeline.
How can I avoid the risk of the session closing? Is there a way for my Dataflow nodes to focus only on the critical part which is write to GCS first and once this is done, then focus on all the other aspects?
You can do a Reshuffle right before applying the write to BigQuery. In Dataflow, that will create a checkpoint, and a new stage in the job. The write to BigQuery would start when all steps previous to the reshuffle have finished, and in case of errors and retries, the job would backtrack to that checkpoint.
Please note that doing a reshuffle implies doing a shuffling of data, so there will be a performance impact.

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

Pull data from HTTP request API to Google Cloud

I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.

How can I load data from BigQuery to Spanner?

I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.

Inserting into BigQuery via load jobs (not streaming)

I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).