Difference between BigQuery API and BigQuery Storage API? - google-bigquery

What is the difference between the BigQuery API Client Libraries and BigQuery Storage API Client Libraries?
In the Overview section of BigQuery Storage Read API, it says
The BigQuery Storage Read API provides fast access to BigQuery-managed storage by using an rpc-based protocol.
Is BigQuery Storage API just faster because it uses rpc?

Yes you are correct it is fast since it uses rpc, and also as stated in this documentation,
The Storage Read API does not provide functionality related to managing BigQuery resources such as datasets, jobs, or tables.
Basically, you would want to use BigQuery Storage API on top of BigQuery API when your operation requires the need to scan large volumes of managed data as it exposes high throughput data reading for consumers. Otherwise, use of BigQuery API is enough for interactions with core resources such as datasets, tables, jobs, and routines.
For further reading, see this documentations on some of the key features of Storage Write API and Storage Read API.

Related

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

Pull data from HTTP request API to Google Cloud

I have an app that sending me data from an API. The data is semi-structured (json data)
I would like to send this data to Google Big Query in order to stock all the information.
However, I'm not able to find how can I do it properly.
So far I have used Node JS on my own server to get the data using POST request.
Could you please help me ? Thnak.
You can use bigquery API to do streaming inserts.
You can also write the data to PubSub or Google Cloud Storage and use dataflow pipelines to load them into bigquery (you can either use streaming inserts (incur costs) or batch load jobs (free))
You can also log in stackdriver and from there you can select and send to bigquery (there already exists direct options for it in GCP, note that under the hood it performs streaming inserts)
If you feel that setting up dataflow is complicated, you can store your files and perform batch load jobs by directly calling bigquery API. Note that there are limits on number of batch loads you can make in a day over a particular table (1000 per day)
There is a page in the official documentation that lists all the possibilities of loading data to BigQuery.
For the simplicity, you can just send data from your local data soruce. You should use the Google Cloud client libraries for Big Query. Here you have a guide on how to do that as well as a relevant code example.
But my honest recommendation is to send data to Google Cloud Storage and from there, to load it to BigQuery. This way the whole process will be more stable.
You can check all the options from the first link that I've posted and choose what you think that will fit best with your workflow.
Keep in mind the limitations of this process.

Export pubsub data to object storage using SCIO

I am trying to export Cloud Pub/Sub streams to Cloud Storage as described by this post by Spotify Reliable export of Cloud Pub/Sub streams to Cloud Storage or this post by Google Simple backup and replay of streaming events using Cloud Pub/Sub, Cloud Storage, and Cloud Dataflow
PubSub creates oubounded PCollection (or SCollection in SCIO) but saveastextfile requires BoundedCollection.
Is there any way to overcome this ?
The new dynamic IO module should support saving unbounded collection to files.
However note that the approach in that Spotify article doesn't use Dataflow since it has a lot of custom logic for SLA/bucketing/reliability reasons. So YMMV.

ETL on Google Cloud - (Dataflow vs. Spring Batch) -> BigQuery

I am considering BigQuery as my data warehouse requirement. Right now, I have my data in google cloud (cloud SQL and BigTable). I have exposed my REST APIs to retrieve data from both. Now, I would like to retrieve data from these APIs, do the ETL and load the data into BigQuery. I am evaluating 2 options of ETL (daily frequency of job for hourly data) right now:-
Use JAVA Spring Batch and create microservice and use Kubernetes as deployment environment. Will it scale?
Use Cloud DataFlow for ETL
Then use BigQuery batch insert API (for initial load) and streaming insert API (for incremental load when new data available in source) to load BigQuery denormalized schema.
Please let me know your opinions.
Without knowing your data volumes, specifically how much new or diff data you have per day and how you are doing paging with your REST APIs - here is my guidance...
If you go down the path of a using Spring Batch you are more than likely going to have to come up with your own sharding mechanism: how will you divide up REST calls to instantiate your Spring services? You will also be in the Kub management space and will have to handle retries with the streaming API to BQ.
If you go down the Dataflow route you will have to write a some transform code to call your REST API and peform the paging to populate your PCollection destined for BQ. With the recent addition of Dataflow templates you could: create a pipeline that is triggered every N hours and parameterize your REST call(s) to just pull data ?since=latestCall. From there you could execute BigQuery writes. I recommend doing this in batch mode as 1) it will scale better if you have millions of rows 2) be less cumbersome to manage (during non-active times).
Since Cloud Dataflow has built in re-try logic for BiqQuery and provides consistency across all input and output collections -- my vote is for Dataflow in this case.
How big are your REST call results in record count?

Google DataFlow API for ingesting HLL_COUNT.INIT into BigQuery

I am sending data to PubSub from where I am trying to create a DataFlow job to put data into BigQuery.
I have a column in the data for unique that I want to do HLL_COUNT.INIT
Is there an equivalent method on the DataFlow side so that I can directly store the HLL version of the column in BigQuery?
No, DataFlow doesn't have support for BigQuery HLL sketches format, but it is clearly something that would be useful. I created feature request for it in DataFlow issue tracker: https://issuetracker.google.com/62153424.
Update: A BigQuery-compatible implementation of HyperLogLog++ has been open-sourced to github.com/google/zetasketch, and a design doc (docs.google.com/document/d/…) about integrating it into Apache Beam has been sent out to dev#beam.apache.org.