As data arrives in my BigQuery, I want to send some of it to another database--a datamart or an operational database that serves real-time dashboards.
How do I do this? Polling the enormous BQ table is too expensive and slow, and I want updates to be frequent--close to real-time.
Strangely, I find little info about streaming from BigQuery.
Polling the enormous BQ table is too expensive and slow
Make sure to partition your data by day, and if you have too much data, cluster it by hour.
There isn't a natural way to stream data out of BigQuery as it arrives, but if you partition and cluster your data appropriately, then scans will be way less costly than doing it from a naive table.
For realtime: Would it be an option to split data to BigQuery and other tools from the pipeline, instead of after it being stored in BQ?
To the comment
"I would rather not alter each of clients to write to two targets, BQ plus PubSub"
Have each client write only to Pub/Sub. Then click-to-deploy a pipeline that writes to BigQuery from Pub/Sub - for the most reliable pipeline. Then other consumers can subscribe to the same Pub/Sub topic that feeds BigQuery.
Related
I am implementing an ETL job that migrates a non partitioned BigQuery Table to a partitioned one.
To do so I use the Storage API from BigQuery. This creates a number of sessions to pull Data from.
In order to route the BigQuery writes to the right partition I use the File Loads methods.
Streaming inserts was not the option due to the limitation of 30 days.
Storage Write API seems to be limited identifying the partition.
By residing to the File Load Method the Data are being written to GCS.
The issue is that this takes too much time and there is the risk of the sessions to close.
Behind the scenes the File Load Method is a complex one with multiple steps. For example writings to GCS and combining the entries to a destination/partition joined file.
Based on the Dataflow processes it seems that nodes can execute workloads on different parts of the pipeline.
How can I avoid the risk of the session closing? Is there a way for my Dataflow nodes to focus only on the critical part which is write to GCS first and once this is done, then focus on all the other aspects?
You can do a Reshuffle right before applying the write to BigQuery. In Dataflow, that will create a checkpoint, and a new stage in the job. The write to BigQuery would start when all steps previous to the reshuffle have finished, and in case of errors and retries, the job would backtrack to that checkpoint.
Please note that doing a reshuffle implies doing a shuffling of data, so there will be a performance impact.
We have analytical ETL store model results on a Snowflake table (2 columns: user-id, score)
We need to use that info in our low latency service, which snowflake is not suitable for that latency.
I thought about storing that table on a Redis collection.
I would like to have some idea of how to keep the Redis in sync with the table.
any other solution for the latency is also welcomed
well it depends on how frequently you snowflake data is updated, what process is updating the data (snowplow or some external tool that you can hock into), what latency you want, are prepared between the snowflake data change, and redis having the values.
You could and a task to export the changes to a S3 and then have a lambda watching the bucket/folder, and push the changes into redis.
You could have your tool that loads the changes, pull the changes out and push those into redis. (we did a form of this)
You could have something poll the snowflake data (seems the worst idea) and push changes into redis. Well if you are polling the main table, this sounds bad, but you could also have a multi-table insert/merge command, thus when you are updating the main table, insert into a changes or stream, and thus read from this in you redis sync.
I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.
I have the following data pipeline:
A process writes messages to Kafka
A Spark structured streaming application is listening for new Kafka messages and writes them as they are to HDFS
A batch Hive job runs on a hourly basis and reads the newly ingested messages from HDFS and via some medium complex INSERT INTO statements populates some tables (I do not have materialized views available). EDIT: Essentially after my Hive job I have as result Table1 storing the raw data, then another table Table2 = fun1(Table1), then Table3 = fun2(Table2), then Table4 = join(Table2, Table3), etc. Fun is a selection or an aggregation.
A Tableau dashboard visualizes the data I wrote.
As you can see, step 3 makes my pipeline not real time.
What can you suggest me in order to make my pipeline fully real time? EDIT: I'd like to have Table1, ... TableN updated on real time!
Using Hive with Spark Streaming is not recommended at all. Since the purpose of Spark streaming is to have low latency. Hive introduces the highest latency possible (OLAP) since at backend it executes MR/Tez job (depends on hive.execution.engine).
Recommendation: Use spark streaming with the low latency DB like HBASE, Phoenix.
Solution: Develop a Spark streaming job with Kafka as a source and use the custom sink to write the data into Hbase/Phoenix.
Introducing HDFS obviously isn't real time. MemSQL or Druid/Imply offer much more real time ingestion from Kafka
You need historical data to perform roll ups and aggregations. Tableau may cache datasets, but it doesn't store persistently itself. You therefore need some storage, and you've chosen to use HDFS rather than a database.
Note: Hive / Presto can read directly from Kafka. Therefore you don't really even need Spark.
If you want to do rolling aggregates from Kafka and make it queryable, KSQL could be used instead, or you can write your own Kafka Streams solution
I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.