I have the following data pipeline:
A process writes messages to Kafka
A Spark structured streaming application is listening for new Kafka messages and writes them as they are to HDFS
A batch Hive job runs on a hourly basis and reads the newly ingested messages from HDFS and via some medium complex INSERT INTO statements populates some tables (I do not have materialized views available). EDIT: Essentially after my Hive job I have as result Table1 storing the raw data, then another table Table2 = fun1(Table1), then Table3 = fun2(Table2), then Table4 = join(Table2, Table3), etc. Fun is a selection or an aggregation.
A Tableau dashboard visualizes the data I wrote.
As you can see, step 3 makes my pipeline not real time.
What can you suggest me in order to make my pipeline fully real time? EDIT: I'd like to have Table1, ... TableN updated on real time!
Using Hive with Spark Streaming is not recommended at all. Since the purpose of Spark streaming is to have low latency. Hive introduces the highest latency possible (OLAP) since at backend it executes MR/Tez job (depends on hive.execution.engine).
Recommendation: Use spark streaming with the low latency DB like HBASE, Phoenix.
Solution: Develop a Spark streaming job with Kafka as a source and use the custom sink to write the data into Hbase/Phoenix.
Introducing HDFS obviously isn't real time. MemSQL or Druid/Imply offer much more real time ingestion from Kafka
You need historical data to perform roll ups and aggregations. Tableau may cache datasets, but it doesn't store persistently itself. You therefore need some storage, and you've chosen to use HDFS rather than a database.
Note: Hive / Presto can read directly from Kafka. Therefore you don't really even need Spark.
If you want to do rolling aggregates from Kafka and make it queryable, KSQL could be used instead, or you can write your own Kafka Streams solution
Related
I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.
I am looking for the best option to access data from Spark data pipelines. The scenario is as follows:
I am reading data from Kafka topics, creating a streaming dataframe which is then cleaned and being printed on the console. I need this data to be integrated with existing Python scripts which is doing all the data operations by Pandas. I have considered the following options:
Write streaming data to local memory (e.g. Hive Tables).
Use Spark Structured Streaming ForeachBatch Sink.
I want to mention that the data is to be read after a certain interval and there will be a real time data dashboard in the future with this data.
Please advise which will be the best approach to handle this scenario. Apologies if the question sounds too basic. Thanks in advance.
If you save data on Hive each time before accessing the newly streamed data through python scripts, the newly added hive partitions are required to be refreshed each time as well.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Here are some disadvantages of having a hive for mentioned real-time scenarios.
https://www.quora.com/What-are-some-disadvantages-of-Apache-Hive#
Whereas, Using Spark Structured Streaming looks a better choice for the near-real-time experience.
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
As data arrives in my BigQuery, I want to send some of it to another database--a datamart or an operational database that serves real-time dashboards.
How do I do this? Polling the enormous BQ table is too expensive and slow, and I want updates to be frequent--close to real-time.
Strangely, I find little info about streaming from BigQuery.
Polling the enormous BQ table is too expensive and slow
Make sure to partition your data by day, and if you have too much data, cluster it by hour.
There isn't a natural way to stream data out of BigQuery as it arrives, but if you partition and cluster your data appropriately, then scans will be way less costly than doing it from a naive table.
For realtime: Would it be an option to split data to BigQuery and other tools from the pipeline, instead of after it being stored in BQ?
To the comment
"I would rather not alter each of clients to write to two targets, BQ plus PubSub"
Have each client write only to Pub/Sub. Then click-to-deploy a pipeline that writes to BigQuery from Pub/Sub - for the most reliable pipeline. Then other consumers can subscribe to the same Pub/Sub topic that feeds BigQuery.
I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.
I have recently started with Apache beam. I am sure I am missing something here. I have a requirement to load from a very huge database to bigquery. These tables are huge. I have written sample beam jobs to load minimal rows from simple tables.
How would I able to load n number of rows from tables using JDBCIO? Is there anyway that i can load these data in batches as we do in conventional data migration jobs.?
Can I do batch read from a database and write in batches to bigquery?
Also i have seen that, the suggested approach to load the data to bigquery is by adding the files to the data store buckets. But, in automated environment, the requirement is to write it as a dataflow job to load from db and write it to bigquery. What should my design approach to solve this issue using apache beam?
Please help.!
It looks[1] like BigQueryIO will write batches of data if it comes from a bounded PCollection (otherwise it uses streaming inserts). It also appears to bound the size of each file and batch, so I don't think you'll need to do any manual batching.
I'd just read from your database via JDBCIO, transform it if needed, and write it to BigQueryIO.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java