I'm using Java + Apache Beam SDK for Java 2.0.1-SNAPSHOT
Scenario:
Read Data from BigQuery(BQ) -> ETL Process in Dataflow -> Write Data in BQ tables
The problem is that the pipeline is trying to process all data before performing the insertion in BQ.
Is there a way to execute stream inserts in this case? I've already tried to set a timestamp to the elements when extracting from BQ, but it didn't work.
Or is it possible to set the BatchLoads so that it inserts bulks of data time to time?
I would take a look at this link to Googles Solution. That being said, BigQuery sounds like it is being treated as a bounded source, but that shouldn't be a problem sinking data back into dataflow, see here.
Related
I am looking for the best option to access data from Spark data pipelines. The scenario is as follows:
I am reading data from Kafka topics, creating a streaming dataframe which is then cleaned and being printed on the console. I need this data to be integrated with existing Python scripts which is doing all the data operations by Pandas. I have considered the following options:
Write streaming data to local memory (e.g. Hive Tables).
Use Spark Structured Streaming ForeachBatch Sink.
I want to mention that the data is to be read after a certain interval and there will be a real time data dashboard in the future with this data.
Please advise which will be the best approach to handle this scenario. Apologies if the question sounds too basic. Thanks in advance.
If you save data on Hive each time before accessing the newly streamed data through python scripts, the newly added hive partitions are required to be refreshed each time as well.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Here are some disadvantages of having a hive for mentioned real-time scenarios.
https://www.quora.com/What-are-some-disadvantages-of-Apache-Hive#
Whereas, Using Spark Structured Streaming looks a better choice for the near-real-time experience.
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.
I have a use case
We have java framework to parse realtime data from Kinesis to Hive table in every half an hour.
I need to access this hive table and do some processing near realtime. An hour delay is fine, as I dont have permission to access Kinesis stream.
Once processing is done in spark (pyspark preferably), I have to create a new kinesys stream and push the data.
I will then use Splunk and pull it near realtime.
Question is, any one has done spark streaming from hive using python ? I have to do a POC and then the actual work.
Any help will be highly appreciated.
Thanks in advance!!
There are 2 ways to go ahead on this:
Use spark-streaming to obtain messages drectly from Kinesis. That will give you something that is real time.
Once the file drops into your staging area ( either your hive ware-house OR your some HDFS location ), you can pick it up for processing using spark-streaming for files.
Do let us know which approch worked best for you.
I have recently started with Apache beam. I am sure I am missing something here. I have a requirement to load from a very huge database to bigquery. These tables are huge. I have written sample beam jobs to load minimal rows from simple tables.
How would I able to load n number of rows from tables using JDBCIO? Is there anyway that i can load these data in batches as we do in conventional data migration jobs.?
Can I do batch read from a database and write in batches to bigquery?
Also i have seen that, the suggested approach to load the data to bigquery is by adding the files to the data store buckets. But, in automated environment, the requirement is to write it as a dataflow job to load from db and write it to bigquery. What should my design approach to solve this issue using apache beam?
Please help.!
It looks[1] like BigQueryIO will write batches of data if it comes from a bounded PCollection (otherwise it uses streaming inserts). It also appears to bound the size of each file and batch, so I don't think you'll need to do any manual batching.
I'd just read from your database via JDBCIO, transform it if needed, and write it to BigQueryIO.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).