Flink batching Sink - batch-processing

I'm trying to use flink in both a streaming and batch way, to add a lot of data into Accumulo (A few million a minute). I want to batch up records before sending them to Accumulo.
I ingest data either from a directory or via kafka, convert the data using a flatmap and then pass to a RichSinkFunction, which adds the data to a collection.
With the streaming data, batching seems ok, in that I can add the records to a collection of fixed size which get sent to accumulo once the batch threshold is reached. But for the batch data which is finite, I'm struggling to find a good approach to batching as it would require a flush time out in case there is no further data within a specified time.
There doesn't seem to be an Accumulo connector unlike for Elastic search or other alternative sinks.
I thought about using a Process Function with a trigger for batch size and time interval, but this requires a keyed window. I didn't want to go down the keyed route as data looks to be very skewed, in that some keys would have a tonne of records and some would have very few. If I don't use a windowed approach, then I understand that the operator won't be parallel. I was hoping to lazily batch, so each sink only cares about numbers or an interval of time.
Has anybody got any pointers on how best to address this?

You can access timers in a sink by implementing ProcessingTimeCallback. For an example, look at the BucketingSink -- its open and onProcessingTime methods should get you started.

Related

How to batch streaming inserts to BigQuery from a Beam job

I'm writing to BigQuery in a beam job from an unbounded source. I'm using STREAMING INSERTS as the Method. I was looking at how to throttle the rows to BigQuery based on the recommendations in
https://cloud.google.com/bigquery/quotas#streaming_inserts
The BigQueryIO.Write API doesn't provide a way to set the micro batches.
I was looking at using Triggers but not sure if BigQuery groups everything in a pane into a request. I've setup the trigger as below
Window.<Long>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(5),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(2)))
))
.discardingFiredPanes());
Q1. Does Beam support micro batches or does it create one request for each element in the PCollection?
Q2. If the above trigger makes sense? Even If I set the window/trigger it could be sending one request for every element.
I don't know what you mean by micro-batch. The way I see it, BigQuery support loading data either as batches, either in streaming.
Basically, batch loads are subject to quotas and streaming loads are a bit more expensive.
Once you set the insertion method for your BigQueryIO the documentation states :
Note: If you use batch loads in a streaming pipeline, you must use withTriggeringFrequency to specify a triggering frequency.
Never tried it, but withTriggeringFrequency seems to be what you need here.

Dataflow: Can I write continuously/stream write to BigQuery with batch job?

I can't seem to find any documentation about this. I have an apache-beam pipeline that takes some information, formats it into TableRows and then writes to BigQuery.
[+] The problem:
The rows are not written to BigQuery until the Dataflow job finishes. If I have a Dataflow job that takes a long time I'd like to be able to see the rows being inserted into BigQuery, can anybody point me the right direction?
Thanks in advance
Since you're working in batch mode, data need to be written at the same time in the same table. If you're working with partitions, all data belonging to a partition need to be written at the same time. That's why the insertion is done last.
Please note that the WriteDisposition is very important when you work in batches because either you append data, or truncate. But does this distinction make sense for streaming pipelines ?
In java, you can specify the Method of insertion with the following function :
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS))
I've not tested it, but I believe it should work as expected. Also note that streaming inserts to BigQuery are not free of charge.
Depending on how complex your initial transform+load operation is, you could just use the big query driver to do streaming inserts into the table from your own worker pool, rather than loading it in via a dataflow job explicitly.
Alternatively, you could do smaller batches:
N Independent jobs each loading TIME_PERIOD/N amounts of data

Concatenate files in S3 using AWS Lambda

Is there a way to use Lambda for S3 file concatenation?
I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).
I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).
The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.
Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.
Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?
I would reconsider whether you actually want to do this:
The S3 costs will go up.
The pipeline complexity will go up.
The latency from Firehose input to Spark input will go up.
If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…
Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:
You can archive S3 objects almost immediately, lowering costs.
Data is available in Spark as soon as possible.
If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
There's a tiny amount of latency increase from establishing TCP connections and authentication.
I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:
A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
A live Spark/EMR/underlying data service instance ready to receive the data.
In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.
Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.
If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.
You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread

Inserting into BigQuery via load jobs (not streaming)

I'm looking to use Dataflow to load data into BigQuery tables using BQ load jobs - not streaming (streaming would cost too much for our use case). I see that the Dataflow SDK has built in support for inserting data via BQ streaming, but I wasn't able to find anything in the Dataflow SDK that supports load jobs out of the box.
Some questions:
1) Does the Dataflow SDK have OOTB support for BigQuery load job inserts? If not, is it planned?
2) If I need to roll my own, what are some good approaches?
If I have to roll my own, performing a BQ load job using Google Cloud Storage is a multi step process - write the file to GCS, submit the load job via the BQ API, and (optionally) check the status until the job has completed (or failed). I'd hope I could use the existing TextIO.write() functionality to write to GCS, but I'm not sure how I'd compose that step with the subsequent call to the BQ API to submit the load job (and optionally the subsequent calls to check the status of the job until it's complete).
Also, I'd be using Dataflow in streaming mode, with windows of 60 seconds - so I'd want to do the load job every 60 seconds as well.
Suggestions?
I'm not sure which version of Apache Beam you are using, but now it's possible to use a micro-batching tactic using a Stream Pipeline. If you decide one way or another you can use something like this:
.apply("Saving in batches", BigQueryIO.writeTableRows()
.to(destinationTable(options))
.withMethod(Method.FILE_LOADS)
.withJsonSchema(myTableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withTriggeringFrequency(Duration.standardMinutes(2))
.withNumFileShards(1);
.optimizedWrites());
Things to keep in mind
There are 2 different methods: FILE_LOADS and STREAMING_INSERT, if you use the first one you need to include the withTriggeringFrequency and withNumFileShards. For the first one, from my experience, is better to use minutes and the number will depend on the amount of throughput data. If you receive quite a lot try to keep it small, I have seen "stuck errors" when you increase it too much. The shards can affect mostly your GCS billing, if you add to much shards it will create more files per table per x amount of minutes.
If your input data size is not so big the streaming insert can work really well and the cost shouldn't be a big deal. In that scenario you can use STREAMING_INSERT method and remove the withTriggeringFrequency and withNumFileShards. Also, you can add withFailedInsertRetryPolicy like InsertRetryPolicy.retryTransientErrors() so no rows are being lost (keep in mind that idempotency is not guaranteed with STREAM_INSERTS, so duplication is possible)
You can check your Jobs in BigQuery and validate that everything is working! Keep in mind the policies for jobs with BigQuery (I think is 1000 jobs per table) when you are trying to define triggering frequency and shards.
Note: You can always read this article about efficient aggregation pipelines https://cloud.google.com/blog/products/data-analytics/how-to-efficiently-process-both-real-time-and-aggregate-data-with-dataflow
BigQueryIO.write() always uses BigQuery load jobs when the input PCollection is bounded. If you'd like it to also use them if it is unbounded, specify .withMethod(FILE_LOADS).withTriggeringFrequency(...).

Trident or Storm topology that writes on Redis

I have a problem with a topology. I try to explain the workflow...
I have a source that emits ~500k tuples every 2 minutes, these tuples must be read by a spout and processed exatly once like a single object (i think a batch in trident).
After that, a bolt/function/what else?...must appends a timestamp and save the tuples into Redis.
I tried to implement a Trident topology with a Function that save all the tuples into Redis using a Jedis object (Redis library for Java) into this Function class, but when i deploy i receive a NotSerializable Exception on this object.
My question is.How can i implement a Function that writes on Redis this batch of tuples? Reading on the web i cannot found any example that writes from a function to Redis or any example using State object in Trident (probably i have to use it...)
My simple topology:
TridentTopology topology = new TridentTopology();
topology.newStream("myStream", new mySpout()).each(new Fields("field1", "field2"), new myFunction("redis_ip", "6379"));
Thanks in advance
(replying about state in general since the specific issue related to Redis seems solved in other comments)
The concepts of DB updates in Storm becomes clearer when we keep in mind that Storm reads from distributed (or "partitioned") data sources (through Storm "spouts"), processes streams of data on many nodes in parallel, optionally perform calculations on those streams of data (called "aggregations") and saves the results to distributed data stores (called "states"). Aggregation is a very broad term that just means "computing stuff": for example computing the minimum value over a stream is seen in Storm as an aggregation of the previously known minimum value with the new values currently processed in some node of the cluster.
With the concepts of aggregations and partition in mind, we can have a look at the two main primitives in Storm that allow to save something in a state: partitionPersist and persistentAggregate, the first one runs at the level of each cluster node without coordination with the other partitions and feels a bit like talking to the DB through a DAO, while the second one involves "repartitioning" the tuples (i.e. re-distributing them across the cluster, typically along some groupby logic), doing some calculation (an "aggregate") before reading/saving something to DB and it feels a bit like talking to a HashMap rather than a DB (Storm calls the DB a "MapState" in that case, or a "Snapshot" if there's only one key in the map).
One more thing to have in mind is that the exactly once semantic of Storm is not achieved by processing each tuple exactly once: this would be too brittle since there are potentially several read/write operations per tuple defined in our topology, we want to avoid 2-phase commits for scalability reasons and at large scale, network partitions become more likely. Rather, Storm will typically continue replaying the tuples until he's sure they have been completely successfully processed at least once. The important relationship of this to state updates is that Storm gives us primitive (OpaqueMap) that allows idempotent state update so that those replays do not corrupt previously stored data. For example, if we are summing up the numbers [1,2,3,4,5], the resulting thing saved in DB will always be 15 even if they are replayed and processed in the "sum" operation several times due to some transient failure. OpaqueMap has a slight impact on the format used to save data in DB. Note that those replay and opaque logic are only present if we tell Storm to act like that, but we usually do.
If you're interested in reading more, I posted 2 blog articles here on the subject.
http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/
http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/
One last thing: as hinted by the replay stuff above, Storm is a very asynchronous mechanism by nature: we typically have some data producer that post event in a queueing system (e,g. Kafka or 0MQ) and Storm reads from there. As a result, assigning a timestamp from within storm as suggested in the question may or may not have the desired effect: this timestamp will reflect the "latest successful processing time", not the data ingestion time, and of course it will not be identical in case of replayed tuples.
Have you tried trident-state for redis. There is a code on github that does it already:
https://github.com/kstyrc/trident-redis.
Let me know if this answers your question or not.