Handling bad messages using Kafka's Streams API - error-handling

I have a basic stream processing flow which looks like
master topic -> my processing in a mapper/filter -> output topics
and I am wondering about the best way to handle "bad messages". This could potentially be things like messages that I can't deserialize properly, or perhaps the processing/filtering logic fails in some unexpected way (I have no external dependencies so there should be no transient errors of that sort).
I was considering wrapping all my processing/filtering code in a try catch and if an exception was raised then routing to an "error topic". Then I can study the message and modify it or fix my code as appropriate and then replay it on to master. If I let any exceptions propagate, the stream seems to get jammed and no more messages are picked up.
Is this approach considered best practice?
Is there a convenient Kafka streams way to handle this? I don't think there is a concept of a DLQ...
What are the alternative ways to stop Kafka jamming on a "bad message"?
What alternative error handling approaches are there?
For completeness here is my code (pseudo-ish):
class Document {
// Fields
}
class AnalysedDocument {
Document document;
String rawValue;
Exception exception;
Analysis analysis;
// All being well
AnalysedDocument(Document document, Analysis analysis) {...}
// Analysis failed
AnalysedDocument(Document document, Exception exception) {...}
// Deserialisation failed
AnalysedDocument(String rawValue, Exception exception) {...}
}
KStreamBuilder builder = new KStreamBuilder();
KStream<String, AnalysedPolecatDocument> analysedDocumentStream = builder
.stream(Serdes.String(), Serdes.String(), "master")
.mapValues(new ValueMapper<String, AnalysedDocument>() {
#Override
public AnalysedDocument apply(String rawValue) {
Document document;
try {
// Deserialise
document = ...
} catch (Exception e) {
return new AnalysedDocument(rawValue, exception);
}
try {
// Perform analysis
Analysis analysis = ...
return new AnalysedDocument(document, analysis);
} catch (Exception e) {
return new AnalysedDocument(document, exception);
}
}
});
// Branch based on whether analysis mapping failed to produce errorStream and successStream
errorStream.to(Serdes.String(), customPojoSerde(), "error");
successStream.to(Serdes.String(), customPojoSerde(), "analysed");
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
Any help greatly appreciated.

Right now, Kafka Streams offers only limited error handling capabilities. There is work in progress to simplify this. For now, your overall approach seems to be a good way to go.
One comment about handling de/serialization errors: handling those error manually, requires you to do de/serialization "manually". This means, you need to configure ByteArraySerdes for key and value for you input/output topic of your Streams app and add a map() that does the de/serialization (ie, KStream<byte[],byte[]> -> map() -> KStream<keyType,valueType> -- or the other way round if you also want to catch serialization exceptions). Otherwise, you cannot try-catch deserialization exceptions.
With your current approach, you "only" validate that the given string represents a valid document -- but it could be the case, that the message itself is corrupted and cannot be converted into a String in the source operator in the first place. Thus, you don't actually cover deserialization exception with you code. However, if you are sure a deserialization exception can never happen, you approach would be sufficient, too.
Update
This issues is tackled via KIP-161 and will be included in the next release 1.0.0. It allows you to register an callback via parameter default.deserialization.exception.handler. The handler will be invoked every time a exception occurs during deserialization and allows you to return an DeserializationResponse (CONTINUE -> drop the record an move on, or FAIL that is the default).
Update 2
With KIP-210 (will be part of in Kafka 1.1) it's also possible to handle errors on the producer side, similar to the consumer part, by registering a ProductionExceptionHandler via config default.production.exception.handler that can return CONTINUE.

Update Mar 23, 2018: Kafka 1.0 provides much better and easier handling for bad error messages ("poison pills") via KIP-161 than what I described below. See default.deserialization.exception.handler in the Kafka 1.0 docs.
This could potentially be things like messages that I can't deserialize properly [...]
Ok, my answer here focuses on the (de)serialization issues as this might be the most tricky scenario to handle for most users.
[...] or perhaps the processing/filtering logic fails in some unexpected way (I have no external dependencies so there should be no transient errors of that sort).
The same thinking (for deserialization) can also be applied to failures in the processing logic. Here, most people tend to gravitate towards option 2 below (minus the deserialization part), but YMMV.
I was considering wrapping all my processing/filtering code in a try catch and if an exception was raised then routing to an "error topic". Then I can study the message and modify it or fix my code as appropriate and then replay it on to master. If I let any exceptions propagate, the stream seems to get jammed and no more messages are picked up.
Is this approach considered best practice?
Yes, at the moment this is the way to go. Essentially, the two most common patterns are (1) skipping corrupted messages or (2) sending corrupted records to a quarantine topic aka a dead letter queue.
Is there a convenient Kafka streams way to handle this? I don't think there is a concept of a DLQ...
Yes, there is a way to handle this, including the use of a dead letter queue. However, it's (at least IMHO) not that convenient yet. If you have any feedback on how the API should allow you to handle this -- e.g. via a new or updated method, a configuration setting ("if serialization/deserialization fails send the problematic record to THIS quarantine topic") -- please let us know. :-)
What are the alternative ways to stop Kafka jamming on a "bad message"?
What alternative error handling approaches are there?
See my examples below.
FWIW, the Kafka community is also discussing the addition of a new CLI tool that allows you to skip over corrupted messages. However, as a user of the Kafka Streams API, I think ideally you want to handle such scenarios directly in your code, and fallback to CLI utilities only as a last resort.
Here are some patterns for the Kafka Streams DSL to handle corrupted records/messages aka "poison pills". This is taken from http://docs.confluent.io/current/streams/faq.html#handling-corrupted-records-and-deserialization-errors-poison-pill-messages
Option 1: Skip corrupted records with flatMap
This is arguably what most users would like to do.
We use flatMap because it allows you to output zero, one, or more output records per input record. In the case of a corrupted record we output nothing (zero records), thereby ignoring/skipping the corrupted record.
Benefit of this approach compared to the others ones listed here: We need to manually deserialize a record only once!
Drawback of this approach: flatMap "marks" the input stream for potential data re-partitioning, i.e. if you perform a key-based operation such as groupings (groupBy/groupByKey) or joins afterwards, your data will be re-partitioned behind the scenes. Since this might be a costly step we don't want that to happen unnecessarily. If you KNOW that the record keys are always valid OR that you don't need to operate on the keys (thus keeping them as "raw" keys in byte[] format), you can change from flatMap to flatMapValues, which will not result in data re-partitioning even if you join/group/aggregate the stream later.
Code example:
Serde<byte[]> bytesSerde = Serdes.ByteArray();
Serde<String> stringSerde = Serdes.String();
Serde<Long> longSerde = Serdes.Long();
// Input topic, which might contain corrupted messages
KStream<byte[], byte[]> input = builder.stream(bytesSerde, bytesSerde, inputTopic);
// Note how the returned stream is of type KStream<String, Long>,
// rather than KStream<byte[], byte[]>.
KStream<String, Long> doubled = input.flatMap(
(k, v) -> {
try {
// Attempt deserialization
String key = stringSerde.deserializer().deserialize(inputTopic, k);
long value = longSerde.deserializer().deserialize(inputTopic, v);
// Ok, the record is valid (not corrupted). Let's take the
// opportunity to also process the record in some way so that
// we haven't paid the deserialization cost just for "poison pill"
// checking.
return Collections.singletonList(KeyValue.pair(key, 2 * value));
}
catch (SerializationException e) {
// log + ignore/skip the corrupted message
System.err.println("Could not deserialize record: " + e.getMessage());
}
return Collections.emptyList();
}
);
Option 2: dead letter queue with branch
Compared to option 1 (which ignores corrupted records) option 2 retains corrupted messages by filtering them out of the "main" input stream and writing them to a quarantine topic (think: dead letter queue). The drawback is that, for valid records, we must pay the manual deserialization cost twice.
KStream<byte[], byte[]> input = ...;
KStream<byte[], byte[]>[] partitioned = input.branch(
(k, v) -> {
boolean isValidRecord = false;
try {
stringSerde.deserializer().deserialize(inputTopic, k);
longSerde.deserializer().deserialize(inputTopic, v);
isValidRecord = true;
}
catch (SerializationException ignored) {}
return isValidRecord;
},
(k, v) -> true
);
// partitioned[0] is the KStream<byte[], byte[]> that contains
// only valid records. partitioned[1] contains only corrupted
// records and thus acts as a "dead letter queue".
KStream<String, Long> doubled = partitioned[0].map(
(key, value) -> KeyValue.pair(
// Must deserialize a second time unfortunately.
stringSerde.deserializer().deserialize(inputTopic, key),
2 * longSerde.deserializer().deserialize(inputTopic, value)));
// Don't forget to actually write the dead letter queue back to Kafka!
partitioned[1].to(Serdes.ByteArray(), Serdes.ByteArray(), "quarantine-topic");
Option 3: Skip corrupted records with filter
I only mention this for completeness. This option looks like a mix of options 1 and 2, but is worse than either of them. Compared to option 1, you must pay the manual deserialization cost for valid records twice (bad!). Compared to option 2, you lose the ability to retain corrupted records in a dead letter queue.
KStream<byte[], byte[]> validRecordsOnly = input.filter(
(k, v) -> {
boolean isValidRecord = false;
try {
bytesSerde.deserializer().deserialize(inputTopic, k);
longSerde.deserializer().deserialize(inputTopic, v);
isValidRecord = true;
}
catch (SerializationException e) {
// log + ignore/skip the corrupted message
System.err.println("Could not deserialize record: " + e.getMessage());
}
return isValidRecord;
}
);
KStream<String, Long> doubled = validRecordsOnly.map(
(key, value) -> KeyValue.pair(
// Must deserialize a second time unfortunately.
stringSerde.deserializer().deserialize(inputTopic, key),
2 * longSerde.deserializer().deserialize(inputTopic, value)));
Any help greatly appreciated.
I hope I could help. If yes, I'd appreciate your feedback on how we could improve the Kafka Streams API to handle failures/exceptions in a better/more convenient way than today. :-)

For the processing logic you could take this approach:
someKStream
.mapValues(inputValue -> {
// for each execution the below "return" could provide a different class than the previous run!
// e.g. "return isFailedProcessing ? failValue : successValue;"
// where failValue and successValue have no related classes
return someObject; // someObject class vary at runtime depending on your business
}) // here you'll have KStream<whateverKeyClass, Object> -> yes, Object for the value!
// you could have a different logic for choosing
// the target topic, below is just an example
.to((k, v, recordContext) -> v instanceof failValueClass ?
"dead-letter-topic" : "success-topic",
// you could completelly ignore the "Produced" part
// and rely on spring-boot properties only, e.g.
// spring.kafka.streams.properties.default.key.serde=yourKeySerde
// spring.kafka.streams.properties.default.value.serde=org.springframework.kafka.support.serializer.JsonSerde
Produced.with(yourKeySerde,
// JsonSerde could be an instance configured as you need
// (with type mappings or headers setting disabled, etc)
new JsonSerde<>()));
Your classes, though different and landing into different topics, will serialize as expected.
When not using to(), but instead one wants to continue with other processing, he could use branch() with splitting the logic based on the kafka-value class; the trick for branch() is to return KStream<keyClass, ?>[] in order to further allow one to cast to the appropriate class the individual array items.

If you want to send an exception (custom exception) to another topic (ERROR_TOPIC_NAME):
#Bean
public KStream<String, ?> kafkaStreamInput(StreamsBuilder kStreamBuilder) {
KStream<String, InputModel> input = kStreamBuilder.stream(INPUT_TOPIC_NAME);
return service.messageHandler(input);
}
public KStream<String, ?> messageHandler(KStream<String, InputModel> inputTopic) {
KStream<String, Object> output;
output = inputTopic.mapValues(v -> {
try {
//return InputModel
return normalMethod(v);
} catch (Exception e) {
//return ErrorModel
return errorHandler(e);
}
});
output.filter((k, v) -> (v instanceof ErrorModel)).to(KafkaStreamsConfig.ERROR_TOPIC_NAME);
output.filter((k, v) -> (v instanceof InputModel)).to(KafkaStreamsConfig.OUTPUT_TOPIC_NAME);
return output;
}
If you want to handle Kafka exceptions and skip it:
#Autowired
public ConsumerErrorHandler(
KafkaProducer<String, ErrorModel> dlqProducer) {
this.dlqProducer = dlqProducer;
}
#Bean
ConcurrentKafkaListenerContainerFactory<?, ?> kafkaListenerContainerFactory(
ConcurrentKafkaListenerContainerFactoryConfigurer configurer,
ObjectProvider<ConsumerFactory<Object, Object>> kafkaConsumerFactory) {
ConcurrentKafkaListenerContainerFactory<Object, Object> factory = new ConcurrentKafkaListenerContainerFactory<>();
configurer.configure(factory, kafkaConsumerFactory.getIfAvailable());
factory.setErrorHandler(((exception, data) -> {
ErrorModel errorModel = ErrorModel.builder().message()
.status("500").build();
assert data != null;
dlqProducer.send(new ProducerRecord<>(DLQ_TOPIC, data.key().toString(), errorModel));
}));
return factory;
}

All above answers although valid and useful, they are assuming that your streams topology is stateless. For example going back to the original example,
master topic -> my processing in a mapper/filter -> output topics
"my processing in a mapper/filter" should be stateless. I.e. Not re-partitioning (aka writing to a persistent re-partition topic) or doing a toTable() (aka writing to a changelog topic). If the processing fails further down the topology and you commit the transaction (by following any of the 3 option mention above - flatmap, branch or filter - then you have to cater for manually or programmatically eventually deleting that inconsistent state. That would mean writing extra custom code for automatic this.
I would personally expect Streams to also give you a LogAndSkip option for any unhandled runtime exception, not only for deserialization and production ones.
Has anyone any ideas on this?

I don't believe these examples work at all when working with Avro.
When the schema can't be resolved (i.e there is bad/non-avro message corrupting the topic, for example) there is no key or value to deserialize in the first place because by the time the DSL .branch() code is called, the exception has already been thrown (or handled).
Can anyone confirm if this i indeed the case? The very fluent approach you refer to here isn't possible when working with Avro?
KIP-161 does explain how to use a handler, however, it's much more fluent to see it as part of the topology.

Related

Spring Cloud Stream deserialization error handling for Batch processing

I have a question about handling deserialization exceptions in Spring Cloud Stream while processing batches (i.e. batch-mode: true).
Per the documentation here, https://docs.spring.io/spring-kafka/docs/2.5.12.RELEASE/reference/html/#error-handling-deserializer, (looking at the implementation of FailedFooProvider), it looks like this function should return a subclass of the original message.
Is the intent here that a list of both Foo's and BadFoo's will end up at the original #StreamListener method, and then it will be up to the code (i.e. me) to sort them out and handle separately? I suspect this is the case, as I've read that the automated DLQ sending isn't desirable for batch error handling, as it would resubmit the whole batch.
And if this is the case, what if there is more than one message type received by the app via different #StreamListener's, say Foo's and Bar's. What type should the value function return in that case? Below is the pseudo code to illustrate the second question?
#StreamListener
public void readFoos(List<Foo> foos) {
List<> badFoos = foos.stream()
.filter(f -> f instanceof BadFoo)
.map(f -> (BadFoo) f)
.collect(Collectors.toList());
// logic
}
#StreamListener
public void readBars(List<Bar> bars) {
// logic
}
// Updated to return Object and let apply() determine subclass
public class FailedFooProvider implements Function<FailedDeserializationInfo, Object> {
#Override
public Object apply(FailedDeserializationInfo info) {
if (info.getTopics().equals("foo-topic") {
return new BadFoo(info);
}
else if (info.getTopics().equals("bar-topic") {
return new BadBar(info);
}
}
}
Yes, the list will contain the function result for failed deserializations; the application needs to handle them.
The function needs to return the same type that would have been returned by a successful deserialization.
You can't use conditions with batch listeners. If the list has a mixture of Foos and Bars, they all go to the same listener.

Apache HTTPClient5 - How to Prevent Connection/Stream Refused

Problem Statement
Context
I'm a Software Engineer in Test running order permutations of Restaurant Menu Items to confirm that they succeed order placement w/ the POS
In short, this POSTs a JSON payload to an endpoint which then validates the order w/ a POS to define success/fail/other
Where POS, and therefore Transactions per Second (TPS), may vary, but each Back End uses the same core handling
This can be as high as ~22,000 permutations per item, in easily manageable JSON size, that need to be handled as quickly as possible
The Network can vary wildly depending upon the Restaurant, and/or Region, one is testing
E.g. where some have a much higher latency than others
Therefore, the HTTPClient should be able to intelligently negotiate the same content & endpoint regardless of this
Direct Problem
I'm using Apache's HTTP Client 5 w/ PoolingAsyncClientConnectionManager to execute both the GET for the Menu contents, and the POST to check if the order succeeds
This works out of the box, but sometimes loses connections w/ Stream Refused, specifically:
org.apache.hc.core5.http2.H2StreamResetException: Stream refused
No individual tuning seems to work across all network contexts w/ variable latency, that I can find
Following the stacktrace seems to indicate it is that the stream had closed already, therefore needs a way to keep it open or not execute an already-closed connection
if (connState == ConnectionHandshake.GRACEFUL_SHUTDOWN) {
throw new H2StreamResetException(H2Error.PROTOCOL_ERROR, "Stream refused");
}
Some Attempts to Fix Problem
Tried to use Search Engines to find answers but there are few hits for HTTPClient5
Tried to use official documentation but this is sparse
Changing max connections per route to a reduced number, shifting inactivity validations, or connection time to live
Where the inactivity checks may fix the POST, but stall the GET for some transactions
And that tuning for one region/restaurant may work for 1 then break for another, w/ only the Network as variable
PoolingAsyncClientConnectionManagerBuilder builder = PoolingAsyncClientConnectionManagerBuilder
.create()
.setTlsStrategy(getTlsStrategy())
.setMaxConnPerRoute(12)
.setMaxConnTotal(12)
.setValidateAfterInactivity(TimeValue.ofMilliseconds(1000))
.setConnectionTimeToLive(TimeValue.ofMinutes(2))
.build();
Shifting to a custom RequestConfig w/ different timeouts
private HttpClientContext getHttpClientContext() {
RequestConfig requestConfig = RequestConfig.custom()
.setConnectTimeout(Timeout.of(10, TimeUnit.SECONDS))
.setResponseTimeout(Timeout.of(10, TimeUnit.SECONDS))
.build();
HttpClientContext httpContext = HttpClientContext.create();
httpContext.setRequestConfig(requestConfig);
return httpContext;
}
Initial Code Segments for Analysis
(In addition to the above segments w/ change attempts)
Wrapper handling to init and get response
public SimpleHttpResponse getFullResponse(String url, PoolingAsyncClientConnectionManager manager, SimpleHttpRequest req) {
try (CloseableHttpAsyncClient httpclient = getHTTPClientInstance(manager)) {
httpclient.start();
CountDownLatch latch = new CountDownLatch(1);
long startTime = System.currentTimeMillis();
Future<SimpleHttpResponse> future = getHTTPResponse(url, httpclient, latch, startTime, req);
latch.await();
return future.get();
} catch (IOException | InterruptedException | ExecutionException e) {
e.printStackTrace();
return new SimpleHttpResponse(999, CommonUtils.getExceptionAsMap(e).toString());
}
}
With actual handler and probing code
private Future<SimpleHttpResponse> getHTTPResponse(String url, CloseableHttpAsyncClient httpclient, CountDownLatch latch, long startTime, SimpleHttpRequest req) {
return httpclient.execute(req, getHttpContext(), new FutureCallback<SimpleHttpResponse>() {
#Override
public void completed(SimpleHttpResponse response) {
latch.countDown();
logger.info("[{}][{}ms] - {}", response.getCode(), getTotalTime(startTime), url);
}
#Override
public void failed(Exception e) {
latch.countDown();
logger.error("[{}ms] - {} - {}", getTotalTime(startTime), url, e);
}
#Override
public void cancelled() {
latch.countDown();
logger.error("[{}ms] - request cancelled for {}", getTotalTime(startTime), url);
}
});
}
Direct Question
Is there a way to configure the client such that it can handle for these variances on its own without explicitly modifying the configuration for each endpoint context?
Fixed w/ Combination of the below to Assure Connection Live/Ready
(Or at least is stable)
Forcing HTTP 1
HttpAsyncClients.custom()
.setConnectionManager(manager)
.setRetryStrategy(getRetryStrategy())
.setVersionPolicy(HttpVersionPolicy.FORCE_HTTP_1)
.setConnectionManagerShared(true);
Setting Effective Headers for POST
Specifically the close header
req.setHeader("Connection", "close, TE");
Note: Inactivity check helps, but still sometimes gets refusals w/o this
Setting Inactivity Checks by Type
Set POSTs to validate immediately after inactivity
Note: Using 1000 for both caused a high drop rate for some systems
PoolingAsyncClientConnectionManagerBuilder
.create()
.setValidateAfterInactivity(TimeValue.ofMilliseconds(0))
Set GET to validate after 1s
PoolingAsyncClientConnectionManagerBuilder
.create()
.setValidateAfterInactivity(TimeValue.ofMilliseconds(1000))
Given the Error Context
Tracing the connection problem in stacktrace to AbstractH2StreamMultiplexer
Shows ConnectionHandshake.GRACEFUL_SHUTDOWN as triggering the stream refusal
if (connState == ConnectionHandshake.GRACEFUL_SHUTDOWN) {
throw new H2StreamResetException(H2Error.PROTOCOL_ERROR, "Stream refused");
}
Which corresponds to
connState = streamMap.isEmpty() ? ConnectionHandshake.SHUTDOWN : ConnectionHandshake.GRACEFUL_SHUTDOWN;
Reasoning
If I'm understanding correctly:
The connections were being un/intentionally closed
However, they were not being confirmed ready before executing again
Which caused it to fail because the stream was not viable
Therefore the fix works because (it seems)
Given Forcing HTTP1 allows for a single context to manage
Where HttpVersionPolicy NEGOTIATE/FORCE_HTTP_2 had greater or equivalent failures across the spectrum of regions/menus
And it assures that all connections are valid before use
And POSTs are always closed due to the close header, which is unavailable to HTTP2
Therefore
GET is checked for validity w/ reasonable periodicity
POST is checked every time, and since it is forcibly closed, it is re-acquired before execution
Which leaves no room for unexpected closures
And otherwise the potential that it was incorrectly switching to HTTP2
Will accept this until a better answer comes along, as this is stable but sub-optimal.

Reactive programming - running jobs in a cluster

I need to run some jobs in a cluster, only one at a time.
Because my team uses Hazelcast, I ended up with a solution based on
Hazelcast ILock implementation. For the purpose of the question, I am going to make a generalisation about it. Let's suppose we have the following interfaces (that could be easily implemented e.g. by Hazelcast or Reddison (Redis)):
public interface MyDistributedLock {
boolean lock();
void unlock();
boolean isLockedByCurrentThread();
}
public interface MyLockDistributedFactory {
MyDistributedLock getLock(String name);
}
And lock method waiting if lock cannot be acquired:
private Mono<Void> lock(String name, Publisher<?> publisher, MyLockDistributedFactory myLockFactory) {
// important to release lock on the same thread as
// it was aquired
Scheduler scheduler = Schedulers.newSingle(name.toLowerCase());
return Mono.defer(() -> Mono.just(myLockFactory.getLock(name)))
publishOn(scheduler)
.doOnNext(MyDistributedLock::lock)
.doOnNext(lock -> LOGGER.info("Process acquired lock for resource {}", name))
.flatMapMany(lock -> Flux.from(publisher))
.publishOn(scheduler)
.doFinally(signalType -> {
MyDistributedLock lock = myLockFactory.getLock(name);
if (signalType == SignalType.CANCEL) {
// cancel ignores publishOn
scheduler.schedule(() -> {
lock.unlock();
LOGGER.info("Process released lock for resource {} due to signal type {}", name, signalType);
});
} else if (lock.isLockedByCurrentThread()) {
lock.unlock();
LOGGER.info("Process released lock for resource {} due to signal type {}", name, signalType);
}
})
.then();
}
And example of some job
private Mono<Void> someJobRunEveryOneHourOnEveryNodeInCluster() {
MyLockDistributedFactory hazelcast = ...;
return lock("some-job", Flux.just(1,2), hazelcast)
.repeatWhen(afterOneHour());
}
I wonder whether this is a good approach of using Project reactor (and correct implementation) or it should be done in a different way. Please advice.
it is a correct approach when using Reactor, because you took care of offsetting the blocking portion into a dedicated Scheduler/Thread.
But I'd say mutually exclusive code like this is not a very good fit for reactive programming in general: you lose one of the key benefits of doing more with less threads, you risk blocking other parts of the application should you forget to publishOn a dedicated thread, etc...

Extraordinary uses of Exceptions vs OOP rules

Me and my friend have some arguement about Exceptions. He proposes to use Exception as some kind of transporter for response (we can't just return it). I'm saying its contradictory to the OOP rules, he say it's ok because application flow was changed and information was passed.
Can you help us settle the dispute?
function example() {
result = pdo.find();
if (result) {
e = new UniqueException();
e.setExistingItem(result);
throw new e;
}
}
try {
this.example();
} catch (UniqueException e) {
this.response(e.getExistingItem());
}
Using exceptions for application flow is a misleading practice. Anyone else (even you) maintaining that code will be puzzled because the function of the exception in your flow is totally different to the semantic of exceptions.
I imagine the reason you're doing this is because you want to return different results. For that, create a new class Result, that holds all information and react to it via an if-statement.

Apache Camel : GBs of data from database routed to JMS endpoint

I've done a few small projects in camel now but one thing I'm struggling to understand is how to deal with big data (that doesn't fit into memory) when consuming in camel routes.
I have a database containing a couple of GBs worth of data that I would like to route using camel. Obviously reading all data into memory isn't an option.
If I were doing this as a standalone app I would have code that paged through the data and send chunks to my JMS enpoint. I'd like to use camel as it provides a nice pattern. If I were consuming from a file I could use the streaming() call.
Also should I use camel-sql/camel-jdbc/camel-jpa or use a bean to read from my database.
Hope everyone is still with me. I'm more familiar with the Java DSL but would appreciate any help/suggestions people can provide.
Update : 2-MAY-2012
So I've had some time to play around with this and I think what I'm actually doing is abusing the concept of a Producer so that I can use it in a route.
public class MyCustomRouteBuilder extends RouteBuilder {
public void configure(){
from("timer:foo?period=60s").to("mycustomcomponent:TEST");
from("direct:msg").process(new Processor() {
public void process(Exchange ex) throws Exception{
System.out.println("Receiving value" : + ex.getIn().getBody() );
}
}
}
}
My producer looks something like the following. For clarity I've not included the CustomEndpoint or CustomComponent as it just seems to be a thin wrapper.
public class MyCustomProducer extends DefaultProducer{
Endpoint e;
CamelContext c;
public MyCustomProducer(Endpoint epoint){
super(endpoint)
this.e = epoint;
this.c = e.getCamelContext();
}
public void process(Exchange ex) throws Exceptions{
Endpoint directEndpoint = c.getEndpoint("direct:msg");
ProducerTemplate t = new DefaultProducerTemplate(c);
// Simulate streaming operation / chunking of BIG data.
for (int i=0; i <20 ; i++){
t.start();
String s ="Value " + i ;
t.sendBody(directEndpoint, value)
t.stop();
}
}
}
Firstly the above doesn't seem very clean. It seems like the cleanest way to perform this would be to populate a jms queue (in place of direct:msg) via a scheduled quartz job that my camel route then consumes so that I can have more flexibility over the message size received within my camel pipelines. However I quite liked the semantics of setting up time based activations as part of the Route.
Does anyone have any thoughts on the best way to do this.
In my understanding, all you need to do is:
from("jpa:SomeEntity" +
"?consumer.query=select e from SomeEntity e where e.processed = false" +
"&maximumResults=150" +
"&consumeDelete=false")
.to("jms:queue:entities");
maximumResults defines a limit of how many entities you get per query.
When you finish the processing of an entity instance, you need to set e.processed = true; and persist() it, so that the entity won't be processed again.
One way to do that is with the #Consumed annotation:
class SomeEntity {
#Consumed
public void markAsProcessed() {
setProcessed(true);
}
}
Another thing, you need to be careful with is how you serialize the entity before sending it to the queue. You might need to use an enricher between the from and to.