Kafka streams: groupByKey and reduce not triggering action exactly once when error occurs in stream - kotlin

I have a simple Kafka streams scenario where I am doing a groupyByKey then reduce and then an action. There could be duplicate events in the source topic hence the groupyByKey and reduce
The action could error and in that case, I need the streams app to reprocess that event. In the example below I'm always throwing an error to demonstrate the point.
It is very important that the action only ever happens once and at least once.
The problem I'm finding is that when the streams app reprocesses the event, the reduce function is being called and as it returns null the action doesn't get recalled.
As only one event is produced to the source topic TOPIC_NAME I would expect the reduce to not have any values and skip down to the mapValues.
val topologyBuilder = StreamsBuilder()
topologyBuilder.stream(
TOPIC_NAME,
Consumed.with(Serdes.String(), EventSerde())
)
.groupByKey(Grouped.with(Serdes.String(), EventSerde()))
.reduce { current, _ ->
println("reduce hit")
null
}
.mapValues { _, v ->
println(Id: "${v.correlationId}")
throw Exception("simulate error")
}
To cause the issue I run the streams app twice. This is the output:
First run
Id: 90e6aefb-8763-4861-8d82-1304a6b5654e
11:10:52.320 [test-app-dcea4eb1-a58f-4a30-905f-46dad446b31e-StreamThread-1] ERROR org.apache.kafka.streams.KafkaStreams - stream-client [test-app-dcea4eb1-a58f-4a30-905f-46dad446b31e] All stream threads have died. The instance will be in error state and should be closed.
Second run
reduce hit
As you can see the .mapValues doesn't get called on the second run even though it errored on the first run causing the streams app to reprocess the same event again.
Is it possible to be able to have a streams app re-process an event with a reduced step where it's treating the event like it's never seen before? - Or is there a better approach to how I'm doing this?

I was missing a property setting for the streams app.
props["processing.guarantee"]= "exactly_once"
By setting this, it will guarantee that any state created from the point of picking up the event will rollback in case of a exception being thrown and the streams app crashing.
The problem was that the streams app would pick up the event again to re-process but the reducer step had state which has persisted. By enabling the exactly_once setting it ensures that the reducer state is also rolled back.
It now successfully re-processes the event as if it had never seen it before

Related

Kafka Streams write an event back to the input topic

in my kafka streams app, I need to re-try processing a message whenever a particular type of exception is thrown in the processing logic.
Rather than wrapping my logic in the RetryTemplate (am using springboot), am considering just simply writing the message back into the input topic, my assumption is that this message will be added to the back of the log in the appropriate partition and it will eventually be re-processed.
Am aware that this would mess up the ordering and am okay with that.
My question is, would kafka streams have an issue when it encounters a message that was supposedly already processed in the past (am assuming kafka streams has a way of marking the messages it has processed especially when exactly is enabled)?
Here is an example of the code am considering for this solution.
val branches = streamsBuilder.stream(inputTopicName)
.mapValues { it -> myServiceObject.executeSomeLogic(it) }
.branch(
{ _, value -> value is successfulResult() }, // success
{ _, error -> error is errorResult() }, // exception was thrown
)
branches[0].to(outputTopicName)
branches[1].to(inputTopicName) //write them back to input as a way of retrying

Airflow/SQLAlchemy Error - Loading context has changed within a load/refresh handler

I am attempting to use clairvoyant's db-cleanup dag to clear metadata in our xcom table, but when I run it, I receive the following warning, printed thousands of times before I manually stop the job in order to not take down our mysql instance:
SAWarning: Loading context for <BaseXCom at 0x7f26f789b370> has changed within a load/refresh handler, suggesting a row refresh operation took place. If this event handler is expected to be emitting row refresh operations within an existing load or refresh operation, set restore_load_context=True when establishing the listener to ensure the context remains unchanged when the event handler completes.
The other cleanup tasks work fine, but it is the xcom table in particular I am having trouble with. We have hundreds/thousands of active dags and so the xcom table is constantly being written to nearly every second or two. I think that is what is causing this error, the fact that the data is continually changing while it is being queried.
I have been unable to find the cause of this or any examples of how this can be resolved. I tried adding a "restore_load_context":True line as per SQLAlchemy docs but it did not work.
Here are the snippets I attempted to add to the database object and the cleanup task:
{
"airflow_db_model": XCom,
"age_check_column": XCom.execution_date,
"keep_last": False,
"keep_last_filters": None,
"keep_last_group_by": None,
"restore_load_context":True
},
....
def cleanup_function(**context):
logging.info("Retrieving max_execution_date from XCom")
max_date = context["ti"].xcom_pull(
task_ids=print_configuration.task_id, key="max_date"
)
max_date = dateutil.parser.parse(max_date) # stored as iso8601 str in xcom
airflow_db_model = context["params"].get("airflow_db_model")
state = context["params"].get("state")
age_check_column = context["params"].get("age_check_column")
keep_last = context["params"].get("keep_last")
keep_last_filters = context["params"].get("keep_last_filters")
keep_last_group_by = context["params"].get("keep_last_group_by")
restore_load_context = context["params"].get("restore_load_context")
In order to not paste too much code here, I am using the same code in the db-cleanup dag. Has anyone encountered this and found a way to resolve?
I am very inexperienced with sqlalchemy and am entirely unsure where else to place this code or how to go about it.

Persist detailed information about failed Item processing

I´ve got a Job that runs a TaskletStep, then a chunk-based step and then another TaskletStep.
In each of these steps, errors (in the form of Exceptions) can occur.
The chunk-based step looks like this:
stepBuilderFactory
.get("step2")
.chunk<SomeItem, SomeItem>(1)
.reader(flatFileItemReader)
.processor(itemProcessor)
.writer {}
.faultTolerant()
.skipPolicy { _ , _ -> true } // skip all Exceptions and continue
.taskExecutor(taskExecutor)
.throttleLimit(taskExecutor.corePoolSize)
.build()
The whole job definition:
jobBuilderFactory.get("job1")
.validator(validator())
.preventRestart()
.start(taskletStep1)
.next(step2)
.next(taskletStep2)
.build()
I expected that Spring Batch somehow picks up the Exceptions that occur along the way, so I can then create a Report including them after the Job has finished processing. Looking at the different contexts, there´s also fields that should contain failureExceptions. However, it seems there´s no such information (especially for the chunked step).
What would be a good approach if I need information about:
what Exceptions did occur in which Job execution
which Item was the one that triggered it
The JobExecution provides a method to get all failure exceptions that happened during the job. You can use that in a JobExecutionListener#afterJob(JobExecution jobExecution) to generate your report.
In regards to which items caused the issue, this will depend on where the exception happens (during the read, process or write operation). For this requirement, you can use one of the ItemReadListener, ItemProcessListener or ItemWriteListener to keep record of the those items (For example, by adding them to the job execution context to be able to get access to them in the JobExecutionListener#afterJob method for your report).

How to handle Not authorized to access topic ... in Kafka Streams

Situation is the following.
We have setup SSL + ACLs in Kafka Broker.
We are setting up stream, which reads messages from two topics:
KStream<String, String> stringInput
= kBuilder.stream( STRING_SERDE, STRING_SERDE, inTopicName );
stringInput
.filter( streamFilter::passOrFilterMessages )
.map( processor )
.to( outTopicName );
It is done like two times (in the loop).
Then we are setting general error handler:
streams.setUncaughtExceptionHandler( ( Thread t, Throwable e ) -> {
synchronized ( this ) {
LOG.fatal( ... );
this.stop();
}
}
);
Problem is the following. If for example in one topic certificate is no more valid. The stream is throwing exception Not authorized to access topics ...
So far so good.
But the exception is handled by general error handler, so the complete application stops even if the second topic has no problems.
The question is, how to handle this exception per topic?
How to avoid situation that at some moment complete application stops due to the problem that one single topic has problems with authorization?
I understand that if Broker is not available, then complete app may stop. But if only one topic is not available, then single stream shall stop, and not complete application, or?
By design, Kafka Streams treats the topology a one and cannot distinguish between both parts. For your specific case, as you loop and build to independent pipelines, you could run two KafkaStreams instances in parallel (within the same application/JVM) to isolate both from each other. Thus, if one fails, the other one is not affected. You would need to use two different application.id for both instances.

Suppressing NServicebus Transaction to write errors to database

I'm using NServiceBus to handle some calculation messages. I have a new requirement to handle calculation errors by writing them the same database. I'm using NHibernate as my DAL which auto enlists to the NServiceBus transaction and provides rollback in case of exceptions, which is working really well. However if I write this particular error to the database, it is also rolled back which is a problem.
I knew this would be a problem, but I thought I could just wrap the call in a new transaction with the TransactionScopeOption = Suppress. However the error data is still rolled back. I believe that's because it was using the existing session with has already enlisted in the NServiceBus transaction.
Next I tried opening a new session from the existing SessionFactory within the suppression transaction scope. However the first call to the database to retrieve or save data using this new session blocks and then times out.
InnerException: System.Data.SqlClient.SqlException
Message=Timeout expired. The timeout period elapsed prior to completion of the >operation or the server is not responding.
Finally I tried creating a new SessionFactory using it to open a new session within the suppression transaction scope. However again it blocks and times out.
I feel like I'm missing something obvious here, and would greatly appreciate any suggestions on this probably common task.
As Adam suggests in the comments, in most cases it is preferred to let the entire message fail processing, giving the built-in Retry mechanism a chance to get it right, and eventually going to the error queue. Then another process can monitor the error queue and do any required notification, including logging to a database.
However, there are some use cases where the entire message is not a failure, i.e. on the whole, it "succeeds" (whatever the business-dependent definition of that is) but there is some small part that is in error. For example, a financial calculation in which the processing "succeeds" but some human element of the data is "in error". In this case I would suggest catching that exception and sending a new message which, when processed by another endpoint, will log the information to your database.
I could see another case where you want the entire message to fail, but you want the fact that it was attempted noted somehow. This may be closest to what you are describing. In this case, create a new TransactionScope with TransactionScopeOption = Suppress, and then (again) send a new message inside that scope. That message will be sent whether or not your full message transaction rolls back.
You are correct that your transaction is rolling back because the NHibernate session is opened while the transaction is in force. Trying to open a new session inside the suppressed transaction can cause a problem with locking. That's why, most of the time, sending a new message asynchronously is part of the solution in these cases, but how you do it is dependent upon your specific business requirements.
I know I'm late to the party, but as an alternative suggestion, you coudl simply raise another separate log message, which NSB handles independently, for example:
public void Handle(DebitAccountMessage message)
{
var account = this.dbcontext.GetById(message.Id);
if (account.Balance <= 0)
{
// log request - new handler
this.Bus.Send(new DebitAccountLogMessage
{
originalMessage = message,
account = account,
timeStamp = DateTime.UtcNow
});
// throw error - NSB will handle
throw new DebitException("Not enough funds");
}
}
public void Handle(DebitAccountLogMessage message)
{
var messageString = message.originalMessage.Dump();
var accountString = message.account.Dump(DumpOptions.SuppressSecurityTokens);
this.Logger.Log(message.UniqueId, string.Format("{0}, {1}", messageString, accountString);
}