How to interpret the RabbitMQ Message stats? - rabbitmq

I to want get and historize queue metrics for the "Enqueued, Dequeued an Size" (Terminology formerly met on ActiveMQ).
The moving charts provided in the management plugin are not enough for the monitoring that I need to do.
So with RabbitMQ, I'm getting data from https://rabbitmq-server:15672/api/queues/myvhost
This returns json.. for a queue, I can obtain real life production data like :
"messages":0, // for "Size"
"message_stats":{
"deliver_get":171528, // for "Dequeued"
"ack":162348,
"redeliver":9513,
"deliver_no_ack":0,
"deliver":171528,
"get":0,
"publish":51293 // for "Enqueued"
(...)
I'm in particular surprised by the publish counter:
Its value can even decrease between 2 measures done with a couple of minutes of delay ! (see sample chart around 17:00)
As you can see on my data, the deliver_get is significantly larger than the publish.
https://my-rabbitmq:15672/doc/stats.html doesn't give a lot of details that could explain what I actually notice.
Also, under the message_stats object that I obtain, I'm missing the some counters like confirm and return which could be related to the enqueuing.
Are there relationships between these metrics ? (like deliver_get + messages = redeliver + publish.. but that one doesn't work with my figures)
Is there another more detailed documentation about these metrics ?

Related

Kafka streams: groupByKey and reduce not triggering action exactly once when error occurs in stream

I have a simple Kafka streams scenario where I am doing a groupyByKey then reduce and then an action. There could be duplicate events in the source topic hence the groupyByKey and reduce
The action could error and in that case, I need the streams app to reprocess that event. In the example below I'm always throwing an error to demonstrate the point.
It is very important that the action only ever happens once and at least once.
The problem I'm finding is that when the streams app reprocesses the event, the reduce function is being called and as it returns null the action doesn't get recalled.
As only one event is produced to the source topic TOPIC_NAME I would expect the reduce to not have any values and skip down to the mapValues.
val topologyBuilder = StreamsBuilder()
topologyBuilder.stream(
TOPIC_NAME,
Consumed.with(Serdes.String(), EventSerde())
)
.groupByKey(Grouped.with(Serdes.String(), EventSerde()))
.reduce { current, _ ->
println("reduce hit")
null
}
.mapValues { _, v ->
println(Id: "${v.correlationId}")
throw Exception("simulate error")
}
To cause the issue I run the streams app twice. This is the output:
First run
Id: 90e6aefb-8763-4861-8d82-1304a6b5654e
11:10:52.320 [test-app-dcea4eb1-a58f-4a30-905f-46dad446b31e-StreamThread-1] ERROR org.apache.kafka.streams.KafkaStreams - stream-client [test-app-dcea4eb1-a58f-4a30-905f-46dad446b31e] All stream threads have died. The instance will be in error state and should be closed.
Second run
reduce hit
As you can see the .mapValues doesn't get called on the second run even though it errored on the first run causing the streams app to reprocess the same event again.
Is it possible to be able to have a streams app re-process an event with a reduced step where it's treating the event like it's never seen before? - Or is there a better approach to how I'm doing this?
I was missing a property setting for the streams app.
props["processing.guarantee"]= "exactly_once"
By setting this, it will guarantee that any state created from the point of picking up the event will rollback in case of a exception being thrown and the streams app crashing.
The problem was that the streams app would pick up the event again to re-process but the reducer step had state which has persisted. By enabling the exactly_once setting it ensures that the reducer state is also rolled back.
It now successfully re-processes the event as if it had never seen it before

Persist detailed information about failed Item processing

I´ve got a Job that runs a TaskletStep, then a chunk-based step and then another TaskletStep.
In each of these steps, errors (in the form of Exceptions) can occur.
The chunk-based step looks like this:
stepBuilderFactory
.get("step2")
.chunk<SomeItem, SomeItem>(1)
.reader(flatFileItemReader)
.processor(itemProcessor)
.writer {}
.faultTolerant()
.skipPolicy { _ , _ -> true } // skip all Exceptions and continue
.taskExecutor(taskExecutor)
.throttleLimit(taskExecutor.corePoolSize)
.build()
The whole job definition:
jobBuilderFactory.get("job1")
.validator(validator())
.preventRestart()
.start(taskletStep1)
.next(step2)
.next(taskletStep2)
.build()
I expected that Spring Batch somehow picks up the Exceptions that occur along the way, so I can then create a Report including them after the Job has finished processing. Looking at the different contexts, there´s also fields that should contain failureExceptions. However, it seems there´s no such information (especially for the chunked step).
What would be a good approach if I need information about:
what Exceptions did occur in which Job execution
which Item was the one that triggered it
The JobExecution provides a method to get all failure exceptions that happened during the job. You can use that in a JobExecutionListener#afterJob(JobExecution jobExecution) to generate your report.
In regards to which items caused the issue, this will depend on where the exception happens (during the read, process or write operation). For this requirement, you can use one of the ItemReadListener, ItemProcessListener or ItemWriteListener to keep record of the those items (For example, by adding them to the job execution context to be able to get access to them in the JobExecutionListener#afterJob method for your report).

Ignite TcpCommunicationSpi : Can slowClientQueueLimit be set to same value as messageQueueLimit as per docs?

I am not completely sure of the meaning or the interplay between slowClientQueueLimit and messageQueueLimit.
As per the documentation, they both should ideally be set to the same value, https://ignite.apache.org/releases/2.4.0/javadoc/org/apache/ignite/spi/communication/tcp/TcpCommunicationSpi.html#setSlowClientQueueLimit-int-
However when i do set that i see this in the logs, is it a minor bug in the check or should i change this?
[WARN ] 2018-06-27 22:32:18.429 [main] org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - Slow client queue limit is set to a value greater than message queue limit (slow client queue limit will have no effect) [msgQueueLimit=1024, slowClientQueueLimit=1024]
Thanks
From code the warning is correct, but javadoc is not. slowClientQueueLimit has to be less than msgQueueLimit, because when message is being prepared to sending, first are checked back pressure limits, and only then slowClientQueueLimit. If these two numbers are equal, sender thread will be blocked by back pressure before it could go to slow client check. What means client would not be dropped.
Set slowClientQueueLimit to msgQueueLimit - 1 or less, and I'll suggest community to fix the docs.

Dataflow's BigQuery inserter thread pool exhausted

I'm using Dataflow to write data into BigQuery.
When the volume gets big and after some time, I get this error from Dataflow:
{
metadata: {
severity: "ERROR"
projectId: "[...]"
serviceName: "dataflow.googleapis.com"
region: "us-east1-d"
labels: {…}
timestamp: "2016-08-19T06:39:54.492Z"
projectNumber: "[...]"
}
insertId: "[...]"
log: "dataflow.googleapis.com/worker"
structPayload: {
message: "Uncaught exception: "
work: "[...]"
thread: "46"
worker: "[...]-08180915-7f04-harness-jv7y"
exception: "java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1a1680f rejected from java.util.concurrent.ThreadPoolExecutor#b11a8a1[Shutting down, pool size = 100, active threads = 100, queued tasks = 2316, completed tasks = 1192]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:681)
at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:218)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2155)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2113)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:62)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:79)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:657)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:86)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:483)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)"
logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"
stage: "F10"
job: "[...]"
}
}
It looks like I'm exhausting the thread pool defined in BigQueryTableInserter.java:84. This thread pool has an hardcoded size of 100 threads and cannot be configured.
My questions are:
How could I avoid this error?
Am I doing something wrong?
Shouldn't the pool size be configurable? How can 100 threads be the perfect fit for all needs and machine types?
Here's a bit of context of my usage:
I'm using Dataflow in streaming mode, reading from Kafka using KafkaIO.java
"After some time" is a few hours, (less than 12h)
I'm using 36 workers of type n1-standard-4
I'm reading around 180k messages/s from Kafka (about 130MB/s of network input to my workers)
Messages are grouped together, outputting around 7k messages/s into BigQuery
Dataflow workers are in the us-east1-d zone, BigQuery dataset location is US
You aren't doing anything wrong, though you may need more resources, depending on how long volume stays high.
The streaming BigQueryIO write does some basic batching of inserts by data size and row count. If I understand your numbers correctly, your rows are large enough that each is being submitted to BigQuery in its own request.
It seems that the thread pool for inserts should install ThreadPoolExecutor.CallerRunsPolicy which causes the caller to block and run jobs synchronously when they exceed the capacity of the executor. I've posted PR #393. This will convert the work queue overflow into pipeline backlog as all the processing threads block.
At this point, the issue is standard:
If the backlog is temporary, you'll catch up once volume decreases.
If the backlog grows without bound, then of course it will not solve the issue and you will need to apply more resources. The signs should be the same as any other backlog.
Another point to be aware of is that around 250 rows/second per thread this will exceed the BigQuery quota of 100k updates/second for a table (such failures will be retried, so you might get past them anyhow). If I understand your numbers correctly, you are far from this.

How to optimize golang program that spends most time in runtime.osyield and runtime.usleep

I've been working on optimizing code that analyzes social graph data (with lots of help from https://blog.golang.org/profiling-go-programs) and I've successfully reworked a lot of slow code.
All data is loaded into memory from db first, and the data analysis from there appears CPU bound (max memory consumption < 10MB, CPU1 # 100%)
But now most of my program's time seems to be in runtime.osyield and runtime.usleep. What's the way to prevent that?
I've set GOMAXPROCS=1 and the code does not spawn any goroutines (other than what the golang libraries may call).
This is my top10 output from pprof
(pprof) top10
62550ms of 72360ms total (86.44%)
Dropped 208 nodes (cum <= 361.80ms)
Showing top 10 nodes out of 77 (cum >= 1040ms)
flat flat% sum% cum cum%
20760ms 28.69% 28.69% 20850ms 28.81% runtime.osyield
14070ms 19.44% 48.13% 14080ms 19.46% runtime.usleep
11740ms 16.22% 64.36% 23100ms 31.92% _/C_/code/sc_proto/cloudgraph.(*Graph).LeafProb
6170ms 8.53% 72.89% 6170ms 8.53% runtime.memmove
4740ms 6.55% 79.44% 10660ms 14.73% runtime.typedslicecopy
2040ms 2.82% 82.26% 2040ms 2.82% _/C_/code/sc_proto.mAvg
890ms 1.23% 83.49% 1590ms 2.20% runtime.scanobject
770ms 1.06% 84.55% 1420ms 1.96% runtime.mallocgc
760ms 1.05% 85.60% 760ms 1.05% runtime.heapBitsForObject
610ms 0.84% 86.44% 1040ms 1.44% _/C_/code/sc_proto/cloudgraph.(*Node).DeepestChildren
(pprof)
The _ /C_/code/sc_proto/* functions are my code.
And the output from web:
(better, SVG version of graph here: https://goo.gl/Tyc6X4)
Found the answer myself, so I'm posting this here for anyone else who is having a similar problem. And special thanks to #JimB for sending me down the right path.
As can be seen from the graph, the paths which lead to osyield and usleep are garbage collection routines. This program was using a linked list which generated a lot of pointers, which created a lot of work for the gc, which occasionally blocked execution of my code while it cleaned up my mess.
Ultimately the solution to this problem came from https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs (which was an awesome resource btw). I followed the instructions about the memory profiler there; and the recommendation to replace collections of pointers with slices cleared up my garbage collection issues, and my code is much faster now!