Alert setting, in case of a large interval between messages - splunk

Good afternoon!
I receive messages from systems on splunk, several messages from one system line up in a message chain.
As a rule, six messages from one system line up in a chain of six messages.
By message chain, I mean that splunk receives six messages with the same field: "srcMsgId".
Messages arrive one after another at different intervals, but the interval should not exceed the value of N.
Tell me how can I set up Alert in splunk, in case the interval between messages in the chain exceeds the value N.

Something like this should work:
index=ndx sourcetype=srctp srcMsgId=* system=*
| stats min(_time) as early max(_time) as late by srcMsgId system
| where (late-early)>N
Use a value (in seconds) for "N" - like | where (late-early)>90 for a minute and a half, or | where (late-early)>300 for 5 minutes

Related

nats jetstream: continue from last acknowledged message in correct order

What do I have?
A producer who publishes a stream like this:
- START_JOB 1
- do_task 1_1
- do_task 1_2
[...]
- do_task 1_X
- END_JOB 1
- START_JOB 2
- do_task 2_1
- do_task 2_2
- END_JOB 2
- START_JOB 3
- do_task 3_1
- do_task 3_2
[...]
- do_task 3_X
- END_JOB 3
Or in other words sequences of:
START_JOB <job_nr>
a random number of do_task <job_nr>_<task_nr>
END_JOB <job_nr>
What do I want?
A consumer who (after a crash/restart) always starts at the last START_JOB he hasn't acknowledged the matching END_JOB yet, and receives the messages in correct order from there on.
So if the consumer crashes while handling do_task 12_34, then after a restart he should start with START_JOB 12 and then get all messages from there on in chronological order.
What doesn't work?
Even after trying different combinations of:
push/pull consumers
Using "AckPolicy All" and only send ack on END_JOB
Playing with "AckWait" and adding a sleep of "AckWait"+2s before starting the consumer.
I sometimes still get the wrong message first after restart.
What do I think could be the problem?
I currently think that it has something to do with the acknowledgement settings (and with messages of the current "START_JOB" -> "END_JOB" sequence still waiting to expire).
My impression is that playing with these settings has improved the situation, but there are still issues depending on how fast my producer is, how much time passes between START_JOB and END_JOB (and respectively between each do_task).
But maybe I'm looking at the wrong thing here, and the solution for my problem is something entirely different.

rabbitmq prefetch with multiple consumers

I'm trying to understand how rabbitmq works with multiple consumer and prefetch_count.
I have three consumers consuming on the same queue and all of these consumers have configured with the QoS prefetch_count = 200.
Now assuming at a certain point I have unlimited backlog messages in the queue and consumers A,B,C are connecting to the queue, would A get message 1-200, B get 201-400, C get 401-600 from the queue simultaneously? That seems like message 1, 201, 401 got processed at the first place compared to the rest. Somehow I don't want that, I'd like to have these messages being processed sequentially.
If that's the case I guess this implies that the messages may be processed disordered based on how consumers are setup, even though the queue follows FIFO.
Or should I set prefetch_count = 1 to make sure of REAL FIFO?
Edited:
Just set up a local env of rabbitmq and experimented a bit. I used a producer to bombard a queue with numbers 0 to 100000 sequentially to accumulate backlog messages in a queue. Later on, I had two consumers A, B consuming messages from that queue with prefetch_count = 200.
From what I observed, A got 0-199 and B got numbers 200-399 at very beginning. However, A started getting numbers {401, 403, 405, 406 ...} and B gets {400, 402, 404, ...} after that.
I guess A and B got non-skipped messages at the beginning was because I wasn't strictly spinning up these two consumers simultaneously. But the following pattern explains well how prefetch_count works. It doesn't necessarily send consumers consecutive messages(I knew it's processed in a round robin fashion, but I guess this is more intuitive with an experiment). There's no guarantee in what order the messages will be processed if using prefetch_count.

How to get pending items with minIdleTime greater then some value?

Using Redis stream we can have pending items which aren't finished by some consumers.
I can find such items using xpending command.
Let we have two pending items:
1) 1) "1-0"
2) "local-dev"
3) (integer) 9599
4) (integer) 1
2) 1) "2-0"
2) "local-dev"
3) (integer) 9599
4) (integer) 1
The problem that by using xpending we can set filters based on id only. I have a couple of service nodes (A, B) which make zombie check: XPENDING mystream test_group - 5 1
Each of them receives "1-0" item and they make xclaim and only one of them (for example A) becomes the owner and starts processing this item. But B runs xpending again to get new items but it receives again "1-0" because it hasn't been processed yet (A is working) and it looks like all my queue is blocked.
Is there any solution how I can avoid it and process pending items concurrently?
You want to see the documentation, in particular Recovering from permanent failures.
The way this is normally used is:
You allow the same consumer to consume its messages from PEL after recovering.
You only XCLAIM from another consumer when a reasonably large time elapsed, that suggests the original consumer is in permanent failure.
You use delivery count to detect poison pills or death letters. If a message has been retried many times, maybe it's better to report it to an admin for analysis.
So normally all you need is to see the oldest age in PEL from other consumers for the Permanent Failure Recovery logic, and you consume one by one.

AWS Cloudwatch alarm set to NonBreaching (or notBreaching) is not triggering, based on a log filter

With the following Metric and Alarm combination
Metric
Comes from a Cloudwatch log filter (when a match is found on the log)
Metric value: "1"
Default value: None
Unit: Count
Alarm
Statistic: Sum
Period: 1 minute
Treat missing data as: notBreaching
Threshold: [Metric] > 0 for 1 datapoints within 1 minute
The alarm goes to:
State changed to OK at 2018/12/17.
Reason: Threshold Crossed: no datapoints were received for 1 period and 1 missing datapoint was treated as [NonBreaching].
And then it doesn't trigger, even though I force the metric > 0
Why is the alarm stuck in OK? How can the alarm become triggered again?
Solution
Remove the "Unit" property from the stack template Alarm config.
The source of the problem was actually the "Unit" property. This being set to "Count" actually made the alarm become stuck :(
Ensure the stack is producing the same result as a manual alarm setup by checking with the describe-alarms API.

How to get priority of current job?

In beanstalkd
telnet localhost 11300
USING foo
put 0 100 120 5
hello
INSERTED 1
How can I know what is the priority of this job when I reserve it? And can I release it by making the new priority equals to current priority +100?
Beanstalkd doesn't return the priority with the data - but you could easily add it as metadata in your own message body. for example, with Json as a message wrapper:
{'priority':100,'timestamp':1302642381,'job':'download http://example.com/'}
The next message that will be reserved will be the next available entry from the selected tubes, according to priority and time - subject to any delay that you had requested when you originally sent the message to the queue.
Addition: You can get the priority of a beanstalk job (as well as a number of other pieces of information, such as how many times it has previously been reserved), but it's an additional call - to the stats-job command. Called with the jobId, it returns about a dozen different pieces of information. See the protocol document, and your libraries docs.