How the batch tuples transport in Jstorm Trident when do a rebalance? - jstorm

When i use trident in jstorm,tuples in one batch which have same fields will be send to different tasks after rebalance,so how does the trident's CombinerAggregator work?

if the task num changed,jstorm can't ensure that tuples of same field are sent to same tasks。so,when we do rebalance in jstorm,we should be ca

Related

Am I able to view a list of devices by partition in iothub?

I have 2 nodes of a cluster receiving messages from iothub. I split their responsibility by partition. Node 1 reads from partitions 1,3,5,7,9 and the other 2,4,6,8, and 0. Recently, my partition 8 stops responding until I stop my code and restart it. It seems like a device is sending a message that locks up the partition. What I want to do is list all devices in my partition 8. Is that possible? Is there a cloud shell command to get those devices in a list?
Not sure this will help you, but you can see the partition on the incoming messages. For example you could use Azure Stream Analytics to see the partitions using this query:
Select GetMetadataPropertyValue(IoTHub, '[IoTHub].[ConnectionDeviceId]') as DeviceId, partitionId
from IoTHub
Also, if you run locally in VisualStudio it will tell you which device is sending malformed JSON. eg.
[Warning] 10/21/2021 9:12:54 AM : User Warning Source 'IoTHub' had 1 occurrences of kind 'InputDeserializerError.InvalidData' between processing times '2021-10-21T15:12:50.5076449Z' and '2021-10-21T15:12:50.5712076Z'. Could not deserialize the input event(s) from resource 'Partition: [1], Offset: [455266583232], SequenceNumber: [634800], DeviceId: [DeviceName]' as Json. Some possible reasons: 1) Malformed events 2) Input source configured with incorrect serialization format
Also check your "Activity Log" blade in the ASA job. It may have more details for you.

How to get pending items with minIdleTime greater then some value?

Using Redis stream we can have pending items which aren't finished by some consumers.
I can find such items using xpending command.
Let we have two pending items:
1) 1) "1-0"
2) "local-dev"
3) (integer) 9599
4) (integer) 1
2) 1) "2-0"
2) "local-dev"
3) (integer) 9599
4) (integer) 1
The problem that by using xpending we can set filters based on id only. I have a couple of service nodes (A, B) which make zombie check: XPENDING mystream test_group - 5 1
Each of them receives "1-0" item and they make xclaim and only one of them (for example A) becomes the owner and starts processing this item. But B runs xpending again to get new items but it receives again "1-0" because it hasn't been processed yet (A is working) and it looks like all my queue is blocked.
Is there any solution how I can avoid it and process pending items concurrently?
You want to see the documentation, in particular Recovering from permanent failures.
The way this is normally used is:
You allow the same consumer to consume its messages from PEL after recovering.
You only XCLAIM from another consumer when a reasonably large time elapsed, that suggests the original consumer is in permanent failure.
You use delivery count to detect poison pills or death letters. If a message has been retried many times, maybe it's better to report it to an admin for analysis.
So normally all you need is to see the oldest age in PEL from other consumers for the Permanent Failure Recovery logic, and you consume one by one.

How should I avoid sending duplicate emails using mailgun, taskqueue and ndb?

I am using the taskqueue API to send multiple emails is small groups with mailgun. My code looks more or less like this:
class CpMsg(ndb.Model):
group = ndb.KeyProperty()
sent = ndb.BooleanProperty()
#Other properties
def send_mail(messages):
"""Sends a request to mailgun's API"""
# Some code
pass
class MailTask(TaskHandler):
def post(self):
p_key = utils.key_from_string(self.request.get('p'))
msgs = CpMsg.query(
CpMsg.group==p_key,
CpMsg.sent==False).fetch(BATCH_SIZE)
if msgs:
send_mail(msgs)
for msg in msgs:
msg.sent = True
ndb.put_multi(msgs)
#Call the task again in COOLDOWN seconds
The code above has been working fine, but according to the docs, the taskqueue API guarantees that a task is delivered at least once, so tasks should be idempotent. Now, most of the time this would be the case with the above code, since it only gets messages that have the 'sent' property equal to False. The problem is that non ancestor ndb queries are only eventually consistent, which means that if the task is executed twice in quick succession the query may return stale results and include the messages that were just sent.
I thought of including an ancestor for the messages, but since the sent emails will be in the thousands I'm worried that may mean having large entity groups, which have a limited write throughput.
Should I use an ancestor to make the queries? Or maybe there is a way to configure mailgun to avoid sending the same email twice? Should I just accept the risk that in some rare cases a few emails may be sent more than once?
One possible approach to avoid the eventual consistency hurdle is to make the query a keys_only one, then iterate through the message keys to get the actual messages by key lookup (strong consistency), check if msg.sent is True and skip sending those messages in such case. Something along these lines:
msg_keys = CpMsg.query(
CpMsg.group==p_key,
CpMsg.sent==False).fetch(BATCH_SIZE, keys_only=True)
if not msg_keys:
return
msgs = ndb.get_multi(msg_keys)
msgs_to_send = []
for msg in msgs:
if not msg.sent:
msgs_to_send.append(msg)
if msgs_to_send:
send_mail(msgs_to_send)
for msg in msgs_to_send:
msg.sent = True
ndb.put_multi(msgs_to_send)
You'd also have to make your post call transactional (with the #ndb.transactional() decorator).
This should address the duplicates caused by the query eventual consistency. However there still is room for duplicates caused by transaction retries due to datastore contention (or any other reason) - as the send_mail() call isn't idempotent. Sending one message at a time (maybe using the task queue) could reduce the chance of that happening. See also GAE/P: Transaction safety with API calls

Inserting Celery tasks directly into Redis

I have an Erlang system. I want this system to be able to trigger Celery tasks on another, Python-based system. They share the same host, and Celery is using Redis as its broker.
Is it possible to insert tasks for Celery directly into Redis (in my case, from Erlang), instead of using a Celery API?
Yes, you can insert tasks directly into redis or whatever backend you are using with celery.
You'll have to match the celery serialization format (which is in JSON by default) and figure out which keys it is inserting to. The key structure used isn't clearly documented, but this part of the source code is a good place to start.
You can also use the redis monitor command to watch which keys celery uses in real time.
According to the task message definition from the Celery docs, the body of the message has the following format (for version 5.2):
body = (
object[] args,
Mapping kwargs,
Mapping embed {
'callbacks': Signature[] callbacks,
'errbacks': Signature[] errbacks,
'chain': Signature[] chain,
'chord': Signature chord_callback,
}
)
Therefore, to trigger a task you should put a message with a body like this to your Celery backend's queue (represented as a Python data structure):
[
['arg1', 'arg2'], # positional arguments for the task
{'kwarg1': 'val1', 'kwarg2': 'val2'}, # keyword arguments for the task
{'callbacks': None, 'errbacks': None, 'chain': None, 'chord': None}
]

Why does celery add thousands of queues to rabbitmq that seem to persist long after the tasks completel?

I am using celery with a rabbitmq backend. It is producing thousands of queues with 0 or 1 items in them in rabbitmq like this:
$ sudo rabbitmqctl list_queues
Listing queues ...
c2e9b4beefc7468ea7c9005009a57e1d 1
1162a89dd72840b19fbe9151c63a4eaa 0
07638a97896744a190f8131c3ba063de 0
b34f8d6d7402408c92c77ff93cdd7cf8 1
f388839917ff4afa9338ef81c28aad75 0
8b898d0c7c7e4be4aa8007b38ccc00ea 1
3fb4be51aaaa4ac097af535301084b01 1
This seems to be inefficient, but further I have observed that these queues persist long after processing is finished.
I have found the task that appears to be doing this:
#celery.task(ignore_result=True)
def write_pages(page_generator):
g = group(render_page.s(page) for page in page_generator)
res = g.apply_async()
for rendered_page in res:
print rendered_page # TODO: print to file
It seems that because these tasks are being called in a group, they are being thrown into the queue but never being released. However, I am clearly consuming the results (as I can view them being printed when I iterate through res. So, I do not understand why those tasks are persisting in the queue.
Additionally, I am wondering if the large number queues that are being created is some indication that I am doing something wrong.
Thanks for any help with this!
Celery with the AMQP backend will store task tombstones (results) in an AMQP queue named with the task ID that produced the result. These queues will persist even after the results are drained.
A couple recommendations:
Apply ignore_result=True to every task you can. Don't depend on results from other tasks.
Switch to a different backend (perhaps Redis -- it's more efficient anyway): http://docs.celeryproject.org/en/latest/userguide/tasks.html
Use CELERY_TASK_RESULT_EXPIRES (or on 4.1 CELERY_RESULT_EXPIRES) to have a periodic cleanup task remove old data from rabbitmq.
http://docs.celeryproject.org/en/master/userguide/configuration.html#std:setting-result_expires