Size of messages exchanged between workers - dask-distributed

When using Dask Distributed on multiple computers, the data is being exchanged between workers, client, and scheduler. This data is chunked into messages of 64 MB (it appears). Can this value be changed? If so, what is the relevant configuration option for that? If there is not option, can anyone point to the relevant source-code where this can be changed. I am asking this question for a NumPy application.

The following flag in distributed/protocol/utils.py sets this value:
BIG_BYTES_SHARD_SIZE = 2 ** 26

Related

Rabbitmq :: Message is never removed from stream queue

I have created an stream queue in the rabbitmq of my project and configured max-age to 1 minute. I sent a message to the queue,all the consumers consumed the message, but the message is remaining in the queue (I waited more than 1 minute) as "ready". My worry is about accumulation of messages in the HD of rabbitmq instance.
So, my question is: All the messages marked as "ready" are stored in the HD, even after all consumer consumed the messages? If yes, how can I could purge (in this case, max-age is not working for it) these messages from HD of rabbitmq instance?
That is the design; see https://www.rabbitmq.com/streams.html#retention
Streams are implemented as an immutable append-only disk log. This means that the log will grow indefinitely until the disk runs out. To avoid this undesirable scenario it is possible to set a retention configuration per stream which will discard the oldest data in the log based on total log data size and/or age.
There are two parameters that control the retention of a stream. These can be combined. These are either set at declaration time using a queue argument or as a policy which can be dynamically updated. ...
max-age:
valid units: Y, M, D, h, m, s
e.g. 7D for a week
max-length-bytes:
the max total size in bytes
NB: retention is evaluated on per segment basis so there is one more parameter that comes into effect and that is the segment size of the stream. The stream will always leave at least one segment in place as long as the segment contains at least one message. When using broker-provided offset-tracking, offsets for each consumer are persisted in the stream itself as non-message data.
But I see what you mean.
I suggest you ask on the rabbitmq-users Google group where the RabbitMQ engineers hang out; they don't monitor SO closely.
Same problem here, the messages is nerver deleted.
The solution that I found:
It's not possible to avoid to store data into HD or make a purge, but it's possible to prevent excessive disk usage.
Add the argument x-stream-max-segment-size-bytes to the queue decreasing the default size to a size that is OK for your necessity. I defined 1 mb, for example. More details: https://www.rabbitmq.com/streams.html#declaring
At least one segment file will always remain, so if you just send 1 message and wait, it will remain on disk forever. However, if you keep publishing, a new segment file gets created at some point and the retention process kicks in. Files that only contain messages older than the retention period will be deleted.

Baselining internal network traffic (corporate)

We are collecting network traffic from switches using Zeek in the form of ‘connection logs’. The connection logs are then stored in Elasticsearch indices via filebeat. Each connection log is a tuple with the following fields: (source_ip, destination_ip, port, protocol, network_bytes, duration) There are more fields, but let’s just consider the above fields for simplicity for now. We get 200 million such logs every hour for internal traffic. (Zeek allows us to identify internal traffic through a field.) We have about 200,000 active IP addresses.
What we want to do is digest all these logs and create a graph where each node is an IP address, and an edge (directed, sourcedestination) represents traffic between two IP addresses. There will be one unique edge for each distinct (port, protocol) tuple. The edge will have properties: average duration, average bytes transferred, number of logs histogram by the hour of the day.
I have tried using Elasticsearch’s aggregation and also the newer Transform technique. While both work in theory, and I have tested them successfully on a very small subset of IP addresses, the processes simply cannot keep up for our entire internal traffic. E.g. digesting 1 hour of logs (about 200M logs) using Transform takes about 3 hours.
My question is:
Is post-processing Elasticsearch data the right approach to making this graph? Or is there some product that we can use upstream to do this job? Someone suggested looking into ntopng, but I did not find this specific use case in their product description. (Not sure if it is relevant, but we use ntop’s PF_RING product as a Frontend for Zeek). Are there other products that does the job out of the box? Thanks.
What problems or root causes are you attempting to elicit with graph of Zeek east-west traffic?
Seems that a more-tailored use case, such as a specific type of authentication, or even a larger problem set such as endpoint access expansion might be a better use of storage, compute, memory, and your other valuable time and resources, no?
Even if you did want to correlate or group on Zeek data, try to normalize it to OSSEM, and there would be no reason to, say, collect tuple when you can collect community-id instead. You could correlate Zeek in the large to Suricata in the small. Perhaps a better data architecture would be VAST.
Kibana, in its latest iterations, does have Graph, and even older version can lever the third-party kbn_network plugin. I could see you hitting a wall with 200k active IP addresses and Elasticsearch aggregations or even summary indexes.
Many orgs will build data architectures beyond the simple Serving layer provided by Elasticsearch. What I have heard of would be a Kappa architecture streaming into the graph database directly, such as dgraph, and perhaps just those edges of the graph available from a Serving layer.
There are other ways of asking questions from IP address data, such as the ML options in AWS SageMaker IP Insights or the Apache Spot project.
Additionally, I'm a huge fan of getting the right data only as the situation arises, although in an automated way so that the puzzle pieces bubble up for me and I can simply lock them into place. If I was working with Zeek data especially, I could lever a platform such as SecurityOnion and its orchestrated Playbook engine to kick off other tasks for me, such as querying out with one of the Velocidex tools, or even cross correlating using the built-in Sigma sources.

In distributed TensorFlow, is it possible to share the same queue across different workers?

In TensorFlow, I want to have a filename queue shared across different workers on different machines, such that each machine can get a subset of files to train. I searched a lot, and it seems that only variables could be put on a PS task to be shared. Does anyone have any example? Thanks.
It is possible to share the same queue across workers, by setting the optional shared_name argument when creating the queue. Just as with tf.Variable objects, you can place the queue on any device that can be accessed from different workers. For example:
with tf.device("/job:ps/task:0"): # Place queue on parameter server.
q = tf.FIFOQueue(..., shared_name="shared_queue")
A few notes:
The value for shared_name must be unique to the particular queue that you are sharing. Unfortunately, the Python API does not currently use scoping or automatic name uniqification to make this easier, so you will have to ensure this manually.
You do not need to place the queue on a parameter server. One possible configuration would be to set up an additional "input job" (e.g. "/job:input") containing a set of tasks that perform pre-processing, and export a shared queue for the workers to use.

scalability of azure cloud queue

In current project we currently use 8 worker role machines side by side that actually work a little different than azure may expect it.
Short outline of the system:
each worker start up to 8 processes that actually connect to cloud queue and processes messages
each process accesses three different cloud queues for collecting messages for different purposes (delta recognition, backup, metadata)
each message leads to a WCF call to an ERP system to gather information and finally add retreived response in an ReDis cache
this approach has been chosen over many smaller machines due to costs and performance. While 24 one-core machines would perform by 400 calls/s to the ERP system, 8 four-core machines with 8 processes do over 800 calls/s.
Now to the question: when even increasing the count of machines to increase performance to 1200 calls/s, we experienced outages of Cloud Queue. In same moment of time, 80% of the machines' processes don't process messages anymore.
Here we have two problems:
Remote debugging is not possible for these processes, but it was possible to use dile to get some information out.
We use GetMessages method of Cloud Queue to get up to 4 messages from queue. Cloud Queue always answers with 0 messages. Reconnect the cloud queue does not help.
Restarting workers does help, but shortly lead to same problem.
Are we hitting the natural end of scalability of Cloud Queue and should switch to Service Bus?
Update:
I have not been able to fully understand the problem, I described it in the natual borders of Cloud Queue.
To summarize:
Count of TCP connections have been impressive. Actually too impressive (multiple hundreds)
Going back to original memory size let the system operate normally again
In my experience I have been able to get better raw performance out of Azure Cloud Queues than service bus, but Service Bus has better enterprise features (reliable, topics, etc). Azure Cloud Queue should process up to 2K/second per queue.
https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
You can also try partitioning to multiple queues if there is some natural partition key.
Make sure that your process don't have some sort of thread deadlock that is the real culprit. You can test this by connecting to the queue when it appears hung and trying to pull messages from the queue. If that works it is your process, not the queue.
Also take a look at this to setup some other monitors:
https://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
It took some time to solve this issue:
First a summarization of the usage of the storage account:
We used the blob storage once a day pretty heavily.
The "normal" diagonistics that Azure provides out of the box also used the same storage account.
Some controlling processes used small tables to store and read information once an hour for ca. 20 minutes
There may be up to 800 calls/s that try to increase a number to count calls to an ERP system.
When recognizing that the storage account is put under heavy load we split it up.
Now there are three physical storage accounts heaving 2 queues.
The original one still keeps up to 800/s calls for increasing counters
Diagnositics are still on the original one
Controlling information has been also moved
The system runs now for 2 weeks, working like a charm. There are several things we learned from that:
No, the infrastructure is "not just there" and it doesn't scale endlessly.
Even if we thought we didn't use "that much" summarized we used quite heavily and uncontrolled.
There is no "best practices" anywhere in the net that tells the complete story. Esp. when start working with the storage account a guide from MS would be quite helpful
Exception handling in storage is quite bad. Even if the storage account is overused, I would expect some kind of exception and not just returning zero message without any surrounding information
Read complete story here: natural borders of cloud storage scalability
UPDATE:
The scalability has a lot of influences. You may are interested in Azure Service Bus: Massive count of listeners and senders to be aware of some more pitfalls.

Fault tolerant system design

There is a DB as data store and y (>5) other machines. There is a machine A that has data (updated) every x mins. The y machines gets the data from Machine A every x mins, updates the data in the database. Every machine doing the same is for some fault tolerance. Is there a clean way to model the working with fault tolerance?
Any pointers is appreciated.
This is a problem with very large scope. How is the data structured? How are the "db loaders" get the data from the "data producing" machine? What happens if an update fails- is the data lost or must it be persisted at any cost?
I will make some assumptions and suggest a solution:
1. The data can be partitioned.
2. You have access to a central persistent buffer. e.g. MSMQ or WebSphere MQ.
The machine generating the data puts chunks inside a central queue. Each chunk is composed of a set of record IDs and the new values for relevant properties)- you decide the granularity.
The "db loaders" listen to the queue and each de-queues a chunk (the contention is only on the dequeue-stage and is very optimized) and updates its own set of ids.
This way insert work is distributed among the machines, each handles its own portion, and if one crashes, well- the others wills simply work a bit harder.
In case of a failure to update you can return the chunk to the queue and retry later (transactional read).