Below are two screenshots. One shows that RabbitMQ is entirely using his memory. The second the breakdown of the memory usage. The queue is empty, and as far as I know, RabbitMQ should not hold any messages as binaries when the queue is empty. Also, when killing al the consumers, the memory usages does drop a little bit but only by 0.2GB.
Related
Vulkan is intended to be thin and explicit to user, but queues are a big exception to this rule: queues may be multiplexed by driver and it's not always obvious if using multiple queues from a family will improve performance or not.
After one of driver updates, I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them (will be happy to be proved wrong)
So why not just say "we have N separate hardware queues and if you want to use some of them in parallel, just mutex it yourself"? Now it looks like there's no way to know, how independent queues in family really are.
GPUs these days have to contend with a multi-processed world. Different programs can access the same hardware, and GPUs have to be able to deal with that. As such, having parallel input streams for a single piece of actual hardware is no different from being able to create more CPU threads than you have actual CPU cores.
That is, a queue from a family is probably not "mutexing" access to the actual hardware. At least, not in a CPU way. If multiple queues from a family are different paths to execute stuff on the same hardware, then the way that hardware gets populated from these multiple queues probably happens at the GPU level. That is, it's an actual hardware feature.
And you could never get performance equivalent to that hardware feature by "mutexing it yourself". For example:
I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them
Let's assume that there really is only one hardware DMA channel with a fixed bandwidth behind that transfer queue. This means that, at any one time, only one thing can be DMA'd from CPU memory to GPU memory at one time.
Now, let's say you have some DMA work to do. You want to upload a bunch of stuff. But every now and then, you need to download some rendering product. And that download needs to complete ASAP, because you need to reuse the image that stores those bytes.
With prioritized queues, you can give the download transfer queue much higher priority than the upload queue. If the hardware permits it, then it can interrupt the upload to perform the download, then get back to the upload.
With your way, you'd have to upload each item one at a time at regular intervals. A process that will have to be able to be interrupted by a possible download. To do that, you'd basically have to have a recurring tasks that shows up to perform and submit a single upload to the transfer queue.
It'd be much more efficient to just throw the work at the GPU and let its priority system take care of it. Even if there is no priority system, then it'll probably perform operations round-robin, jumping back and forth between the input transfer queue operations rather than waiting for one queue to run dry before trying another.
But of course, this is all hypothetical. You'd need to do profiling work to make sure that these things pan out.
The main issue with queues within families is that they sometimes represent distinct hardware with their own dedicated resources and sometimes they don't. AMD's hardware for example offers two transfer queues, but these actually use separate DMA channels. Granted, they probably still share the same overall bandwidth, but it's not a simple case of one queue having to wait to execute work until the other queue has executed a transfer command.
My embedded application uses the LwIP library to send UDP messages of varying lengths, depending on the contents.
Right now I'm calling pbuf_alloc / pbuf_free every time a message needs sending using PBUF_RAM. It appears to work fine, but I'm worried it will lead to memory fragmentation nastiness after it's been running for a long time. Should I be worried?
Also, is it true that PBUF_POOL is for receiving messages only, not for sending?
PBUF_POOL is for RX only, the idea is to separate the memory pool for received packets and buffered TX segments. See PBUF_POOL define for documentation
In terms of PBUF_RAM and fragmentation in the memory heap, there are a number of configurations which determine how the heap is implemented, which may affect fragmentation. So you'll want to understand your configuration.
Heap could be implemented by standard C library malloc, a series of various fixed sized pools, or a single static array. If using the latter, plug_holes() is called from mem_free() which should handle fragments. See mem.c
When running in a cluster, if something wrong happens, a worker generally dies (JVM shutdown). It can be caused by many factors, most of the time it is a challenge (the biggest difficulty with storm?) to find out what causes the crash.
Of course, storm-supervisor restarts dead workers and liveness is quite good within a storm cluster, still a worker crash is a mess that we should avoid as it adds overhead, latency (can be very long until a worker is found dead and respawned) and data loss if you didn't design your topology to prevent that.
Is there an easy way / tool / methodology to check when and possibly why a storm worker crashes? They are not shown in storm-ui (whereas supervisors are shown), and everything needs manual monitoring (with jstack + JVM opts for instance) with a lot of care.
Here are some cases that can happen:
timeouts and many possible reasons: slow java garbage collection, bad network, bad sizing in timeout configuration. The only output we get natively from supervisor logs is "state: timeout" or "state: disallowed" which is poor. Also when a worker dies the statistics on storm-ui are rebooted. As you get scared of timeouts you end up using long ones which does not seem to be a good solution for real-time processing.
high back pressure with unexpected behaviour, starving worker heartbeats and inducing a timeout for instance. Acking seems to be the only way to deal with back pressure and needs good crafting of bolts according to your load. Not acking seems to be a no-go as it would indeed crash workers and get bad results in the end (even less data processed than an acking topology under pressure?).
code runtime exceptions, sometimes not shown in storm-ui that need manual checking of application logs (the easiest case).
memory leaks that can be found out with JVM dumps.
The storm supervisor logs restart by timeout.
you can monitor the supervisor log, also you can monitor your bolt's execute(tuple) method's performance.
As for memory leak, since storm supervisor does kill -9 the worker, the heap dump is likely to be corrupted, so i would use tools that monitor your heap dynamically or killing the supervisor to produce heap dumps via jmap. Also, try monitoring the gc logs.
I still recommend increasing the default timeouts.
I have simple RabbitMQ cluster with 2 physical identical linux nodes: (CentOS, RabbitMQ 3.1.5, Erlang R15B, 2GB Ram, CPU 1xCore). Mirroring and synchronization of nodes is turned on.
I have two problems which bothers me:
In a normal situation everything is fine, but after restarting one of the nodes(by stop_app and start_app in the commandline) the whole cluster becomes unavaible to producers and consumers - I can't produce or receive messages from a queue during synchronization. Is this situation normal?
During synchronization I observed very high CPU load (almost 100%) on the slave node(that which was restarted). I measured the speed of synchronization - it's dramatic low (synchronization of 2 millions of messages takes above 3 hours). It's strange because producing of such amount takes much less. Is this situation normal too?
I've recently been tasked with looking into RabbitMQ at work and so have been deep in the documentation.
When synchronising this is the case. This is an extract from the RabbitMQ HA documentation here.
If a queue is set to automatically synchronise it will synchronise
whenever a new slave joins - becoming unresponsive until it has done
so.
If the messages are being read-from and persisted-to disk (either through choice or through memory limitations) there may be overhead there. You can see a chart on this blog entry (it's the last chart before the comments) which indicates that there are performance changes when reading from and writing to queues of that many messages. These charts are for older versions of RabbitMQ but I've not seen anything more recent.
Hope this helps!
I have 3 threads (in addition to the main thread). The threads read, process, and write. They each do this to a number of buffers, which are cycled through and reused. The reason it's set up this way is so the program can continue to do the other tasks while one of them is running. So, for example, while the program is writing to disk, it can simultaneously be reading more data.
The problem is I need to synchronize all this so the processing thread doesn't try to process buffers that haven't been filled with new data. Otherwise, there is a chance that the processing step could process leftover data in one of the buffers.
The read thread reads data into a buffer, then marks the buffer as "new data" in an array. So, it works like this:
//set up in main thread
NSConditionLock *readlock = [[NSConditionLock alloc] initWithCondition:0];
//set up lock in thread
[readlock lockWhenCondition:buffer_new[current_buf]];
//copy data to buffer
memcpy(buffer[current_buf],source_data,data_length);
//mark buffer as new (this is reset to 0 once the data is processed)
buffer_new[current_buf] = 1;
//unlock
[readlock unlockWithCondition:0];
I use buffer_new[current_buf] as a condition variable to NSConditionLock. If the buffer isn't marked as new, then the thread in question will lock, waiting for the previous thread to write new data. That part seems to work okay.
The main problem is I need to sync this in both directions. If the read thread happens to take too long for some reason and the processing thread has already finished with processing all the buffers, the processing thread needs to wait and vice-versa.
I'm not sure NSConditionLock is the appropriate way to do this.
I'd turn this on its head. As you say, threading is hard and multi-way synchronization of threads is even harder. Queue based concurrency is often much more natural.
Define three queues; a read queue, a write queue and a processing queue. Then employ a rule stating that no buffer shall be enqueued in more than one queue at a time.
That is, a buffer may be enqueued onto the read queue and, once done reading, enqueued into the processing queue, and once done processing, enqueued into the write queue.
You could use a stack of buffers if you want but, typically, the cost of allocation is pretty cheap compared to the cost of processing and, thus, enqueue-for-read could also do the allocation while dequeue-once-written could do the free.
This would be pretty straightforward to code with GCD. Note that if you really want parallelism, your various queues would really just be throttles, using semaphores -- potentially shared -- to enqueue the work to the global concurrent queues.
Note also that this design has a distinct advantage over what you are currently using in that it uses no locks. The only locks are hidden below the GCD APIs as a part of queue management, but that is effectively invisible to your code.
Have you seen then Apple Concurrency Programming Guide ?
It recommends several preferable methods for moving away from a Threads and Locks concurrency model. Using Operation Queues for example can not only reduce and simplify your code, speed up your development and give you better performance.
Sometimes you need to use threads, and you already have the correct idea. You will need to keep adding locks, and with each it will get exponentially more complicated until you can't understand your own code. Then you can start adding locks at random places. Then you're screwed.
Read the concurrency guide, then follow bbum's advice.