LwIP buffer management for sending UDP messages - udp

My embedded application uses the LwIP library to send UDP messages of varying lengths, depending on the contents.
Right now I'm calling pbuf_alloc / pbuf_free every time a message needs sending using PBUF_RAM. It appears to work fine, but I'm worried it will lead to memory fragmentation nastiness after it's been running for a long time. Should I be worried?
Also, is it true that PBUF_POOL is for receiving messages only, not for sending?

PBUF_POOL is for RX only, the idea is to separate the memory pool for received packets and buffered TX segments. See PBUF_POOL define for documentation
In terms of PBUF_RAM and fragmentation in the memory heap, there are a number of configurations which determine how the heap is implemented, which may affect fragmentation. So you'll want to understand your configuration.
Heap could be implemented by standard C library malloc, a series of various fixed sized pools, or a single static array. If using the latter, plug_holes() is called from mem_free() which should handle fragments. See mem.c

Related

Akka Stream application using more memory than the jvm's heap

Summary:
I have a Java application that uses akka streams that's using more memory than I have specified the jvm to use. The below values are what I have set through the JAVA_OPTS.
maximum heap size (-Xmx) = 700MB
metaspace (-XX) = 250MB
stack size (-Xss) = 1025kb
Using those values and plugging them into the formula below, one would assume the application would be using around 950MB. However that is not the case and it's using over 1.5GB.
Max memory = [-Xmx] + [-XX:MetaspaceSize] + number_of_threads * [-Xss]
Question: Thoughts on how this is possible?
Application overview:
This java application uses alpakka to connect to pubsub and consumes messages. It utilizes akka stream's parallelism where it performs logic on the consumed messages and then it produces those messages to a kafka instance. See the heap dump below. Note, the heap is only 912.9MB so something is taking up 587.1MB and getting the memory usage over 1.5GB
Why is this a problem?
This application is deployed on a kubernetes cluster and the POD has a memory limit specified to 1.5GB. So when the container, where the java application is running, consumes more that 1.5GB the container is killed and restarted.
The short answer is that those do not account for all the memory consumed by the JVM.
Outside of the heap, for instance, memory is allocated for:
compressed class space (governed by the MaxMetaspaceSize)
direct byte buffers (especially if your application performs network I/O and cares about performance, it's virtually certain to make somewhat heavy use of those)
threads (each thread has a stack governed by -Xss ... note that if mixing different concurrency models, each model will tend to allocate its own threads and not necessarily provide a means to share threads)
if native code is involved (e.g. perhaps in the library Alpakka is using to interact with pubsub?), that can allocate arbitrary amounts of memory outside of the heap)
the code cache (typically 48MB)
the garbage collector's state (will vary based on the GC in use, including the presence of any tunable options)
various other things that generally aren't going to be that large
In my experience you're generally fairly safe with a heap that's at most (pod memory limit minus 1 GB), but if you're performing exceptionally large I/Os etc. you can pretty easily get OOM even then.
Your JVM may ship with support for native memory tracking which can shed light on at least some of that non-heap consumption: most of these allocations tend to happen soon after the application is fully loaded, so running with a much higher resource limit and then stopping (e.g. via SIGTERM with enough time to allow it to save results) should give you an idea of what you're dealing with.

Memory reference traces with Intel Pin of packet processing applications

I'm learning how to use Intel Pin and I have a couple of questions regarding the instrumentation process for a particular usecase. I would like to create a memory reference trace of a simple packet processing application. I have developed the required pintool for that purpose and my questions are the following.
Assuming I use the same network packet trace at all times as input to my packet processing application and let's say I instrument that same application on two different machines. How will the memory reference traces be different? Apparently Pin instruments userspace and is architecture independent so I wouldn't expect to see big qualitative differences in the two output memory reference traces. Is that assumption correct ?
How will the memory trace change if I experiment with the rate at which I inject network packets to my packet processing application ? Or will it change at all and if yes how can I detect how the output traces differ ?
Thank you
I assume you are doing something related to following the data flow / code flow of the network packet, probably closely related to data tainting?
Assuming I use the same network packet trace at all times as input to my packet processing application and let's say I instrument that same application on two different machines. How will the memory reference traces be different?
There are multiple factors that can make the memory trace trace quite different, the crucial point being "two different machines":
Exact copy of the same O.S : traces nearly the same (as the stack, heap and virtual memory manager will work the same) except addresses will change (ASLR).
Same O.S (but not necessarily the same version of the system shared libraries): probably the same as above if there is no recompilation of the target application. Maybe minor difference due to the heap manager that can behave differently.
Different O.S (where a recompilation of the traced application is needed): completely different traces.
Apparently Pin instruments userspace and is architecture independent so I wouldn't expect to see big qualitative differences in the two output memory reference traces. Is that assumption correct ?
Pintools needs to be recompiled for different archs, but the pintool itself should not change the way the target application is traced (same pintool + same os + same application = nearly same trace).
How will the memory trace change if I experiment with the rate at which I inject network packets to my packet processing application ?
This is system dependent and also depends on your insertion point(s). If you start tracing at recv() or recvfrom() there might be some congestion or dropped packets (UDP) if, for example, the rate is too important. Depends on the protocol, your receive window, etc. There are really multiple factors here.
Or will it change at all and if yes how can I detect how the output traces differ ?
I'd probably check the code flow rather than the data flow for this case (seems easier to me). Given exactly the same packet but different rates, if the code branches are not the same (maybe at the basic block (BBL) level), this immediately tells that the same packet is handled differently.

Where are buffers located?

I hear a lot about flushing buffers, sending to buffer etc. but I don't have a visual image about where buffers reside and how they look like.
Are buffers part of the OS' kernel or part of each process? If the case is the first, can the same buffers be used by multiple processes?
A buffer is a generic term for a collection of bytes, typically used in the context of either sending, receiving or storing information where the internal data-structure of the information isn't important.
In the case of "flushing" buffers, this typically is used in the context of sending data either to a file or network; the buffer in this case being used to coalesce multiple small writes to the file or network into one larger and more-efficient-to-transmit buffer. After the final write has been performed (or after some "commit" point), the buffer must be "flushed" to ensure that any data left waiting to coalesce with a future write is committed immediately to the underlying file sent over the network rather than left waiting for a future write that might never come.
In both the case of network and file IO, buffers are usuaully used in multiple places. File IO may well be buffered by a buffer in the application, in a library (for instance an implementation of fwrite may buffer the output), in the kernel and even on the device itself - network writes may well be buffered by the device whilst waiting for bandwidth on the wire and hard-disk drives will buffer output from the OS to ensure that data isn't lost as the physical platters spin to the correct position for the write.

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

How to dynamically allocate buffer for receiving UDP socket (VB.Net)

A friend and I are working on a project where we're required to build a reliable UDPclient/server using VB.Net. We have things working well, but one thing that still eludes us is how to dynamically allocate a (byte) buffer for the incoming data. Right now we have to hard code a maximum value/MTU (or use a really large buffer size and resize it once we've finished receiving). Does anyone know of a way that this can be done without needing to specify the receive buffer size?
Basically, before calling the receive function on the socket with a buffer of size x, we want to know x so we can allocate an appropriately sized buffer. Perhaps this is a problem in all socket programing that you just have to deal with??
This is one of the burden's you'll have to take on when you use UDP. You'll have to consider Path MTU discovery. Then again, since you are making reliable UDP, you should be able to auto-detect this and dynamically switch to a smaller packet size. That will solve PMTUD problems as well.
Hopefully, this doesn't sound too much like: "those whose don't use TCP are doomed to reinvent it." Check out the RFCs that are linked in that article for ideas.