Memory reference traces with Intel Pin of packet processing applications

Memory reference traces with Intel Pin of packet processing applications - instrumentation

I'm learning how to use Intel Pin and I have a couple of questions regarding the instrumentation process for a particular usecase. I would like to create a memory reference trace of a simple packet processing application. I have developed the required pintool for that purpose and my questions are the following.
Assuming I use the same network packet trace at all times as input to my packet processing application and let's say I instrument that same application on two different machines. How will the memory reference traces be different? Apparently Pin instruments userspace and is architecture independent so I wouldn't expect to see big qualitative differences in the two output memory reference traces. Is that assumption correct ?
How will the memory trace change if I experiment with the rate at which I inject network packets to my packet processing application ? Or will it change at all and if yes how can I detect how the output traces differ ?
Thank you

I assume you are doing something related to following the data flow / code flow of the network packet, probably closely related to data tainting?
Assuming I use the same network packet trace at all times as input to my packet processing application and let's say I instrument that same application on two different machines. How will the memory reference traces be different?
There are multiple factors that can make the memory trace trace quite different, the crucial point being "two different machines":
Exact copy of the same O.S : traces nearly the same (as the stack, heap and virtual memory manager will work the same) except addresses will change (ASLR).
Same O.S (but not necessarily the same version of the system shared libraries): probably the same as above if there is no recompilation of the target application. Maybe minor difference due to the heap manager that can behave differently.
Different O.S (where a recompilation of the traced application is needed): completely different traces.
Apparently Pin instruments userspace and is architecture independent so I wouldn't expect to see big qualitative differences in the two output memory reference traces. Is that assumption correct ?
Pintools needs to be recompiled for different archs, but the pintool itself should not change the way the target application is traced (same pintool + same os + same application = nearly same trace).
How will the memory trace change if I experiment with the rate at which I inject network packets to my packet processing application ?
This is system dependent and also depends on your insertion point(s). If you start tracing at recv() or recvfrom() there might be some congestion or dropped packets (UDP) if, for example, the rate is too important. Depends on the protocol, your receive window, etc. There are really multiple factors here.
Or will it change at all and if yes how can I detect how the output traces differ ?
I'd probably check the code flow rather than the data flow for this case (seems easier to me). Given exactly the same packet but different rates, if the code branches are not the same (maybe at the basic block (BBL) level), this immediately tells that the same packet is handled differently.

Related

Using openOCD to extract ETB traces on Cortex-M (STM32F4)

STM32F4 discovery (Cortex-M4) has ETB, a buffer storing instruction traces. How can I use OpenOCD and on-chip st-link debugger to pull out the traces from ETB?
I am little confused between SWO/SWD. What should I be using? Also, do I need any additional hardware for extracting traces?
Thank you

I'm afraid I never heard of
STM32F4
including an Embedded Trace Buffer (ETB) in the implemented subset of the
ARM core
and its
CoreSight features.
I think this is because ETB is an optional feature, and ST has decided not to configure/implement this ETB option in its STM32F4 controllers and the ARM core they embed.
I looked up the programming/reference manuals and datasheet of an upper-level representative of STM32F4xx family, and I didn't find anything about ETB, which seems to confirm this assumption.
Now, ETB is not the only option if one wants to stream trace data out of one's MCU:
STM32F4 controllers all have an Instruction Trace Macrocell (ITM), which can alternatively deliver a software-defined char output stream or snapshots of data values or the program counter that are collected either at your breakpoints or just periodically, with assistance by the Data Watchpoint/Trace (DWT) unit.
You can use the ITM
to instrument the application with character output (printf())
to profile your application
to inspect certain properties/state flows of your software by tracing program breakpoints or data watchpoints
The ITM is usable through the SWO pin, using any adapter hardware like any version of ST-Link, j-Link, uLink-* etc.
It is a proper trace interface since it works without stopping the CPU at breakpoints, so examination won't break your system's real-time properties.
Many STM32F4 controllers (AFAIK, those with >= 100 pins) include an Embedded Trace macrocell (ETM), which is able to trace program counter (PC) and data of every CPU cycle, so you can use this one to trace the entire control flow (and data flow) of your controller, also without stopping it at any breakpoint.
The humongous amount of data to be traced (make sure you have a free USB3 port...) can only be delivered in a useful way through the synchronous port interface around the GPIOE group (alternate functions TRACECLK+TRACED0/1/2/3 => 5 pins in total), which is connected to the Trace Port Interface Unit (TPIU) next to the ETM.
In order to use this technology, you need the more expensive variants of debug adapters like j-Trace, uLink-Pro or Lauterbach. The cheapest ETM-capable adapter I'm aware of (haven't used it yet, though) is QTrace by PDQlogic starting around £379. The others are available for about 1-4 k£/k€/k$.
The way your question sounds tells me that you probably just started programming STM32's. Therefore I recommend you to get a development board with an embedded ST-Link inside. This is the cheapest solution to get (SWD debug running first, and then) SWO trace running. Atollic blog has a nice intro how to do that quickly.

Chip to chip communication protocol over SPI

I'm trying to design an efficient communication protocol between a micro-controller on one side and an ARM processor on a multi-core TI chip on the other side through SPI.
The requirements for the needed protocol:
1 - Multi-session with queuing support, as I have multiple sending/receiving threads, so it will be more than one application using this communication protocol and I need the protocol to handle queuing these requests (I will keep holding the buffer if the transmission is queue but I just need the protocol to manage scheduling the queues).
2 - Works over SPI as an underlying protocol.
3 - Simple error checking.
In this thread: "Simple serial point-to-point communication protocol", PPP was a recommended option, however I see PPP does only part of the job.
I also found Light weight IP (LwIP) project featuring PPP over serial (which I assume that I can use it over SPI), so I thought about the possibility of utilizing any of the upper layers protocols like TCP/UDP to do the rest of the required jobs. Fortunately, I found TI including LwIP as part of their ethernet SW in the starterware package, which I assume to ease porting at least on the TI chip side.
So, my questions are:
1 - Is it valid to use LwIP for this communication scheme? Won't this introduce much overhead due to IP headers which are not necessary for a point to point (on the chip level) communication and kill the throughput?
2 - Will the TCP or any similar protocol residing in LwIP handle the queuing of transmission requests, for example if I request transmission through a socket while the communication channel is busy transmitting/receiving request for another socket (session) of another thread, will this be managed by the protocol stack? If so, which protocol layer manages it?
3 - Is their a more efficient protocol stack than LwIP, that meets the above requirements?
Update 1: More points to consider
1 - SPI is the only available option, I use it with available GPIOs to indicate to the master when the slave has data to send.
2 - The current implemented (non-standard) protocol uses DMA with SPI, and a message format of《STX_MsgID_length_payload_ETX》with a fixed message fragments length, however the main drawback of the current scheme is that the master waits for a response on the message (not fragment) before sending another one, which kills the throughput and does not utilise the full duplex nature of SPI.
3- An improvement to this point was to use a kind of mailbox for receiving fragments, so a long message can be interrupted by a higher priority one so that fragments of a single message can arrive non sequentially, but the problem is that this design lead to complicating things especially that I don't have much available resources for many buffers to use the mailbox approach on the controller (master) side. So I thought that it's like I'm re-inventing the wheel by designing a protocol stack for a simple point to point link which may not be efficient.
4- What kind of higher level protocols can be normally used above SPI to establish multiple sessions and solve the queuing/scheduling of messages?
Update 2: Another useful thread "A good serial communications protocol/stack for embedded devices?"
Update 3: I had a look at Modbus protocol, it seems to specify the application layer then directly the data link layer for serial line communication, which sounds to skip the unnecessary overhead of network oriented protocols layers.
Do you think this will be a better option than LwIP for the intended purpose? Also, is there a widely used open source implementation like LwIP but for Modbus?

I think that perhaps you are expecting too much of the humble SPI.
An SPI link is little more a pair of shift registers one in each node. The master selects a single node to connect to its SPI shift register. As it shifts in its data, the slave simultaneously shifts data out. Data is not exchanged unless the master explicitly clocks the data out. Efficient protocols on SPI involve the slave having something useful to output while the master inputs. This may be difficult to arrange, so you usually need a means of indicating null data.
PPP is useful when establishing a connection between two arbitrary endpoints, when the endpoints are fixed and known a priori, PPP would serve no purpose other than to complicate things unnecessarily.
SPI is not a very sophisticated nor flexible interface and probably unsuited to heavyweight general purpose protocols such as TCP/IP. Since "addressing" on SPI is performed by physical chip-select, the addressing inherent in such protocols is meaningless.
Flow control is also a problem with SPI. The master has no way of determining that the slave has copied the data from SPI the shift register before pushing more data. If your slave SPI supports DMA you would be wise to use it.
Either way I suggest that you develop something specific to your purpose. Since SPI is not a network as such, you only need a means to address threads on the selected node. This could be as simple as STX<thread ID><length><payload>ETX.
Added 27 September 2013 in response to comments
Generally SPI as its names suggests is used to connect to peripheral devices, and in that context the protocol is defined by the peripheral. EEPROMS for example typically use a common or at least compatible command interface across vendors, and SD/MMC card SPI interface uses a standardised command test and protocol.
Between two microcontrollers, I would imagine that most implementations are proprietary and application specific. Open protocols are designed for generic interoperability and to achieve that might impose significant unnecessary overhead for a closed system, unless perhaps the nodes were running a system that already had a network stack built in.
I would suggest that if you do want to use a generic network stack that you should abstract the SPI with device drivers at each end that give the SPI a standard I/O stream interface (open(), close(), read(), write() etc.), then you can use the higher-level PPP and TCP/IP protocols (although PPP can probably be avoided since the connection is permanent). However that would only be attractive if both nodes already supported these protocols (running Linux for example), otherwise it will be significant effort and code for little benefit, and would certainly not be "efficient".

I assume you dont really want or have room for a full ip (lwip) stack on the microcontroller? This just sounds like a lot of overkill. Why not just roll your own simple packet structure to move the data items you need to move. Depending on how spi is supported on both sides you may or may not be able to use it to define the frame for your data, if not a simple start pattern, length and a trailing checksum and maybe tail pattern would suffice for finding packet boundaries in the stream (no different than a serial/uart solution). You can even use the PPP solution for that with a start pattern and I think end pattern with the payload using a two byte pattern whenever the start pattern happens to show up in the data. I dont remember all the details now.
Whatever your frame is then add a packet type and your handshakes, or if the data is going to just be microcontroller to arm then you dont even need to do that.
To get back to your direct question. Yes, I think that an ip stack (lwip or other) will introduce a lot of overhead. both bandwidth and more important the amount of code needed to support that stack will chew up rom/ram on both sides. If you ultimately need to present this data in an ip fashion (a website hosted by the embedded system) then somewhere in the path you need an ip stack, etc.
I cant imagine that lwip manages your queues for you. I assume you would need to do that yourself. the various queues might want to talk to a single driver that deals with the single spi bus (assuming there is a single spi bus with multiple chip selects). It also depends on how you are using the spi interface, if you are allowing the arm to talk to multiple microcontrollers and the packets of data are broken up into a little bit from this controller a little from that controller so that nobody has to wait to long before they get a few more bytes of data. Or will a complete frame have to move from one microcontroller before moving onto the next gpio interrupt to pull that guys data? The long and short of it is I would assume you have to manage the shared resource just like you would in any other situation where you have multiple users of a shared resource (rtos, full blown operating system, etc). I dont remember lwip that well at all but with a full blown berkeley sockets application interface the user could write separate applications where each application only cared about one TCP or UDP port and the libraries and drivers managed separating those packets out to each application as well as all of the rules for the IP stack.
If you are not already doing experiments with moving data over the spi interface(s) I would start with simple experiments first just to get the feel for how well it is or isnt going to work, the sizes of transfers you can do reliably per spi transction, etc. Your solution may naturally just fall out of those experiments.

Memory Map for RTOS

I am looking forward to understand, what purpose a memory map serves in embedded system.
How does the function stack differs here, from normal unix system.
Any insights that can help me debug few memory related crashes for embedded system will be helpful.

Embedded systems, especially real-time ones, often have a lot of statically-allocated data, and/or data placed at specific locations in memory. The memory map tells you where these things are, which can be helpful when you run into problems and need to examine the state of the system. For example, you might dump all of memory and then analyze it after the fact; in such a case, the memory map will be rather handy for finding the objects you suspect might be related to the problem.
On the code side, your system might log a hardware exception that points to the address of the instruction where the exception was detected. Looking up the memory locations of functions, combined with a disassembly of the function, can help you analyze such problems.
The details really depend on what kind of embedded system you're building. If you provide more details, people may be able to give better responses.

I am not sure that I understand the question. You seem to be suggesting that a "memory map" is something unique to embedded systems or that it is a tangible software component. It is neither; it is merely a description of the layout of an application's memory usage.
All applications will have a memory map regardless of platform, the difference is that typically on an embedded system the application is linked as a single monolithic entity, so that the resultant memory layout refers to the entire system rather than an individual process as it might in an application on a GPOS platform.
It is the linker and the linker script that determines memory mapping, and your linker will be able to output a map report file that describes the layout and allocation applied. This is true of embedded and desktop applications regardless of OS or architecture.

The memory map for a RTOS is not that much different than the memory map for any computer. It defines which hardware resides at which of the processor's addresses. That hardware may be RAM, ROM, Flash, serial ports, parallel ports, timers, interrupt vectors, or any number of other parts addressable by the processor.
The memory map also describes how you intend to budget for limited resources such as RAM, ROM, or Flash in your system design.
For instance, if there's multiple tasks running, RAM might be mapped so that each task has it's own specific area of RAM allocated to it.
In turn, each tasks's part of RAM would be mapped so that there are specific areas for the stack, another for static variables, and perhaps more again for heap(s).
When you have an operating system on the target, it looks after a lot of this dynamically. However, if your application is the only software on the device, you'll have to manage these decisions yourself, usually at compile/link time. Search "link scripts" for further clues,

The Memory map is a layout of memory of system. It is present in both embedded systems and normal applications. Though it is present in normal applications, it's usage is well appreciated in embedded systems due to system constraints.
Memory map is managed by means of linker scripts or linker command files. It maps resources like Flash or Internal RAM(L1P,L1D,L2,L3) or External RAM(DDR) or ROM or peripherals (ports,serial,parallel,USB etc) or specific device registers or I/O ports with appropriate fixed addresses in the memory space of the system.
In case of embedded systems, based on the memory configuration or constraints of board and performance requirements, the segments like text segment or data segment or BSS can also be placed in the appropriate memory of choice.
There are occasions where various versions of development boards will have different configurations of memory and peripherals. In that case, we may need to edit the linker scripts according to memory configuration and peripherals of the board as an essential check-point in board bring-up.
Memory map can help in defining the shared memory too that can play a key role in multi-threaded applications and also for multi-core applications.
Crashes can be debugged by back tracing the address of crash and mapping it to the memory of the system to get an high level idea of the possible library or object causing the problem.

On reset what happens in embedded system?

I have a doubt regarding the reset due to power up:
As I know that microcontroller is hardwired to start with some particular memory location say 0000H on power up. At 0000h, whether interrupt service routine is written for reset(initialization of stack pointer and program counter etc) or the reset address is there at 0000h(say 7000) so that micro controller jumps at 7000 address and there initialization of stack and PC is written.
Who writes this reset service routine? Is it the manufacturer of microcontroller chip(Intel or microchip etc) or any programmer can change this reset service routine(For example, programmer changed the PC to 4000h from 7000h on power up reset resulting into the first instruction to be fetched from 4000 instead of 7000).
How the stack pointer and program counter are initialized to the respective initial addresses as on power up microcontroller is not in the state to put the address into stack pointer and program counter registers(there is no initialization done till reset service routine).
What should be the steps in the reset service routine considering all possibilities?

With reference to your numbering:
The hardware reset process is processor dependent and will be fully described in the data sheet or reference manual for the part, but your description is generally the case - different architectures may have subtle variations.
While some microcontrollers include a ROM based boot-loader that may contain start-up code, typically such bootloaders are only used to load code over a communications port, either to program flash memory directly or to load and execute a secondary bootloader to RAM that then programs flash memory. As far as C runtime start-up goes, this is either provided with the compiler/toolchain, or you write it yourself in assembler. Normally even when start-up code is provided by the compiler vendor, it is supplied as source to be assembled and linked with your application. The compiler vendor cannot always know things like memory map, SDRAM mapping and timing, or processor clock speed or what oscillator crystal is used in your hardware, so the start-up code will generally need customisation or extension through initialisation stubs that you must implement for your hardware.
On ARM Cortex-M devices in fact the initial PC and stack-pointer are in fact loaded by hardware, they are stored at the reset address and loaded on power-up. However in the general case you are right, the reset address either contains the start-up code or a vector to the start-up code, on pre-Cortex ARM architectures, the reset address actually contains a jump instruction rather than a true vector address. Either way, the start-up code for a C/C++ runtime must at least initialise the stack pointer, initialise static data, perform any necessary C library initialisation and jump to main(). In the case of C++ it must also execute the constructors of any global static objects before calling main().

The processor cores normally have as you say a starting address of some sort of table either a list of addresses or like ARM a place where instructions are executed. Wrapped around that core but within the chip can vary. Cores that are not specific to the chip vendor like 8051, mips, arm, xscale, etc are going to have a much wider range of different answers. Some microcontroller vendors for example will look at strap pins and if the strap is wired a certain way when reset is released then it executes from a special boot flash inside the chip, a bootloader that you can for example use to program the user boot flash with. If the strap is not tied that certain way then sometimes it boots your user code. One vendor I know of still has it boot their bootloader flash, if the vector table has a valid checksum then they jump to the reset vector in your vector table otherwise they sit in their bootloader mode waiting for you to talk to them.
When you get into the bigger processors, non-microcontrollers, where software lives outside the processor either on a boot flash (separate chip from the processor) or some ram that is managed somehow before reset, etc. Those usually follow the rule for the core, start at address 0xFFFFFFF0 or start at address 0x00000000, if there is garbage there, oh well fire off the undefined instruction vector, if that is garbage just hang there or sit in an infinite loop calling the undefined instruction vector. this works well for an ARM for example you can build a board with a boot flash that is erased from the factory (all 0xFFs) then you can use jtag to stop the arm and program the flash the first time and you dont have to unsolder or socket or pre-program anything. So long as your bootloader doesnt hang the arm you can have an unbrickable design. (actually you can often hold the arm in reset and still get at it with the jtag debugger and not worry about bad code messing with jtag pins or hanging the arm core).
The short answer: How many different processor chip vendors have there been? There are many different solutions, as many as you can think of and more have been deployed. Placing a reset handler address in a known place in memory is the most common though.
EDIT:
Questions 2 and 3. if you are buying a chip, some of the microcontrollers have this protected bootloader, but even with that normally you write the boot code that will be used by the product. And part of that boot code is to initialize the stack pointers and prepare memory and bring up parts of the chip and all those good things. Sometimes chip vendors will provide examples. if you are buying a board level product, then often you will find a board support package (BSP) which has working example code to bring up the board and perhaps do a few things. Say the beagleboard for example or the open-rd or embeddedarm.com come with a bootloader (u-boot or other) and some already have linux pre-installed. boards like that the user usually just writes some linux apps/drivers and adds them to the bsp, but you are not limited to that, you are often welcome to completely re-write and replace the bootloader. And whoever writes the bootloader has to setup the stacks and bring up the hardware, etc.
systems like the gameboy advance or nds or the like, the vendor has some startup code that calls your startup code. so they may have the stack and such setup for them but they are handing off to you, so much of the system may be up, you just get to decide how to slice up the memorires, where you want your stack, data, program, etc.
some vendors want to keep this stuff controlled or a secret, others do not. in some cases you may end up with a board or chip with no example code, just some data sheets and reference manuals.
if you want to get into this business though you need to be prepared to write this startup code (in assembler) that may call some C code to bring up the rest of the system, then that might start up the main operating system or application or whatever. Microcotrollers sounds like what you are playing with, the answers to your questions are in the chip vendors users guides, some vendors are better than others. search for the word reset or boot in the document to try to figure out what their boot schemes are. I recommend you use "dollar votes" to choose the better vendors. A vendor with bad docs, secret docs, bad support, dont give them your money, spend your money on vendors with freely downloadable, well written docs, with well written examples and or user forums with full time employees trolling around answering questions. There are times where the docs are not available except to serious, paying customers, it depends on the market. most general purpose embedded systems though are openly documented. the quality varies widely, but the docs, etc are there.

Depends completely on the controller/embedded system you use. The ones I've used in game development have the IP point at a starting address in RAM. The boot strap code supplied from the compiler initializes static/const memory, sets the stack pointer, and then jumps execution to a main() routine of some sort. Older systems also started at a fixed address, but you manually had to set the stack, starting vector table, and other stuff in assembler. A common name for the starting assembler file is CRT0.s for the stuff I've done.
So 1. You are correct. The microprocessor has to start at some fixed address.
2. The ISR can be supplied by the manufacturer or compiler creator, or you can write one yourself, depending on the complexity of the system in question.
3. The stack and initial programmer counter are usually handled via some sort of bootstrap routine that quite often can be overriden with your own code. See above.
Last: The steps will depend on the chip. If there is a power interruption of any sort, RAM may be scrambled and all ISR vector tables and startup code should be rewritten, and the app should be run as if it just powered up. But, read your documentation! I'm sure there is platform specific stuff there that will answer these for your specific case.

Grand Unified Theory of logging

Is their a Grand Unified Theory of logging? Shall we develop one? Question (just to show this is not a discussion :), how can I improve on the following? (note that I live mainly in the embedded world, but non-embedded suggestions are also welcome)
How do you log, when do you log, what do you log, what do you do with log files?
How do you log - I generally have macros, #ifdef TESTING, sort of thing. They write to RAM and a low priority process writes them out when the system is idle (using UDP, since I do embedded systems)
When do you log - same as voting, early and often. At every (in)significant program event, I log at varying levels. Events received, transaction succeed/fail, data updated, etc
What do you log - Fatal/Error/Warning/Info/Debug/Trace is covered in When to use the different log levels?
What do you do with log files - 1) keep them (in CVS), both pass and fail 2) capture everything and filter later in case I can't repeat a problem. I have tools to filter the log by "level" (Fatal/Error/etc), process, file, etc. And to draw message sequence charts, dump data structures, draw histograms of memory usage - what am I missing?
Hmmm, binary or ascii log file format? Ascii is bulkier, but binary requires more processing. I have done both, currently I use ascii
Question - did I miss anything, and how can I improve on this?

You could "instrument" your code in many different ways, everything from start-up/shut-down events to individual machine instruction execution (using a processor emulator). Of all the possibilities, what's worth doing? Don't just do it for the sake of completeness; have a specific goal in mind. A business case if you like, with a benefit you expect to receive. E.g.:
Insight into CPU task execution times/patterns to enable optimisation (if you need to improve performance).
Insight into other systems to resolve system integration issues (e.g. what messages is your VoIP box sending and receiving when it connects to a particular peer?)
Insight into the nature of errors (for field diagnostics)
Aid in development
Aid in validation testing
I imagine that there's no grand unified theory of logging, because what you do would depend on many details:
Quantity of data
Type of data
Events
Streamed audio/video
Available storage
Storage speed
Storage capacity
Available channels to extract data
Bandwidth
Cost
Availability
Internet connected 24×7
Site visit required
Need to unlock a rusty gate, climb a ladder onto a roof, to plug in a cable, after filling out OHS documentation
Need to wait until the Antarctic winter is over and the ice sheets thaw
Random access vs linear access (e.g. if you compress it, do you need to read from the start to decompress and access some random point?)
Need to survive error conditions
Watchdog reboots
Possible data corruption
Due to failing power supply
Due to unreliable storage media
Need to survive a plane crash
As for ASCII vs binary, I usually prefer to keep the logging simple, and put any nice presentation in a PC application that decodes the data. It's usually easier to create a user-friendly presentation in PC software (written in e.g. Python) rather than in the embedded system itself.

did I miss anything, and how can I
improve on this?
Asynchronous logging.
Using multiple log files for the same process for different logging abstractions. e.g. the process' activities are logged in a normal log file. And the process' stats (periodic statistics that you might be interested in) are logged in a separate stats log file.
Hmmm, binary or ascii log file format?
Ascii is bulkier, but binary requires
more processing. I have done both,
currently I use ascii
ASCII is good. More often than not, logs are meant to be used for debugging purposes. A human readable form eases and speeds this up.
However, if your logs are used mostly to record information which is used later on for analysis and generation of reports (e.g. stats or latencies etc.) a binary format would be preferred. You can go one step ahead and use a custom format along with a db service which does index based sorting, where the index can be a tuple of time with the event type.
--

One thing which may be helpful is to have a "maybeLogger" object which will accept log records for an operation which may or may not succeed, and then either ditch those records if the operation succeeds or fails in an uninteresting way, or log them if it does something interesting. This is relatively easy to do in something like .net. In an embedded system, it can only be done really easily if the amount of stuff to be logged is small enough to fit in free RAM, but one could probably use a garbage-collection-based approach to hold stuff in flash (have one 'stream' of data in flash for new log entries, and another for ones that are confirmed to be interesting; periodically move data which is known to be good from the first stream to the second).

Here's my $0.02.
I only log when I'm having a problem and need to track down the source. Usually this has to do with a customer's environment, so I can't just attach the debugger. My solution is to enable the Telnet port and use that to print out statements as to where the program is and values of variables.
I do ASCII only because it's over telnet.
Another aspect of telnet is that it is pretty simple. It's a TCP port with text being thrown out. Very little processing other than the normal TCP headaches.
The log files are dumped as soon as I get them because I have not tried to capture and save a telnet session. I guess I could with WireShark, but I don't need a history of that session. I just need to find the problem and verify a fix.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas