Recovering from OOM on DirectByteBuffer allocation on a WebFlux Web Server - spring-webflux

I'm not sure if this makes sense so please comment if I need to provide more info:
My webserver is used to upload files (receives files as Multipart/form-data and uploads them to another service). Using WebFlux, the controller defines the argument as a #RequestPart(name = "payload") final Part payload which wraps the header and Flux.
Reactor / Netty uses DirectByteBuffers to accomodate the payload. If the request handler cannot get enough direct memory to handle the request, it's gonna fail on an OOM and return 500. So this is normal / expected.
However, what's supposed to happen after?
I'm running load tests by sending multiple requests at the same time (either lots of requests with small files or less requests with bigger files). Once I get the first 500 due to an OOM, the system becomes unstable. Some requests will go through, and other fails with OOM (even requests with very small payload can fail).
This behaviour leds me to believe the allocated Pooled buffers are not shared between IO Channels? However this seems weird, it makes the system very easy to DDOS?
From the tests I did, I get the same behaviour using unpooled databuffers, although for a different reason. I do see the memory being unallocated when doing jcmd <PID> VM.native_memory but they aren't released to the OS according to metrics & htop. For instance, the reserved memory shown by jcmd goes back down but htop still reports the previous-high amount and it eventually OOM.
So Question :
Is that totally expected or am I missing a config value somewhere?
Setup :
Spring-boot 2.5.5 on openjdk11:jdk-11.0.10_9
Netty config :
-Dio.netty.allocator.type=pooled -Dio.netty.leakDetectionLevel=paranoid -Djdk.nio.maxCachedBufferSize=262144 -XX:MaxDirectMemorySize=1g -Dio.netty.maxDirectMemory=0

Related

Would a blocking web server get hung up to the sense it needs restarting, if many http clients send requests at most in parallel?

I read there are web servers their behaviors are called blocking whereas Node.js's is said non-blocking.
Would a blocking web server get hung up to the sense it needs restarting, if many http clients send requests at most in parallel?
As a complement, I don't say that it needs restarting while it potentially works fine again after the flood of parallel requests have stopped.
And I currently don't understand how request buffers and overflows work for web servers.
Although technically it could be possible to make a single-thread, single-process blocking server that can only handle 1 request at a time, it doesn't really practically make sense. Concurrency is kind of important.
The three main paradigms for parallelism (that I know of) are:
Multi-process/forking
Threading
Using an event loop/reactor pattern
Node falls in the third category, and also a bit in the second category depending on how you look at it.
Most languages can look at a socket and read from it, and immediately move on if there was nothing to read. Therefore most languages can have this non-blocking behavior.

Understanding Cro request/response cycle and memory use

I'm a bit confused about how Cro handles client requests and, specifically, why some requests seem to cause Cro's memory usage to balloon.
A minimal example of this shows up in the literal "Hello world!" Cro server.
use Cro::HTTP::Router;
use Cro::HTTP::Server;
my $application = route {
get -> {
content 'text/html', 'Hello Cro!';
}
}
my Cro::Service $service = Cro::HTTP::Server.new:
:host<localhost>, :port<10000>, :$application;
$service.start;
react whenever signal(SIGINT) {
$service.stop;
exit;
}
All that this server does is respond to GET requests with "Hello Cro!' – which certainly shouldn't be taxing. However, if I navigate to localhost:10000 and then rapidly refresh the page, I notice Cro's memory use start to climb (and then to stay elevated).
This only seems to happen when the refreshes are rapid, which suggests that the issue might be related either to not properly closing connections or to a concurrency issue (a maybe-slightly-related prior question).
Is there some performance technique or best practice that this "Hello world" server has omitted for simplicity? Or am I missing something else about how Cro is designed to work?
The Cro request processing pipeline is a chain of supply blocks that requests and, later, responses pass through. Decisions about the optimal number of processing threads to create are left to the Raku ThreadPoolScheduler implementation.
So far as connection lifetime goes, it's up to the client - that is, the web browser - as to how eagerly connections are closed; if the browser uses a keep-alive HTTP/1.1 connection or retains a HTTP/2.0 connection, Cro respects that request.
Regarding memory use, growth up to a certain point isn't surprising; it's only a problem if it doesn't eventually level out. Causes include:
The scheduler determining more threads are required to handle the load. Each OS thread comes with some overhead inside the VM, the majority of it being that the GC nursery is per thread to allow simple bump-the-pointer allocation.
The MoarVM optimizer using memory for specialized bytecode and JIT-compiled machine code, which it produces in the background as the application runs, and is driven by certain bits of code having been executed enough times.
The GC trying to converge on a full collection threshold.

How to create a immediately-completed client response?

I want to stub-out a JAX-RS client request. Instead of making an HTTP call, I want to return an immediately-completed client Response. I tried invoking javax.ws.rs.core.Response.ok().build(), unfortunately when the application invokes Response.getEntity() later on it gets this exception:
java.lang.IllegalStateException: Method not supported on an outbound message.
at org.glassfish.jersey.message.internal.OutboundJaxrsResponse.readEntity(OutboundJaxrsResponse.java:144
I dug into Jersey's source-code but couldn't figure out a way to do this. How can one translate a server-side Response to a client-side Response in Jersey (or more generally JAX-RS)?
Use-case
I want to stub-out the HTTP call in a development environment (not a unit test) in order to prove that the network call is responsible for a performance problem I am seeing. I tried using a profiler to do this, but the call is asynchronous with 500+ threads and some network calls return fast (~100ms) while others return much slower (~1.5 seconds). Profilers do not follow asynchronous workflows well, and even if they did they only display the average time consumed across all invocations. I need to see the timing for each individual call. Stubbing-out the network call allows me to test whether the server is returning calls with such a large delta (100ms to 1.5 seconds) or whether the surrounding code is responsible.

Async WCF and Protocol Behaviors

FYI: This will be my first real foray into Async/Await; for too long I've been settling for the familiar territory of BackgroundWorker. It's time to move on.
I wish to build a WCF service, self-hosted in a Windows service running on a remote machine in the same LAN, that does this:
Accepts a request for a single .ZIP archive
Creates the archive and packages several files
Returns the archive as its response to the request
I have to support archives as large as 10GB. Needless to say, this scenario isn't covered by basic WCF designs; we must take additional steps to meet the requirement. We must eliminate timeouts while the archive is building and memory errors while it's being sent. Both of these occur under basic WCF designs, depending on the size of the file returned.
My plan is to proceed using task-based asynchronous WCF calls and streaming mode.
I have two concerns:
Is this the proper approach to the problem?
Microsoft has done a nice job at abstracting all of this, but what of the underlying protocols? What goes on 'under the hood?' Does the server keep the connection alive while the archive is building (could be several minutes) or instead does it close the connection and initiate a new one once the operation is complete, thereby requiring me to properly route the request through the client machine firewall?
For #2, clearly I'm hoping for the former (keep-alive). But after some searching I'm not easily finding an answer. Perhaps you know.
You need streaming for big payloads. That is the right approach. This has nothing at all to do with asynchronous IO. The two are independent. The client cannot even tell that the server is async internally.
I'll add my standard answers for whether to use async IO or not:
https://stackoverflow.com/a/25087273/122718 Why does the EF 6 tutorial use asychronous calls?
https://stackoverflow.com/a/12796711/122718 Should we switch to use async I/O by default?
Each request runs over a single connection that is kept alive. This goes for both streaming big amounts of data as well as big initial delays. Not sure why you are concerned about routing. Does your router kill such connections? That's a problem.
Regarding keep alive, there is nothing going over the wire to do that. TCP sessions can stay open indefinitely without any kind of wire traffic.

Mule and memory (RAM) usage

i've tried to run mule on 3 cases in order to test it's mem usage:
One case is where i had a quartz generator create an event that a filter (right after it in a flow) allways stopped (Returned false) - meaning the flow did absolutly nothing.
In another case i did not use the filter but just used that flow to send a custom object to a WCF service running on another computer (using a cxf endpoint)
Also, i've checked what happened when i leave the flow as is but drop the wcf servce (meaning a lot of socket connection exceptions were thrown).
I did this because i am building a large app that would need this bus to work at all times (weeks at a time).
In all of those cases, the mem usage kept rising. (getting as high as 200mb ram after a few hours)
Any specific reasons this could happen?? What is causing mule to take more memory, in all of these cases?
Off the top of my head I'll stick with thread pool lazy initialization as explanation for this behavior. As time goes on and usage gets higher, the thread pools will get fully initialized.
If you want proof evidences take a look to this approach, or this one (with enableStatistics).