Which scheduler should I use for a "penalty box" when I do delayElement in WebFlux? - spring-webflux

I am trying to implement a "penalty box" for unsuccessful attempts to use our API when it is dealing with security tokens. This extends my previous question Would delayElement be susceptible to a DoS attack? which is isn't.
Here's my handler at the moment.
.onErrorResume(
SecurityException.class,
ex -> {
ServerWebExchangeUtils.setResponseStatus(
exchange, HttpStatus.UNAUTHORIZED);
ServerWebExchangeUtils.setAlreadyRouted(exchange);
securityLog.warn("security error obtaining claims: {}", ex.getMessage());
return chain
.filter(exchange)
.delayElement(
Duration.ofMillis(authProperties.getPenaltyDelayInMillis()),
penaltyScheduler
)
.then(
respondWithUnauthorized(config, exchange, "invalid_token"));
})
I have my scheduler as Schedulers.newParallel("penalty") which allows for parallel threading. But I am not sure if parallel would be the better scheduler or should I just use boundedElastic?

i see no reason as to why you would use parallel.
A parallel scheduler guarantees that work will be scheduled over several cores. Running things in parallel is only beneficial if you need raw computing power, for instance if you need to do heavy mathematical calculations.
Using a parallel scheduler also has a setup time as we need to send instructions to the other cpus, they need to allocate threads etc.
What you want is to allocate threads that are going to most likely spend 99% of their time waiting, which means there is no reason whatsoever that you need the multiple cpu core guarantee.
boundedElastic will create a scheduler with an upper bounded limit which will suffice. Async work is more about orchestration, and not raw cpu-power so the framework will handle the threading and optimization for you as efficient as possible.
Thats the reason for using webflux, so we dont need to think about those things.

Related

How to get basic coroutine health metrics?

tl;dr: How can I get a glimpse of basic state for a coroutine dispatcher?
Specifically:
How many total coroutines are active?
Is the dispatcher exceeding its maximum parallelism? (e.g. are there coroutines stuck waiting)
Context:
I'm running a game server where coroutines are used to listen on ports, handle asynchronous work, manage game state, etc. It is very easy to introduce subtle latency bugs where someone adds a new coroutine inside of a request and then the coroutine load substantially increases and request latency suffers (and in rare cases, deadlocks).
Optimal coroutine scheduling is integral to a good user experience for my service and is something I want to measure. Are there any available health metrics I can observe related to coroutines?

Can I use Dispatchers.Default for CPU bound/intensive operations?

I wanted to understand best practices/approaches for performing blocking CPU operations in production grade server side applications. (Note that I am talking about server side and not Android apps)
What I mean by blocking CPU operation?
Operation which runs on and consumes CPU, for example, huge matrix multiplication, huge data transformation done in while loop etc.
My Setup
We are building kotlin dsl powered by coroutines
We have created our own CoroutineScope with single thread for non blocking operations so that mutating variables from callback are thread safe because all of them runs on same thread
We are recommending users to use withContext(Dispatchers.IO) for doing all the blocking IO operations
Open Questions
We have fixed strategy to do non blocking ops and thread safe mutations as well as doing IO ops as mentioned above
We are exploring options to perform blocking CPU bound operations as mentioned in the beginning
Some of the Solutions
Use Dispatchers.Default for do blocking cpu operations
Does anyone foresee any issue with using Default dispatcher?
What if other part of code/transitive libs are also using Default dispatcher?
Create separate CoroutineScope with fixed threads (usually equals to no. CPU threads) and then ask our users to use that isolated scope for running any blocking CPU bound operations
Which approach we should go with? pros and cons? Is there any other alternative good approach?
Side Note
Coming from Scala, it is usually not preferred to use Global execution context and Dispatchers.Default somehow maps to global context

How to best convert legacy polling embedded firmware architecture into an event driven one?

I've got a family of embedded products running a typical main-loop based firmware with 150+k lines of code. A load of complex timing critical features is realized by a combination of hardware interrupt handlers, timer polling and protothreads (think co-routines). Actually protothreads are polling well and "only" are syntactic sugar to mimic pseudo-parallel scheduling of multiple threads (endless loops). I add bugfixes and extensions to the firmware all the time. There are about 30k devices of about 7 slightly different hardware types and versions out in the field.
For a new product-family member I need to integrate an external FreeRTOS-based project on the new product only while all older products need to get further features and improvements.
In order to not have to port the whole complex legacy firmware to FreeRTOS with all the risk of breaking perfectly fine products I plan to let the old firmware run inside a FreeRTOS task. On the older products the firmware shall still not run without FreeRTOS. Inside the FreeRTOS task the legacy firmware would consume all available processor time because its underlying implementation scheme is polling based. Due to the fact that its using prothreads (timer and hardware polling based behind the scenes) and polls on a free-running processor counter register I hope that I can convert the polling into a event driven behavior.
Here are two examples:
// first example: do something every 100 ms
if (GET_TICK_COUNT() - start > MS(100))
{
start = GET_TICK_COUNT();
// do something every 100 ms
}
// second example: wait for hardware event
setup_hardware();
PT_WAIT_UNTIL(hardware_ready(), pt);
// hardware is ready, do something else
So I got the impression that I can convert these 2 programming patterns (e.g. through macro magic and FreeRTOS functionality) into an underlying event based scheme.
So now my question: has anybody done such a thing already? Are there patterns and or best practices to follow?
[Update]
Thanks for the detailed responses. Let me comment some details: My need is to combine a "multithreading-simulation based legacy firmware (using coo-routines implementation protothreads)" and a FreeRTOS based project being composed of a couple of interacting FreeRTOS tasks. The idea is to let the complete old firmware run in its own RTOS task besides the other new tasks. I'm aware of RTOS principles and patterns (pre-emption, ressource sharing, blocking operations, signals, semaphores, mutexes, mailboxes, task priorities and so forth). I've planed to base the interaction of old and new parts exactly on these mechanisms. What I'm asking for is 1) Ideas how to convert the legacy firmware (150k+ LOC) in a semi-automated way so that the busy-waiting / polling schemes I presented above are using either the new mechanisms when run inside the RTOS tasks or just work the old way when build and run as the current main-loop-kind of firmware. A complete rewrite / full port of the legacy code is no option. 2) More ideas how to teach the old firmware implementation that is used to have full CPU resources on its hands to behave nicely inside its new prison of a RTOS task and not just consume all CPU cycles available (when given the highest priority) or producing new large Real Time Latencies when run not at the highest RTOS priority.
I guess here is nobody who already has done such a very special tas. So I just have to do the hard work and solve all the described issues one after another.
In an RTOS you create and run tasks. If you are not running more than one task, then there is little advantage in using an RTOS.
I don't use FreeRTOS (but have done), but the following applies to any RTOS, and is pseudo-code rather then FreeRTOS API specific - many details such as task priorities and stack allocation are deliberately missing.
First in most simple RTOS, including FreeRTOS, main() is used for hardware initialisation, task creation, and scheduler start:
int main( void )
{
// Necessary h/w & kernel initialisation
initHardware() ;
initKernel() ;
// Create tasks
createTask( task1 ) ;
createTask( task2 ) ;
// Start scheduling
schedulerStart() ;
// schedulerStart should not normally return
return 0 ;
}
Now let us assume that your first example is implemented in task1. A typical RTOS will have both timer and delay functions. The simplest to use is a delay, and this is suitable when the periodic processing is guaranteed to take less than one OS tick period:
void task1()
{
// do something every 100 ms
for(;;)
{
delay( 100 ) ; // assuming 1ms tick period
// something
...
}
}
If the something takes more than 1ms in this case, it will not be executed every 100ms, but 100ms plus the something execution time, which may itself be variable or non-deterministic leading to undesirable timing jitter. In that case you should use a timer:
void task1()
{
// do something every 100 ms
TIMER timer = createTimer( 100 ) ; // assuming 1ms tick period
for(;;)
{
timerWait() ;
// something
...
}
}
That way something can take up to 100ms and will still be executed accurately and deterministically every 100ms.
Now to your second example; that is a little more complex. If nothing useful can happen until the hardware is initialised, then you may as well use your existing pattern in main() before starting the scheduler. However as a generalisation, waiting for something in a different context (task or interrupt) to occur is done using a synchronisation primitive such as a semaphore or task event flag (not all RTOS have task event flags). So in a simple case in main() you might create a semaphore:
createSemaphore( hardware_ready ) ;
Then in the context performing the process that must complete:
// Init hardware
...
// Tell waiting task hardware ready
semaphoreGive( hardware_ready ) ;
Then in some task that will wait for the hardware to be ready:
void task2()
{
// wait for hardware ready
semaphoreTake( hardware_ready ) ;
// do something else
for(;;)
{
// This loop must block is any lower-priority task
// will run. Equal priority tasks may run is round-robin
// scheduling is implemented.
...
}
}
You are facing two big gotchas...
Since the old code is implemented using protothreads (coroutines), there is never any asynchronous resource contention between them. If you split these into FreeRTOS tasks, there there will be preemptive scheduling task switches; these switches may occur at places where the protothreads were not expecting, leaving data or other resources in an inconsistent state.
If you convert any of your protothreads' PT_WAIT calls into real waits in FreeRTOS, the call will really block. But the protothreads assume that other protothreads continue while they're blocked.
So, #1 implies you cannot just convert protothreads to tasks, and #2 implies you must convert protothreads to tasks (if you use FreeRTOS blocking primitives, like xEventGroupWaitBits().
The most straightforward approach will be to put all your protothreads in one task, and continue polling within that task.

Should I try to use as many queues as possible?

On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?
Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.
With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.
To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:
Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.
That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.
Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.

Grand Central Dispatch vs NSThreads?

I searched a variety of sources but don't really understand the difference between using NSThreads and GCD. I'm completely new to the OS X platform so I might be completely misinterpreting this.
From what I read online, GCD seems to do the exact same thing as basic threads (POSIX, NSThreads etc.) while adding much more technical jargon ("blocks"). It seems to just overcomplicate the basic thread creation system (create thread, run function).
What exactly is GCD and why would it ever be preferred over traditional threading? When should traditional threads be used rather than GCD? And finally is there a reason for GCD's strange syntax? ("blocks" instead of simply calling functions).
I am on Mac OS X 10.6.8 Snow Leopard and I am not programming for iOS - I am programming for Macs. I am using Xcode 3.6.8 in Cocoa, creating a GUI application.
Advantages of Dispatch
The advantages of dispatch are mostly outlined here:
Migrating Away from Threads
The idea is that you eliminate work on your part, since the paradigm fits MOST code more easily.
It reduces the memory penalty your application pays for storing thread stacks in the application’s memory space.
It eliminates the code needed to create and configure your threads.
It eliminates the code needed to manage and schedule work on threads.
It simplifies the code you have to write.
Empirically, using GCD-type locking instead of #synchronized is about 80% faster or more, though micro-benchmarks may be deceiving. Read more here, though I think the advice to go async with writes does not apply in many cases, and it's slower (but it's asynchronous).
Advantages of Threads
Why would you continue to use Threads? From the same document:
It is important to remember that queues are not a panacea for
replacing threads. The asynchronous programming model offered by
queues is appropriate in situations where latency is not an issue.
Even though queues offer ways to configure the execution priority of
tasks in the queue, higher execution priorities do not guarantee the
execution of tasks at specific times. Therefore, threads are still a
more appropriate choice in cases where you need minimal latency, such
as in audio and video playback.
Another place where I haven't personally found an ideal solution using queues is daemon processes that need to be constantly rescheduled. Not that you cannot reschedule them, but looping within a NSThread method is simpler (I think). Edit: Now I'm convinced that even in this context, GCD-style locking would be faster, and you could also do a loop within a GCD-dispatched operation.
Blocks in Objective-C?
Blocks are really horrible in Objective-C due to the awful syntax (though Xcode can sometimes help with autocompletion, at least). If you look at blocks in Ruby (or any other language, pretty much) you'll see how simple and elegant they are for dispatching operations. I'd say that you'll get used to the Objective-C syntax, but I really think that you'll get used to copying from your examples a lot :)
You might find my examples from here to be helpful, or just distracting. Not sure.
While the answers so far are about the context of threads vs GCD inside the domain of a single application and the differences it has for programming, the reason you should always prefer GCD is because of multitasking environments (since you are on MacOSX and not iOS). Threads are ok if your application is running alone on your machine. Say, you have a video edition program and want to apply some effect to the video. The render is going to take 10 minutes on a machine with eight cores. Fine.
Now, while the video app is churning in the background, you open an image edition program and play with some high resolution image, decide to apply some special image filter and your image application being clever detects you have eight cores and starts eight threads to process the image. Nice isn't it? Except that's terrible for performance. The image edition app doesn't know anything about the video app (and vice versa) and therefore both will request their respectively optimum number of threads. And there will be pain and blood while the cores try to switch from one thread to another, because to avoid starvation the CPU will eventually let all threads run, even though in this situation it would be more optimal to run only 4 threads for the video app and 4 threads for the image app.
For a more detailed reference, take a look at http://deusty.blogspot.com/2010/11/introducing-gcd-based-cocoahttpserver.html where you can see a benchmark of an HTTP server using GCD versus thread, and see how it scales. Once you understand the problem threads have for multicore machines in multi-app environments, you will always want to use GCD, simply because threads are not always optimal, while GCD potentially can be since the OS can scale thread usage per app depending on load.
Please, remember we won't have more GHz in our machines any time soon. From now on we will only have more cores, so it's your duty to use the best tool for this environment, and that is GCD.
Blocks allow for passing a block of code to execute. Once you get past the "strange syntax", they are quite powerful.
GCD also uses queues which if used properly can help with lock free concurrency if the code executing in the separate queues are isolated. It's a simpler way to offer background and concurrency while minimizing the chance for deadlocks (if used right).
The "strange syntax" is because they chose the caret (^) because it was one of the few symbols that wasn't overloaded as an operator in C++
See:
https://developer.apple.com/library/ios/#documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html
When it comes to adding concurrency to an application, dispatch queues
provide several advantages over threads. The most direct advantage is
the simplicity of the work-queue programming model. With threads, you
have to write code both for the work you want to perform and for the
creation and management of the threads themselves. Dispatch queues let
you focus on the work you actually want to perform without having to
worry about the thread creation and management. Instead, the system
handles all of the thread creation and management for you. The
advantage is that the system is able to manage threads much more
efficiently than any single application ever could. The system can
scale the number of threads dynamically based on the available
resources and current system conditions. In addition, the system is
usually able to start running your task more quickly than you could if
you created the thread yourself.
Although you might think rewriting your code for dispatch queues would
be difficult, it is often easier to write code for dispatch queues
than it is to write code for threads. The key to writing your code is
to design tasks that are self-contained and able to run
asynchronously. (This is actually true for both threads and dispatch
queues.)
...
Although you would be right to point out that two tasks running in a
serial queue do not run concurrently, you have to remember that if two
threads take a lock at the same time, any concurrency offered by the
threads is lost or significantly reduced. More importantly, the
threaded model requires the creation of two threads, which take up
both kernel and user-space memory. Dispatch queues do not pay the same
memory penalty for their threads, and the threads they do use are kept
busy and not blocked.
GCD (Grand Central Dispatch): GCD provides and manages FIFO queues to which your application can submit tasks in the form of block objects. Work submitted to dispatch queues are executed on a pool of threads fully managed by the system. No guarantee is made as to the thread on which a task executes. Why GCD over threads :
How much work your CPU cores are doing
How many CPU cores you have.
How much threads should be spawned.
If GCD needs it can go down into the kernel and communicate about resources, thus better scheduling.
Less load on kernel and better sync with OS
GCD uses existing threads from thread pool instead of creating and then destroying.
Best advantage of the system’s hardware resources, while allowing the operating system to balance the load of all the programs currently running along with considerations like heating and battery life.
I have shared my experience with threads, operating system and GCD AT http://iosdose.com