Running jobs concurrently with multiple steps - nextflow

I need to run 10,000 pipelines, each consisting of 5 steps (processes)
Another requirement is that I want to run about 300 concurrently. Meaning, I want 300 to start, then for each pipeline that finished the 5 steps, I want to start a new pipeline. I couldn't find how to do it using channels.
Some initial thoughts:
start by splitting the 10,000 channel to buffers of 300 items.
But it doesn't help with starting a new one when one ends...
proteins = Channel.fromPath( '/some/path/*.fa' ).buffer( size: 300 )
process A {
input:
file query_file from proteins
output:
}
process B {
}

You can achieve approximately what you want using the following in your nextflow.config:
process {
maxForks = 300
}
executor {
queueSize = 300
}
The maxForks directive sets the maximum number of process instances that can be executed in parallel. By setting this value in your nextflow.config, we ensure it is blanket applied to each of your five processes. If you have other processes that you don't want covered by this directive, you can of course use one or more process selectors to select the processes you want to limit this configuration to. Alternatively, just add the directive to each of your five process definitions.
The executor queueSize just defines the number of tasks the executor will handle in parallel.
This of course won't guarantee the completion of a chunk of five processes before starting a new chunk, but that usually isn't much of a concern.

Related

How to add 2 minutes delay between jobs in a queue?

I am using Hangfire in ASP.NET Core with a server that has 20 workers, which means 20 jobs can be enqueued at the same time.
What I need is to enqueue them one by one with 2 minutes delay between each one and another. Each job can take 1-45 minutes, but I don't have a problem running jobs concurrently, but I do have a problem starting 20 jobs at the same time. That's why changing the worker count to 1 is not practical for me (this will slow the process a lot).
The idea is that I just don't want 2 jobs to run at the same second since this may make some conflicts in my logic, but if the second job started 2 minutes after the first one, then I am good.
How can I achieve that?
You can use BackgroundJob.Schedule() to run your job run at a specific time:
BackgroundJob.Schedule(() => Console.WriteLine("Hello"), dateTimeToExecute);
Based on that set a date for the first job to execute, and then increase this date to 2 minutes for each new job.
Something like this:
var dateStartDate = DateTime.Now;
foreach (var j in listOfjobsToExecute)
{
BackgroundJob.Schedule(() => j.Run(), dateStartDate);
dateStartDate = dateStartDate.AddMinutes(2);
}
See more here:
https://docs.hangfire.io/en/latest/background-methods/calling-methods-with-delay.html?highlight=delay

Increase delay and load over time in jMeter

I'm trying to get an increase in delay/pause and load over time in jmeter load testing while keeping the sequence constant. For example -
Initially - 10 samples (of 2 get requests - a,b,a,b,a,b...)
Then after 10 samples, a delay/pause of 10 secs and then 20 samples (a,b,a,b,a,b...)
After the 20 samples, another delay/pause of 20 secs; then 30 samples (a,b,a,b,a,b...)
And so on.
Constraints here being -
Getting exact number of samples
Getting the desired delay
The order of requests should be maintained
The critical section controller helps with maintaining the order of threads but only in a normal thread group. So if I try the ultimate thread group to get the desired variable delay and load, the order and number of samples go haywire.
I've tried the following-
Run test group consecutively
Flow control action
Throughput controller
Module controller
Interleave controller
Synchronizing timer (with and without flow control)
Add think times to children
Is there any way to get this output in jMeter? Or should I just opt for scripting?
Add User Defined Variables and set the following variables there:
samples=10
delay=10
Add Thread Group and specify the required number of threads and iterations
Add Loop Controller under the Thread Group and set "Loop Count" to ${samples}. Put your requests under the Loop Controller
Add JSR223 Sampler and put the following code into Script area:
def delay = vars.get('delay') as long
sleep(delay * 1000)
def new_delay = delay + 10
vars.put('delay', new_delay as String)
def samples = vars.get('samples') as int
def new_samples = samples + 10
vars.put('samples', new_samples as String)

Flink: How to process rest of finite stream with combination of countWindowAll()

//assume following logic
val source = arrayOf(1,2,3,4,5,6,7,8,9,10,11,12) // total 12 elements
val env = StreamExecutionEnvironment.createLocalEnvironment(1);
val input = env.fromCollection(source)
.countWindowAll(5)
.aggregate(...) // pack them to List<Int> for bulk upload to DB
.addSink(...) // sends bulk
When i execute it - only first 10 processed, but rest 2 elements
are thrown away - flink shutdown without processing of them.
The only avoid for me - while i totally controll source data, i can push some well-known IGNORABLE_VALUES to source collection to fit window size and then ignore them in sink... but i think where is some far more professional way in flink.
You have a finite stream of 12 and a window that triggers for every 5 elements. So the first window gets 5 elements and then triggers, then the next 5 are received and it triggers, but the last 2 come and the job knows that no more are going to come. So since there aren't 5 elements in the window the trigger doesn't fire so nothing is done with them.

JMeter HTTP Request Post Body from File

I am trying to send an HTTP request via JMeter. I have created a thread group with a loop count of 25. I have a ramp up period of 120 and number of threads set to 30. Within the thread group, I have 20 HTTP Requests. I am a little confused as to how JMeter runs these requests. Do each of the 20 requests within a thread group run in a single thread, and each loop over a thread group runs concurrently on a different thread? Or do each of the 20 requests run in different threads as and when they are available.
My other question is, Over each loop, I want to vary the body of the post data that is being sent via the HTTP request. Is it possible to pass the post data body via a file instead of inserting the data into the JMeter Body Data Tab as show below:
However, instead of doing that, I want to define some kind of variable that picks a file based on iteration of the threadgroup that is running, for example, if it is looping over the thread group the second time, i want to call test2.txt, if the third time test3.txt etc and these text files will contain different post data. Could anyone tell me if this is possible with JMeter please and if so, how would I go about doing this.
Point 1 - JMeter concurrency
JMeter starts with 1 thread and spawns more threads as per ramp-up set. In your case (30 threads and 120 seconds ramp-up) another thread is being added each 4 seconds. Each thread executes 20 requests and if there is another loop - starts over, if there is no loop - the threads shuts down. To control load and concurrency JMeter provides 2 options:
Synchronizing Timer - pause all threads till specified threshold is reached and then release all of them at the same time
Constant Throughput Timer - to specify the load in requests per minute.
Point 2 - Send file instead of text
You can replace your request body with __fileToString function. If you want to parametrize it you can use nested function to provide current iteration - see below.
Point 3 - adding iteration as a parameter
JMeter provides 2 options on how you can increment a counter each loop
Counter config element - starts from specified value and gets incremented by specified value each time it's called.
__counter function - start from 1 and gets incremented by 1 each time it's being called. Can be "per-user" or "global"
See How to Use JMeter Functions post series for comprehensive information on above and more JMeter functions.

Scalable delayed task execution with Redis

I need to design a Redis-driven scalable task scheduling system.
Requirements:
Multiple worker processes.
Many tasks, but long periods of idleness are possible.
Reasonable timing precision.
Minimal resource waste when idle.
Should use synchronous Redis API.
Should work for Redis 2.4 (i.e. no features from upcoming 2.6).
Should not use other means of RPC than Redis.
Pseudo-API: schedule_task(timestamp, task_data). Timestamp is in integer seconds.
Basic idea:
Listen for upcoming tasks on list.
Put tasks to buckets per timestamp.
Sleep until the closest timestamp.
If a new task appears with timestamp less than closest one, wake up.
Process all upcoming tasks with timestamp ≤ now, in batches (assuming
that task execution is fast).
Make sure that concurrent worker wouldn't process same tasks. At the same time, make sure that no tasks are lost if we crash while processing them.
So far I can't figure out how to fit this in Redis primitives...
Any clues?
Note that there is a similar old question: Delayed execution / scheduling with Redis? In this new question I introduce more details (most importantly, many workers). So far I was not able to figure out how to apply old answers here — thus, a new question.
Here's another solution that builds on a couple of others [1]. It uses the redis WATCH command to remove the race condition without using lua in redis 2.6.
The basic scheme is:
Use a redis zset for scheduled tasks and redis queues for ready to run tasks.
Have a dispatcher poll the zset and move tasks that are ready to run into the redis queues. You may want more than 1 dispatcher for redundancy but you probably don't need or want many.
Have as many workers as you want which do blocking pops on the redis queues.
I haven't tested it :-)
The foo job creator would do:
def schedule_task(queue, data, delay_secs):
# This calculation for run_at isn't great- it won't deal well with daylight
# savings changes, leap seconds, and other time anomalies. Improvements
# welcome :-)
run_at = time.time() + delay_secs
# If you're using redis-py's Redis class and not StrictRedis, swap run_at &
# the dict.
redis.zadd(SCHEDULED_ZSET_KEY, run_at, {'queue': queue, 'data': data})
schedule_task('foo_queue', foo_data, 60)
The dispatcher(s) would look like:
while working:
redis.watch(SCHEDULED_ZSET_KEY)
min_score = 0
max_score = time.time()
results = redis.zrangebyscore(
SCHEDULED_ZSET_KEY, min_score, max_score, start=0, num=1, withscores=False)
if results is None or len(results) == 0:
redis.unwatch()
sleep(1)
else: # len(results) == 1
redis.multi()
redis.rpush(results[0]['queue'], results[0]['data'])
redis.zrem(SCHEDULED_ZSET_KEY, results[0])
redis.exec()
The foo worker would look like:
while working:
task_data = redis.blpop('foo_queue', POP_TIMEOUT)
if task_data:
foo(task_data)
[1] This solution is based on not_a_golfer's, one at http://www.saltycrane.com/blog/2011/11/unique-python-redis-based-queue-delay/, and the redis docs for transactions.
You didn't specify the language you're using. You have at least 3 alternatives of doing this without writing a single line of code in Python at least.
Celery has an optional redis broker.
http://celeryproject.org/
resque is an extremely popular redis task queue using redis.
https://github.com/defunkt/resque
RQ is a simple and small redis based queue that aims to "take the good stuff from celery and resque" and be much simpler to work with.
http://python-rq.org/
You can at least look at their design if you can't use them.
But to answer your question - what you want can be done with redis. I've actually written more or less that in the past.
EDIT:
As for modeling what you want on redis, this is what I would do:
queuing a task with a timestamp will be done directly by the client - you put the task in a sorted set with the timestamp as the score and the task as the value (see ZADD).
A central dispatcher wakes every N seconds, checks out the first timestamps on this set, and if there are tasks ready for execution, it pushes the task to a "to be executed NOW" list. This can be done with ZREVRANGEBYSCORE on the "waiting" sorted set, getting all items with timestamp<=now, so you get all the ready items at once. pushing is done by RPUSH.
workers use BLPOP on the "to be executed NOW" list, wake when there is something to work on, and do their thing. This is safe since redis is single threaded, and no 2 workers will ever take the same task.
once finished, the workers put the result back in a response queue, which is checked by the dispatcher or another thread. You can add a "pending" bucket to avoid failures or something like that.
so the code will look something like this (this is just pseudo code):
client:
ZADD "new_tasks" <TIMESTAMP> <TASK_INFO>
dispatcher:
while working:
tasks = ZREVRANGEBYSCORE "new_tasks" <NOW> 0 #this will only take tasks with timestamp lower/equal than now
for task in tasks:
#do the delete and queue as a transaction
MULTI
RPUSH "to_be_executed" task
ZREM "new_tasks" task
EXEC
sleep(1)
I didn't add the response queue handling, but it's more or less like the worker:
worker:
while working:
task = BLPOP "to_be_executed" <TIMEOUT>
if task:
response = work_on_task(task)
RPUSH "results" response
EDit: stateless atomic dispatcher :
while working:
MULTI
ZREVRANGE "new_tasks" 0 1
ZREMRANGEBYRANK "new_tasks" 0 1
task = EXEC
#this is the only risky place - you can solve it by using Lua internall in 2.6
SADD "tmp" task
if task.timestamp <= now:
MULTI
RPUSH "to_be_executed" task
SREM "tmp" task
EXEC
else:
MULTI
ZADD "new_tasks" task.timestamp task
SREM "tmp" task
EXEC
sleep(RESOLUTION)
If you're looking for ready solution on Java. Redisson is right for you. It allows to schedule and execute tasks (with cron-expression support) in distributed way on Redisson nodes using familiar ScheduledExecutorService api and based on Redis queue.
Here is an example. First define a task using java.lang.Runnable interface. Each task can access to Redis instance via injected RedissonClient object.
public class RunnableTask implements Runnable {
#RInject
private RedissonClient redissonClient;
#Override
public void run() throws Exception {
RMap<String, Integer> map = redissonClient.getMap("myMap");
Long result = 0;
for (Integer value : map.values()) {
result += value;
}
redissonClient.getTopic("myMapTopic").publish(result);
}
}
Now it's ready to sumbit it into ScheduledExecutorService:
RScheduledExecutorService executorService = redisson.getExecutorService("myExecutor");
ScheduledFuture<?> future = executorService.schedule(new CallableTask(), 10, 20, TimeUnit.MINUTES);
future.get();
// or cancel it
future.cancel(true);
Examples with cron expressions:
executorService.schedule(new RunnableTask(), CronSchedule.of("10 0/5 * * * ?"));
executorService.schedule(new RunnableTask(), CronSchedule.dailyAtHourAndMinute(10, 5));
executorService.schedule(new RunnableTask(), CronSchedule.weeklyOnDayAndHourAndMinute(12, 4, Calendar.MONDAY, Calendar.FRIDAY));
All tasks are executed on Redisson node.
A combined approach seems plausible:
No new task timestamp may be less than current time (clamp if less). Assuming reliable NTP synch.
All tasks go to bucket-lists at keys, suffixed with task timestamp.
Additionally, all task timestamps go to a dedicated zset (key and score — timestamp itself).
New tasks are accepted from clients via separate Redis list.
Loop: Fetch oldest N expired timestamps via zrangebyscore ... limit.
BLPOP with timeout on new tasks list and lists for fetched timestamps.
If got an old task, process it. If new — add to bucket and zset.
Check if processed buckets are empty. If so — delete list and entrt from zset. Probably do not check very recently expired buckets, to safeguard against time synchronization issues. End loop.
Critique? Comments? Alternatives?
Lua
I made something similar to what's been suggested here, but optimized the sleep duration to be more precise. This solution is good if you have few inserts into the delayed task queue. Here's how I did it with a Lua script:
local laterChannel = KEYS[1]
local nowChannel = KEYS[2]
local currentTime = tonumber(KEYS[3])
local first = redis.call("zrange", laterChannel, 0, 0, "WITHSCORES")
if (#first ~= 2)
then
return "2147483647"
end
local execTime = tonumber(first[2])
local event = first[1]
if (currentTime >= execTime)
then
redis.call("zrem", laterChannel, event)
redis.call("rpush", nowChannel, event)
return "0"
else
return tostring(execTime - currentTime)
end
It uses two "channels". laterChannel is a ZSET and nowChannel is a LIST. Whenever it's time to execute a task, the event is moved from the the ZSET to the LIST. The Lua script with respond with how many MS the dispatcher should sleep until the next poll. If the ZSET is empty, sleep forever. If it's time to execute something, do not sleep(i e poll again immediately). Otherwise, sleep until it's time to execute the next task.
So what if something is added while the dispatcher is sleeping?
This solution works in conjunction with key space events. You basically need to subscribe to the key of laterChannel and whenever there is an add event, you wake up all the dispatcher so they can poll again.
Then you have another dispatcher that uses the blocking left pop on nowChannel. This means:
You can have the dispatcher across multiple instances(i e it's scaling)
The polling is atomic so you won't have any race conditions or double events
The task is executed by any of the instances that are free
There are ways to optimize this even more. For example, instead of returning "0", you fetch the next item from the zset and return the correct amount of time to sleep directly.
Expiration
If you can not use Lua scripts, you can use key space events on expired documents.
Subscribe to the channel and receive the event when Redis evicts it. Then, grab a lock. The first instance to do so will move it to a list(the "execute now" channel). Then you don't have to worry about sleeps and polling. Redis will tell you when it's time to execute something.
execute_later(timestamp, eventId, event) {
SET eventId event EXP timestamp
SET "lock:" + eventId, ""
}
subscribeToEvictions(eventId) {
var deletedCount = DEL eventId
if (deletedCount == 1) {
// move to list
}
}
This however has it own downsides. For example, if you have many nodes, all of them will receive the event and try to get the lock. But I still think it's overall less requests any anything suggested here.