Avoid running same job by two EC2 instances - apscheduler

I am using APScheduler in decorator way to run jobs at certain intervals. The problem is that when below code is deployed in two EC2 instances then same job runs twice at same with difference in milliseconds.
My question is : How to avoid running same job by two EC2 instances at same time or Do I need to follow different code design pattern in this case. I want to run this job only once either by one of the severs.
from datetime import datetime
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
sched.start()
#sched.scheduled_job('interval', id='my_job_id', hours=2)
def job_function():
print("Hello World")
If you can share any locking mechanism examples it would be appreciable

You can use AWS-SDK/AWS-CLI by using AWS-SDK/AWS-CLI you can set
If instance_id = "your instance id"
Write your code here
Now your cron will get execute on each instances you have and your code will be executed from that specific instance.

Related

Scrapyd: How to cancel all jobs with one command?

I am running over 40 spiders which are until now scheduled via cron and issued via scrapy crawl Due to several reasons I am now switching to scrapyd, one of them is to be able to see which jobs are running in case I need to do maintenance and reboot - so I can cancel a job.
Is it possible to cancel multiple jobs at once? I noticed that multiple jobs might be running at once with many waiting in quene with status "pending". Stopping the crawl might therefore require multiple calls of the cancel.json endpoint.
How to stop (or better pause) all jobs?
The scrapyd API (as of v1.3.0) does not support pausing. It does have stopping one job per call, however, so you have to loop the jobs yourself
I took #kolas's script from this question and update it to work with python 3.
import json, os
PROJECT_NAME = "MY_PROJECT"
cd = os.system('curl http://localhost:6800/listjobs.json?project={} > kill_job.text'.format(PROJECT_NAME))
with open('kill_job.text', 'r') as f:
a = json.loads(f.readlines()[0])
pending_jobs = list(a.values())[2]
for job in pending_jobs:
job_id = job['id']
kill = 'curl http://localhost:6800/cancel.json -d project={} -d job={}'.format(PROJECT_NAME, job_id)
os.system(kill)

Run Job every 4 days but first run should happen now

I am trying to setup APScheduler to run every 4 days, but I need the job to start running now. I tried using interval trigger but I discovered it waits the specified period before running. Also I tried using cron the following way:
sched = BlockingScheduler()
sched.add_executor('processpool')
#sched.scheduled_job('cron', day='*/4')
def test():
print('running')
One final idea I got was using a start_date in the past:
#sched.scheduled_job('interval', seconds=10, start_date=datetime.datetime.now() - datetime.timedelta(hours=4))
but that still waits 10 seconds before running.
Try this instead:
#sched.scheduled_job('interval', days=4, next_run_time=datetime.datetime.now())
Similar to the above answer, only difference being it uses add_job method.
scheduler = BlockingScheduler()
scheduler.add_job(dump_data, trigger='interval', days=21,next_run_time=datetime.datetime.now())

How redis pipe-lining works in pyredis?

I am trying to understand, how pipe lining in redis works? According to one blog I read, For this code
Pipeline pipeline = jedis.pipelined();
long start = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
pipeline.set("" + i, "" + i);
}
List<Object> results = pipeline.execute();
Every call to pipeline.set() effectively sends the SET command to Redis (you can easily see this by setting a breakpoint inside the loop and querying Redis with redis-cli). The call to pipeline.execute() is when the reading of all the pending responses happens.
So basically, when we use pipe-lining, when we execute any command like set above, the command gets executed on the server but we don't collect the response until we executed, pipeline.execute().
However, according to the documentation of pyredis,
Pipelines are a subclass of the base Redis class that provide support for buffering multiple commands to the server in a single request.
I think, this implies that, we use pipelining, all the commands are buffered and are sent to the server, when we execute pipe.execute(), so this behaviour is different from the behaviour described above.
Could someone please tell me what is the right behaviour when using pyreids?
This is not just a redis-py thing. In Redis, pipelining always means buffering a set of commands and then sending them to the server all at once. The main point of pipelining is to avoid extraneous network back-and-forths-- frequently the bottleneck when running commands against Redis. If each command were sent to Redis before the pipeline was run, this would not be the case.
You can test this in practice. Open up python and:
import redis
r = redis.Redis()
p = r.pipeline()
p.set('blah', 'foo') # this buffers the command. it is not yet run.
r.get('blah') # pipeline hasn't been run, so this returns nothing.
p.execute()
r.get('blah') # now that we've run the pipeline, this returns "foo".
I did run the test that you described from the blog, and I could not reproduce the behaviour.
Setting breakpoints in the for loop, and running
redis-cli info | grep keys
does not show the size increasing after every set command.
Speaking of which, the code you pasted seems to be Java using Jedis (which I also used).
And in the test I ran, and according to the documentation, there is no method execute() in jedis but an exec() and sync() one.
I did see the values being set in redis after the sync() command.
Besides, this question seems to go with the pyredis documentation.
Finally, the redis documentation itself focuses on networking optimization (Quoting the example)
This time we are not paying the cost of RTT for every call, but just one time for the three commands.
P.S. Could you get the link to the blog you read?

Celery: Task Singleton?

I have a task that I need to run asynchronously from the web page that triggered it. This task runs rather long, and as the web page could be getting a lot of these requests, I'd like celery to only run one instance of this task at a given time.
Is there any way I can do this in Celery natively? I'm tempted to create a database table that holds this state for all the tasks to communicate with, but it feels hacky.
You probably can create a dedicated worker for that task configured with CELERYD_CONCURRENCY=1 then all tasks on that worker will run synchronously
You can use memcache/redis for that.
There is an example on the celery official site - http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html
And if you prefer redis (This is a Django realization, but you can also easily modify it for your needs):
from django.core.cache import cache
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
class SingletonTask(Task):
def __call__(self, *args, **kwargs):
lock = cache.lock(self.name)
if not lock.acquire(blocking=False):
logger.info("{} failed to lock".format(self.name))
return
try:
super(SingletonTask, self).__call__(*args, **kwargs)
except Exception as e:
lock.release()
raise e
lock.release()
And then use it as a base task:
#shared_task(base=SingletonTask)
def test_task():
from time import sleep
sleep(10)
This realization is nonblocking. If you want next task to wait for the previous task change blocking=False to blocking=True and add timeout

Celery task schedule (Celery, Django and RabbitMQ)

I want to have a task that will execute every 5 minutes, but it will wait for last execution to finish and then start to count this 5 minutes. (This way I can also be sure that there is only one task running) The easiest way I found is to run django application manage.py shell and run this:
while True:
result = task.delay()
result.wait()
sleep(5)
but for each task that I want to execute this way I have to run it's own shell, is there an easy way to do it? May be some king custom ot django celery scheduler?
Wow it's amazing how no one understands this person's question. They are asking not about running tasks periodically, but how to ensure that Celery does not run two instances of the same task simultaneously. I don't think there's a way to do this with Celery directly, but what you can do is have one of the tasks acquire a lock right when it begins, and if it fails, to try again in a few seconds (using retry). The task would release the lock right before it returns; you can make the lock auto-expire after a few minutes if it ever crashes or times out.
For the lock you can probably just use your database or something like Redis.
You may be interested in this simpler method that requires no changes to a celery conf.
#celery.decorators.periodic_task(run_every=datetime.timedelta(minutes=5))
def my_task():
# Insert fun-stuff here
All you need is specify in celery conf witch task you want to run periodically and with which interval.
Example: Run the tasks.add task every 30 seconds
from datetime import timedelta
CELERYBEAT_SCHEDULE = {
"runs-every-30-seconds": {
"task": "tasks.add",
"schedule": timedelta(seconds=30),
"args": (16, 16)
},
}
Remember that you have to run celery in beat mode with the -B option
manage celeryd -B
You can also use the crontab style instead of time interval, checkout this:
http://ask.github.com/celery/userguide/periodic-tasks.html
If you are using django-celery remember that you can also use tha django db as scheduler for periodic tasks, in this way you can easily add trough the django-celery admin panel new periodic tasks.
For do that you need to set the celerybeat scheduler in settings.py in this way
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
To expand on #MauroRocco's post, from http://docs.celeryproject.org/en/v2.2.4/userguide/periodic-tasks.html
Using a timedelta for the schedule means the task will be executed 30 seconds after celerybeat starts, and then every 30 seconds after the last run. A crontab like schedule also exists, see the section on Crontab schedules.
So this will indeed achieve the goal you want.
Because of celery.decorators deprecated, you can use periodic_task decorator like that:
from celery.task.base import periodic_task
from django.utils.timezone import timedelta
#periodic_task(run_every=timedelta(seconds=5))
def my_background_process():
# insert code
Add that task to a separate queue, and then use a separate worker for that queue with the concurrency option set to 1.