Doing an atomic update of the first instance in a QuerySet - sql

I'm working on a system which has to handle a number of race-conditions when serving jobs to a number of worker-machines.
The clients would query the system for jobs with status='0' (ToDo), then, in an atomic way, update the 'oldest' row with status='1' (Locked) and retrieve the id for that row (for updating the job with worker information like which machine is working on it etc.).
The main issue here is that there might be any number of clients updating at the same time. A solution would be to lock around 20 of the rows with status='0', update the oldest one and release all the locks again afterwards. I've been looking into the TransactionMiddleware but I don't see how this would prevent the case of the oldest one being updated from under me after I query it.
I've looked into the QuerySet.update() thing, and it looks promising, but in the case of two clients getting a hold of the same record, the status would simply updated, and we would have two workers working on the same job.. I'm really at a loss here.
I also found ticket #2705 which seems to handle the case nicely, but I have no idea how to get the code from there because of my limited SVN experience (the last updates are simply diffs, but I don't know how to merge that with the trunk of the code).
Code: Result = Job
class Result(models.Model):
"""
Result: completed- and pending runs
'ToDo': job hasn't been acquired by a client
'Locked': job has been acquired
'Paused'
"""
# relations
run = models.ForeignKey(Run)
input = models.ForeignKey(Input)
PROOF_CHOICES = (
(1, 'Maybe'),
(2, 'No'),
(3, 'Yes'),
(4, 'Killed'),
(5, 'Error'),
(6, 'NA'),
)
proof_status = models.IntegerField(
choices=PROOF_CHOICES,
default=6,
editable=False)
STATUS_CHOICES = (
(0, 'ToDo'),
(1, 'Locked'),
(2, 'Done'),
)
result_status = models.IntegerField(choices=STATUS_CHOICES, editable=False, default=0)
# != 'None' => status = 'Done'
proof_data = models.FileField(upload_to='results/',
null=True, blank=True)
# part of the proof_data
stderr = models.TextField(editable=False,
null=True, blank=True)
realtime = models.TimeField(editable=False,
null=True, blank=True)
usertime = models.TimeField(editable=False,
null=True, blank=True)
systemtime = models.TimeField(editable=False,
null=True, blank=True)
# updated when client sets status to locked
start_time = models.DateTimeField(editable=False)
worker = models.ForeignKey('Worker', related_name='solved',
null=True, blank=True)

To merge #2705 into your django, you need to download it first:
cd <django-dir>
wget http://code.djangoproject.com/attachment/ticket/2705/for_update_11366_cdestigter.diff?format=raw
then rewind svn to the necessary django version:
svn update -r11366
then apply it:
patch -p1 for_update_11366_cdestigter.diff
It will inform you which files were patched successfully and which were not. In the unlikely case of conflicts you can fix them manually looking at http://code.djangoproject.com/attachment/ticket/2705/for_update_11366_cdestigter.diff
To unapply the patch, just write
svn revert --recursive .

If your django is running on one machine, there is a much simpler way to do it... Excuse the pseudo-code as the details of your implementation aren't clear.
from threading import Lock
workers_lock = Lock()
def get_work(request):
workers_lock.acquire()
try:
# Imagine this method exists for brevity
work_item = WorkItem.get_oldest()
work_item.result_status = 1
work_item.save()
finally:
workers_lock.release()
return work_item

You have two choices off the top of my head. One is to lock rows immediately upon retrieval and only release the lock once the appropriate one has been marked as in use. The problem here is that no other client process can even look at the jobs which don't get selected. If you're always just automatically selecting the last one then it may be a brief enough of a window to be o.k. for you.
The other option would be to bring back the rows that are open at the time of the query, but to then check again whenever the client tries to grab a job to work with. When a client attempts to update a job to work on it a check would first be done to see if it's still available. If someone else has already grabbed it then a notification would be sent back to the client. This allows all of the clients to see all of the jobs as snapshots, but if they are constantly grabbing the latest one then you might have the clients constantly receiving notifications that a job is already in use. Maybe this is the race condition to which you're referring?
One way to get around that would be to return the jobs in specific groups to the clients so that they are not always getting the same lists. For example, break them down by geographic area or even just randomly. For example, each client could have an ID of 0 to 9. Take the mod of an ID on the jobs and send back those jobs with the same ending digit to the client. Don't limit it to just those jobs though, as you don't want there to be jobs that you can't reach. So for example if you had clients of 1, 2, and 3 and a job of 104 then no one would be able to get to it. So, once there aren't enough jobs with the correct ending digit jobs would start coming back with other digits just to fill the list. You might need to play around with the exact algorithm here, but hopefully this gives you an idea.
How you lock the rows in your database in order to update them and/or send back the notifications will largely depend on your RDBMS. In MS SQL Server you could wrap all of that work nicely in a stored procedure as long as user intervention isn't needed in the middle of it.
I hope this helps.

Related

How should I avoid sending duplicate emails using mailgun, taskqueue and ndb?

I am using the taskqueue API to send multiple emails is small groups with mailgun. My code looks more or less like this:
class CpMsg(ndb.Model):
group = ndb.KeyProperty()
sent = ndb.BooleanProperty()
#Other properties
def send_mail(messages):
"""Sends a request to mailgun's API"""
# Some code
pass
class MailTask(TaskHandler):
def post(self):
p_key = utils.key_from_string(self.request.get('p'))
msgs = CpMsg.query(
CpMsg.group==p_key,
CpMsg.sent==False).fetch(BATCH_SIZE)
if msgs:
send_mail(msgs)
for msg in msgs:
msg.sent = True
ndb.put_multi(msgs)
#Call the task again in COOLDOWN seconds
The code above has been working fine, but according to the docs, the taskqueue API guarantees that a task is delivered at least once, so tasks should be idempotent. Now, most of the time this would be the case with the above code, since it only gets messages that have the 'sent' property equal to False. The problem is that non ancestor ndb queries are only eventually consistent, which means that if the task is executed twice in quick succession the query may return stale results and include the messages that were just sent.
I thought of including an ancestor for the messages, but since the sent emails will be in the thousands I'm worried that may mean having large entity groups, which have a limited write throughput.
Should I use an ancestor to make the queries? Or maybe there is a way to configure mailgun to avoid sending the same email twice? Should I just accept the risk that in some rare cases a few emails may be sent more than once?
One possible approach to avoid the eventual consistency hurdle is to make the query a keys_only one, then iterate through the message keys to get the actual messages by key lookup (strong consistency), check if msg.sent is True and skip sending those messages in such case. Something along these lines:
msg_keys = CpMsg.query(
CpMsg.group==p_key,
CpMsg.sent==False).fetch(BATCH_SIZE, keys_only=True)
if not msg_keys:
return
msgs = ndb.get_multi(msg_keys)
msgs_to_send = []
for msg in msgs:
if not msg.sent:
msgs_to_send.append(msg)
if msgs_to_send:
send_mail(msgs_to_send)
for msg in msgs_to_send:
msg.sent = True
ndb.put_multi(msgs_to_send)
You'd also have to make your post call transactional (with the #ndb.transactional() decorator).
This should address the duplicates caused by the query eventual consistency. However there still is room for duplicates caused by transaction retries due to datastore contention (or any other reason) - as the send_mail() call isn't idempotent. Sending one message at a time (maybe using the task queue) could reduce the chance of that happening. See also GAE/P: Transaction safety with API calls

Can a telegram bot block a specific user?

I have a telegram bot that for any received message runs a program in the server and sends its result back. But there is a problem! If a user sends too many messages to my bot(spamming), it will make server so busy!
Is there any way to block the people whom send more than 5 messages in a second and don't receive their messages anymore? (using telegram api!!)
Firstly I have to say that Telegram Bot API does not have such a capability itself, Therefore you will need to implement it on your own and all you need to do is:
Count the number of the messages that a user sends within a second which won't be so easy without having a database. But if you have a database with a table called Black_List and save all the messages with their sent-time in another table, you'll be able to count the number of messages sent via one specific ChatID in a pre-defined time period(In your case; 1 second) and check if the count is bigger than 5 or not, if the answer was YES you can insert that ChatID to the Black_List table.
Every time the bot receives a message it must run a database query to see that the sender's chatID exists in the Black_List table or not. If it exists it should continue its own job and ignore the message(Or even it can send an alert to the user saying: "You're blocked." which I think can be time consuming).
Note that as I know the current telegram bot API doesn't have the feature to stop receiving messages but as I mentioned above you can ignore the messages from spammers.
In order to save time, You should avoid making a database connection
every time the bot receives an update(message), instead you can load
the ChatIDs that exist in the Black_List to a DataSet and update the
DataSet right after the insertion of a new spammer ChatID to the
Black_List table. This way the number of the queries will reduce
noticeably.
I have achieved it by this mean:
# Using the ttlcache to set a time-limited dict. you can adjust the ttl.
ttl_cache = cachetools.TTLCache(maxsize=128, ttl=60)
def check_user_msg_frequency(message):
print(ttl_cache)
msg_cnt = ttl_cache[message.from_user.id]
if msg_cnt > 3:
now = datetime.now()
until = now + timedelta(seconds=60*10)
bot.restrict_chat_member(message.chat.id, message.from_user.id, until_date=until)
def set_user_msg_frequency(message):
if not ttl_cache.get(message.from_user.id):
ttl_cache[message.from_user.id] = 1
else:
ttl_cache[message.from_user.id] += 1
With these to functions above, you can record how many messages sent by any user in the period. If a user's messages sent more than expected, he would be restricted.
Then, every handler you called should call these two functions:
#bot.message_handler(commands=['start', 'help'])
def handle_start_help(message):
set_user_msg_frequency(message)
check_user_msg_frequency(message)
I'm using pyTelegramBotAPI this module to handle.
I know I'm late to the party, but here is another simple solution that doesn't use a Db:
Create a ConversationState class to attach to each telegram Id when they start to chat with the bot
Then add a LastMessage DateTime variable to the ConversationState class
Now every time you receive a message check if enought time has passed from the LasteMessage DateTime, if not enought time has passed answer with a warning message.
You can also implement a timer that deletes the conversation state class if you are worried about performance.

Using multiple threads for DB updates results in higher write time per update

So I have a script that is supposed to update a giant table (Postgres). Since the table has about 150m rows and I want to complete this as fast as possible, using multiple threads seemed like a perfect answer. However, I'm seeing something very weird.
When I use a single thread, the write time to an update is much much lower than when I use multiple threads.
require 'sequel'
.....
DB = Sequel.connect(DB_CREDS)
queue = Queue.new
read_query = query = DB["
SELECT id, extra_fields
FROM objects
WHERE XYZ IS FALSE
"]
read_query.use_cursor(:rows_per_fetch => 1000).each do |row|
queue.push(row)
end
Up until this point, IMO it shouldn't matter because we're just reading stuff from the DB and it has nothing to do with writing. From here, I've tried two approaches. Single-threaded and Multi-threaded.
NOTE - This is not the actual UPDATE query that I want to execute, it's just a pseudo one for demonstration purposes. The actual query is a lot longer and plays with JSON and stuff so I can't really update the entire table using a single query.
Single-threaded
until queue.empty?
photo = queue.shift
id = photo[:id]
update_query = DB["
UPDATE objects
SET XYZ = TRUE
WHERE id = #{id}
"]
result = update_query.update
end
If I execute this, I see in my DB logs that each update query takes time less than 0.01 seconds
I, [2016-08-15T10:45:48.095324 #54495] INFO -- : (0.001441s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395179
I, [2016-08-15T10:45:48.103818 #54495] INFO -- : (0.008331s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395181
I, [2016-08-15T10:45:48.106741 #54495] INFO -- : (0.002743s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395182
Multi-threaded
MAX_THREADS = 5
num_threads = 0
all_threads = []
until queue.empty?
if num_threads < MAX_THREADS
photo = queue.shift
num_threads += 1
all_threads << Thread.new {
id = photo[:id]
update_query = DB["
UPDATE photos
SET cv_tagged = TRUE
WHERE id = #{id}
"]
result = update_query.update
num_threads -= 1
Thread.exit
}
end
end
all_threads.each do |thread|
thread.join
end
Now, in theory it should be faster right? But each update takes about 0.5 seconds. I'm so surprised what that is the case.
I, [2016-08-15T11:02:10.992156 #54583] INFO -- : (0.414288s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498834
I, [2016-08-15T11:02:11.097004 #54583] INFO -- : (0.622775s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498641
I, [2016-08-15T11:02:11.097074 #54583] INFO -- : (0.415521s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498826
Any ideas on -
Why this is happening?
How can I increase the update speed for multiple threads approach.
Have you configured Sequel so that it has a connection pool of 5 connections?
Have you considered doing multiple updates per call via an IN clause?
If you haven't done 1, you have N threads fighting over N-n connections, which equates to resource starvation, which is a classic concurrency issue.
Your example can be reduced to: DB[:objects].where(:XYZ=>false).update(:XYZ=>true)
I'm guessing your actual need is not that simple. But the same approach may still work. Instead of issuing a query per row, use a single query to update all related rows.
I went through something similar on a project ("import all history from a legacy database into a new one with completely different structure and organization"). Unless you managed to shoot yourself in the foot somewhere else, you have 2 basic bottlenecks to look for:
the database's disk IO
the ruby process' CPU
Some suggestions,
database IO: use DB transactions, update 1000 records per transaction (you can tweak the exact number but 1000 is usually good) - huge DB table usually means a lot of indexes too, every couple of update actions will trigger a REINDEX and AUTOVACUUM actions within the DB which will result in a significant drop of update speed, a transaction basically allows you to push a 1000 updated records without REINDEX and AUTOVACUUM and then perform both actions, the result is MUCH faster (something like an order of magnitude)
database IO: change indexes, drop every index you can live without during the update process, ideally you will have only 1 very streamlined index which allows unique row lookups for update purposes
ruby CPU: unless you are using JRuby or Rubinius, or REALLY paying the price of network latency to your DB, threads will do you no big benefit, use fork/processes (see GIL). You did a great job choosing Sequel over AR for this
ruby CPU: if you decide to go threads + JRuby with this don't forget to try and plug in jProfiler, it's amazing at tracing bottlenecks in Java and author of SideKiq swears it is amazing for JRuby too - unfortunately, afaik, there is no equivalent of jProfiler for C Ruby (there are profiling tools, but nowhere as useful)
After you implement these suggestions you know you did all you could when:
all of the CPUs on the Ruby box are on 100% load
the hard disk IO of the DB is on 100% throughput
Find this sweet spot and don't add additional ruby update threads/processes after that (or add more hardware) and that's that
PS check out https://github.com/ruby-concurrency/concurrent-ruby - it's a great parallelization lib

Load Runner Session ID Changes Indefinitely

Good day
I'm trying to perform load testing with LoadRunner 11. Here's an issue:
I've got automatically generated script after actions recording
Need to catch Session ID. I do it with web_reg_save_param() in the next way:
web_reg_save_param("S_ID",
"LB=Set-Cookie: JSESSIONID=",
"RB=; Path=/app/;",
LAST);
web_add_cookie("S_ID; DOMAIN={host}");
I catch ID from the response (Tree View):
D2B6F5B05A1366C395F8E86D8212F324
Compare it with Replay Log and see:
"S_ID = 75C78912AE78D26BDBDE73EBD9ADB510".
Compare 2 IDs above with the next request ID and see 3rd ID (Tree View):
80FE367101229FA34EB6429F4822E595
Why do I have 3 different IDs?
Let me know if I have to provide extra information.
You should Use(Search=All) below Code. Provided your Right and left boundary is correct:
web_reg_save_param("S_ID",
"LB=Set-Cookie: JSESSIONID=",
"RB=; Path=/app/;",
"Search=All",
LAST);
web_add_cookie("{S_ID}; DOMAIN={host}");
For Details refer HP Mannual for web_reg_save_param function.
I do not see what the conflict or controversy is here. Yes, items related to state or session will definitely change from user to user, one recording session to the next. They may even change from one request to the next. You may need to record several times to identify the change and use pattern for when you need to collect and when you need to reuse the collected data from a response in a subsequent request.
Take a listen to this podcast. It should help
http://www.perfbytes.com/dynamic-data-correlation

Dynamically change the periodic interval of celery task at runtime

I have a periodic celery task running once per minute, like so:
#tasks.py
#periodic_task(run_every=(crontab(hour="*", minute="*", day_of_week="*")))
def scraping_task():
result = pollAPI()
Where the function pollAPI(), as you might have guessed from the name, polls an API. The catch is that the API has a rate limit that is undisclosed, and sometimes gives an error response, if that limit is hit. I'd like to be able to take that response, and if the limit is hit, decrease the periodic task interval dynamically (or even put the task on pause for a while). Is this possible?
I read in the docs about overwriting the is_due method of schedules, but I am lost on exactly what to do to give the behaviour I'm looking for here. Could anyone help?
You could try using celery.conf.update to update your CELERYBEAT_SCHEDULE.
You can add a model in the database that will store the information if the rate limit is reached. Before doing an API poll, you can check the information in the database. If there is no limit, then just send an API request.
The other approach is to use PeriodicTask from django-celery-beat. You can update the interval dynamically. I created an example project and wrote an article showing how to use dynamic periodic tasks in Celery and Django.
The example code that updates the task when the limit reached:
def scraping_task(special_object_id, larger_interval=1000):
try:
result = pollAPI()
except Exception as e:
# limit reached
special_object = ModelWithTask.objects.get(pk=special_object_id)
task = PeriodicTask.objects.get(pk=special_object.task.id)
new_schedule, created = IntervalSchedule.objects.get_or_create(
every=larger_inerval,
period=IntervalSchedule.SECONDS,
)
task.interval = new_schedule
task.save()
You can pass the parameters to the scraping_task when creating a PeriodicTask object. You will need to have an additional model in the database to have access to the task:
from django.db import models
from django_celery_beat.models import PeriodicTask
class ModelWithTask(models.Model):
task = models.OneToOneField(
PeriodicTask, null=True, blank=True, on_delete=models.SET_NULL
)
# create periodic task
special_object = ModelWithTask.objects.create_or_get()
schedule, created = IntervalSchedule.objects.get_or_create(
every=10,
period=IntervalSchedule.SECONDS,
)
task = PeriodicTask.objects.create(
interval=schedule,
name="Task 1",
task="scraping_task",
kwargs=json.dumps(
{
"special_obejct_id": special_object.id,
}
),
)
special_object.task = task
special_object.save()