Possible to speed up this algorithm? - sql

I am trying to speed up a ruby algorithm. I have a rails app that uses active record and nokogiri to visit a list of urls in a database and scrape the main image from the page and save it under the image attribute associated with that url.
This rails task usually takes about 2:30 s to complete and I am trying to speed it up as a learning exercise. Would it be possible to use C through RubyInline and raw SQL code to achieve the desired result? My only issue is that if I use C I lose the database connection that active record with ruby had, and have no idea how to write SQL queries in conjunction with the C code that will properly connect to my db.
Has anyone had experience with this, or even know if it's possible? I'm doing this as primarily a learning exercise and was wondering whether it was even possible. Here is the code that I want to translate into C and SQL if you are interested:
task :getimg => :environment do
stories = FeedEntry.all
stories.each do |story|
if story.image.nil?
url = story.url
doc = Nokogiri::HTML(open(url))
if doc.at_css(".full-width img")
img = doc.at_css(".full-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".body-width img")
img = doc.at_css(".body-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".body-narrow-width img")
img = doc.at_css(".body-narrow-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".caption img")
img = doc.at_css(".caption img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnnArticleGalleryPhotoContainer img")
img = doc.at_css(".cnnArticleGalleryPhotoContainer img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnn_strylftcntnt div img")
img = doc.at_css(".cnn_strylftcntnt div img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnn_stryimg640captioned img")
img = doc.at_css(".cnn_stryimg640captioned img")[:src]
story.image = img
story.save!
end
else
#do nothing
end
end
end
I would appreciate any and all help and insights in this matter. Thank you in advance!!

Speed of DB Saving
I've written a web crawler in ruby and I found that one of the bottlenecks that can affect performance is the actual creation of the row in the database. It's faster to have a single mass insert at the end of extracting all URLs than to have multiple individual inserts (at-least for Postgres).
So instead of calling YourModel.save! for every url you visit, just push every url to an array that will keep track of url's that you need to save to the database. Then once you've finished scraping all links, do a mass insert of all the image links through an sql command.
stories.each do |story|
url = story.url
doc = Nokogiri::HTML(open(url))
img_url = doc.at_css("img")[:src]
to_insert.push "(#{img_url})"
end
#notice the mass insert at the end
sql = "INSERT INTO your_table (img_url) VALUES #{to_insert.join(", ")}"
#CONN is a constant declared at the top of your file (CONN = ActiveRecord::Base.connection)
#that connects to the database
CONN.execute sql
"Speed Up" Downloading
The downloading of links will also be a bottleneck. Thus, the best option would be to create a thread pool, where each thread is allocated a partition of URLs from the database to scrape. This way, you will never be stuck waiting for a single page to download before you do any real processing.
Some pseudoish ruby code:
number_of_workers = 10
(1..number_of_workers).each do |worker|
Thread.new do
begin
urls_to_scrape_for_this_thread = [...list of urls to scrape...]
while urls_to_scrape > 0
url = take_one_url_from_list
scrape(url)
end
rescue => e
puts "========================================"
puts "Thread # #{i} error"
puts "#{e.message}"
puts "#{e.backtrace}"
puts "======================================="
raise e
end
end
end

Are the URLs remote? if so, first benchmark it to see the network latency. If that's the bottleneck, I think you have nothing to do with your code or your choice of language.
How many FeedEntrys do you have in your database? I suggest using FeedEntry.find_each instead of FeedEntry.all.each, because the former loads 1000 entries into memory, processes them, and then loads the next 1000 entries ..., while the latter loads all entries into memory and then iterates over them, which requires more memory and increases GC cycles.
If the bottleneck is neither one of the above, then maybe it's the DOM node searching algorithm which is slow. You can find the (only one?) img node, then check its parent node or grandparent node if necessary, and update your entries accordingly.
image_node = doc.at_css('img')
story.update image: image_node['src'] if needed?(image_node)
def needed?(image_node)
parent_node = image_node.parent
parent_class = image_node.parent['class']
return true if parent_class == 'full-width'
return true if parent_class == 'body-width'
return true if parent_class == 'body-narrow-width'
return true if parent_class == 'caption'
return true if parent_class == 'cnnArticleGalleryPhotoContainer'
return true if parent_class == 'cnn_stryimg640captioned'
return false unless parent_node.node_type == 'div'
return true if parent_node.parent['class'] == 'cnn_strylftcntnt'
false
end

Related

How can I create reliable flask-SQLAlchemy interactions with server-side-events?

I have a flask app that is functioning to expectations, and I am now trying to add a message notification section to my page. The difficulty I am having is that the database changes I am trying to rely upon do not seem to be updating in a timely fashion.
The html code is elementary:
<ul id="out" cols="85" rows="14">
</ul><br><br>
<script type="text/javascript">
var ul = document.getElementById("out");
var eventSource = new EventSource("/stream_game_channel");
eventSource.onmessage = function(e) {
ul.innerHTML += e.data + '<br>';
}
</script>
Here is the msg write code that the second user is executing. I know the code block is run because the redis trigger is properly invoked:
msg_join = Messages(game_id=game_id[0],
type="gameStart",
msg_from=current_user.username,
msg_to="Everyone",
message=f'{current_user.username} has requested to join.')
db.session.add(msg_join)
db.session.commit()
channel = str(game_id[0]).zfill(5) + 'startGame'
session['channel'] = channel
date_time = datetime.utcnow().strftime("%Y/%m/%d %H:%M:%S")
redisChannel.set(channel, date_time)
Here is the flask stream code, which is correctly triggered by a new redis time, but when I pull the list of messages, the new message the the second user has added is not yet accessible:
#games.route('/stream_game_channel')
def stream_game_channel():
#stream_with_context
def eventStream():
channel = session.get('channel')
game_id = int(left(channel, 5))
cnt = 0
while cnt < 1000:
print(f'cnt = 0 process running from: {current_user.username}')
time.sleep(1)
ntime = redisChannel.get(channel)
if cnt == 0:
msgs = db.session.query(Messages).filter(Messages.game_id == game_id)
msg_list = [i.message for i in msgs]
cnt += 1
ltime = ntime
lmsg_list = msg_list
for i in msg_list:
yield "data: {}\n\n".format(i)
elif ntime != ltime:
print(f'cnt > 0 process running from: {current_user.username}')
time.sleep(3)
msgs = db.session.query(Messages).filter(Messages.game_id == game_id)
msg_list = [i.message for i in msgs]
new_messages = # need to write this code still
ltime = ntime
cnt += 1
yield "data: {}\n\n".format(msg_list[len(msg_list)-len(lmsg_list)])
return Response(eventStream(), mimetype="text/event-stream")
The syntactic error that I am running into is that the msg_list is exactly the same length (i.e the pushed new message does not get written when i expect it to). Strangely, the second user's session appears to be accessing this information because its stream correctly reflects the addition.
I am using an Amazon RDS MySQL database.
The solution was to utilize a db.session.commit() before my db.session.query(Messages).filter(...) even where no writes were pending. This enabled an immediate read from a different user session, and my code commenced to react to the change in message list length properly.

insert_many in pymongo not persisting

I'm having some issues with persisting documents with pymongo when using insert_many.
I'm handing over a list of dicts to insert_many and it works fine from inside the same script that does the inserting. Less so once the script has finished.
def row_to_doc(row):
rowdict = row.to_dict()
for key in rowdict:
val = rowdict[key]
if type(val) == float or type(val) == np.float64:
if np.isnan(val):
# If we want a SQL style document collection
rowdict[key] = None
# If we want a NoSQL style document collection
# del rowdict[key]
return rowdict
def dataframe_to_collection(df):
n = len(df)
doc_list = []
for k in range(n):
doc_list.append(row_to_doc(df.iloc[k]))
return doc_list
def get_mongodb_client(host="localhost", port=27017):
return MongoClient(host, port)
def create_collection(client):
db = client["material"]
return db["master-data"]
def add_docs_to_mongo(collection, doc_list):
collection.insert_many(doc_list)
def main():
client = get_mongodb_client()
csv_fname = "some_csv_fname.csv"
df = get_clean_csv(csv_fname)
doc_list = dataframe_to_collection(df)
collection = create_collection(client)
add_docs_to_mongo(collection, doc_list)
test_doc = collection.find_one({"MATERIAL": "000000000000000001"})
When I open up another python REPL and start looking through the client.material.master_data collection with collection.find_one({"MATERIAL": "000000000000000001"}) or collection.count_documents({}) I get None for the find_one and 0 for the count_documents.
Is there a step where I need to call some method to persist the data to disk? db.collection.save() in the mongo client API sounds like what I need but it's just another way of inserting documents from what I have read. Any help would be greatly appreciated.
The problem was that I was getting my collection via client.db_name.collection_name and it wasn't getting the same collection I was creating with my code. client.db_name["collection-name"] solved my issue. Weird.

Django local server: Atomic Database from a text file

I made a Web app that takes in a text file, reads each line, takes the 11th character and saves it to SQLite3 db. How do I lock the database or have two or more separate tables while multiple requests are running?
I have added adding ATOMIC_REQUESTS': True to the settings.py in Django.
and I tried creating temporary tables for each request, but can't figure it out. I am pretty fresh to Django 2.2
My View.py
def home(request):
if request.method == 'GET':
return render(request, 'home.html')
if request.method == 'POST':
form = DocumentForm(data=request.POST, files=request.FILES)
print(form.errors)
if form.is_valid():
try:
f = request.FILES['fileToUpload']
except:
print('\033[0;93m'+ "No File uploaded, Redirecting" +'\033[0m')
return HttpResponseRedirect('/tryagain')
print('\033[32m'+ "Success" +'\033[0m')
print('Working...')
line = f.readline()
while line:
#print(line)
mst = message_typer.messages.create_message(str(line)[11])
line = f.readline()
else:
print('\033[0;91m'+ "Failed to validate Form" +'\033[0m')
return HttpResponseRedirect('/output')
return HttpResponse('Failure')
def output(request):
s = message_typer.messages.filter(message='s').count()
A = message_typer.messages.filter(message='A').count()
d = message_typer.messages.filter(message='d').count()
E = message_typer.messages.filter(message='E').count()
X = message_typer.messages.filter(message='X').count()
P = message_typer.messages.filter(message='P').count()
r = message_typer.messages.filter(message='r').count()
B = message_typer.messages.filter(message='B').count()
H = message_typer.messages.filter(message='H').count()
I = message_typer.messages.filter(message='I').count()
J = message_typer.messages.filter(message='J').count()
R = message_typer.messages.filter(message='R').count()
message_types = {'s':s, 'A':A, 'd':d, 'E':E, 'X':X, 'P':P,\
'r':r, 'B':B, 'H':H, 'I':I, 'J':J, 'R':R }
output = {'output':message_types}
#return HttpResponse('Received')
message_typer.messages.all().delete()
return render(request, 'output.html',output)
When the web page loads, it should display a simple break down each character in the 11th position of the uploaded text file.
However, if two requests are running concurrently, the first page that makes the request gets an Operation Error; Db is locked.
Traceback to here:
message_typer.messages.all().delete()
The second page will sum the total of the two files that were uploaded.
I do want to wipe the table after so that the next user will have an empty table to populate and perform a count on.
Is there a better way?

how does scrapy-splash handle infinite scrolling?

I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933. screwrand doesn't seem to follow any pattern, so the reversing the urls don't work. I'm considering the automatic rendering using Splash. How to use Splash to scroll like browsers? Thanks a lot!
Here are the codes for two request:
request1 = scrapy_splash.SplashRequest(
'https://www.crowdfunder.com/user/following/{}'.format(user_id),
self.parse_follow_relationship,
args={'wait':2},
meta={'user_id':user_id, 'action':'following'},
endpoint='http://192.168.99.100:8050/render.html')
yield request1
request2 = scrapy_splash.SplashRequest(
'https://www.crowdfunder.com/user/following_user/80159?user_id=80159&limit=0&per_page=20&screwrand=76',
self.parse_tmp,
meta={'user_id':user_id, 'action':'following'},
endpoint='http://192.168.99.100:8050/render.html')
yield request2
ajax request shown in browser console
To scroll a page you can write a custom rendering script (see http://splash.readthedocs.io/en/stable/scripting-tutorial.html), something like this:
function main(splash)
local num_scrolls = 10
local scroll_delay = 1.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end
To render this script use 'execute' endpoint instead of render.html endpoint:
script = """<Lua script> """
scrapy_splash.SplashRequest(url, self.parse,
endpoint='execute',
args={'wait':2, 'lua_source': script}, ...)
Thanks Mikhail, I tried your scroll script, and it worked, but I also notice that your script scroll too much one time, some js have no time too render and is skipped, so I do some little change as follow:
function main(splash)
local num_scrolls = 10
local scroll_delay = 1
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
local height = get_body_height()
for i = 1, 10 do
scroll_to(0, height * i/10)
splash:wait(scroll_delay/10)
end
end
return splash:html()
end
I do not think that setting the number of scrolls hard coded is a good idea for infinite scroll pages, so I modified the above-mentioned code like this:
function main(splash, args)
current_scroll = 0
scroll_to = splash:jsfunc("window.scrollTo")
get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(3)
height = get_body_height()
while current_scroll < height do
scroll_to(0, get_body_height())
splash:wait(5)
current_scroll = height
height = get_body_height()
end
splash:set_viewport_full()
return splash:html()
end

django.db.utils.IntegrityError when trying to delete duplicate images

I've got the following code to delete duplicate images from a perceptual hash I calculated.
images = Image.objects.all()
images_deleted = 0
for image in images:
duplicates = Image.objects.filter(hash=image.hash).exclude(pk=image.pk).exclude(hash="ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff")
for duplicate in duplicates:
duplicate_tags = duplicate.tags.all()
image.tags.add(*duplicate_tags)
duplicate.delete()
images_deleted+=1
print(str(images_deleted))
running it I get the following exception:
django.db.utils.IntegrityError: insert or update on table
"crawlers_image_tags" violates foreign key constraint
"crawlers_image_t_image_id_72a28d1d54e11b5f_fk_crawlers_image_id"
DETAIL: Key (image_id)=(5675) is not present in table
"crawlers_image".
can anyone shed some light on what exactly the problem is?
edit:
models:
class Tag(models.Model):
name = models.CharField(max_length=100)
def __str__(self):
return self.name
class Image(models.Model):
origins = (
('PX', 'Pexels'),
('MG', 'Magdeleine'),
('FC', 'FancyCrave'),
('SS', 'StockSnap'),
('PB', 'PixaBay'),
('TP', 'tookapic'),
('KP', 'kaboompics'),
('PJ', 'picjumbo'),
('LS', 'LibreShot')
)
source_url = models.URLField(max_length=400)
page_url = models.URLField(unique=True, max_length=400)
thumbnail = models.ImageField(upload_to='thumbs', null=True)
origin = models.CharField(choices=origins, max_length=2)
tags = models.ManyToManyField(Tag)
hash = models.CharField(max_length=200)
def __str__(self):
return self.page_url
def create_hash(self):
thumbnail = Imagelib.open(self.thumbnail.path)
thumbnail = thumbnail.convert('RGB')
self.hash = blockhash(thumbnail, 24)
self.save(update_fields=["hash"])
def create_thumbnail(self, image_url):
if not self.thumbnail:
if not image_url:
image_url = self.source_url
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
}
for i in range(5):
r = requests.get(image_url, stream=True, headers=headers)
if r.status_code != 200 and r.status_code!= 304:
print("error loading image url status code: {}".format(r.status_code))
time.sleep(2)
else:
break
if r.status_code != 200 and r.status_code!= 304:
print("giving up on this image, final status code: {}".format(r.status_code))
return False
# Create the thumbnail of dimension size
size = 500, 500
img = Imagelib.open(r.raw)
thumb = ImageOps.fit(img, size, Imagelib.ANTIALIAS)
# Get the image name from the url
img_name = os.path.basename(image_url.split('?', 1)[0])
file_path = os.path.join(djangoSettings.MEDIA_ROOT, "thumb" + img_name)
thumb.save(file_path, 'JPEG')
# Save the thumbnail in the media directory, prepend thumb
self.thumbnail.save(
img_name,
File(open(file_path, 'rb')))
os.remove(file_path)
return True
Let's examine your code step by step.
Say, you have 3 images in your database (for simplicity i've skipped irrelevant fields):
Image(pk=1, hash="d2ffacb...e3')
Image(pk=2, hash="afcbdee...77')
Image(pk=3, hash="d2ffacb...e3')
As we can see, first and third image have exact same hash. Let's assume all your images have some tags. Now back to your code. Lets check what will happen in first iteration:
all images with same hash will be fetched from database, this will be only image pk=3
Iterating through that images will copy all your tags from that duplicates to original one. There is nothing wrong.
iterating through that images will also remove them.
So after first iteration, image with pk=3 doesn't exist anymore.
Next iteration, image pk=2. Nothing will happen because there are no duplicates.
Next iteration, image pk=3.
all images with same hash will be fetched from database, this will be only image pk=1
Iterating through that images will copy all your tags from that duplicates to original one. But wait... there is no image pk=3 in database, we can't assign any tags to it. And that will throw your IntegrityError.
To avoid that, you should simply fetch from database only original ones in outer for loop. To do that, you can do:
images = Image.objects.distinct('hash')
You can also add some ordering here, so there always will be fetched for example image with lower ID as original one:
images = Image.objects.order_by('id').distinct('hash')
This is to do with the evaluation strategy of the queryset.
Image.objects.all() returns a thunk - that is, a sort of promise of an iterable sequence of images. The SQL query is not executed at this stage.
When you start iterating over it - for image in images - the SQL query is evaluated. You now have a list of image objects in memory.
Now, say you have four images in the database - ids 0, 1, 2, and 3. 0 and 3 are duplicates. The first image is processed, turning up 3 as a duplicate. You delete 3. Image 3 is still in the images iterator, however. When you get there, you're going to try to add tags from image 0 to image 3's tags collection. This will trigger the integrity error, since image 3 has already been deleted.
The simple fix is to keep an accumulator of images to be deleted, and do them all at the end.
images = Image.objects.all()
images_to_delete = []
for image in images:
if image.pk in images_to_delete:
pass
else:
duplicates = Image.objects.filter(hash=image.hash).exclude(pk=image.pk).exclude(hash="ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff")
for duplicate in duplicates:
duplicate_tags = duplicate.tags.all()
image.tags.add(*duplicate_tags)
images_to_delete.append(duplicate.pk)
print(len(images_to_delete))
for pk in images_to_delete:
Image.objects.get(pk=pk).delete()
EDIT: corrected proximate cause of the error, as pointed out by GwynBleidD.