Start multiple spiders sequentially from another spider - scrapy

I have one spider which creates +100 spiders with arguments.
Those spiders scrape x items and forward them to a mysqlpipeline.
The mysqldatabase can handle 10 connections at the time.
Due to that reason I can only have max 10 spiders running at the same time.
How can I make this happen?
My not succesful approach now is
- Add spiders to a list in the first spider like this:
if item.get('location_selectors') is not None and item.get('start_date_selectors') is not None:
spider = WikiSpider.WikiSpider(template=item.get('category'), view_threshold=0, selectors = {
'location': [item.get('location_selectors')],
'date_start': [item.get('start_date_selectors')],
'date_end': [item.get('end_date_selectors')]
})
self.spiders.append(spider)
Then in the first spider I listen to the close_spider signal:
def spider_closed(self, spider):
for spider in self.spiders:
process = CrawlerProcess(get_project_settings())
process.crawl(spider)
But this approach gives me the following error:
connection to the other side was lost in a non-clean fashion
What is the correct way to start multiple spiders in a sequentially manner?
Thanks in advance!

Related

Why is Python object id different after the Process starts but the pid remains the same?

"""
import time
from multiprocessing import Process, freeze_support
class FileUploadManager(Process):
"""
WorkerObject which uploads files in background process
"""
def __init__(self):
"""
Worker class to upload files in a separate background process.
"""
super().__init__()
self.daemon = True
self.upload_size = 0
self.upload_queue = set()
self.pending_uploads = set()
self.completed_uploads = set()
self.status_info = {'STOPPED'}
print(f"Initial ID: {id(self)}")
def run(self):
try:
print("STARTING NEW PROCESS...\n")
if 'STARTED' in self.status_info:
print("Upload Manager - Already Running!")
return True
self.status_info.add('STARTED')
print(f"Active Process Info: {self.status_info}, ID: {id(self)}")
# Upload files
while True:
print("File Upload Queue Empty.")
time.sleep(10)
except Exception as e:
print(f"{repr(e)} - Cannot run upload process.")
if __name__ == '__main__':
upload_manager = FileUploadManager()
print(f"Object ID: {id(upload_manager)}")
upload_manager.start()
print(f"Process Info: {upload_manager.status_info}, ID After: {id(upload_manager)}")
while 'STARTED' not in upload_manager.status_info:
print(f"Not Started! Process Info: {upload_manager.status_info}")
time.sleep(7)
"""
OUTPUT
Initial ID: 2894698869712
Object ID: 2894698869712
Process Info: {'STOPPED'}, ID After: 2894698869712
Not Started! Process Info: {'STOPPED'}
STARTING NEW PROCESS...
Active Process Info: {'STARTED', 'STOPPED'}, ID: 2585771578512
File Upload Queue Empty.
Not Started! Process Info: {'STOPPED'}
File Upload Queue Empty.
Why does the Process object have the same id and attribute values before and after is has started. but different id when the run method starts?
Initial ID: 2894698869712
Active Process Info: {'STARTED', 'STOPPED'}, ID: 2585771578512
Process Info: {'STOPPED'}, ID After: 2894698869712
I fixed your indentation, and I also removed everything from your script that was not actually being used. It is now a minimum, reproducible example that anyone can run. In the future, please adhere to the site guidelines, and please proofread your questions. It will save everybody's time and you will get better answers.
I would also like to point out that the question in your title is not at all the same as the question asked in your text. At no point do you retrieve the process ID, which is an operating system value. You are printing out the ID of the object, which is a value that has meaning only within the Python runtime environment.
import time
from multiprocessing import Process
# Removed freeze_support since it was unused
class FileUploadManager(Process):
"""
WorkerObject which uploads files in background process
"""
def __init__(self):
"""
Worker class to upload files in a separate background process.
"""
super().__init__(daemon=True)
# The next line probably does not work as intended, so
# I commented it out. The docs say that the daemon
# flag must be set by a keyword-only argument
# self.daemon = True
# I removed a buch of unused variables for this test program
self.status_info = {'STOPPED'}
print(f"Initial ID: {id(self)}")
def run(self):
try:
print("STARTING NEW PROCESS...\n")
if 'STARTED' in self.status_info:
print("Upload Manager - Already Running!")
return # Removed True return value (it was unused)
self.status_info.add('STARTED')
print(f"Active Process Info: {self.status_info}, ID: {id(self)}")
# Upload files
while True:
print("File Upload Queue Empty.")
time.sleep(1.0)
except Exception as e:
print(f"{repr(e)} - Cannot run upload process.")
if __name__ == '__main__':
upload_manager = FileUploadManager()
print(f"Object ID: {id(upload_manager)}")
upload_manager.start()
print(f"Process Info: {upload_manager.status_info}",
f"ID After: {id(upload_manager)}")
while 'STARTED' not in upload_manager.status_info:
print(f"Not Started! Process Info: {upload_manager.status_info}")
time.sleep(0.7)
Your question is, why is the id of upload_manager the same before and after it is started. Simple answer: because it's the same object. It does not become another object just because you called one of its functions. That would not make any sense.
I suppose you might be wondering why the ID of the FileUploadManager object is different when you print it out from its "run" method. It's the same simple answer: because it's a different object. Your script actually creates two instances of FileUploadManager, although it's not obvious. In Python, each Process has its own memory space. When you start a secondary Process (upload_manager.start()), Python makes a second instance of FileUploadManager to execute in this new Process. The two instances are completely separate and "know" nothing about each other.
You did not say that your script doesn't terminate, but it actually does not. It runs forever, stuck in the loop while 'STARTED' not in upload_manager.status_info. That's because 'STARTED' was added to self.status_info in the secondary Process. That Process is working with a different instance of FileUploadManager. The changes you make there do not get automatically reflected in the first instance, which lives in the main Process. Therefore the first instance of FileUploadManager never changes, and the loop never exits.
This all makes perfect sense once you realize that each Process works with its own separate objects. If you need to pass data from one Process to another, that can be done with Pipes, Queues, Managers and shared variables. That is documented in the Concurrent Execution section of the standard library.

Two consecutive yields, only the first work

I have this piece of code that only executes the first yield's callback and not the next one. I have tried reordering them and it gives the same result:
Only the first yield callback gets executed.
for j in range(totalOrderPages): # the code gets in the loop
productURI = feedUrl % (productId, j + 1)
print "Got in the loop" # this gets printed
yield response.follow(productURI, self.parse_orders, meta={'pid': productId, 'categories': categories})
yield response.follow(first_page, self.parse_product, meta={'pid': productId, 'categories': categories})
Is there anything in Python or scrapy that prevents 2 consecutive yields?
Second question:
I'm trying to debug this using pdb.set_trace() but when I try to execute yield from the debugging console, it give the yield outside function error.
Does anyone know how can we debug yields?
Thank you.
Without knowing more details, like the redirection behaviour of the specific site or the contents of the variables (feedUrl, productURI, first_page, etc), I would say that some requests are being dropped by the Dupefilter (https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class).
I'd recommend you to enable the DEBUG logging level and setting DUPEFILTER_DEBUG=True, and check the logs to see if that's the case.
You can force requests to bypass the Dupefilter by adding dont_filter=True when calling response.follow.
If this doesn't solve your issue, please share your crawl logs so we can have more information to debug the issue. Happy scraping!

Twisted deferreds block when URI is the same (multiple calls from the same browser)

I have the following code
# -*- coding: utf-8 -*-
# 好
##########################################
import time
from twisted.internet import reactor, threads
from twisted.web.server import Site, NOT_DONE_YET
from twisted.web.resource import Resource
##########################################
class Website(Resource):
def getChild(self, name, request):
return self
def render(self, request):
if request.path == "/sleep":
duration = 3
if 'duration' in request.args:
duration = int(request.args['duration'][0])
message = 'no message'
if 'message' in request.args:
message = request.args['message'][0]
#-------------------------------------
def deferred_activity():
print 'starting to wait', message
time.sleep(duration)
request.setHeader('Content-Type', 'text/plain; charset=UTF-8')
request.write(message)
print 'finished', message
request.finish()
#-------------------------------------
def responseFailed(err, deferred):
pass; print err.getErrorMessage()
deferred.cancel()
#-------------------------------------
def deferredFailed(err, deferred):
pass; # print err.getErrorMessage()
#-------------------------------------
deferred = threads.deferToThread(deferred_activity)
deferred.addErrback(deferredFailed, deferred) # will get called indirectly by responseFailed
request.notifyFinish().addErrback(responseFailed, deferred) # to handle client disconnects
#-------------------------------------
return NOT_DONE_YET
else:
return 'nothing at', request.path
##########################################
reactor.listenTCP(321, Site(Website()))
print 'starting to serve'
reactor.run()
##########################################
# http://localhost:321/sleep?duration=3&message=test1
# http://localhost:321/sleep?duration=3&message=test2
##########################################
My issue is the following:
When I open two tabs in the browser, point one at http://localhost:321/sleep?duration=3&message=test1 and the other at http://localhost:321/sleep?duration=3&message=test2 (the messages differ) and reload the first tab and then ASAP the second one, then the finish almost at the same time. The first tab about 3 seconds after hitting F5, the second tab finishes about half a second after the first tab.
This is expected, as each request got deferred into a thread, and they are sleeping in parallel.
But when I now change the URL of the second tab to be the same as the one of the first tab, that is to http://localhost:321/sleep?duration=3&message=test1, then all this becomes blocking. If I press F5 on the first tab and as quickly as possible F5 on the second one, the second tab finishes about 3 seconds after the first one. They don't get executed in parallel.
As long as the entire URI is the same in both tabs, this server starts to block. This is the same in Firefox as well as in Chrome. But when I start one in Chrome and another one in Firefox at the same time, then it is non-blocking again.
So it may not neccessarily be related to Twisted, but maybe because of some connection reusage or something like that.
Anyone knows what is happening here and how I can solve this issue?
Coincidentally, someone asked a related question over at the Tornado section. As you suspected, this is not an "issue" in Twisted but rather a "feature" of web browsers :). Tornado's FAQ page has a small section dedicated to this issue. The proposed solution is appending an arbitrary query string.
Quote of the day:
One dev's bug is another dev's undocumented feature!

Celery: Task Singleton?

I have a task that I need to run asynchronously from the web page that triggered it. This task runs rather long, and as the web page could be getting a lot of these requests, I'd like celery to only run one instance of this task at a given time.
Is there any way I can do this in Celery natively? I'm tempted to create a database table that holds this state for all the tasks to communicate with, but it feels hacky.
You probably can create a dedicated worker for that task configured with CELERYD_CONCURRENCY=1 then all tasks on that worker will run synchronously
You can use memcache/redis for that.
There is an example on the celery official site - http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html
And if you prefer redis (This is a Django realization, but you can also easily modify it for your needs):
from django.core.cache import cache
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
class SingletonTask(Task):
def __call__(self, *args, **kwargs):
lock = cache.lock(self.name)
if not lock.acquire(blocking=False):
logger.info("{} failed to lock".format(self.name))
return
try:
super(SingletonTask, self).__call__(*args, **kwargs)
except Exception as e:
lock.release()
raise e
lock.release()
And then use it as a base task:
#shared_task(base=SingletonTask)
def test_task():
from time import sleep
sleep(10)
This realization is nonblocking. If you want next task to wait for the previous task change blocking=False to blocking=True and add timeout

apache on mod_wsgi Daemon Mode not yielding small string

While trying to test a few things, Using Django + apache2 + mod_wsgi3.3. I find two different results by running periodic yielding of results. Between embeded and daemon mode.
When tried with embedded mode, i.e having no WSGIDaemonProcess, WSGIProcessGroup directive used. below mentioned function yields results one after the other, with every digit getting printed on browser view after 2 seconds of sleep.
def yielder(request):
gen = testYielding()
return HttpResponse(gen)
def testYielding():
yield "3"
time.sleep(2)
yield "4"
time.sleep(2)
yield "5"
time.sleep(2)
yield "6"
time.sleep(2)
yield "7"
Though with DaemonMode on, this view responds data after collating the complete response post 8 seconds with all the digits printed together and not yielding the same, one after the other.
Is this behavior correct? and is there a way to make sure on Daemon mode responses are yielded like embedded mode?
The flush which occurs in the daemon process isn't transfered across to the Apache child worker process that is doing the proxying. Whether the output therefore is passed back through to the client immediately, will depend in part on what Apache output filters you have registered. If you have output filters which want to try and buffer up response data before flushing, you will see this issue.
You should therefore look closely at what Apache output filters are in place. If you can change these, then you would have no choice but to use embedded mode.