Why does calling a scrapy spider from pywikibot give a ReactorNotRestartable error? - scrapy

I am able to call a scrapy spider from another Python script using either CrawlerRunner or CrawlerProcess. But, when I try to call the same spider calling class from a pywikibot robot, I get a ReactorNotRestartable error. Why is this and how can I fix it?
Here is the error:
File ".\scripts\userscripts\ReplicationWiki\RWLoad.py", line 161, in format_new_page
aea = AEAMetadata(url=DOI_url)
File ".\scripts\userscripts\ReplicationWiki\GetAEAMetadata.py", line 39, in __init__
reactor.run() # the script will block here until all crawling jobs are finished
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
CRITICAL: Exiting due to uncaught exception <class 'twisted.internet.error.ReactorNotRestartable'>
Here is the script which calls my scrapy spider. It runs fine if I just call the class from main.
from twisted.internet import reactor, defer
from scrapy import signals
from scrapy.crawler import Crawler, CrawlerProcess, CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from Scrapers.spiders.ScrapeAEA import ScrapeaeaSpider
class AEAMetadata:
"""
Helper to run ScrapeAEA spider and return JEL codes and data links
for a given AEA article link.
"""
def __init__(self, *args, **kwargs):
"""Initializer"""
url = kwargs.get('url')
if not url:
raise ValueError('No article url given')
self.items = []
def collect_items(item, response, spider):
self.items.append(item)
settings = get_project_settings()
crawler = Crawler(ScrapeaeaSpider, settings)
crawler.signals.connect(collect_items, signals.item_scraped)
runner = CrawlerRunner(settings)
d = runner.crawl(crawler, url=url)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
#process = CrawlerProcess(settings)
#process.crawl(crawler, url=url)
#process.start() # the script will block here until the crawling is finished
def get_jelcodes(self):
jelcodes = self.items[0]['jelcodes']
return jelcodes
def main():
aea = AEAMetadata(url='https://doi.org/10.1257/app.20180286')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
Updated simple Test that instantiates the AEAMetadata class twice.
Here is the calling code in my pywikibot bot which fails:
from GetAEAMetadata import AEAMetadata
def main(*args):
for _ in [1,2]:
print('Top')
url = 'https://doi.org/10.1257/app.20170442'
aea = AEAMetadata(url=url)
print('After AEAMetadata')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()

My call to AEAMetadata was embedded in a larger script which fooled me into thinking the AEAMetadata class was only instantiated once before failure.
In fact, AEAMetadata was called twice.
And, I also thought that the script would block after the reactor.run() because the comment in all the scrapy examples stated that was the case.
However, the second deferred callback is reactor.stop() which unblocks the reactor.run().
A more basic incorrect assumption was that the reactor was deleted and recreated on each iteration. In fact, the reactor is instantiated and initialized when it is first imported. And, it is a global object which lives as long as the underlying process and was not designed to be restarted. The extremes actually needed to delete and restart a reactor are described here:
http://www.blog.pythonlibrary.org/2016/09/14/restarting-a-twisted-reactor/
So, I guess I've answered my own question.
And, I'm rewriting my script so it doesn't try to use the reactor in a way it was never intended to be used.
And, thanks Gallaecio for getting me thinking in the right direction.

Related

Trigger errback when process_exception() is called in Middleware

Using Scrapy i'm implementing a CrawlSpider which will scrape all kinds of websites and hence, sometimes very slow ones which will produce a timeout eventually.
My problem is that if such a twisted.internet.error.TimeoutError occurs, i want to trigger the errback of my spider. I don't want to raise this exception and i also don't want to return a dummy Response object which may would suggest that scraping was successful.
Note that i was already able to made all of this work, but only using a "dirty" workaround:
myspider.py (excerpt)
class MySpider(CrawlSpider):
name = 'my-spider'
rules = (
Rule(
link_extractor=LinkExtractor(unique=True),
callback='_my_callback', follow=True
),
)
def parse_start_url(self, response):
# (...)
def errback(self, failure):
log.warning('Failed scraping following link: {}'
.format(failure.request.url))
middlewares.py (excerpt)
from twisted.internet.error import DNSLookupError, TimeoutError
# (...)
class MyDownloaderMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
if (isinstance(exception, TimeoutError)
or (isinstance(exception, DNSLookupError))):
# just 2 examples of errors i want to catch
# set status=500 to enforce errback() call
return Response(request.url, status=500)
Settings should be fine with my custom Middleware already enabled.
Now as you can see by using return Response(request.url, status=500) i can trigger my errback() function in MySpider as desired. However, the status code 500 is very misleading because it's not only incorrect but technically i never receive any response at all.
So my question is, how can i trigger my errback() function trough DownloaderMiddleware.process_exception() in a clean way?
EDIT: I quickly figured it out that for similar exceptions like DNSLookupError i want to have the same behaviour in place. I've updated the coding snippets accordingly.
I didn't find it in the docs, but looking at the source I find DownloaderMiddleware.process_exception() can return twisted.python.failure.Failure objects as well as Request or Response objects.
This means you can return a Failure object to be handled by the errback by wrapping the exception in the Failure object.
This is cleaner than creating a fake Response object, see an example Middleware implementation that does this here: https://github.com/miguelsimon/site2graph/blob/master/site2graph/middlewares.py
The core idea:
from twisted.python.failure import Failure
class MyDownloaderMiddleware:
def process_exception(self, request, exception, spider):
return Failure(exception)
The __init__ method of the Rule class accepts a process_request parameter that you can use to attatch an errback to a request:
class MySpider(CrawlSpider):
name = 'my-spider'
rules = (
Rule(
# …
process_request='process_request',
),
)
def process_request(self, request, response):
return request.replace(errback=self.errback)
def errback(self, failure):
pass

Code Error with Scrapy Tutorial

I am trying to learn Scrapy and going through the basic tutorial.I am using Anaconda Navigator. I am working in an environment with scrapy installed. I have inputted the code, but keep getting an error.
Here is the code:
import scrapy
class FirstSpider(scrapy.Spider):
name = "FirstSpider"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Requests(url=url, callback = self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = "quotes-%.html" % page
with open(filename, "wb") as f:
f.write(response.body)
self.log("saved file %s")% filename
The code runs for a bit. Says it crawled 0 pages. Then DEBUGS: Telnet Console, and then puts out this error,"[scrapy.core.engine] ERROR: Error while obtaining start requests."
The code then runs some more, and puts out another error after, "yield scrapy.Requests(utl=url, callback = self.parse)" that says "AttributeError: Module 'scrapy' has no attribute 'Requests'.
I have re-written the code, and looked for answers. Please help. Thanks!
You have a typo here:
yield scrapy.Requests(url=url, callback = self.parse)
It's Request and not Requests.

ScrapyDeprecationWaring: Command's default `crawler` is deprecated and will be removed. Use `create_crawler` method to instantiate crawlers

Scrapy version 0.19
I am using the code at this page ( Run multiple scrapy spiders at once using scrapyd ). When I run scrapy allcrawl, I got
ScrapyDeprecationWaring: Command's default `crawler` is deprecated and will be removed. Use `create_crawler` method to instantiate crawlers
Here is the code:
from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log
class AllCrawlCommand(ScrapyCommand):
requires_project = True
default_settings = {'LOG_ENABLED': False}
def short_desc(self):
return "Schedule a run for all available spiders"
def run(self, args, opts):
url = 'http://localhost:6800/schedule.json'
for s in self.crawler.spiders.list(): #this line raise the warning
values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
log.msg(response)
How do I fix the DeprecationWarning ?
Thanks
Use:
crawler = self.crawler_process.create_crawler()

https with jython2.7 + trusting all certificates does not work. Result: httplib.BadStatusLine

UPDATE: Problem related to bug in jython 2.7b1. See bug report: http://bugs.jython.org/issue2021. jython-coders are working on a fix!
After changing to jython2.7beta1 from Jython2.5.3 I am no longer able to read content of webpages using SSL, http and "trusting all certificates". The response from the https-page is always an empty string, resulting in httplib.BadStatusLine exception from httplib.py in Jython.
I need to be able to read from a webpage which requires authentication and do not want to setup any certificate store since I must have portability. Therefore my solution is to use the excellent implementation provided by http://tech.pedersen-live.com/2010/10/trusting-all-certificates-in-jython/
Example code is detailed below. Twitter might not be the best example, since it does not require certificate trusting; but the result is the same with or without the decorator.
#! /usr/bin/python
import sys
from javax.net.ssl import TrustManager, X509TrustManager
from jarray import array
from javax.net.ssl import SSLContext
class TrustAllX509TrustManager(X509TrustManager):
# Define a custom TrustManager which will blindly
# accept all certificates
def checkClientTrusted(self, chain, auth):
pass
def checkServerTrusted(self, chain, auth):
pass
def getAcceptedIssuers(self):
return None
# Create a static reference to an SSLContext which will use
# our custom TrustManager
trust_managers = array([TrustAllX509TrustManager()], TrustManager)
TRUST_ALL_CONTEXT = SSLContext.getInstance("SSL")
TRUST_ALL_CONTEXT.init(None, trust_managers, None)
# Keep a static reference to the JVM's default SSLContext for restoring
# at a later time
DEFAULT_CONTEXT = SSLContext.getDefault()
def trust_all_certificates(f):
# Decorator function that will make it so the context of the decorated
# method will run with our TrustManager that accepts all certificates
def wrapped(*args, **kwargs):
# Only do this if running under Jython
if 'java' in sys.platform:
from javax.net.ssl import SSLContext
SSLContext.setDefault(TRUST_ALL_CONTEXT)
print "SSLContext set to TRUST_ALL"
try:
res = f(*args, **kwargs)
return res
finally:
SSLContext.setDefault(DEFAULT_CONTEXT)
else:
return f(*args, **kwargs)
return wrapped
##trust_all_certificates
def read_page(host):
import httplib
print "Host: " + host
conn = httplib.HTTPSConnection(host)
conn.set_debuglevel(1)
conn.request('GET', '/example')
response = conn.getresponse()
print response.read()
read_page("twitter.com")
This results in:
Host: twitter.com
send: 'GET /example HTTP/1.1\r\nHost: twitter.com\r\nAccept-Encoding: identity\r\n\r\n'
reply: ''
Traceback (most recent call last):
File "jytest.py", line 62, in <module>
read_page("twitter.com")
File "jytest.py", line 59, in read_page
response = conn.getresponse()
File "/Users/erikiveroth/Workspace/Procera/sandbox/jython/jython2.7.jar/Lib/httplib.py", line 1030, in getresponse
File "/Users/erikiveroth/Workspace/Procera/sandbox/jython/jython2.7.jar/Lib/httplib.py", line 407, in begin
File "/Users/erikiveroth/Workspace/Procera/sandbox/jython/jython2.7.jar/Lib/httplib.py", line 371, in _read_status
httplib.BadStatusLine: ''
Changing back to jython2.5.3 gives me parseable output from twitter.
Have any of you seen this before? Can not find any bug-tickets on jython project page about this nor can I understand what changes could result in this behaviour (more than maybe #1309, but I do not understand if it is related to my problem).
Cheers

twisted - interrupt callback via KeyboardInterrupt

I'm currently repeating a task in a for loop inside a callback using Twisted, but would like the reactor to break the loop in the callback (one) if the user issues a KeyboardInterrupt via Ctrl-C. From what I have tested, the reactor only stops or processes interrupts at the end of the callback.
Is there any way of sending a KeyboardInterrupt to the callback or the error handler in the middle of the callback run?
Cheers,
Chris
#!/usr/bin/env python
from twisted.internet import reactor, defer
def one(result):
print "Start one()"
for i in xrange(10000):
print i
print "End one()"
reactor.stop()
def oneErrorHandler(failure):
print failure
print "INTERRUPTING one()"
reactor.stop()
if __name__ == '__main__':
d = defer.Deferred()
d.addCallback(one)
d.addErrback(oneErrorHandler)
reactor.callLater(1, d.callback, 'result')
print "STARTING REACTOR..."
try:
reactor.run()
except KeyboardInterrupt:
print "Interrupted by keyboard. Exiting."
reactor.stop()
I got this working dandy. The fired SIGINT sets a flag running for any running task in my code, and additionally calls reactor.callFromThread(reactor.stop) to stop any twisted running code:
#!/usr/bin/env python
import sys
import twisted
import re
from twisted.internet import reactor, defer, task
import signal
def one(result, token):
print "Start one()"
for i in xrange(1000):
print i
if token.running is False:
raise KeyboardInterrupt()
#reactor.callFromThread(reactor.stop) # this doesn't work
print "End one()"
def oneErrorHandler(failure):
print "INTERRUPTING one(): Unkown Exception"
import traceback
print traceback.format_exc()
reactor.stop()
def oneKeyboardInterruptHandler(failure):
failure.trap(KeyboardInterrupt)
print "INTERRUPTING one(): KeyboardInterrupt"
reactor.stop()
def repeatingTask(token):
d = defer.Deferred()
d.addCallback(one, token)
d.addErrback(oneKeyboardInterruptHandler)
d.addErrback(oneErrorHandler)
d.callback('result')
class Token(object):
def __init__(self):
self.running = True
def sayBye():
print "bye bye."
if __name__ == '__main__':
token = Token()
def customHandler(signum, stackframe):
print "Got signal: %s" % signum
token.running = False # to stop my code
reactor.callFromThread(reactor.stop) # to stop twisted code when in the reactor loop
signal.signal(signal.SIGINT, customHandler)
t2 = task.LoopingCall(reactor.callLater, 0, repeatingTask, token)
t2.start(5)
reactor.addSystemEventTrigger('during', 'shutdown', sayBye)
print "STARTING REACTOR..."
reactor.run()
This is intentional to avoid (semi-)preemption, since Twisted is a cooperative multitasking system. Ctrl-C is handled in Python with a SIGINT handler installed by the interpreter at startup. The handler sets a flag when it is invoked. After each byte code is executed, the interpreter checks the flag. If it is set, KeyboardInterrupt is raised at that point.
The reactor installs its own SIGINT handler. This replaces the behavior of the interpreter's handler. The reactor's handler initiates reactor shutdown. Since it doesn't raise an exception, it doesn't interrupt whatever code is running. The loop (or whatever) gets to finish, and when control is returned to the reactor, shutdown proceeds.
If you'd rather have Ctrl-C (ie SIGINT) raise KeyboardInterrupt, then you can just restore Python's SIGINT handler using the signal module:
signal.signal(signal.SIGINT, signal.default_int_handler)
Note, however, that if you send a SIGINT while code from Twisted is running, rather than your own application code, the behavior is undefined, as Twisted does not expect to be interrupted by KeyboardInterrupt.