ScrapyDeprecationWaring: Command's default `crawler` is deprecated and will be removed. Use `create_crawler` method to instantiate crawlers - scrapy

Scrapy version 0.19
I am using the code at this page ( Run multiple scrapy spiders at once using scrapyd ). When I run scrapy allcrawl, I got
ScrapyDeprecationWaring: Command's default `crawler` is deprecated and will be removed. Use `create_crawler` method to instantiate crawlers
Here is the code:
from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log
class AllCrawlCommand(ScrapyCommand):
requires_project = True
default_settings = {'LOG_ENABLED': False}
def short_desc(self):
return "Schedule a run for all available spiders"
def run(self, args, opts):
url = 'http://localhost:6800/schedule.json'
for s in self.crawler.spiders.list(): #this line raise the warning
values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
log.msg(response)
How do I fix the DeprecationWarning ?
Thanks

Use:
crawler = self.crawler_process.create_crawler()

Related

Simple scraper with Scrapy API

I am writing a scraper with Scrapy within a larger project, and I'm trying to keep it as minimal as possible (without create a whole scrapy project). This code downloads a single URL correctly:
import scrapy
from scrapy.crawler import CrawlerProcess
class WebsiteSpider(scrapy.Spider):
"""
https://docs.scrapy.org/en/latest/
"""
custom_settings = {'DOWNLOAD_DELAY': 1, 'DEPTH_LIMIT': 3}
name = 'my_website_scraper'
def parse(self,response):
html = response.body
url = response.url
# process page here
process = CrawlerProcess()
process.crawl(WebsiteSpider, start_urls=['https://www.bbc.co.uk/'])
process.start()
How can I enrich this code to keep scraping the links found in the start URLs (with a maximum depth, for example of 3)?
Try this.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain
class WebsiteSpider(Spider):
name = 'bbc.co.uk'
allowed_domains = ['.bbc.co.uk']
start_urls = ['https://www.bbc.co.uk/']
# refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lstA = doc.listA(url=url["url"]) # Get link data for subsequent crawling
data = [{"title": doc.title.text}] # Get target data
return {"Urls": lstA, "Data": data} # Return data to framework
SimplifiedMain.startThread(WebsiteSpider()) # Start crawling

Using scrapy in a script and passing args

I want to use scrapy in a larger project, but I am unsure how to pass args like name,start_urls,and allowed_domains. As I understand it name,start_urls,and allowed_domains variables are settings for process.crawl, but I am not able to use self.var like I have with line- site = self.site since self obviously isn't defined there. There is also the problem of the proper way to return. At the end of the day I just want a way to crawl all urls on a single domain from within a script.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlparse
from scrapy.crawler import CrawlerProcess
#from project.spiders.test_spider import SpiderName
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(settings={
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
crawledUrls = []
class MySpider(CrawlSpider):
name = 'spider_example_name'
def __init__(self,site):
self.site=site
site = self.site
domain = urlparse(site).netloc
start_urls = [site]
allowed_domains = [domain]
rules = (
Rule(LinkExtractor(unique=True), callback='parse_item', follow=True),
)
def parse_item(self, response):
#I think there is a way to do this with yeild
print(self.site)
crawledUrls.append(response.url)
def main():
spider = MySpider('http://quotes.toscrape.com')
process.crawl(spider)
process.start() # the script will block here until the crawling is finished
print("###########################################")
print(len(crawledUrls))
print(crawledUrls)
print("###########################################")
if __name__ == "__main__":
main()
See this comment on the scrapy github:
https://github.com/scrapy/scrapy/issues/1823#issuecomment-189731464
It appears you made the same mistakes as the reporter in that comment, where
process.crawl(...) takes a class, not instance, of Spider
params can be specified within the call to process.crawl(...) as keyword arguments. Check the possible kwargs in the Scrapy docs for CrawlerProcess.
So, for example, your main could look like this:
def main():
process.crawl(
MySpider,
start_urls=[
"http://example.com",
"http://example.org"
)
process.start()
...

Why does calling a scrapy spider from pywikibot give a ReactorNotRestartable error?

I am able to call a scrapy spider from another Python script using either CrawlerRunner or CrawlerProcess. But, when I try to call the same spider calling class from a pywikibot robot, I get a ReactorNotRestartable error. Why is this and how can I fix it?
Here is the error:
File ".\scripts\userscripts\ReplicationWiki\RWLoad.py", line 161, in format_new_page
aea = AEAMetadata(url=DOI_url)
File ".\scripts\userscripts\ReplicationWiki\GetAEAMetadata.py", line 39, in __init__
reactor.run() # the script will block here until all crawling jobs are finished
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
CRITICAL: Exiting due to uncaught exception <class 'twisted.internet.error.ReactorNotRestartable'>
Here is the script which calls my scrapy spider. It runs fine if I just call the class from main.
from twisted.internet import reactor, defer
from scrapy import signals
from scrapy.crawler import Crawler, CrawlerProcess, CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from Scrapers.spiders.ScrapeAEA import ScrapeaeaSpider
class AEAMetadata:
"""
Helper to run ScrapeAEA spider and return JEL codes and data links
for a given AEA article link.
"""
def __init__(self, *args, **kwargs):
"""Initializer"""
url = kwargs.get('url')
if not url:
raise ValueError('No article url given')
self.items = []
def collect_items(item, response, spider):
self.items.append(item)
settings = get_project_settings()
crawler = Crawler(ScrapeaeaSpider, settings)
crawler.signals.connect(collect_items, signals.item_scraped)
runner = CrawlerRunner(settings)
d = runner.crawl(crawler, url=url)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
#process = CrawlerProcess(settings)
#process.crawl(crawler, url=url)
#process.start() # the script will block here until the crawling is finished
def get_jelcodes(self):
jelcodes = self.items[0]['jelcodes']
return jelcodes
def main():
aea = AEAMetadata(url='https://doi.org/10.1257/app.20180286')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
Updated simple Test that instantiates the AEAMetadata class twice.
Here is the calling code in my pywikibot bot which fails:
from GetAEAMetadata import AEAMetadata
def main(*args):
for _ in [1,2]:
print('Top')
url = 'https://doi.org/10.1257/app.20170442'
aea = AEAMetadata(url=url)
print('After AEAMetadata')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
My call to AEAMetadata was embedded in a larger script which fooled me into thinking the AEAMetadata class was only instantiated once before failure.
In fact, AEAMetadata was called twice.
And, I also thought that the script would block after the reactor.run() because the comment in all the scrapy examples stated that was the case.
However, the second deferred callback is reactor.stop() which unblocks the reactor.run().
A more basic incorrect assumption was that the reactor was deleted and recreated on each iteration. In fact, the reactor is instantiated and initialized when it is first imported. And, it is a global object which lives as long as the underlying process and was not designed to be restarted. The extremes actually needed to delete and restart a reactor are described here:
http://www.blog.pythonlibrary.org/2016/09/14/restarting-a-twisted-reactor/
So, I guess I've answered my own question.
And, I'm rewriting my script so it doesn't try to use the reactor in a way it was never intended to be used.
And, thanks Gallaecio for getting me thinking in the right direction.

HttpResponseRedirect' object has no attribute 'client'

Django 1.9.6
I'd like to write some unit test for checking redirection.
Could you help me understand what am I doing wrongly here.
Thank you in advance.
The test:
from django.test import TestCase
from django.core.urlresolvers import reverse
from django.http.request import HttpRequest
from django.contrib.auth.models import User
class GeneralTest(TestCase):
def test_anonymous_user_redirected_to_login_page(self):
user = User(username='anonymous', email='vvv#mail.ru', password='ttrrttrr')
user.is_active = False
request = HttpRequest()
request.user = user
hpv = HomePageView()
response = hpv.get(request)
self.assertRedirects(response, reverse("auth_login"))
The result:
ERROR: test_anonymous_user_redirected_to_login_page (general.tests.GeneralTest)
Traceback (most recent call last):
File "/home/michael/workspace/photoarchive/photoarchive/general/tests.py", line 44, in test_anonymous_user_redirected_to_login_page
self.assertRedirects(response, reverse("auth_login"))
File "/home/michael/workspace/venvs/photoarchive/lib/python3.5/site-packages/django/test/testcases.py", line 326, in assertRedirects
redirect_response = response.client.get(path, QueryDict(query),
AttributeError: 'HttpResponseRedirect' object has no attribute 'client'
Ran 3 tests in 0.953s
What pdb says:
-> self.assertRedirects(response, reverse("auth_login"))
(Pdb) response
<HttpResponseRedirect status_code=302, "text/html; charset=utf-8", url="/accounts/login/">
You need to add a client to the response object. See the updated code below.
from django.test import TestCase, Client
from django.core.urlresolvers import reverse
from django.http.request import HttpRequest
from django.contrib.auth.models import User
class GeneralTest(TestCase):
def test_anonymous_user_redirected_to_login_page(self):
user = User(username='anonymous', email='vvv#mail.ru', password='ttrrttrr')
user.is_active = False
request = HttpRequest()
request.user = user
hpv = HomePageView()
response = hpv.get(request)
response.client = Client()
self.assertRedirects(response, reverse("auth_login"))
Looks like you are directly calling your view's get directly rather than using the built-in Client. When you use the test client, you get your client instance back in the response, presumably for cases such as this where you want to check/fetch a redirect.
One solution would be to use the client to fetch the response from your view. Another is to stick a client in the response as mentioned above.
A third option is tell assertRedirects not to fetch the redirect. There is no need for client if you don't ask the assertion to fetch the redirect. That's done by adding fetch_redirect_response=False to your assertion.

Scrapy: How to redownload already cached URL if certain conditions apply

I have a large amount of http data stored in my cache backend in scrapy. There are certain pages that contain false data. these urls need to be rescheduled for download on the next run of scrapy.
I came up with the idea to modify the dummy cache policy that comes with scrapy. Unfortunately this doesn't seem to work.
Can anyone see what is wrong in the method is_cached_response_fresh?:
import os
import cPickle as pickle
from time import time
from weakref import WeakKeyDictionary
from email.utils import mktime_tz, parsedate_tz
from w3lib.http import headers_raw_to_dict, headers_dict_to_raw
from scrapy.http import Headers
from scrapy.responsetypes import responsetypes
from scrapy.utils.request import request_fingerprint
from scrapy.utils.project import data_path
from scrapy.utils.httpobj import urlparse_cached
class DummyPolicy(object):
def __init__(self, settings):
self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_request(self, request):
return urlparse_cached(request).scheme not in self.ignore_schemes
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
def is_cached_response_fresh(self, response, request):
if "thisstring" in response.body.lower():
print "got mobile page. redownload"
return False
else:
return True
def is_cached_response_valid(self, cachedresponse, response, request):
return True
I think the answer here is that your content is most likely gzipped or deflated.
Try
from scrapy.utils.gz import gunzip
if "thisstring" in gunzip(response.body).lower()
Cannot say that solution is versatile but most likely it'll work in your case.