Scrapy : How to write a UserAgentMiddleware? - scrapy

I want to write a UserAgentMiddleware for scrapy,
the docs says:
Middleware that allows spiders to override the default user agent.
In order for a spider to override the default user agent, its user_agent attribute must be set.
docs:
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.useragent
But there is no a example,I have no ideas how to write it.
Any suggestions?

You look at it in install scrapy path
/Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddlewares/useragent.py
"""Set User-Agent header per spider or use a default value from settings"""
from scrapy import signals
class UserAgentMiddleware(object):
"""This middleware allows spiders to override the user_agent"""
def __init__(self, user_agent='Scrapy'):
self.user_agent = user_agent
#classmethod
def from_crawler(cls, crawler):
o = cls(crawler.settings['USER_AGENT'])
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user_agent', self.user_agent)
def process_request(self, request, spider):
if self.user_agent:
request.headers.setdefault(b'User-Agent', self.user_agent)
You can see a below example for setting Random user agent
https://github.com/alecxe/scrapy-fake-useragent/blob/master/scrapy_fake_useragent/middleware.py

First visit some website and get some of the newest user agents. Then in your standard middleware do something like this. This is the same place you would setup your own proxy settings. Grab a random UA from the text file, and put it in the headers. This is sloppy to show an example you would want to import random at the top and also make sure to closer useragents.txt when you are done with it. I would probably just load them into a list at the top of the document.
class GdataDownloaderMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
user_agents = open('useragents.txt', 'r')
user_agents = user_agents.readlines()
import random
user_agent = random.choice(user_agents)
request.headers.setdefault(b'User-Agent', user_agent)
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

Related

Trigger errback when process_exception() is called in Middleware

Using Scrapy i'm implementing a CrawlSpider which will scrape all kinds of websites and hence, sometimes very slow ones which will produce a timeout eventually.
My problem is that if such a twisted.internet.error.TimeoutError occurs, i want to trigger the errback of my spider. I don't want to raise this exception and i also don't want to return a dummy Response object which may would suggest that scraping was successful.
Note that i was already able to made all of this work, but only using a "dirty" workaround:
myspider.py (excerpt)
class MySpider(CrawlSpider):
name = 'my-spider'
rules = (
Rule(
link_extractor=LinkExtractor(unique=True),
callback='_my_callback', follow=True
),
)
def parse_start_url(self, response):
# (...)
def errback(self, failure):
log.warning('Failed scraping following link: {}'
.format(failure.request.url))
middlewares.py (excerpt)
from twisted.internet.error import DNSLookupError, TimeoutError
# (...)
class MyDownloaderMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
if (isinstance(exception, TimeoutError)
or (isinstance(exception, DNSLookupError))):
# just 2 examples of errors i want to catch
# set status=500 to enforce errback() call
return Response(request.url, status=500)
Settings should be fine with my custom Middleware already enabled.
Now as you can see by using return Response(request.url, status=500) i can trigger my errback() function in MySpider as desired. However, the status code 500 is very misleading because it's not only incorrect but technically i never receive any response at all.
So my question is, how can i trigger my errback() function trough DownloaderMiddleware.process_exception() in a clean way?
EDIT: I quickly figured it out that for similar exceptions like DNSLookupError i want to have the same behaviour in place. I've updated the coding snippets accordingly.
I didn't find it in the docs, but looking at the source I find DownloaderMiddleware.process_exception() can return twisted.python.failure.Failure objects as well as Request or Response objects.
This means you can return a Failure object to be handled by the errback by wrapping the exception in the Failure object.
This is cleaner than creating a fake Response object, see an example Middleware implementation that does this here: https://github.com/miguelsimon/site2graph/blob/master/site2graph/middlewares.py
The core idea:
from twisted.python.failure import Failure
class MyDownloaderMiddleware:
def process_exception(self, request, exception, spider):
return Failure(exception)
The __init__ method of the Rule class accepts a process_request parameter that you can use to attatch an errback to a request:
class MySpider(CrawlSpider):
name = 'my-spider'
rules = (
Rule(
# …
process_request='process_request',
),
)
def process_request(self, request, response):
return request.replace(errback=self.errback)
def errback(self, failure):
pass

Scrapy how to remove a url from httpcache or prevent adding to cache

I am using latest scrapy version, v1.3
I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.
What I need is how can I prevent that page get into cache? Or if I cannot do that, I need to remove it from cache after I realize the error in parse method. Then I can retry and get the correct one.
I have a partial solution, I yield all requests with "dont_cache":False parameter in meta so I make sure they use cache. Where I detect the error and retry the request, I put dont_filter=True along with "dont_cache":True to make sure I get the fresh copy of the erroneous url.
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
page = page - 1
yield Request(url=response.url, callback=self.parse, meta={"page":page, "dont_cache":True}, dont_filter=True)
I also tried a custom retry middleware, where I managed to get it working before cache, but I couldnt read the response.body successfully. I suspect that it is zipped somehow, as it is binary data.
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
with open('debug.txt', 'wb') as outfile:
outfile.write(response.body)
html = Selector(text=response.body)
url = response.url
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
log.msg("Automated process error: %s" %url, level=log.INFO)
reason = 'Automated process error %d' %response.status
return self._retry(request, reason, spider) or response
return response
Any suggestion is appreciated.
Thanks
Mehmet
Middleware responsible for requests/response caching is HttpCacheMiddleware. Under the hood it is driven by the cache policies - special classes which dispatch what requests and responses should or shouldn't be cached. You can implement your own cache policy class and use it with the setting
HTTPCACHE_POLICY = 'my.custom.cache.Class'
More information in docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
Source code of built-in policies: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/httpcache.py#L18
Thanks to mizhgun, I managed to develop a solution using custom policies.
Here is what I did,
from scrapy.utils.httpobj import urlparse_cached
class CustomPolicy(object):
def __init__(self, settings):
self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_request(self, request):
return urlparse_cached(request).scheme not in self.ignore_schemes
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
def is_cached_response_fresh(self, response, request):
if "refresh_cache" in request.meta:
return False
return True
def is_cached_response_valid(self, cachedresponse, response, request):
if "refresh_cache" in request.meta:
return False
return True
And when I catch the error, (after caching occurred of course)
def parse(self, response):
html = Selector(response)
counttext = html.css('selector').extract_first()
if counttext is None:
yield Request(url=response.url, callback=self.parse, meta={"refresh_cache":True}, dont_filter=True)
When you add refresh_cache into meta, that can be catched in custom policy class.
Don't forget to add dont_filter otherwise second request will be filtered as duplicate.

How to pass response to a spider without fetching a web page?

The scrapy documentation specifically mentions that I should use downloader middleware if I want to pass a response to a spider without actually fetching the web page. However, I can't find any documentation or examples on how to achieve this functionality.
I am interested in passing only the url to the request callback, populate an item's file_urls field with the url (and certain permutations thereof), and use the FilesPipeline to handle the actual download.
How can a write a downloader middleware class that passes the url to the spider while avoiding downloading the web page?
You can return Response object in downloader middleware's process_request() method. This method is called for every request your spider yields.
Something like:
class NoDownloadMiddleware(object):
def process_request(self, request, spider):
# only process marked requests
if not request.meta.get('only_download'):
return
# now make Response object however you wish
response = Response(request.url)
return response
and in your spider:
def parse(self, response):
yield Request(some_url, meta={'only_download':True})
and in your settings.py activate the middleware:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.NoDownloadMiddleware': 543,
}

Outgoing and Incoming Bandwidth at regular interval of time using Scrapy

Is it possible to get stats like outgoing and incoming bandwidth used during a crawl using scrapy at regular interval of time?
Yes, it is possible. =)
The total request and response bytes are already tracked in the stats by the DownloaderStats middleware that comes with Scrapy. You can add another downloader middleware that tracks time and add the new stats.
Here are the steps for it:
1) Configure a new downloader middleware in settings.py with a high order number so it will execute later in the pipeline:
DOWNLOADER_MIDDLEWARES = {
'testing.middlewares.InOutBandwithStats': 990,
}
2) Put the following code into a middleware.py file in the same dir as settings.py
import time
class InOutBandwithStats(object):
def __init__(self, stats):
self.stats = stats
self.startedtime = time.time()
#classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def elapsed_seconds(self):
return time.time() - self.startedtime
def process_request(self, request, spider):
request_bytes = self.stats.get_value('downloader/request_bytes')
if request_bytes:
outgoing_bytes_per_second = request_bytes / self.elapsed_seconds()
self.stats.set_value('downloader/outgoing_bytes_per_second',
outgoing_bytes_per_second)
def process_response(self, request, response, spider):
response_bytes = self.stats.get_value('downloader/response_bytes')
if response_bytes:
incoming_bytes_per_second = response_bytes / self.elapsed_seconds()
self.stats.set_value('downloader/incoming_bytes_per_second',
incoming_bytes_per_second)
return response
And that's it. The process_request/process_response methods will be called whenever a request/response is processed and will keep updating the stats accordingly.
If you want to have logs at regular times you can also call spider.log('Incoming bytes/sec: %s' % incoming_bytes_per_second) there.
Read more
Downloader middlewares in Scrapy - how to activate and create your own
DownloadStats - downloader middleware stats class

Emailing when Scrapy project is finished

So i re-read this page in docs and still can't grasp in which files in project should i insert these lines?
from scrapy.mail import MailSender
mailer = MailSender()
mailer.send(to=["someone#example.com"], subject="Some subject", body="Some body", cc=["another#example.com"])
# ...
from scrapy.mail import MailSender
# ...
class MailSpider(Spider):
# ...
#classmethod
def from_crawler(cls, crawler):
spider = cls()
spider.mailer = MailSender()
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self, spider):
spider.mailer.send(to=["someone#example.com"], subject="Some subject", body="Some body", cc=["another#example.com"])
# ...
You could use signals like this to send the e-mail after the spider has closed. But I am not sure if this is the best way of doing this.
Also I believe you could send e-mails anywhere a python code is allowed.