Emailing when Scrapy project is finished - scrapy

So i re-read this page in docs and still can't grasp in which files in project should i insert these lines?
from scrapy.mail import MailSender
mailer = MailSender()
mailer.send(to=["someone#example.com"], subject="Some subject", body="Some body", cc=["another#example.com"])

# ...
from scrapy.mail import MailSender
# ...
class MailSpider(Spider):
# ...
#classmethod
def from_crawler(cls, crawler):
spider = cls()
spider.mailer = MailSender()
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def spider_closed(self, spider):
spider.mailer.send(to=["someone#example.com"], subject="Some subject", body="Some body", cc=["another#example.com"])
# ...
You could use signals like this to send the e-mail after the spider has closed. But I am not sure if this is the best way of doing this.
Also I believe you could send e-mails anywhere a python code is allowed.

Related

Response works in Scrapy Shell, but doesn't work in code

I'm new in Scrapy. I wrote my first spider for this site https://book24.ru/knigi-bestsellery/?section_id=1592 and it works fine
import scrapy
class BookSpider(scrapy.Spider):
name = 'book24'
start_urls = ['https://book24.ru/knigi-bestsellery/']
def parse(self, response):
for link in response.css('div.product-card__image-holder a::attr(href)'):
yield response.follow(link, callback=self.parse_book)
for i in range (1, 5):
next_page = f'https://book24.ru/knigi-bestsellery/page-{i}/'
yield response.follow(next_page, callback=self.parse)
print(i)
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
Now I try to write a spider only for one page
import scrapy
class BookSpider(scrapy.Spider):
name = 'book'
start_urls = ['https://book24.ru/product/transhumanism-inc-6015821/']
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
And it doesn't work, I get an empty file after this command in terminal.
scrapy crawl book -O book.csv
I don't know why.
Will be grateful for the help!
You were getting raise
NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')
NotImplementedError: BookSpider.parse callback is not defined
according the document
parse(): a method that will be called to handle the response
downloaded for each of the requests made. The response parameter is an
instance of TextResponse that holds the page content and has further
helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped
data as dicts and also finding new URLs to follow and creating new
requests (Request) from them.
just rename your def parse_book(self, response): to def parse(self, response):
Its work fine.

Using scrapy in a script and passing args

I want to use scrapy in a larger project, but I am unsure how to pass args like name,start_urls,and allowed_domains. As I understand it name,start_urls,and allowed_domains variables are settings for process.crawl, but I am not able to use self.var like I have with line- site = self.site since self obviously isn't defined there. There is also the problem of the proper way to return. At the end of the day I just want a way to crawl all urls on a single domain from within a script.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlparse
from scrapy.crawler import CrawlerProcess
#from project.spiders.test_spider import SpiderName
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(settings={
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
crawledUrls = []
class MySpider(CrawlSpider):
name = 'spider_example_name'
def __init__(self,site):
self.site=site
site = self.site
domain = urlparse(site).netloc
start_urls = [site]
allowed_domains = [domain]
rules = (
Rule(LinkExtractor(unique=True), callback='parse_item', follow=True),
)
def parse_item(self, response):
#I think there is a way to do this with yeild
print(self.site)
crawledUrls.append(response.url)
def main():
spider = MySpider('http://quotes.toscrape.com')
process.crawl(spider)
process.start() # the script will block here until the crawling is finished
print("###########################################")
print(len(crawledUrls))
print(crawledUrls)
print("###########################################")
if __name__ == "__main__":
main()
See this comment on the scrapy github:
https://github.com/scrapy/scrapy/issues/1823#issuecomment-189731464
It appears you made the same mistakes as the reporter in that comment, where
process.crawl(...) takes a class, not instance, of Spider
params can be specified within the call to process.crawl(...) as keyword arguments. Check the possible kwargs in the Scrapy docs for CrawlerProcess.
So, for example, your main could look like this:
def main():
process.crawl(
MySpider,
start_urls=[
"http://example.com",
"http://example.org"
)
process.start()
...

Trigger errback when process_exception() is called in Middleware

Using Scrapy i'm implementing a CrawlSpider which will scrape all kinds of websites and hence, sometimes very slow ones which will produce a timeout eventually.
My problem is that if such a twisted.internet.error.TimeoutError occurs, i want to trigger the errback of my spider. I don't want to raise this exception and i also don't want to return a dummy Response object which may would suggest that scraping was successful.
Note that i was already able to made all of this work, but only using a "dirty" workaround:
myspider.py (excerpt)
class MySpider(CrawlSpider):
name = 'my-spider'
rules = (
Rule(
link_extractor=LinkExtractor(unique=True),
callback='_my_callback', follow=True
),
)
def parse_start_url(self, response):
# (...)
def errback(self, failure):
log.warning('Failed scraping following link: {}'
.format(failure.request.url))
middlewares.py (excerpt)
from twisted.internet.error import DNSLookupError, TimeoutError
# (...)
class MyDownloaderMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
if (isinstance(exception, TimeoutError)
or (isinstance(exception, DNSLookupError))):
# just 2 examples of errors i want to catch
# set status=500 to enforce errback() call
return Response(request.url, status=500)
Settings should be fine with my custom Middleware already enabled.
Now as you can see by using return Response(request.url, status=500) i can trigger my errback() function in MySpider as desired. However, the status code 500 is very misleading because it's not only incorrect but technically i never receive any response at all.
So my question is, how can i trigger my errback() function trough DownloaderMiddleware.process_exception() in a clean way?
EDIT: I quickly figured it out that for similar exceptions like DNSLookupError i want to have the same behaviour in place. I've updated the coding snippets accordingly.
I didn't find it in the docs, but looking at the source I find DownloaderMiddleware.process_exception() can return twisted.python.failure.Failure objects as well as Request or Response objects.
This means you can return a Failure object to be handled by the errback by wrapping the exception in the Failure object.
This is cleaner than creating a fake Response object, see an example Middleware implementation that does this here: https://github.com/miguelsimon/site2graph/blob/master/site2graph/middlewares.py
The core idea:
from twisted.python.failure import Failure
class MyDownloaderMiddleware:
def process_exception(self, request, exception, spider):
return Failure(exception)
The __init__ method of the Rule class accepts a process_request parameter that you can use to attatch an errback to a request:
class MySpider(CrawlSpider):
name = 'my-spider'
rules = (
Rule(
# …
process_request='process_request',
),
)
def process_request(self, request, response):
return request.replace(errback=self.errback)
def errback(self, failure):
pass

Scrapy : How to write a UserAgentMiddleware?

I want to write a UserAgentMiddleware for scrapy,
the docs says:
Middleware that allows spiders to override the default user agent.
In order for a spider to override the default user agent, its user_agent attribute must be set.
docs:
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.useragent
But there is no a example,I have no ideas how to write it.
Any suggestions?
You look at it in install scrapy path
/Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddlewares/useragent.py
"""Set User-Agent header per spider or use a default value from settings"""
from scrapy import signals
class UserAgentMiddleware(object):
"""This middleware allows spiders to override the user_agent"""
def __init__(self, user_agent='Scrapy'):
self.user_agent = user_agent
#classmethod
def from_crawler(cls, crawler):
o = cls(crawler.settings['USER_AGENT'])
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user_agent', self.user_agent)
def process_request(self, request, spider):
if self.user_agent:
request.headers.setdefault(b'User-Agent', self.user_agent)
You can see a below example for setting Random user agent
https://github.com/alecxe/scrapy-fake-useragent/blob/master/scrapy_fake_useragent/middleware.py
First visit some website and get some of the newest user agents. Then in your standard middleware do something like this. This is the same place you would setup your own proxy settings. Grab a random UA from the text file, and put it in the headers. This is sloppy to show an example you would want to import random at the top and also make sure to closer useragents.txt when you are done with it. I would probably just load them into a list at the top of the document.
class GdataDownloaderMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
user_agents = open('useragents.txt', 'r')
user_agents = user_agents.readlines()
import random
user_agent = random.choice(user_agents)
request.headers.setdefault(b'User-Agent', user_agent)
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

open link authentication using scrapy

hello I have a trouble using scrapy
I want to scrap some data from clinicalkey.com
I have a id, password for my hospital and my hospital has authority of clinicalkey.com
so if I log in to my hospital's library page, I also can use clincalkey.com without authentication
But My scrapy script didn't work. I can't findout why this is not working
My script here
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.FormRequest(loginsite, formdata={'id': 'Myid', 'password': 'MyPassword'}, callback=self.after_login)
def after_login(self, response):
yield scrapy.Request(clinicalkeysite, callback=self.parse_datail)
def parse_detail(self, response):
blahblah
When I see the final response, It has message about "You need login"..
This site use json body form for authenification.
Try something like this:
body = '{"username":"{}","password":"{}","remember_me":true,"product":"CK_US"}'.format(yourname, yourpassword)
yield scrapy.FormRequest(loginsite, body=body, callback=self.after_login)