Scrapy don't crawl over site - scrapy

I have common trap and can't get rid with it: my Scrapy spider very lazy, so that it is, it can parse only start_urls. Code below:
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Field
from scrapy.selector import Selector
class HabraPostSpider(scrapy.Spider):
name = 'habrapost'
allowed_domains = ['habrahabr.ru']
start_urls = ['https://habrahabr.ru/interesting/']
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
rules = (Rule(LinkExtractor()),
Rule(LinkExtractor(allow=('/post/'),),callback='parse_post',follow= True))
I will be very happy if anybody could say how to fix my spider!

Your english is totally broken but reading between the lines what I understand is that you want the crawler to go into every link it sees.
For that you have to use CrawlSpider instead of Spider
class HabraPostSpider(scrapy.spiders.CrawlSpider)
Check the documentation.

Related

Scrapy not defined

I'm trying to write a web crawler using VSC & encountered the error. Below are my codes.
class spider1(scrapy.Spider):
name = 'Wikipedia'
start_urls = <kbd>['https://en.wikipedia.org/wiki/Battery_(electricity)']</kbd>
def parse(self, response):pass
May I know what's wrong?
Thanks.
try to import scrapy first
import scrapy
This should work, you are missing a scrapy import that is being used to import a parent of Spider1
import scrapy
class spider1(scrapy.Spider):
name = 'Wikipedia'
start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']
def parse(self, response):
pass

Is there a way to run code after reactor.run() in scrapy?

I am working on a scrapy api. One of my issues was that the twisted reactor wasn't restartable. I fixed this using crawl runner as opposed to crawl process. My spider extracts links from a website, validates them. My issue is that if I add the validation code after reactor.run() it doesn't work. This is my code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
list = set([])
list_validate = set([])
runner = CrawlerRunner()
class Crawler(CrawlSpider):
name = "Crawler"
start_urls = ['https:www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
def parse_links(self, response):
base_url = url
href = response.xpath('//a/#href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
runner.crawl(Crawler)
reactor.run()
If add the code that validates the links after reactor.run(), it doesn't get executed. And if I put the code before reactor.run(), nothing happens because the spider hasn't yet finished crawling all the links. What should I do? The code that validates the links is perfectly fine I used it before and it works.
We can do that with d.addCallback(<callback_function>) and d.addErrback(<errback_function>)
...
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(d):
print("finished :D")
def spider_error(err):
print("Spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()
For your ScraperApi you can use Klein.
Klein is a micro-framework for developing production-ready web services with Python. It is 'micro' in that it has an incredibly small API similar to Bottle and Flask.
...
import scrapy
from scrapy.crawler import CrawlerRunner
from klein import Klein
app=Klein()
#app.route('/')
async def hello(request):
status=list()
class TestSpider(scrapy.Spider):
name='test'
start_urls=[
'https://quotes.toscrape.com/',
'https://quotes.toscrape.com/page/2/',
'https://quotes.toscrape.com/page/3/',
'https://quotes.toscrape.com/page/4/'
]
def parse(self,response):
"""
parse
"""
status.append(response.status)
runner=CrawlerRunner()
d= await runner.crawl(TestSpider)
content=str(status)
return content
#app.route('/h')
def index(request):
return 'Index Page'
app.run('localhost',8080)

scrapy not entering parse(response.url)

I'm a beginner. When I crawl, there is no error code, but scrapy does not enter response.url in parse. That is, the page is empty page titled "data;"
how to enter the repsonse.url?
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
import time
from result_crawler.items import RESULT_Item
class RESULT_Spider(scrapy.Spider):
name="EPL"
allowed_domains=["premierleague.com"]
starts_urls=["https://www.premierleague.com/match/38567"]
def __init__(self):
scrapy.Spider.__init__(self)
self.browser=webdriver.Chrome("/users/germpark/chromedriver")
def parse(self,response):
self.browser.get(response.url)
time.sleep(5)
.
.
.
I want to enter https://www.premierleague.com/match/38567 but result did not exist.
The correct property name is start_urls not starts_urls. Because of the incorrect property name it does not detect any start pages.

login using post request in scrapy

I want to login rediffmail but error is generate
exceptions.NameError: global name 'FormRequest' is not defined
here my spider code:
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from rediffmail.items import RediffmailItem
class MySpider(BaseSpider):
name = 'rediffmail'
allowed_domains = ["rediff.com"]
start_urls = ['https://mail.rediff.com/cgi-bin/login.cgi']
def parse(self, response):
return [FormRequest.from_response(response,formdata={'login': 'XXXX', 'passwd': 'XXXX'},callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
Please check is there any issue in my code. I am beginner in python
You are missing the import of FormRequest. And in your version of scrapy, the FormRequest is in scrapy.http.
So add this line in your import section:
from scrapy.http import FormRequest

Scrapy-Recursively Scrape Webpages and save content as html file

I am using scrapy to extract the information in tag of web pages and then save those webpages as HTML files.Eg http://www.austlii.edu.au/au/cases/cth/HCA/1945/ this site has some webpages related to judicial cases.I want to go to each link and save only the content related to the particular judicial case as an HTML page.eg go to this http://www.austlii.edu.au/au/cases/cth/HCA/1945/1.html and then save information related to case.
Is there a way to do this recursively in scrapy and save content in HTML page
Yes, you can do it with Scrapy, Link Extractors will help:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class AustliiSpider(CrawlSpider):
name = "austlii"
allowed_domains = ["austlii.edu.au"]
start_urls = ["http://www.austlii.edu.au/au/cases/cth/HCA/1945/"]
rules = (
Rule(SgmlLinkExtractor(allow=r"au/cases/cth/HCA/1945/\d+.html"), follow=True, callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
# do whatever with html content (response.body variable)
Hope that helps.