How to crawl whole website by following links using CrawlSpider?

How to crawl whole website by following links using CrawlSpider? - scrapy

I realized that using CrawlSpider with a LinkExtractor rule only parses the linked pages but not the starting page itself.
For example, if http://mypage.test contains links to http://mypage.test/cats/ and http://mypage.test/horses/, the crawler would parse the cats and horses page without parsing http://mypage.test. Here's a simple code sample:
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = ['http://mypage.test']
rules = [
Rule(LinkExtractor(), callback='parse_page', follow=True),
]
def parse_page(self, response):
yield {
'url': response.url,
'status': response.status,
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES': {
'pipelines.MyPipeline': 100,
},
})
process.crawl(MySpider)
process.start()
My goal is to parse every single page in a website by following links. How do I accomplish that?
Apparently, CrawlSpider with a LinkExtractor rule only parses the linked pages but not the starting page itself.

Remove start_urls and add:
def start_requests(self):
yield Request('http://mypage.test', callback="parse_page")
yield Request("http://mypage.test", callback="parse")
CrawlSpider uses self.parse to extract and follow links.

Related

Waiting for element to be visible

I'm practicing with a playwright and scrapy integration towards clicking on a selector with a hidden selector. The aim is to click the selector and wait for the other two hidden selectors to load, then click on one of these and then move on. However, I'm getting the following error:
waiting for selector "option[value='type-2']"
selector resolved to hidden <option value="type-2" defaultvalue="">Type 2 (I started uni on or after 2012)</option>
attempting click action
waiting for element to be visible, enabled and stable
element is not visible - waiting...
I think the issue is when the selector is clicked, it disappears for some reason. I have implemented a wait on the selector, but the issue still persists.
from scrapy.crawler import CrawlerProcess
import scrapy
from scrapy_playwright.page import PageCoroutine
class JobSpider(scrapy.Spider):
name = 'job_play'
custom_settings = {
'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15',
}
def start_requests(self):
yield scrapy.Request(
url = 'https://www.student-loan-calculator.co.uk/',
callback = self.parse,
meta= dict(
playwright = True,
playwright_include_page = True,
playwright_page_coroutines = [
PageCoroutine("fill", "#salary", '28000'),
PageCoroutine("fill", "#debt", '25000'),
PageCoroutine("click", selector="//select[#id='loan-type']"),
PageCoroutine('wait_for_selector', "//select[#id='loan-type']"),
PageCoroutine('click', selector = "//select[#id='loan-type']/option[2]"),
PageCoroutine('wait_for_selector', "//div[#class='form-row calculate-button-row']"),
PageCoroutine('click', selector = "//button[#class='btn btn-primary calculate-button']"),
PageCoroutine('wait_for_selector', "//div[#class='container results-table-container']"),
PageCoroutine("wait_for_timeout", 5000),
]
),
)
def parse(self, response):
container = response.xpath("//div[#class='container results-table-container']")
for something in container:
yield {
'some':something
}
if __name__ == "__main__":
process = CrawlerProcess(
settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS": 32,
"FEED_URI":'loans.jl',
"FEED_FORMAT":'jsonlines',
}
)
process.crawl(JobSpider)
process.start()

inherit the home controller odoo 13

I wanted to inherit the home page to add some information for login user I followed
odoo forms and odoo mates youtube
it is not working this is my code
from odoo import http
from odoo.http import request
from odoo.addons.website_sale.controllers.main import Website
class Website_inherit(Website):
#http.route('/', type='http', auth="public", website=True)
def index(self, **kw):
res = super(Website_inherit, self).index(**kw)
print("vlad work")
return http.request.render('n_theme.name_app', data)

Problem having same data while crawling a web page

I am trying to crawl a web page to get reviews and ratings of that web page. But i am getting the same data as the output.
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
for i in range(1, 10):
url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
print(url)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(json.dumps({'rating': response.xpath("//div[#class='star-rating__score']").xpath("#style").extract(),
'review': response.xpath("//p[#class='fan-reviews__item-content']/text()").getall()}))
expected: crawling 1000 pages of the web site https://www.fandango.com/aquaman-208499/movie-reviews
actual output:
https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

The reviews are dynamically populated using JavaScript.
You have to inspect the requests made by the site in cases likes this.
The URL to get user reviews is this:
https://www.fandango.com/napi/fanReviews/208499/1/5
It returns a json with 5 reviews.
Your spider could be rewrite like this:
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
movie_id = "208499"
for page in range(1, 10):
# You have to pass the referer, otherwise the site returns a 403 error
headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
yield scrapy.Request(url=url, callback=self.parse, headers=headers)
def parse(self, response):
data = json.loads(response.text)
for review in data['data']:
yield review
Note that I am also using yield instead of print to extract the items, this is how Scrapy expect items to be generated.
You can run this spider like this to export the extracted items to a file:
scrapy crawl rate -o outputfile.json

Scrapy : Preserving a website

I'm trying to save a copy the pyparsing project on wikispaces.com before they take wikispaces down at the end of the month.
It seems odd (perhaps my version of google is broken ^_^) but I can't find any examples of duplicating/copying a site as. That is, as one views it upon a browser. SO has this and this on the topic but they are just saving the text, strictly the HTML/DOM structure, for the site. Unless I'm mistaken these asnwers do not appear to save the images/header link files/javascript and related information necessary to render the page. Further examples I have seen are more concerned with extraction of parts of the page and not duplicating it as is.
I was wondering if anyone had any experience with this sort of thing or could point me to a useful blog/doc somewhere. I've used WinHTTrack in the past but the robots.txt or the pyparsing.wikispaces.com/auth/ route are preventing it from running properly and I figured I'd get some scrapy experience in.
For those interested to see what I have tried thus far. Here is my crawl spider implementation, that acknowledges the robots.txt file
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PyparsingSpider(CrawlSpider):
name = 'pyparsing'
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
# i = {}
# #i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
# #i['name'] = response.xpath('//div[#id="name"]').extract()
# #i['description'] = response.xpath('//div[#id="description"]').extract()
# return i
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
Trying the same thing with the sitemap spider is similar. The first SO link provides an implementation with a plain spider.
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
class PyParsingSiteMap(SitemapSpider) :
name = "pyparsing"
sitemap_urls = [
'http://pyparsing.wikispaces.com/sitemap.xml',
# 'http://pyparsing.wikispaces.com/robots.txt',
]
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com'] # "/home"
custom_settings = {
"ROBOTSTXT_OBEY" : False
}
def parse(self, response) :
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
None of these spiders collect more then the HTML structure
Also I have found that the links, ..., that are saved do not appear to point to proper relative paths. Atleast, when opening the saved files, the links point to a path relative to the hard drive and not relative to the file. While opening a page via http.server the links point to dead locations, presumably the .html extension is the trouble here. It might be necessary to remap/replace links in the stored structure.

How to use middlewares when using scrapy runspider command?

I know that we can configure middlewares in settings.py when we have a scrapy project.
I haven't started a scrapy project, and I use runspider command to run spider, but I want to use some middlewares. How to set it in the spider file?

So, the problem is, when you run a spider using scrapy runspider my_file.py, you can use the -s option to pass only simple scalar spider settings (like strings or integers). The problem is, the SPIDER_MIDDLEWARES setting expects a dictionary, and there isn't a really straight-forward way to pass that through the command-line.
Currently, the only way I know to set SPIDER_MIDDLEWARES settings for a spider without a project is using custom spider settings, which is currently available in Scrapy from the code repo (not officially released yet) since Scrapy 1.0.
If you go that route, you can put your middlewares in a file middlewares.py and do:
import middlewares # need this, or you get import error
class MySpider(scrapy.Spider):
name = 'my-spider'
custom_settings = {
'SPIDER_MIDDLEWARES': {
'middlewares.SampleMiddleware': 500,
}
}
...
Alternatively, if you're putting the middleware class in the same file, you can use:
import scrapy
class SampleMiddleware(object):
# your middleware code here
...
def fullname(o):
return o.__module__ + "." + o.__name__
class MySpider(scrapy.Spider):
name = 'my-spider'
custom_settings = {
'SPIDER_MIDDLEWARES': {
fullname(SampleMiddleware): 500,
}
}
...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to crawl whole website by following links using CrawlSpider? - scrapy

Remove start_urls and add: def start_requests(self): yield Request('http://mypage.test', callback="parse_page") yield Request("http://mypage.test", callback="parse") CrawlSpider uses self.parse to extract and follow links.

Related

Waiting for element to be visible

inherit the home controller odoo 13

Problem having same data while crawling a web page

Scrapy : Preserving a website

How to use middlewares when using scrapy runspider command?

Categories

Resources