I'm practicing with a playwright and scrapy integration towards clicking on a selector with a hidden selector. The aim is to click the selector and wait for the other two hidden selectors to load, then click on one of these and then move on. However, I'm getting the following error:
waiting for selector "option[value='type-2']"
selector resolved to hidden <option value="type-2" defaultvalue="">Type 2 (I started uni on or after 2012)</option>
attempting click action
waiting for element to be visible, enabled and stable
element is not visible - waiting...
I think the issue is when the selector is clicked, it disappears for some reason. I have implemented a wait on the selector, but the issue still persists.
from scrapy.crawler import CrawlerProcess
import scrapy
from scrapy_playwright.page import PageCoroutine
class JobSpider(scrapy.Spider):
name = 'job_play'
custom_settings = {
'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15',
}
def start_requests(self):
yield scrapy.Request(
url = 'https://www.student-loan-calculator.co.uk/',
callback = self.parse,
meta= dict(
playwright = True,
playwright_include_page = True,
playwright_page_coroutines = [
PageCoroutine("fill", "#salary", '28000'),
PageCoroutine("fill", "#debt", '25000'),
PageCoroutine("click", selector="//select[#id='loan-type']"),
PageCoroutine('wait_for_selector', "//select[#id='loan-type']"),
PageCoroutine('click', selector = "//select[#id='loan-type']/option[2]"),
PageCoroutine('wait_for_selector', "//div[#class='form-row calculate-button-row']"),
PageCoroutine('click', selector = "//button[#class='btn btn-primary calculate-button']"),
PageCoroutine('wait_for_selector', "//div[#class='container results-table-container']"),
PageCoroutine("wait_for_timeout", 5000),
]
),
)
def parse(self, response):
container = response.xpath("//div[#class='container results-table-container']")
for something in container:
yield {
'some':something
}
if __name__ == "__main__":
process = CrawlerProcess(
settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS": 32,
"FEED_URI":'loans.jl',
"FEED_FORMAT":'jsonlines',
}
)
process.crawl(JobSpider)
process.start()
Related
My question is related to:
Scrapy-splash not rendering dynamic content from a certain react-driven site
But disabling private mode did not fix the issue (passing --disable-private-mode as an argument to the Docker container).
I made a single page app using React and all it does is add an element to root:
function F(){
class HelloMessage extends React.Component {
render() {
return React.createElement(
"div",
null,
"Hello ",
this.props.name
);
}
}
return(React.createElement(HelloMessage, { name: "Taylor" }));
}
But Splash only shows a blank white page!
I also made a single page website using Django and the HTML loads a script on head:
<script src="/static/js/modify-the-dom.js"></script>
And all the script does is to write something on the page after the content is loaded:
window.onload = function(){
document.write('Modified the DOM!');
}
This change shows up on the browsers! But on on Splash, if I remove the window.onload and just write document.write('Modified the DOM!'); it works fine! So the issue seems to be with window.onload even though I'm waiting long enough about 10 seconds on the LUA script.
So I was wondering how to fix that.
I also checked my Splash with this website and it seems JS is not enabled!
https://www.whatismybrowser.com/detect/is-javascript-enabled I wonder if there's an option I have not enabled?
One example of a client-rendering website that Splash does not crawl properly (same issues above) is https://www.digikala.com/ it only shows a blank white page with some HTML that needs to be populated with AJAX requests. Even https://web.archive.org/ does not crawl this website's pages properly and shows a blank page.
Thanks a lot!
Update:
The Splash script is:
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36")
assert(splash:go(args.url))
assert(splash:wait(30)) -- Tried different timeouts.
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
I also tried setting these to no avail:
splash.plugins_enabled = true
splash.resource_timeout = 0
I realized that using CrawlSpider with a LinkExtractor rule only parses the linked pages but not the starting page itself.
For example, if http://mypage.test contains links to http://mypage.test/cats/ and http://mypage.test/horses/, the crawler would parse the cats and horses page without parsing http://mypage.test. Here's a simple code sample:
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = ['http://mypage.test']
rules = [
Rule(LinkExtractor(), callback='parse_page', follow=True),
]
def parse_page(self, response):
yield {
'url': response.url,
'status': response.status,
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES': {
'pipelines.MyPipeline': 100,
},
})
process.crawl(MySpider)
process.start()
My goal is to parse every single page in a website by following links. How do I accomplish that?
Apparently, CrawlSpider with a LinkExtractor rule only parses the linked pages but not the starting page itself.
Remove start_urls and add:
def start_requests(self):
yield Request('http://mypage.test', callback="parse_page")
yield Request("http://mypage.test", callback="parse")
CrawlSpider uses self.parse to extract and follow links.
I am trying to crawl a web page to get reviews and ratings of that web page. But i am getting the same data as the output.
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
for i in range(1, 10):
url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
print(url)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(json.dumps({'rating': response.xpath("//div[#class='star-rating__score']").xpath("#style").extract(),
'review': response.xpath("//p[#class='fan-reviews__item-content']/text()").getall()}))
expected: crawling 1000 pages of the web site https://www.fandango.com/aquaman-208499/movie-reviews
actual output:
https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
The reviews are dynamically populated using JavaScript.
You have to inspect the requests made by the site in cases likes this.
The URL to get user reviews is this:
https://www.fandango.com/napi/fanReviews/208499/1/5
It returns a json with 5 reviews.
Your spider could be rewrite like this:
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
movie_id = "208499"
for page in range(1, 10):
# You have to pass the referer, otherwise the site returns a 403 error
headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
yield scrapy.Request(url=url, callback=self.parse, headers=headers)
def parse(self, response):
data = json.loads(response.text)
for review in data['data']:
yield review
Note that I am also using yield instead of print to extract the items, this is how Scrapy expect items to be generated.
You can run this spider like this to export the extracted items to a file:
scrapy crawl rate -o outputfile.json
I'm using NightwatchJS to test a page with an embedded iFrame. The test opens the page, waits for the iframe to be present. All test steps work so far, but the iFrame's content tells me that the browser can not render embedded iframes.
Nightwatch Config (nightwatch.conf.js)
"chrome" : {
"desiredCapabilities": {
"browserName": "chrome",
"javascriptEnabled": true,
"acceptSslCerts": true,
"nativeElements": true
}
},
Test Code
.waitForElementVisible('//iframe[#id = "botframe"]')
.element('xpath', '//iframe[#id = "botframe"]', (r) => console.log(r))
.assert.containsText('//iframe[#id = "botframe"]', 'Hello')
Output
✔ Element <//iframe[#id = "botframe"]> was visible after 61 milliseconds.
{ sessionId: '76685e966809e760c639d589ba318693',
status: 0,
value: { ELEMENT: '0.5426473824985356-1' } }
✖ Testing if element <//iframe[#id = "botframe"]> contains text: "Hallo" in 1000 ms. - expected "Hallo" but got: "<br><p>Da Ihr Browser keine eingebetteten Frames anzeigen kann...
You need to switch to the frame in order to check it’s contents, like
.frame(‘botframe’)
After you have finished checking the frame and want to return to your primary html content:
.frame(0)
// or
.frame(null)
//or
.frame()
Will return you to the original frame.
I'm trying to save a copy the pyparsing project on wikispaces.com before they take wikispaces down at the end of the month.
It seems odd (perhaps my version of google is broken ^_^) but I can't find any examples of duplicating/copying a site as. That is, as one views it upon a browser. SO has this and this on the topic but they are just saving the text, strictly the HTML/DOM structure, for the site. Unless I'm mistaken these asnwers do not appear to save the images/header link files/javascript and related information necessary to render the page. Further examples I have seen are more concerned with extraction of parts of the page and not duplicating it as is.
I was wondering if anyone had any experience with this sort of thing or could point me to a useful blog/doc somewhere. I've used WinHTTrack in the past but the robots.txt or the pyparsing.wikispaces.com/auth/ route are preventing it from running properly and I figured I'd get some scrapy experience in.
For those interested to see what I have tried thus far. Here is my crawl spider implementation, that acknowledges the robots.txt file
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PyparsingSpider(CrawlSpider):
name = 'pyparsing'
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
# i = {}
# #i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
# #i['name'] = response.xpath('//div[#id="name"]').extract()
# #i['description'] = response.xpath('//div[#id="description"]').extract()
# return i
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
Trying the same thing with the sitemap spider is similar. The first SO link provides an implementation with a plain spider.
import scrapy
from scrapy.spiders import SitemapSpider
from urllib.parse import urlparse
from pathlib import Path
class PyParsingSiteMap(SitemapSpider) :
name = "pyparsing"
sitemap_urls = [
'http://pyparsing.wikispaces.com/sitemap.xml',
# 'http://pyparsing.wikispaces.com/robots.txt',
]
allowed_domains = ['pyparsing.wikispaces.com']
start_urls = ['http://pyparsing.wikispaces.com'] # "/home"
custom_settings = {
"ROBOTSTXT_OBEY" : False
}
def parse(self, response) :
page = urlparse(response.url)
path = Path(page.netloc)/Path("" if page.path == "/" else page.path[1:])
if path.parent : path.parent.mkdir(parents = True, exist_ok=True) # Creates the folder
path = path.with_suffix(".html")
with open(path, 'wb') as file:
file.write(response.body)
None of these spiders collect more then the HTML structure
Also I have found that the links, ..., that are saved do not appear to point to proper relative paths. Atleast, when opening the saved files, the links point to a path relative to the hard drive and not relative to the file. While opening a page via http.server the links point to dead locations, presumably the .html extension is the trouble here. It might be necessary to remap/replace links in the stored structure.