scrapy-splash do not crawl recursively with CrawlerSpider - scrapy

I have integrated scrapy-splash in my CrawlerSpider process_request in rules like this:
def process_request(self,request):
request.meta['splash']={
'args': {
# set rendering arguments here
'html': 1,
}
}
return request
The problem is that the crawl renders just urls in the first depth,
I wonder also how can I get response even with bad http code or redirected reponse;
Thanks in advance,

Your problem may be related to this: https://github.com/scrapy-plugins/scrapy-splash/issues/92
In short, try to add this to your parsing callback function:
def parse_item(self, response):
"""Parse response into item also create new requests."""
page = RescrapItem()
...
yield page
if isinstance(response, (HtmlResponse, SplashTextResponse)):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = SplashRequest(url=link.url, callback=self._response_downloaded,
args=SPLASH_RENDER_ARGS)
r.meta.update(rule=rule, link_text=link.text)
yield rule.process_request(r)
In case, you wonder why this could return both items and new requests. Here is from the doc: https://doc.scrapy.org/en/latest/topics/spiders.html
In the callback function, you parse the response (web page) and return
either dicts with extracted data, Item objects, Request objects, or an
iterable of these objects. Those Requests will also contain a callback
(maybe the same) and will then be downloaded by Scrapy and then their
response handled by the specified callback.

Related

How to swiftly scrap a list of urls from dynamically rendered websites using scrapy-playwright using parallel processing?

Here is my Spider that works well, but is not very fast for larger numbers of pages (10s of thousands)
import scrapy
import csv
from scrapy_playwright.page import PageMethod
class ImmoSpider(scrapy.Spider):
name = "immo"
def start_requests(self):
with open("urls.csv","r") as f:
reader = csv.DictReader(f)
list = [item['Url-scraper'] for item in reader][0: 1]
for elem in list :
yield scrapy.Request(elem, meta={'playwright': True, 'playwright_include_page' : True, "playwright_page_methods": [
PageMethod("wait_for_load_state", 'networkidle')
],})
async def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
#parse stuff
yield {
#yield stuff
}
My scraper project is setup like the official scrapy getting started tutorial.
I'm still a beginner scraper so maybe I missed the simple solution.

How to add a waiting time with playwright

I am integrating scrapy with playwright but find myself having difficulties with adding a timer after a click. Therefore, when I take a screenshot of the page after a click it's still hanging on the log-in page.
How can I integrate a timer so that the page waits a few seconds until the page loads?
import scrapy
from scrapy_playwright.page import PageCoroutine
class DoorSpider(scrapy.Spider):
name = 'door'
start_urls = ['https://nextdoor.co.uk/login/']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback = self.parse,
meta= dict(
playwright = True,
playwright_include_page = True,
playwright_page_coroutines = [
PageCoroutine("click", selector = ".onetrust-close-btn-handler.onetrust-close-btn-ui.banner-close-button.onetrust-lg.ot-close-icon"),
PageCoroutine("fill", "#id_email", 'my_email'),
PageCoroutine("fill", "#id_password", 'my_password'),
PageCoroutine('waitForNavigation'),
PageCoroutine("click", selector="#signin_button"),
PageCoroutine("screenshot", path="cookies.png", full_page=True),
]
)
)
def parse(self, response):
yield {
'data':response.body
}
There are many waiting methods that you can use depending on your particular use case. Below are a sample but you can read more from the docs
wait_for_event(event, **kwargs)
wait_for_selector(selector, **kwargs)
wait_for_load_state(**kwargs)
wait_for_url(url, **kwargs)
wait_for_timeout(timeout
For your question, if you need to wait until page loads, you can use below coroutine and insert it at the appropriate place in your list:
...
PageCoroutine("wait_for_load_state", "load"),
...
or
...
PageCoroutine("wait_for_load_state", "domcontentloaded"),
...
You can try any of the other wait methods if the two above don't work or you can use an explicit timeout value like 3 seconds.(this is not recommended as it will fail more often and is not optimal when webscraping)
...
PageCoroutine("wait_for_timeout", 3000),
...

Scrapy Playwright: execute CrawlSpider using scrapy playwright

Is it possible to execute CrawlSpider using Playwright integration for Scrapy? I am trying the following script to execute a CrawlSpider but it does not scrape anything. It also does not show any error!
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class GumtreeCrawlSpider(CrawlSpider):
name = 'gumtree_crawl'
allowed_domains = ['www.gumtree.com']
def start_requests(self):
yield scrapy.Request(
url='https://www.gumtree.com/property-for-sale/london/page',
meta={"playwright": True}
)
return super().start_requests()
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#class='grid-col-12']/ul[1]/li/article/a"), callback='parse_item', follow=False),
)
async def parse_item(self, response):
yield {
'Title': response.xpath("//div[#class='css-w50tn5 e1pt9h6u11']/h1/text()").get(),
'Price': response.xpath("//h3[#itemprop='price']/text()").get(),
'Add Posted': response.xpath("//dl[#class='css-16xsajr elf7h8q4'][1]/dd/text()").get(),
'Links': response.url
}
Requests extracted from the rule do not have the playwright=True meta key, that's a problem if they need to be rendered by the browser to have useful content. You could solve that by using Rule.process_request, something like:
def set_playwright_true(request, response):
request.meta["playwright"] = True
return request
class MyCrawlSpider(CrawlSpider):
...
rules = (
Rule(LinkExtractor(...), callback='parse_item', follow=False, process_request=set_playwright_true),
)
Update after comment
Make sure your URL is correct, I get no results for that particular one (remove /page?).
Bring back your start_requests method, seems like the first page also needs to be downloaded using the browser
Unless marked explicitly (e.g. #classmethod, #staticmethod) Python instance methods receive the calling object as implicit first argument. The convention is to call this self (e.g. def set_playwright_true(self, request, response)). However, if you do this, you will need to change the way you create the rule, either:
Rule(..., process_request=self.set_playwright_true)
Rule(..., process_request="set_playwright_true")
From the docs: "process_request is a callable (or a string, in which case a method from the spider object with that name will be used)"
My original example defines the processing function outside of the spider, so it's not an instance method.
As suggested by elacuesta, I'd only add change your "parse_item" def from an async to a standard def.
def parse_item(self, response):
It defies what all I've read too, but that got me through.

get request url after AJAX request

I have a search page with link Search?params but any subsequent search requests are made via Ajax forms using Asp.Net. It makes a request to an action with a different name like InstantSearch?params but in the browser I see Search?params.
From this page I have a link to another page and I need to save the Url to return back to this page.
But if I had an AJAX request, Request.Url returns InstantSearch?params, not the link from browser address bar. And the action from this link returns only a Partial View, so when it returns to the previous URL the page is messed up.
How do I get the link of the previous page, from the browser address bar in Asp.Net, not the actual last requested URL?
While searching we are loading masonry containers like this:
$("#main-content-container").load("/Kit/InstantSearch?" + parameters, function() {
$('#mason-container').imagesLoaded(function() {
$('#mason-container').masonry({
itemSelector: '.kit-thumb-container',
columnWidth: 210,
isFitWidth: true,
gutter: 10
});
});
});
Then I'm calling foundation Joyride on same page and need to pass current page URL to return back. Joyride calls onload of the page under this link:
#Html.ActionLink("Go to kit details help", "OrderPageHelp", "Kit", new { returnUrl = Request.Url }, new { #style = "font-size:16px;" })
The needed page return Url is Kit/Search?params, but Request.Url returns that last request when loading masonry with Kit/InstantSearch?params.
How can I pass the needed Url without hard-coding it?
So this ones a bit old but I found myself in a similar situation recently and found a quick work around. Posting it in case any one's interested.
You can solve this problem by taking advantage of the TempData class.
Temp Data can be used to store data in between requests. The information will remain as long as the session is active, until you retrieve the data again.
So when the user first loads the page, before the ajax method is triggered, store the data in a variable on the page AND in the TempData("YourVariableName") object. Create the Action Link with the Saved URL. When the ajax request is fired it will overwrite the value in Request.URL. So, Check for a value in the TempData("YourVariableName"), if it is there, use that value AND Reset the TempData("YourVariableName") value. This will keep the original value of the page URL even after many ajax requests have been triggered. Code in Visual Basic:
#Code
Dim LastURL As String = ""
If Not TempData("LastURL") Is Nothing Then
LastURL = TempData("LastURL")
TempData("LastURL") = LastURL
Else
LastURL = Request.Url.AbsoluteUri
TempData("LastURL") = LastURL
End If
End Code
And pass the value stored in the LastURL variable as a parameter to your action link.

Grab the resource contents in CasperJS or PhantomJS

I see that CasperJS has a "download" function and an "on resource received" callback but I do not see the contents of a resource in the callback, and I don't want to download the resource to the filesystem.
I want to grab the contents of the resource so that I can do something with it in my script. Is this possible with CasperJS or PhantomJS?
This problem has been in my way for the last couple of days. The proxy solution wasn't very clean in my environment so I found out where phantomjs's QTNetworking core put the resources when it caches them.
Long story short, here is my gist. You need the cache.js and mimetype.js files:
https://gist.github.com/bshamric/4717583
//for this to work, you have to call phantomjs with the cache enabled:
//usage: phantomjs --disk-cache=true test.js
var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');
//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';
var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };
//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
//I only cache images, but you can change this
if(response.contentType.indexOf('image') >= 0)
{
cache.includeResource(response);
}
};
//when the page is done loading, go through each cachedResource and do something with it,
//I'm just saving them to a file
page.onLoadFinished = function(status) {
for(index in cache.cachedResources) {
var file = cache.cachedResources[index].cacheFileNoPath;
var ext = mimetype.ext[cache.cachedResources[index].mimetype];
var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
}
};
page.open(url, function () {
page.render('saved/google.pdf');
phantom.exit();
});
Then when you call phantomjs, just make sure the cache is enabled:
phantomjs --disk-cache=true test.js
Some notes:
I wrote this for the purpose of getting the images on a page without using the proxy or taking a low res snapshot. QT uses compression on certain text file resources and you will have to deal with the decompression if you use this for text files. Also, I ran a quick test to pull in html resources and it didn't parse the http headers out of the result. But, this is useful to me, hopefully someone else will find it so, modify it if you have problems with a specific content type.
I've found that until the phantomjs matures a bit, according to the issue 158 http://code.google.com/p/phantomjs/issues/detail?id=158 this is a bit of a headache for them.
So you want to do it anyways? I've opted to go a bit higher to accomplish this and have grabbed PyMiProxy over at https://github.com/allfro/pymiproxy, downloaded, installed, set it up, took their example code and made this in proxy.py
from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
from mimetools import Message
from StringIO import StringIO
class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):
def do_request(self, data):
data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1);
return data
def do_response(self, data):
#print '<< %s' % repr(data[:100])
request_line, headers_alone = data.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print "Content type: %s" %(headers['content-type'])
if headers['content-type'] == 'text/x-comma-separated-values':
f = open('data.csv', 'w')
f.write(data)
print ''
return data
if __name__ == '__main__':
proxy = AsyncMitmProxy()
proxy.register_interceptor(DebugInterceptor)
try:
proxy.serve_forever()
except KeyboardInterrupt:
proxy.server_close()
Then I fire it up
python proxy.py
Next I execute phantomjs with the proxy specified...
phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js
You may want to turn your security on or such, it was needless for me currently as I'm scraping just one source. You should now see a bunch of text flowing through your proxy console and if it lands on something with the mime type of "text/x-comma-separated-values" it'll save it as data.csv. This will also save all the headers and everything, but if you've come this far I'm sure you can figure out how to pop those off.
One other detail, I've found that I've had to disable gzip encoding, I could use zlib and decompress data in gzip from my own apache webserver, but if it comes out of IIS or such the decompression will get errors and I'm not sure about that part of it.
So my power company won't offer me an API? Fine! We do it the hard way!
Did not realize I could grab the source from the document object like this:
casper.start(url, function() {
var js = this.evaluate(function() {
return document;
});
this.echo(js.all[0].outerHTML);
});
More info here.
You can use Casper.debugHTML() to print out contents of a HTML resource:
var casper = require('casper').create();
casper.start('http://google.com/', function() {
this.debugHTML();
});
casper.run();
You can also store the HTML contents in a var using casper.getPageContent(): http://casperjs.org/api.html#casper.getPageContent (available in lastest master)