How to swiftly scrap a list of urls from dynamically rendered websites using scrapy-playwright using parallel processing? - scrapy

Here is my Spider that works well, but is not very fast for larger numbers of pages (10s of thousands)
import scrapy
import csv
from scrapy_playwright.page import PageMethod
class ImmoSpider(scrapy.Spider):
name = "immo"
def start_requests(self):
with open("urls.csv","r") as f:
reader = csv.DictReader(f)
list = [item['Url-scraper'] for item in reader][0: 1]
for elem in list :
yield scrapy.Request(elem, meta={'playwright': True, 'playwright_include_page' : True, "playwright_page_methods": [
PageMethod("wait_for_load_state", 'networkidle')
],})
async def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
#parse stuff
yield {
#yield stuff
}
My scraper project is setup like the official scrapy getting started tutorial.
I'm still a beginner scraper so maybe I missed the simple solution.

Related

How to add a waiting time with playwright

I am integrating scrapy with playwright but find myself having difficulties with adding a timer after a click. Therefore, when I take a screenshot of the page after a click it's still hanging on the log-in page.
How can I integrate a timer so that the page waits a few seconds until the page loads?
import scrapy
from scrapy_playwright.page import PageCoroutine
class DoorSpider(scrapy.Spider):
name = 'door'
start_urls = ['https://nextdoor.co.uk/login/']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback = self.parse,
meta= dict(
playwright = True,
playwright_include_page = True,
playwright_page_coroutines = [
PageCoroutine("click", selector = ".onetrust-close-btn-handler.onetrust-close-btn-ui.banner-close-button.onetrust-lg.ot-close-icon"),
PageCoroutine("fill", "#id_email", 'my_email'),
PageCoroutine("fill", "#id_password", 'my_password'),
PageCoroutine('waitForNavigation'),
PageCoroutine("click", selector="#signin_button"),
PageCoroutine("screenshot", path="cookies.png", full_page=True),
]
)
)
def parse(self, response):
yield {
'data':response.body
}
There are many waiting methods that you can use depending on your particular use case. Below are a sample but you can read more from the docs
wait_for_event(event, **kwargs)
wait_for_selector(selector, **kwargs)
wait_for_load_state(**kwargs)
wait_for_url(url, **kwargs)
wait_for_timeout(timeout
For your question, if you need to wait until page loads, you can use below coroutine and insert it at the appropriate place in your list:
...
PageCoroutine("wait_for_load_state", "load"),
...
or
...
PageCoroutine("wait_for_load_state", "domcontentloaded"),
...
You can try any of the other wait methods if the two above don't work or you can use an explicit timeout value like 3 seconds.(this is not recommended as it will fail more often and is not optimal when webscraping)
...
PageCoroutine("wait_for_timeout", 3000),
...

Scrapy Playwright: execute CrawlSpider using scrapy playwright

Is it possible to execute CrawlSpider using Playwright integration for Scrapy? I am trying the following script to execute a CrawlSpider but it does not scrape anything. It also does not show any error!
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class GumtreeCrawlSpider(CrawlSpider):
name = 'gumtree_crawl'
allowed_domains = ['www.gumtree.com']
def start_requests(self):
yield scrapy.Request(
url='https://www.gumtree.com/property-for-sale/london/page',
meta={"playwright": True}
)
return super().start_requests()
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#class='grid-col-12']/ul[1]/li/article/a"), callback='parse_item', follow=False),
)
async def parse_item(self, response):
yield {
'Title': response.xpath("//div[#class='css-w50tn5 e1pt9h6u11']/h1/text()").get(),
'Price': response.xpath("//h3[#itemprop='price']/text()").get(),
'Add Posted': response.xpath("//dl[#class='css-16xsajr elf7h8q4'][1]/dd/text()").get(),
'Links': response.url
}
Requests extracted from the rule do not have the playwright=True meta key, that's a problem if they need to be rendered by the browser to have useful content. You could solve that by using Rule.process_request, something like:
def set_playwright_true(request, response):
request.meta["playwright"] = True
return request
class MyCrawlSpider(CrawlSpider):
...
rules = (
Rule(LinkExtractor(...), callback='parse_item', follow=False, process_request=set_playwright_true),
)
Update after comment
Make sure your URL is correct, I get no results for that particular one (remove /page?).
Bring back your start_requests method, seems like the first page also needs to be downloaded using the browser
Unless marked explicitly (e.g. #classmethod, #staticmethod) Python instance methods receive the calling object as implicit first argument. The convention is to call this self (e.g. def set_playwright_true(self, request, response)). However, if you do this, you will need to change the way you create the rule, either:
Rule(..., process_request=self.set_playwright_true)
Rule(..., process_request="set_playwright_true")
From the docs: "process_request is a callable (or a string, in which case a method from the spider object with that name will be used)"
My original example defines the processing function outside of the spider, so it's not an instance method.
As suggested by elacuesta, I'd only add change your "parse_item" def from an async to a standard def.
def parse_item(self, response):
It defies what all I've read too, but that got me through.

Django generic Views with templates

I've added a new template to my project (thing_listings.html) and I've added the views;
from django.views import generic
from .models import Things
class IndexView(generic.ListView):
template_name = 'home/index.html'
def get_queryset(self):
return Things.objects.all()
**class ThingView(generic.ListView):
template_name = 'home/thing_listings.html'
def get_queryset(self):
return Things.objects.all()**
class DetailView(generic.DetailView):
model = Labs
template_name = 'home/detail.html'
and the URl's;
from django.conf.urls import url
from . import views
app_name = 'home'
urlpatterns = [
# /home/
url(r'^$', views.IndexView.as_view(), name = 'index'),
**# /thingview/
url(r'^$', views.ThingView.as_view(), name='thingview'),**
# /home/"details"/
url(r'^(?P<pk>[0-9]+)/$', views.DetailView.as_view(), name='detail'),
]
At the moment the site runs fine, except when I click on the thing_listings link I just get directed to index instead of what thing view is supposed to direct me to. Please help, I'm not sure where I've gone wrong.
Ive used the href: {% url 'home:thingview' %}
I've found the solution if anyone else is having the same issue.
All you should need to do is add the path to your regular expression eg:
url(r'^servicesview/$', views.ServicesView.as_view(), name='services'),
I've repeated the process multiple times to make sure it works.

Scrapy ends after first result

I've been looking around and cant find the answer I'm looking for. I got my crawler (scrapy) to return the results close to what i'm looking for. So What I'm trying to do now is get it to pull the multiple results from the page. Currently it pulls the first one and stops. If I take off the extract_first() then it pulls all the data and groups them. So looking for one of 2 answers that would work.
1) continue crawling results and not ending
2) ungroup each item onto a new line of results
Here is my code:
import scrapy
from scrapy.selector import Selector
from urlparse import urlparse
from urlparse import urljoin
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
#from scrappy.http import HtmlResponse
class MySpider(CrawlSpider):
name = "ziprecruiter"
def start_requests(self):
allowed_domains = ["https://www.ziprecruiter.com/"]
urls = [
'https://www.ziprecruiter.com/candidate/search?search=operations+manager&location=San+Francisco%2C+CA'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for houses in response.xpath('/html/body'):
yield {
'Job_title:' : houses.xpath('.//span[#class="just_job_title"]//text()[1]').extract_first(),
'Company:' : houses.xpath('.//a[#class="t_org_link name"]//text()[1]').extract_first(),
'Location:' : houses.xpath('.//a[#class="t_location_link location"]//text()[1]').extract_first(),
'FT/PT:' : houses.xpath('.//span[#class="data_item"]//text()[1]').extract_first(),
'Link' : houses.xpath('/html/body/main/div/section/div/div[2]/div/div[2]/div[1]/article[4]/div[1]/button[1]/text()').extract_first(),
'Link' : houses.xpath('.//a/#href[1]').extract_first(),
'pay' : houses.xpath('./section[#class="perks_item"]/span[#class="data_item"]//text()[1]').extract_first()
}
Thank you in advance!
EDIT::
After more research I redefined the container to crawl in and that gives me all the right answers. Now my question is how do I get each item on the page instead of only the first result... it just doesn't loop. Heres my code:
import scrapy
from scrapy.selector import Selector
from urlparse import urlparse
from urlparse import urljoin
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
#from scrappy.http import HtmlResponse
class MySpider(CrawlSpider):
name = "ziprecruiter"
def start_requests(self):
allowed_domains = ["https://www.ziprecruiter.com/"]
urls = [
'https://www.ziprecruiter.com/candidate/search?search=operations+manager&location=San+Francisco%2C+CA'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for houses in response.xpath('/html/body/main/div/section/div/div[2]/div/div[2]/div[1]/article[1]/div[2]'):
yield {
'Job_title:' : houses.xpath('.//span[#class="just_job_title"]//text()').extract(),
'Company:' : houses.xpath('.//a[#class="t_org_link name"]//text()').extract(),
'Location:' : houses.xpath('.//a[#class="t_location_link location"]//text()').extract(),
'FT/PT:' : houses.xpath('.//span[#class="data_item"]//text()').extract(),
'Link' : houses.xpath('.//a/#href').extract(),
'pay' : houses.xpath('./section[#class="perks_item"]/span[#class="data_item"]//text()').extract()
}
Seems to me that you should use this xpath instead:
//div[#class="job_content"]
As that is the class of the div you're looking for. When I execute it for this page, I get 20 div elements returned. However, you might want to add some more filtering to the xpath query just in case there are other divs with that class name that you don't want to parse.

scrapy-splash do not crawl recursively with CrawlerSpider

I have integrated scrapy-splash in my CrawlerSpider process_request in rules like this:
def process_request(self,request):
request.meta['splash']={
'args': {
# set rendering arguments here
'html': 1,
}
}
return request
The problem is that the crawl renders just urls in the first depth,
I wonder also how can I get response even with bad http code or redirected reponse;
Thanks in advance,
Your problem may be related to this: https://github.com/scrapy-plugins/scrapy-splash/issues/92
In short, try to add this to your parsing callback function:
def parse_item(self, response):
"""Parse response into item also create new requests."""
page = RescrapItem()
...
yield page
if isinstance(response, (HtmlResponse, SplashTextResponse)):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = SplashRequest(url=link.url, callback=self._response_downloaded,
args=SPLASH_RENDER_ARGS)
r.meta.update(rule=rule, link_text=link.text)
yield rule.process_request(r)
In case, you wonder why this could return both items and new requests. Here is from the doc: https://doc.scrapy.org/en/latest/topics/spiders.html
In the callback function, you parse the response (web page) and return
either dicts with extracted data, Item objects, Request objects, or an
iterable of these objects. Those Requests will also contain a callback
(maybe the same) and will then be downloaded by Scrapy and then their
response handled by the specified callback.