I am hitting a dead end with this problem I am having for 4 days. I want to crawl "http://www.ledcor.com/careers/search-careers". On each job listing page (i.e. http://www.ledcor.com/careers/search-careers?page=2) I go inside each job link and get the job title. I have this working so far.
Now, I am trying to make the spider go to next job listing page (i.g. from http://www.ledcor.com/careers/search-careers?page=2 to http://www.ledcor.com/careers/search-careers?page=3 and crawl all the jobs). My crawl rule does not work and I have no clues what is wrong, what is missing. Please help.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class LedcorSpider(CrawlSpider):
name = "ledcor"
allowed_domains = ["www.ledcor.com"]
start_urls = ["http://www.ledcor.com/careers/search-careers"]
rules = [
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/careers/search-careers\?page=\d",),restrict_xpaths=('//div[#class="pager"]/a',)), follow=True),
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/job\?(.*)",)),callback="parse_items")
]
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
item = CraigslistSampleItem()
item['title'] = hxs.select('//h1/text()').extract()[0].encode('utf-8')
item['link'] = response.url
return item
here is Items.py
from scrapy.item import Item, Field
class CraigslistSampleItem(Item):
title = Field()
link = Field()
desc = Field()
Here is Pipelines.py
class CraigslistSamplePipeline(object):
def process_item(self, item, spider):
return item
Updated: (#blender suggestion) It doesnt crawl
rules = [
Rule(SgmlLinkExtractor(allow=(r"http://www.ledcor.com/careers/search-careers\?page=\d",),restrict_xpaths=('//div[#class="pager"]/a',)), follow=True),
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/job\?(.*)",)),callback="parse_items")
]
Your restrict_xpaths argument is wrong. Remove it and it will work.
$ scrapy shell http://www.ledcor.com/careers/search-careers
In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
In [2]: lx = SgmlLinkExtractor(allow=("http://www.ledcor.com/careers/search-careers\?page=\d",),restrict_xpaths=('//div[#class="pager"]/a',))
In [3]: lx.extract_links(response)
Out[3]: []
In [4]: lx = SgmlLinkExtractor(allow=("http://www.ledcor.com/careers/search-careers\?page=\d",))
In [5]: lx.extract_links(response)
Out[5]:
[Link(url='http://www.ledcor.com/careers/search-careers?page=1', text=u'', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=2', text=u'2', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=3', text=u'3', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=4', text=u'4', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=5', text=u'5', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=6', text=u'6', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=7', text=u'7', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=8', text=u'8', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=9', text=u'9', fragment='', nofollow=False),
Link(url='http://www.ledcor.com/careers/search-careers?page=10', text=u'10', fragment='', nofollow=False)]
You need to escape the question mark and use a raw string for the regex:
r"http://www\.ledcor\.com/careers/search-careers\?page=\d"
Otherwise, it looks for URLs like ...careerspage=2 and ...carrerpage=3.
try this:
rules = [Rule(SgmlLinkExtractor(), follow=True, callback="parse_items")]
Also, suitable changes need to be made in pipeline.py and do paste pipeline and items code.
Related
I am trying to get data using scrapy-selenium but there is some issue with the pagination. I have tried my level best to use different selectors and methods but nothing changes. It can only able to scrape the 1st page. I have also checked the other solutions but still, I am unable to make it work. Looking forward to experts' advice.
Source: https://www.gumtree.com/property-for-sale/london
import scrapy
from urllib.parse import urljoin
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
class Basic2Spider(scrapy.Spider):
name = 'basic2'
def start_requests(self):
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.set_window_size(1920, 1080)
driver.get("https://www.gumtree.com/property-for-sale/london")
time.sleep(2)
property_xpath = driver.find_elements(By.XPATH, "(//article[#class='listing-maxi']/a)[position()>=2 and position()<30]")
for detail in property_xpath:
href= detail.get_attribute('href')
time.sleep(2)
yield SeleniumRequest(
url = href,
)
driver.quit()
return super().start_requests()
def parse(self, response):
yield {
'Title': response.xpath("//div[#class='css-w50tn5 e1pt9h6u11']/h1/text()").get(),
'Price': response.xpath("//h3[#itemprop='price']/text()").get(),
'Add Posted': response.xpath("//*[#id='content']/div[1]/div/main/div[5]/section/div[1]/dl[1]/dd/text()").get(),
'Links': response.url
}
next_page = response.xpath("//li[#class='pagination-currentpage']/following-sibling::li[1]/a/text()").get()
if next_page:
abs_url = f'https://www.gumtree.com/property-for-sale/london/page{next_page}'
yield SeleniumRequest(
url= abs_url,
wait_time=5,
callback=self.parse
)
Your code seem to be correct but getting tcp ip block. I also tried alternative way where code is correct and pagination is working and this type of pagination is two times faster than others but gives me sometimes strange result and sometimes getting ip block.
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
class Basic2Spider(scrapy.Spider):
name = 'basic2'
responses = []
def start_requests(self):
url='https://www.gumtree.com/property-for-sale/london/page{page}'
for page in range(1,6):
print(page)
yield SeleniumRequest(
url=url.format(page=page),
callback=self.parse,
wait_time=5
)
def parse(self, response):
driver = response.meta['driver']
intial_page = driver.page_source
self.responses.append(intial_page)
for resp in self.responses:
r = Selector(text=resp)
property_xpath = r.xpath("(//article[#class='listing-maxi']/a)[position()>=2 and position()<30]")
for detail in property_xpath:
yield {
'Title': detail.xpath('.//*[#class="listing-title"]/text()').get().strip(),
'Price': detail.xpath('.//*[#class="listing-price"]/strong/text()').get(),
'Add Posted': detail.xpath('.//*[#class="listing-posted-date txt-sub"]/span//text()').getall()[2].strip(),
'Links': response.url
}
In CrawlSpider, how can I scrape the marked field "4 days ago" in the image before extracting each link?
The below-mentioned CrawlSpider is working fine. But in 'parse_item' I want to add a new field named 'Add posted' where I want to get the field marked on the image.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PropertySpider(CrawlSpider):
name = 'property'
allowed_domains = ['www.openrent.co.uk']
start_urls = [
'https://www.openrent.co.uk/properties-to-rent/london?term=London&skip='+ str(x) for x in range(0, 5, 20)
]
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#id='property-data']/a"), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'Title': response.xpath("//h1[#class='property-title']/text()").get(),
'Price': response.xpath("//h3[#class='perMonthPrice price-title']/text()").get(),
'Links': response.url,
'Add posted': ?
}
When using the Rule object of the scrapy crawl spider, the extracted link's text is saved in a meta field of the request named link_text. You can obtain this value in the parse_item method and extract the time information using regex. You can read more about it from the docs. See below example.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
class PropertySpider(CrawlSpider):
name = 'property'
allowed_domains = ['www.openrent.co.uk']
start_urls = [
'https://www.openrent.co.uk/properties-to-rent/london?term=London&skip='+ str(x) for x in range(0, 5, 20)
]
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[#id='property-data']/a"), callback='parse_item', follow=True),
)
def parse_item(self, response):
link_text = response.request.meta.get("link_text")
m = re.search(r"(Last Updated.*ago)", link_text)
if m:
posted = m.group(1).replace("\xa0", " ")
yield {
'Title': response.xpath("//h1[#class='property-title']/text()").get(),
'Price': response.xpath("//h3[#class='perMonthPrice price-title']/text()").get(),
'Links': response.url,
"Add posted": posted
}
To show in a loop, you can use the following xpath to receive that data point:
x = response.xpath('//div[#class="timeStamp"]')
for i in x:
yield {'result': i.xpath("./i/following-sibling::text()").get().strip() }
Here's my current code:
#scrap all the cafe links from example.com
import scrapy, re
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy import Selector
class DengaSpider(scrapy.Spider):
name = 'cafes'
allowed_domains = ['example.com']
start_urls = [
'http://example.com/archives/8136.html',
]
cafeOnlyLink = []
def parse(self, response):
cafelink = response.xpath('//li/a[contains(#href, "archives")]/#href').extract()
twoHourRegex = re.compile(r'^http://example\.com/archives/\d+.html$')
cafeOnlyLink = [ s for s in cafelink if twoHourRegex.match(s) ]
So how should I continue to parse content from each url containing in the [cafeOnlyLink] list? and I want to save all the result from each page in a csv file.
You can use something like this:
for url in cafeOnlyLink:
yield scrapy.Request(url=url, callback=self.parse_save_to_csv)
def parse_save_to_csv(self, response):
# The content is in response.body, so you have to select what information
# you want to sent to the csv file.
I am trying to create a simple crawl spider, but the response.url seem to be broken.
The code i am currently running is:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from teatrorecur.items import TeatrorecurItem
class Teatrorecurspider(CrawlSpider):
name = "teatrorecurspider"
allowed_domains = ["cartelera.com.uy"]
start_urls = (
'http://www.cartelera.com.uy/apeliculafunciones.aspx?,,PELICULAS,OBRA,0,26',
)
rules = (
Rule(LinkExtractor(allow=('CINE&OBRA&-1&29', )), callback='parse_item', follow=False),
#Rule(LinkExtractor(restrict_xpaths='//a[#href="CINE%2COBRA%2C-1%2C29"]'), follow=False, callback='parse_item'),
#Rule(LinkExtractor(allow=('CINE&OBRA&-1&29$', )), callback='parse_item', follow=False),
)
def parse_item(self, response):
item = TeatrorecurItem()
item['url']=response.url
yield item
a sample url i'm getting from this code is
<200 http://www.cartelera.com.uy/apeliculafunciones.aspx?-1=&12415=&29=&CINE=&OBRA=>
but the corresponding element in the page has the following href value
<a href="http://www.cartelera.com.uy/apeliculafunciones.aspx?12415&&CINE&OBRA&-1&29">
as you can see, the string following the .aspx? is messed up, i have no clue what is wrong.
LinkExtractor has a option named canonicalize that defaults to True.
Set it to False like so:
rules = (
Rule(LinkExtractor(allow=('CINE&OBRA&-1&29',), canonicalize=False), callback='parse_item', follow=False),
)
This will prevent LinkExtractor from performing changes to the url described at the def of canonicalize_url.
here is my spider:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem
class vriskoSpider(CrawlSpider):
name = 'vrisko'
allowed_domains = ['vrisko.gr']
start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
rules = (
Rule(SgmlLinkExtractor(allow=('\?page=\d')), callback='parse_vrisko'),
)
def parse_vrisko(self, response):
hxs = HtmlXPathSelector(response)
vriskoit = VriskoItem()
vriskoit['eponimia'] = hxs.select("//a[#itemprop='name']/text()").extract()
vriskoit['address'] = hxs.select("//div[#class='results_address_class']/text()").extract()
print ' '.join(vriskoit['eponimia']).join(vriskoit['address'])
return vriskoit
The pages i try to crawl have the format http://www.blabla.com/blabla/bla?page=x
where x = any integer.
My problem is that my spider crawls all pages except the first one!
Any ideas why does this happen ?
Thank you in advance!
if you look into scrapy doc , start_urls response goes to **
parse
** method
so you can change your rule like this
rules = (
Rule(SgmlLinkExtractor(allow=('\?page=\d')), callback='parse'),
)
and method name from def parse_vrisko(self, response): to def parse(self, response):
or you can remove start_urls and start your spider with def start_requests(self): with callback to parse_vrisko