Scrapy LinkExtractor ScraperApi integration - scrapy

I am trying to extract links from a webpage but I have to use proxy service. If I use proxy service links not extracting correctly. Extracted links missing https://www.homeadvisor.com part. Extracted links using api.scraperapi.com as domain without website domain. How can I fix this problem?
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scraper_api import ScraperAPIClient
client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")
class Sip2Spider(CrawlSpider):
name = 'sip2'
# allowed_domains = ['homeadvisor.com']
# start_urls = ['https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html']
start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
rules= [
Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True)
]
def parse_page(self, response):
company_name = response.css("h1.\#w-full.\#text-3xl").css("::text").get().strip()
yield {
"company_name" : company_name
}
2022-10-24 18:13:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python> (referer: None)
2022-10-24 18:13:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.MarkAllenContracting.117262730.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:51 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FireSignDBAdditionsand.123758218.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:52 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.DCEnclosuresInc.16852798.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.G3BuildersLLC.71091804.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:54 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FletcherCooleyInc.43547458.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)

It looks like when using the ScraperAPIClient, it requires you to use that specific syntax client.scrapyGet(url=...) for each and ever request. However since you are using the crawlspider with a linkextractor set to follow, scrapy automatically sends out requests in it's usual way, so those requests are getting blocked. You might be better off extracting all of the links yourself, and then filtering the ones you want to follow.
For example:
import scrapy
from scraper_api import ScraperAPIClient
client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")
class Sip2Spider(scrapy.Spider):
name = 'sip2'
domain = 'https://www.homeadvisor.com'
start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
def parse(self, response):
print(response)
links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/#href").getall()]
yield {"links" : list(set(links))}
This will yield:
[
{
"links": [
"https://www.homeadvisor.com/rated.TapConstructionLLC.42214874.html",
"https://www.homeadvisor.com#quote=42214874",
"https://www.homeadvisor.com/emc.Drywall-Plaster-directory.-12025.html",
"https://www.linkedin.com/company/homeadvisor/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Philadelphia.PA.-12001.html",
"https://www.homeadvisor.com/login",
"https://www.homeadvisor.com/task.Major-Home-Repairs-General-Contractor.40062.html",
"https://www.homeadvisor.com/near-me/home-addition-builders/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Lawrenceville.GA.-12001.html",
"https://www.homeadvisor.com/near-me/carpentry-contractors/",
"https://www.homeadvisor.com/emc.Roofing-directory.-12061.html",
"https://www.homeadvisor.com/c.Doors.Atlanta.GA.-12024.html",
"https://www.homeadvisor.com#quote=20057351",
"https://www.homeadvisor.com/near-me/deck-companies/",
"https://www.homeadvisor.com/tloc/Atlanta-GA/Bathroom-Remodel/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Knoxville.TN.-12001.html",
"https://www.homeadvisor.com/xm/35317287/task-selection/-12001?postalCode=30301",
"https://www.homeadvisor.com/category.Additions-Remodeling.12001.html",
"https://www.homeadvisor.comtel:4042672949",
"https://www.homeadvisor.com/rated.DCEnclosuresInc.16852798.html",
"https://www.homeadvisor.com#quote=16721785",
"https://www.homeadvisor.com/near-me/bathroom-remodeling/",
"https://www.homeadvisor.com/near-me",
"https://www.homeadvisor.com/emc.Heating-Furnace-Systems-directory.-12040.html",
"https://pro.homeadvisor.com/r/?m=sp_pro_center&entry_point_id=33522463",
"https://www.homeadvisor.com/r/hiring-a-home-architect/",
"https://www.homeadvisor.com#quote=119074241",
"https://www.homeadvisor.comtel:8669030759",
"https://www.homeadvisor.com/rated.SilverOakRemodel.78475581.html#ratings-reviews",
"https://www.homeadvisor.com/emc.Tree-Service-directory.-12074.html",
"https://www.homeadvisor.com/task.Bathroom-Remodel.40129.html",
"https://www.homeadvisor.com/rated.G3BuildersLLC.71091804.html",
"https://www.homeadvisor.com/sp/horizon-remodeling-construction",
"https://www.homeadvisor.com/near-me/fence-companies/",
"https://www.homeadvisor.com/emc.Gutters-directory.-12038.html",
"https://www.homeadvisor.com/c.GA.html#topcontractors",
...
...
...
]
}
]
The actual output is almost 400 links...
Then you can use some kind of filtering to decide which links you want to follow and use the same api sdk syntax to follow them. Applying some kind of filtering system will also cut down on the number of requests sent which will conserve api calls which will save you money as well.
For example:
def parse(self, response):
print(response)
links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/#href").getall()]
yield {"links" : list(set(links))}
# some filtering process
for link in links:
yield scrapy.Request(client.scrapyGet(url = link))
UPDATE:
Try this...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlencode
APIKEY = "67e5e7755771b9abf8062e595dd5cc2a" # <- your api key
APIDOMAIN = "http://api.scraperapi.com/"
DOMAIN = 'https://www.homeadvisor.com/'
def get_scraperapi_url(url):
payload = {'api_key': APIKEY, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
def process_links(links):
for link in links:
i = link.url.index('rated')
link.url = DOMAIN + link.url[i:]
link.url = get_scraperapi_url(link.url)
return links
class Sip2Spider(CrawlSpider):
name = 'sip2'
domain = 'https://www.homeadvisor.com'
start_urls =[get_scraperapi_url('https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
rules= [
Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True, process_links=process_links)
]
def parse_page(self, response):
company_name = response.xpath("//h1[contains(#class,'#w-full #text-3xl')]/text()").get()
yield {
"company_name" : company_name
}

Related

How to build own middleware in Scrapy?

I'm just starting to learn Scrapy and I have such a question. for my "spider" I have to take a list of urls (start_urls) from the google sheets table and I have this code:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)
client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)
for link in records_data:
print(link)
........
How do I configure the middleware so that when the spider (scrappy crawl my_spider) is launched, links from this code are automatically substituted into start_urls? perhaps i need to create a class in middlewares.py?
I will be grateful for any help, with examples.
it is necessary that this rule applies to all new spiders, generating a list from a file in start_requests (for example start_urls = [l.strip() for an open string('urls.txt ').readline()]) is not convenient...
Read this
spider.py:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
custom_settings = {
'SPIDER_MIDDLEWARES': {
'tempbuffer.middlewares.ExampleMiddleware': 543,
}
}
def parse(self, response):
print(response.url)
middlewares.py:
class ExampleMiddleware(object):
def process_start_requests(self, start_requests, spider):
# change this to your needs:
with open('urls.txt', 'r') as f:
for url in f:
yield scrapy.Request(url=url)
urls.txt:
https://example.com
https://example1.com
https://example2.org
output:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example2.org> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example1.com> (referer: None)
https://example2.org
https://example.com
https://example1.com

INFO: Ignoring response <403 webpage_name> HTTP status code is not handled or not allowed

I'm constantly getting the 403 error in my spider, note my spider is just scraping the very firsst page of the website, it is not doing the pagination. Could this be a clue? These are a few things that I have done to see if the issue resolves, these two advices were given to similar questions here in stackoverflow:
I have changed my user agent, didn't work.
I have set handle_httpstatus_list = [403] in settings.py, didn't work.
This is my code:
import scrapy
from scrapy_selenium import SeleniumRequest
#myclass
class QrocasasSpider(scrapy.Spider):
name = 'qrocasas'
page_number = 2
#functions to clean data
#starting my request
def start_requests(self):
yield SeleniumRequest(
url='https://www.inmuebles24.com/casas-o-departamentos-o-terrenos-en-venta-en-queretaro.html',
wait_time=5,
callback=self.parse
)
#scraping
def parse(self, response):
homes = response.xpath("//div[#class='postingCardTop']/div[2]")
for home in homes:
yield {
'price': home.xpath("normalize-space(.//div[1]/div[1]/span/text())").get()
# 'location': home.xpath(".//div[#class='tile-location one-liner']/b/text()").get(),
# 'description': home.xpath(".//div[#class='tile-desc one-liner']/a/text()").get(),
# 'bedrooms': home.xpath(".//div[#class='chiplets-inline-block re-bedroom']/text()").get(),
# 'm2': home.xpath(".//div[#class='chiplets-inline-block surface-area']/text()").get(),
}
next_page = 'https://www.inmuebles24.com/casas-o-departamentos-o-terrenos-en-venta-en-queretaro-pagina-'+ str(self.page_number) + '.html'
if int(self.page_number) < 10:
self.page_number += 1
yield response.follow(next_page,callback=self.parse)
This is the error I get:
2021-12-08 10:53:41 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.inmuebles24.com/casas-o-departamentos-o-terrenos-en-venta-en-queretaro-pagina-2.html> (referer: https://www.inmuebles24.com/casas-o-terrenos-o-departamentos-en-venta-en-queretaro.html)
2021-12-08 10:53:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.inmuebles24.com/casas-o-departamentos-o-terrenos-en-venta-en-queretaro-pagina-2.html>: HTTP status code is not handled or not allowed
2021-12-08 10:53:41 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-08 10:53:41 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:59145/session/2e39a7f0-1ffa-48e7-ad30-9ca71202c44a {}
2021-12-08 10:53:42 [urllib3.connectionpool] DEBUG: http://127.0.0.1:59145 "DELETE /session/2e39a7f0-1ffa-48e7-ad30-9ca71202c44a HTTP/1.1" 200 14
2021-12-08 10:53:42 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request

Can anyone help look into my scrapy xpath for this site?

When I tried to scrape pissedconsumer.com item with following code:
import scrapy
class PissedreviewsSpider(scrapy.Spider):
name = 'pissedreviews'
allowed_domains = ['pissedconsumer.com']
start_urls = ['https://lazada-malaysia.pissedconsumer.com/review.html']
def parse(self, response):
selectors = response.xpath('//div[#class="f-component-info"]')
for selector in selectors:
title = selector.xpath('./h2/text()').get()
print(title)
Here is the log in shell when crawl:
2020-04-11 19:00:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://lazada-malaysia.pissedconsumer.com/review.html> (referer: None) <Selector xpath='//div[#class="f-component-info"]' data='<div class="f component-info">\n ...'>
None
<Selector xpath='//div[#class="f-component-info"]' data='<div class="f-component-info">\n ...'>
I already set ROBOTSTXT_OBEY to false and added headers
Any other things I can do to make it work?
Thanks
Update because question updated:
You are missing a / in your xpath. It should be title = selector.xpath('.//h2/text()').get()

Scrapy Spider gets stuck in middle of crawling

I'm new to scrapy and I'm trying to build a spider that will crawl a website and get all the phone numbers, emails, pdfs, etc. from it (and I want it to follow all the links from the main page so it searches the entire domain).
This question had a similar issue but it wasn't resolved: Why scrapy crawler stops?
Here is the code for my spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mobilesuites.items import MobilesuitesItem
import re
class ExampleSpider(CrawlSpider):
name = "hyatt"
allowed_domains = ["hyatt.com"]
start_urls = (
'http://www.hyatt.com/',
)
#follow only non-javascript links
rules = (
Rule(SgmlLinkExtractor(deny = ('.*\.jsp.*')), follow = True, callback = 'parse_item'),
)
def parse_item(self, response):
#self.log('The current url is %s' % response.url)
selector = Selector(response)
item = MobilesuitesItem()
#get url
item['url'] = response.url
#get page title
titles = selector.select("//title")
for t in titles:
item['title'] = t.select("./text()").extract()
#get all phone numbers, emails, and pdf links
text = response.body
item['phone'] = '|'.join(re.findall('\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(?\d{3}\)?[-\.\s]?\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4}', text))
item['email'] = '|'.join(re.findall("[^\s#]+#[^\s#]+\.[^\s#]+", text))
item['pdfs'] = '|'.join(re.findall("[^\s\"<]*\.pdf[^\s\">]*", text))
#check to see if dining is mentioned on the page
item['dining'] = bool(re.findall("\s[dD]ining\s|\s[mM]enu\s|\s[bB]everage\s", text))
return item
Here's the last part of the crawl log before it hangs:
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Scraped from <200 http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html>
{'email': '',
'phone': '',
'title': [u'24/7 Gallery Menu'],
'url': 'http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html'}
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Ignoring response <404 http://hyatt.com/gallery/thrive/siteMap.html>: HTTP status code is not handled or not allowed
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.hyatt.com/hyatt/pure/contact/> (referer: http://www.hyatt.com/hyatt/pure/?icamp=HY_HyattPure_HPLS)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/aboutus.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.place.hyatt.com/en/hyattplace/eat-and-drink/eat-and-drink.html> (referer: http://www.place.hyatt.com/en/hyattplace/eat-and-drink.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.park.hyatt.com/en/parkhyatt/newsandannouncements.html?icamp=park_hppsa_new_hotels> (referer: http://www.park.hyatt.com/en/parkhyatt.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.regency.hyatt.com/en/hyattregency/meetingsandevents.html> (referer: http://www.regency.hyatt.com/en/hyattregency.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/specialoffers.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/locations.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)

How to check if a specific button exists in Scrapy?

I have a button in web page as
<input class="nextbutton" type="submit" name="B1" value="Next 20>>"></input>
Now i want to check if this button exists on the page or not using Xpath selectors so that if it exists i can go to next page and retreive information from there.
First, you have to determine what counts as "this button". Given the context, I'd suggest looking for an input with a class of 'nextbutton'. You could check for an element with only one class like this in XPath:
//input[#class='nextbutton']
But that looks for exact matches only. So you could try:
//input[contains(#class, 'nextbutton')]
Though this will also match "nonextbutton" or "nextbuttonbig". So your final answer will probably be:
//input[contains(concat(' ', #class, ' '), ' nextbutton ')]
In Scrapy, a Selector will evaluate as true if it matches some nonzero content. So you should be able to write something like:
from scrapy.selector import Selector
input_tag = Selector(text=html_content).xpath("//input[contains(concat(' ', #class, ' '), ' nextbutton ')]")
if input_tag:
print "Yes, I found a 'next' button on the page."
http://www.trumed.org/patients-visitors/find-a-doctor loads an iframe with src="http://verify.tmcmed.org/iDirectory/"
<iframe border="0" frameborder="0" id="I1" name="I1"
src="http://verify.tmcmed.org/iDirectory/"
style="width: 920px; height: 600px;" target="I1">
Your browser does not support inline frames or is currently configured not to display inline frames.
</iframe>
The search form is in this iframe.
Here's a scrapy shell session illustrating this:
$ scrapy shell "http://www.trumed.org/patients-visitors/find-a-doctor"
2014-07-10 11:31:05+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
2014-07-10 11:31:07+0200 [default] DEBUG: Crawled (200) <GET http://www.trumed.org/patients-visitors/find-a-doctor> (referer: None)
...
In [1]: response.xpath('//iframe/#src').extract()
Out[1]: [u'http://verify.tmcmed.org/iDirectory/']
In [2]: fetch('http://verify.tmcmed.org/iDirectory/')
2014-07-10 11:31:34+0200 [default] DEBUG: Redirecting (302) to <GET http://verify.tmcmed.org/iDirectory/applicationspecific/intropage.asp> from <GET http://verify.tmcmed.org/iDirectory/>
2014-07-10 11:31:35+0200 [default] DEBUG: Redirecting (302) to <GET http://verify.tmcmed.org/iDirectory/applicationspecific/search.asp> from <GET http://verify.tmcmed.org/iDirectory/applicationspecific/intropage.asp>
2014-07-10 11:31:36+0200 [default] DEBUG: Crawled (200) <GET http://verify.tmcmed.org/iDirectory/applicationspecific/search.asp> (referer: None)
...
In [3]: from scrapy.http import FormRequest
In [4]: frq = FormRequest.from_response(response, formdata={'LastName': 'c'})
In [5]: fetch(frq)
2014-07-10 11:32:15+0200 [default] DEBUG: Redirecting (302) to <GET http://verify.tmcmed.org/iDirectory/applicationspecific/SearchStart.asp> from <POST http://verify.tmcmed.org/iDirectory/applicationspecific/search.asp>
2014-07-10 11:32:15+0200 [default] DEBUG: Redirecting (302) to <GET http://verify.tmcmed.org/iDirectory/applicationspecific/searchresults.asp> from <GET http://verify.tmcmed.org/iDirectory/applicationspecific/SearchStart.asp>
2014-07-10 11:32:17+0200 [default] DEBUG: Crawled (200) <GET http://verify.tmcmed.org/iDirectory/applicationspecific/searchresults.asp> (referer: None)
...
In [6]: response.css('input.nextbutton')
Out[6]: [<Selector xpath=u"descendant-or-self::input[#class and contains(concat(' ', normalize-space(#class), ' '), ' nextbutton ')]" data=u'<input type="submit" value=" Next 20 &gt'>]
In [7]: response.xpath('//input[#class="nextbutton"]')
Out[7]: [<Selector xpath='//input[#class="nextbutton"]' data=u'<input type="submit" value=" Next 20 &gt'>]
In [8]:
Due to the Scrapy Selectors documentation , you can use xpath and element property for check element Exists or not.
try this!
isExists = response.xpath("//input[#class='nextbutton']").extract_first(default='not-found')
if( isExists == 'not-found'):
# input Not Exists
pass
else:
# input Exists , crawl other page
pass