How to check if a specific button exists in Scrapy? - scrapy

I have a button in web page as
<input class="nextbutton" type="submit" name="B1" value="Next 20>>"></input>
Now i want to check if this button exists on the page or not using Xpath selectors so that if it exists i can go to next page and retreive information from there.

First, you have to determine what counts as "this button". Given the context, I'd suggest looking for an input with a class of 'nextbutton'. You could check for an element with only one class like this in XPath:
//input[#class='nextbutton']
But that looks for exact matches only. So you could try:
//input[contains(#class, 'nextbutton')]
Though this will also match "nonextbutton" or "nextbuttonbig". So your final answer will probably be:
//input[contains(concat(' ', #class, ' '), ' nextbutton ')]
In Scrapy, a Selector will evaluate as true if it matches some nonzero content. So you should be able to write something like:
from scrapy.selector import Selector
input_tag = Selector(text=html_content).xpath("//input[contains(concat(' ', #class, ' '), ' nextbutton ')]")
if input_tag:
print "Yes, I found a 'next' button on the page."

http://www.trumed.org/patients-visitors/find-a-doctor loads an iframe with src="http://verify.tmcmed.org/iDirectory/"
<iframe border="0" frameborder="0" id="I1" name="I1"
src="http://verify.tmcmed.org/iDirectory/"
style="width: 920px; height: 600px;" target="I1">
Your browser does not support inline frames or is currently configured not to display inline frames.
</iframe>
The search form is in this iframe.
Here's a scrapy shell session illustrating this:
$ scrapy shell "http://www.trumed.org/patients-visitors/find-a-doctor"
2014-07-10 11:31:05+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
2014-07-10 11:31:07+0200 [default] DEBUG: Crawled (200) <GET http://www.trumed.org/patients-visitors/find-a-doctor> (referer: None)
...
In [1]: response.xpath('//iframe/#src').extract()
Out[1]: [u'http://verify.tmcmed.org/iDirectory/']
In [2]: fetch('http://verify.tmcmed.org/iDirectory/')
2014-07-10 11:31:34+0200 [default] DEBUG: Redirecting (302) to <GET http://verify.tmcmed.org/iDirectory/applicationspecific/intropage.asp> from <GET http://verify.tmcmed.org/iDirectory/>
2014-07-10 11:31:35+0200 [default] DEBUG: Redirecting (302) to <GET http://verify.tmcmed.org/iDirectory/applicationspecific/search.asp> from <GET http://verify.tmcmed.org/iDirectory/applicationspecific/intropage.asp>
2014-07-10 11:31:36+0200 [default] DEBUG: Crawled (200) <GET http://verify.tmcmed.org/iDirectory/applicationspecific/search.asp> (referer: None)
...
In [3]: from scrapy.http import FormRequest
In [4]: frq = FormRequest.from_response(response, formdata={'LastName': 'c'})
In [5]: fetch(frq)
2014-07-10 11:32:15+0200 [default] DEBUG: Redirecting (302) to <GET http://verify.tmcmed.org/iDirectory/applicationspecific/SearchStart.asp> from <POST http://verify.tmcmed.org/iDirectory/applicationspecific/search.asp>
2014-07-10 11:32:15+0200 [default] DEBUG: Redirecting (302) to <GET http://verify.tmcmed.org/iDirectory/applicationspecific/searchresults.asp> from <GET http://verify.tmcmed.org/iDirectory/applicationspecific/SearchStart.asp>
2014-07-10 11:32:17+0200 [default] DEBUG: Crawled (200) <GET http://verify.tmcmed.org/iDirectory/applicationspecific/searchresults.asp> (referer: None)
...
In [6]: response.css('input.nextbutton')
Out[6]: [<Selector xpath=u"descendant-or-self::input[#class and contains(concat(' ', normalize-space(#class), ' '), ' nextbutton ')]" data=u'<input type="submit" value=" Next 20 &gt'>]
In [7]: response.xpath('//input[#class="nextbutton"]')
Out[7]: [<Selector xpath='//input[#class="nextbutton"]' data=u'<input type="submit" value=" Next 20 &gt'>]
In [8]:

Due to the Scrapy Selectors documentation , you can use xpath and element property for check element Exists or not.
try this!
isExists = response.xpath("//input[#class='nextbutton']").extract_first(default='not-found')
if( isExists == 'not-found'):
# input Not Exists
pass
else:
# input Exists , crawl other page
pass

Related

Scrapy LinkExtractor ScraperApi integration

I am trying to extract links from a webpage but I have to use proxy service. If I use proxy service links not extracting correctly. Extracted links missing https://www.homeadvisor.com part. Extracted links using api.scraperapi.com as domain without website domain. How can I fix this problem?
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scraper_api import ScraperAPIClient
client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")
class Sip2Spider(CrawlSpider):
name = 'sip2'
# allowed_domains = ['homeadvisor.com']
# start_urls = ['https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html']
start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
rules= [
Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True)
]
def parse_page(self, response):
company_name = response.css("h1.\#w-full.\#text-3xl").css("::text").get().strip()
yield {
"company_name" : company_name
}
2022-10-24 18:13:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python> (referer: None)
2022-10-24 18:13:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.MarkAllenContracting.117262730.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:51 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FireSignDBAdditionsand.123758218.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:52 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.DCEnclosuresInc.16852798.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.G3BuildersLLC.71091804.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:54 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FletcherCooleyInc.43547458.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
It looks like when using the ScraperAPIClient, it requires you to use that specific syntax client.scrapyGet(url=...) for each and ever request. However since you are using the crawlspider with a linkextractor set to follow, scrapy automatically sends out requests in it's usual way, so those requests are getting blocked. You might be better off extracting all of the links yourself, and then filtering the ones you want to follow.
For example:
import scrapy
from scraper_api import ScraperAPIClient
client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")
class Sip2Spider(scrapy.Spider):
name = 'sip2'
domain = 'https://www.homeadvisor.com'
start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
def parse(self, response):
print(response)
links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/#href").getall()]
yield {"links" : list(set(links))}
This will yield:
[
{
"links": [
"https://www.homeadvisor.com/rated.TapConstructionLLC.42214874.html",
"https://www.homeadvisor.com#quote=42214874",
"https://www.homeadvisor.com/emc.Drywall-Plaster-directory.-12025.html",
"https://www.linkedin.com/company/homeadvisor/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Philadelphia.PA.-12001.html",
"https://www.homeadvisor.com/login",
"https://www.homeadvisor.com/task.Major-Home-Repairs-General-Contractor.40062.html",
"https://www.homeadvisor.com/near-me/home-addition-builders/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Lawrenceville.GA.-12001.html",
"https://www.homeadvisor.com/near-me/carpentry-contractors/",
"https://www.homeadvisor.com/emc.Roofing-directory.-12061.html",
"https://www.homeadvisor.com/c.Doors.Atlanta.GA.-12024.html",
"https://www.homeadvisor.com#quote=20057351",
"https://www.homeadvisor.com/near-me/deck-companies/",
"https://www.homeadvisor.com/tloc/Atlanta-GA/Bathroom-Remodel/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Knoxville.TN.-12001.html",
"https://www.homeadvisor.com/xm/35317287/task-selection/-12001?postalCode=30301",
"https://www.homeadvisor.com/category.Additions-Remodeling.12001.html",
"https://www.homeadvisor.comtel:4042672949",
"https://www.homeadvisor.com/rated.DCEnclosuresInc.16852798.html",
"https://www.homeadvisor.com#quote=16721785",
"https://www.homeadvisor.com/near-me/bathroom-remodeling/",
"https://www.homeadvisor.com/near-me",
"https://www.homeadvisor.com/emc.Heating-Furnace-Systems-directory.-12040.html",
"https://pro.homeadvisor.com/r/?m=sp_pro_center&entry_point_id=33522463",
"https://www.homeadvisor.com/r/hiring-a-home-architect/",
"https://www.homeadvisor.com#quote=119074241",
"https://www.homeadvisor.comtel:8669030759",
"https://www.homeadvisor.com/rated.SilverOakRemodel.78475581.html#ratings-reviews",
"https://www.homeadvisor.com/emc.Tree-Service-directory.-12074.html",
"https://www.homeadvisor.com/task.Bathroom-Remodel.40129.html",
"https://www.homeadvisor.com/rated.G3BuildersLLC.71091804.html",
"https://www.homeadvisor.com/sp/horizon-remodeling-construction",
"https://www.homeadvisor.com/near-me/fence-companies/",
"https://www.homeadvisor.com/emc.Gutters-directory.-12038.html",
"https://www.homeadvisor.com/c.GA.html#topcontractors",
...
...
...
]
}
]
The actual output is almost 400 links...
Then you can use some kind of filtering to decide which links you want to follow and use the same api sdk syntax to follow them. Applying some kind of filtering system will also cut down on the number of requests sent which will conserve api calls which will save you money as well.
For example:
def parse(self, response):
print(response)
links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/#href").getall()]
yield {"links" : list(set(links))}
# some filtering process
for link in links:
yield scrapy.Request(client.scrapyGet(url = link))
UPDATE:
Try this...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlencode
APIKEY = "67e5e7755771b9abf8062e595dd5cc2a" # <- your api key
APIDOMAIN = "http://api.scraperapi.com/"
DOMAIN = 'https://www.homeadvisor.com/'
def get_scraperapi_url(url):
payload = {'api_key': APIKEY, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
def process_links(links):
for link in links:
i = link.url.index('rated')
link.url = DOMAIN + link.url[i:]
link.url = get_scraperapi_url(link.url)
return links
class Sip2Spider(CrawlSpider):
name = 'sip2'
domain = 'https://www.homeadvisor.com'
start_urls =[get_scraperapi_url('https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
rules= [
Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True, process_links=process_links)
]
def parse_page(self, response):
company_name = response.xpath("//h1[contains(#class,'#w-full #text-3xl')]/text()").get()
yield {
"company_name" : company_name
}

How to build own middleware in Scrapy?

I'm just starting to learn Scrapy and I have such a question. for my "spider" I have to take a list of urls (start_urls) from the google sheets table and I have this code:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)
client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)
for link in records_data:
print(link)
........
How do I configure the middleware so that when the spider (scrappy crawl my_spider) is launched, links from this code are automatically substituted into start_urls? perhaps i need to create a class in middlewares.py?
I will be grateful for any help, with examples.
it is necessary that this rule applies to all new spiders, generating a list from a file in start_requests (for example start_urls = [l.strip() for an open string('urls.txt ').readline()]) is not convenient...
Read this
spider.py:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
custom_settings = {
'SPIDER_MIDDLEWARES': {
'tempbuffer.middlewares.ExampleMiddleware': 543,
}
}
def parse(self, response):
print(response.url)
middlewares.py:
class ExampleMiddleware(object):
def process_start_requests(self, start_requests, spider):
# change this to your needs:
with open('urls.txt', 'r') as f:
for url in f:
yield scrapy.Request(url=url)
urls.txt:
https://example.com
https://example1.com
https://example2.org
output:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example2.org> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example1.com> (referer: None)
https://example2.org
https://example.com
https://example1.com

Can anyone help look into my scrapy xpath for this site?

When I tried to scrape pissedconsumer.com item with following code:
import scrapy
class PissedreviewsSpider(scrapy.Spider):
name = 'pissedreviews'
allowed_domains = ['pissedconsumer.com']
start_urls = ['https://lazada-malaysia.pissedconsumer.com/review.html']
def parse(self, response):
selectors = response.xpath('//div[#class="f-component-info"]')
for selector in selectors:
title = selector.xpath('./h2/text()').get()
print(title)
Here is the log in shell when crawl:
2020-04-11 19:00:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://lazada-malaysia.pissedconsumer.com/review.html> (referer: None) <Selector xpath='//div[#class="f-component-info"]' data='<div class="f component-info">\n ...'>
None
<Selector xpath='//div[#class="f-component-info"]' data='<div class="f-component-info">\n ...'>
I already set ROBOTSTXT_OBEY to false and added headers
Any other things I can do to make it work?
Thanks
Update because question updated:
You are missing a / in your xpath. It should be title = selector.xpath('.//h2/text()').get()

Login to the site through scrapy

I do an official video lesson authorization on the site.
If the username and password are incorrect, then the transition to the callback method is successful, if the login and password are correct, then the transition to the method is not feasible.
My code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://www.darkorbit.com"]
def parse(self, response):
login_url = response.css('form[name="bgcdw_login_form"]::attr(action)').extract_first()
data = {
'username': 'testscrapy',
'password': 'testtest',
}
yield scrapy.FormRequest(url=login_url, formdata=data, callback=self.after_login)
def after_login(self, response):
print('----------------------------------------')
With the correct input data, a log is obtained(Long fragments are cut):
2017-06-03 22:04:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.darkorbit.com/robots.txt> (referer: None)
2017-06-03 22:04:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.darkorbit.com> (referer: None)
2017-06-03 22:04:42 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://auth3.bpsecure.com/robots.txt> (referer: None)
2017-06-03 22:04:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.darkorbit.com/ProjectAp........>
2017-06-03 22:04:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ru4.darkorbit.com/Pro..........>
2017-06-03 22:04:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ru4.darkorbit.com/robots.txt> (referer: None)
2017-06-03 22:04:43 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://ru4.darkorbit.com/Pro......>
From this line of your logs:
2017-06-03 22:04:43 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://ru4.darkorbit.com/Pro......>
I can tell that you need to change a setting in you settings.py file.
The variable ROBOTSTXT_OBEY needs to be set at False
ROBOTSTXT_OBEY=False

Scrapy Spider gets stuck in middle of crawling

I'm new to scrapy and I'm trying to build a spider that will crawl a website and get all the phone numbers, emails, pdfs, etc. from it (and I want it to follow all the links from the main page so it searches the entire domain).
This question had a similar issue but it wasn't resolved: Why scrapy crawler stops?
Here is the code for my spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mobilesuites.items import MobilesuitesItem
import re
class ExampleSpider(CrawlSpider):
name = "hyatt"
allowed_domains = ["hyatt.com"]
start_urls = (
'http://www.hyatt.com/',
)
#follow only non-javascript links
rules = (
Rule(SgmlLinkExtractor(deny = ('.*\.jsp.*')), follow = True, callback = 'parse_item'),
)
def parse_item(self, response):
#self.log('The current url is %s' % response.url)
selector = Selector(response)
item = MobilesuitesItem()
#get url
item['url'] = response.url
#get page title
titles = selector.select("//title")
for t in titles:
item['title'] = t.select("./text()").extract()
#get all phone numbers, emails, and pdf links
text = response.body
item['phone'] = '|'.join(re.findall('\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(?\d{3}\)?[-\.\s]?\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4}', text))
item['email'] = '|'.join(re.findall("[^\s#]+#[^\s#]+\.[^\s#]+", text))
item['pdfs'] = '|'.join(re.findall("[^\s\"<]*\.pdf[^\s\">]*", text))
#check to see if dining is mentioned on the page
item['dining'] = bool(re.findall("\s[dD]ining\s|\s[mM]enu\s|\s[bB]everage\s", text))
return item
Here's the last part of the crawl log before it hangs:
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Scraped from <200 http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html>
{'email': '',
'phone': '',
'title': [u'24/7 Gallery Menu'],
'url': 'http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html'}
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Ignoring response <404 http://hyatt.com/gallery/thrive/siteMap.html>: HTTP status code is not handled or not allowed
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.hyatt.com/hyatt/pure/contact/> (referer: http://www.hyatt.com/hyatt/pure/?icamp=HY_HyattPure_HPLS)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/aboutus.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.place.hyatt.com/en/hyattplace/eat-and-drink/eat-and-drink.html> (referer: http://www.place.hyatt.com/en/hyattplace/eat-and-drink.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.park.hyatt.com/en/parkhyatt/newsandannouncements.html?icamp=park_hppsa_new_hotels> (referer: http://www.park.hyatt.com/en/parkhyatt.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.regency.hyatt.com/en/hyattregency/meetingsandevents.html> (referer: http://www.regency.hyatt.com/en/hyattregency.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/specialoffers.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/locations.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)