I'm just starting to learn Scrapy and I have such a question. for my "spider" I have to take a list of urls (start_urls) from the google sheets table and I have this code:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)
client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)
for link in records_data:
print(link)
........
How do I configure the middleware so that when the spider (scrappy crawl my_spider) is launched, links from this code are automatically substituted into start_urls? perhaps i need to create a class in middlewares.py?
I will be grateful for any help, with examples.
it is necessary that this rule applies to all new spiders, generating a list from a file in start_requests (for example start_urls = [l.strip() for an open string('urls.txt ').readline()]) is not convenient...
Read this
spider.py:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
custom_settings = {
'SPIDER_MIDDLEWARES': {
'tempbuffer.middlewares.ExampleMiddleware': 543,
}
}
def parse(self, response):
print(response.url)
middlewares.py:
class ExampleMiddleware(object):
def process_start_requests(self, start_requests, spider):
# change this to your needs:
with open('urls.txt', 'r') as f:
for url in f:
yield scrapy.Request(url=url)
urls.txt:
https://example.com
https://example1.com
https://example2.org
output:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example2.org> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example1.com> (referer: None)
https://example2.org
https://example.com
https://example1.com
Related
I am trying to extract links from a webpage but I have to use proxy service. If I use proxy service links not extracting correctly. Extracted links missing https://www.homeadvisor.com part. Extracted links using api.scraperapi.com as domain without website domain. How can I fix this problem?
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scraper_api import ScraperAPIClient
client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")
class Sip2Spider(CrawlSpider):
name = 'sip2'
# allowed_domains = ['homeadvisor.com']
# start_urls = ['https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html']
start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
rules= [
Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True)
]
def parse_page(self, response):
company_name = response.css("h1.\#w-full.\#text-3xl").css("::text").get().strip()
yield {
"company_name" : company_name
}
2022-10-24 18:13:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python> (referer: None)
2022-10-24 18:13:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.MarkAllenContracting.117262730.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:51 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FireSignDBAdditionsand.123758218.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:52 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.DCEnclosuresInc.16852798.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:53 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.G3BuildersLLC.71091804.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
2022-10-24 18:13:54 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://api.scraperapi.com/rated.FletcherCooleyInc.43547458.html> (referer: https://api.scraperapi.com/?url=https%3A%2F%2Fwww.homeadvisor.com%2Fc.Additions-Remodeling.Atlanta.GA.-12001.html&api_key=67e5e7755771b9abf8062e595dd5cc2a&scraper_sdk=python)
It looks like when using the ScraperAPIClient, it requires you to use that specific syntax client.scrapyGet(url=...) for each and ever request. However since you are using the crawlspider with a linkextractor set to follow, scrapy automatically sends out requests in it's usual way, so those requests are getting blocked. You might be better off extracting all of the links yourself, and then filtering the ones you want to follow.
For example:
import scrapy
from scraper_api import ScraperAPIClient
client = ScraperAPIClient("67e5e7755771b9abf8062e595dd5cc2a")
class Sip2Spider(scrapy.Spider):
name = 'sip2'
domain = 'https://www.homeadvisor.com'
start_urls =[client.scrapyGet(url = 'https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
def parse(self, response):
print(response)
links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/#href").getall()]
yield {"links" : list(set(links))}
This will yield:
[
{
"links": [
"https://www.homeadvisor.com/rated.TapConstructionLLC.42214874.html",
"https://www.homeadvisor.com#quote=42214874",
"https://www.homeadvisor.com/emc.Drywall-Plaster-directory.-12025.html",
"https://www.linkedin.com/company/homeadvisor/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Philadelphia.PA.-12001.html",
"https://www.homeadvisor.com/login",
"https://www.homeadvisor.com/task.Major-Home-Repairs-General-Contractor.40062.html",
"https://www.homeadvisor.com/near-me/home-addition-builders/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Lawrenceville.GA.-12001.html",
"https://www.homeadvisor.com/near-me/carpentry-contractors/",
"https://www.homeadvisor.com/emc.Roofing-directory.-12061.html",
"https://www.homeadvisor.com/c.Doors.Atlanta.GA.-12024.html",
"https://www.homeadvisor.com#quote=20057351",
"https://www.homeadvisor.com/near-me/deck-companies/",
"https://www.homeadvisor.com/tloc/Atlanta-GA/Bathroom-Remodel/",
"https://www.homeadvisor.com/c.Additions-Remodeling.Knoxville.TN.-12001.html",
"https://www.homeadvisor.com/xm/35317287/task-selection/-12001?postalCode=30301",
"https://www.homeadvisor.com/category.Additions-Remodeling.12001.html",
"https://www.homeadvisor.comtel:4042672949",
"https://www.homeadvisor.com/rated.DCEnclosuresInc.16852798.html",
"https://www.homeadvisor.com#quote=16721785",
"https://www.homeadvisor.com/near-me/bathroom-remodeling/",
"https://www.homeadvisor.com/near-me",
"https://www.homeadvisor.com/emc.Heating-Furnace-Systems-directory.-12040.html",
"https://pro.homeadvisor.com/r/?m=sp_pro_center&entry_point_id=33522463",
"https://www.homeadvisor.com/r/hiring-a-home-architect/",
"https://www.homeadvisor.com#quote=119074241",
"https://www.homeadvisor.comtel:8669030759",
"https://www.homeadvisor.com/rated.SilverOakRemodel.78475581.html#ratings-reviews",
"https://www.homeadvisor.com/emc.Tree-Service-directory.-12074.html",
"https://www.homeadvisor.com/task.Bathroom-Remodel.40129.html",
"https://www.homeadvisor.com/rated.G3BuildersLLC.71091804.html",
"https://www.homeadvisor.com/sp/horizon-remodeling-construction",
"https://www.homeadvisor.com/near-me/fence-companies/",
"https://www.homeadvisor.com/emc.Gutters-directory.-12038.html",
"https://www.homeadvisor.com/c.GA.html#topcontractors",
...
...
...
]
}
]
The actual output is almost 400 links...
Then you can use some kind of filtering to decide which links you want to follow and use the same api sdk syntax to follow them. Applying some kind of filtering system will also cut down on the number of requests sent which will conserve api calls which will save you money as well.
For example:
def parse(self, response):
print(response)
links = [self.domain + i if not i.startswith('https://') else i for i in response.xpath("//a/#href").getall()]
yield {"links" : list(set(links))}
# some filtering process
for link in links:
yield scrapy.Request(client.scrapyGet(url = link))
UPDATE:
Try this...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlencode
APIKEY = "67e5e7755771b9abf8062e595dd5cc2a" # <- your api key
APIDOMAIN = "http://api.scraperapi.com/"
DOMAIN = 'https://www.homeadvisor.com/'
def get_scraperapi_url(url):
payload = {'api_key': APIKEY, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
def process_links(links):
for link in links:
i = link.url.index('rated')
link.url = DOMAIN + link.url[i:]
link.url = get_scraperapi_url(link.url)
return links
class Sip2Spider(CrawlSpider):
name = 'sip2'
domain = 'https://www.homeadvisor.com'
start_urls =[get_scraperapi_url('https://www.homeadvisor.com/c.Additions-Remodeling.Atlanta.GA.-12001.html')]
rules= [
Rule(LinkExtractor(allow="/rated"), callback="parse_page", follow=True, process_links=process_links)
]
def parse_page(self, response):
company_name = response.xpath("//h1[contains(#class,'#w-full #text-3xl')]/text()").get()
yield {
"company_name" : company_name
}
I'm constantly getting the 403 error in my spider, note my spider is just scraping the very firsst page of the website, it is not doing the pagination. Could this be a clue? These are a few things that I have done to see if the issue resolves, these two advices were given to similar questions here in stackoverflow:
I have changed my user agent, didn't work.
I have set handle_httpstatus_list = [403] in settings.py, didn't work.
This is my code:
import scrapy
from scrapy_selenium import SeleniumRequest
#myclass
class QrocasasSpider(scrapy.Spider):
name = 'qrocasas'
page_number = 2
#functions to clean data
#starting my request
def start_requests(self):
yield SeleniumRequest(
url='https://www.inmuebles24.com/casas-o-departamentos-o-terrenos-en-venta-en-queretaro.html',
wait_time=5,
callback=self.parse
)
#scraping
def parse(self, response):
homes = response.xpath("//div[#class='postingCardTop']/div[2]")
for home in homes:
yield {
'price': home.xpath("normalize-space(.//div[1]/div[1]/span/text())").get()
# 'location': home.xpath(".//div[#class='tile-location one-liner']/b/text()").get(),
# 'description': home.xpath(".//div[#class='tile-desc one-liner']/a/text()").get(),
# 'bedrooms': home.xpath(".//div[#class='chiplets-inline-block re-bedroom']/text()").get(),
# 'm2': home.xpath(".//div[#class='chiplets-inline-block surface-area']/text()").get(),
}
next_page = 'https://www.inmuebles24.com/casas-o-departamentos-o-terrenos-en-venta-en-queretaro-pagina-'+ str(self.page_number) + '.html'
if int(self.page_number) < 10:
self.page_number += 1
yield response.follow(next_page,callback=self.parse)
This is the error I get:
2021-12-08 10:53:41 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.inmuebles24.com/casas-o-departamentos-o-terrenos-en-venta-en-queretaro-pagina-2.html> (referer: https://www.inmuebles24.com/casas-o-terrenos-o-departamentos-en-venta-en-queretaro.html)
2021-12-08 10:53:41 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.inmuebles24.com/casas-o-departamentos-o-terrenos-en-venta-en-queretaro-pagina-2.html>: HTTP status code is not handled or not allowed
2021-12-08 10:53:41 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-08 10:53:41 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:59145/session/2e39a7f0-1ffa-48e7-ad30-9ca71202c44a {}
2021-12-08 10:53:42 [urllib3.connectionpool] DEBUG: http://127.0.0.1:59145 "DELETE /session/2e39a7f0-1ffa-48e7-ad30-9ca71202c44a HTTP/1.1" 200 14
2021-12-08 10:53:42 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
When I tried to scrape pissedconsumer.com item with following code:
import scrapy
class PissedreviewsSpider(scrapy.Spider):
name = 'pissedreviews'
allowed_domains = ['pissedconsumer.com']
start_urls = ['https://lazada-malaysia.pissedconsumer.com/review.html']
def parse(self, response):
selectors = response.xpath('//div[#class="f-component-info"]')
for selector in selectors:
title = selector.xpath('./h2/text()').get()
print(title)
Here is the log in shell when crawl:
2020-04-11 19:00:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://lazada-malaysia.pissedconsumer.com/review.html> (referer: None) <Selector xpath='//div[#class="f-component-info"]' data='<div class="f component-info">\n ...'>
None
<Selector xpath='//div[#class="f-component-info"]' data='<div class="f-component-info">\n ...'>
I already set ROBOTSTXT_OBEY to false and added headers
Any other things I can do to make it work?
Thanks
Update because question updated:
You are missing a / in your xpath. It should be title = selector.xpath('.//h2/text()').get()
I do an official video lesson authorization on the site.
If the username and password are incorrect, then the transition to the callback method is successful, if the login and password are correct, then the transition to the method is not feasible.
My code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://www.darkorbit.com"]
def parse(self, response):
login_url = response.css('form[name="bgcdw_login_form"]::attr(action)').extract_first()
data = {
'username': 'testscrapy',
'password': 'testtest',
}
yield scrapy.FormRequest(url=login_url, formdata=data, callback=self.after_login)
def after_login(self, response):
print('----------------------------------------')
With the correct input data, a log is obtained(Long fragments are cut):
2017-06-03 22:04:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.darkorbit.com/robots.txt> (referer: None)
2017-06-03 22:04:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.darkorbit.com> (referer: None)
2017-06-03 22:04:42 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://auth3.bpsecure.com/robots.txt> (referer: None)
2017-06-03 22:04:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.darkorbit.com/ProjectAp........>
2017-06-03 22:04:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ru4.darkorbit.com/Pro..........>
2017-06-03 22:04:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ru4.darkorbit.com/robots.txt> (referer: None)
2017-06-03 22:04:43 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://ru4.darkorbit.com/Pro......>
From this line of your logs:
2017-06-03 22:04:43 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://ru4.darkorbit.com/Pro......>
I can tell that you need to change a setting in you settings.py file.
The variable ROBOTSTXT_OBEY needs to be set at False
ROBOTSTXT_OBEY=False
I'm new to scrapy and I'm trying to build a spider that will crawl a website and get all the phone numbers, emails, pdfs, etc. from it (and I want it to follow all the links from the main page so it searches the entire domain).
This question had a similar issue but it wasn't resolved: Why scrapy crawler stops?
Here is the code for my spider:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mobilesuites.items import MobilesuitesItem
import re
class ExampleSpider(CrawlSpider):
name = "hyatt"
allowed_domains = ["hyatt.com"]
start_urls = (
'http://www.hyatt.com/',
)
#follow only non-javascript links
rules = (
Rule(SgmlLinkExtractor(deny = ('.*\.jsp.*')), follow = True, callback = 'parse_item'),
)
def parse_item(self, response):
#self.log('The current url is %s' % response.url)
selector = Selector(response)
item = MobilesuitesItem()
#get url
item['url'] = response.url
#get page title
titles = selector.select("//title")
for t in titles:
item['title'] = t.select("./text()").extract()
#get all phone numbers, emails, and pdf links
text = response.body
item['phone'] = '|'.join(re.findall('\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(?\d{3}\)?[-\.\s]?\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4}', text))
item['email'] = '|'.join(re.findall("[^\s#]+#[^\s#]+\.[^\s#]+", text))
item['pdfs'] = '|'.join(re.findall("[^\s\"<]*\.pdf[^\s\">]*", text))
#check to see if dining is mentioned on the page
item['dining'] = bool(re.findall("\s[dD]ining\s|\s[mM]enu\s|\s[bB]everage\s", text))
return item
Here's the last part of the crawl log before it hangs:
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Scraped from <200 http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html>
{'email': '',
'phone': '',
'title': [u'24/7 Gallery Menu'],
'url': 'http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html'}
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Ignoring response <404 http://hyatt.com/gallery/thrive/siteMap.html>: HTTP status code is not handled or not allowed
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.hyatt.com/hyatt/pure/contact/> (referer: http://www.hyatt.com/hyatt/pure/?icamp=HY_HyattPure_HPLS)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/aboutus.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.place.hyatt.com/en/hyattplace/eat-and-drink/eat-and-drink.html> (referer: http://www.place.hyatt.com/en/hyattplace/eat-and-drink.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.park.hyatt.com/en/parkhyatt/newsandannouncements.html?icamp=park_hppsa_new_hotels> (referer: http://www.park.hyatt.com/en/parkhyatt.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.regency.hyatt.com/en/hyattregency/meetingsandevents.html> (referer: http://www.regency.hyatt.com/en/hyattregency.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/specialoffers.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/locations.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)