Looping through pages of Web Page's Request URL with Scrapy - scrapy

I'm looking to adapt this tutorial, (https://medium.com/better-programming/a-gentle-introduction-to-using-scrapy-to-crawl-airbnb-listings-58c6cf9f9808) to scraping this site of tiny home listings: https://tinyhouselistings.com/.
The tutorial uses the request URL, to get a very complete and clean JSON file, but does so for the first page only. It seems that looping through the 121 pages of my tinyhouselistings request url should be pretty straight-forward but I have not been able to get anything to work. The tutorial does not loop through the pages of the request url, but rather uses scrapy splash, run within a Docker container to get all the listings. I am willing to try that, but I just feel like it should be possible to loop through this request url.
This outputs only the first page only of the tinyhouselistings request url for my project:
import scrapy
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['http://www.tinyhouselistings.com']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page=1'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
_file = "tiny_listings.json"
with open(_file, 'wb') as f:
f.write(response.body)
I've tried this:
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page='
for page in range(1, 121):
self.start_urls.append(url + str(page))
yield scrapy.Request(url=start_urls, callback=self.parse)
But I'm not sure how to then pass start_urls to parse so as to write the response to the json being written at the end of the script.
Any help would be much appreciated!

Remove allowed_domains = ['tinyhouselistings.com'] because the url thl-prod.global.ssl.fastly.net will be filtered out by Scrapy
Since you are using start_requests method so you do not need start_urls, you can only have either of them
import json
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
listings_url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page={}'
def start_requests(self):
page = 1
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
def parse(self, response):
resp = json.loads(response.body)
for ad in resp["listings"]:
yield ad
page = int(response.meta['page']) + 1
if page < int(listings['meta']['pagination']['page_count'])
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
From terminal, run spider using to save scraped data to a JSON file
scrapy crawl tinyhouselistings -o output_file.json

Related

Response works in Scrapy Shell, but doesn't work in code

I'm new in Scrapy. I wrote my first spider for this site https://book24.ru/knigi-bestsellery/?section_id=1592 and it works fine
import scrapy
class BookSpider(scrapy.Spider):
name = 'book24'
start_urls = ['https://book24.ru/knigi-bestsellery/']
def parse(self, response):
for link in response.css('div.product-card__image-holder a::attr(href)'):
yield response.follow(link, callback=self.parse_book)
for i in range (1, 5):
next_page = f'https://book24.ru/knigi-bestsellery/page-{i}/'
yield response.follow(next_page, callback=self.parse)
print(i)
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
Now I try to write a spider only for one page
import scrapy
class BookSpider(scrapy.Spider):
name = 'book'
start_urls = ['https://book24.ru/product/transhumanism-inc-6015821/']
def parse_book(self, response):
yield{
'name': response.css('h1.product-detail-page__title::text').get(),
'type': response.css('div.product-characteristic__value a::attr(title)')[2].get()
}
And it doesn't work, I get an empty file after this command in terminal.
scrapy crawl book -O book.csv
I don't know why.
Will be grateful for the help!
You were getting raise
NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')
NotImplementedError: BookSpider.parse callback is not defined
according the document
parse(): a method that will be called to handle the response
downloaded for each of the requests made. The response parameter is an
instance of TextResponse that holds the page content and has further
helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped
data as dicts and also finding new URLs to follow and creating new
requests (Request) from them.
just rename your def parse_book(self, response): to def parse(self, response):
Its work fine.

How to use python requests with scrapy?

I am trying to use requests to fetch a page then pass the response object to a parser, but I ran into a problem:
def start_requests(self):
yield self.parse(requests.get(url))
def parse(self, response):
#pass
builtins.AttributeError: 'generator' object has no attribute 'dont_filter'
You first need to download the page's resopnse and then convert that string to HtmlResponse object
from scrapy.http import HtmlResponse
resp = requests.get(url)
response = HtmlResponse(url="", body=resp.text, encoding='utf-8')
what you need to do is
get the page with python requests and save it to variable different then Scrapy response.
r = requests.get(url)
replace scrapy response body with your python requests text.
response = response.replace(body = r.text)
thats it. Now you have Scrapy response object with all data available from python requests.
yields return a generator so it iterates over it before the request get's the data you can remove the yield and it should work. I have tested it with sample URL
def start_requests(self):
self.parse(requests.get(url))
def parse(self, response):
#pass

Web Crawler not printing pages correctly

Good morning !
I've developed a very simple spider with Scrapy just to get used with FormRequest. I'm trying to send a request to this page: https://www.caramigo.eu/ which should lead me to a page like this one: https://www.caramigo.eu/be/fr/recherche?address=Belgique%2C+Li%C3%A8ge&date_debut=16-03-2019&date_fin=17-03-2019. The issue is that my spider does not prompt the page correctly (the cars images and info do not appear at all) and therefore I can't collect any data from it. Here is my spider:
import scrapy
class CarSpider(scrapy.Spider):
name = "caramigo"
def start_requests(self):
urls = [
'https://www.caramigo.eu/'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.search_line)
def search_line(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'address': 'Belgique, Liège', 'date_debut': '16-03-2019', 'date_fin': '17-03-2019'},
callback=self.parse
)
def parse(self, response):
filename = 'caramigo.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Sorry if the syntax is not correct, I'm pretty new to coding.
Thank you in advance !

Easier way to follow links with Scrapy

I have the following code in a scrapy spider:
class ContactSpider(Spider):
name = "contact"
# allowed_domains = ["http://www.domain.com/"]
start_urls = [
"http://web.domain.com/DECORATION"
]
BASE_URL = "http://web.domain.com"
def parse(self, response):
links = response.selector.xpath('//*[contains(#class,"MAIN")]/a/#href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield Request(absolute_url, headers= headers, callback=self.second)
I'm surprised there is not a simpler way in scrapy to follow links rather than build each absolute_url. Is there a a better way to do this?
For absolute urls you can use urlparse.urljoin, Response already has a shortcut for that via response.urljoin(link). So your code could easily be replaced by:
def parse(self, response):
links = response.selector.xpath('//*[contains(#class,"MAIN")]/a/#href').extract()
for link in links:
yield Request(response.urljoin(link), headers=headers, callback=self.second)
You can also use scrapy LinkExtractors which extract links according to some rules and manages all of the joining automatically.
from scrapy.linkextractors import LinkExtractor
def parse(self, response):
le = LinkExtractor(restrict_xpaths='//*[contains(#class,"MAIN")]/a/#href')
links = le.extract_links(response)
for link in links:
yield Request(link.url, headers= headers, callback=self.second)
Regarding more automated crawling experience - scrapy has CrawlSpider which uses set of rules to extract and follow links on each page. You can read about it more here: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
The docs have some examples of it as well.

Any way to follow further requests in one web page?

I need to download a web page with intensive ajax. Currently, I am using Scrapy with Ajaxenabled. After I write out this response, and open it in browser. There are still some requests initiated. I am not sure if I was right that the rendered response only includes the first level requests. So, how could we let scrapy include all sub-requests into one response?
Now in this case, there are 72 requests sent as opening online, where 23 requests as opening offline.
Really appreciate it!
Here are the screenshots for the requests sent before and after download
requests sent before download
requests sent after download
Here is the code:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/caplinked/bridge',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item
The code is as follows:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/startmart/pre.seed',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item