I need to download a web page with intensive ajax. Currently, I am using Scrapy with Ajaxenabled. After I write out this response, and open it in browser. There are still some requests initiated. I am not sure if I was right that the rendered response only includes the first level requests. So, how could we let scrapy include all sub-requests into one response?
Now in this case, there are 72 requests sent as opening online, where 23 requests as opening offline.
Really appreciate it!
Here are the screenshots for the requests sent before and after download
requests sent before download
requests sent after download
Here is the code:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/caplinked/bridge',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item
The code is as follows:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/startmart/pre.seed',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item
Related
I am writing a scraper with Scrapy within a larger project, and I'm trying to keep it as minimal as possible (without create a whole scrapy project). This code downloads a single URL correctly:
import scrapy
from scrapy.crawler import CrawlerProcess
class WebsiteSpider(scrapy.Spider):
"""
https://docs.scrapy.org/en/latest/
"""
custom_settings = {'DOWNLOAD_DELAY': 1, 'DEPTH_LIMIT': 3}
name = 'my_website_scraper'
def parse(self,response):
html = response.body
url = response.url
# process page here
process = CrawlerProcess()
process.crawl(WebsiteSpider, start_urls=['https://www.bbc.co.uk/'])
process.start()
How can I enrich this code to keep scraping the links found in the start URLs (with a maximum depth, for example of 3)?
Try this.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain
class WebsiteSpider(Spider):
name = 'bbc.co.uk'
allowed_domains = ['.bbc.co.uk']
start_urls = ['https://www.bbc.co.uk/']
# refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lstA = doc.listA(url=url["url"]) # Get link data for subsequent crawling
data = [{"title": doc.title.text}] # Get target data
return {"Urls": lstA, "Data": data} # Return data to framework
SimplifiedMain.startThread(WebsiteSpider()) # Start crawling
I'm looking to adapt this tutorial, (https://medium.com/better-programming/a-gentle-introduction-to-using-scrapy-to-crawl-airbnb-listings-58c6cf9f9808) to scraping this site of tiny home listings: https://tinyhouselistings.com/.
The tutorial uses the request URL, to get a very complete and clean JSON file, but does so for the first page only. It seems that looping through the 121 pages of my tinyhouselistings request url should be pretty straight-forward but I have not been able to get anything to work. The tutorial does not loop through the pages of the request url, but rather uses scrapy splash, run within a Docker container to get all the listings. I am willing to try that, but I just feel like it should be possible to loop through this request url.
This outputs only the first page only of the tinyhouselistings request url for my project:
import scrapy
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['http://www.tinyhouselistings.com']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page=1'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
_file = "tiny_listings.json"
with open(_file, 'wb') as f:
f.write(response.body)
I've tried this:
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
allowed_domains = ['tinyhouselistings.com']
start_urls = ['']
def start_requests(self):
url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page='
for page in range(1, 121):
self.start_urls.append(url + str(page))
yield scrapy.Request(url=start_urls, callback=self.parse)
But I'm not sure how to then pass start_urls to parse so as to write the response to the json being written at the end of the script.
Any help would be much appreciated!
Remove allowed_domains = ['tinyhouselistings.com'] because the url thl-prod.global.ssl.fastly.net will be filtered out by Scrapy
Since you are using start_requests method so you do not need start_urls, you can only have either of them
import json
class TinyhouselistingsSpider(scrapy.Spider):
name = 'tinyhouselistings'
listings_url = 'https://thl-prod.global.ssl.fastly.net/api/v1/listings/search?area_min=0&measurement_unit=feet&page={}'
def start_requests(self):
page = 1
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
def parse(self, response):
resp = json.loads(response.body)
for ad in resp["listings"]:
yield ad
page = int(response.meta['page']) + 1
if page < int(listings['meta']['pagination']['page_count'])
yield scrapy.Request(url=self.listings_url.format(page),
meta={"page": page},
callback=self.parse)
From terminal, run spider using to save scraped data to a JSON file
scrapy crawl tinyhouselistings -o output_file.json
I am trying to use requests to fetch a page then pass the response object to a parser, but I ran into a problem:
def start_requests(self):
yield self.parse(requests.get(url))
def parse(self, response):
#pass
builtins.AttributeError: 'generator' object has no attribute 'dont_filter'
You first need to download the page's resopnse and then convert that string to HtmlResponse object
from scrapy.http import HtmlResponse
resp = requests.get(url)
response = HtmlResponse(url="", body=resp.text, encoding='utf-8')
what you need to do is
get the page with python requests and save it to variable different then Scrapy response.
r = requests.get(url)
replace scrapy response body with your python requests text.
response = response.replace(body = r.text)
thats it. Now you have Scrapy response object with all data available from python requests.
yields return a generator so it iterates over it before the request get's the data you can remove the yield and it should work. I have tested it with sample URL
def start_requests(self):
self.parse(requests.get(url))
def parse(self, response):
#pass
hello I have a trouble using scrapy
I want to scrap some data from clinicalkey.com
I have a id, password for my hospital and my hospital has authority of clinicalkey.com
so if I log in to my hospital's library page, I also can use clincalkey.com without authentication
But My scrapy script didn't work. I can't findout why this is not working
My script here
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.FormRequest(loginsite, formdata={'id': 'Myid', 'password': 'MyPassword'}, callback=self.after_login)
def after_login(self, response):
yield scrapy.Request(clinicalkeysite, callback=self.parse_datail)
def parse_detail(self, response):
blahblah
When I see the final response, It has message about "You need login"..
This site use json body form for authenification.
Try something like this:
body = '{"username":"{}","password":"{}","remember_me":true,"product":"CK_US"}'.format(yourname, yourpassword)
yield scrapy.FormRequest(loginsite, body=body, callback=self.after_login)
I am using latest scrapy version, v1.3
I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.
What I need is how can I prevent that page get into cache? Or if I cannot do that, I need to remove it from cache after I realize the error in parse method. Then I can retry and get the correct one.
I have a partial solution, I yield all requests with "dont_cache":False parameter in meta so I make sure they use cache. Where I detect the error and retry the request, I put dont_filter=True along with "dont_cache":True to make sure I get the fresh copy of the erroneous url.
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
page = page - 1
yield Request(url=response.url, callback=self.parse, meta={"page":page, "dont_cache":True}, dont_filter=True)
I also tried a custom retry middleware, where I managed to get it working before cache, but I couldnt read the response.body successfully. I suspect that it is zipped somehow, as it is binary data.
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
with open('debug.txt', 'wb') as outfile:
outfile.write(response.body)
html = Selector(text=response.body)
url = response.url
counttext = html.css('h2#s-result-count::text').extract_first()
if counttext is None:
log.msg("Automated process error: %s" %url, level=log.INFO)
reason = 'Automated process error %d' %response.status
return self._retry(request, reason, spider) or response
return response
Any suggestion is appreciated.
Thanks
Mehmet
Middleware responsible for requests/response caching is HttpCacheMiddleware. Under the hood it is driven by the cache policies - special classes which dispatch what requests and responses should or shouldn't be cached. You can implement your own cache policy class and use it with the setting
HTTPCACHE_POLICY = 'my.custom.cache.Class'
More information in docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
Source code of built-in policies: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/httpcache.py#L18
Thanks to mizhgun, I managed to develop a solution using custom policies.
Here is what I did,
from scrapy.utils.httpobj import urlparse_cached
class CustomPolicy(object):
def __init__(self, settings):
self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_request(self, request):
return urlparse_cached(request).scheme not in self.ignore_schemes
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
def is_cached_response_fresh(self, response, request):
if "refresh_cache" in request.meta:
return False
return True
def is_cached_response_valid(self, cachedresponse, response, request):
if "refresh_cache" in request.meta:
return False
return True
And when I catch the error, (after caching occurred of course)
def parse(self, response):
html = Selector(response)
counttext = html.css('selector').extract_first()
if counttext is None:
yield Request(url=response.url, callback=self.parse, meta={"refresh_cache":True}, dont_filter=True)
When you add refresh_cache into meta, that can be catched in custom policy class.
Don't forget to add dont_filter otherwise second request will be filtered as duplicate.