I need to delay request in scrapy. Web page I am scraping is providing me with "Your data will be ready in 50 seconds" and time can be from 2 to 60 seconds, then I want to scrap a lot of pages (I get a list from request seconds request) so setting global download_delay to 60s is not the best idea.
You can try this
from scrapy.spider import BaseSpider
from twisted.internet import reactor, defer
from scrapy.http import Request
DELAY = 5 # seconds
class MySpider(BaseSpider):
name = 'wikipedia'
max_concurrent_requests = 1
start_urls = ['http://www.wikipedia.org']
def parse(self, response):
nextreq = Request('http://en.wikipedia.org')
dfd = defer.Deferred()
reactor.callLater(DELAY, dfd.callback, nextreq)
return dfd
In your case DELAY will be the time you get from response.
Related
I am writing a scraper with Scrapy within a larger project, and I'm trying to keep it as minimal as possible (without create a whole scrapy project). This code downloads a single URL correctly:
import scrapy
from scrapy.crawler import CrawlerProcess
class WebsiteSpider(scrapy.Spider):
"""
https://docs.scrapy.org/en/latest/
"""
custom_settings = {'DOWNLOAD_DELAY': 1, 'DEPTH_LIMIT': 3}
name = 'my_website_scraper'
def parse(self,response):
html = response.body
url = response.url
# process page here
process = CrawlerProcess()
process.crawl(WebsiteSpider, start_urls=['https://www.bbc.co.uk/'])
process.start()
How can I enrich this code to keep scraping the links found in the start URLs (with a maximum depth, for example of 3)?
Try this.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain
class WebsiteSpider(Spider):
name = 'bbc.co.uk'
allowed_domains = ['.bbc.co.uk']
start_urls = ['https://www.bbc.co.uk/']
# refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lstA = doc.listA(url=url["url"]) # Get link data for subsequent crawling
data = [{"title": doc.title.text}] # Get target data
return {"Urls": lstA, "Data": data} # Return data to framework
SimplifiedMain.startThread(WebsiteSpider()) # Start crawling
I am trying to use requests to fetch a page then pass the response object to a parser, but I ran into a problem:
def start_requests(self):
yield self.parse(requests.get(url))
def parse(self, response):
#pass
builtins.AttributeError: 'generator' object has no attribute 'dont_filter'
You first need to download the page's resopnse and then convert that string to HtmlResponse object
from scrapy.http import HtmlResponse
resp = requests.get(url)
response = HtmlResponse(url="", body=resp.text, encoding='utf-8')
what you need to do is
get the page with python requests and save it to variable different then Scrapy response.
r = requests.get(url)
replace scrapy response body with your python requests text.
response = response.replace(body = r.text)
thats it. Now you have Scrapy response object with all data available from python requests.
yields return a generator so it iterates over it before the request get's the data you can remove the yield and it should work. I have tested it with sample URL
def start_requests(self):
self.parse(requests.get(url))
def parse(self, response):
#pass
I need to download a web page with intensive ajax. Currently, I am using Scrapy with Ajaxenabled. After I write out this response, and open it in browser. There are still some requests initiated. I am not sure if I was right that the rendered response only includes the first level requests. So, how could we let scrapy include all sub-requests into one response?
Now in this case, there are 72 requests sent as opening online, where 23 requests as opening offline.
Really appreciate it!
Here are the screenshots for the requests sent before and after download
requests sent before download
requests sent after download
Here is the code:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/caplinked/bridge',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item
The code is as follows:
class SeedinvestSpider(CrawlSpider):
name = "seedinvest"
allowed_domains = ["seedinvest.com"]
start_urls = (
'https://www.seedinvest.com/startmart/pre.seed',
)
def parse_start_url(self, response):
item = SeedinvestDownloadItem()
item['url'] = response.url
item['html'] = response.body
yield item
Is it possible to get stats like outgoing and incoming bandwidth used during a crawl using scrapy at regular interval of time?
Yes, it is possible. =)
The total request and response bytes are already tracked in the stats by the DownloaderStats middleware that comes with Scrapy. You can add another downloader middleware that tracks time and add the new stats.
Here are the steps for it:
1) Configure a new downloader middleware in settings.py with a high order number so it will execute later in the pipeline:
DOWNLOADER_MIDDLEWARES = {
'testing.middlewares.InOutBandwithStats': 990,
}
2) Put the following code into a middleware.py file in the same dir as settings.py
import time
class InOutBandwithStats(object):
def __init__(self, stats):
self.stats = stats
self.startedtime = time.time()
#classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def elapsed_seconds(self):
return time.time() - self.startedtime
def process_request(self, request, spider):
request_bytes = self.stats.get_value('downloader/request_bytes')
if request_bytes:
outgoing_bytes_per_second = request_bytes / self.elapsed_seconds()
self.stats.set_value('downloader/outgoing_bytes_per_second',
outgoing_bytes_per_second)
def process_response(self, request, response, spider):
response_bytes = self.stats.get_value('downloader/response_bytes')
if response_bytes:
incoming_bytes_per_second = response_bytes / self.elapsed_seconds()
self.stats.set_value('downloader/incoming_bytes_per_second',
incoming_bytes_per_second)
return response
And that's it. The process_request/process_response methods will be called whenever a request/response is processed and will keep updating the stats accordingly.
If you want to have logs at regular times you can also call spider.log('Incoming bytes/sec: %s' % incoming_bytes_per_second) there.
Read more
Downloader middlewares in Scrapy - how to activate and create your own
DownloadStats - downloader middleware stats class
I have a large amount of http data stored in my cache backend in scrapy. There are certain pages that contain false data. these urls need to be rescheduled for download on the next run of scrapy.
I came up with the idea to modify the dummy cache policy that comes with scrapy. Unfortunately this doesn't seem to work.
Can anyone see what is wrong in the method is_cached_response_fresh?:
import os
import cPickle as pickle
from time import time
from weakref import WeakKeyDictionary
from email.utils import mktime_tz, parsedate_tz
from w3lib.http import headers_raw_to_dict, headers_dict_to_raw
from scrapy.http import Headers
from scrapy.responsetypes import responsetypes
from scrapy.utils.request import request_fingerprint
from scrapy.utils.project import data_path
from scrapy.utils.httpobj import urlparse_cached
class DummyPolicy(object):
def __init__(self, settings):
self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]
def should_cache_request(self, request):
return urlparse_cached(request).scheme not in self.ignore_schemes
def should_cache_response(self, response, request):
return response.status not in self.ignore_http_codes
def is_cached_response_fresh(self, response, request):
if "thisstring" in response.body.lower():
print "got mobile page. redownload"
return False
else:
return True
def is_cached_response_valid(self, cachedresponse, response, request):
return True
I think the answer here is that your content is most likely gzipped or deflated.
Try
from scrapy.utils.gz import gunzip
if "thisstring" in gunzip(response.body).lower()
Cannot say that solution is versatile but most likely it'll work in your case.