Scrapy. I get key error ( last_page) when passing value through meta in request. It works once but gives keyerror when looping for pagination - scrapy

I am new to Scrapy and I get a key error when passing last_page through meta in the request.
This is the code
def parse_start(self, response):
last_page = response.request.meta['last_page']
dress_links = response.xpath('//p[#class="brand"]/a/#href').getall()
for dress_link in dress_links:
link_item = 'https://www.modanisa.com/en' + dress_link
yield scrapy.Request(
url = link_item,
callback=self.parse_clothing,
meta={'link_item' : link_item})
for idx_next in range(2, last_page+1):
url_next = 'https://www.modanisa.com/en/dresses-en.list' + f'?page={idx_next}'
yield scrapy.Request(
url = url_next,
callback = self.parse_start
)
And this is the error I get:
last_page = response.meta['last_page']
KeyError: 'last_page'
Is there a way to keep the value of last_page constant throughout the scraping?

Related

scrapy dosn't stop after yield in python

I'm trying to make a spider that goes through a certain amount of start urls and if the resulting page is the right one I yield another request. The problem is that if I try anyway of not yielding a second request the spider will stop directly. There are no problems if I yield the second request.
Here is the relevant code:
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
def parse(self, response):
result = response.xpath("//div[#class = 'playerTeam']//a/#href").get()
if result is None:
result = response.xpath("//span[contains(concat(' ',normalize-space(#class),' '),' profile-player-stat-value bold ')]//a/#href").get()
if result is not None:
yield scrapy.Request(
url = "https://www.hltv.org" + result,
callback = self.parseTeam
)
So I want a way to make the spider to continue after I call the parse function and don't yield a request.

Scrapy crawl spider only crawls as if DEPTH = 1 and stops with reason = finished

I've got a rather simple spider, that loads URLs from files (working) and should then start crawling and archive the HTML response.
It was working before nicely and since days, I can't figure out anymore, what I've changed to make it stop working.
Now, the spider only crawls the first page of every URL and stops then:
'finish_reason': 'finished',
Spider:
class TesterSpider(CrawlSpider):
name = 'tester'
allowed_domains = []
rules = (
Rule(LinkExtractor(allow=(), deny=(r'.*Zahlung.*', r'.*Cookies.*', r'.*Login.*', r'.*Datenschutz.*', r'.*Registrieren.*', r'.*Kontaktformular.*', )),callback='parse_item'),
)
def __init__(self, *a, **kw):
super(CrawlSpider, self).__init__(*a, **kw)
def start_requests(self):
logging.log(logging.INFO, "======== Starting with start_requests")
self._compile_rules()
smgt = Sourcemanagement()
rootdir = smgt.get_root_dir()
file_list = smgt.list_all_files ( rootdir + "/sources" )
links = smgt.get_all_domains()
links = list(set(links))
request_list = []
for link in links:
o = urlparse(link)
result = '{uri.netloc}'.format(uri=o)
self.allowed_domains.append(result)
request_list.append ( Request(url=link, callback=self.parse_item) )
return ( request_list )
def parse_item(self, response):
item = {}
self.write_html_file ( response )
return item
And the settings:
BOT_NAME = 'crawlerscrapy'
SPIDER_MODULES = ['crawlerscrapy.spiders']
NEWSPIDER_MODULE = 'crawlerscrapy.spiders'
USER_AGENT_LIST = "useragents.txt"
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 150
DOWNLOAD_DELAY = 43
CONCURRENT_REQUESTS_PER_DOMAIN = 1
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Connection':'keep-alive',
'Cache-Control':'max-age=0',
'Accept-Language': 'de',
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
'random_useragent.RandomUserAgentMiddleware': 400
}
AUTOTHROTTLE_ENABLED = False
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
REACTOR_THREADPOOL_MAXSIZE = 20
LOG_LEVEL = 'DEBUG'
DEPTH_LIMIT = 0
DOWNLOAD_TIMEOUT = 15
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
Any idea, what I'm doing wrong?
EDIT:
I found out the answer:
request_list.append ( Request(url=link, callback=self.parse_item) )
# to be replaced by:
request_list.append ( Request(url=link, callback=self.parse) )
But I don't really understand why.
https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.parse
So I can return an empty dict in parse_item but I shouldn't because it would break the flow of things?
CrawlSpider.parse is the method that takes care of applying your rules to a response. Only responses you send to CrawlSpider.parse will get your rules applied, generating additional responses.
By yielding a request with a different callback, you are specifying that you don’t want rules to be applied to the response to that request.
The right place to put your parse_item callback when using a CrawlSpider subclass (as opposed to a Spider) is your rules. You already did that.
If what you want is to have responses to your start requests be handled both by rules and by a different callback, you might be better off using a regular spider. CrawlSpider is a very specialized spider, with a limited set of use cases; as soon as you need to do something it doesn’t support, you need to switch to a regular spider.

Update response.body in scrapy(without reload)

I use scrapy and selenium for crawl! my site use ajax for pagination! actully , url no change and so response.body also no change! I want to click with selenium (for pagination) and get self.driver.page_source and use it instead response.body!
So i writed this code :
res = scrapy.http.TextResponse(url=self.driver.current_url, body=self.driver.page_source,
encoding='utf-8')
print(str(res)) //nothing to print!
for quote in res.css("#ctl00_ContentPlaceHolder1_Grd_Dr_DXMainTable > tr.dxgvDataRow_Office2003Blue"):
i = i+1
item = dict()
item['id'] = int(quote.css("td.dxgv:nth-child(1)::text").extract_first())
And no error !
You can replace body of original response in scrapy by using response.replace() method:
def parse(self, response):
response = response.replace(body=driver.page_source)

Scrapy FormRequest return 400 error code

I am trying to scrapy following website in which the pagination is though AJAX request.
http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak
I am sending FormRequest to access the different pages, however I am getting following error.
Retrying http://studiegids.uva.nl/xmlpages/plspub/uva_search.courses_pls> (failed 1 times): 400 Bad Request
Not able to understand what is wrong? Following is the code.
class Spider(BaseSpider):
name = "zoek"
allowed_domains = ["studiegids.uva.nl"]
start_urls = ["http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak"]
def parse(self, response):
base_url = "http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak"
for i in range(1, 10):
data = {'p_fetch_size': unicode(20),
'p_page:': unicode(i),
'p_searchpagetype': u'courses',
'p_site_lang': u'nl',
'p_strip': u'/2014-2015',
'p_ctxparam': u'/xmlpages/page/2014-2015/',
'p_rsrcpath':u'/xmlpages/resources/TXP/studiegidswebsite/'}
yield FormRequest.from_response(response,
formdata=data,
callback=self.fetch_details,
dont_click=True)
# yield FormRequest(base_url,
# formdata=data,
# callback=self.fetch_details)
def fetch_details(self, response):
# print response.body
hxs = HtmlXPathSelector(response)
item = ZoekItem()
Studiegidsnummer = hxs.select("//div[#class=item-info']//tr[1]/td[2]/p/text()")
Studielast = hxs.select("//div[#class=item-info']//tr[2]/td[2]/p/text()")
Voertaal = hxs.select("//div[#class=item-info']//tr[3]/td[2]/p/text()")
Ingangseis = hxs.select("//div[#class=item-info']//tr[4]/td[2]/p/text()")
Studiejaar = hxs.select("//div[#class=item-info']//tr[5]/td[2]/p/text()")
Onderwijsinstituut = hxs.select("//div[#class=item-info']//tr[6]/td[2]/p/text()")
for i in range(20):
item['Studiegidsnummer'] = Studiegidsnummer
item['Studielast'] = Studielast
item['Voertaal'] = Voertaal
yield item
Try also check headers using firebug.
400 Bad Request usually means that your request does not fully match the expected request format. Common causes include missing or invalid cookies, headers or parameters.
On your web browser, open the Network tab of the Developer Tools and trigger the request. When you see the request in the Network tab, inspect it fully (parameters, headers, etc.). Try to match such a request in your code.

Sequentially crawl website using scrapy

Is there a way to tell scrapy to stop crawling based upon condition in the 2nd level page? I am doing the following:
I have a start_url to begin with (1st level page)
I have set of urls extracted from the start_url using parse(self,
response)
Then I add queue the links using Request with callback as parseDetailPage(self, response)
Under parseDetail (2nd level page) I come to know if I can stop crawling or not
Right now I am using CloseSpider() to accomplish this, but the problem is that the urls to be parsed are already queued by the time I start crawling second level pages and I do not know how to remove them from the queue. Is there a way to sequentially crawl the list of links and then be able to stop in parseDetailPage?
global job_in_range
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
self.job_in_range = True
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//blockquote[#id="toc_rows"]')
items = []
if results:
links = results.select('.//p[#class="row"]/a/#href')
for link in links:
if link is self.end_url:
break;
nextUrl = link.extract()
isValid = WPUtil.validateUrl(nextUrl);
if isValid:
item = WoodPeckerItem()
item['url'] = nextUrl
item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
items.append(item)
else:
self.error.log('Could not parse the document')
return items
def parseDetailPage(self, response):
if self.job_in_range is False:
raise CloseSpider('End date reached - No more crawling for ' + self.name)
hxs = HtmlXPathSelector(response)
print response
body = hxs.select('//article[#id="pagecontainer"]/section[#class="body"]')
item = response.meta['item']
item['postDate'] = body.select('.//section[#class="userbody"]/div[#class="postinginfos"]/p')[1].select('.//date/text()')[0].extract()
if item['jobTitle'] is 'Admin':
self.job_in_range = False
raise CloseSpider('Stop crawling')
item['jobTitle'] = body.select('.//h2[#class="postingtitle"]/text()')[0].extract()
item['description'] = body.select(str('.//section[#class="userbody"]/section[#id="postingbody"]')).extract()
return item
Do you mean that you would like to stop the spider and resume it without parsing the urls which have been parsed?
If so, you may try to set the JOB_DIR setting. This setting can keep the request.queue in specified file on the disk.