I'm getting an error message when using the response.urljoin() function after extracting the href attribute:
next_page = response.urljoin(href_attribute)
It says
Cannot mix string and non-string arguments
Any idea what could be wrong and how to fix that?
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Related
I am new to Scrapy and I get a key error when passing last_page through meta in the request.
This is the code
def parse_start(self, response):
last_page = response.request.meta['last_page']
dress_links = response.xpath('//p[#class="brand"]/a/#href').getall()
for dress_link in dress_links:
link_item = 'https://www.modanisa.com/en' + dress_link
yield scrapy.Request(
url = link_item,
callback=self.parse_clothing,
meta={'link_item' : link_item})
for idx_next in range(2, last_page+1):
url_next = 'https://www.modanisa.com/en/dresses-en.list' + f'?page={idx_next}'
yield scrapy.Request(
url = url_next,
callback = self.parse_start
)
And this is the error I get:
last_page = response.meta['last_page']
KeyError: 'last_page'
Is there a way to keep the value of last_page constant throughout the scraping?
Can anyone help me? I'm practicing and I can't understand what I did wrong on pagination! It only returns the first page to me and sometimes an error comes up. When it works, it just returns the first page.
"The source list for the Content Security Policy directive 'frame-src' contains an invalid source '*trackcmp.net' It will be ignored", source: https://naturaldaterra.com.br/hortifruti.html?page=2"
import scrapy
from scrapy_selenium import SeleniumRequest
class ComputerdealsSpider(scrapy.Spider):
name = 'produtos'
def start_requests(self):
yield SeleniumRequest(
url='https://naturaldaterra.com.br/hortifruti.html?page=1',
wait_time=3,
callback=self.parse
)
def parse(self, response):
for produto in response.xpath("//div[#class='gallery-items-1IC']/div"):
yield {
'nome_produto': produto.xpath(".//div[#class='item-nameContainer-1kz']/span/text()").get(),
'valor_produto': produto.xpath(".//span[#class='itemPrice-price-1R-']/text()").getall(),
}
next_page = response.xpath("//button[#class='tile-root-1uO'][1]/text()").get()
if next_page:
absolute_url = f"https://naturaldaterra.com.br/hortifruti.html?page={next_page}"
yield SeleniumRequest(
url=absolute_url,
wait_time=3,
callback=self.parse
)
The problem is that your xpath selector returns None instead of the next page number. Consider changing it from
next_page = response.xpath("//button[#class='tile-root-1uO'][1]/text()").get()
to
next_page = response.xpath("//button[#class='tile-root_active-TUl tile-root-1uO']/following-sibling::button[1]/text()").get()
For your future projects consider using scrapy-playwright to scrape js rendered websites. It is faster and simple to use. See a sample implementation of your scraper using scrapy-playwright
import scrapy
from scrapy.crawler import CrawlerProcess
class ComputerdealsSpider(scrapy.Spider):
name = 'produtos'
def start_requests(self):
yield scrapy.Request(
url='https://naturaldaterra.com.br/hortifruti.html?page=1',
meta={"playwright": True}
)
def parse(self, response):
for produto in response.xpath("//div[#class='gallery-items-1IC']/div"):
yield {
'nome_produto': produto.xpath(".//div[#class='item-nameContainer-1kz']/span/text()").get(),
'valor_produto': produto.xpath(".//span[#class='itemPrice-price-1R-']/text()").getall(),
}
# scrape next page
next_page = response.xpath(
"//button[#class='tile-root_active-TUl tile-root-1uO']/following-sibling::button[1]/text()").get()
yield scrapy.Request(
url='https://naturaldaterra.com.br/hortifruti.html?page=' + next_page,
meta={"playwright": True}
)
if __name__ == "__main__":
process = CrawlerProcess(settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}, })
process.crawl(ComputerdealsSpider)
process.start()
Spider code is wrong. I created a demon project, but it does not work, kindly check my vs code shot cut & I have no idea for all my spider code and problems.
import scrapy
class EmailSpider(scrapy.Spider):
name='Email'
start_url = [
'http://jsjy.114chn.com/'
]
def parse(self,response):
for Email in response.xpath("//span[#id='lblEmail']"):
yiel{
'email_text': Email.xpath(".//span[#id='lblEmail_text']/p").extract_first()
}
next_page= response.xpath("//li[#class='next']/a/#href").extract_first()
if next_page is not None:
next_page_link= response.urljoin(next_page)
yield scrapy.Request(url=next_page_link, callback=self.parse)
You have problems with indentation and yield function. Also made some code-style corrections:
import scrapy
class EmailSpider(scrapy.Spider):
name = 'Email'
start_url = ['http://jsjy.114chn.com/']
def parse(self, response):
for email in response.xpath("//span[#id='lblEmail']"):
yield {
'email_text': email.xpath(".//span[#id='lblEmail_text']/p").get()
}
next_page = response.xpath("//li[#class='next']/a/#href").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
But since you don't have any #lblEmail elements on page, this spider will not output anything.
I'm using a loop to generate my requests inside start_request() and I'd like to pass the index to parse() so it can store it in the item. However when I use self.i the output has the i max value (last loop turn) for every items. I can use response.url.re('regex to extract the index') but I wonder if there is a clean way to pass a variable from start_requests to parse.
You can use scrapy.Request meta attribute:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
urls = [...]
for index, url in enumerate(urls):
yield scrapy.Request(url, meta={'index':index})
def parse(self, response):
print(response.url)
print(response.meta['index'])
You can pass cb_kwargs argument to scrapy.Request()
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.cb_kwargs
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
urls = [...]
for index, url in enumerate(urls):
yield scrapy.Request(url, callback=self.parse, cb_kwargs={'index':index})
def parse(self, response, index):
pass
I am trying to scrapy following website in which the pagination is though AJAX request.
http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak
I am sending FormRequest to access the different pages, however I am getting following error.
Retrying http://studiegids.uva.nl/xmlpages/plspub/uva_search.courses_pls> (failed 1 times): 400 Bad Request
Not able to understand what is wrong? Following is the code.
class Spider(BaseSpider):
name = "zoek"
allowed_domains = ["studiegids.uva.nl"]
start_urls = ["http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak"]
def parse(self, response):
base_url = "http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak"
for i in range(1, 10):
data = {'p_fetch_size': unicode(20),
'p_page:': unicode(i),
'p_searchpagetype': u'courses',
'p_site_lang': u'nl',
'p_strip': u'/2014-2015',
'p_ctxparam': u'/xmlpages/page/2014-2015/',
'p_rsrcpath':u'/xmlpages/resources/TXP/studiegidswebsite/'}
yield FormRequest.from_response(response,
formdata=data,
callback=self.fetch_details,
dont_click=True)
# yield FormRequest(base_url,
# formdata=data,
# callback=self.fetch_details)
def fetch_details(self, response):
# print response.body
hxs = HtmlXPathSelector(response)
item = ZoekItem()
Studiegidsnummer = hxs.select("//div[#class=item-info']//tr[1]/td[2]/p/text()")
Studielast = hxs.select("//div[#class=item-info']//tr[2]/td[2]/p/text()")
Voertaal = hxs.select("//div[#class=item-info']//tr[3]/td[2]/p/text()")
Ingangseis = hxs.select("//div[#class=item-info']//tr[4]/td[2]/p/text()")
Studiejaar = hxs.select("//div[#class=item-info']//tr[5]/td[2]/p/text()")
Onderwijsinstituut = hxs.select("//div[#class=item-info']//tr[6]/td[2]/p/text()")
for i in range(20):
item['Studiegidsnummer'] = Studiegidsnummer
item['Studielast'] = Studielast
item['Voertaal'] = Voertaal
yield item
Try also check headers using firebug.
400 Bad Request usually means that your request does not fully match the expected request format. Common causes include missing or invalid cookies, headers or parameters.
On your web browser, open the Network tab of the Developer Tools and trigger the request. When you see the request in the Network tab, inspect it fully (parameters, headers, etc.). Try to match such a request in your code.