change response in scrapy itemloader - scrapy

Is there a way to change the response used in an ItemLoader, for example I am parsing two pages and I pass my_item in meta, the response in the item loader is the first one (where I initially defined it).
I am currently using my_item.add_value('price',response.xpath('//p[#id="price"]').extract_first() to get passed this since I can't use my_item.add_xpath('price', '//p[#id="price"]') because the response is from the initial page
my_item = ItemLoader(item=MyItem(),response=response)
#fill my_item
yield Request(My_url,callback=self.parse_item,meta={'my_item':my_item)

You need something like this:
def parse(self, response):
l = ItemLoader(item=YourItem(), response=response)
l.add_xpath('Field1', '...')
l.add_value('Field2', '...')
item = l.load_item()
yield scrapy.Request(
url=another_url,
callback=self.second,
meta={'item': item}
)
def second(self, response):
l = ItemLoader(item=response.meta["item"], response=response)
l.add_xpath("Field3", '...')
yield l.load_item()

Related

How to store scraped links in Scrapy

I did a lot of searches on the web but I couldn't find anything related or maybe it has to do with the wording used.
Basically, I would like to write a spider that would able to save the scraped links and to check if some other links have been already scraped. Is there any build in function in scrapy to do so?
Many thanks
You can write your own method for this purpose. I have written in my project and you can take reference from this. A dictionary called already_parsed_urls and for every callback, I am updating this dictionary.
You can look at the below code snippet and take reference.
from scrapy.spiders import CrawlSpider
from scrapy_splash import SplashRequest
class Spider(CrawlSpider):
name = 'test'
allowed_domains = []
web_url = ''
start_urls = ['']
counter = 0
already_parsed_urls = {}
wait_time = 3
timeout = '90'
def start_requests(self):
for start_url in self.start_urls:
yield SplashRequest(start_url, callback=self.parse_courses,
args={'wait': self.wait_time, 'timeout': self.timeout})
def parse_courses(self, response):
course_urls = []
yield SplashRequest(course_urls[0], callback=self.parse_items, args={'wait': self.wait_time})
def parse_items(self, response):
if not self.already_parsed_urls.get(response.url):
# Get Program URL
program_url = response.url
self.already_parsed_urls[response.url] = 1
else:
return {}

Count scraped items from scrapy

Looking to just count the number of things scraped. New to python and scraping just following the example and what to know how to just count the number of times Albert Einstein shows up and print to a json file. Just can not get it to print to file using print, yield, or return.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "author"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
i=0
for quote in response.css('div.quote'):
author = quote.css("small.author::text").get()
if author == "Albert Einstein":
i+=1
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
I found out how to get to the item_scraped_count that shows up in the log output at the end of the spider.
import scrapy
from scrapy import signals
class CountSpider(scrapy.Spider):
name = 'count'
start_urls = ['https://example.com']
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(CountSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
stats = spider.crawler.stats.get_stats()
numcount = str(stats['item_scraped_count'])
Here I can create a csv file with the stats
In scrapy request are made asynchronously, and each request will callback to the parse function indepedently. Your i variable is not an instance variable, so it's scope is limited to each function call.
Even if that wasn't the case, the recursion would turn your counter to 0 in each callback.
I would suggest you to take a look at scrapy items, at the end of the scrapy process it will return a counter with the nr of scraped items. Although that maybe an overkill if you don't want to store anymore information but the number of occurrences of "Albert Einstein".
If that's all you want, you can use a dirtier solution, set your counter var to be a instance var and have parse method to increment it, like this:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "author"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
counter = 0
def parse(self, response):
for quote in response.css('div.quote'):
author = quote.css("small.author::text").get()
if author == "Albert Einstein":
self.counter += 1
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)

How to add multiple variables in one string (URL)

My spider starts off with the start_urls, being:
https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL
Based on a keywords.csv file, located in my resource folder, the keywordsID (number 20035386) will change. Once the number changed, the spider will fetch the data from another product.
I also have a chunk of code which constantly checks the page if isTruncated = true, if that's the case, it will change the page number in the URL to +1. The only problem I am having right now, is that I don't know how to set a second variable in one string (URL). When isTruncated = true the code need to adjust the URL's page number AND keywordsID accordingly. Currently, I only managed to add a variable for the page number.
Currently the chunk of code is:
if data["isTruncated"]:
yield scrapy.Request(
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page={page}&size=8&isocode=nl-NL".format(page=next_page),
callback=self.parse,
meta={'page': next_page, "category": category},
)
However, it should become something like:
if data["isTruncated"]:
yield scrapy.Request(
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/ {keywordsid} ?page={page}&size=8&isocode=nl-NL".format(page=next_page),
callback=self.parse,
meta={'page': next_page, "category": category},
)
When I run the spider, it will crawl all the pages of the product with keywordsID 20035386, but it will only crawl the first page of all the other products listed in the keywords.csv file.
FULL CODE
./krc/spiders/krc_spider.py
# -*- coding: utf-8 -*-
import scrapy
from krc.items import KrcItem
import json
import os
import csv
import time
import datetime
class KRCSpider(scrapy.Spider):
name = "krc_spider"
allowed_domains = ["kaercher.com"]
start_urls = ['https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL']
def start_requests(self):
"""Read keywords from keywords file amd construct the search URL"""
with open(os.path.join(os.path.dirname(__file__), "../resources/keywords.csv")) as search_keywords:
for keyword in csv.DictReader(search_keywords):
search_text=keyword["keyword"]
category = keyword["keywordtype"]
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/{0}?page=1&size=8&isocode=nl-NL".format(search_text)
# The meta is used to send our search text into the parser as metadata
yield scrapy.Request(url, callback = self.parse, meta = {"search_text": search_text, "category": category})
def parse(self, response):
category = response.meta["category"]
current_page = response.meta.get("page", 1)
next_page = current_page + 1
#Printing the timestamp when fetching the data, using default timezone from the requesting machine
ts = time.time()
timestamp = datetime.datetime.fromtimestamp(ts).strftime('%d-%m-%Y %H:%M:%S')
#Defining the items
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["category"] = category
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"].replace("\u20ac","").strip()
item["timestamp"] = timestamp
yield item
#Checking whether "isTruncated" is true (boolean), if so, next page will be triggered
if data["isTruncated"]:
yield scrapy.Request(
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page={page}&size=8&isocode=nl-NL".format(page=next_page),
callback=self.parse,
meta={'page': next_page, "category": category},
)
./krc/resources/keywords.csv
keyword,keywordtype
20035386,Hogedrukreiniger
20035424,Window Vacs
Current Output
When I run the spider it fetches the data from all the page's of the product with keywordsID 20035386. From all the other products with a different keywordsID, only the data from the first page will be fetched.
Use response.meta for this:
def start_requests(self):
"""Read keywords from keywords file amd construct the search URL"""
with open(os.path.join(os.path.dirname(__file__), "../resources/keywords.csv")) as search_keywords:
for keyword in csv.DictReader(search_keywords):
product_id = keyword["keyword"]
category = keyword["keywordtype"]
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/{0}?page=1&size=8&isocode=nl-NL".format(product_id)
# The meta is used to send our search text into the parser as metadata
yield scrapy.Request(url, callback = self.parse, meta = {"category": category, "product_id": product_id})
def parse(self, response):
category = response.meta["category"]
product_id = response.meta["product_id"]
current_page = response.meta.get("page", 1)
next_page = current_page + 1
#Printing the timestamp when fetching the data, using default timezone from the requesting machine
ts = time.time()
timestamp = datetime.datetime.fromtimestamp(ts).strftime('%d-%m-%Y %H:%M:%S')
#Defining the items
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["category"] = category
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"].replace("\u20ac","").strip()
item["timestamp"] = timestamp
yield item
#Checking whether "isTruncated" is true (boolean), if so, next page will be triggered
if data["isTruncated"]:
yield scrapy.Request(
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/{product_id}?page={page}&size=8&isocode=nl-NL".format(page=next_page, product_id=product_id),
callback=self.parse,
meta={'page': next_page, "category": category, "product_id": product_id},
)
I believe you need a nested for when your search_text change.
for [first iterating variable] in [outer loop]: # Outer loop
[do something] # Optional
for [second iterating variable] in [nested loop]: # Nested loop
[do something]
Checks this out, it might help you.
For Loops
I think adding the keyword to the url would be the following. It may or may not need + signs before and after search_text, my knowledge is limited.
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/"search_text"?page={page}&size=8&isocode=nl-NL".format(page=next_page),
though I'm not really following what this line is doing, at least the format(search_text) portion of it.
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/{0}?page=1&size=8&isocode=nl-NL".format(search_text)

Scrapy using start_requests with rules

I can't find any solution for using start_requests with rules, also I haven't seen any example on the Internet with this two. My purpose is simple, I wanna redefine start_request function to get an ability catch all exceptions dunring requests and also use meta in requests. This is a code of my spider:
class TestSpider(CrawlSpider):
name = 'test'
allowed_domains = ['www.oreilly.com']
start_urls = ['https://www.oreilly.com/library/view/practical-postgresql/9781449309770/ch04s05.html']
# Base on scrapy doc
def start_requests(self):
for u in self.start_urls:
yield Request(u, callback=self.parse_item, errback=self.errback_httpbin, dont_filter=True)
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = {}
item['title'] = response.xpath('//head/title/text()').extract()
item['url'] = response.url
yield item
def errback_httpbin(self, failure):
self.logger.error('ERRRRROR - {}'.format(failure))
This code scrape only one page. I try to modify it and instead of:
def parse_item(self, response):
item = {}
item['title'] = response.xpath('//head/title/text()').extract()
item['url'] = response.url
yield item
I've tried to use this, based on this answer
def parse_item(self, response):
item = {}
item['title'] = response.xpath('//head/title/text()').extract()
item['url'] = response.url
return self.parse(response)
It seems to work, but it doesn't scrape anything, even if I add parse function to my spider. Does anybody know how to use start_request and rules together? I will be glad any information about this topic. Have a nice coding!
I found a solution, but frankly speaking I don't know how it works but it sertantly does it.
class TSpider(CrawlSpider):
name = 't'
allowed_domains = ['books.toscrapes.com']
start_urls = ['https://books.toscrapes.com']
login_page = 'https://books.toscrapes.com'
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def start_requests(self):
yield Request(url=self.login_page, callback=self.login, errback=self.errback_httpbin, dont_filter=True)
def login(self, response):
return FormRequest.from_response(response)
def parse_item(self, response):
item = {}
item['title'] = response.xpath('//head/title/text()').extract()
item['url'] = response.url
yield item
def errback_httpbin(self, failure):
self.logger.error('ERRRRROR - {}'.format(failure))
To catch errors from your rules you need to define errback for your Rule(). But unfortunately this is not possible now.
You need to parse and yield request by yourself (this way you can use errback) or process each response using middleware.
Here is a solution for handle errback in LinkExtractor
Thanks this dude!

Crawler get Info about pages (Scrapy)

How to implement a crawler (using SCRAPY) that get all information about a page. For example, size of images, size of CSS files and save in .txt file (page1.txt, page2.txt)
I tried this with images:
class TestSpider(scrapy.Spider):
name="Test"
start_urls = ["http://www.example.com/page1.html", "http://www.example.com/page2", ]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait':5})
def parse(self, response):
for url_image in response.xpath('//img/#src').extract():
yield scrapy.Request(url=url_image, callback=self.parse_image)
def parse_image(self, response):
with open('page1.txt', 'a+') as f:
f.write(str(len(response.body)))
This code would save all size images in page1.txt, how to send parameters to parse_image()? For example, filename to parse_image() function.
Splash browser does exactly what I need --> link.
To transfer data between your parse methods you can use Requests meta attribute:
def parse(self, response):
data = {'foo': 'bar'}
yield Request(url, self.parse2, meta=data)
def parse2(self, response):
data = response.meta
# {'foo': 'bar'}