How to scrape products from finite-scrolling page using scrapy?

How to scrape products from finite-scrolling page using scrapy? - api

I recently started to learn scrapy and decided to scrape this site.
There are 24 products on 1 page, and when you scroll down more products load.
There should be about 334 products on this page.
I used scrapy and tried to scrape the products and information inside, but I can't make scrapy to scrape more than 24 products.
I think, I need selenium or splash to render/scroll down to the end, and then I would be able to scrape it.
This is the code that scrapes 24 products:
import scrapy
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 OPR/92.0.0.0'
}
class BookSpider(scrapy.Spider):
name = 'basics2'
api_url = 'https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page'
start_urls = ['https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page=1']
#Def parse goes to the href of every product
def parse(self, response):
for link in response.xpath("//div[#class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--1th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product carousel__item product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
#def parse-book gets all the information inside each product
def parse_book(self, response):
yield{
'title' : response.xpath("//div[#class='product-detail-info__header']/h1/text()").get(),
'normal_price' : response.xpath("//div[#class='money-amount price-formatted__price-amount']//span//text()").get(),
'discounted_price' : response.xpath("(//span[#class='price__amount price__amount--on-sale price-current--with-background']//div[#class='money-amount price-formatted__price-amount']//span)[1]").get(),
'Reference' : response.xpath("//div[#class='product-detail-color-selector product-detail-info__color-selector']//p[#class='product-detail-selected-color product-detail-color-selector__selected-color-name']//text()").get(),
'Description' : response.xpath("//div[#class='expandable-text__inner-content']//p//text()").get(),
'Image' : response.xpath("//picture[#class='media-image']//source//#srcset").extract(),
'item_url' : response.url,
# 'User-Agent': response.request.headers['User-Agent']
}

No need to use so slow and complex selenium, You can grab all the requred data from API like:
import scrapy
import json
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [API_URL]
custom_settings = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def parse(self, response):
json_response = json.loads(response.text)
datas = json_response["productGroups"][0]['elements']
for data in datas:
yield {
"name":data.get("commercialComponents")[0]['name']
}
Output:
{'name': 'БОТИЛЬОНЫ ИЗ ТКАНИ С ОТДЕЛКОЙ ПАЙЕТКАМИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТУФЛИ С ОТДЕЛКОЙ ПАЙЕТКАМИ, НА КАБЛУКЕ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ФУТБОЛКА С ВОРОТНИКОМ-СТОЙКОЙ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'СУМКА-ШОПЕР С УЗЛАМИ НА ЛЯМКАХ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНАЯ ЮБКА ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНОЕ ПЛАТЬЕ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНЫЕ ЛЕГИНСЫ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-19 22:39:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186484,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 3.171018,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 19, 16, 39, 52, 441260),
'httpcompression/response_bytes': 2096267,
'httpcompression/response_count': 1,
'item_scraped_count': 476,
Update: See the updated answer how to extract image url from the API responsed data of this website.
import scrapy
import json
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [API_URL]
custom_settings = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def parse(self, response):
json_response = json.loads(response.text)
datas = json_response["productGroups"][0]['elements']
for data in datas:
name = data.get("commercialComponents")[0]['xmedia'][0]['name']
#print(name)
path = data.get("commercialComponents")[0]['xmedia'][0]['path']
#print(path)
ts = data.get("commercialComponents")[0]['xmedia'][0]['timestamp']
#print(ts)
img = 'https://static.zara.net/photos//' + path+ '/'+name+'.jpg?ts=' +ts
#print(img)
yield {
"image_url": img
}
Output:
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_2_2_1.jpg?ts=1668003224849'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_1_1_1.jpg?ts=1668003224932'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/744/505/2/1067744505_1_1_1.jpg?ts=1668155524538'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_1_1.jpg?ts=1668085284347'}2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8587/866/099/2/8587866099_1_1_1.jpg?ts=1668003219701'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_10_1.jpg?ts=1668081955599'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/5388/629/711/2/5388629711_1_1_1.jpg?ts=1668008862794'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/800/2/6672010800_1_1_1.jpg?ts=1668172065554'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/002/2/6672010002_2_3_1.jpg?ts=1668164312812'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_8_1.jpg?ts=1668696590284'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/938/822/2/7901938822_2_5_1.jpg?ts=1668767172364'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/935/822/2/7901935822_2_5_1.jpg?ts=1668764555064'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_1_1.jpg?ts=1668691124206'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/936/822/2/7901936822_2_5_1.jpg?ts=1668767061454'}
2022-11-20 23:16:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-20 23:16:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186815,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.670308,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 20, 17, 16, 14, 180866),
'httpcompression/response_bytes': 2100146,
'httpcompression/response_count': 1,
'item_scraped_count': 474,
... so on

Related

Can't scrape multiple pages using scrapy-playwright api

CONTEXT: I'm just a newbie in web scraping. I was trying to scrape a local e-commerce site. It's a dynamic website so I am using scrapy-playwright(chromium) with proxies.
PROBLEM: It was running smoothly until I tried to scrape multiple pages. I am using multiple Urls with individual page number. But instead of scraping different pages, It's scraping the first page for multiple times. It seems that Playwright is at fault. But I am not sure if it's because wrong code or Bugs. I have tried to do it in different processes but the results are same. I tried with and without Proxies and User-agents. AND CAN'T FIGURE OUT WHY IT'S HAPPENING...
import logging
import scrapy
from scrapy_playwright.page import PageMethod
from helper import should_abort_request
class ABCSpider(scrapy.Spider):
name = "ABC"
custom_settings = {
'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000',
'PLAYWRIGHT_ABORT_REQUEST': should_abort_request
}
def start_requests(self):
yield scrapy.Request(
url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1',
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", '[class="box--LNmE6"]'),
],
},
)
async def parse(self, response):
total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0]
total_pages = int(total) #total_pages = 4
links = []
for i in range(1, total_pages+1):
a = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i)
links.append(a)
for link in links:
res = scrapy.Request(url=link, meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector",
'[class="box--ujueT"]'),
]})
yield res and {
"link" : response.url
}
OUTPUT :
[
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}
]

Instead of iterating the pages in the start_requests method you instead are trying to pull number of pages in the parse method and generate further requests from there.
The issue with this strategy is that each one of those requests that you generate in the parse method, is itself parsed by the parse method, so for each and every request you are telling it to generate a whole set of new requests for each and every page it detects from the page number which likely is the same on every page.
Luckily scrapy has a duplicate filter built in so it would likely ignore these duplicates if you were yielding them properly.
The next issue is your yield statement. the expression a and b doesn't return a and b, it only returns b. That is unless a is falsy then it will return a.
So your yield expression...
yield res and {
"link" : response.url
}
will only ever actually yield: {"link": response.url}.
Beyond what I mention above your code doesn't do anything else. However, I am assuming that since you instruct the page to wait for the element with each of the items for sale to render that your eventual goal is to scrape the data from each of the items on the page.
So with this consideration in mind I would suggest that you don't even use scrapy_playwright at all and instead get the data from the json api that the website uses in it's ajax requests.
For example:
import scrapy
class ABCSpider(scrapy.Spider):
name = "ABC"
def start_requests(self):
for i in range(4):
url = f"https://www.daraz.com.bd/xbox-games/?ajax=true&page={i}&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO"
yield scrapy.Request(url)
def parse(self, response):
data = response.json()
items = data["mods"]["listItems"]
for item in items:
yield {"name": item['name'],
"brand": item['brandName'],
"price": item['price']}
partial output:
{'name': 'Xbox 360 GamePad, Xbox 360 Controller for Windows', 'brand': 'Microsoft', 'price': '1400.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Pole Bugatt RT 360 12FIT Fishing Rod Hat Chip', 'brand': 'No Brand', 'price': '1020.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'Microsoft', 'price': '1250.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '【Seyijian】 1Set RB LB Bumpers Buttons for Microsoft XBox Series X Controller Button Holder RHA', 'brand': 'No Brand', 'price': '452.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'For Xbox One S Slim Internal Power Supply Adapter Replacement N115-120P1A 12V', 'brand': 'No Brand', 'price': '2591.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'DOU Lb Rb Lt Rt Front Bumper Buttons Set Replacement Accessory, Fits for X box Series S X Controllers', 'brand': 'No Brand', 'price': '602.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'IVYUEEN 2 Sets RB LB Bumpers Buttons for XBox Series X S Controller Trigger Button Middle Holder with Screwdriver Tool', 'brand': 'No Brand', 'price': '645.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Alloy Analog Controller Thumbsticks Replacement Parts Joysticks Analog Sticks for Xbox ONE / PS4 / Switch Controller 11 Pcs', 'brand': 'MOONEYE', 'price': '1544.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FIFA 21 – Xbox One & Xbox Series X', 'brand': 'No Brand', 'price': '1800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'No Brand', 'price': '1150.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Game Consoles Flight Stick Joystick USB Simulator Flight Controller Joystick', 'brand': 'No Brand', 'price': '15179.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Power Charger Adapter For Microsoft Surfa.6 RT Charger US Plug', 'brand': 'No Brand', 'price': '964.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '684.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FORIDE Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '763.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '663.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '739.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2208.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'TP4-005 Smart Turbo Temperature Control 5-Fan For Playstation 4 For PS4', 'brand': 'No Brand', 'price': '1239.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Stencils Bga Reballing Kit for Xbox Ps3 Chip Reballing Repair Game Consoles Repair Tools Kit', 'brand': 'No Brand', 'price': '1331.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Preloved Game Kinect xbox 360 CD Cassette Xbox360', 'brand': 'No Brand', 'price': '2138.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '734'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Shadow of the Tomb Raider - Xbox One', 'brand': 'No Brand', 'price': '2800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2322.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2027.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motheoard Repair', 'brand': 'No Brand', 'price': '649'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'XBOX 360 GAMES - DANCE CENTRAL 3 (KINECT REQUIRED) (FOR MOD /JAILBREAK CONSOLE)', 'brand': 'No Brand', 'price': '1485.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Kontrol Freek Call Of Duty Black Ops 4 Xbox One Series S-X', 'brand': 'No Brand', 'price': '810.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Hitman 2 - Xbox One', 'brand': 'No Brand', 'price': '2500.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Red Dead Redemption 2 XBOX ONE', 'brand': 'No Brand', 'price': '3800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Wired Gaming Headphones Bass Stereo Headsets with Mic for PS4 for XBOX-ONE', 'brand': 'No Brand', 'price': '977.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '10X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motheoard', 'brand': 'No Brand', 'price': '3615'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '739.00'}

My scrapy code is seems okey at dev tools but not working

So i tried to scrape this website using scrapy and scrapy-selenium for exercise.Iam trying to get names,prices etc. My xpath expression seems okey at dev tool on chrome.But it isnt working at my script.I dont know what i am doing wrong? Can u please explain that why my xpath expression not working?
import scrapy
from scrapy_selenium import SeleniumRequest
from scrapy.selector import Selector
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
class ComputerdealsSpider(scrapy.Spider):
name = 'computerdeals'
def start_requests(self):
yield SeleniumRequest(
url = 'https://slickdeals.net/computer-deals',
wait_time=3,
callback = self.parse
)
def parse(self, response):
products = response.xpath("//ul[#class ='bp-p-filterGrid_items']/li")
for product in products:
yield{
'price' : product.xpath(".//div/span[#class='bp-c-card_subtitle']/text()").get(),
}
OUTPUT
2022-11-20 13:59:59 [scrapy.utils.log] INFO: Scrapy 2.7.0 started (bot: silkdeals)
2022-11-20 13:59:59 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform Windows-10-10.0.19044-SP0
2022-11-20 13:59:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'silkdeals',
'NEWSPIDER_MODULE': 'silkdeals.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['silkdeals.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-11-20 13:59:59 [asyncio] DEBUG: Using selector: SelectSelector
2022-11-20 13:59:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-11-20 13:59:59 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2022-11-20 13:59:59 [scrapy.extensions.telnet] INFO: Telnet Password: d3adcd8a4caad669
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-11-20 13:59:59 [scrapy.middleware] WARNING: Disabled SeleniumMiddleware: SELENIUM_DRIVER_NAME and SELENIUM_DRIVER_EXECUTABLE_PATH must be set
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-20 13:59:59 [scrapy.core.engine] INFO: Spider opened
2022-11-20 13:59:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-20 13:59:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-20 14:00:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://slickdeals.net/robots.txt> (referer: None)
2022-11-20 14:00:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://slickdeals.net/computer-deals/> from <GET https://slickdeals.net/computer-deals>
2022-11-20 14:00:01 [filelock] DEBUG: Attempting to acquire lock 2668401413376 on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Lock 2668401413376 acquired on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Attempting to release lock 2668401413376 on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Lock 2668401413376 released on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://slickdeals.net/computer-deals/> (referer: None)
2022-11-20 14:00:01 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-20 14:00:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 96185,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 2.098319,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 20, 11, 0, 1, 689826),
'httpcompression/response_bytes': 617590,
'httpcompression/response_count': 2,
'log_count/DEBUG': 10,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2022, 11, 20, 10, 59, 59, 591507)}
2022-11-20 14:00:01 [scrapy.core.engine] INFO: Spider closed (finished)

Iteration returns the same result while crawling

I'm new to Scrapy and i'm browsing through the manual book. I'm doing some exercises and stuck with these issue. While iterating through the list of books, the results always return the same ''key: value" pairs after iteration, despite tha fact, that there is 20 different elements in the page.
This is my code:
import scrapy
class MyBooks(scrapy.Spider):
name = 'bookstore'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
def parse(self, response):
for book in response.xpath('//article[#class="product_pod"]'):
yield {
'title': book.xpath('//h3/a/text()').get(),
'price': book.xpath('//p[#class="price_color"]/text()').get(),
}
And this is my result:
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
Why is that? Where i'm wrong?

I am not really familiar with xpath selectors, but for some reason it looks like book.xpath('//h3/a/text()') and book.xpath('//p[#class="price_color"]/text()') return a list of selectors with every book's data in them. To confirm this, you can call .getall() instead of .get() on these selectors, you will see that it returns a list of every book's result. I got it working with CSS selectors though:
def parse(self, response):
for book in response.xpath('//article[#class="product_pod"]'):
yield {
'title': book.css('h3').css('a::text').get(),
'price': book.css('.price_color::text').get()
}
You can read more about selectors here.

getting error while implementing headers, body in Scrapy Spider

When trying to scrap a page passing headers and body i get the following error show below.
i tried converting to json, str and sending it but it doesn't give any results.
please let me know if anything needs to be changed..
Code
import scrapy
class TestingSpider(scrapy.Spider):
name = "test"
def start_requests(self):
request_headers = {
"Host": "host_here",
"User-Agent": "Mozilla/5.0 20100101 Firefox/46.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Cache-Control": "max-age=0"
}
url = "my_url_here"
payload = {
"searchargs.approvedFrom.input": "05/18/2017",
"searchargs.approvedTO.input": "05/18/2017"
"pagesize": -1
}
yield scrapy.Request(url, method="POST", callback=self.parse, headers=request_headers, body=payload)
def parse(self, response):
print("-------------------------------came here-------------------------------")
print(response.body)
Error 1
Traceback (most recent call last):
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/home/suventure/Desktop/suventure-projects/python-projects/scraper_txrrc/scraper_txrrc/spiders/wells_spider.py", line 114, in start_requests
yield scrapy.Request(url, method="POST", callback=self.parse, headers=request_headers, body=payload)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 26, in __init__
self._set_body(body)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 68, in _set_body
self._body = to_bytes(body, self.encoding)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/utils/python.py", line 117, in to_bytes
'object, got %s' % type(text).__name__)
TypeError: to_bytes must receive a unicode, str or bytes object, got dict
Error 2 without any response if dict is converted to string and sent in body
2017-05-19 22:39:38 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scraper_)
2017-05-19 22:39:38 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scraper', 'NEWSPIDER_MODULE': 'scraper_.spiders', 'SPIDER_MODULES': ['scraper_.spiders'], 'ROBOTSTXT_OBEY': True}
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-05-19 22:39:39 [scrapy.core.engine] INFO: Spider opened
2017-05-19 22:39:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-19 22:39:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-19 22:39:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://website_link_here/robots.txt> (referer: None)
2017-05-19 22:39:40 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <POST website_link_here>
2017-05-19 22:39:40 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-19 22:39:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 232,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 258,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 19, 17, 9, 40, 581949),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 5, 19, 17, 9, 39, 332675)}
2017-05-19 22:39:40 [scrapy.core.engine] INFO: Spider closed (finished)

In settings.py change
ROBOTSTXT_OBEY = False

Getting NotImplementedError when running Scrapy tutorial

I'm following the tutorial for Scrapy.
I used this code from the tutorial:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1',
'http://quotes.toscrape.com/page/2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
page = response.url.split("/)[-2]")
filename = 'quotes-%s.html' % page
with open(filename,'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
When I then run the command scrapy crawl quotes I get the following output:
2017-05-14 02:19:55 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial)
2017-05-14 02:19:55 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWS
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-14 02:19:55 [scrapy.core.engine] INFO: Spider opened
2017-05-14 02:19:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped
2017-05-14 02:19:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/ro
2017-05-14 02:19:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET htt
2017-05-14 02:19:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET htt
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/pa
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/pa
2017-05-14 02:19:56 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.tosc
Traceback (most recent call last):
File "c:\users\mehmet\anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _ru
current.result = callback(current.result, *args, **kw)
File "c:\users\mehmet\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 76, in par
raise NotImplementedError
NotImplementedError
2017-05-14 02:19:56 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.tosc
Traceback (most recent call last):
File "c:\users\mehmet\anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _ru
current.result = callback(current.result, *args, **kw)
File "c:\users\mehmet\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 76, in par
raise NotImplementedError
NotImplementedError
2017-05-14 02:19:56 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-14 02:19:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1121,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 6956,
'downloader/response_count': 5,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 14, 0, 19, 56, 125822),
'log_count/DEBUG': 6,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/NotImplementedError': 2,
'start_time': datetime.datetime(2017, 5, 14, 0, 19, 55, 659206)}
2017-05-14 02:19:56 [scrapy.core.engine] INFO: Spider closed (finished)
What is going wrong?

This error means you did not implement the parse function. But according to your post you did. So it could be an indentation error. Your code should be like that:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1',
'http://quotes.toscrape.com/page/2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
page = response.url.split("/)[-2]")
filename = 'filename'
with open(filename,'w+') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
I tested it and it works.

Shouldn't the line
page = response.url.split("/)[-2]")
be
page = response.url.split("/)[-1]")
as now it looks like you are selecting the word page and want a number ?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to scrape products from finite-scrolling page using scrapy? - api

Related

Can't scrape multiple pages using scrapy-playwright api

My scrapy code is seems okey at dev tools but not working

Iteration returns the same result while crawling

getting error while implementing headers, body in Scrapy Spider

Getting NotImplementedError when running Scrapy tutorial

Categories

Resources