Iteration returns the same result while crawling - scrapy

I'm new to Scrapy and i'm browsing through the manual book. I'm doing some exercises and stuck with these issue. While iterating through the list of books, the results always return the same ''key: value" pairs after iteration, despite tha fact, that there is 20 different elements in the page.
This is my code:
import scrapy
class MyBooks(scrapy.Spider):
name = 'bookstore'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
def parse(self, response):
for book in response.xpath('//article[#class="product_pod"]'):
yield {
'title': book.xpath('//h3/a/text()').get(),
'price': book.xpath('//p[#class="price_color"]/text()').get(),
}
And this is my result:
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
Why is that? Where i'm wrong?

I am not really familiar with xpath selectors, but for some reason it looks like book.xpath('//h3/a/text()') and book.xpath('//p[#class="price_color"]/text()') return a list of selectors with every book's data in them. To confirm this, you can call .getall() instead of .get() on these selectors, you will see that it returns a list of every book's result. I got it working with CSS selectors though:
def parse(self, response):
for book in response.xpath('//article[#class="product_pod"]'):
yield {
'title': book.css('h3').css('a::text').get(),
'price': book.css('.price_color::text').get()
}
You can read more about selectors here.

Related

Can't scrape multiple pages using scrapy-playwright api

CONTEXT: I'm just a newbie in web scraping. I was trying to scrape a local e-commerce site. It's a dynamic website so I am using scrapy-playwright(chromium) with proxies.
PROBLEM: It was running smoothly until I tried to scrape multiple pages. I am using multiple Urls with individual page number. But instead of scraping different pages, It's scraping the first page for multiple times. It seems that Playwright is at fault. But I am not sure if it's because wrong code or Bugs. I have tried to do it in different processes but the results are same. I tried with and without Proxies and User-agents. AND CAN'T FIGURE OUT WHY IT'S HAPPENING...
import logging
import scrapy
from scrapy_playwright.page import PageMethod
from helper import should_abort_request
class ABCSpider(scrapy.Spider):
name = "ABC"
custom_settings = {
'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000',
'PLAYWRIGHT_ABORT_REQUEST': should_abort_request
}
def start_requests(self):
yield scrapy.Request(
url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1',
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", '[class="box--LNmE6"]'),
],
},
)
async def parse(self, response):
total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0]
total_pages = int(total) #total_pages = 4
links = []
for i in range(1, total_pages+1):
a = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i)
links.append(a)
for link in links:
res = scrapy.Request(url=link, meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector",
'[class="box--ujueT"]'),
]})
yield res and {
"link" : response.url
}
OUTPUT :
[
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}
]
Instead of iterating the pages in the start_requests method you instead are trying to pull number of pages in the parse method and generate further requests from there.
The issue with this strategy is that each one of those requests that you generate in the parse method, is itself parsed by the parse method, so for each and every request you are telling it to generate a whole set of new requests for each and every page it detects from the page number which likely is the same on every page.
Luckily scrapy has a duplicate filter built in so it would likely ignore these duplicates if you were yielding them properly.
The next issue is your yield statement. the expression a and b doesn't return a and b, it only returns b. That is unless a is falsy then it will return a.
So your yield expression...
yield res and {
"link" : response.url
}
will only ever actually yield: {"link": response.url}.
Beyond what I mention above your code doesn't do anything else. However, I am assuming that since you instruct the page to wait for the element with each of the items for sale to render that your eventual goal is to scrape the data from each of the items on the page.
So with this consideration in mind I would suggest that you don't even use scrapy_playwright at all and instead get the data from the json api that the website uses in it's ajax requests.
For example:
import scrapy
class ABCSpider(scrapy.Spider):
name = "ABC"
def start_requests(self):
for i in range(4):
url = f"https://www.daraz.com.bd/xbox-games/?ajax=true&page={i}&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO"
yield scrapy.Request(url)
def parse(self, response):
data = response.json()
items = data["mods"]["listItems"]
for item in items:
yield {"name": item['name'],
"brand": item['brandName'],
"price": item['price']}
partial output:
{'name': 'Xbox 360 GamePad, Xbox 360 Controller for Windows', 'brand': 'Microsoft', 'price': '1400.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Pole Bugatt RT 360 12FIT Fishing Rod Hat Chip', 'brand': 'No Brand', 'price': '1020.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'Microsoft', 'price': '1250.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '【Seyijian】 1Set RB LB Bumpers Buttons for Microsoft XBox Series X Controller Button Holder RHA', 'brand': 'No Brand', 'price': '452.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'For Xbox One S Slim Internal Power Supply Adapter Replacement N115-120P1A 12V', 'brand': 'No Brand', 'price': '2591.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'DOU Lb Rb Lt Rt Front Bumper Buttons Set Replacement Accessory, Fits for X box Series S X Controllers', 'brand': 'No Brand', 'price': '602.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'IVYUEEN 2 Sets RB LB Bumpers Buttons for XBox Series X S Controller Trigger Button Middle Holder with Screwdriver Tool', 'brand': 'No Brand', 'price': '645.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Alloy Analog Controller Thumbsticks Replacement Parts Joysticks Analog Sticks for Xbox ONE / PS4 / Switch Controller 11 Pcs', 'brand': 'MOONEYE', 'price': '1544.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FIFA 21 – Xbox One & Xbox Series X', 'brand': 'No Brand', 'price': '1800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'No Brand', 'price': '1150.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Game Consoles Flight Stick Joystick USB Simulator Flight Controller Joystick', 'brand': 'No Brand', 'price': '15179.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Power Charger Adapter For Microsoft Surfa.6 RT Charger US Plug', 'brand': 'No Brand', 'price': '964.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '684.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FORIDE Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '763.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '663.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '739.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2208.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'TP4-005 Smart Turbo Temperature Control 5-Fan For Playstation 4 For PS4', 'brand': 'No Brand', 'price': '1239.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Stencils Bga Reballing Kit for Xbox Ps3 Chip Reballing Repair Game Consoles Repair Tools Kit', 'brand': 'No Brand', 'price': '1331.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Preloved Game Kinect xbox 360 CD Cassette Xbox360', 'brand': 'No Brand', 'price': '2138.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '734'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Shadow of the Tomb Raider - Xbox One', 'brand': 'No Brand', 'price': '2800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2322.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2027.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motheoard Repair', 'brand': 'No Brand', 'price': '649'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'XBOX 360 GAMES - DANCE CENTRAL 3 (KINECT REQUIRED) (FOR MOD /JAILBREAK CONSOLE)', 'brand': 'No Brand', 'price': '1485.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Kontrol Freek Call Of Duty Black Ops 4 Xbox One Series S-X', 'brand': 'No Brand', 'price': '810.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Hitman 2 - Xbox One', 'brand': 'No Brand', 'price': '2500.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Red Dead Redemption 2 XBOX ONE', 'brand': 'No Brand', 'price': '3800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Wired Gaming Headphones Bass Stereo Headsets with Mic for PS4 for XBOX-ONE', 'brand': 'No Brand', 'price': '977.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '10X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motheoard', 'brand': 'No Brand', 'price': '3615'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '739.00'}

Scrapy not returned all items from pagination

I want to scrape all monitor item from the site https://www.startech.com.bd. But
The problem arise when I run my spider it returns only 60 result.
Here is my code, which doesn't work right:
import scrapy
import time
class StartechSpider(scrapy.Spider):
name = 'startech'
allowed_domains = ['startech.com.bd']
start_urls = ['https://www.startech.com.bd/monitor/']
def parse(self, response):
monitors = response.xpath("//div[#class='p-item']")
for monitor in monitors:
item = monitor.xpath(".//h4[#class = 'p-item-name']/a/text()").get()
price = monitor.xpath(".//div[#class = 'p-item-price']/span/text()").get()
yield{
'item' : item,
'price' : price
}
next_page = response.xpath("//ul[#class = 'pagination']/li/a/#href").get()
print (next_page)
if next_page:
yield response.follow(next_page, callback = self.parse)
Any help is much appreciated!
//ul[#class = 'pagination']/li/a/#href selects 10 items/pages at once but you have to select unique meaning only the next page.The following xpath expression grab the right pagination.
Code:
next_page = response.xpath("//a[contains(text(), 'NEXT')]/#href").get()
print (next_page)
if next_page:
yield response.follow(next_page, callback = self.parse)
Output:
2022-11-26 01:45:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.startech.com.bd/monitor?page=19> (referer: https://www.startech.com.bd/monitor?page=18)
2022-11-26 01:45:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.startech.com.bd/monitor?page=19>
{'item': 'HP E27q G4 27 Inch 2K QHD IPS Monitor', 'price': '41,000৳'}
None
2022-11-26 01:45:06 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-26 01:45:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6702,
'downloader/request_count': 19,
'downloader/request_method_count/GET': 19,
'downloader/response_bytes': 546195,
'downloader/response_count': 19,
'downloader/response_status_count/200': 19,
'elapsed_time_seconds': 9.939978,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 25, 19, 45, 6, 915772),
'httpcompression/response_bytes': 6200506,
'httpcompression/response_count': 19,
'item_scraped_count': 361,

How to scrape products from finite-scrolling page using scrapy?

I recently started to learn scrapy and decided to scrape this site.
There are 24 products on 1 page, and when you scroll down more products load.
There should be about 334 products on this page.
I used scrapy and tried to scrape the products and information inside, but I can't make scrapy to scrape more than 24 products.
I think, I need selenium or splash to render/scroll down to the end, and then I would be able to scrape it.
This is the code that scrapes 24 products:
import scrapy
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 OPR/92.0.0.0'
}
class BookSpider(scrapy.Spider):
name = 'basics2'
api_url = 'https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page'
start_urls = ['https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page=1']
#Def parse goes to the href of every product
def parse(self, response):
for link in response.xpath("//div[#class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--1th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product carousel__item product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
#def parse-book gets all the information inside each product
def parse_book(self, response):
yield{
'title' : response.xpath("//div[#class='product-detail-info__header']/h1/text()").get(),
'normal_price' : response.xpath("//div[#class='money-amount price-formatted__price-amount']//span//text()").get(),
'discounted_price' : response.xpath("(//span[#class='price__amount price__amount--on-sale price-current--with-background']//div[#class='money-amount price-formatted__price-amount']//span)[1]").get(),
'Reference' : response.xpath("//div[#class='product-detail-color-selector product-detail-info__color-selector']//p[#class='product-detail-selected-color product-detail-color-selector__selected-color-name']//text()").get(),
'Description' : response.xpath("//div[#class='expandable-text__inner-content']//p//text()").get(),
'Image' : response.xpath("//picture[#class='media-image']//source//#srcset").extract(),
'item_url' : response.url,
# 'User-Agent': response.request.headers['User-Agent']
}
No need to use so slow and complex selenium, You can grab all the requred data from API like:
import scrapy
import json
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [API_URL]
custom_settings = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def parse(self, response):
json_response = json.loads(response.text)
datas = json_response["productGroups"][0]['elements']
for data in datas:
yield {
"name":data.get("commercialComponents")[0]['name']
}
Output:
{'name': 'БОТИЛЬОНЫ ИЗ ТКАНИ С ОТДЕЛКОЙ ПАЙЕТКАМИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТУФЛИ С ОТДЕЛКОЙ ПАЙЕТКАМИ, НА КАБЛУКЕ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ФУТБОЛКА С ВОРОТНИКОМ-СТОЙКОЙ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'СУМКА-ШОПЕР С УЗЛАМИ НА ЛЯМКАХ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНАЯ ЮБКА ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНОЕ ПЛАТЬЕ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНЫЕ ЛЕГИНСЫ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-19 22:39:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186484,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 3.171018,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 19, 16, 39, 52, 441260),
'httpcompression/response_bytes': 2096267,
'httpcompression/response_count': 1,
'item_scraped_count': 476,
Update: See the updated answer how to extract image url from the API responsed data of this website.
import scrapy
import json
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [API_URL]
custom_settings = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def parse(self, response):
json_response = json.loads(response.text)
datas = json_response["productGroups"][0]['elements']
for data in datas:
name = data.get("commercialComponents")[0]['xmedia'][0]['name']
#print(name)
path = data.get("commercialComponents")[0]['xmedia'][0]['path']
#print(path)
ts = data.get("commercialComponents")[0]['xmedia'][0]['timestamp']
#print(ts)
img = 'https://static.zara.net/photos//' + path+ '/'+name+'.jpg?ts=' +ts
#print(img)
yield {
"image_url": img
}
Output:
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_2_2_1.jpg?ts=1668003224849'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_1_1_1.jpg?ts=1668003224932'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/744/505/2/1067744505_1_1_1.jpg?ts=1668155524538'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_1_1.jpg?ts=1668085284347'}2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8587/866/099/2/8587866099_1_1_1.jpg?ts=1668003219701'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_10_1.jpg?ts=1668081955599'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/5388/629/711/2/5388629711_1_1_1.jpg?ts=1668008862794'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/800/2/6672010800_1_1_1.jpg?ts=1668172065554'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/002/2/6672010002_2_3_1.jpg?ts=1668164312812'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_8_1.jpg?ts=1668696590284'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/938/822/2/7901938822_2_5_1.jpg?ts=1668767172364'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/935/822/2/7901935822_2_5_1.jpg?ts=1668764555064'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_1_1.jpg?ts=1668691124206'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/936/822/2/7901936822_2_5_1.jpg?ts=1668767061454'}
2022-11-20 23:16:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-20 23:16:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186815,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.670308,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 20, 17, 16, 14, 180866),
'httpcompression/response_bytes': 2100146,
'httpcompression/response_count': 1,
'item_scraped_count': 474,
... so on

Scrapy/Selenium: Send more than 3 failed requests before script stops

I am currently trying to crawl a website (around 500 subpages).
The script is working quite smoothly. However, after 3 to 4 hours of running, I am sometimes getting the error message which you can find bellow. I think it is not the script which makes the problems, it is the website server which is quite slowly.
My question is: Is it somehow possible to send more than 3 "failed requests" before the script automatically stops/ closes the spider?
2019-09-27 10:53:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 1 pages/min), scraped 4480 items (at 10 items/min)
2019-09-27 10:54:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 1 times): 504 Gateway Time-out
2019-09-27 10:54:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:55:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 2 times): 504 Gateway Time-out
2019-09-27 10:55:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:56:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 3 times): 504 Gateway Time-out
2019-09-27 10:56:00 [scrapy.core.engine] DEBUG: Crawled (504) <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (referer: https://blogabet.com/tipsters) ['partial']
2019-09-27 10:56:00 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480>: HTTP status code is not handled or not allowed
2019-09-27 10:56:00 [scrapy.core.engine] INFO: Closing spider (finished)
UPDATED CODE ADDED
class AlltipsSpider(Spider):
name = 'alltips'
allowed_domains = ['blogabet.com']
start_urls = ('https://blogabet.com',)
def parse(self, response):
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver.get(url)
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[#id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
while True:
try:
element = self.driver.find_element_by_id('last_item')
self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sleep(3)
self.driver.find_element_by_id('last_item').click()
sleep(7)
except NoSuchElementException:
self.logger.info('No more tipps')
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[#class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[#class="col-sm-7 col-lg-6 no-padding"]/a/#title').extract()
publish_date = post.xpath('.//*[#class="bet-age text-muted"]/text()').extract()
yield {'Username': username,
'Publish date': publish_date
self.driver.quit()
break
You can do this by simply changing the RETRY_TIMES setting to a higher number.
You can read about your retry-related options in the RetryMiddleware docs: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-RETRY_TIMES

How to get cookies from response of scrapy splash

I want to get the cookie value from the response object of a splash. but it is not working as I expected.
Here is spider code
class AmazonSpider(scrapy.Spider):
name = 'amazon'
allowed_domains = ['amazon.com']
def start_requests(self):
url = 'https://www.amazon.com/gp/goldbox?ref_=nav_topnav_deals'
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
print(response.headers)
Output log:
2019-08-17 11:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2019-08-17 11:53:08 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.99.100:8050/robots.txt> (referer: None)
2019-08-17 11:53:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/gp/goldbox?ref_=nav_topnav_deals via http://192.168.99.100:8050/render.html> (referer: None)
{b'Date': [b'Sat, 17 Aug 2019 06:23:09 GMT'], b'Server': [b'TwistedWeb/18.9.0'], b'Content-Type': [b'text/html; charset=utf-8']}
2019-08-17 11:53:24 [scrapy.core.engine] INFO: Closing spider (finished)
You can try the following approach:
- write a small Lua script that returns the html + the cookies:
lua_request = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
splash:wait(0.5)
return {
html = splash:html(),
cookies = splash:get_cookies()
}
end
"""
Change your Request to the following:
yield SplashRequest(
url,
self.parse,
endpoint='execute',
args={'lua_source': self.lua_request}
)
Then find the cookies in your parse-method as follows:
def parse(self, response):
cookies = response.data['cookies']
headers = response.headers