CONTEXT: I'm just a newbie in web scraping. I was trying to scrape a local e-commerce site. It's a dynamic website so I am using scrapy-playwright(chromium) with proxies.
PROBLEM: It was running smoothly until I tried to scrape multiple pages. I am using multiple Urls with individual page number. But instead of scraping different pages, It's scraping the first page for multiple times. It seems that Playwright is at fault. But I am not sure if it's because wrong code or Bugs. I have tried to do it in different processes but the results are same. I tried with and without Proxies and User-agents. AND CAN'T FIGURE OUT WHY IT'S HAPPENING...
import logging
import scrapy
from scrapy_playwright.page import PageMethod
from helper import should_abort_request
class ABCSpider(scrapy.Spider):
name = "ABC"
custom_settings = {
'PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT': '100000',
'PLAYWRIGHT_ABORT_REQUEST': should_abort_request
}
def start_requests(self):
yield scrapy.Request(
url='https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1',
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", '[class="box--LNmE6"]'),
],
},
)
async def parse(self, response):
total= response.xpath('/html/body/div[3]/div/div[2]/div/div/div[1]/div[3]/div/ul/li[last()-1]/a/text()').extract()[0]
total_pages = int(total) #total_pages = 4
links = []
for i in range(1, total_pages+1):
a = 'https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page={}'.format(i)
links.append(a)
for link in links:
res = scrapy.Request(url=link, meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector",
'[class="box--ujueT"]'),
]})
yield res and {
"link" : response.url
}
OUTPUT :
[
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"},
{"link": "https://www.daraz.com.bd/xbox-games/?spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO&page=1"}
]
Instead of iterating the pages in the start_requests method you instead are trying to pull number of pages in the parse method and generate further requests from there.
The issue with this strategy is that each one of those requests that you generate in the parse method, is itself parsed by the parse method, so for each and every request you are telling it to generate a whole set of new requests for each and every page it detects from the page number which likely is the same on every page.
Luckily scrapy has a duplicate filter built in so it would likely ignore these duplicates if you were yielding them properly.
The next issue is your yield statement. the expression a and b doesn't return a and b, it only returns b. That is unless a is falsy then it will return a.
So your yield expression...
yield res and {
"link" : response.url
}
will only ever actually yield: {"link": response.url}.
Beyond what I mention above your code doesn't do anything else. However, I am assuming that since you instruct the page to wait for the element with each of the items for sale to render that your eventual goal is to scrape the data from each of the items on the page.
So with this consideration in mind I would suggest that you don't even use scrapy_playwright at all and instead get the data from the json api that the website uses in it's ajax requests.
For example:
import scrapy
class ABCSpider(scrapy.Spider):
name = "ABC"
def start_requests(self):
for i in range(4):
url = f"https://www.daraz.com.bd/xbox-games/?ajax=true&page={i}&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO"
yield scrapy.Request(url)
def parse(self, response):
data = response.json()
items = data["mods"]["listItems"]
for item in items:
yield {"name": item['name'],
"brand": item['brandName'],
"price": item['price']}
partial output:
{'name': 'Xbox 360 GamePad, Xbox 360 Controller for Windows', 'brand': 'Microsoft', 'price': '1400.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Pole Bugatt RT 360 12FIT Fishing Rod Hat Chip', 'brand': 'No Brand', 'price': '1020.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'Microsoft', 'price': '1250.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '【Seyijian】 1Set RB LB Bumpers Buttons for Microsoft XBox Series X Controller Button Holder RHA', 'brand': 'No Brand', 'price': '452.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'For Xbox One S Slim Internal Power Supply Adapter Replacement N115-120P1A 12V', 'brand': 'No Brand', 'price': '2591.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'DOU Lb Rb Lt Rt Front Bumper Buttons Set Replacement Accessory, Fits for X box Series S X Controllers', 'brand': 'No Brand', 'price': '602.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'IVYUEEN 2 Sets RB LB Bumpers Buttons for XBox Series X S Controller Trigger Button Middle Holder with Screwdriver Tool', 'brand': 'No Brand', 'price': '645.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Alloy Analog Controller Thumbsticks Replacement Parts Joysticks Analog Sticks for Xbox ONE / PS4 / Switch Controller 11 Pcs', 'brand': 'MOONEYE', 'price': '1544.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FIFA 21 – Xbox One & Xbox Series X', 'brand': 'No Brand', 'price': '1800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Xbox 360 Controller,USB Wired Controller Gamepad for Microsoft Xbox 360,PC Windowns,XP,Vista,Win7 - Black', 'brand': 'No Brand', 'price': '1150.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Game Consoles Flight Stick Joystick USB Simulator Flight Controller Joystick', 'brand': 'No Brand', 'price': '15179.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Power Charger Adapter For Microsoft Surfa.6 RT Charger US Plug', 'brand': 'No Brand', 'price': '964.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '684.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'FORIDE Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '763.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '663.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motherboard Repair', 'brand': 'No Brand', 'price': '739.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2208.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'TP4-005 Smart Turbo Temperature Control 5-Fan For Playstation 4 For PS4', 'brand': 'No Brand', 'price': '1239.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Stencils Bga Reballing Kit for Xbox Ps3 Chip Reballing Repair Game Consoles Repair Tools Kit', 'brand': 'No Brand', 'price': '1331.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Preloved Game Kinect xbox 360 CD Cassette Xbox360', 'brand': 'No Brand', 'price': '2138.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '734'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Shadow of the Tomb Raider - Xbox One', 'brand': 'No Brand', 'price': '2800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2322.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '5X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motherboard', 'brand': 'No Brand', 'price': '2027.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Motheoard Repair', 'brand': 'No Brand', 'price': '649'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'XBOX 360 GAMES - DANCE CENTRAL 3 (KINECT REQUIRED) (FOR MOD /JAILBREAK CONSOLE)', 'brand': 'No Brand', 'price': '1485.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Kontrol Freek Call Of Duty Black Ops 4 Xbox One Series S-X', 'brand': 'No Brand', 'price': '810.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Hitman 2 - Xbox One', 'brand': 'No Brand', 'price': '2500.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Red Dead Redemption 2 XBOX ONE', 'brand': 'No Brand', 'price': '3800.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Wired Gaming Headphones Bass Stereo Headsets with Mic for PS4 for XBOX-ONE', 'brand': 'No Brand', 'price': '977.00'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': '10X Matrix Glitcher V3 Corona 48MHZ Crystals IC Chip Repair for Xbox 360 Gaming Console Motheoard', 'brand': 'No Brand', 'price': '3615'}
2022-12-20 23:21:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.daraz.com.bd/xbox-games/?ajax=true&page=3&spm=a2a0e.searchlistcategory.cate_6_6.5.2a4e15ab6961xO>
{'name': 'Matrix Glitcher V1 Run Chip Board for Xbox 360/Xbox 360 Slim Repair', 'brand': 'No Brand', 'price': '739.00'}
Related
I want to scrape all monitor item from the site https://www.startech.com.bd. But
The problem arise when I run my spider it returns only 60 result.
Here is my code, which doesn't work right:
import scrapy
import time
class StartechSpider(scrapy.Spider):
name = 'startech'
allowed_domains = ['startech.com.bd']
start_urls = ['https://www.startech.com.bd/monitor/']
def parse(self, response):
monitors = response.xpath("//div[#class='p-item']")
for monitor in monitors:
item = monitor.xpath(".//h4[#class = 'p-item-name']/a/text()").get()
price = monitor.xpath(".//div[#class = 'p-item-price']/span/text()").get()
yield{
'item' : item,
'price' : price
}
next_page = response.xpath("//ul[#class = 'pagination']/li/a/#href").get()
print (next_page)
if next_page:
yield response.follow(next_page, callback = self.parse)
Any help is much appreciated!
//ul[#class = 'pagination']/li/a/#href selects 10 items/pages at once but you have to select unique meaning only the next page.The following xpath expression grab the right pagination.
Code:
next_page = response.xpath("//a[contains(text(), 'NEXT')]/#href").get()
print (next_page)
if next_page:
yield response.follow(next_page, callback = self.parse)
Output:
2022-11-26 01:45:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.startech.com.bd/monitor?page=19> (referer: https://www.startech.com.bd/monitor?page=18)
2022-11-26 01:45:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.startech.com.bd/monitor?page=19>
{'item': 'HP E27q G4 27 Inch 2K QHD IPS Monitor', 'price': '41,000৳'}
None
2022-11-26 01:45:06 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-26 01:45:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6702,
'downloader/request_count': 19,
'downloader/request_method_count/GET': 19,
'downloader/response_bytes': 546195,
'downloader/response_count': 19,
'downloader/response_status_count/200': 19,
'elapsed_time_seconds': 9.939978,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 25, 19, 45, 6, 915772),
'httpcompression/response_bytes': 6200506,
'httpcompression/response_count': 19,
'item_scraped_count': 361,
I recently started to learn scrapy and decided to scrape this site.
There are 24 products on 1 page, and when you scroll down more products load.
There should be about 334 products on this page.
I used scrapy and tried to scrape the products and information inside, but I can't make scrapy to scrape more than 24 products.
I think, I need selenium or splash to render/scroll down to the end, and then I would be able to scrape it.
This is the code that scrapes 24 products:
import scrapy
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 OPR/92.0.0.0'
}
class BookSpider(scrapy.Spider):
name = 'basics2'
api_url = 'https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page'
start_urls = ['https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page=1']
#Def parse goes to the href of every product
def parse(self, response):
for link in response.xpath("//div[#class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--1th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='carousel__items']//li[#class='product-grid-product _product carousel__item product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[#class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
#def parse-book gets all the information inside each product
def parse_book(self, response):
yield{
'title' : response.xpath("//div[#class='product-detail-info__header']/h1/text()").get(),
'normal_price' : response.xpath("//div[#class='money-amount price-formatted__price-amount']//span//text()").get(),
'discounted_price' : response.xpath("(//span[#class='price__amount price__amount--on-sale price-current--with-background']//div[#class='money-amount price-formatted__price-amount']//span)[1]").get(),
'Reference' : response.xpath("//div[#class='product-detail-color-selector product-detail-info__color-selector']//p[#class='product-detail-selected-color product-detail-color-selector__selected-color-name']//text()").get(),
'Description' : response.xpath("//div[#class='expandable-text__inner-content']//p//text()").get(),
'Image' : response.xpath("//picture[#class='media-image']//source//#srcset").extract(),
'item_url' : response.url,
# 'User-Agent': response.request.headers['User-Agent']
}
No need to use so slow and complex selenium, You can grab all the requred data from API like:
import scrapy
import json
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [API_URL]
custom_settings = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def parse(self, response):
json_response = json.loads(response.text)
datas = json_response["productGroups"][0]['elements']
for data in datas:
yield {
"name":data.get("commercialComponents")[0]['name']
}
Output:
{'name': 'БОТИЛЬОНЫ ИЗ ТКАНИ С ОТДЕЛКОЙ ПАЙЕТКАМИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТУФЛИ С ОТДЕЛКОЙ ПАЙЕТКАМИ, НА КАБЛУКЕ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ФУТБОЛКА С ВОРОТНИКОМ-СТОЙКОЙ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'СУМКА-ШОПЕР С УЗЛАМИ НА ЛЯМКАХ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'МИНИ-СУМКА ГЕОМЕТРИЧЕСКОЙ ФОРМЫ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНАЯ ЮБКА ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНОЕ ПЛАТЬЕ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'ТОП ИЗ ЭЛАСТИЧНОГО ТРИКОТАЖА'}
2022-11-19 22:39:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'name': 'БЕСШОВНЫЕ ЛЕГИНСЫ ИЗ МЯГКОЙ ТКАНИ'}
2022-11-19 22:39:52 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-19 22:39:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186484,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 3.171018,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 19, 16, 39, 52, 441260),
'httpcompression/response_bytes': 2096267,
'httpcompression/response_count': 1,
'item_scraped_count': 476,
Update: See the updated answer how to extract image url from the API responsed data of this website.
import scrapy
import json
API_URL = "https://www.zara.com/ru/ru/category/2111785/products?ajax=true"
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [API_URL]
custom_settings = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def parse(self, response):
json_response = json.loads(response.text)
datas = json_response["productGroups"][0]['elements']
for data in datas:
name = data.get("commercialComponents")[0]['xmedia'][0]['name']
#print(name)
path = data.get("commercialComponents")[0]['xmedia'][0]['path']
#print(path)
ts = data.get("commercialComponents")[0]['xmedia'][0]['timestamp']
#print(ts)
img = 'https://static.zara.net/photos//' + path+ '/'+name+'.jpg?ts=' +ts
#print(img)
yield {
"image_url": img
}
Output:
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_2_2_1.jpg?ts=1668003224849'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/785/800/2/1067785800_1_1_1.jpg?ts=1668003224932'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/1067/744/505/2/1067744505_1_1_1.jpg?ts=1668155524538'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_1_1.jpg?ts=1668085284347'}2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8587/866/099/2/8587866099_1_1_1.jpg?ts=1668003219701'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/8586/866/099/2/8586866099_15_10_1.jpg?ts=1668081955599'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/5388/629/711/2/5388629711_1_1_1.jpg?ts=1668008862794'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/800/2/6672010800_1_1_1.jpg?ts=1668172065554'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/1/1/p/6672/010/002/2/6672010002_2_3_1.jpg?ts=1668164312812'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_8_1.jpg?ts=1668696590284'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/938/822/2/7901938822_2_5_1.jpg?ts=1668767172364'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/935/822/2/7901935822_2_5_1.jpg?ts=1668764555064'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2023/V/0/1/p/5584/151/800/2/5584151800_2_1_1.jpg?ts=1668691124206'}
2022-11-20 23:16:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.zara.com/ru/ru/category/2111785/products?ajax=true>
{'image_url': 'https://static.zara.net/photos///2022/I/0/1/p/7901/936/822/2/7901936822_2_5_1.jpg?ts=1668767061454'}
2022-11-20 23:16:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-20 23:16:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 330,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 186815,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.670308,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 20, 17, 16, 14, 180866),
'httpcompression/response_bytes': 2100146,
'httpcompression/response_count': 1,
'item_scraped_count': 474,
... so on
I am trying to execute this script but I don't know why it is throwing 'Null' and duplicate value at the same time! My goal is to put the necessary value and click the search button, get all the 'href' from the page and collect the data, which is working fine but providing the 'Null' and duplicate value at the same time!. I don't know what I am actually missing here.
import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
class RightMove2Spider(scrapy.Spider):
name = 'rightmove2'
start_urls = ["https://www.rightmove.co.uk/property-for-sale/search.html?searchLocation=London&useLocationIdentifier=true&locationIdentifier=REGION%5E87490&buy=For+sale"]
def __init__(self, name=None, **kwargs):
chrome_options = Options()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.set_window_size(1920, 1080)
driver.get("https://www.rightmove.co.uk/property-for-sale/search.html?searchLocation=London&useLocationIdentifier=true&locationIdentifier=REGION%5E87490&buy=For+sale")
price_range = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "(//option[#value='2000000'])[2]")))
price_range.click()
time.sleep(1)
bedroom_range = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "(//option[#value='5'])[1]")))
bedroom_range.click()
time.sleep(1)
tick_box = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[#class='tickbox--indicator']")))
tick_box.click()
time.sleep(1)
find_properties_btn = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[#id='submit']")))
find_properties_btn.click()
time.sleep(3)
self.property_xpath = driver.find_elements(By.XPATH, "//*[#class='l-searchResult is-list']/div/div/div[4]/div[1]/div[2]/a")
# driver.close()
super().__init__(name, **kwargs)
def parse(self, response):
for el in self.property_xpath:
href= el.get_attribute('href')
time.sleep(1)
yield SeleniumRequest(
url=href,
wait_time=3)
yield {
'Title': response.xpath("//h1[#itemprop='streetAddress']/text()").get(),
'Price': response.xpath("//div[#class='_1gfnqJ3Vtd1z40MlC0MzXu']/span/text()").get(),
'Agent Name': response.xpath("//div[#class='RPNfwwZBarvBLs58-mdN8']/a/text()").get(),
'Agent Address': response.xpath("//div[#class='OojFk4MTxFDKIfqreGNt0']/text()").get(),
'Agent Telephone': response.xpath("//a[#class='_3E1fAHUmQ27HFUFIBdrW0u']/text()").get(),
'Added on': response.xpath("//div[#class='_2nk2x6QhNB1UrxdI5KpvaF']/text()").get(),
'Links': response.url
}
for x in range(24, 1008, 24):
abs_url = f'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E87490&minBedrooms=5&maxPrice=2000000&index={x}&propertyTypes=&includeSSTC=true&mustHave=&dontShow=&furnishTypes=&keywords='
yield SeleniumRequest(
url= abs_url,
callback=self.parse
)
output
{"Title": null, "Price": null, "Agent Name": null, "Agent Address": null, "Agent Telephone": null, "Added on": null, "Links": "https://www.rightmove.co.uk/property-for-sale/search.html?searchLocation=London&useLocationIdentifier=true&locationIdentifier=REGION%5E87490&buy=For+sale"},
{"Title": "Combwell Crescent, Abbey Wood, London", "Price": "£450,000", "Agent Name": "Anthony Martin Estate Agents, Bexleyheath", "Agent Address": "2 Pickford Lane,\r\nBexleyheath,\r\nDA7 4QW", "Agent Telephone": "020 8012 7475", "Added on": "Added on 30/11/2021", "Links": "https://www.rightmove.co.uk/properties/117050312"},
{"Title": null, "Price": null, "Agent Name": null, "Agent Address": null, "Agent Telephone": null, "Added on": null, "Links": "https://www.rightmove.co.uk/property-for-sale/search.html?searchLocation=London&useLocationIdentifier=true&locationIdentifier=REGION%5E87490&buy=For+sale"},
{"Title": null, "Price": null, "Agent Name": null, "Agent Address": null, "Agent Telephone": null, "Added on": null, "Links": "https://www.rightmove.co.uk/property-for-sale/search.html?searchLocation=London&useLocationIdentifier=true&locationIdentifier=REGION%5E87490&buy=For+sale"},
{"Title": "Combwell Crescent, Abbey Wood, London", "Price": "£450,000", "Agent Name": "Anthony Martin Estate Agents, Bexleyheath", "Agent Address": "2 Pickford Lane,\r\nBexleyheath,\r\nDA7 4QW", "Agent Telephone": "020 8012 7475", "Added on": "Added on 30/11/2021", "Links": "https://www.rightmove.co.uk/properties/117050312"},
{"Title": null, "Price": null, "Agent Name": null, "Agent Address": null, "Agent Telephone": null, "Added on": null, "Links": "https://www.rightmove.co.uk/property-for-sale/search.html?searchLocation=London&useLocationIdentifier=true&locationIdentifier=REGION%5E87490&buy=For+sale"},
{"Title": "Combwell Crescent, Abbey Wood, London", "Price": "£450,000", "Agent Name": "Anthony Martin Estate Agents, Bexleyheath", "Agent Address": "2 Pickford Lane,\r\nBexleyheath,\r\nDA7 4QW", "Agent Telephone": "020 8012 7475", "Added on": "Added on 30/11/2021", "Links": "https://www.rightmove.co.uk/properties/117050312"},
Before starting webscraping project, success depends on to choose right tool in right way.Data is also generating from api calls json response. Why make web scraping so complex using selenium where you can easily grab data from api?
Script:
import scrapy
#import json
class PropertySpider(scrapy.Spider):
name = 'property'
def start_requests(self):
headers= {
"Content-Type": "application/x-www-form-urlencoded",
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36"
}
yield scrapy.Request(
url='https://www.rightmove.co.uk/api/_search?locationIdentifier=STATION%5E9662&numberOfPropertiesPerPage=24&radius=0.5&sortType=2&index=24&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft¤cyCode=GBP&isFetching=false',
method="GET",
headers=headers,
callback=self.parse
)
def parse(self,response):
resp=response.json()
for item in resp['properties']:
yield {
"title":item['summary'],
'price':item['price']['amount'],
'url':'https://www.rightmove.co.uk' + item['propertyUrl']
}
Output:
{'title': "A stunning two bedroom, two bathroom apartment on the 11th floor set over approx 1,645 sq ft, located in St George's brilliant new river fronted development, One Blackfriars, SE1.", 'price': 2000000, 'url': 'https://www.rightmove.co.uk/properties/118739888#/?channel=RES_BUY'}
2022-03-29 15:54:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rightmove.co.uk/api/_search?locationIdentifier=STATION%5E9662&numberOfPropertiesPerPage=24&radius=0.5&sortType=2&index=24&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft¤cyCode=GBP&isFetching=false>
{'title': 'An immaculate four bedroom townhouse arranged over three floors nestled along a peaceful row of pretty houses.', 'price': 2000000, 'url': 'https://www.rightmove.co.uk/properties/118772936#/?channel=RES_BUY'}
2022-03-29 15:54:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rightmove.co.uk/api/_search?locationIdentifier=STATION%5E9662&numberOfPropertiesPerPage=24&radius=0.5&sortType=2&index=24&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft¤cyCode=GBP&isFetching=false>
{'title': "With a great view of the River Thames, The Shard and the City of London, this bright and ideally located two bedroom apartment is 'as new' and is available for chain free sale through Prime London. The bright and clean living space, coming in at 1,210 sq ft / 112 sq m presents exceptionally well...", 'price': 2000000, 'url': 'https://www.rightmove.co.uk/properties/113289182#/?channel=RES_BUY'}
2022-03-29 15:54:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rightmove.co.uk/api/_search?locationIdentifier=STATION%5E9662&numberOfPropertiesPerPage=24&radius=0.5&sortType=2&index=24&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft¤cyCode=GBP&isFetching=false>
{'title': 'With incredible views from the 19th and 20th floors, this 1,678 sq ft (155.9 sqm) penthouse apartment at The Perspective Building is available for sale exclusively through Prime London. The property features two large double bedrooms (both with en suite), occasional/guest bedroom, large open-pla...', 'price': 1999950, 'url': 'https://www.rightmove.co.uk/properties/73980120#/?channel=RES_BUY'}
2022-03-29 15:54:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rightmove.co.uk/api/_search?locationIdentifier=STATION%5E9662&numberOfPropertiesPerPage=24&radius=0.5&sortType=2&index=24&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft¤cyCode=GBP&isFetching=false>
{'title': 'Set in one of London’s most desirable riverside locations, adjacent to Westminster and next to the London Eye, 8 Casson Square celebrates the rich history and heritage of its surroundings. The combination of the intricate architectural design and the impressive location will together create so...', 'price': 1965000, 'url': 'https://www.rightmove.co.uk/properties/79565985#/?channel=RES_BUY'}
2022-03-29 15:54:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rightmove.co.uk/api/_search?locationIdentifier=STATION%5E9662&numberOfPropertiesPerPage=24&radius=0.5&sortType=2&index=24&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft¤cyCode=GBP&isFetching=false>
{'title': 'Newly refurbished two bedroom, two bathroom apartment in Whitehall Court, Westminster.', 'price': 1950000, 'url': 'https://www.rightmove.co.uk/properties/116568074#/?channel=RES_BUY'}
2022-03-29 15:54:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rightmove.co.uk/api/_search?locationIdentifier=STATION%5E9662&numberOfPropertiesPerPage=24&radius=0.5&sortType=2&index=24&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft¤cyCode=GBP&isFetching=false>
... so on
I'm trying to learn how to use PyMongo, so I borrowed some code from a tutorial. Here's the entire program:
from pymongo import MongoClient
cars = [ {'name': 'Audi', 'price': 52642},
{'name': 'Mercedes', 'price': 57127},
{'name': 'Skoda', 'price': 9000},
{'name': 'Volvo', 'price': 29000},
{'name': 'Bentley', 'price': 350000},
{'name': 'Citroen', 'price': 21000},
{'name': 'Hummer', 'price': 41400},
{'name': 'Volkswagen', 'price': 21600} ]
client = MongoClient('mongodb://localhost:27017/')
print("Created client")
with client:
db = client.testdb
print("Created db")
db.cars.insert_many(cars)
print("Inserted")
When I run it, it prints "Created client" and "Created db", but never prints "Inserted", and the program never terminates.
I'm using Python 3.8.5, the Eclipse IDE, and I just did "pip install PyMongo" today, so I should have the latest version. Thanks for any help.
SOLVED:
Whoops, I didn't realize that you have to install MongoDB separately from PyMongo!
I'm new to Scrapy and i'm browsing through the manual book. I'm doing some exercises and stuck with these issue. While iterating through the list of books, the results always return the same ''key: value" pairs after iteration, despite tha fact, that there is 20 different elements in the page.
This is my code:
import scrapy
class MyBooks(scrapy.Spider):
name = 'bookstore'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
def parse(self, response):
for book in response.xpath('//article[#class="product_pod"]'):
yield {
'title': book.xpath('//h3/a/text()').get(),
'price': book.xpath('//p[#class="price_color"]/text()').get(),
}
And this is my result:
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
{'title': 'A Light in the ...', 'price': '£51.77'}
2020-02-07 12:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://books.toscrape.com>
Why is that? Where i'm wrong?
I am not really familiar with xpath selectors, but for some reason it looks like book.xpath('//h3/a/text()') and book.xpath('//p[#class="price_color"]/text()') return a list of selectors with every book's data in them. To confirm this, you can call .getall() instead of .get() on these selectors, you will see that it returns a list of every book's result. I got it working with CSS selectors though:
def parse(self, response):
for book in response.xpath('//article[#class="product_pod"]'):
yield {
'title': book.css('h3').css('a::text').get(),
'price': book.css('.price_color::text').get()
}
You can read more about selectors here.