Scrapy Crawl Spider is not following the links - scrapy

I wrote this script to get data from https://de.rs-online.com/web/c/automation/elektromechanische-magnete/hubmagnete-linear/. My goal is to follow all the links and extract items from all those pages. But I don't know what is wrong with this script it is not following the links. If I use basic spider then it is easily getting items from the page but for crawl spider, it is not working. It is not throwing any error but the following message_
2022-02-19 21:36:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-19 21:36:56 [scrapy.core.engine] INFO: Spider opened
2022-02-19 21:36:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-19 21:36:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ProductSpider(CrawlSpider):
name = 'product'
allowed_domains = ['de.rs-online.com']
start_urls = ['https://de.rs-online.com/web/c/automation/elektromechanische-magnete/hubmagnete-linear/']
rules = (
Rule(LinkExtractor(restrict_xpaths="//tr/td/div/div/div[2]/div/a"), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'Title': response.xpath("//h1/text()").get(),
'Categories': response.xpath("(//ol[#class='Breadcrumbstyled__StyledBreadcrumbList-sc-g4avu2-1 gHzygm']/li/a)[4]/text()").get(),
'RS Best.-Nr.': response.xpath("//dl[#data-testid='key-details-desktop']/dd[1]/text()").get(),
'URL': response.url
}

If you want to follow all links without any filtering then you can simply omit the restrict_xpaths argument in your Rule definition. However, note that your xpaths in the parse_item callback are not correct so you will still receive empty items. Recheck your xpaths and define them correctly to obtain the information you are after.
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)

Related

Unable to paginate with selenium-scrapy, only extracting data for first page

I am scraping a website for most recent customer rating, with several pages.
The problem is that I am able to interact with the "sortby" option and select "most recent" using Selenium, and scrape data for first page using Scrapy. However, I am unable to extract the data for other pages, the Selenium Web driver somehow does not render the next page. My intension is to automate data scraping.
I am a newb to web scraping. A snippet of the code is attached here (Some information is removed due to confidentiality)
import scrapy
import selenium.webdriver as webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait,Select
import time
from selenium.webdriver.support import expected_conditions as EC
from scrapy import Selector
from selenium.webdriver.edge.options import Options
from scrapy.utils.project import get_project_settings
class ABC(scrapy.Spider):
#"........."
def start_requests(self):
#" ...... "
yield scrapy.Request(url)
def parse(self, response):
settings =get_project_settings()
driver_path = settings.get('EDGE_DRIVER_PATH')
options = Options()
options.add_argument("headless")
ser=Service(driver_path)
driver = webdriver.Edge(service=ser,options = options)
driver.get(response.url)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID,"sort-order-dropdown")))
element_dropdown=driver.find_element(By.ID,"sort-order-dropdown")
select=Select(element_dropdown)
select.select_by_value("recent")
time.sleep(5)
for review in response.css('[data-hook="review"]':
res={
"rating": review.css('[class="a-icon-alt"]::text').get(),
}
yield res
next_page =response.xpath('//a[text()="Next page"]/#href').get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
driver.quit()
Looks that you're using Scrapy and Selenium instead of scrapy_selenium (I don't see any SeleniumRequest in your code.
Your current spider works like this:
Get page using Scrapy
Get the same page using Selenium webdriver
Perform some actions using Selenium
Parse Scrapy response (for rating and next_page)
As you see you never use / parse Selenium result.

Scrapy spider on an API to wait until new items are available

I'm writing a Scrapy spider that scrapes an API for all its items. The API does not provide the total count of results, so I go through all the pages in sequence until a page returns zero results. When it does, my spider currently exits.
Instead, I would like the spider to wait for 30 minutes, then try the same page again. Based on the previous question Scrapy: non-blocking pause, I tried the following code:
def parse(self, response):
items = json.loads(response.text)
for item in items:
yield scrapy.Request(f'{self.settings.get("API_URL")}/{item["id"]}',
callback=self.parse_item,
headers=self.settings.get('API_HEADERS')
)
if len(items) == 0:
self.logger.info('No new items found. Waiting for 30 mins...')
d = defer.Deferred()
reactor.callLater(60.0*30.0, d.callback, self.page_request())
return d
but I get an error twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed.
Since I am not familiar with Twisted, and just learning Scrapy, I wonder if anyone has a suggestion how to make progress. Thanks!

Missing results in export when running scrapy spider with multiple start_urls

I am running a scrapy spider to export some football data and using the scrapy splash plugin.
For development I am running the spider against cached results, so as not to hit the website too much. The strange thing is, that I am consistently missing some items in the export when running the spider with multiple start_urls. The number of missing Items differ slightly each time.
However, when I comment out all but one start_urls and run the spider for each one separately, I get all the results. I am not sure if this is a bug with scrapy or if I am missing something about scrapy here, as this is my first project with the framework.
Here is my caching configuration:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [403]
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
These are my start_urls:
start_urls = [
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2017',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2018',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2019',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2020',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2017',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2018',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2019',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2020'
]
I have a standard setup with an export pipeline and my spider yielding splash requests multiple times for each relevant url on the page from a parse method. Each parse method fills the same item passed via cb_kwargs or creates a new one with data from the passed item.
Please let me know if further code from my project, like the spider, pipelines or item loaders might be relevant to the issue and I will edit my question here.

Scrapy Broad Crawl: Quickstart Example Project

Would there be any code example showing a minimal structure of a Broad Crawls with Scrapy?
Some desirable requirements:
crawl in BFO order; (DEPTH_PRIORITY?)
crawl only from URLs that follow certain patterns; and (LinkExtractor?)
URLs must have a maximum depth. (DEPTH_LIMIT)
I am starting with:
import scrapy
from scrapy import Request
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class WebSpider(scrapy.Spider):
name = "webspider"
def __init__(self):
super().__init__()
self.link_extractor = LxmlLinkExtractor(allow="\.br\/")
self.collecton_file = open("collection.jsonl", 'w')
start_urls = [
"https://www.uol.com.br/"
]
def parse(self, response):
data = {
"url": response.request.url,
"html_content": response.body
}
self.collecton_file.write(f"{data}\n")
for link in self.link_extractor.extract_links(response):
yield Request(link.url, callback=self.parse)
Is that the correct approach to crawl and store the raw HTML on disk?
How to stop the spider when it has already collected n pages?
How to show some stats (pages/min, errors, number of pages collected so far) instead of the standard log?
It should work, but you are writing to a file manually instead of using Scrapy for that.
CLOSESPIDER_PAGECOUNT. But mind that it will start stopping once you reach the specified number of pages, but it will still finish processing some additional pages.
I supect you just want to set LOG_LEVEL to something else.

Run multiple processes of a Scrapy Spider

I have a Scrapy project which reads 1 millions product IDs from database and then starts scraping product details based on ID from a website.
My Spider is fully working.
I want to run 10 instances of Spider with each assigned an equal number of product IDs.
I can do it like,
SELECT COUNT(*) FROM product_ids and then divide it by 10 and then do
SELECT * FROM product_ids LIMIT 0, N and so on
I have an idea I can do it in Terminal by passing LIMIT in scrapy command like scrapy crawl my_spider scrape=1000 and so on.
But I want to do it in Spider, so I just run my Spider only once and then it runs 10 another processes of same spider within spider.
One way to do this is using CrawlerProcess helper class or CrawlerRunner class.
import scrapy
from scrapy.crawler import CrawlerProcess
class Spider1(scrapy.Spider):
# Your first spider definition
process = CrawlerProcess()
process.crawl(Spider1)
process.crawl(Spider1)
process.start()
It is running multi spiders in the same process not multiple processes.