I have built a basic Scrapy Spider which scrapes a site's product category page, opens all the individual product pages and scrapes some product information. When there are multiple pages for one category the site uses Javascript to refresh the product list (the URL does not change).
I am trying to use Selenium to access the JS generated pages.
import time
import scrapy
from myscraper.items import myscraperItem
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
class websiteSpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/cat1',
)
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(10)
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_css_selector("li.active a#page_right div")
try:
for href in response.css('div.productlist h3 a::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_product_page)
time.sleep(10)
next.click()
except:
break
def parse_product_page(self, response):
...
When I run this I only scrape products from the first page, how can I push the newly generated source code for pages 2 onwards from Selenium into Scrapy? I have tried a few things involving:
hxs = HtmlXPathSelector(response)
But I don't really understand it, any help would be much appreciated!!
Thanks
Related
I'm trying to write a web crawler using VSC & encountered the error. Below are my codes.
class spider1(scrapy.Spider):
name = 'Wikipedia'
start_urls = <kbd>['https://en.wikipedia.org/wiki/Battery_(electricity)']</kbd>
def parse(self, response):pass
May I know what's wrong?
Thanks.
try to import scrapy first
import scrapy
This should work, you are missing a scrapy import that is being used to import a parent of Spider1
import scrapy
class spider1(scrapy.Spider):
name = 'Wikipedia'
start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']
def parse(self, response):
pass
I would like to add a proxy to my script.
How do I have to do it? Do I have to use Selenium or Scrapy for it?
I think that Scrapy is making the initial request, so it would make sense to use scrapy for it. But what exactly do I have to do?
Can you recommend any proxylist which works quite reliable?
This is my current script:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import re
import csv
from time import sleep
class PostsSpider(Spider):
name = 'posts'
allowed_domains = ['xyz']
start_urls = ('xyz',)
def parse(self, response):
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
self.driver.get(url)
try:
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[#id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
except NoSuchElementException:
self.logger.info('Blog does not exist anymore')
while True:
try:
element = self.driver.find_element_by_id('last_item')
self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sleep(3)
self.driver.find_element_by_id('last_item').click()
sleep(7)
except NoSuchElementException:
self.logger.info('No more tipps')
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[#class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[#class="col-sm-7 col-lg-6 no-padding"]/a/#title').extract()
publish_date = post.xpath('.//*[#class="bet-age text-muted"]/text()').extract()
yield {'Username': username,
'Publish date': publish_date}
self.driver.close()
break
One simple way is to set the http_proxy and https_proxy environment variables.
You could set them in your environment before starting your script, or maybe add this at the beginning of your script:
import os
os.environ['http_proxy'] = 'http://my/proxy'
os.environ['https_proxy'] = 'http://my/proxy'
For a list of publicly available proxy, you will find a ton if you just search in Google.
You should read Scrapy ProxyMiddleware to explore it to best. Ways of using mentioned proxies are also mentioned in it
I would like to get the link located in a href attribute from a aelement. The url is: https://www.drivy.com/location-voiture/antwerpen/bmw-serie-1-477429?address=Gare+d%27Anvers-Central&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-20&end_time=18%3A30&latitude=51.2162&longitude=4.4209&start_date=2019-05-20&start_time=06%3A00
I'm searching for the href of this element:
<a class="car_owner_section" href="/users/2643273" rel="nofollow"></a>
When I enter response.css('a.car_owner_section::attr(href)').get() in the terminal I just get nothing but the element exists even when I inspect view(response).
Anybody has a clue about this issue ?
The site seems to load on JavaScript, using splash works perfect.
Here is the code:
import scrapy
from scrapy_splash import SplashRequest
class ScrapyOverflow1(scrapy.Spider):
name = "overflow1"
def start_requests(self):
url = 'https://www.drivy.com/location-voiture/antwerpen/bmw-serie-1-477429?address=Gare+d%27Anvers-Central&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-20&end_time=18%3A30&latitude=51.2162&longitude=4.4209&start_date=2019-05-20&start_time=06%3A00'
yield SplashRequest(url=url, callback=self.parse, args={'wait': 5})
def parse(self, response):
links = response.xpath('//a[#class="car_owner_section"]/#href').extract()
print(links)
To use splash install splash, scrapy splash and run sudo docker run -p 8050:8050 scrapinghub/splash
before running the spider. Here is a great article on installing and running splash. article on scrapy spash... and also add midlewares to settings.py (also in the article)
The result is as above
I'm a beginner. When I crawl, there is no error code, but scrapy does not enter response.url in parse. That is, the page is empty page titled "data;"
how to enter the repsonse.url?
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
import time
from result_crawler.items import RESULT_Item
class RESULT_Spider(scrapy.Spider):
name="EPL"
allowed_domains=["premierleague.com"]
starts_urls=["https://www.premierleague.com/match/38567"]
def __init__(self):
scrapy.Spider.__init__(self)
self.browser=webdriver.Chrome("/users/germpark/chromedriver")
def parse(self,response):
self.browser.get(response.url)
time.sleep(5)
.
.
.
I want to enter https://www.premierleague.com/match/38567 but result did not exist.
The correct property name is start_urls not starts_urls. Because of the incorrect property name it does not detect any start pages.
I am using scrapy to extract the information in tag of web pages and then save those webpages as HTML files.Eg http://www.austlii.edu.au/au/cases/cth/HCA/1945/ this site has some webpages related to judicial cases.I want to go to each link and save only the content related to the particular judicial case as an HTML page.eg go to this http://www.austlii.edu.au/au/cases/cth/HCA/1945/1.html and then save information related to case.
Is there a way to do this recursively in scrapy and save content in HTML page
Yes, you can do it with Scrapy, Link Extractors will help:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class AustliiSpider(CrawlSpider):
name = "austlii"
allowed_domains = ["austlii.edu.au"]
start_urls = ["http://www.austlii.edu.au/au/cases/cth/HCA/1945/"]
rules = (
Rule(SgmlLinkExtractor(allow=r"au/cases/cth/HCA/1945/\d+.html"), follow=True, callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
# do whatever with html content (response.body variable)
Hope that helps.