Scrapy - CSS selector issue - scrapy

I would like to get the link located in a href attribute from a aelement. The url is: https://www.drivy.com/location-voiture/antwerpen/bmw-serie-1-477429?address=Gare+d%27Anvers-Central&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-20&end_time=18%3A30&latitude=51.2162&longitude=4.4209&start_date=2019-05-20&start_time=06%3A00
I'm searching for the href of this element:
<a class="car_owner_section" href="/users/2643273" rel="nofollow"></a>
When I enter response.css('a.car_owner_section::attr(href)').get() in the terminal I just get nothing but the element exists even when I inspect view(response).
Anybody has a clue about this issue ?

The site seems to load on JavaScript, using splash works perfect.
Here is the code:
import scrapy
from scrapy_splash import SplashRequest
class ScrapyOverflow1(scrapy.Spider):
name = "overflow1"
def start_requests(self):
url = 'https://www.drivy.com/location-voiture/antwerpen/bmw-serie-1-477429?address=Gare+d%27Anvers-Central&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-20&end_time=18%3A30&latitude=51.2162&longitude=4.4209&start_date=2019-05-20&start_time=06%3A00'
yield SplashRequest(url=url, callback=self.parse, args={'wait': 5})
def parse(self, response):
links = response.xpath('//a[#class="car_owner_section"]/#href').extract()
print(links)
To use splash install splash, scrapy splash and run sudo docker run -p 8050:8050 scrapinghub/splash
before running the spider. Here is a great article on installing and running splash. article on scrapy spash... and also add midlewares to settings.py (also in the article)
The result is as above

Related

Scrapy not defined

I'm trying to write a web crawler using VSC & encountered the error. Below are my codes.
class spider1(scrapy.Spider):
name = 'Wikipedia'
start_urls = <kbd>['https://en.wikipedia.org/wiki/Battery_(electricity)']</kbd>
def parse(self, response):pass
May I know what's wrong?
Thanks.
try to import scrapy first
import scrapy
This should work, you are missing a scrapy import that is being used to import a parent of Spider1
import scrapy
class spider1(scrapy.Spider):
name = 'Wikipedia'
start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']
def parse(self, response):
pass

scrapy not entering parse(response.url)

I'm a beginner. When I crawl, there is no error code, but scrapy does not enter response.url in parse. That is, the page is empty page titled "data;"
how to enter the repsonse.url?
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
import time
from result_crawler.items import RESULT_Item
class RESULT_Spider(scrapy.Spider):
name="EPL"
allowed_domains=["premierleague.com"]
starts_urls=["https://www.premierleague.com/match/38567"]
def __init__(self):
scrapy.Spider.__init__(self)
self.browser=webdriver.Chrome("/users/germpark/chromedriver")
def parse(self,response):
self.browser.get(response.url)
time.sleep(5)
.
.
.
I want to enter https://www.premierleague.com/match/38567 but result did not exist.
The correct property name is start_urls not starts_urls. Because of the incorrect property name it does not detect any start pages.

How to get Selenium to pass Javascript generated source-code to Scrapy?

I have built a basic Scrapy Spider which scrapes a site's product category page, opens all the individual product pages and scrapes some product information. When there are multiple pages for one category the site uses Javascript to refresh the product list (the URL does not change).
I am trying to use Selenium to access the JS generated pages.
import time
import scrapy
from myscraper.items import myscraperItem
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
class websiteSpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/cat1',
)
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(10)
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_css_selector("li.active a#page_right div")
try:
for href in response.css('div.productlist h3 a::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_product_page)
time.sleep(10)
next.click()
except:
break
def parse_product_page(self, response):
...
When I run this I only scrape products from the first page, how can I push the newly generated source code for pages 2 onwards from Selenium into Scrapy? I have tried a few things involving:
hxs = HtmlXPathSelector(response)
But I don't really understand it, any help would be much appreciated!!
Thanks

Scrapy don't crawl over site

I have common trap and can't get rid with it: my Scrapy spider very lazy, so that it is, it can parse only start_urls. Code below:
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Field
from scrapy.selector import Selector
class HabraPostSpider(scrapy.Spider):
name = 'habrapost'
allowed_domains = ['habrahabr.ru']
start_urls = ['https://habrahabr.ru/interesting/']
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
rules = (Rule(LinkExtractor()),
Rule(LinkExtractor(allow=('/post/'),),callback='parse_post',follow= True))
I will be very happy if anybody could say how to fix my spider!
Your english is totally broken but reading between the lines what I understand is that you want the crawler to go into every link it sees.
For that you have to use CrawlSpider instead of Spider
class HabraPostSpider(scrapy.spiders.CrawlSpider)
Check the documentation.

Scrapy-Recursively Scrape Webpages and save content as html file

I am using scrapy to extract the information in tag of web pages and then save those webpages as HTML files.Eg http://www.austlii.edu.au/au/cases/cth/HCA/1945/ this site has some webpages related to judicial cases.I want to go to each link and save only the content related to the particular judicial case as an HTML page.eg go to this http://www.austlii.edu.au/au/cases/cth/HCA/1945/1.html and then save information related to case.
Is there a way to do this recursively in scrapy and save content in HTML page
Yes, you can do it with Scrapy, Link Extractors will help:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class AustliiSpider(CrawlSpider):
name = "austlii"
allowed_domains = ["austlii.edu.au"]
start_urls = ["http://www.austlii.edu.au/au/cases/cth/HCA/1945/"]
rules = (
Rule(SgmlLinkExtractor(allow=r"au/cases/cth/HCA/1945/\d+.html"), follow=True, callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
# do whatever with html content (response.body variable)
Hope that helps.