scrapy not entering parse(response.url) - selenium

I'm a beginner. When I crawl, there is no error code, but scrapy does not enter response.url in parse. That is, the page is empty page titled "data;"
how to enter the repsonse.url?
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
import time
from result_crawler.items import RESULT_Item
class RESULT_Spider(scrapy.Spider):
name="EPL"
allowed_domains=["premierleague.com"]
starts_urls=["https://www.premierleague.com/match/38567"]
def __init__(self):
scrapy.Spider.__init__(self)
self.browser=webdriver.Chrome("/users/germpark/chromedriver")
def parse(self,response):
self.browser.get(response.url)
time.sleep(5)
.
.
.
I want to enter https://www.premierleague.com/match/38567 but result did not exist.

The correct property name is start_urls not starts_urls. Because of the incorrect property name it does not detect any start pages.

Related

Scrapy not defined

I'm trying to write a web crawler using VSC & encountered the error. Below are my codes.
class spider1(scrapy.Spider):
name = 'Wikipedia'
start_urls = <kbd>['https://en.wikipedia.org/wiki/Battery_(electricity)']</kbd>
def parse(self, response):pass
May I know what's wrong?
Thanks.
try to import scrapy first
import scrapy
This should work, you are missing a scrapy import that is being used to import a parent of Spider1
import scrapy
class spider1(scrapy.Spider):
name = 'Wikipedia'
start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']
def parse(self, response):
pass

How to get Selenium to pass Javascript generated source-code to Scrapy?

I have built a basic Scrapy Spider which scrapes a site's product category page, opens all the individual product pages and scrapes some product information. When there are multiple pages for one category the site uses Javascript to refresh the product list (the URL does not change).
I am trying to use Selenium to access the JS generated pages.
import time
import scrapy
from myscraper.items import myscraperItem
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
class websiteSpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/cat1',
)
def __init__(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(10)
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_css_selector("li.active a#page_right div")
try:
for href in response.css('div.productlist h3 a::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_product_page)
time.sleep(10)
next.click()
except:
break
def parse_product_page(self, response):
...
When I run this I only scrape products from the first page, how can I push the newly generated source code for pages 2 onwards from Selenium into Scrapy? I have tried a few things involving:
hxs = HtmlXPathSelector(response)
But I don't really understand it, any help would be much appreciated!!
Thanks

Scrapy don't crawl over site

I have common trap and can't get rid with it: my Scrapy spider very lazy, so that it is, it can parse only start_urls. Code below:
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Field
from scrapy.selector import Selector
class HabraPostSpider(scrapy.Spider):
name = 'habrapost'
allowed_domains = ['habrahabr.ru']
start_urls = ['https://habrahabr.ru/interesting/']
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
rules = (Rule(LinkExtractor()),
Rule(LinkExtractor(allow=('/post/'),),callback='parse_post',follow= True))
I will be very happy if anybody could say how to fix my spider!
Your english is totally broken but reading between the lines what I understand is that you want the crawler to go into every link it sees.
For that you have to use CrawlSpider instead of Spider
class HabraPostSpider(scrapy.spiders.CrawlSpider)
Check the documentation.

login using post request in scrapy

I want to login rediffmail but error is generate
exceptions.NameError: global name 'FormRequest' is not defined
here my spider code:
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from rediffmail.items import RediffmailItem
class MySpider(BaseSpider):
name = 'rediffmail'
allowed_domains = ["rediff.com"]
start_urls = ['https://mail.rediff.com/cgi-bin/login.cgi']
def parse(self, response):
return [FormRequest.from_response(response,formdata={'login': 'XXXX', 'passwd': 'XXXX'},callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
Please check is there any issue in my code. I am beginner in python
You are missing the import of FormRequest. And in your version of scrapy, the FormRequest is in scrapy.http.
So add this line in your import section:
from scrapy.http import FormRequest

Scrapy-Recursively Scrape Webpages and save content as html file

I am using scrapy to extract the information in tag of web pages and then save those webpages as HTML files.Eg http://www.austlii.edu.au/au/cases/cth/HCA/1945/ this site has some webpages related to judicial cases.I want to go to each link and save only the content related to the particular judicial case as an HTML page.eg go to this http://www.austlii.edu.au/au/cases/cth/HCA/1945/1.html and then save information related to case.
Is there a way to do this recursively in scrapy and save content in HTML page
Yes, you can do it with Scrapy, Link Extractors will help:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class AustliiSpider(CrawlSpider):
name = "austlii"
allowed_domains = ["austlii.edu.au"]
start_urls = ["http://www.austlii.edu.au/au/cases/cth/HCA/1945/"]
rules = (
Rule(SgmlLinkExtractor(allow=r"au/cases/cth/HCA/1945/\d+.html"), follow=True, callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
# do whatever with html content (response.body variable)
Hope that helps.