Scrapy FormRequest redirect not wanted link - scrapy

I followed the basic Scrapy Login. It always works, but in this case, I had some problems. The FormRequest.from_response didn't request the https://www.crowdfunder.com/user/validateLogin, instead it always sent payload to https://www.crowdfunder.com/user/signup. I tried directly request the validateLogin with payload, but it responded with 404 Error. Any idea to help me solve this problem? Thanks in advance!!!
class CrowdfunderSpider(InitSpider):
name = "crowdfunder"
allowed_domains = ["crowdfunder.com"]
start_urls = [
'http://www.crowdfunder.com/',
]
login_page = 'https://www.crowdfunder.com/user/login/'
payload = {}
def init_request(self):
"""This function is called before crawling starts."""
return scrapy.Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
self.payload = {'email': 'my_email',
'password': 'my_password'}
# scrapy login
return scrapy.FormRequest.from_response(response, formdata=self.payload, callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if 'https://www.crowdfunder.com/user/settings' == response.url:
self.log("Successfully logged in. :) :) :)")
# start the crawling
return self.initialized()
else:
# login fail
self.log("login failed :( :( :(")
Here is the payload and request link sent by clicking login in browser:
payload and request url sent by clicking login button
Here is the log info:
2016-10-21 21:56:21 [scrapy] INFO: Scrapy 1.1.0 started (bot: crowdfunder_crawl)
2016-10-21 21:56:21 [scrapy] INFO: Overridden settings: {'AJAXCRAWL_ENABLED': True, 'NEWSPIDER_MODULE': 'crowdfunder_crawl.spiders', 'SPIDER_MODULES': ['crowdfunder_crawl.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'crowdfunder_crawl'}
2016-10-21 21:56:21 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-21 21:56:21 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-21 21:56:21 [scrapy] INFO: Spider opened
2016-10-21 21:56:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-21 21:56:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-10-21 21:56:21 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/robots.txt> (referer: None)
2016-10-21 21:56:21 [scrapy] DEBUG: Redirecting (301) to <GET http://www.crowdfunder.com/user/login> from <GET https://www.crowdfunder.com/user/login/>
2016-10-21 21:56:22 [scrapy] DEBUG: Redirecting (301) to <GET https://www.crowdfunder.com/user/login> from <GET http://www.crowdfunder.com/user/login>
2016-10-21 21:56:22 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/user/login> (referer: None)
2016-10-21 21:56:23 [scrapy] DEBUG: Crawled (200) <POST https://www.crowdfunder.com/user/signup> (referer: https://www.crowdfunder.com/user/login)
2016-10-21 21:56:23 [crowdfunder] DEBUG: login failed :( :( :(
2016-10-21 21:56:23 [scrapy] INFO: Closing spider (finished)
2016-10-21 21:56:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1569,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 16313,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 22, 4, 56, 23, 232493),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 10, 22, 4, 56, 21, 180030)}
2016-10-21 21:56:23 [scrapy] INFO: Spider closed (finished)

FormRequest.from_response(response) by default uses the first form it finds. If you check what forms the page has you'd see:
In : response.xpath("//form")
Out:
[<Selector xpath='//form' data='<form action="/user/signup" method="post'>,
<Selector xpath='//form' data='<form action="/user/login" method="POST"'>,
<Selector xpath='//form' data='<form action="/user/login" method="post"'>]
So the form you are looking for is not 1st one. The way to fix it is to use one of many from_response method parameters to specify which form to use.
Using formxpath is the most flexible and my personal favorite:
In : FormRequest.from_response(response, formxpath='//*[contains(#action,"login")]')
Out: <POST https://www.crowdfunder.com/user/login>

Related

My scrapy code is seems okey at dev tools but not working

So i tried to scrape this website using scrapy and scrapy-selenium for exercise.Iam trying to get names,prices etc. My xpath expression seems okey at dev tool on chrome.But it isnt working at my script.I dont know what i am doing wrong? Can u please explain that why my xpath expression not working?
import scrapy
from scrapy_selenium import SeleniumRequest
from scrapy.selector import Selector
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
class ComputerdealsSpider(scrapy.Spider):
name = 'computerdeals'
def start_requests(self):
yield SeleniumRequest(
url = 'https://slickdeals.net/computer-deals',
wait_time=3,
callback = self.parse
)
def parse(self, response):
products = response.xpath("//ul[#class ='bp-p-filterGrid_items']/li")
for product in products:
yield{
'price' : product.xpath(".//div/span[#class='bp-c-card_subtitle']/text()").get(),
}
OUTPUT
2022-11-20 13:59:59 [scrapy.utils.log] INFO: Scrapy 2.7.0 started (bot: silkdeals)
2022-11-20 13:59:59 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform Windows-10-10.0.19044-SP0
2022-11-20 13:59:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'silkdeals',
'NEWSPIDER_MODULE': 'silkdeals.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['silkdeals.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-11-20 13:59:59 [asyncio] DEBUG: Using selector: SelectSelector
2022-11-20 13:59:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-11-20 13:59:59 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2022-11-20 13:59:59 [scrapy.extensions.telnet] INFO: Telnet Password: d3adcd8a4caad669
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-11-20 13:59:59 [scrapy.middleware] WARNING: Disabled SeleniumMiddleware: SELENIUM_DRIVER_NAME and SELENIUM_DRIVER_EXECUTABLE_PATH must be set
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-20 13:59:59 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-20 13:59:59 [scrapy.core.engine] INFO: Spider opened
2022-11-20 13:59:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-20 13:59:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-20 14:00:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://slickdeals.net/robots.txt> (referer: None)
2022-11-20 14:00:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://slickdeals.net/computer-deals/> from <GET https://slickdeals.net/computer-deals>
2022-11-20 14:00:01 [filelock] DEBUG: Attempting to acquire lock 2668401413376 on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Lock 2668401413376 acquired on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Attempting to release lock 2668401413376 on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [filelock] DEBUG: Lock 2668401413376 released on C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-11-20 14:00:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://slickdeals.net/computer-deals/> (referer: None)
2022-11-20 14:00:01 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-20 14:00:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 96185,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 2.098319,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 20, 11, 0, 1, 689826),
'httpcompression/response_bytes': 617590,
'httpcompression/response_count': 2,
'log_count/DEBUG': 10,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2022, 11, 20, 10, 59, 59, 591507)}
2022-11-20 14:00:01 [scrapy.core.engine] INFO: Spider closed (finished)

Scrapy - Multiple resquest by looping JSON file

I'm trying to get the latitude and longitude of different cities. The name of cities are stored in a JSON file. Here is my code:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
return scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {response.css('span.coordinatetxt::text').get()}
The objective is to loop through the JSON file and for each city send a resquest to a form from the url "https://www.latlong.net/". However nothing is prompting from this request. Is this a bad way to make loop ? Should I treat the JSON file inside the class ?
Log:
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17763-SP0
2019-04-01 16:27:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-01 16:27:17 [scrapy.core.engine] INFO: Spider opened
2019-04-01 16:27:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-01 16:27:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/robots.txt> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.latlong.net/> (referer: https://www.latlong.net/)
2019-04-01 16:27:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.latlong.net/>
{'latlong': '0,0'}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:27:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 874,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 29252,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 1, 14, 27, 18, 923987),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 1, 14, 27, 17, 773592)}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Spider closed (finished)
Your parse method should be a generator, so you need to use yield instead of return on the for loop, otherwise you'll finish the loop on the first iteration. Furthermore, get_get method is returning a set, but it must return Request, BaseItem, dict or None.
I suggest changing the code as follow:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
yield scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {'coord': response.css('span.coordinatetxt::text').get()}
https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/

Scrapy doesnt get products from an ecommerce site

I try to learn Scrapy and I manage to crawl some sites and other I fail for example:
I try to crawl: http://www.polyhousestore.com/
I created a test spider that will get all the products in the page
http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60
When I run the spider I get that it didn’t find any product.
Can someone help me understand what am I doing wrong, is it related to the CSS ::before and ::after ?
And how can I make it to work?
Spider code(that fail to get the products in the page)
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
class PolySpider(scrapy.Spider):
name = "poly"
allowed_domains = ["polyhousestore.com"]
start_urls = (
'http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60',
)
def parse(self, response):
sel = Selector(response)
products = sel.xpath('/html/body/div[4]/div/div[5]/div/div/div/div/div[2]/div[3]/div[2]/div')
items = []
if not products:
print '------------- No products from sel.xpath'
else:
print '------------- Found products ' + str(len(products))
Command line that I run and the output
D:\scrapyProj\cmdProj>scrapy crawl poly
2016-01-19 10:23:16 [scrapy] INFO: Scrapy 1.0.3 started (bot: cmdProj)
2016-01-19 10:23:16 [scrapy] INFO: Optional features available: ssl, http11
2016-01-19 10:23:16 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cm
dProj.spiders', 'SPIDER_MODULES': ['cmdProj.spiders'], 'BOT_NAME': 'cmdProj'}
2016-01-19 10:23:17 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
2016-01-19 10:23:17 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-19 10:23:17 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-19 10:23:17 [scrapy] INFO: Enabled item pipelines:
2016-01-19 10:23:17 [scrapy] INFO: Spider opened
2016-01-19 10:23:17 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-01-19 10:23:17 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-19 10:23:17 [scrapy] DEBUG: Crawled (200) <GET http://www.polyhousestore
.com/catalogsearch/result/?cat=&q=lc+60> (referer: None)
------------- No products from sel.xpath
2016-01-19 10:23:18 [scrapy] INFO: Closing spider (finished)
2016-01-19 10:23:18 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 254,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16091,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 19, 8, 23, 18, 53000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 1, 19, 8, 23, 17, 376000)}
2016-01-19 10:23:18 [scrapy] INFO: Spider closed (finished)
Thanks for the help
When I look at the URL given in your question with Chrome I see only 2 div tags in the body of the site. This means scrapy sees these div tags too. However you want to access the 4th, which does not exist so your search returns no elements.
If I open a scrapy shell and execute a count on the div tags on the body I get 2 in return:
[<Selector xpath='count(/html/body/div)' data=u'2.0'>]
This above is the same as
len(response.xpath('/html/body/div'))
All this means you have to modify your query to get all the products. If you need the 4 elements in the site, try:
response.xpath('//div[#class="item-inner"]')
As you can see you do not need to wrap the response into a selector with scrapy anymore.

How to scrape data in an authenticated session within a dynamic page?

I have coded a Scrapy spider using the loginform library (http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/) and taking this post as reference for dynamic webpages. This is the code:
class MySpider(CrawlSpider):
login_user = 'myusername'
login_pass = 'mypassword'
name = "tv"
allowed_domains = []
start_urls = ["https://twitter.com/Acrocephalus/followers"]
rules = (
Rule(SgmlLinkExtractor(allow=('https://twitter\.com/.*')), callback='parse_items', follow=True),
)
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
def after_login(self, response):
# you are logged in here
def __init__(self):
CrawlSpider.__init__(self)
# use any browser you wish
self.browser = webdriver.Firefox()
def __del__(self):
self.browser.close()
def parse_item(self, response):
item = TwitterVizItem()
self.browser.get(response.url)
# let JavaScript Load
time.sleep(3)
# scrape dynamically generated HTML
hxs = Selector(text=self.browser.page_source)
item['Follower'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()>2]').extract()
item['Follows'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()=1]').extract()
return item
When I run it, I get this output:
2015-07-22 18:46:38 [scrapy] INFO: Scrapy 1.0.1.post1+g5b8c9e5 started (bot: TwitterViz)
2015-07-22 18:46:38 [scrapy] INFO: Optional features available: ssl, http11
2015-07-22 18:46:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TwitterViz.spiders', 'SPIDER_MODULES': ['TwitterViz.spiders'], 'BOT_NAME': 'TwitterViz'}
2015-07-22 18:46:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-22 18:46:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-22 18:46:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-22 18:46:38 [scrapy] INFO: Enabled item pipelines:
2015-07-22 18:46:38 [scrapy] INFO: Spider opened
2015-07-22 18:46:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-22 18:46:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> from <GET https://twitter.com/Acrocephalus/followers>
2015-07-22 18:46:39 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> (referer: None)
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/Acrocephalus/followers> from <POST https://twitter.com/sessions>
2015-07-22 18:46:40 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/Acrocephalus/followers> (referer: https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers)
2015-07-22 18:46:40 [scrapy] INFO: Closing spider (finished)
2015-07-22 18:46:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2506,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 63188,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 22, 16, 46, 40, 538785),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 7, 22, 16, 46, 38, 926487)}
2015-07-22 18:46:40 [scrapy] INFO: Spider closed (finished)
From the ouput I understand that it logs in properly. However, it doesn't scrape anything. I've tested the XPath with Firefox's XPath Checker and they seem to work properly. Can anyone have a look to see if we can find what is wrong?

Scrapy crawl resume does not crawl anything and just finishes

I start a crawl with a CrawlSpider Derived class, and pause it with Ctrl+C. When I execute the command again to resume it, it does not continue.
My start and resume command:
scrapy crawl mycrawler -s JOBDIR=crawls/test5_mycrawl
Scrapy creates the folder. The permissions are 777.
When I resume the crawl, it just outputs:
/home/adminuser/.virtualenvs/rg_harvest/lib/python2.7/site-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification. Many valid certificate/hostname mappings may be rejected.
verifyHostname, VerificationError = _selectVerifyImplementation()
2014-11-21 11:05:10-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: rg_harvest_scrapy)
2014-11-21 11:05:10-0500 [scrapy] INFO: Optional features available: ssl, http11, django
2014-11-21 11:05:10-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'rg_harvest_scrapy.spiders', 'SPIDER_MODULES': ['rg_harvest_scrapy.spiders'], 'BOT_NAME': 'rg_harvest_scrapy'}
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled item pipelines: ValidateMandatory, TypeConversion, ValidateRange, ValidateLogic, RestegourmetImagesPipeline, SaveToDB
2014-11-21 11:05:10-0500 [mycrawler] INFO: Spider opened
2014-11-21 11:05:10-0500 [mycrawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-11-21 11:05:10-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-11-21 11:05:10-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-11-21 11:05:10-0500 [mycrawler] DEBUG: Crawled (200) <GET http://eatsmarter.de/suche/rezepte> (referer: None)
2014-11-21 11:05:10-0500 [mycrawler] DEBUG: Filtered duplicate request: <GET http://eatsmarter.de/suche/rezepte?page=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2014-11-21 11:05:10-0500 [mycrawler] INFO: Closing spider (finished)
2014-11-21 11:05:10-0500 [mycrawler] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 225,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 19242,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'dupefilter/filtered': 29,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 11, 21, 16, 5, 10, 733196),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/disk': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/disk': 1,
'start_time': datetime.datetime(2014, 11, 21, 16, 5, 10, 528629)}
I have one start_url. Could this be the reason? My crawler uses one start_url and then follows the pagination by a Rule with a LinkExtractor, and calls the parse item by a specific url format:
My Spider code:
class MyCrawlSpiderBase(CrawlSpider):
name = 'test_spider'
testmode = True
crawl_start = datetime.utcnow().isoformat()
def __init__(self, testmode=True, urls=None, *args, **kwargs):
self.testmode = bool(int(testmode))
super(MyCrawlSpiderBase, self).__init__(*args, **kwargs)
def parse_item(self, response):
# Item Values
l = MyItemLoader(RecipeItem(), response=response)
l.replace_value('url', response.url)
l.replace_value('crawl_start', self.crawl_start)
return l.load_item()
class MyCrawlSpider(MyCrawlSpiderBase):
name = 'example_de'
allowed_domains = ['example.de']
start_urls = [
"http://example.de",
]
rules = (
Rule(
LinkExtractor(
allow=('/search/entry\?page=', )
)
),
Rule(
LinkExtractor(
allow=('/entry/[0-9A-z\-]{3,}$', ),
),
callback='parse_item'
),
)
def parse_item(self, response):
item = super(MyCrawlSpider, self).parse_item(response)
l = MyItemLoader(item=item, response=response)
l.replace_xpath("name", "//h1[#class='fn title']/text()")
(...)
return l.load_item()
Since your URL is always the same, the requests are most likely being filtered.
You can solve this in two ways:
In yoursettings.py file, add this line:
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
This replaces the default RFPDupeFilter with the BaseDupeFilter which will not filter any requests. This my not be what you want if you in fact want to filter out some other requests not relevant to this question.
You can get more involved in the process of creating requests, and create them with the parameter dont_filter=True, which will disable filtering on a per-requests basis. To achieve this, you could remove the start_urls and replace it with a method start_requests() that would yield the requests for parsing. Check out more info in the official documentation.
If you click Ctrl+C twice (force stop) it won't be able to be continued. Click Ctrl+C just once and wait.