When trying to scrap a page passing headers and body i get the following error show below.
i tried converting to json, str and sending it but it doesn't give any results.
please let me know if anything needs to be changed..
Code
import scrapy
class TestingSpider(scrapy.Spider):
name = "test"
def start_requests(self):
request_headers = {
"Host": "host_here",
"User-Agent": "Mozilla/5.0 20100101 Firefox/46.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Cache-Control": "max-age=0"
}
url = "my_url_here"
payload = {
"searchargs.approvedFrom.input": "05/18/2017",
"searchargs.approvedTO.input": "05/18/2017"
"pagesize": -1
}
yield scrapy.Request(url, method="POST", callback=self.parse, headers=request_headers, body=payload)
def parse(self, response):
print("-------------------------------came here-------------------------------")
print(response.body)
Error 1
Traceback (most recent call last):
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/home/suventure/Desktop/suventure-projects/python-projects/scraper_txrrc/scraper_txrrc/spiders/wells_spider.py", line 114, in start_requests
yield scrapy.Request(url, method="POST", callback=self.parse, headers=request_headers, body=payload)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 26, in __init__
self._set_body(body)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 68, in _set_body
self._body = to_bytes(body, self.encoding)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/utils/python.py", line 117, in to_bytes
'object, got %s' % type(text).__name__)
TypeError: to_bytes must receive a unicode, str or bytes object, got dict
Error 2 without any response if dict is converted to string and sent in body
2017-05-19 22:39:38 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scraper_)
2017-05-19 22:39:38 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scraper', 'NEWSPIDER_MODULE': 'scraper_.spiders', 'SPIDER_MODULES': ['scraper_.spiders'], 'ROBOTSTXT_OBEY': True}
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-05-19 22:39:39 [scrapy.core.engine] INFO: Spider opened
2017-05-19 22:39:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-19 22:39:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-19 22:39:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://website_link_here/robots.txt> (referer: None)
2017-05-19 22:39:40 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <POST website_link_here>
2017-05-19 22:39:40 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-19 22:39:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 232,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 258,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 19, 17, 9, 40, 581949),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 5, 19, 17, 9, 39, 332675)}
2017-05-19 22:39:40 [scrapy.core.engine] INFO: Spider closed (finished)
In settings.py change
ROBOTSTXT_OBEY = False
Related
I'm trying to get the latitude and longitude of different cities. The name of cities are stored in a JSON file. Here is my code:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
return scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {response.css('span.coordinatetxt::text').get()}
The objective is to loop through the JSON file and for each city send a resquest to a form from the url "https://www.latlong.net/". However nothing is prompting from this request. Is this a bad way to make loop ? Should I treat the JSON file inside the class ?
Log:
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17763-SP0
2019-04-01 16:27:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-01 16:27:17 [scrapy.core.engine] INFO: Spider opened
2019-04-01 16:27:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-01 16:27:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/robots.txt> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.latlong.net/> (referer: https://www.latlong.net/)
2019-04-01 16:27:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.latlong.net/>
{'latlong': '0,0'}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:27:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 874,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 29252,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 1, 14, 27, 18, 923987),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 1, 14, 27, 17, 773592)}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Spider closed (finished)
Your parse method should be a generator, so you need to use yield instead of return on the for loop, otherwise you'll finish the loop on the first iteration. Furthermore, get_get method is returning a set, but it must return Request, BaseItem, dict or None.
I suggest changing the code as follow:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
yield scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {'coord': response.css('span.coordinatetxt::text').get()}
https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/
I have written a code that passes through links within a web page to extract data and move to the next page. It is the about link from each author in quotes.toscrape.com.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com',]
def parse(self, response):
linkto = response.css('div.quote > span > a::attr(href)').extract()
for links in linkto:
links = response.urljoin(links)
yield scrapy.Request(url=links, callback = scrapy.parse_about)
nextp = response.css('li.next > a::attr(href)').extract()
if nextp:
nextp = response.urljoin(nextp)
yield scrapy.Request(url=nextp, callback=self.parse)
def parse_about(self, response):
yield {
'date_of_birth': response.css('span.author-born-date::text').extract(),
'author': response.css('h3.author-title::text').extract(),
}
I executed in the command prompt:
scrapy crawl test -o test.csv
but the results I got:
019-03-20 16:36:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: quotestoscrape)
2019-03-20 16:36:03 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 17.5.0, Python 2.7.15 |Anaconda, Inc.| (default, Nov 13 2018, 17:33:26) [MSC v.1500 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1 11 Sep 2018), cryptography 2.5, Platform Windows-10-10.0.17134
2019-03-20 16:36:03 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotestoscrape.spiders', 'SPIDER_MODULES': ['quotestoscrape.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'quotestoscrape'}
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-20 16:36:03 [scrapy.core.engine] INFO: Spider opened
2019-03-20 16:36:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-20 16:36:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-03-20 16:36:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-03-20 16:36:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-03-20 16:36:04 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com> (referer: None)
Traceback (most recent call last):
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kenny\quotestoscrape\quotestoscrape\spiders\QuoteTestSpider.py", line 13, in parse
yield scrapy.Request(url=links, callback = scrapy.parse_about)
AttributeError: 'module' object has no attribute 'parse_about'
2019-03-20 16:36:04 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-20 16:36:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 3, 20, 21, 36, 4, 41000),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2019, 3, 20, 21, 36, 3, 468000)}
2019-03-20 16:36:04 [scrapy.core.engine] INFO: Spider closed (finished)
And my csv file I moved it to is empty:
enter image description here
Please let me know what I am doing wrong
According to your log method parse_about is not called because you are trying to call scrapy.parse_about instead of spider's self.parse_about:
....
for links in linkto:
links = response.urljoin(links)
yield scrapy.Request(url=links, callback = self.parse_about)
As your application doesn't scrape any data -> It creates empty csv file as result.
I was curious to know if google authetication could be achieved via scrapy. I tried with the following code snippet to achieve so.
My code:
import scrapy
from scrapy.http import FormRequest, Request
import logging
import json
class LoginSpider(scrapy.Spider):
name = 'hello'
start_urls = ['https://accounts.google.com']
handle_httpstatus_list = [404, 405]
def parse(self, response):
print('inside parse')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Email': 'abc#gmail.com'},
callback=self.log_password)]
def log_password(self, response):
print('inside log_password')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Passwd': 'password'},
callback=self.after_login)]
def after_login(self, response):
print('inside after_login')
print(response.url)
if "authentication failed" in response.body:
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
print("Login Successful!!")
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'DOWNLOAD_DELAY': 0.2,
'HANDLE_HTTPSTATUS_ALL': True
})
process.crawl(LoginSpider)
process.start()
But I'm getting following output when I run the script.
Output
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.4 (default, Mar 22 2018, 14:05:57) - [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Darwin-15.6.0-x86_64-i386-64bit 2018-08-15 10:38:05 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 0.2, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'} 2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled item pipelines: []
2018-08-15 10:38:05 [scrapy.core.engine] INFO: Spider opened 2018-08-15 10:38:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-15 10:38:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-15 10:38:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ManageAccount> from <GET https://accounts.google.com>
2018-08-15 10:38:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> from <GET https://accounts.google.com/ManageAccount>
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> (referer: None)
**inside parse**
https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (405) <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
**inside log_password**
https://accounts.google.com/
2018-08-15 10:38:10 [scrapy.core.scraper] ERROR: Spider error processing <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
Traceback (most recent call last): File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw) File "google_login.py", line 79, in log_password
callback=self.after_login)] File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 48, in from_response
form = _get_form(response, formname, formid, formnumber, formxpath) File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 77, in _get_form
raise ValueError("No <form> element found in %s" % response) ValueError: No <form> element found in <405 https://accounts.google.com/>
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-15 10:38:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1810, 'downloader/request_count': 4, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 357598, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/302': 2, 'downloader/response_status_count/405': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 15, 4, 53, 10, 164911), 'log_count/DEBUG': 5, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'memusage/max': 41132032, 'memusage/startup': 41132032, 'request_depth_max': 1, 'response_received_count': 2, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'spider_exceptions/ValueError': 1, 'start_time': datetime.datetime(2018, 8, 15, 4, 53, 5, 463699)}
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Spider closed (finished)
Any help would be appreciated
Error 405 means that URL doesn't accept the HTTP method, in your case POST generated at parse.
Google default login page is much more complex than simple POST form heavily using JS and Ajax under the hood. To login using Scrapy you have use https://accounts.google.com/ServiceLogin?nojavascript=1 as start URL instead.
I'm following the tutorial for Scrapy.
I used this code from the tutorial:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1',
'http://quotes.toscrape.com/page/2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
page = response.url.split("/)[-2]")
filename = 'quotes-%s.html' % page
with open(filename,'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
When I then run the command scrapy crawl quotes I get the following output:
2017-05-14 02:19:55 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial)
2017-05-14 02:19:55 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWS
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-14 02:19:55 [scrapy.core.engine] INFO: Spider opened
2017-05-14 02:19:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped
2017-05-14 02:19:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/ro
2017-05-14 02:19:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET htt
2017-05-14 02:19:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET htt
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/pa
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/pa
2017-05-14 02:19:56 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.tosc
Traceback (most recent call last):
File "c:\users\mehmet\anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _ru
current.result = callback(current.result, *args, **kw)
File "c:\users\mehmet\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 76, in par
raise NotImplementedError
NotImplementedError
2017-05-14 02:19:56 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.tosc
Traceback (most recent call last):
File "c:\users\mehmet\anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _ru
current.result = callback(current.result, *args, **kw)
File "c:\users\mehmet\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 76, in par
raise NotImplementedError
NotImplementedError
2017-05-14 02:19:56 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-14 02:19:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1121,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 6956,
'downloader/response_count': 5,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 14, 0, 19, 56, 125822),
'log_count/DEBUG': 6,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/NotImplementedError': 2,
'start_time': datetime.datetime(2017, 5, 14, 0, 19, 55, 659206)}
2017-05-14 02:19:56 [scrapy.core.engine] INFO: Spider closed (finished)
What is going wrong?
This error means you did not implement the parse function. But according to your post you did. So it could be an indentation error. Your code should be like that:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1',
'http://quotes.toscrape.com/page/2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
page = response.url.split("/)[-2]")
filename = 'filename'
with open(filename,'w+') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
I tested it and it works.
Shouldn't the line
page = response.url.split("/)[-2]")
be
page = response.url.split("/)[-1]")
as now it looks like you are selecting the word page and want a number ?
I followed the basic Scrapy Login. It always works, but in this case, I had some problems. The FormRequest.from_response didn't request the https://www.crowdfunder.com/user/validateLogin, instead it always sent payload to https://www.crowdfunder.com/user/signup. I tried directly request the validateLogin with payload, but it responded with 404 Error. Any idea to help me solve this problem? Thanks in advance!!!
class CrowdfunderSpider(InitSpider):
name = "crowdfunder"
allowed_domains = ["crowdfunder.com"]
start_urls = [
'http://www.crowdfunder.com/',
]
login_page = 'https://www.crowdfunder.com/user/login/'
payload = {}
def init_request(self):
"""This function is called before crawling starts."""
return scrapy.Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
self.payload = {'email': 'my_email',
'password': 'my_password'}
# scrapy login
return scrapy.FormRequest.from_response(response, formdata=self.payload, callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if 'https://www.crowdfunder.com/user/settings' == response.url:
self.log("Successfully logged in. :) :) :)")
# start the crawling
return self.initialized()
else:
# login fail
self.log("login failed :( :( :(")
Here is the payload and request link sent by clicking login in browser:
payload and request url sent by clicking login button
Here is the log info:
2016-10-21 21:56:21 [scrapy] INFO: Scrapy 1.1.0 started (bot: crowdfunder_crawl)
2016-10-21 21:56:21 [scrapy] INFO: Overridden settings: {'AJAXCRAWL_ENABLED': True, 'NEWSPIDER_MODULE': 'crowdfunder_crawl.spiders', 'SPIDER_MODULES': ['crowdfunder_crawl.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'crowdfunder_crawl'}
2016-10-21 21:56:21 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-21 21:56:21 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-21 21:56:21 [scrapy] INFO: Spider opened
2016-10-21 21:56:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-21 21:56:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-10-21 21:56:21 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/robots.txt> (referer: None)
2016-10-21 21:56:21 [scrapy] DEBUG: Redirecting (301) to <GET http://www.crowdfunder.com/user/login> from <GET https://www.crowdfunder.com/user/login/>
2016-10-21 21:56:22 [scrapy] DEBUG: Redirecting (301) to <GET https://www.crowdfunder.com/user/login> from <GET http://www.crowdfunder.com/user/login>
2016-10-21 21:56:22 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/user/login> (referer: None)
2016-10-21 21:56:23 [scrapy] DEBUG: Crawled (200) <POST https://www.crowdfunder.com/user/signup> (referer: https://www.crowdfunder.com/user/login)
2016-10-21 21:56:23 [crowdfunder] DEBUG: login failed :( :( :(
2016-10-21 21:56:23 [scrapy] INFO: Closing spider (finished)
2016-10-21 21:56:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1569,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 16313,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 22, 4, 56, 23, 232493),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 10, 22, 4, 56, 21, 180030)}
2016-10-21 21:56:23 [scrapy] INFO: Spider closed (finished)
FormRequest.from_response(response) by default uses the first form it finds. If you check what forms the page has you'd see:
In : response.xpath("//form")
Out:
[<Selector xpath='//form' data='<form action="/user/signup" method="post'>,
<Selector xpath='//form' data='<form action="/user/login" method="POST"'>,
<Selector xpath='//form' data='<form action="/user/login" method="post"'>]
So the form you are looking for is not 1st one. The way to fix it is to use one of many from_response method parameters to specify which form to use.
Using formxpath is the most flexible and my personal favorite:
In : FormRequest.from_response(response, formxpath='//*[contains(#action,"login")]')
Out: <POST https://www.crowdfunder.com/user/login>