Google Authenticaion via Scrapy - scrapy

I was curious to know if google authetication could be achieved via scrapy. I tried with the following code snippet to achieve so.
My code:
import scrapy
from scrapy.http import FormRequest, Request
import logging
import json
class LoginSpider(scrapy.Spider):
name = 'hello'
start_urls = ['https://accounts.google.com']
handle_httpstatus_list = [404, 405]
def parse(self, response):
print('inside parse')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Email': 'abc#gmail.com'},
callback=self.log_password)]
def log_password(self, response):
print('inside log_password')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Passwd': 'password'},
callback=self.after_login)]
def after_login(self, response):
print('inside after_login')
print(response.url)
if "authentication failed" in response.body:
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
print("Login Successful!!")
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'DOWNLOAD_DELAY': 0.2,
'HANDLE_HTTPSTATUS_ALL': True
})
process.crawl(LoginSpider)
process.start()
But I'm getting following output when I run the script.
Output
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.4 (default, Mar 22 2018, 14:05:57) - [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Darwin-15.6.0-x86_64-i386-64bit 2018-08-15 10:38:05 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 0.2, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'} 2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled item pipelines: []
2018-08-15 10:38:05 [scrapy.core.engine] INFO: Spider opened 2018-08-15 10:38:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-15 10:38:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-15 10:38:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ManageAccount> from <GET https://accounts.google.com>
2018-08-15 10:38:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> from <GET https://accounts.google.com/ManageAccount>
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> (referer: None)
**inside parse**
https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (405) <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
**inside log_password**
https://accounts.google.com/
2018-08-15 10:38:10 [scrapy.core.scraper] ERROR: Spider error processing <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
Traceback (most recent call last): File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw) File "google_login.py", line 79, in log_password
callback=self.after_login)] File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 48, in from_response
form = _get_form(response, formname, formid, formnumber, formxpath) File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 77, in _get_form
raise ValueError("No <form> element found in %s" % response) ValueError: No <form> element found in <405 https://accounts.google.com/>
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-15 10:38:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1810, 'downloader/request_count': 4, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 357598, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/302': 2, 'downloader/response_status_count/405': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 15, 4, 53, 10, 164911), 'log_count/DEBUG': 5, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'memusage/max': 41132032, 'memusage/startup': 41132032, 'request_depth_max': 1, 'response_received_count': 2, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'spider_exceptions/ValueError': 1, 'start_time': datetime.datetime(2018, 8, 15, 4, 53, 5, 463699)}
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Spider closed (finished)
Any help would be appreciated

Error 405 means that URL doesn't accept the HTTP method, in your case POST generated at parse.
Google default login page is much more complex than simple POST form heavily using JS and Ajax under the hood. To login using Scrapy you have use https://accounts.google.com/ServiceLogin?nojavascript=1 as start URL instead.

Related

Why does my code not wait as per wait argument and why does it not return the Javascript rendered content?

I am trying to learn scrapy and splash to scrape efficiently from the web. I have installed scrapy, scrapy-splash and have splash running in a docker container. My code is below
import scrapy
from scrapy.crawler import CrawlerProcess
from pathlib import Path
from scrapy_splash import SplashRequest
max_price = "110000"
min_price = "65000"
region_code = "5E430"
class QuotesSpider(scrapy.Spider):
name = "propertySearch"
def start_requests(self):
url = "http://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%" + region_code + "&minBedrooms=2&maxPrice=" + max_price + "&minPrice=" + min_price + "&propertyTypes=detached" + \
"%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords="
yield SplashRequest(url=url, callback=self.parse, args={'wait': 3})
def parse(self, response):
work_path = "C:/Users/Cristi/Desktop/Scrapy_ROI_work_area/"
no_of_pages = response.xpath('//span[#class = "pagination-pageInfo"]')
with open(Path(work_path, "test.txt"), 'wb') as f:
f.write(response.body)
# with open(Path(work_path, "extract.txt"), 'wb') as g:
# g.write(no_of_pages)
self.log('Saved file test.txt')
process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
When I run it the console outputs the below if it is of any use.
2020-06-07 00:06:32 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot)
2020-06-07 00:06:32 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-06-07 00:06:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-07 00:06:32 [scrapy.crawler] INFO: Overridden settings:
{}
2020-06-07 00:06:32 [scrapy.extensions.telnet] INFO: Telnet Password: 162ccfed8b528ac9
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-06-07 00:06:32 [scrapy.core.engine] INFO: Spider opened
2020-06-07 00:06:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-07 00:06:32 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-07 00:06:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=> from <GET http://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=>
2020-06-07 00:06:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=> (referer: None)
2020-06-07 00:06:33 [propertySearch] DEBUG: Saved file test.txt
2020-06-07 00:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-07 00:06:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 994,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 77255,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 0.429054,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 6, 23, 6, 33, 185514),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 6, 6, 23, 6, 32, 756460)}
2020-06-07 00:06:33 [scrapy.core.engine] INFO: Spider closed (finished)
The text file that it outputs only contains the bare HTML with none of the data that JavaScript would bring in.
I read the documentation closely and found that first I needed to follow the scrapy project set-up process from here closely and then found very useful info here about how to run my script from my IDE. It's working now.

Scrapy - Multiple resquest by looping JSON file

I'm trying to get the latitude and longitude of different cities. The name of cities are stored in a JSON file. Here is my code:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
return scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {response.css('span.coordinatetxt::text').get()}
The objective is to loop through the JSON file and for each city send a resquest to a form from the url "https://www.latlong.net/". However nothing is prompting from this request. Is this a bad way to make loop ? Should I treat the JSON file inside the class ?
Log:
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17763-SP0
2019-04-01 16:27:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-01 16:27:17 [scrapy.core.engine] INFO: Spider opened
2019-04-01 16:27:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-01 16:27:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/robots.txt> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.latlong.net/> (referer: https://www.latlong.net/)
2019-04-01 16:27:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.latlong.net/>
{'latlong': '0,0'}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:27:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 874,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 29252,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 1, 14, 27, 18, 923987),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 1, 14, 27, 17, 773592)}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Spider closed (finished)
Your parse method should be a generator, so you need to use yield instead of return on the for loop, otherwise you'll finish the loop on the first iteration. Furthermore, get_get method is returning a set, but it must return Request, BaseItem, dict or None.
I suggest changing the code as follow:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
yield scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {'coord': response.css('span.coordinatetxt::text').get()}
https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/

Spider Runs using scrapy but there is no data stored into a csv

I have written a code that passes through links within a web page to extract data and move to the next page. It is the about link from each author in quotes.toscrape.com.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com',]
def parse(self, response):
linkto = response.css('div.quote > span > a::attr(href)').extract()
for links in linkto:
links = response.urljoin(links)
yield scrapy.Request(url=links, callback = scrapy.parse_about)
nextp = response.css('li.next > a::attr(href)').extract()
if nextp:
nextp = response.urljoin(nextp)
yield scrapy.Request(url=nextp, callback=self.parse)
def parse_about(self, response):
yield {
'date_of_birth': response.css('span.author-born-date::text').extract(),
'author': response.css('h3.author-title::text').extract(),
}
I executed in the command prompt:
scrapy crawl test -o test.csv
but the results I got:
019-03-20 16:36:03 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: quotestoscrape)
2019-03-20 16:36:03 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 17.5.0, Python 2.7.15 |Anaconda, Inc.| (default, Nov 13 2018, 17:33:26) [MSC v.1500 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1 11 Sep 2018), cryptography 2.5, Platform Windows-10-10.0.17134
2019-03-20 16:36:03 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotestoscrape.spiders', 'SPIDER_MODULES': ['quotestoscrape.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'quotestoscrape'}
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-20 16:36:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-03-20 16:36:03 [scrapy.core.engine] INFO: Spider opened
2019-03-20 16:36:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-20 16:36:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-03-20 16:36:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-03-20 16:36:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-03-20 16:36:04 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.toscrape.com> (referer: None)
Traceback (most recent call last):
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kenny\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\kenny\quotestoscrape\quotestoscrape\spiders\QuoteTestSpider.py", line 13, in parse
yield scrapy.Request(url=links, callback = scrapy.parse_about)
AttributeError: 'module' object has no attribute 'parse_about'
2019-03-20 16:36:04 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-20 16:36:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 3, 20, 21, 36, 4, 41000),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2019, 3, 20, 21, 36, 3, 468000)}
2019-03-20 16:36:04 [scrapy.core.engine] INFO: Spider closed (finished)
And my csv file I moved it to is empty:
enter image description here
Please let me know what I am doing wrong
According to your log method parse_about is not called because you are trying to call scrapy.parse_about instead of spider's self.parse_about:
....
for links in linkto:
links = response.urljoin(links)
yield scrapy.Request(url=links, callback = self.parse_about)
As your application doesn't scrape any data -> It creates empty csv file as result.

getting error while implementing headers, body in Scrapy Spider

When trying to scrap a page passing headers and body i get the following error show below.
i tried converting to json, str and sending it but it doesn't give any results.
please let me know if anything needs to be changed..
Code
import scrapy
class TestingSpider(scrapy.Spider):
name = "test"
def start_requests(self):
request_headers = {
"Host": "host_here",
"User-Agent": "Mozilla/5.0 20100101 Firefox/46.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Cache-Control": "max-age=0"
}
url = "my_url_here"
payload = {
"searchargs.approvedFrom.input": "05/18/2017",
"searchargs.approvedTO.input": "05/18/2017"
"pagesize": -1
}
yield scrapy.Request(url, method="POST", callback=self.parse, headers=request_headers, body=payload)
def parse(self, response):
print("-------------------------------came here-------------------------------")
print(response.body)
Error 1
Traceback (most recent call last):
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/home/suventure/Desktop/suventure-projects/python-projects/scraper_txrrc/scraper_txrrc/spiders/wells_spider.py", line 114, in start_requests
yield scrapy.Request(url, method="POST", callback=self.parse, headers=request_headers, body=payload)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 26, in __init__
self._set_body(body)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 68, in _set_body
self._body = to_bytes(body, self.encoding)
File "/home/suventure/home/python/lib/python3.5/site-packages/scrapy/utils/python.py", line 117, in to_bytes
'object, got %s' % type(text).__name__)
TypeError: to_bytes must receive a unicode, str or bytes object, got dict
Error 2 without any response if dict is converted to string and sent in body
2017-05-19 22:39:38 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scraper_)
2017-05-19 22:39:38 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scraper', 'NEWSPIDER_MODULE': 'scraper_.spiders', 'SPIDER_MODULES': ['scraper_.spiders'], 'ROBOTSTXT_OBEY': True}
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-19 22:39:39 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-05-19 22:39:39 [scrapy.core.engine] INFO: Spider opened
2017-05-19 22:39:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-19 22:39:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-19 22:39:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://website_link_here/robots.txt> (referer: None)
2017-05-19 22:39:40 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <POST website_link_here>
2017-05-19 22:39:40 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-19 22:39:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 232,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 258,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 19, 17, 9, 40, 581949),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 5, 19, 17, 9, 39, 332675)}
2017-05-19 22:39:40 [scrapy.core.engine] INFO: Spider closed (finished)
In settings.py change
ROBOTSTXT_OBEY = False

Scrapy FormRequest redirect not wanted link

I followed the basic Scrapy Login. It always works, but in this case, I had some problems. The FormRequest.from_response didn't request the https://www.crowdfunder.com/user/validateLogin, instead it always sent payload to https://www.crowdfunder.com/user/signup. I tried directly request the validateLogin with payload, but it responded with 404 Error. Any idea to help me solve this problem? Thanks in advance!!!
class CrowdfunderSpider(InitSpider):
name = "crowdfunder"
allowed_domains = ["crowdfunder.com"]
start_urls = [
'http://www.crowdfunder.com/',
]
login_page = 'https://www.crowdfunder.com/user/login/'
payload = {}
def init_request(self):
"""This function is called before crawling starts."""
return scrapy.Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
self.payload = {'email': 'my_email',
'password': 'my_password'}
# scrapy login
return scrapy.FormRequest.from_response(response, formdata=self.payload, callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if 'https://www.crowdfunder.com/user/settings' == response.url:
self.log("Successfully logged in. :) :) :)")
# start the crawling
return self.initialized()
else:
# login fail
self.log("login failed :( :( :(")
Here is the payload and request link sent by clicking login in browser:
payload and request url sent by clicking login button
Here is the log info:
2016-10-21 21:56:21 [scrapy] INFO: Scrapy 1.1.0 started (bot: crowdfunder_crawl)
2016-10-21 21:56:21 [scrapy] INFO: Overridden settings: {'AJAXCRAWL_ENABLED': True, 'NEWSPIDER_MODULE': 'crowdfunder_crawl.spiders', 'SPIDER_MODULES': ['crowdfunder_crawl.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'crowdfunder_crawl'}
2016-10-21 21:56:21 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-21 21:56:21 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-21 21:56:21 [scrapy] INFO: Spider opened
2016-10-21 21:56:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-21 21:56:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-10-21 21:56:21 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/robots.txt> (referer: None)
2016-10-21 21:56:21 [scrapy] DEBUG: Redirecting (301) to <GET http://www.crowdfunder.com/user/login> from <GET https://www.crowdfunder.com/user/login/>
2016-10-21 21:56:22 [scrapy] DEBUG: Redirecting (301) to <GET https://www.crowdfunder.com/user/login> from <GET http://www.crowdfunder.com/user/login>
2016-10-21 21:56:22 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/user/login> (referer: None)
2016-10-21 21:56:23 [scrapy] DEBUG: Crawled (200) <POST https://www.crowdfunder.com/user/signup> (referer: https://www.crowdfunder.com/user/login)
2016-10-21 21:56:23 [crowdfunder] DEBUG: login failed :( :( :(
2016-10-21 21:56:23 [scrapy] INFO: Closing spider (finished)
2016-10-21 21:56:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1569,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 16313,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 22, 4, 56, 23, 232493),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 10, 22, 4, 56, 21, 180030)}
2016-10-21 21:56:23 [scrapy] INFO: Spider closed (finished)
FormRequest.from_response(response) by default uses the first form it finds. If you check what forms the page has you'd see:
In : response.xpath("//form")
Out:
[<Selector xpath='//form' data='<form action="/user/signup" method="post'>,
<Selector xpath='//form' data='<form action="/user/login" method="POST"'>,
<Selector xpath='//form' data='<form action="/user/login" method="post"'>]
So the form you are looking for is not 1st one. The way to fix it is to use one of many from_response method parameters to specify which form to use.
Using formxpath is the most flexible and my personal favorite:
In : FormRequest.from_response(response, formxpath='//*[contains(#action,"login")]')
Out: <POST https://www.crowdfunder.com/user/login>