Scrapy with Tor, Polipo in (Ubuntu) : how to renew IP - scrapy

I have added the following code to /etc/polipo/config file:
# This file only needs to list configuration variables that deviate
# from the default values. See /usr/share/doc/polipo/examples/config.sample
# and "polipo -v" for variables you can tweak and further information.
logSyslog = true
logFile = /var/log/polipo/polipo.log
socksParentProxy = localhost:9050
diskCacheRoot=""
disableLocalInterface=""
Added this code to settings.py file:
# More comprehensive list can be found at
# http://techpatterns.com/forums/about304.html
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'myproject.scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
# Disable compression middleware, so the actual HTML pages are cached
}
Deleted the middlewares.py file in my scrapy project folder and added a new middlewares.py with the following code:
#TOR proxy
import os
import random
from scrapy.conf import settings
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(settings.get('USER_AGENT_LIST'))
if ua:
request.headers.setdefault('User-Agent', ua)
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')
Set up TOR browser network setting as follows:
Proxy type: HTTP/HTTPS
Address: localhost
Port: 9050
Started polipo with following command:
sudo /etc/init.d/polipo restart
It seems it's not crawling the site. This is my log:
scrapy crawl event -o items.csv
2018-09-13 21:02:02 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: tentimes)
2018-09-13 21:02:02 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.12 (default, Dec 4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.15.0-33-generic-x86_64-with-Ubuntu-16.04-xenial
2018-09-13 21:02:02 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tentimes.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['tentimes.spiders'], 'BOT_NAME': 'tentimes', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'csv'}
2018-09-13 21:02:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-09-13 21:02:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'tentimes.middlewares.RandomUserAgentMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'tentimes.middlewares.ProxyMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-13 21:02:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-13 21:02:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-13 21:02:02 [scrapy.core.engine] INFO: Spider opened
2018-09-13 21:02:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-13 21:02:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-13 21:03:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Related

Scrapy crawl not Crawling any url

This is my first spider code. When I executed this code in my cmd. log shows that the urls are not even getting crawled and there were not DEBUG message in them.
Can't be able to find any solution to this problem anywhere.. I am not able to understand what is wrong. can somebody help me with this.
My code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
def start_request(self):
urls = ["http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html'% page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
Log:
2021-06-19 23:19:01 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: my_scrapy)
2021-06-19 23:19:01 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect
1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.0 (tags/v3.9.0:9cf6752, Oct 5
2020, 15:34:40) [MSC v.1927 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021),
cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2021-06-19 23:19:01 [scrapy.utils.log] DEBUG: Using reactor:
twisted.internet.selectreactor.SelectReactor
2021-06-19 23:19:01 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_scrapy',
'NEWSPIDER_MODULE': 'my_scrapy.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['my_scrapy.spiders']}
2021-06-19 23:19:01 [scrapy.extensions.telnet] INFO: Telnet Password: 1a9440bbf933d074
2021-06-19 23:19:01 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-06-19 23:19:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Spider opened
2021-06-19 23:19:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2021-06-19 23:19:02 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Closing spider (finished)
2021-06-19 23:19:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.008228,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 6, 19, 17, 49, 2, 99933),
'log_count/INFO': 10,
'start_time': datetime.datetime(2021, 6, 19, 17, 49, 2, 91705)}
2021-06-19 23:19:02 [scrapy.core.engine] INFO: Spider closed (finished)
Note: As I do not have 50 reputation to comment that's why I am answering here.
The problem is in function naming, your function should be def start_requests(self) instead of def start_request(self).
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs. But, in your case it never gets into that function due to which the requests are never made for those URLs.
Your code after small change
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
def start_requests(self):
urls = ["http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html'% page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

Why does my code not wait as per wait argument and why does it not return the Javascript rendered content?

I am trying to learn scrapy and splash to scrape efficiently from the web. I have installed scrapy, scrapy-splash and have splash running in a docker container. My code is below
import scrapy
from scrapy.crawler import CrawlerProcess
from pathlib import Path
from scrapy_splash import SplashRequest
max_price = "110000"
min_price = "65000"
region_code = "5E430"
class QuotesSpider(scrapy.Spider):
name = "propertySearch"
def start_requests(self):
url = "http://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%" + region_code + "&minBedrooms=2&maxPrice=" + max_price + "&minPrice=" + min_price + "&propertyTypes=detached" + \
"%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords="
yield SplashRequest(url=url, callback=self.parse, args={'wait': 3})
def parse(self, response):
work_path = "C:/Users/Cristi/Desktop/Scrapy_ROI_work_area/"
no_of_pages = response.xpath('//span[#class = "pagination-pageInfo"]')
with open(Path(work_path, "test.txt"), 'wb') as f:
f.write(response.body)
# with open(Path(work_path, "extract.txt"), 'wb') as g:
# g.write(no_of_pages)
self.log('Saved file test.txt')
process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
When I run it the console outputs the below if it is of any use.
2020-06-07 00:06:32 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot)
2020-06-07 00:06:32 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-06-07 00:06:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-07 00:06:32 [scrapy.crawler] INFO: Overridden settings:
{}
2020-06-07 00:06:32 [scrapy.extensions.telnet] INFO: Telnet Password: 162ccfed8b528ac9
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-06-07 00:06:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-06-07 00:06:32 [scrapy.core.engine] INFO: Spider opened
2020-06-07 00:06:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-07 00:06:32 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-07 00:06:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=> from <GET http://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=>
2020-06-07 00:06:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E430&minBedrooms=2&maxPrice=110000&minPrice=65000&propertyTypes=detached%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords=> (referer: None)
2020-06-07 00:06:33 [propertySearch] DEBUG: Saved file test.txt
2020-06-07 00:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-07 00:06:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 994,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 77255,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 0.429054,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 6, 23, 6, 33, 185514),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 6, 6, 23, 6, 32, 756460)}
2020-06-07 00:06:33 [scrapy.core.engine] INFO: Spider closed (finished)
The text file that it outputs only contains the bare HTML with none of the data that JavaScript would bring in.
I read the documentation closely and found that first I needed to follow the scrapy project set-up process from here closely and then found very useful info here about how to run my script from my IDE. It's working now.

Error in extracting links from website using scrapy

When I execute the following code I am able to extract only one link instead of all the links in that specific page of website.
from scrapy import Spider
from scrapy.http import Request
class BooksSpider(Spider):
name = 'books'
allowed_domains = ['books.toscrape.com/']
start_urls= ["http://books.toscrape.com"]
def parse(self, response):
books = response.xpath("//h3/a/#href").extract()
for book in books:
absolute_url = response.urljoin(book)
yield Request(absolute_url), callback=self.parse_page)
def parse_page(self, response):
pass
This is the output which extracts only the 1st link of the website "books.toscrape.com". Can anyone help me to understand what is mistake here or is this due to some system error. This is too much frustrating now as all loops and everything else is fine. There is some issue with YIELD I guess and how can I handle this error.:
2020-05-26 12:09:23 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrap_book)
2020-05-26 12:09:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0
2020-05-26 12:09:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-26 12:09:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrap_book',
'NEWSPIDER_MODULE': 'scrap_book.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrap_book.spiders']}
2020-05-26 12:09:23 [scrapy.extensions.telnet] INFO: Telnet Password: 7b1edefe67af4658
2020-05-26 12:09:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-05-26 12:09:25 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-26 12:09:25 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-26 12:09:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-26 12:09:26 [scrapy.core.engine] INFO: Spider opened
2020-05-26 12:09:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-26 12:09:26 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-05-26 12:09:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://books.toscrape.com/robots.txt> (referer: None)
2020-05-26 12:09:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com> (referer: None)
2020-05-26 12:09:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'books.toscrape.com': <GET http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/ind 'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 5, 26, 6, 24, 26, 117907)}
2020-05-26 12:09:28 [scrapy.core.engine] INFO: Spider closed (finished)
As Scrapy says links are filtered because of your allowed_domains:
DEBUG: Filtered offsite request to 'books.toscrape.com'
Change your code to allowed_domains = ['books.toscrape.com'] and it should work fine.
In addition, in the code you posted there is an error in the yield because there is one bracket too much right behind absolute_url. It should be: yield Request(absolute_url, callback=self.parse_page)

Google Authenticaion via Scrapy

I was curious to know if google authetication could be achieved via scrapy. I tried with the following code snippet to achieve so.
My code:
import scrapy
from scrapy.http import FormRequest, Request
import logging
import json
class LoginSpider(scrapy.Spider):
name = 'hello'
start_urls = ['https://accounts.google.com']
handle_httpstatus_list = [404, 405]
def parse(self, response):
print('inside parse')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Email': 'abc#gmail.com'},
callback=self.log_password)]
def log_password(self, response):
print('inside log_password')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Passwd': 'password'},
callback=self.after_login)]
def after_login(self, response):
print('inside after_login')
print(response.url)
if "authentication failed" in response.body:
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
print("Login Successful!!")
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'DOWNLOAD_DELAY': 0.2,
'HANDLE_HTTPSTATUS_ALL': True
})
process.crawl(LoginSpider)
process.start()
But I'm getting following output when I run the script.
Output
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.4 (default, Mar 22 2018, 14:05:57) - [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Darwin-15.6.0-x86_64-i386-64bit 2018-08-15 10:38:05 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 0.2, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'} 2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled item pipelines: []
2018-08-15 10:38:05 [scrapy.core.engine] INFO: Spider opened 2018-08-15 10:38:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-15 10:38:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-15 10:38:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ManageAccount> from <GET https://accounts.google.com>
2018-08-15 10:38:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> from <GET https://accounts.google.com/ManageAccount>
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> (referer: None)
**inside parse**
https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (405) <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
**inside log_password**
https://accounts.google.com/
2018-08-15 10:38:10 [scrapy.core.scraper] ERROR: Spider error processing <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
Traceback (most recent call last): File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw) File "google_login.py", line 79, in log_password
callback=self.after_login)] File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 48, in from_response
form = _get_form(response, formname, formid, formnumber, formxpath) File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 77, in _get_form
raise ValueError("No <form> element found in %s" % response) ValueError: No <form> element found in <405 https://accounts.google.com/>
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-15 10:38:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1810, 'downloader/request_count': 4, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 357598, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/302': 2, 'downloader/response_status_count/405': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 15, 4, 53, 10, 164911), 'log_count/DEBUG': 5, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'memusage/max': 41132032, 'memusage/startup': 41132032, 'request_depth_max': 1, 'response_received_count': 2, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'spider_exceptions/ValueError': 1, 'start_time': datetime.datetime(2018, 8, 15, 4, 53, 5, 463699)}
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Spider closed (finished)
Any help would be appreciated
Error 405 means that URL doesn't accept the HTTP method, in your case POST generated at parse.
Google default login page is much more complex than simple POST form heavily using JS and Ajax under the hood. To login using Scrapy you have use https://accounts.google.com/ServiceLogin?nojavascript=1 as start URL instead.

Scrapy: USER_AGENT and ROBOTSTXT_OBEY are properly set, but I still get error 403

Hello and thanks in advance for the help or direction you can bring. This is my scraper:
import scrapy
class RakutenSpider(scrapy.Spider):
name = "rak"
allowed_domains = ["rakuten.com"]
start_urls = ['https://www.rakuten.com/deals?omadtrack=hp_deals_viewmore']
def parse(self, response):
for sel in response.xpath('//div[#class="page-bottom"]/div'):
yield {
'titles': sel.xpath("//div[#class='slider-prod-title']").extract_first(),
'prices': sel.xpath("//span[#class='price-bold']").extract_first(),
'images': sel.xpath("//div[#class='deal-img']/img").extract_first()
}
And this is part of my settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 5
# Obey robots.txt rules
ROBOTSTXT_OBEY = 'False'
and this is part of the log:
DEBUG: Crawled (403) <GET https://www.rakuten.com/deals?omadtrack=hp_deals_viewmore> (referer: None)
I have tried almost all solutions I found in s/o
log file: This is a new log after installing Firefox driver. Now I get an ERROR: Error downloading https://www.rakuten.com/deals?omadtrack=hp_deals_viewmore>
2017-11-17 00:38:45 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-11-17 00:38:45 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'deals.spiders', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['deals.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36', 'TELNETCONSOLE_ENABLED': False, 'DOWNLOAD_DELAY': 5}
2017-11-17 00:38:45 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named cryptography.x509'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
2017-11-17 00:38:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2017-11-17 00:38:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['deals.middlewares.JSMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-17 00:38:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-17 00:38:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-17 00:38:45 [scrapy.core.engine] INFO: Spider opened
2017-11-17 00:38:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-17 00:38:45 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.rakuten.com/deals?omadtrack=hp_deals_viewmore>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/home/seealldeals/tmp/scrapy/deals/deals/middlewares.py", line 63, in process_request
driver = webdriver.Firefox()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/firefox/webdriver.py", line 144, in __init__
self.service.start()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/common/service.py", line 74, in start
stdout=self.log_file, stderr=self.log_file)
File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error
2017-11-17 00:38:45 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-17 00:38:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/exceptions.OSError': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 17, 5, 38, 45, 328366),
'log_count/ERROR': 1,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'memusage/max': 33509376,
'memusage/startup': 33509376,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 11, 17, 5, 38, 45, 112667)}
2017-11-17 00:38:45 [scrapy.core.engine] INFO: Spider closed (finished)
What's wrong
rakuten.com had integrated with Google Analytics which has anti-spider feature.
If your request can't process rakuten.com's analytics.js properly, you will be blocked from the site and have a 403 error code.
How to fix it
Use Javascript rendering technique
Solution 1: (Integrate scrapy with scrapy-splash)
Here is Scrapy-splash github repository
Install scrapy-splash from pypi:
pip install scrapy-splash
Install Docker to your machine
Run a scrapy-splash container:
docker run -p 8050:8050 scrapinghub/splash
Add following lines to your settings.py
SPLASH_URL = 'http://192.168.59.103:8050'
Append splash download middleware to your settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
Change your spider's code to
import scrapy
from scrapy_splash import SplashRequest
class RakutenSpider(scrapy.Spider):
name = "rak"
allowed_domains = ["rakuten.com"]
start_urls = ['https://www.rakuten.com/deals?omadtrack=hp_deals_viewmore']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
for sel in response.xpath('//div[#class="page-bottom"]/div'):
yield {
'titles': sel.xpath("//div[#class='slider-prod-title']").extract_first(),
'prices': sel.xpath("//span[#class='price-bold']").extract_first(),
'images': sel.xpath("//div[#class='deal-img']/img").extract_first()
}
Solution 2: (Integrate scrapy with selenium webdriver as a middleware)
Selenium web driver python binding documentation
Install Selenium from pypi:
pip install selenium
If you want to use Firefox Browser, Install Firefox's Geckodriver to your PATH.
Download Mozilla Geckodriver here
If you want to use Chrome Browser, Install Chrome driver to your PATH.
Download Chromedriver
If you want to use PhantomJS Browser, Install phantomJS from Homebrew.
brew install phantomjs
Add a JSmiddleware class to your middlewares.py
from scrapy.http import HtmlResponse
from selenium import webdriver
class JSMiddleware(object):
def process_request(self, request, spider):
driver = webdriver.Firefox()
driver.get(request.url)
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
Append selenium download middleware to your settings.py
DOWNLOADER_MIDDLEWARES = {
'youproject.middlewares.JSMiddleware': 200
}
Use your original spider's code
import scrapy
class RakutenSpider(scrapy.Spider):
name = "rak"
allowed_domains = ["rakuten.com"]
start_urls = ['https://www.rakuten.com/deals?omadtrack=hp_deals_viewmore']
def parse(self, response):
for sel in response.xpath('//div[#class="page-bottom"]/div'):
yield {
'titles': sel.xpath("//div[#class='slider-prod-title']").extract_first(),
'prices': sel.xpath("//span[#class='price-bold']").extract_first(),
'images': sel.xpath("//div[#class='deal-img']/img").extract_first()
}
More
If you wan to use Chrome Browser with Headless mode, check this tutorial
There is a problem with your settings, it should be:
ROBOTSTXT_OBEY = False
the ROBOTSTXT_OBEY variable needs a boolean, you were setting it with string. You can check your logs, that it was visiting the robots.txt request first.