I'm trying to scrape this website www.zillow.com by using Scrapy. I'm trying to import addresses from a CSV file and trying to search by it. But getting error. Here is my code.
csv_read.py
import pandas as pd
def read_csv():
df = pd.read_csv('report2.csv')
return df['site_address'].values.tolist()
zillow.py
import scrapy
from csv_automation.spiders.csv_read import read_csv
base_url = "https://www.zillow.com/homes/{}_rb"
class ZillowSpider(scrapy.Spider):
name = 'zillow'
def start_requests(self):
for tag in read_csv():
yield scrapy.Request(base_url.format(tag))
def parse(self, response):
yield {
'Address': response.body(".(//h1[#id='ds-chip-property-address']/span)[1]/text()").get(),
'zestimate': response.body(".(//span[#class='Text-c11n-8-38-0__aiai24-0 jtMauM'])[1]/text()").get(),
'rent zestimate': response.body(".(//span[#class='Text-c11n-8-38-0__aiai24-0 jtMauM'])[2]/text()").get()
}
settings.py
# Scrapy settings for csv_automation project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'csv_automation'
SPIDER_MODULES = ['csv_automation.spiders']
NEWSPIDER_MODULE = 'csv_automation.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 15
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'csv_automation.middlewares.CsvAutomationSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'csv_automation.middlewares.CsvAutomationDownloaderMiddleware': 543,
#}
DOWNLOADER_MIDDLEWARES = {
# ...
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
# ...
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'csv_automation.pipelines.CsvAutomationPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
PROXY_POOL_ENABLED = True
PROXY_POOL_BAN_POLICY = 'policy.policy.BanDetectionPolicyNotText'
I tried to get data from the JSON file but I found nothing. Maybe this is just because I am searching for a specific address. So, I tried to use the Xpath.
Screenshot
My output
PS G:\Python_Practice\scrapy_practice\csv_automation> scrapy crawl zillow
2021-08-15 22:31:10 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: csv_automation)
2021-08-15 22:31:10 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64
bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19041-SP0
2021-08-15 22:31:10 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-08-15 22:31:10 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'csv_automation',
'DOWNLOAD_DELAY': 15,
'NEWSPIDER_MODULE': 'csv_automation.spiders',
'SPIDER_MODULES': ['csv_automation.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
2021-08-15 22:31:10 [scrapy.extensions.telnet] INFO: Telnet Password: 5004f35501fe1348
2021-08-15 22:31:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2021-08-15 22:31:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware',
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-08-15 22:31:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-08-15 22:31:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-08-15 22:31:12 [scrapy.core.engine] INFO: Spider opened
2021-08-15 22:31:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-08-15 22:31:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-08-15 22:31:12 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.us-proxy.org:443
2021-08-15 22:31:14 [urllib3.connectionpool] DEBUG: https://www.us-proxy.org:443 "GET / HTTP/1.1" 200 None
2021-08-15 22:31:15 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): free-proxy-list.net:443
2021-08-15 22:31:15 [urllib3.connectionpool] DEBUG: https://free-proxy-list.net:443 "GET /anonymous-proxy.html HTTP/1.1" 200 None
2021-08-15 22:31:15 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): free-proxy-list.net:443
2021-08-15 22:31:15 [urllib3.connectionpool] DEBUG: https://free-proxy-list.net:443 "GET /uk-proxy.html HTTP/1.1" 200 None
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.free-proxy-list.net:80
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: http://www.free-proxy-list.net:80 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.free-proxy-list.net:443
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: https://www.free-proxy-list.net:443 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): free-proxy-list.net:80
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: http://free-proxy-list.net:80 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): free-proxy-list.net:443
2021-08-15 22:31:16 [urllib3.connectionpool] DEBUG: https://free-proxy-list.net:443 "GET / HTTP/1.1" 200 None
2021-08-15 22:31:17 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.sslproxies.org:443
2021-08-15 22:31:17 [urllib3.connectionpool] DEBUG: https://www.sslproxies.org:443 "GET / HTTP/1.1" 200 None
2021-08-15 22:31:17 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2021-08-15 22:31:18 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:18 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.proxy-daily.com:443
2021-08-15 22:31:19 [urllib3.connectionpool] DEBUG: https://www.proxy-daily.com:443 "GET / HTTP/1.1" 301 None
2021-08-15 22:31:19 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): proxy-daily.com:443
2021-08-15 22:31:20 [urllib3.connectionpool] DEBUG: https://proxy-daily.com:443 "GET / HTTP/1.1" 200 None
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-JP Japanese prober hit error at byte 3652
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: GB2312 Chinese prober hit error at byte 3654
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-KR Korean prober hit error at byte 3652
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: CP949 Korean prober hit error at byte 3652
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: Big5 Chinese prober hit error at byte 3654
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-TW Taiwan prober hit error at byte 3652
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: utf-8 confidence = 0.87625
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: SHIFT_JIS Japanese confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-JP not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: GB2312 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-KR not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: CP949 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: Big5 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-TW not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1251 Russian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: KOI8-R Russian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: ISO-8859-5 Russian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: MacCyrillic Russian confidence = 0.0
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: IBM866 Russian confidence = 0.0
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: IBM855 Russian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: ISO-8859-7 Greek confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1253 Greek confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: ISO-8859-5 Bulgarian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1251 Bulgarian confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: TIS-620 Thai confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: ISO-8859-9 Turkish confidence = 0.5252901105163297
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.0
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: windows-1255 Hebrew confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: utf-8 confidence = 0.87625
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: SHIFT_JIS Japanese confidence = 0.01
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-JP not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: GB2312 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-KR not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: CP949 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: Big5 not active
2021-08-15 22:31:21 [chardet.charsetprober] DEBUG: EUC-TW not active
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.94:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://185.77.221.113:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.55:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://109.94.172.150:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://109.94.172.150:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.149.75:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://193.56.64.200:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.122.237:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://109.94.172.211:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://54.156.145.160:8080
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://185.77.220.189:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.151.241:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://94.231.216.42:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://193.56.64.179:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://213.166.78.52:8085
2021-08-15 22:31:21 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://5.181.2.102:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/1421-Beechwood-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://204.87.183.21:3128
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/7393-Frolic-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.74:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/2759-Armaugh-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.208.211.87:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/673-Hummingbird-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.123.116:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/303-Old-Farm-Rd_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://18.205.10.48:80
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/8430-Burket-Way_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://91.188.246.162:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/778-Courtney-Cir_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/283-Meadow-Glen-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://94.231.216.146:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://5.181.2.54:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/2020-Bridlewood-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/6515-Springview-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/578-Shortleaf-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/493-Founders-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/481-Autumn-Springs-Ct_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.208.211.224:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.123.101:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://5.181.2.113:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://213.166.79.79:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.149.204:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/1377-E-New-Rd_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://91.188.247.38:8085
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/1452-S-Highland-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:31:42 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://213.166.78.61:8085
2021-08-15 22:31:43 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/303-Old-Farm-Rd_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:31:43 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.208.211.87:8085
2021-08-15 22:32:01 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/2051-Milburn-Dr_rb> with another proxy (failed 1 times, max retries: 5)
2021-08-15 22:32:01 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://3.12.95.129:80
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/7393-Frolic-Dr_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://185.77.221.177:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/1421-Beechwood-Dr_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.115:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/2759-Armaugh-Dr_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.122.202:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/673-Hummingbird-Dr_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://102.64.120.43:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/8430-Burket-Way_rb> with another proxy (failed 2 times, max retries: 5)
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: [ProxyChoosen] http://85.209.150.46:8085
2021-08-15 22:32:03 [scrapy_proxy_pool.middlewares] DEBUG: Retrying <GET https://www.zillow.com/homes/283-Meadow-Glen-Dr_rb> with another proxy (failed 2 times, max retries: 5)
Traceback (most recent call last):
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "G:\Python_Practice\scrapy_practice\csv_automation\csv_automation\spiders\zillow.py", line 17, in parse
'Address': response.body(".(//h1[#id='ds-chip-property-address']/span)[1]/text()").get(),
TypeError: 'bytes' object is not callable
2021-08-15 22:43:12 [scrapy.extensions.logstats] INFO: Crawled 14 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
2021-08-15 22:43:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zillow.com/homes/2759-Armaugh-Dr_rb/> (referer: None)
2021-08-15 22:43:18 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.zillow.com/homes/2759-Armaugh-Dr_rb/> (referer: None)
Traceback (most recent call last):
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\raisu\anaconda3\envs\Scrapy_Workspace2\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "G:\Python_Practice\scrapy_practice\csv_automation\csv_automation\spiders\zillow.py", line 17, in parse
'Address': response.body(".(//h1[#id='ds-chip-property-address']/span)[1]/text()").get(),
TypeError: 'bytes' object is not callable
2021-08-15 22:43:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/scrapy.core.downloader.handlers.http11.TunnelError': 1,
'bans/error/twisted.internet.error.TCPTimedOutError': 106,
'bans/status/307': 1,
'downloader/exception_count': 107,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 106,
'downloader/request_bytes': 66947,
'downloader/request_count': 141,
'downloader/request_method_count/GET': 141,
'downloader/response_bytes': 3019748,
'downloader/response_count': 34,
'downloader/response_status_count/200': 15,
'downloader/response_status_count/301': 18,
'downloader/response_status_count/307': 1,
'elapsed_time_seconds': 725.869701,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2021, 8, 15, 16, 43, 18, 373681),
'log_count/DEBUG': 326,
'log_count/ERROR': 15,
'log_count/INFO': 24,
'response_received_count': 15,
'scheduler/dequeued': 141,
'scheduler/dequeued/memory': 141,
'scheduler/enqueued': 144,
'scheduler/enqueued/memory': 144,
'spider_exceptions/TypeError': 15,
'start_time': datetime.datetime(2021, 8, 15, 16, 31, 12, 503980)}
2021-08-15 22:43:18 [scrapy.core.engine] INFO: Spider closed (shutdown)
PS G:\Python_Practice\scrapy_practice\csv_automation>
I'm using proxy to avoid getting banned from the website
It's the response parsing methond, your should use response.xpath() but not response.body
def parse(self, response):
# For address, there's special whitespace \xa0, filter it
address = [
text
for text in response.xpath('//h1[#id="ds-chip-property-address"]//span//text()').getall()
if text not in ['\xa0']
]
address = " ".join(address)
# The class "Text-c11n-8-38-0__aiai24-0 jtMauM" seems be generated by some fontend
# framework. In case the class name being changed later, use the id.
# Search element with id, and find its next `span` sibling.
zestimate, rent_zestimate = response.css('#dsChipZestimateTooltip ~ span::text').getall()[:2]
yield {
'Address': address,
'zestimate': zestimate,
'rent zestimate': rent_zestimate,
}
First, it's always important to make sure you're scraping on the right webpage - I always run:
from scrapy.utils.response import open_in_browser
open_in_browser(response)
to check if I'm on the webpage I want to get information from.
Second, you could use xpath, like:
yield {
'Address': "".join(response.xpath("//h1[#id='ds-chip-property-address']/span/text()").getall()).replace('\xa0', ''),
'zestimate': "".join(response.xpath("//button[#class='TriggerText-c11n-8-38-0__sc-139r5uq-0 kulCB']/span/text()").getall()),
'rent zestimate': "".join(response.xpath("//div[#class='Spacer-c11n-8-38-0__sc-17suqs2-0 dwpwDj']/span[#class='Text-c11n-8-38-0__aiai24-0 cMFTfb']/text()").getall())
}
Related
I have generated the quotes spider in the tutorial, and I have added a yield option to parse. However, the spider is not working because it is having an issue downloading quotes.toscrape.com.
# -*- coding: utf-8 -*-
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = (
'http://www.quotes.toscrape.com/',
)
def parse(self, response):
h1_tag = response.xpath('//h1/a/text()').extract_first()
tags = response.xpath('//*[#class="tag-item"]/a/text()').extract()
yield {'H1 Tag': h1_tag, 'Tags': tags}
2019-07-31 12:04:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://www.quotes.toscrape.com/> (failed 1 times): []
2019-07-31 12:04:09 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.quotes.toscrape.com/> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:09 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.quotes.toscrape.com/> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:09 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.quotes.toscrape.com/>
2019-07-31 12:04:07 [scrapy.utils.log] INFO: Scrapy 1.7.2 started (bot: quotes_spider)
2019-07-31 12:04:07 [scrapy.utils.log] INFO: Versions: lxml 4.4.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (default, Mar 27 2019, 16:54:48) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Darwin-18.6.0-x86_64-i386-64bit
2019-07-31 12:04:07 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'quotes_spider', 'NEWSPIDER_MODULE': 'quotes_spider.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['quotes_spider.spiders']}
2019-07-31 12:04:07 [scrapy.extensions.telnet] INFO: Telnet Password: ab4784ba2a683680
2019-07-31 12:04:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2019-07-31 12:04:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-31 12:04:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-31 12:04:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-07-31 12:04:07 [scrapy.core.engine] INFO: Spider opened
2019-07-31 12:04:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-31 12:04:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-31 12:04:07 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.quotes.toscrape.com/robots.txt> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.quotes.toscrape.com/robots.txt> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:08 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.quotes.toscrape.com/robots.txt> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:08 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.quotes.toscrape.com/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
Traceback (most recent call last):
File "/Users/aakankshasaxena/anaconda3/envs/API_env/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.quotes.toscrape.com/> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:09 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.quotes.toscrape.com/> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:09 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.quotes.toscrape.com/> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:09 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.quotes.toscrape.com/>
Traceback (most recent call last):
File "/Users/aakankshasaxena/anaconda3/envs/API_env/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-07-31 12:04:09 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-31 12:04:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6,
'downloader/request_bytes': 1362,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 2.28015,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 7, 31, 19, 4, 9, 880266),
'log_count/DEBUG': 6,
'log_count/ERROR': 2,
'log_count/INFO': 10,
'memusage/max': 50892800,
'memusage/startup': 50892800,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4,
"robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2019, 7, 31, 19, 4, 7, 600116)}
2019-07-31 12:04:09 [scrapy.core.engine] INFO: Spider closed (finished)
This should actually yield the correct result. The error was a syntax error.
I'm trying to scrape reviews for all hotels in Punta Cana. The code seems to run but when I call crawl, it doesn't actually crawl any of the sites. Here are my file structures, what I called, and what happened when I ran it.
folder structure:
├── scrapy.cfg
└── tripadvisor_reviews
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-37.pyc
│ ├── items.cpython-37.pyc
│ └── settings.cpython-37.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-37.pyc
│ └── tripadvisorSpider.cpython-37.pyc
└── tripadvisorSpider.py
tripadvisorSpider.py
import scrapy
from tripadvisor_reviews.items import TripadvisorReviewsItem
class tripadvisorSpider(scrapy.Spider):
name = "tripadvisorspider"
allowed_domains = ["www.tripadvisor.com"]
def start_requests(self):
urls = [
'https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath(
'//div[#class="nav next taLnk ui_button primary"]/#href').extract_first()
if next_page:
url = response.urljoin(next_page)
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[#class="hotels-review-list-parts-ReviewTitle__reviewTitleText"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath(
'//div[#class="ui_button nav next primary "]/#href').extract_first()
if next_page:
url = response.urljoin(next_page)
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorReviewsItem()
item['title'] = response.xpath(
'//div[#class="hotels-review-list-parts-ReviewTitle__reviewTitleText"]/text()').extract()
item['content'] = response.xpath(
'//q[#class="hotels-review-list-parts-ExpandableReview__reviewText"]/text()').extract()
# item['stars'] = response.xpath(
# '//span[#class="rate sprite-rating_s rating_s"]/img/#alt').extract()[0]
print(item)
yield item
items.py
import scrapy
class TripadvisorReviewsItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
content = scrapy.Field()
# stars = scrapy.Field()
I ran it using the following command in terminal:
scrapy crawl tripadvisorspider -o items.json
This is my terminal output
2019-05-14 12:32:12 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: tripadvisor_reviews)
2019-05-14 12:32:12 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.1 (default, Dec 14 2018, 13:28:58) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.4.2, Platform Darwin-18.5.0-x86_64-i386-64bit
2019-05-14 12:32:12 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tripadvisor_reviews', 'FEED_FORMAT': 'csv', 'FEED_URI': 'items.csv', 'NEWSPIDER_MODULE': 'tripadvisor_reviews.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tripadvisor_reviews.spiders']}
2019-05-14 12:32:12 [scrapy.extensions.telnet] INFO: Telnet Password: aae78556d6b8c59b
2019-05-14 12:32:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-05-14 12:32:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-05-14 12:32:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-05-14 12:32:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-05-14 12:32:12 [scrapy.core.engine] INFO: Spider opened
2019-05-14 12:32:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-14 12:32:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2019-05-14 12:32:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/robots.txt> (referer: None)
2019-05-14 12:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html> (referer: None)
2019-05-14 12:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d15025732-Reviews-Impressive_Resort_Spa_Punta_Cana-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d313884-Reviews-Punta_Cana_Princess_All_Suites_Resort_Spa-Bavaro_Punta_Cana_La_Altagracia_Province_Dom.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d4451011-Reviews-The_Westin_Puntacana_Resort_Club-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d10175054-Reviews-Secrets_Cap_Cana_Resort_Spa-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d7307251-Reviews-The_Level_at_Melia_Caribe_Beach-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Re.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d292158-Reviews-Grand_Palladium_Punta_Cana_Resort_Spa-Punta_Cana_La_Altagracia_Province_Dominican_Repub.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d1604057-Reviews-Secrets_Royal_Beach_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d649099-Reviews-Zoetry_Agua_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d150842-Reviews-Iberostar_Dominicana_Hotel-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d15515013-Reviews-Grand_Memories_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d150841-Reviews-Iberostar_Selection_Bavaro-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d1233228-Reviews-Iberostar_Grand_Bavaro-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d149397-Reviews-Bavaro_Princess_Resort_Spa_Casino-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d584407-Reviews-Ocean_Blue_Sand-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d259337-Reviews-Grand_Palladium_Bavaro_Suites_Resort_Spa-Bavaro_Punta_Cana_La_Altagracia_Province_Domi.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d150854-Reviews-Hotel_Riu_Palace_Macao-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d11701188-Reviews-BlueBay_Grand_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d14838260-Reviews-Melia_Punta_Cana_Beach_Resort-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d1595124-Reviews-Luxury_Bahia_Principe_Esmeralda-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d508162-Reviews-Dreams_Punta_Cana_Resort_Spa-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d1076311-Reviews-Hard_Rock_Hotel_Casino_Punta_Cana-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d10595200-Reviews-Grand_Bahia_Principe_Aquamarine-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d611114-Reviews-Hotel_Riu_Palace_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3200043-d8709413-Reviews-Excellence_El_Carmen-Uvero_Alto_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d1199681-Reviews-Luxury_Bahia_Principe_Ambar-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d1889895-Reviews-Karibo_Punta_Cana-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d6454132-Reviews-Premium_Level_at_Barcelo_Bavaro_Palace-Bavaro_Punta_Cana_La_Altagracia_Province_Domin.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d15080584-Reviews-Impressive_Premium_Resorts_Spa-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Re.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d2687221-Reviews-NH_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d579774-Reviews-Iberostar_Punta_Cana-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] INFO: Closing spider (finished)
2019-05-14 12:32:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 46023,
'downloader/request_count': 32,
'downloader/request_method_count/GET': 32,
'downloader/response_bytes': 5599418,
'downloader/response_count': 32,
'downloader/response_status_count/200': 32,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 5, 14, 19, 32, 21, 637712),
'log_count/DEBUG': 33,
'log_count/INFO': 8,
'memusage/max': 51412992,
'memusage/startup': 51412992,
'request_depth_max': 1,
'response_received_count': 32,
'scheduler/dequeued': 31,
'scheduler/dequeued/memory': 31,
'scheduler/enqueued': 31,
'scheduler/enqueued/memory': 31,
'start_time': datetime.datetime(2019, 5, 14, 19, 32, 12, 996979)}
2019-05-14 12:32:21 [scrapy.core.engine] INFO: Spider closed (finished)
This selector is not working:
response.xpath('//div[#class="hotels-review-list-parts-ReviewTitle__reviewTitleText"]/a/#href')
The element on the site is <a> and not <div>, also the class name seems wrong. Perhaps this site append some random data to the class name, as you can see below
<span><span>Ótima experiência! Resort amplo, com diversas opções de entretenimento!!!</span></span>
You can try to match only part of the string, for example:
//a[contains(#class, "hotels-review-list-parts-ReviewTitle__reviewTitleText")]
If you just use Scrapy, there is no way to get the js dynamically generated page. You only get the source code of the webpage.I think you can use Scapy-Splash.
I was curious to know if google authetication could be achieved via scrapy. I tried with the following code snippet to achieve so.
My code:
import scrapy
from scrapy.http import FormRequest, Request
import logging
import json
class LoginSpider(scrapy.Spider):
name = 'hello'
start_urls = ['https://accounts.google.com']
handle_httpstatus_list = [404, 405]
def parse(self, response):
print('inside parse')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Email': 'abc#gmail.com'},
callback=self.log_password)]
def log_password(self, response):
print('inside log_password')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Passwd': 'password'},
callback=self.after_login)]
def after_login(self, response):
print('inside after_login')
print(response.url)
if "authentication failed" in response.body:
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
print("Login Successful!!")
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'DOWNLOAD_DELAY': 0.2,
'HANDLE_HTTPSTATUS_ALL': True
})
process.crawl(LoginSpider)
process.start()
But I'm getting following output when I run the script.
Output
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.4 (default, Mar 22 2018, 14:05:57) - [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Darwin-15.6.0-x86_64-i386-64bit 2018-08-15 10:38:05 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 0.2, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'} 2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled item pipelines: []
2018-08-15 10:38:05 [scrapy.core.engine] INFO: Spider opened 2018-08-15 10:38:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-15 10:38:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-15 10:38:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ManageAccount> from <GET https://accounts.google.com>
2018-08-15 10:38:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> from <GET https://accounts.google.com/ManageAccount>
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> (referer: None)
**inside parse**
https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (405) <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
**inside log_password**
https://accounts.google.com/
2018-08-15 10:38:10 [scrapy.core.scraper] ERROR: Spider error processing <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
Traceback (most recent call last): File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw) File "google_login.py", line 79, in log_password
callback=self.after_login)] File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 48, in from_response
form = _get_form(response, formname, formid, formnumber, formxpath) File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 77, in _get_form
raise ValueError("No <form> element found in %s" % response) ValueError: No <form> element found in <405 https://accounts.google.com/>
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-15 10:38:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1810, 'downloader/request_count': 4, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 357598, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/302': 2, 'downloader/response_status_count/405': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 15, 4, 53, 10, 164911), 'log_count/DEBUG': 5, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'memusage/max': 41132032, 'memusage/startup': 41132032, 'request_depth_max': 1, 'response_received_count': 2, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'spider_exceptions/ValueError': 1, 'start_time': datetime.datetime(2018, 8, 15, 4, 53, 5, 463699)}
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Spider closed (finished)
Any help would be appreciated
Error 405 means that URL doesn't accept the HTTP method, in your case POST generated at parse.
Google default login page is much more complex than simple POST form heavily using JS and Ajax under the hood. To login using Scrapy you have use https://accounts.google.com/ServiceLogin?nojavascript=1 as start URL instead.
I'm following the tutorial for Scrapy.
I used this code from the tutorial:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1',
'http://quotes.toscrape.com/page/2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
page = response.url.split("/)[-2]")
filename = 'quotes-%s.html' % page
with open(filename,'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
When I then run the command scrapy crawl quotes I get the following output:
2017-05-14 02:19:55 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial)
2017-05-14 02:19:55 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWS
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-14 02:19:55 [scrapy.core.engine] INFO: Spider opened
2017-05-14 02:19:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped
2017-05-14 02:19:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/ro
2017-05-14 02:19:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET htt
2017-05-14 02:19:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET htt
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/pa
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/pa
2017-05-14 02:19:56 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.tosc
Traceback (most recent call last):
File "c:\users\mehmet\anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _ru
current.result = callback(current.result, *args, **kw)
File "c:\users\mehmet\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 76, in par
raise NotImplementedError
NotImplementedError
2017-05-14 02:19:56 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.tosc
Traceback (most recent call last):
File "c:\users\mehmet\anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _ru
current.result = callback(current.result, *args, **kw)
File "c:\users\mehmet\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 76, in par
raise NotImplementedError
NotImplementedError
2017-05-14 02:19:56 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-14 02:19:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1121,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 6956,
'downloader/response_count': 5,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 14, 0, 19, 56, 125822),
'log_count/DEBUG': 6,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/NotImplementedError': 2,
'start_time': datetime.datetime(2017, 5, 14, 0, 19, 55, 659206)}
2017-05-14 02:19:56 [scrapy.core.engine] INFO: Spider closed (finished)
What is going wrong?
This error means you did not implement the parse function. But according to your post you did. So it could be an indentation error. Your code should be like that:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1',
'http://quotes.toscrape.com/page/2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
page = response.url.split("/)[-2]")
filename = 'filename'
with open(filename,'w+') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
I tested it and it works.
Shouldn't the line
page = response.url.split("/)[-2]")
be
page = response.url.split("/)[-1]")
as now it looks like you are selecting the word page and want a number ?
I'm following the Scrapy tutorial documentation at http://media.readthedocs.org/pdf/scrapy/0.14/scrapy.pdf and I've verified that items.py and dmoz_spider.py are typed (not cut & pasted) correctly.
The first "hmmm..." part for me was this instruction:
This is the code for our first Spider; save it in a file named dmoz_spider.py under the dmoz/spiders directory
I'm using the latest version of Ubuntu and there wasn't a dmoz folder created, so I've put this code into ~/tutorial/tutorial/spiders. (Was this my first error?)
So here's my dmoz_spider.py script:
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
In my terminal I type
scrapy crawl dmoz
And I get this:
2012-10-08 13:20:22-0700 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: tutorial)
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled item pipelines:
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-10-08 13:20:22-0700 [dmoz] INFO: Spider opened
2012-10-08 13:20:22-0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2012-10-08 13:20:22-0700 [dmoz] ERROR: Spider error processing <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 62, in parse
raise NotImplementedError
exceptions.NotImplementedError:
2012-10-08 13:20:22-0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2012-10-08 13:20:22-0700 [dmoz] ERROR: Spider error processing <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 62, in parse
raise NotImplementedError
exceptions.NotImplementedError:
2012-10-08 13:20:22-0700 [dmoz] INFO: Closing spider (finished)
2012-10-08 13:20:22-0700 [dmoz] INFO: Spider closed (finished)
In my searching, I saw that someone else had said twisted probably wasn't installed... but wouldn't it be installed if I used the Ubuntu package installer for Scrapy?
Thanks in advance!
The parse method in BaseSpider is getting called instead of your one because you have not correctly overridden the parse method. Your indentation is wrong, so parse is declared as a function outside of the DmozSpider class. Welcome to python :)
It's nothing to do with twisted, I can see that twisted is in the tracebacks, so it's clearly installed.