I'm using scrapy to scrape a website: https://www.sephora.fr/marques/de-a-a-z/.
It used to work well a year ago but it now shows an error:
User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds
It retries for 5 times and then fails completely. I can access the url on chrome but it's not working on scrapy. I've tried using custom user agents and emulating header requests but It still doesn't work.
Below is my code:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import json
import requests
from urllib.parse import parse_qsl, urlencode
import re
from ..pipelines import Pipeline
class SephoraSpider(scrapy.Spider):
"""
The SephoraSpider object gets you the data on all products hosted on sephora.fr
"""
name = 'sephora'
allowed_domains = ['sephora.fr']
# the url of all the brands
start_urls = ['https://www.sephora.fr/marques-de-a-a-z/']
custom_settings = {
'DOWNLOAD_TIMEOUT': '180',
}
def __init__(self):
self.base_url = 'https://www.sephora.fr'
def parse(self, response):
"""
Parses the response of a webpage we are given when we start crawling the first webpage.
This method is automatically launched by Scrapy when crawling.
:param response: the response from the webpage triggered by a get query while crawling.
A Response object represents an HTTP response, which is usually downloaded (by the Downloader)
and fed to the Spiders for processing.
:return: the results of parse_brand().
:rtype: scrapy.Request()
"""
# if we are given an url of the brand we are interested in (burberry) we send an http request to them
if response.url == "https://www.sephora.fr/marques/de-a-a-z/burberry-burb/":
yield scrapy.Request(url=response.url, callback=self.parse_brand)
# otherwise it means we are visiting another html object (another brand, a higher level url ...)
# we call the url back with another method
else:
self.log("parse: I just visited: " + response.url)
urls = response.css('a.sub-category-link::attr(href)').extract()
if urls:
for url in urls:
yield scrapy.Request(url=self.base_url + url, callback=self.parse_brand)
...
Scrapy log:
(scr_env) antoine.cop1#protonmail.com:~/environment/bass2/scraper (master) $ scrapy crawl sephora
2022-03-13 16:39:19 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: nosetime_scraper)
2022-03-13 16:39:19 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.6.9 (default, Dec 8 2021, 21:08:43) - [GCC 8.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.0-1068-aws-x86_64-with-Ubuntu-18.04-bionic
2022-03-13 16:39:19 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DOWNLOAD_TIMEOUT': '180',
'EDITOR': '',
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-03-13 16:39:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-13 16:39:19 [scrapy.extensions.telnet] INFO: Telnet Password: af81c5b648cc3542
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-13 16:39:19 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-13 16:39:19 [scrapy.core.engine] INFO: Spider opened
2022-03-13 16:39:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:39:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-13 16:40:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:41:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:42:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:42:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/robots.txt> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds..
2022-03-13 16:42:19 [py.warnings] WARNING: /home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/engine.py:276: ScrapyDeprecationWarning: Passing a 'spider' argument to ExecutionEngine.download is deprecated
return self.download(result, spider) if isinstance(result, Request) else result
2022-03-13 16:43:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/robots.txt took longer than 180.0 seconds..
2022-03-13 16:49:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:50:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:51:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:51:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:52:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:53:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:54:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:54:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 2 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:55:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:56:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:57:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 16:57:19 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 3 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:57:19 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sephora.fr/marques-de-a-a-z/>
Traceback (most recent call last):
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
cast(Failure, result).throwExceptionIntoGenerator, gen
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 62, in run
return f(*args, **kwargs)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 16:57:19 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-13 16:57:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6,
'downloader/request_bytes': 1881,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 1080.231435,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 13, 16, 57, 19, 904633),
'log_count/DEBUG': 5,
'log_count/ERROR': 4,
'log_count/INFO': 28,
'log_count/WARNING': 1,
'memusage/max': 72749056,
'memusage/startup': 70950912,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.internet.error.TimeoutError': 4,
"robotstxt/exception_count/<class 'twisted.internet.error.TimeoutError'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 3, 13, 16, 39, 19, 673198)}
2022-03-13 16:57:19 [scrapy.core.engine] INFO: Spider closed (finished)
I am going to look at the request headers using fiddler and doing some tests. Maybe Scrapy is sending a Connection: close header by default due to which I'm not getting any response from the sephora site ?
Here are the logs when I chose not to respect robots.txt:
(scr_env) antoine.cop1#protonmail.com:~/environment/bass2/scraper (master) $ scrapy crawl sephora
2022-03-13 23:23:38 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: nosetime_scraper)
2022-03-13 23:23:38 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.6.9 (default, Dec 8 2021, 21:08:43) - [GCC 8.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform Linux-5.4.0-1068-aws-x86_64-with-Ubuntu-18.04-bionic
2022-03-13 23:23:38 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nosetime_scraper',
'CONCURRENT_REQUESTS': 1,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 7,
'DOWNLOAD_TIMEOUT': '180',
'EDITOR': '',
'NEWSPIDER_MODULE': 'nosetime_scraper.spiders',
'SPIDER_MODULES': ['nosetime_scraper.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 '
'Safari/537.36'}
2022-03-13 23:23:38 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-13 23:23:38 [scrapy.extensions.telnet] INFO: Telnet Password: 3f4205a34aff02c5
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-13 23:23:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-13 23:23:38 [scrapy.core.engine] INFO: Spider opened
2022-03-13 23:23:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:23:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-13 23:24:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:25:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:26:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:26:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 1 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:27:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:28:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:29:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:29:38 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 2 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:30:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:31:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:32:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-13 23:32:38 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sephora.fr/marques-de-a-a-z/> (failed 3 times): User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:32:38 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.sephora.fr/marques-de-a-a-z/>
Traceback (most recent call last):
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
cast(Failure, result).throwExceptionIntoGenerator, gen
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 62, in run
return f(*args, **kwargs)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 858, in _runCallbacks
current.result, *args, **kwargs
File "/home/ubuntu/environment/bass2/scraper/scr_env/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 360, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.sephora.fr/marques-de-a-a-z/ took longer than 180.0 seconds..
2022-03-13 23:32:39 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-13 23:32:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 3,
'downloader/request_bytes': 951,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'elapsed_time_seconds': 540.224149,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 3, 13, 23, 32, 39, 59500),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 19,
'memusage/max': 72196096,
'memusage/startup': 70766592,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.TimeoutError': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 3, 13, 23, 23, 38, 835351)}
2022-03-13 23:32:39 [scrapy.core.engine] INFO: Spider closed (finished)
And here is my environment, pip list output:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper>pip list
Package Version
------------------- ------------------
async-generator 1.10
attrs 21.4.0
Automat 20.2.0
beautifulsoup4 4.10.0
blis 0.7.5
bs4 0.0.1
catalogue 2.0.6
certifi 2021.10.8
cffi 1.15.0
charset-normalizer 2.0.12
click 8.0.4
colorama 0.4.4
configparser 5.2.0
constantly 15.1.0
crayons 0.4.0
cryptography 36.0.1
cssselect 1.1.0
cymem 2.0.6
DAWG-Python 0.7.2
docopt 0.6.2
en-core-web-sm 3.2.0
et-xmlfile 1.1.0
geographiclib 1.52
geopy 2.2.0
h11 0.13.0
h2 3.2.0
hpack 3.0.0
hyperframe 5.2.0
hyperlink 21.0.0
idna 3.3
incremental 21.3.0
itemadapter 0.4.0
itemloaders 1.0.4
Jinja2 3.0.3
jmespath 0.10.0
langcodes 3.3.0
libretranslatepy 2.1.1
lxml 4.8.0
MarkupSafe 2.1.0
murmurhash 1.0.6
numpy 1.22.2
openpyxl 3.0.9
outcome 1.1.0
packaging 21.3
pandas 1.4.1
parsel 1.6.0
pathy 0.6.1
pip 22.0.4
preshed 3.0.6
priority 1.3.0
Protego 0.2.1
pyaes 1.6.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.21
pydantic 1.8.2
PyDispatcher 2.0.5
pymongo 3.11.0
pymorphy2 0.9.1
pymorphy2-dicts-ru 2.4.417127.4579844
pyOpenSSL 22.0.0
pyparsing 3.0.7
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2021.3
queuelib 1.6.2
requests 2.27.1
rsa 4.8
ru-core-news-md 3.2.0
Scrapy 2.5.1
selenium 4.1.2
service-identity 21.1.0
setuptools 56.0.0
six 1.16.0
smart-open 5.2.1
sniffio 1.2.0
sortedcontainers 2.4.0
soupsieve 2.3.1
spacy 3.2.2
spacy-legacy 3.0.9
spacy-loggers 1.0.1
srsly 2.4.2
Telethon 1.24.0
thinc 8.0.13
tqdm 4.62.3
translate 3.6.1
trio 0.20.0
trio-websocket 0.9.2
Twisted 22.1.0
twisted-iocpsupport 1.0.2
typer 0.4.0
typing_extensions 4.1.1
urllib3 1.26.8
w3lib 1.22.0
wasabi 0.9.0
webdriver-manager 3.5.3
wsproto 1.0.0
zope.interface 5.4.0
With scrapy runspider sephora.py I remark it doesn't accept my relative import from ..pipelines import Pipeline:
(scr_env) C:\Users\antoi\Documents\Programming\Work\scrapy-scraper\nosetime_scraper\spiders>scrapy runspider sephora.py
2022-03-14 01:00:27 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: nosetime_scraper)
2022-03-14 01:00:27 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.
9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform
Windows-10-10.0.19043-SP0
2022-03-14 01:00:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
Usage
=====
scrapy runspider [options] <spider_file>
runspider: error: Unable to load 'sephora.py': attempted relative import with no known parent package
Here are my settings.py:
# Scrapy settings for nosetime_scraper project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'nosetime_scraper'
SPIDER_MODULES = ['nosetime_scraper.spiders']
NEWSPIDER_MODULE = 'nosetime_scraper.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 7
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'nosetime_scraper.middlewares.NosetimeScraperSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'nosetime_scraper.middlewares.NosetimeScraperDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'nosetime_scraper.pipelines.NosetimeScraperPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
I'm following the developer-tools tutorial on scrapy.org:
https://docs.scrapy.org/en/latest/topics/developer-tools.html#topics-developer-tools
At the "The network-tool" section, when running:
scrapy shell "quotes.toscrape.com/scroll"
Inside the terminal, I'm getting this:
2020-11-24 11:22:37 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: quotetutorial)
2020-11-24 11:22:37 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.2.1, Platform Linux-5.4.0-54-generic-x86_64-with-glibc2.29
2020-11-24 11:22:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-11-24 11:22:37 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'quotetutorial',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'quotetutorial.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['quotetutorial.spiders']}
2020-11-24 11:22:37 [scrapy.extensions.telnet] INFO: Telnet Password: a6b7e55ad47ad876
2020-11-24 11:22:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-11-24 11:22:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-11-24 11:22:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-11-24 11:22:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-11-24 11:22:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-11-24 11:22:37 [scrapy.core.engine] INFO: Spider opened
2020-11-24 11:22:37 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-11-24 11:22:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/scroll> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f4845b32070>
[s] item {}
[s] request <GET http://quotes.toscrape.com/scroll>
[s] response <200 http://quotes.toscrape.com/scroll>
[s] settings <scrapy.settings.Settings object at 0x7f4845b2fe20>
[s] spider <DefaultSpider 'default' at 0x7f4844ea69a0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
All good so far.
then I run:
view(response)
and I'm getting this in terminal:
>>> view(response)
True
But in my browser I get:
file:///tmp/tmp_ssijiw4.html
Your file couldn’t be accessed
It may have been moved, edited, or deleted.
ERR_FILE_NOT_FOUND
I've tried finding answers to the issue online without any useful results.
Why is this happening and how can I fix it?
This is because of chromium's limited access inside the file-system:
https://askubuntu.com/questions/1184357/why-cant-chromium-suddenly-access-any-partition-except-for-home
Change your default browser to firefox:
https://askubuntu.com/questions/79305/how-do-i-change-my-default-browser
I keep getting syntax error when I'm trying to run scrapy in AWS ubuntu 18.04 instance:
scrapy crawl pcz -o px.csv
here's the log
ubuntu#ip-172-31-60-245:~/free_proxy/free_proxy$ scrapy crawl pcz -o px.csv
2020-08-27 14:09:37 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: free_proxy)
2020-08-27 14:09:37 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 17.9.0, Python 3.8.5 (default, Jul 20 2020, 19:48:14) - [GCC 7.5.0], pyOpenSSL 17.5.0 (OpenSSL 1.1.1 11 Sep 2018), cryptography 2.1.4, Platform Linux-5.3.0-1033-aws-x86_64-with-glibc2.27
2020-08-27 14:09:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-08-27 14:09:37 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'free_proxy',
'CONCURRENT_REQUESTS_PER_DOMAIN': 2,
'DOWNLOAD_TIMEOUT': 10,
'NEWSPIDER_MODULE': 'free_proxy.spiders',
'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408, 401],
'RETRY_TIMES': 10,
'SPIDER_MODULES': ['free_proxy.spiders']}
2020-08-27 14:09:37 [scrapy.middleware] WARNING: Disabled TelnetConsole: TELNETCONSOLE_ENABLED setting is True but required twisted modules failed to import:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/extensions/telnet.py", line 15, in <module>
from twisted.conch import manhole, telnet
File "/usr/lib/python3/dist-packages/twisted/conch/manhole.py", line 154
def write(self, data, async=False):
^
SyntaxError: invalid syntax
2020-08-27 14:09:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
Temporarily disabling observer LegacyLogObserverWrapper(<bound method PythonLoggingObserver.emit of <twisted.python.log.PythonLoggingObserver object at 0x7fe474901f10>>) due to exception: [Failure instance: Traceback: <class 'TypeError'>: _findCaller() takes from 1 to 2 positional arguments but 3 were given
/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/cmdline.py:153:_run_command
/usr/lib/python3/dist-packages/twisted/internet/defer.py:954:__del__
/usr/lib/python3/dist-packages/twisted/logger/_logger.py:261:critical
/usr/lib/python3/dist-packages/twisted/logger/_logger.py:135:emit
--- <exception caught here> ---
/usr/lib/python3/dist-packages/twisted/logger/_observer.py:131:__call__
/usr/lib/python3/dist-packages/twisted/logger/_legacy.py:93:__call__
/usr/lib/python3/dist-packages/twisted/python/log.py:595:emit
/usr/lib/python3/dist-packages/twisted/logger/_legacy.py:154:publishToNewObserver
/usr/lib/python3/dist-packages/twisted/logger/_stdlib.py:115:__call__
/usr/lib/python3.8/logging/__init__.py:1500:log
/usr/lib/python3.8/logging/__init__.py:1565:_log
]
Temporarily disabling observer LegacyLogObserverWrapper(<bound method PythonLoggingObserver.emit of <twisted.python.log.PythonLoggingObserver object at 0x7fe474901f10>>) due to exception: [Failure instance: Traceback: <class 'TypeError'>: _findCaller() takes from 1 to 2 positional arguments but 3 were given
/home/ubuntu/.local/lib/python3.8/site-packages/scrapy/cmdline.py:153:_run_command
/usr/lib/python3/dist-packages/twisted/internet/defer.py:963:__del__
/usr/lib/python3/dist-packages/twisted/logger/_logger.py:181:failure
/usr/lib/python3/dist-packages/twisted/logger/_logger.py:135:emit
--- <exception caught here> ---
/usr/lib/python3/dist-packages/twisted/logger/_observer.py:131:__call__
/usr/lib/python3/dist-packages/twisted/logger/_legacy.py:93:__call__
/usr/lib/python3/dist-packages/twisted/python/log.py:595:emit
/usr/lib/python3/dist-packages/twisted/logger/_legacy.py:154:publishToNewObserver
/usr/lib/python3/dist-packages/twisted/logger/_stdlib.py:115:__call__
/usr/lib/python3.8/logging/__init__.py:1500:log
/usr/lib/python3.8/logging/__init__.py:1565:_log
]
I'm unable to get the Scrapy spider to crawl my Discover account page.
I'm new to Scrapy. I've read all the relevant documentation, but can't seem to get the form request to submit correctly. I've added formname, userID, and password.
import scrapy
class DiscoverSpider(scrapy.Spider):
name = "Discover"
start_urls = ['https://www.discover.com']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formname='loginForm',
formdata={'userID': 'userID', 'password': 'password'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
After form submission, I expect the spider to crawl my account page. Instead the spider is getting redirected to 'https://portal.discover.com/psv1/notification.html'. The following is the spider console output:
2018-12-26 11:39:46 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot:
MoneySpiders)
2018-12-26 11:39:46 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0,
libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0,
Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)],
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1,
Platform Windows-10-10.0.17134-SP0
2018-12-26 11:39:46 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'MoneySpiders', 'NEWSPIDER_MODULE': 'MoneySpiders.spiders',
'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['MoneySpiders.spiders']}
2018-12-26 11:39:46 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-12-26 11:39:46 [scrapy.middleware] INFO: Enabled downloader
middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-12-26 11:39:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-12-26 11:39:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-12-26 11:39:47 [scrapy.core.engine] INFO: Spider opened
2018-12-26 11:39:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at
0 pages/min), scraped 0 items (at 0 items/min)
2018-12-26 11:39:47 [scrapy.extensions.telnet] DEBUG: Telnet console
listening on
2018-12-26 11:39:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.discover.com/robots.txt> (referer: None)
2018-12-26 11:39:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.discover.com> (referer: None)
2018-12-26 11:39:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://portal.discover.com/robots.txt> (referer: None)
2018-12-26 11:39:48 [scrapy.downloadermiddlewares.redirect] DEBUG:
Redirecting (302) to <GET
https://portal.discover.com/psv1/notification.html> from <POST
https://portal.discover.com/customersvcs/universalLogin/signin>
2018-12-26 11:39:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://portal.discover.com/psv1/notification.html> (referer:
https://www.discover.com)
2018-12-26 11:39:48 [scrapy.core.scraper] ERROR: Spider error processing
<GET https://portal.discover.com/psv1/notification.html> (referer:
https://www.discover.com)
From the response I got this:
Your account cannot currently be accessed. Outdated browsers can
expose your computer to security risks. To get the best experience on
Discover.com, you may need to update your browser to the latest
version and try again.
So it looks like the website doesn't recognize your spider as a valid browser. To solve that you will need to set a proper User-Agent and maybe some others headers commonly used by this browsers
What i'm trying to do is trigger a function (abc) when a scrapy spider is opened, which sould be triggered by scrapys 'signals'.
(Later on i wanna change it to 'closed' to save the stats from each spider to the database for a daily monitoring.)
So for now i tried this simply solution just to print something out, what i would expect to see in the console when i run the crawlerprocess in the moment the spider is opend.
What happen is that the crawler runs fine, but is does not print the output of 'abc' the moment the spider is openend which should trigger the output. At the end i posted what is see in the console, which is just that the spider is running perfectly fine.
Why is the abc function not triggered by the signal at the point where is see 'INFO: Spider opened' in the log (or at all)?
MyCrawlerProcess.py:
from twisted.internet import reactor
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
def abc():
print '######################works!######################'
def from_crawler(crawler):
crawler.signals.connect(abc, signal=signals.spider_opened)
process.crawl('Dissident')
process.start() # the script will block here until the crawling is finished
Console output:
2016-03-17 13:00:14 [scrapy] INFO: Scrapy 1.0.4 started (bot: Chrome 41.0.2227.1. Mozilla/5.0 (Macintosh; Intel Mac Osource)
2016-03-17 13:00:14 [scrapy] INFO: Optional features available: ssl, http11
2016-03-17 13:00:14 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapytry.spiders', 'SPIDER_MODULES': ['scrapytry.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'Chrome 41.0.2227.1. Mozilla/5.0 (Macintosh; Intel Mac Osource'}
2016-03-17 13:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-17 13:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-17 13:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-17 13:00:14 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilesPipeline, ScrapytryPipeline
2016-03-17 13:00:14 [scrapy] INFO: Spider opened
2016-03-17 13:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-17 13:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-17 13:00:14 [scrapy] DEBUG: Crawled (200) <GET http://www.xyz.zzm/> (referer: None)
Simply defining from_crawler isn't enough as it's not hooked into the scrapy framework. Take a look at the docs here which show how to create an extension that does exactly what you're attempting to do. Be sure to follow the instruction for enabling the extension via the MYEXT_ENABLED setting.