Scrapy always running same command from command prompt - scrapy

I'm trying to learn Scrapy on BashOnUbunty on Windows 10. I created a spider (yelprest) using the genspider command, and then directly created another spider (quotes_spider) by creating the spider file (followed the official tutorial https://doc.scrapy.org/en/latest/intro/tutorial.html).
The first spider is not yet tested, but I tried to go through the tutorial with second spider, and when I try to run, I'm getting an error which points to the first spider. Also, when I try to run any other scrapy command like version, I'm getting the same error as above. Below is the error:
(BashEnv) root > scrapy version
Traceback (most recent call last):
File "/mnt/s/BashEnv/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/mnt/s/BashEnv/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 148, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/mnt/s/BashEnv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/mnt/s/BashEnv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/mnt/s/BashEnv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/mnt/s/BashEnv/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
return cls(settings)
File "/mnt/s/BashEnv/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "/mnt/s/BashEnv/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "/mnt/s/BashEnv/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/mnt/s/BashEnv/Scrapy/Scrapy/spiders/yelprest.py", line 14
rules = (
^
IndentationError: unexpected indent
(BashEnv) root >
I'm not understanding why I am getting the same error for any command I give.

There is some error in your yelprest.py file (at line 14 or before): it is not valid Python. Fix this error and everything will work. Make sure your file is correctly indented and do not mix spaces and tabs.
Edit:
To make sure the error is in this file, just delete it. If everything works without this file, the error must be there!
Update:
Your question does not state it clearly, but by your comment your question is "why does Scrapy load my spider code for every command?". And the answer is: because Scrapy was made to do it. Some commands can be run only inside a project, like check or crawl. And some commands may be run anywhere, like startproject. But inside a Scrapy project, ANY command will load ALL your code. Scrapy was made this way.
For example, I have a project named crawler (I know, very descriptive!):
$ cd ~
$ scrapy version
Scrapy 1.4.0
$ cd crawler/
$ scrapy version
2017-10-31 14:47:42 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: crawler)
2017-10-31 14:47:42 [scrapy.utils.log] INFO: Overridden settings: {...}
Scrapy 1.4.0

Related

How could I use GPU in google colab to train spacy relation extraction model, [E002] Can't find factory for 'transformer' for language English (en)

I am running a relation extraction spacy model on google colab , It works when I use !spacy project run all or !spacy project run train_cpu but when I run !spacy project run train_gpu it returns following error:
================================= train_gpu =================================
Running command: /usr/bin/python3 -m spacy train configs/rel_trf.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py --gpu-id 0
ℹ Saving to output directory: training
ℹ Using GPU: 0
=========================== Initializing pipeline ===========================
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/spacy/__main__.py", line 4, in <module>
setup_cli()
File "/usr/local/lib/python3.7/dist-packages/spacy/cli/_util.py", line 71, in setup_cli
command(prog_name=COMMAND)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/usr/local/lib/python3.7/dist-packages/spacy/cli/train.py", line 45, in train_cli
train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
File "/usr/local/lib/python3.7/dist-packages/spacy/cli/train.py", line 72, in train
nlp = init_nlp(config, use_gpu=use_gpu)
File "/usr/local/lib/python3.7/dist-packages/spacy/training/initialize.py", line 41, in init_nlp
nlp = load_model_from_config(raw_config, auto_fill=True)
File "/usr/local/lib/python3.7/dist-packages/spacy/util.py", line 531, in load_model_from_config
validate=validate,
File "/usr/local/lib/python3.7/dist-packages/spacy/language.py", line 1784, in from_config
raw_config=raw_config,
File "/usr/local/lib/python3.7/dist-packages/spacy/language.py", line 794, in add_pipe
validate=validate,
File "/usr/local/lib/python3.7/dist-packages/spacy/language.py", line 652, in create_pipe
raise ValueError(err)
ValueError: [E002] Can't find factory for 'transformer' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator `#Language.component` (for function components) or `#Language.factory` (for class components).
Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, spancat, textcat_multilabel, relation_extractor, en.lemmatizer
I used both following installations (interchangeably) in case the GPU wasn't called correctly:
!pip install -U spacy[cuda101]
#!pip install -U spacy-nightly --pre
You haven't installed spacy-transformers. The easiest way to do this is probably to spacy download en_core_web_trf.
I would recommend you check the install quickstart again - I don't think spacy-nightly has been updated since v3 was released almost a year ago. Also check the Discussions FAQ - it's been a while since we've heard reports of it, but a while ago you had to specifically not install cupy (that is, not use pip install spacy[cuda101]) in order to get GPU support on Colab.

Cannot run dask-mpi with Python 3.7 -- timeout when connecting client to dask-mpi scheduler

I'm attempting to run the Dask-MPI "Getting Started" (http://mpi.dask.org/en/latest/) example in a fresh Anaconda environment.
I set up an environment using
conda create -n dask-mpi -c conda-forge python=3.7 dask-mpi
conda activate dask-mpi
Inside the environment, I run
mpirun -np 4 dask-mpi --scheduler-file ./scheduler.json
Then, from a python interpreter on the same machine (and in the same folder), I run
from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')
This results in the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 712, in __init__
self.start(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 858, in start
sync(self.loop, self._start, **kwargs)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
six.reraise(*error[0])
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
result[0] = yield future
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 954, in _start
yield self._ensure_connected(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 1015, in _ensure_connected
timedelta(seconds=timeout), self._update_scheduler_info()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
The terminal that I ran dask-mpi from does not have any output which would indicate that something is trying to connect. I have verified that the port in question, 8786, is open. I've also verified via debugger that the client is getting the correct address from the scheduler file.
I've tried this in quite a few different environments and on a few different machines, including a fresh Ubuntu 18.04 docker container. I'm completely at a loss for what steps I might be missing.
It turns out this was due to an error in newer versions of dask.distributed (1.25.3) which broke the behavior of dask-mpi. This seems to be fixed as of dask-mpi 1.0.3 (https://github.com/dask/dask-mpi/releases/tag/1.0.3).

Issues while using Snappy for tensorflow preprocessing using BeamIO

While using Apache beamIO for preprocessing data, snappy library was a good to have module for compression but looks like the file transformation doesnt seems to work as it cannot find the crc32 compress function in the library! Im using snappy-0.5.2 version
the error looks like this -
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
ERROR:root:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7f1dd1d60e50>, due to an exception.
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/direct/executor.py", line 312, in call
side_input_values)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/direct/executor.py", line 347, in attempt_call
evaluator.process_element(value)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/direct/transform_evaluator.py", line 551, in process_element
self.runner.process(element)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.py", line 390, in process
self._reraise_augmented(exn)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.py", line 388, in process
self.do_fn_invoker.invoke_process(windowed_value)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.py", line 281, in invoke_process
self._invoke_per_window(windowed_value)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/common.py", line 307, in _invoke_per_window
windowed_value, self.process_method(*args_for_process))
File "/usr/local/lib/python2.7/dist-packages/apache_beam/typehints/typecheck.py", line 63, in process
return self.wrapper(self.dofn.process, args, kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/typehints/typecheck.py", line 81, in wrapper
result = method(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/iobase.py", line 965, in process
self.writer.write(element)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", line 299, in write
self.sink.write_record(self.temp_handle, value)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", line 129, in write_record
self.write_encoded_record(file_handle, self.coder.encode(value))
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/tfrecordio.py", line 235, in write_encoded_record
_TFRecordUtil.write_record(file_handle, value)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/tfrecordio.py", line 97, in write_record
struct.pack('<I', cls._masked_crc32c(encoded_length)), #
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/tfrecordio.py", line 77, in _masked_crc32c
crc = crc32c_fn(value)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/tfrecordio.py", line 43, in _default_crc32c_fn
_default_crc32c_fn.fn = snappy._crc32c # pylint: disable=protected-access
AttributeError: 'module' object has no attribute '_crc32c' [while running 'WriteTrainData/Write/WriteImpl/WriteBundles']
If any one could help me to use snappy with tensorflow correctly!
Thank you
I just hit this issue; I think it is due to Beam being a little careless about versions of optional test-dependencies (in this case, tensorflow and python-snappy).
The problematic code:
import snappy
snappy._crc32c
works in python-snappy version 0.5.1 but not in 0.5.2 (the latest version).
I got these Beam tests passing by installing python-snappy 0.5.1 via:
pip install \
--upgrade --ignore-installed \
python-snappy==0.5.1 \
--global-option=build_ext \
--global-option="-I/usr/local/include" \
--global-option="-L/usr/local/lib"
On OSX I need the three --global-option flags otherwise it doesn't find my snappy headers (symptom: errors about #include <snappy-c.h>) and library files, which brew install snappy placed in /usr/local/include and /usr/local/lib, respectively.
The bits before that seem necessary to override pip's default of wanting to give me the latest version.

LetsEncrypt Certbot-Auto freezes when trying to run any command on Apache

I am trying to get LetsEncrypt SSL certificates installed on a Centos 6 server using Cerbot-Auto, however no matter what I try, it just hangs on:
Apache version is 2.2.15
Command
./certbot-auto -v
When I press CTRL + C to exit the program, it takes about 15 seconds and then exits with a stack trace:
Exiting abnormally:
Traceback (most recent call last):
File "/opt/eff.org/certbot/venv/bin/letsencrypt", line 9, in <module>
load_entry_point('letsencrypt==0.7.0', 'console_scripts', 'letsencrypt')()
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot/main.py", line 1240, in main
return config.func(config, plugins)
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot/main.py", line 981, in run
installer, authenticator = plug_sel.choose_configurator_plugins(config, plugins, "run")
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot/plugins/selection.py", line 189, in choose_configurator_plugins
authenticator = installer = pick_configurator(config, req_inst, plugins)
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot/plugins/selection.py", line 25, in pick_configurator
(interfaces.IAuthenticator, interfaces.IInstaller))
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot/plugins/selection.py", line 77, in pick_plugin
verified.prepare()
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot/plugins/disco.py", line 248, in prepare
return [plugin_ep.prepare() for plugin_ep in six.itervalues(self._plugins)]
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot/plugins/disco.py", line 248, in <listcomp>
return [plugin_ep.prepare() for plugin_ep in six.itervalues(self._plugins)]
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot/plugins/disco.py", line 130, in prepare
self._initialized.prepare()
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot_apache/configurator.py", line 225, in prepare
self.parser = self.get_parser()
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot_apache/override_centos.py", line 39, in get_parser
self.version, configurator=self)
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot_apache/override_centos.py", line 47, in __init__
super(CentOSParser, self).__init__(*args, **kwargs)
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot_apache/parser.py", line 74, in __init__
if self.find_dir("Define", exclude=False):
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/certbot_apache/parser.py", line 401, in find_dir
"%s//*[self::directive=~regexp('%s')]" % (start, regex))
File "/opt/eff.org/certbot/venv/lib64/python3.4/site-packages/augeas.py", line 413, in match
ctypes.byref(array))
KeyboardInterrupt
Please see the logfiles in /var/log/letsencrypt for more details.
I thought it may be a python version issue but when checked, the server is running Python 2.6.6, which, according to the Certbot System Requirements is acceptable.
Letsencrypt.log
When I checked the log, it is exactly the same stacktrace as was reported by the script previously.
Any ideas?

Cannot install or use scrapy on OS X -- blocked by proxy (I think)

So I am really having a tough time with something that should be very easy. AT work we have a lot of annoying proxies etc. and I'm pretty sure that's involved here. Anyhow, when I try to install scrapy, I get "Connection reset by peer" in the middle of downloading libxml, always 37% of the way in:
root#rcmac (~ ): pip install scrapy
Requirement already satisfied (use --upgrade to upgrade): scrapy in /Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg
Requirement already satisfied (use --upgrade to upgrade): Twisted>=10.0.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): w3lib>=1.8.0 in /Library/Python/2.7/site-packages (from scrapy)
Collecting queuelib (from scrapy)
Using cached queuelib-1.2.2-py2.py3-none-any.whl
Collecting lxml (from scrapy)
/Library/Python/2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:79: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Downloading lxml-3.4.2.tar.gz (3.5MB)
37% |############ | 1.3MB 186kB/s eta 0:00:12 Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip/basecommand.py", line 246, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip/commands/install.py", line 342, in run
requirement_set.prepare_files(finder)
File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 345, in prepare_files
functools.partial(self._prepare_file, finder))
File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 290, in _walk_req_to_install
more_reqs = handler(req_to_install)
File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 487, in _prepare_file
download_dir, do_download, session=self.session,
File "/Library/Python/2.7/site-packages/pip/download.py", line 827, in unpack_url
session,
File "/Library/Python/2.7/site-packages/pip/download.py", line 673, in unpack_http_url
from_path, content_type = _download_http_url(link, session, temp_dir)
File "/Library/Python/2.7/site-packages/pip/download.py", line 888, in _download_http_url
_download_url(resp, link, content_file)
File "/Library/Python/2.7/site-packages/pip/download.py", line 621, in _download_url
for chunk in progress_indicator(resp_read(4096), 4096):
File "/Library/Python/2.7/site-packages/pip/utils/ui.py", line 133, in iter
for x in it:
File "/Library/Python/2.7/site-packages/pip/download.py", line 586, in resp_read
decode_content=False):
File "/Library/Python/2.7/site-packages/pip/_vendor/requests/packages/urllib3/response.py", line 273, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Library/Python/2.7/site-packages/pip/_vendor/requests/packages/urllib3/response.py", line 203, in read
data = self._fp.read(amt)
File "/Library/Python/2.7/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 49, in read
data = self.__fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 241, in recv
return self.read(buflen)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 160, in read
return self._sslobj.read(len)
error: [Errno 54] Connection reset by peer
I can get my hands on the libxml tarball, but I don't know how to get pip past this roadblock. I have somehow managed to get scrapy installed but it blows up when I try to import it:
root#rcmac (~ ): python
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/__init__.py", line 56, in <module>
from scrapy.spider import Spider
File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/spider.py", line 7, in <module>
from scrapy.http import Request
File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/http/__init__.py", line 11, in <module>
from scrapy.http.request.form import FormRequest
File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/http/request/form.py", line 9, in <module>
import lxml.html
ImportError: No module named lxml.html
So let's see: I guess my question is, "Help?" :-) thanks.
OK, solved again. Sorry. It looks like a proxy server here at work was blocking my lxml install from pip. I reran the pip install command for scrapy and lxml got properly installed at that point. After that, the error went away.