scrapy tutorial: cannot run scrapy crawl dmoz - scrapy

I'm asking a new question because I'm aware I wasn't clear enough in the last one.
I'm trying to follow the scrapy tutorial, but I'm stuck in the crucial step, the "scrapy crawl dmoz' command.
The code is this one (I have written that in the python shell and save it typing .py extension):
ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
Python 2.7.2 (default, Jun 24 2011, 12:20:15)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "copyright", "credits" or "license()" for more information.
>>> from scrapy.spider import BaseSpider
class dmoz(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
>>>
The directory I'm using should be fine, please find below the tree:
.
├── scrapy.cfg
└── tutorial
├── __init__.py
├── __init__.pyc
├── items.py
├── pipelines.py
├── settings.py
├── settings.pyc
└── spiders
├── __init__.py
├── __init__.pyc
└── dmoz_spider.py
2 directories, 10 files
Now when I try to run "scapy crawl dmoz" I get this:
$ scrapy crawl dmoz
2013-08-14 12:51:40+0200 [scrapy] INFO: Scrapy 0.16.5 started (bot: tutorial)
2013-08-14 12:51:40+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.16.5', 'scrapy')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
execute()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/cmdline.py", line 76, in _run_print_help
func(*a, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/commands/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/command.py", line 33, in crawler
self._crawler.configure()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/crawler.py", line 40, in configure
self.spiders = spman_cls.from_crawler(self)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/spidermanager.py", line 35, in from_crawler
sm = cls.from_settings(crawler.settings)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/spidermanager.py", line 31, in from_settings
return cls(settings.getlist('SPIDER_MODULES'))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/spidermanager.py", line 22, in __init__
for module in walk_modules(name):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.5-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules
submod = __import__(fullpath, {}, {}, [''])
File "/Users//Documents/tutorial/tutorial/spiders/dmoz_spider.py", line 1
ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
^
SyntaxError: invalid syntax
Does anybody know what is wrong with the steps I'm making?
Thank you for your help. This is my very first programming experience, so it might be a very stupid issue.

It's not an indentation problem, the error message is clear:
File "/Users//Documents/tutorial/tutorial/spiders/dmoz_spider.py", line 1
ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
^
SyntaxError: invalid syntax
You clearly have copy pasted the code in IDLE including starting strings from the IDLE, which are no code.
Instead of copy-pasting, try opening an editor and actually typing the tutorial code there, you'll learn better and you wont be pasting crap inadvertently.

The indentation is not correct.
It should be:
>>>from scrapy.spider import BaseSpider
>>>class dmoz(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
I think you copy pasted the code in IDLE, please indent class.

Related

Scrapyd-Deploy: SPIDER_MODULES not found

I am trying to deploy a scrapy 2.1.0 project with scrapy-deploy 1.2 and get this error:
scrapyd-deploy example
/Library/Frameworks/Python.framework/Versions/3.8/bin/scrapyd-deploy:23: ScrapyDeprecationWarning: Module `scrapy.utils.http` is deprecated, Please import from `w3lib.http` instead.
from scrapy.utils.http import basic_auth_header
fatal: No names found, cannot describe anything.
Packing version r1-master
Deploying to project "crawler" in http://myip:6843/addversion.json
Server response (200):
{"node_name": "spider1", "status": "error", "message": "/usr/local/lib/python3.8/dist-packages/scrapy/utils/project.py:90: ScrapyDeprecationWarning: Use of environment variables prefixed with SCRAPY_ to override settings is deprecated. The following environment variables are currently defined: EGG_VERSION\n warnings.warn(\nTraceback (most recent call last):\n File \"/usr/lib/python3.8/runpy.py\", line 193, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/lib/python3.8/runpy.py\", line 86, in _run_code\n exec(code, run_globals)\n File \"/usr/local/lib/python3.8/dist-packages/scrapyd/runner.py\", line 40, in <module>\n main()\n File \"/usr/local/lib/python3.8/dist-packages/scrapyd/runner.py\", line 37, in main\n execute()\n File \"/usr/local/lib/python3.8/dist-packages/scrapy/cmdline.py\", line 142, in execute\n cmd.crawler_process = CrawlerProcess(settings)\n File \"/usr/local/lib/python3.8/dist-packages/scrapy/crawler.py\", line 280, in __init__\n super(CrawlerProcess, self).__init__(settings)\n File \"/usr/local/lib/python3.8/dist-packages/scrapy/crawler.py\", line 152, in __init__\n self.spider_loader = self._get_spider_loader(settings)\n File \"/usr/local/lib/python3.8/dist-packages/scrapy/crawler.py\", line 146, in _get_spider_loader\n return loader_cls.from_settings(settings.frozencopy())\n File \"/usr/local/lib/python3.8/dist-packages/scrapy/spiderloader.py\", line 60, in from_settings\n return cls(settings)\n File \"/usr/local/lib/python3.8/dist-packages/scrapy/spiderloader.py\", line 24, in __init__\n self._load_all_spiders()\n File \"/usr/local/lib/python3.8/dist-packages/scrapy/spiderloader.py\", line 46, in _load_all_spiders\n for module in walk_modules(name):\n File \"/usr/local/lib/python3.8/dist-packages/scrapy/utils/misc.py\", line 69, in walk_modules\n mod = import_module(path)\n File \"/usr/lib/python3.8/importlib/__init__.py\", line 127, in import_module\n return _bootstrap._gcd_import(name[level:], package, level)\n File \"<frozen importlib._bootstrap>\", line 1014, in _gcd_import\n File \"<frozen importlib._bootstrap>\", line 991, in _find_and_load\n File \"<frozen importlib._bootstrap>\", line 973, in _find_and_load_unlocked\nModuleNotFoundError: No module named 'crawler.spiders_prod'\n"}
crawler.spiders_prod is the first module defined in SPIDER_MODULES
Part of crawler.settings.py:
SPIDER_MODULES = ['crawler.spiders_prod', 'crawler.spiders_dev']
NEWSPIDER_MODULE = 'crawler.spiders_dev'
The crawler works localy, but using deploy it will fail to use whatever I call the folder where my spiders live in.
scrapyd-deploy setup.py:
# Automatically created by: scrapyd-deploy
from setuptools import setup, find_packages
setup(
name = 'project',
version = '1.0',
packages = find_packages(),
entry_points = {'scrapy': ['settings = crawler.settings']},
)
scrapy.cfg:
[deploy:example]
url = http://myip:6843/
username = test
password = whatever.
project = crawler
version = GIT
Is this possibly a bug or am I missing something?
Modules have to be initialised within scrapy. This happens through simply placing the following file into each folder defined as a module:
__init__.py
This has solved my described problem.
Learning:
If you want to split your spiders into folders, it is not enough to simple create a folder and specify this folder as a module within the settings file, but you also need to place this file into the new folder. Funny engough the crawler works, without the file just deployment to scrapyd fails.

cannot import name _parse_proxy (urllib2)

I was trying to use Scrapy with Python 2 and I got this error,
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 77, in crawl
self.engine = self._create_engine()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 102, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Library/Python/2.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/Library/Python/2.7/site-packages/scrapy/downloadermiddlewares/httpproxy.py", line 4, in <module>
from urllib2 import _parse_proxy
ImportError: cannot import name _parse_proxy
I have urllib2 installed
Name: urllib2
Version: 1498656401.94
Summary: Checking out the typosquatting state of PyPi
Home-page: https://github.com/benjaoming
Author: Benjamin Bach
Author-email: benjamin#overtag.dk
License: MIT
Location: /usr/local/lib/python2.7/site-packages
Requires:
I have tried uninstalling and reinstalling urllib2 and requests.
You probably have two different Python environments. Scrapy seems to live in /Library/Python/2.7/site-packages/scrapy while your urllib2 lives in /usr/local/lib/python2.7/site-packages.

Packaging Bokeh with py2exe

I've got an python application that generates charts using BOKEH (0.12), the charts are all standalone (i.e BOKEHJS gets inlined) so that there's no need for the browser to go onto the web to find the CDN or make any external connections.
It all works fine when I am running it from Eclipse, chart displays no problem. But when I try to package it with py2exe, the html file is created but no chart is shown when I open it in browser. This is what my setup.py looks like
from distutils.core import setup
import py2exe
import os
import psutil
import pkg_resources
import inspect
import matplotlib
import sys
import bokeh.core
import zipfile
sys.setrecursionlimit(5000)
includes=["sqlite3","PyQt4","decimal","bokeh.core","jinja2","matplotlib","mpl_toolkits","matplotlib.backends.backend_wx","bokeh"]
excludes=[]
packages=["pkg_resources"]
dll_excludes=['libgdk-win32-1.0-0.dll', 'libgobject-2.0-0.dll', 'tcl84.dll', 'tk84.dll', 'msvcp90.dll','msvcr71.dll', 'IPHLPAPI.DLL', 'NSI.dll', 'WINNSI.DLL', 'WTSAPI32.dll']
dir_name = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))) # script directory
parent_dir_name=os.path.abspath(os.path.join(dir_name,os.pardir))
iconfile_config_location = "my_icon.ico"
configuration_location = "my_configuration.ini"
database_location = "localDB"
datafiles = [('', [configuration_location,database_location])]
datafiles.extend(matplotlib.get_py2exe_datafiles())
current_dir = os.path.dirname(os.path.realpath(__file__))
dist_dir = os.path.join(current_dir, "release")
setup(
options={"py2exe": {"compressed": 2,
"optimize": 0,#string or int of optimization level (0, 1, or 2) 0 = do not optimize (generate .pyc) 1 = normal optimization (like python -O) 2 = extra optimization (like python -OO)
"includes": includes,
"excludes": excludes,
"packages": packages,
"dll_excludes": dll_excludes,
"bundle_files": 2,
"dist_dir": "release",
"xref": False,
"skip_archive": False,
"ascii": False,
"custom_boot_script": '',
}
},
windows=[{"script":"main.py","icon_resources": [(1, iconfile_config_location)],}],
data_files= datafiles
)
# Add boke/core/_templates files to the library.zip file
bokeh_path = sys.modules['bokeh.core'].__path__[0]
zipfile_path = os.path.join(dist_dir, "library.zip")
z = zipfile.ZipFile(zipfile_path, 'a')
for dirpath,dirs,files in os.walk(os.path.join(bokeh_path, '_templates')):
for f in files:
fn = os.path.join(dirpath, f)
z.write(fn, os.path.join(dirpath[dirpath.index('bokeh'):], f))
z.close()
Can anyone guide me on how to package Bokeh so that the resulting executable can make use of it. (i.e everything gets bundled in together) is it possible?
I also noticed that when I set the optimize option to 2, it doesn't work, I get the error
File "zipextimporter.pyo", line 82, in load_module
File "bokeh\plotting\__init__.pyo", line 2, in <module>
File "zipextimporter.pyo", line 82, in load_module
File "bokeh\document.pyo", line 36, in <module>
File "zipextimporter.pyo", line 82, in load_module
File "bokeh\model.pyo", line 12, in <module>
File "zipextimporter.pyo", line 82, in load_module
File "bokeh\core\properties.pyo", line 73, in <module>
File "zipextimporter.pyo", line 82, in load_module
File "bokeh\core\enums.pyo", line 25, in <module>
File "zipextimporter.pyo", line 82, in load_module
File "bokeh\icons.pyo", line 78, in <module>
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'
Any ideas

What dependencies do I need for USB programing in python with pyUSB?

I am trying to get the usb.find command to work properly in a python script I'm writing on Angstrom for the Beagleboard.
Here is my code:
#!/usr/bin/env python
import usb.core
import usb.util
import usb.backend.libusb01 as libusb
PYUSB_DEBUG_LEVEL = 'debug'
# find our device
# Bus 002 Device 006: ID 1208:0815
# idVendor 0x1208
# idProduct 0x0815
# dev = usb.core.find(idVendor=0xfffe, idProduct=0x0001)
# iManufacturer 1 TOROBOT.com
dev = usb.core.find(idVendor=0x1208, idProduct=0x0815,
backend=libusb.get_backend() )
I don't know what's missing, but here is what I do know.
When I don't specify the backend, no backend is found. When I do specify the backend usb.backend.libusb01 I get the following error:
root#beagleboard:~/servo# ./pyServo.py
Traceback (most recent call last):
File "./pyServo.py", line 17, in <module>
dev = usb.core.find(idVendor=0x1208, idProduct=0x0815, backend=libusb.get_backend() )
File "/usr/lib/python2.6/site-packages/usb/core.py", line 854, in find
return _interop._next(device_iter(k, v))
File "/usr/lib/python2.6/site-packages/usb/_interop.py", line 60, in _next
return next(iter)
File "/usr/lib/python2.6/site-packages/usb/core.py", line 821, in device_iter
for dev in backend.enumerate_devices():
File "/usr/lib/python2.6/site-packages/usb/backend/libusb01.py", line 390, in enumerate_devices
_check(_lib.usb_find_busses())
File "/usr/lib/python2.6/ctypes/__init__.py", line 366, in __getattr__
func = self.__getitem__(name)
File "/usr/lib/python2.6/ctypes/__init__.py", line 371, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: python: undefined symbol: usb_find_busses
What am I missing so that this will work properly?
Try installing python-pyusb through opkg, that should install the native dependencies.

Why is Scrapy spider not loading?

I am a little new to the Scraping domain and was able to manage the following piece of code for my spider:
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'thesentientspider.settings')
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urljoin
from thesentientspider.items import RestaurantDetails, UserReview
import urllib
from scrapy.conf import settings
import pymongo
from pymongo import MongoClient
#MONGODB Settings
MongoDBServer=settings['MONGODB_SERVER']
MongoDBPort=settings['MONGODB_PORT']
class ZomatoSpider(BaseSpider):
name = 'zomatoSpider'
allowed_domains = ['zomato.com']
CITY=["hyderabad"]
start_urls = [
'http://www.zomato.com/%s/restaurants/' %cityName for cityName in CITY
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
BASE_URL=get_base_url(response)
However, when i try to launch it through the scrapy crawl zomatoSpider command, it throws the following error:
Traceback (most recent call last):
File "/usr/bin/scrapy", line 4, in <module>
execute()
File "/usr/lib/pymodules/python2.6/scrapy/cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/pymodules/python2.6/scrapy/cmdline.py", line 76, in _run_print_help
func(*a, **kw)
File "/usr/lib/pymodules/python2.6/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/usr/lib/pymodules/python2.6/scrapy/commands/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/usr/lib/pymodules/python2.6/scrapy/command.py", line 33, in crawler
self._crawler.configure()
File "/usr/lib/pymodules/python2.6/scrapy/crawler.py", line 40, in configure
self.spiders = spman_cls.from_crawler(self)
File "/usr/lib/pymodules/python2.6/scrapy/spidermanager.py", line 35, in from_crawler
sm = cls.from_settings(crawler.settings)
File "/usr/lib/pymodules/python2.6/scrapy/spidermanager.py", line 31, in from_settings
return cls(settings.getlist('SPIDER_MODULES'))
File "/usr/lib/pymodules/python2.6/scrapy/spidermanager.py", line 23, in __init__
self._load_spiders(module)
File "/usr/lib/pymodules/python2.6/scrapy/spidermanager.py", line 26, in _load_spiders
for spcls in iter_spider_classes(module):
File "/usr/lib/pymodules/python2.6/scrapy/utils/spider.py", line 21, in iter_spider_classes
issubclass(obj, BaseSpider) and \
TypeError: issubclass() arg 1 must be a class
Could anyone please point out the root cause and suggest modification for the same via a code snippet?
def __init__(self):
MongoDBServer=settings['MONGODB_SERVER']
MongoDBPort=settings['MONGODB_PORT']
database=settings['MONGODB_DB']
rest_coll=settings['RESTAURANTS_COLLECTION']
review_coll=settings['REVIEWS_COLLECTION']
client=MongoClient(MongoDBServer, MongoDBPort)
db=client[database]
self.restaurantsCollection=db[rest_coll]
self.reviewsCollection=db[review_coll]
This is the code that i added in to make it work. Hope it helps.