Why is Scrapy spider not loading? - scrapy

I am a little new to the Scraping domain and was able to manage the following piece of code for my spider:
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'thesentientspider.settings')
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urljoin
from thesentientspider.items import RestaurantDetails, UserReview
import urllib
from scrapy.conf import settings
import pymongo
from pymongo import MongoClient
#MONGODB Settings
MongoDBServer=settings['MONGODB_SERVER']
MongoDBPort=settings['MONGODB_PORT']
class ZomatoSpider(BaseSpider):
name = 'zomatoSpider'
allowed_domains = ['zomato.com']
CITY=["hyderabad"]
start_urls = [
'http://www.zomato.com/%s/restaurants/' %cityName for cityName in CITY
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
BASE_URL=get_base_url(response)
However, when i try to launch it through the scrapy crawl zomatoSpider command, it throws the following error:
Traceback (most recent call last):
File "/usr/bin/scrapy", line 4, in <module>
execute()
File "/usr/lib/pymodules/python2.6/scrapy/cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/pymodules/python2.6/scrapy/cmdline.py", line 76, in _run_print_help
func(*a, **kw)
File "/usr/lib/pymodules/python2.6/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/usr/lib/pymodules/python2.6/scrapy/commands/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/usr/lib/pymodules/python2.6/scrapy/command.py", line 33, in crawler
self._crawler.configure()
File "/usr/lib/pymodules/python2.6/scrapy/crawler.py", line 40, in configure
self.spiders = spman_cls.from_crawler(self)
File "/usr/lib/pymodules/python2.6/scrapy/spidermanager.py", line 35, in from_crawler
sm = cls.from_settings(crawler.settings)
File "/usr/lib/pymodules/python2.6/scrapy/spidermanager.py", line 31, in from_settings
return cls(settings.getlist('SPIDER_MODULES'))
File "/usr/lib/pymodules/python2.6/scrapy/spidermanager.py", line 23, in __init__
self._load_spiders(module)
File "/usr/lib/pymodules/python2.6/scrapy/spidermanager.py", line 26, in _load_spiders
for spcls in iter_spider_classes(module):
File "/usr/lib/pymodules/python2.6/scrapy/utils/spider.py", line 21, in iter_spider_classes
issubclass(obj, BaseSpider) and \
TypeError: issubclass() arg 1 must be a class
Could anyone please point out the root cause and suggest modification for the same via a code snippet?

def __init__(self):
MongoDBServer=settings['MONGODB_SERVER']
MongoDBPort=settings['MONGODB_PORT']
database=settings['MONGODB_DB']
rest_coll=settings['RESTAURANTS_COLLECTION']
review_coll=settings['REVIEWS_COLLECTION']
client=MongoClient(MongoDBServer, MongoDBPort)
db=client[database]
self.restaurantsCollection=db[rest_coll]
self.reviewsCollection=db[review_coll]
This is the code that i added in to make it work. Hope it helps.

Related

FileNotFoundError when I invoke AWS lambda function for integration test

I want a integration test of aws lambda.
Therefore I used aws sdk v2 for java.
Please refer to the test code as shown below:
package integration;
import com.amazonaws.services.lambda.runtime.events.SQSEvent;
import com.squareup.moshi.JsonAdapter;
import com.squareup.moshi.JsonReader;
import com.squareup.moshi.JsonWriter;
import com.squareup.moshi.Moshi;
import org.jetbrains.annotations.NotNull;
import org.jetbrains.annotations.Nullable;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Test;
import software.amazon.awssdk.core.SdkBytes;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.lambda.LambdaClient;
import software.amazon.awssdk.services.lambda.model.InvokeRequest;
import software.amazon.awssdk.services.lambda.model.InvokeResponse;
import java.io.IOException;
import java.net.URI;
import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.Objects;
public class ConsumerRequestHandlerTest extends WsinDatabaseConnectionTest
{
private static final String FUNCTION_NAME = "ConsumerFunction";
private static final String LOCAL_ENDPOINT = "http://127.0.0.1:3001";
private LambdaClient lambdaClient;
private TestWsinJdbcTemplate testWsinJdbcTemplate = new TestWsinJdbcTemplate(getConnection());
private String makerId;
#BeforeEach
void setUp()
{
lambdaClient = LambdaClient.builder()
.region(Region.AP_NORTHEAST_2)
.endpointOverride(URI.create(LOCAL_ENDPOINT))
.build();
}
#Test
void integrationTest()
{
SQSEvent event = createEvent();
String jsonEvent = toJson(event);
SdkBytes payload = SdkBytes.fromUtf8String(jsonEvent);
InvokeRequest request = InvokeRequest.builder()
.functionName(FUNCTION_NAME)
.payload(payload)
.build();
InvokeResponse response = lambdaClient.invoke(request);
System.out.println(response);
}
#AfterEach
void tearDown()
{
if (Objects.nonNull(lambdaClient))
{
lambdaClient.close();
}
}
}
when the function invoked, the client gets an error as following
software.amazon.awssdk.services.lambda.model.LambdaException: ServiceException (Service: Lambda, Status Code: 500, Request ID: null, Extended Request ID: null)
and the server gets an error as following
Invoking com.manufacturer.ui.ConsumerRequestHandler::handleRequest (java8.al2)
WsinManufacturerLambdaShutdownExtension is a local Layer in the template
Image was not found.
Building image...Exception on /2015-03-31/functions/ConsumerFunction/invocations [POST]
Traceback (most recent call last):
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/local/lambda_service/local_lambda_invoke_service.py", line 165, in _invoke_request_handler
self.lambda_runner.invoke(function_name, request_data, stdout=stdout_stream_writer, stderr=self.stderr)
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/commands/local/lib/local_lambda.py", line 137, in invoke
self.local_runtime.invoke(
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/lib/telemetry/metric.py", line 230, in wrapped_func
return_value = func(*args, **kwargs)
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/local/lambdafn/runtime.py", line 178, in invoke
container = self.create(function_config, debug_context, container_host, container_host_interface)
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/local/lambdafn/runtime.py", line 73, in create
container = LambdaContainer(
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/local/docker/lambda_container.py", line 93, in __init__
image = LambdaContainer._get_image(
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/local/docker/lambda_container.py", line 236, in _get_image
return lambda_image.build(runtime, packagetype, image, layers, architecture, function_name=function_name)
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/local/docker/lambda_image.py", line 163, in build
self._build_image(
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/local/docker/lambda_image.py", line 261, in _build_image
with create_tarball(tar_paths, tar_filter=tar_filter) as tarballfile:
File "/usr/local/Cellar/python#3.8/3.8.12_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/usr/local/Cellar/aws-sam-cli/1.38.1/libexec/lib/python3.8/site-packages/samcli/lib/utils/tar.py", line 29, in create_tarball
archive.add(path_on_system, arcname=path_in_tarball, filter=tar_filter)
File "/usr/local/Cellar/python#3.8/3.8.12_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/tarfile.py", line 1955, in add
tarinfo = self.gettarinfo(name, arcname)
File "/usr/local/Cellar/python#3.8/3.8.12_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/tarfile.py", line 1834, in gettarinfo
statres = os.lstat(name)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/we/IdeaProjects/wsin-manufacturer-queue-serverless/WsinManufacturerLambdaShutdownExtension.zip'
2022-04-22 10:41:20 127.0.0.1 - - [22/Apr/2022 10:41:20] "POST /2015-03-31/functions/ConsumerFunction/invocations HTTP/1.1" 500 -
Do I have to place the file on the path?
Why does the error occur?
Thank you in advance.

Receive error when I run an executable created after py2exe conversion

After running my setup.py file and attempt to run the newly created .exe file I receive an error log that says:
Traceback (most recent call last):
File "Healthy_temp.py", line 7, in <module>
File "pandas\__init__.pyc", line 42, in <module>
File "pandas\core\api.pyc", line 26, in <module>
File "pandas\core\groupby\__init__.pyc", line 1, in <module>
File "pandas\core\groupby\groupby.pyc", line 37, in <module>
File "pandas\core\frame.pyc", line 100, in <module>
File "pandas\core\series.pyc", line 4390, in <module>
File "pandas\core\generic.pyc", line 10138, in _add_series_or_dataframe_operations
File "pandas\core\window.pyc", line 14, in <module>
File "pandas\_libs\window.pyc", line 12, in <module>
File "pandas\_libs\window.pyc", line 10, in __load
File "pandas\_libs\skiplist.pxd", line 31, in initpandas._libs.window
ImportError: No module named skiplist
My setup.py file is here:
from distutils.core import setup
import py2exe
import matplotlib
from PyQt4 import QtCore, QtGui
from AboutDialog import Ui_Dialog
from matplotlibwidget import MatplotlibWidget
import sys
from os import walk
from os import getcwd
import pandas
import numpy
from glob import glob
from datetime import datetime
from math import isnan
import sip
# setup(windows=['Healthy_temp.py'])
setup(options={'py2exe': {'bundle_files': 3, 'compressed': True}}, windows=[
'Healthy_temp.py'], zipfile=None,
data_files=matplotlib.get_py2exe_datafiles())
What can I do to correct this problem?
Add 'pandas._libs.skiplist' to the "includes" option list.

ImportError: cannot import name ResponseError

I am trying to use mod_python.
however when i try to access my page I get following error.
Traceback (most recent call last):
File "/usr/lib64/python2.6/site-packages/mod_python/importer.py", line 1540, in HandlerDispatch
default=default_handler, arg=req, silent=hlist.silent)
File "/usr/lib64/python2.6/site-packages/mod_python/importer.py", line 1205, in _process_target
module = import_module(module_name, path=path)
File "/usr/lib64/python2.6/site-packages/mod_python/importer.py", line 299, in import_module
log, import_path)
File "/usr/lib64/python2.6/site-packages/mod_python/importer.py", line 683, in import_module
execfile(file, module.__dict__)
File "/var/www/pylons-data/prod/reports/scripts/access_reports.py", line 2, in <module>
import requests
File "/usr/lib/python2.6/site-packages/requests/__init__.py", line 60, in <module>
from .api import request, get, head, post, patch, put, delete, options
File "/usr/lib/python2.6/site-packages/requests/api.py", line 14, in <module>
from . import sessions
File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 27, in <module>
from .adapters import HTTPAdapter
File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 29, in <module>
from .packages.urllib3.exceptions import ResponseError
ImportError: cannot import name ResponseError
Any help is appreciated. not sure how i can fix
Maybe Make sure that yours module is in sys.path list

cannot import name _parse_proxy (urllib2)

I was trying to use Scrapy with Python 2 and I got this error,
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 77, in crawl
self.engine = self._create_engine()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 102, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Python/2.7/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/Library/Python/2.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/Library/Python/2.7/site-packages/scrapy/downloadermiddlewares/httpproxy.py", line 4, in <module>
from urllib2 import _parse_proxy
ImportError: cannot import name _parse_proxy
I have urllib2 installed
Name: urllib2
Version: 1498656401.94
Summary: Checking out the typosquatting state of PyPi
Home-page: https://github.com/benjaoming
Author: Benjamin Bach
Author-email: benjamin#overtag.dk
License: MIT
Location: /usr/local/lib/python2.7/site-packages
Requires:
I have tried uninstalling and reinstalling urllib2 and requests.
You probably have two different Python environments. Scrapy seems to live in /Library/Python/2.7/site-packages/scrapy while your urllib2 lives in /usr/local/lib/python2.7/site-packages.

python urllib error - AttributeError: 'module' object has no attribute 'request'

I am trying out a tutorial code which fetches the html code form a website and prints it. I'm using python 3.4.0 on ubuntu. The code:
import urllib.request
page = urllib.request.urlopen("http://www.brainjar.com/java/host/test.html")
text = page.read().decode("utf8")
print(text)
I saw previous solutions and tried them, I also tried importing only urllib but it still doesn't work. The error message displayed is as shown:
Traceback (most recent call last):
File "string.py", line 1, in <module>
import urllib.request
File "/usr/lib/python3.4/urllib/request.py", line 88, in <module>
import http.client
File "/usr/lib/python3.4/http/client.py", line 69, in <module>
import email.parser
File "/usr/lib/python3.4/email/parser.py", line 12, in <module>
from email.feedparser import FeedParser, BytesFeedParser
File "/usr/lib/python3.4/email/feedparser.py", line 27, in <module>
from email import message
File "/usr/lib/python3.4/email/message.py", line 15, in <module>
from email import utils
File "/usr/lib/python3.4/email/utils.py", line 40, in <module>
from email.charset import Charset
File "/usr/lib/python3.4/email/charset.py", line 15, in <module>
import email.quoprimime
File "/usr/lib/python3.4/email/quoprimime.py", line 44, in <module>
from string import ascii_letters, digits, hexdigits
File "/media/saiwal/D89602199601F930/Documents/Copy/codes/python/headfirst/string.py", line 2, in <module>
page = urllib.request.urlopen("http://www.brainjar.com/java/host/test.html")
AttributeError: 'module' object has no attribute 'request'
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in <module>
from apport.report import Report
File "/usr/lib/python3/dist-packages/apport/report.py", line 21, in <module>
from urllib.request import urlopen
File "/usr/lib/python3.4/urllib/request.py", line 88, in <module>
import http.client
File "/usr/lib/python3.4/http/client.py", line 69, in <module>
import email.parser
File "/usr/lib/python3.4/email/parser.py", line 12, in <module>
from email.feedparser import FeedParser, BytesFeedParser
File "/usr/lib/python3.4/email/feedparser.py", line 27, in <module>
from email import message
File "/usr/lib/python3.4/email/message.py", line 15, in <module>
from email import utils
File "/usr/lib/python3.4/email/utils.py", line 40, in <module>
from email.charset import Charset
File "/usr/lib/python3.4/email/charset.py", line 15, in <module>
import email.quoprimime
File "/usr/lib/python3.4/email/quoprimime.py", line 44, in <module>
from string import ascii_letters, digits, hexdigits
File "/media/saiwal/D89602199601F930/Documents/Copy/codes/python/headfirst/string.py", line 2, in <module>
page = urllib.request.urlopen("http://www.brainjar.com/java/host/test.html")
AttributeError: 'module' object has no attribute 'request'
Original exception was:
Traceback (most recent call last):
File "string.py", line 1, in <module>
import urllib.request
File "/usr/lib/python3.4/urllib/request.py", line 88, in <module>
import http.client
File "/usr/lib/python3.4/http/client.py", line 69, in <module>
import email.parser
File "/usr/lib/python3.4/email/parser.py", line 12, in <module>
from email.feedparser import FeedParser, BytesFeedParser
File "/usr/lib/python3.4/email/feedparser.py", line 27, in <module>
from email import message
File "/usr/lib/python3.4/email/message.py", line 15, in <module>
from email import utils
File "/usr/lib/python3.4/email/utils.py", line 40, in <module>
from email.charset import Charset
File "/usr/lib/python3.4/email/charset.py", line 15, in <module>
import email.quoprimime
File "/usr/lib/python3.4/email/quoprimime.py", line 44, in <module>
from string import ascii_letters, digits, hexdigits
File "/media/saiwal/D89602199601F930/Documents/Copy/codes/python/headfirst/string.py", line 2, in <module>
page = urllib.request.urlopen("http://www.brainjar.com/java/host/test.html")
AttributeError: 'module' object has no attribute 'request'
This looks like a nasty coincidence.
TL;DR: Don’t name your script string.py.
So what’s happening here?
You’re trying to import urllib.request.
urllib.request tries to import http.client, which tries to import email.parser, which tries to import email.feedparser, which tries to import email.message, which tries to import email.utils, which tries to import email.charset, which tries to import email.quoprimime.
email.quoprimime tries to import string, expecting it to be the standard Python string module—but since the current working directory has priority over the standard Python library directories, it finds your string.py instead and tries to import that.
When importing your string.py, you try to import urllib.request. Since urllib.request is still being imported, you get back a skeleton urllib without a request attribute yet.
Because your imported string.py then fails because it can’t find the request attribute, the exception starts propagating back up.
But wait, there’s more! Since there was an error during an import, Ubuntu tries to be helpful by seeing if you’re missing a dpkg package. If so, it could say “hey, it looks like you’re missing this module; want to apt-get it?” So the mechanism for looking up the appropriate package is activated…
…but the module for looking up the appropriate package itself depends on urllib.request, so it tries to import it, and again fails…
In short, because you picked string.py as a file name, you overrode the standard string module, which broke a lot of other modules, and even broke the module that was supposed to be helpful when you were missing a module, causing a whole lot of havoc. Fortunately the solution is easy: rename your script.