why scrapy logs different in the console and external log file - scrapy

I'm new to Scrapy and I once managed to run my script well on Scrapy 0.24. But when I switched to the newly launched 1.0 I encountered a logging problem: What I want to do is to set both the file and the console log level to INFO, but however I set the LOG_LEVEL or the configure_logging() function(using the Python internal logging package instead of scrapy.log), Scrapy always logs DEBUG level information to the console, which returns the whole item object in format of dict. In fact, the LOG_LEVEL option only works for the external file. I suspect it must have something to do with the Python logging but have no idea how to set it. Could any one help me out?
This is how I config my logging in run_my_spider.py:
from crawler.settings import LOG_FILE, LOG_FORMAT
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from crawler.spiders.MySpiders import MySpider
import logging
def run_spider(spider):
settings = get_project_settings()
# configure file logging
# It ONLY works for the file
configure_logging({'LOG_FORMAT': LOG_FORMAT,
'LOG_ENABLEED' : True,
'LOG_FILE' : LOG_FILE,
'LOG_LEVEL' : 'INFO',
'LOG_STDOUT' : True})
# instantiate spider
process = CrawlerProcess(settings)
process.crawl(MySpider)
logging.info('Running Crawler: ' + spider.name)
process.start() # the script will block here until the spider_closed signal was sent
logging.info('Crawler ' + spider.name + ' stopped.\n')
......
This is the console output:
DEBUG:scrapy.core.engine:Crawled (200) <GET http://mil.news.sina.com.cn/2014-10-09/0450804543.html>(referer: http://rss.sina.com.cn/rollnews/jczs/20141009.js)
{'item_name': 'item_sina_news_reply',
'news_id': u'jc:27-1-804530',
'reply_id': u'jc:27-1-804530:1',
'reply_lastcrawl': '1438605374.41',
'reply_table': 'news_reply_20141009'}
Many Thanks!

It may be that what you are viewing in the console is the Twisted Logs.
It will print the Debug level messages to the console.
You can redirect them to your log files using:
from twisted.python import log
observer = log.PythonLoggingObserver(loggerName='logname')
observer.start()
(As given in How to make Twisted use Python logging?)

Related

unable to open database file on a hosting service pythonanywhere

I want to deploy my project on pythonanywhere. Error.log says that server or machine is unable to open my database. Everything works fine on my local machine. I watched a video of Pretty Printed from YouTube
This how I initialize in app.py. This what I got from error.log
db_session.global_init("db/data.sqlite")
this in db_session:
def global_init(db_file):
global __factory
if __factory:
return
if not db_file or not db_file.strip():
raise Exception("Необходимо указать файл базы данных.")
conn_str = f'sqlite:///{db_file.strip()}?check_same_thread=False'
print(f"Подключение к базе данных по адресу {conn_str}")
engine = sa.create_engine(conn_str, echo=False)
__factory = orm.sessionmaker(bind=engine)
from . import __all_models
SqlAlchemyBase.metadata.create_all(engine)
def create_session() -> Session:
global __factory
return __factory()
last thing is my wsgi.py:
import sys
path = '/home/r1chter/Chicken-beta'
if path not in sys.path:
sys.path.append(path)
import os
from dotenv import load_dotenv
project_folder = os.path.expanduser(path)
load_dotenv(os.path.join(project_folder, '.env'))
import app # noqa
application = app.app()
Usually errors like this on PythonAnywhere are due to providing relative path instead of absolute path.

Why does calling a scrapy spider from pywikibot give a ReactorNotRestartable error?

I am able to call a scrapy spider from another Python script using either CrawlerRunner or CrawlerProcess. But, when I try to call the same spider calling class from a pywikibot robot, I get a ReactorNotRestartable error. Why is this and how can I fix it?
Here is the error:
File ".\scripts\userscripts\ReplicationWiki\RWLoad.py", line 161, in format_new_page
aea = AEAMetadata(url=DOI_url)
File ".\scripts\userscripts\ReplicationWiki\GetAEAMetadata.py", line 39, in __init__
reactor.run() # the script will block here until all crawling jobs are finished
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
CRITICAL: Exiting due to uncaught exception <class 'twisted.internet.error.ReactorNotRestartable'>
Here is the script which calls my scrapy spider. It runs fine if I just call the class from main.
from twisted.internet import reactor, defer
from scrapy import signals
from scrapy.crawler import Crawler, CrawlerProcess, CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from Scrapers.spiders.ScrapeAEA import ScrapeaeaSpider
class AEAMetadata:
"""
Helper to run ScrapeAEA spider and return JEL codes and data links
for a given AEA article link.
"""
def __init__(self, *args, **kwargs):
"""Initializer"""
url = kwargs.get('url')
if not url:
raise ValueError('No article url given')
self.items = []
def collect_items(item, response, spider):
self.items.append(item)
settings = get_project_settings()
crawler = Crawler(ScrapeaeaSpider, settings)
crawler.signals.connect(collect_items, signals.item_scraped)
runner = CrawlerRunner(settings)
d = runner.crawl(crawler, url=url)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
#process = CrawlerProcess(settings)
#process.crawl(crawler, url=url)
#process.start() # the script will block here until the crawling is finished
def get_jelcodes(self):
jelcodes = self.items[0]['jelcodes']
return jelcodes
def main():
aea = AEAMetadata(url='https://doi.org/10.1257/app.20180286')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
Updated simple Test that instantiates the AEAMetadata class twice.
Here is the calling code in my pywikibot bot which fails:
from GetAEAMetadata import AEAMetadata
def main(*args):
for _ in [1,2]:
print('Top')
url = 'https://doi.org/10.1257/app.20170442'
aea = AEAMetadata(url=url)
print('After AEAMetadata')
jelcodes = aea.get_jelcodes()
print(jelcodes)
if __name__ == '__main__':
main()
My call to AEAMetadata was embedded in a larger script which fooled me into thinking the AEAMetadata class was only instantiated once before failure.
In fact, AEAMetadata was called twice.
And, I also thought that the script would block after the reactor.run() because the comment in all the scrapy examples stated that was the case.
However, the second deferred callback is reactor.stop() which unblocks the reactor.run().
A more basic incorrect assumption was that the reactor was deleted and recreated on each iteration. In fact, the reactor is instantiated and initialized when it is first imported. And, it is a global object which lives as long as the underlying process and was not designed to be restarted. The extremes actually needed to delete and restart a reactor are described here:
http://www.blog.pythonlibrary.org/2016/09/14/restarting-a-twisted-reactor/
So, I guess I've answered my own question.
And, I'm rewriting my script so it doesn't try to use the reactor in a way it was never intended to be used.
And, thanks Gallaecio for getting me thinking in the right direction.

Django & Celery & Rabbit getting Not Registered error

I am trying to set up Django & Celery & Rabbit for the first time following this tutorial. I am using Django 2.0 Celery 4.2.0 and Rabbit on Windows
I am getting the error: celery.exceptions.NotRegistered: 'GeneratePDF'
I have set up as follows:
in my init.py:
from __future__ import absolute_import, unicode_literals
import celery
from .celery import app as celery_app
__all__ = ['celery_app']
in my celery.py:
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery
from django.conf import settings
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'abc.settings')
app = Celery('abc')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
#app.task(bind=True)
def debug_task(self):
print('Request: {0!r}'.format(self.request))
in my tasks.py:
from celery import shared_task
from abc.celery import app
#shared_task(name='GeneratePDF')
class GeneratePDF(View):
def get(self, request, *args, **kwargs):
....
in my views.py:
from abc.tasks import GeneratePDF
#method_decorator(login_required, name='dispatch')
class ClientProfilePDF(RedirectView):
def get(self, request, *args, **kwargs):
GeneratePDF.delay(request)
return HttpResponseRedirect('/home/')
in my settings.py:
CELERY_BROKER_URL = 'amqp://localhost'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_RESULT_BACKEND = 'django-db'
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'Australia/Sydney'
CELERY_IMPORTS = ('abc.tasks',)
Can anyone point me in the right direction as to where I am going wrong and why I am getting this error? Any help is much appreciated!
Two quick things:
No need for any parameters to app.autodiscover_tasks() Celery alreayd knows how to use settings.INSTALLED_APPS.
The #shared_task decorator is for tasks that live in apps that do not have their own celery.py file that instantiates an app. From the looks of it, your tasks.py file lives in the same django app as the celery.py file. In this case, you should use #app.task and not #shared_task.
before you start, you can get a list of registered tasks by doing celery -A myapp inspect registered. That will let you see if your GeneratePDF task is registered or not.

ScrapyDeprecationWaring: Command's default `crawler` is deprecated and will be removed. Use `create_crawler` method to instantiate crawlers

Scrapy version 0.19
I am using the code at this page ( Run multiple scrapy spiders at once using scrapyd ). When I run scrapy allcrawl, I got
ScrapyDeprecationWaring: Command's default `crawler` is deprecated and will be removed. Use `create_crawler` method to instantiate crawlers
Here is the code:
from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log
class AllCrawlCommand(ScrapyCommand):
requires_project = True
default_settings = {'LOG_ENABLED': False}
def short_desc(self):
return "Schedule a run for all available spiders"
def run(self, args, opts):
url = 'http://localhost:6800/schedule.json'
for s in self.crawler.spiders.list(): #this line raise the warning
values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
log.msg(response)
How do I fix the DeprecationWarning ?
Thanks
Use:
crawler = self.crawler_process.create_crawler()

ImproperlyConfigured: Requested setting MIDDLEWARE_CLASSES, but settings are not configured

i am confuring the apache mod_wsgi to django project
and here is my djangotest.wsgi file
import os
import sys
sys.path = ['/home/pavan/djangoproject'] + sys.path
os.environ['DJANGO_SETTINS_MODULE'] = 'djangoproject.settings'
import django.core.handlers.wsgi
_application = django.core.handlers.wsgi.WSGIHandler()
def application(environ, start_response):
environ['PATH_INFO'] = environ['SCRIPT_NAME'] + environ['PATH_INFO']
return _application(environ, start_response)
and i add the WSGIScrptAlias to my virtual directory
when i try to get the homepage of the project it says the following error
ImproperlyConfigured: Requested setting MIDDLEWARE_CLASSES, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.
Looks like you have a typo on line 4. 'DJANGO_SETTINS_MODULE' should be 'DJANGO_SETTINGS_MODULE'.