Apscheduler+scrapy signal only works in main thread - scrapy

i want to combine apscheduler with scrapy.but my code is wrong.
How should i modify it?
settings = get_project_settings()
configure_logging(settings)
runner = CrawlerRunner(settings)
#defer.inlineCallbacks
def crawl():
reactor.run()
yield runner.crawl(Jobaispider)#this is my spider
yield runner.crawl(Jobpythonspider)#this is my spider
reactor.stop()
sched = BlockingScheduler()
sched.add_job(crawl, 'date', run_date=datetime(2018, 12, 4, 10, 45, 10))
sched.start()
Error:builtins.ValueError: signal only works in main thread

This question has been answered in good detail here: How to integrate Flask & Scrapy? where it covers a variety of usecases and ideas. I also found one of the links in that thread very useful: https://github.com/notoriousno/scrapy-flask
To answer your question more directly, try this out. It uses the solution from the above two links, in particular, it uses the crochet library.
import crochet
crochet.setup()
settings = get_project_settings()
configure_logging(settings)
runner = CrawlerRunner(settings)
# Note: Removing defer here for the example
##defer.inlineCallbacks
#crochet.run_in_reactor
def crawl():
runner.crawl(Jobaispider)#this is my spider
runner.crawl(Jobpythonspider)#this is my spider
sched = BlockingScheduler()
sched.add_job(crawl, 'date', run_date=datetime(2018, 12, 4, 10, 45, 10))
sched.start()

Related

Plotly Dash: Chained callbacks are working fine, but always throw ValueError. How to fix or silence?

I have a dataset that lists different jobs and what tools are need by date and location. Here is a tiny example of my df:
State,County,Job,Tool,Start,End
LA,Acadia,A,Hammer,2020-10-19,2020-11-02
LA,Acadia,A,Drill,2020-11-02,2020-12-02
LA,Acadia,B,Hammer,2020-11-28,2020-12-30
LA,Orleans,A,Hammer,2020-10-03,2020-11-02
LA,Orleans,A,Drill,2020-11-05,2020-12-02
LA,Orleans,B,Hammer,2020-12-03,2020-12-30
TX,Travis,A,Hammer,2020-10-19,2020-11-02
TX,Travis,A,Drill,2020-11-02,2020-12-02
TX,Travis,B,Hammer,2020-11-28,2020-12-30
TX,Brazoria,A,Hammer,2020-10-03,2020-11-02
TX,Brazoria,A,Drill,2020-11-05,2020-12-02
TX,Brazoria,B,Hammer,2020-12-03,2020-12-30
My dcc.Dropdowns and dcc.RadioItems are as follows:
html.Div([
html.Label("State:", style={'fontSize': 30, 'textAlign': 'center'}),
dcc.Dropdown(
id='state-dpdn',
options=[{'label': s, 'value': s} for s in sorted(df.State.unique())],
value='LA',
clearable=False,
)],
html.Div([
html.Label("County:", style={'fontSize': 30, 'textAlign': 'center'}),
dcc.Dropdown(
id='county-dpdn',
options=[],
value=[],
clearable=False
)],
html.Div([
html.Label("Job:", style={'fontSize': 30, 'textAlign': 'center'}),
dcc.RadioItems(
id='radio-buttons',
options=[],
),
And here are my callbacks:
#app.callback(
Output('county-dpdn', 'options'),
Output('county-dpdn', 'value'),
Input('state-dpdn', 'value'),
)
def set_county_options(chosen_state):
dff = df[df.State == chosen_state]
counties_of_state = [{'label': c, 'value': c} for c in sorted(dff.County.unique())]
values_selected = [x['value'] for x in counties_of_state]
return counties_of_state, values_selected
#app.callback(
Output('radio-buttons', 'options'),
Output('radio-buttons', 'value'),
Input('county-dpdn', 'value'),
)
def set_job_options(chosen_county):
dff = df[df.County == chosen_county] ###THIS IS LINE 255 W/ ERROR###
job_of_county = [{'label': c, 'value': c} for c in dff.Job.unique()]
values_selected = job_of_county[0]['value']
return job_of_county, values_selected
#app.callback(
Output('graph1', 'children'),
Input('radio-buttons', 'value'),
Input('county-dpdn', 'value'),
Input('state-dpdn', 'value')
)
def build_graph(job_selected, location_selected, state_selected):
if isinstance(state_selected, str):
df['State'].eq(state_selected)
if isinstance(location_selected, str):
df['County'].eq(location_selected)
if isinstance(job_selected, str):
df['Job'].eq(job_selected)
dffgraph = df[df['State'].eq(state_selected) & df['County'].eq(location_selected) & df['Job'].eq(job_selected)]
###dffgraph manipulation and figure creation down here###
All of this works, loads the data as intended, and my graphs are great. My problem comes when trying to get this out to users without throwing errors. Errors were helpful in constructing everything but when it came time to tweak the presentation of my graphs, I silenced the errors to come back to them later using dev_tools:
if __name__ == '__main__':
app.run_server(debug=False, port=5000, dev_tools_ui=False, dev_tools_props_check=False)
I am creating a multipage dash app which works great, but when I put this code into my main multipage app or if I remove the dev tools from the local 'app.run_server' shown above I get the following error:
File "/Users/Documents/pythonProject/app.py", line 255, in set_job_options
dff = df[df.County == chosen_county]
File "/Users/venv/lib/python3.7/site-packages/pandas/core/ops/common.py", line 65, in new_method
return method(self, other)
File "/Users/venv/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 370, in wrapper
res_values = comparison_op(lvalues, rvalues, op)
File "/Users/venv/lib/python3.7/site-packages/pandas/core/ops/array_ops.py", line 225, in comparison_op
"Lengths must match to compare", lvalues.shape, rvalues.shape
ValueError: ('Lengths must match to compare', (4781,), (64,))
The csv file used for creating my df is 4781 lines long, and 64 is the number of parishes in Louisiana, so the number make sense.
While ignoring this error has worked thus far, after weeks of trying to occasionally figure this out I would love some help with this error. Thank you!
Edit 1
For help with simply silencing the errors, it is best to show how my app is setup. I have a multipage Flask app, with my Dash app being called when the user is directed to it.
My dashboard.py file looks like:
def init_dashboard(server):
"""Create a Plotly Dash dashboard."""
dash_app = dash.Dash(
server=server,
routes_pathname_prefix='/dashapp/',
external_stylesheets=[dbc.themes.FLATLY]
)
# Load DataFrame
df = create_dataframe()
and so on....
return dash_app.server
The init.py file in my flask app that calls my dashboard.py:
def create_app(config_class=Config):
app = Flask(__name__)
app.config.from_object(config_class)
.... all the normal init stuff ....
from .plotlydash.dashboard import init_dashboard
app = init_dashboard(app)
return app
Prior to deployment while testing my dash app as a single entity, debug=False, dev_tools_ui=False, dev_tools_props_check=False all work well to silence the dash errors. When I put my working dash app into my flask app as shown above it still works just fine, but it throws errors which are logged and sent to my email like crazy. I've tried a few different approaches to silencing only the errors from my dash app but haven't found a solution yet.
Edit at bottom.
The problem is you're doing this:
dff = df[df.County == chosen_county]
but the == won't work with a list, and chosen_county is a list. You should do
dff = df[df.County.isin(chosen_county)]
or require that only a single value be selected and it not be a list.
Edit:
To do this with a single value dropdown, I think the right way would be:
dff = df[df['County'].eq(chosen_county)]
The dropdown should not have multi=True, that was from my previous misunderstanding. Please let me know if this still is not working.

Python multiprocessing how to update a complex object in a manager list without using .join() method

I started programming in Python about 2 months ago and I've been struggling with this problem in the last 2 weeks.
I know there are many similar threads to this one but I can't really find a solution which suits my case.
I need to have the main process which is the one which interacts with Telegram and another process, buffer, which understands the complex object received from the main and updates it.
I'd like to do this in a simpler and smoother way.
At the moment objects are not being updated due to the use of multi-processing without the join() method.
I tried then to use multi-threading instead but it gives me compatibility problems with Pyrogram a framework which i am using to interact with Telegram.
I wrote again the "complexity" of my project in order to reproduce the same error I am getting and in order to get and give the best help possible from and for everyone.
a.py
class A():
def __init__(self, length = -1, height = -1):
self.length = length
self.height = height
b.py
from a import A
class B(A):
def __init__(self, length = -1, height = -1, width = -1):
super().__init__(length = -1, height = -1)
self.length = length
self.height = height
self.width = width
def setHeight(self, value):
self.height = value
c.py
class C():
def __init__(self, a, x = 0, y = 0):
self.a = a
self.x = x
self.y = y
def func1(self):
if self.x < 7:
self.x = 7
d.py
from c import C
class D(C):
def __init__(self, a, x = 0, y = 0, z = 0):
super().__init__(a, x = 0, y = 0)
self.a = a
self.x = x
self.y = y
self.z = z
def func2(self):
self.func1()
main.py
from b import B
from d import D
from multiprocessing import Process, Manager
from buffer import buffer
if __name__ == "__main__":
manager = Manager()
lizt = manager.list()
buffer = Process(target = buffer, args = (lizt, )) #passing the list as a parameter
buffer.start()
#can't invoke buffer.join() here because I need the below code to keep running while the buffer process takes a few minutes to end an instance passed in the list
#hence I can't wait the join() function to update the objects inside the buffer but i need objects updated in order to pop them out from the list
import datetime as dt
t = dt.datetime.now()
#library of kind of multithreading (pool of 4 processes), uses asyncio lib
#this while was put to reproduce the same error I am getting
while True:
if t + dt.timedelta(seconds = 10) < dt.datetime.now():
lizt.append(D(B(5, 5, 5)))
t = dt.datetime.now()
"""
#This is the code which looks like the one in my project
#main.py
from pyrogram import Client #library of kind of multithreading (pool of 4 processes), uses asyncio lib
from b import B
from d import D
from multiprocessing import Process, Manager
from buffer import buffer
if __name__ == "__main__":
api_id = 1234567
api_hash = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
app = Client("my_account", api_id, api_hash)
manager = Manager()
lizt = manager.list()
buffer = Process(target = buffer, args = (lizt, )) #passing the list as a parameter
buffer.start()
#can't invoke buffer.join() here because I need the below code to run at the same time as the buffer process
#hence I can't wait the join() function to update the objects inside the buffer
#app.on_message()
def my_handler(client, message):
lizt.append(complex_object_conatining_message)
"""
buffer.py
def buffer(buffer):
print("buffer was defined")
while True:
if len(buffer) > 0:
print(buffer[0].x) #prints 0
buffer[0].func2() #this changes the class attribute locally in the class instance but not in here
print(buffer[0].x) #prints 0, but I'd like it to be 7
print(buffer[0].a.height) #prints 5
buffer[0].a.setHeight(10) #and this has the same behaviour
print(buffer[0].a.height) #prints 5 but I'd like it to be 10
buffer.pop(0)
This is the whole code about the problem I am having.
Literally every suggestion is welcome, hopefully constructive, thank you in advance!
At last I had to change the way to solve this problem, which was using asyncio like the framework was doing as well.
This solution offers everything I was looking for:
-complex objects update
-avoiding the problems of multiprocessing (in particular with join())
It is also:
-lightweight: before I had 2 python processes 1) about 40K 2) about 75K
This actual process is about 30K (and it's also faster and cleaner)
Here's the solution, I hope it will be useful for someone else like it was for me:
The part of the classes is skipped because this solution updates complex objects absolutely fine
main.py
from pyrogram import Client
import asyncio
import time
def cancel_tasks():
#get all task in current loop
tasks = asyncio.Task.all_tasks()
for t in tasks:
t.cancel()
try:
buffer = []
firstWorker(buffer) #this one is the old buffer.py file and function
#the missing loop and loop method are explained in the next piece of code
except KeyboardInterrupt:
print("")
finally:
print("Closing Loop")
cancel_tasks()
firstWorker.py
import asyncio
def firstWorker(buffer):
print("First Worker Executed")
api_id = 1234567
api_hash = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
app = Client("my_account", api_id, api_hash)
#app.on_message()
async def my_handler(client, message):
print("Message Arrived")
buffer.append(complex_object_conatining_message)
await asyncio.sleep(1)
app.run(secondWorker(buffer)) #here is the trick: I changed the
#method run() of the Client class
#inside the Pyrogram framework
#since it was a loop itself.
#In this way I added another task
#to the existing loop in orther to
#let run both of them together.
my secondWorker.py
import asyncio
async def secondWorker(buffer):
while True:
if len(buffer) > 0:
print(buffer.pop(0))
await asyncio.sleep(1)
The resources to understand the asyncio used in this code can be found here:
Asyncio simple tutorial
Python Asyncio Official Documentation
This tutorial about how to fix classical Asyncio errors

Deferred requests in scrapy

I would like to repeatedly scrape the same URLs with different delays. After researching the issue it seemed that the appropriate solution was to use something like
nextreq = scrapy.Request(url, dont_filter=True)
d = defer.Deferred()
delay = 1
reactor.callLater(delay, d.callback, nextreq)
yield d
in parse.
However, I have been unable to make this work. I am getting the error message
ERROR: Spider must return Request, BaseItem, dict or None, got 'Deferred'
I am not familiar with twisted so I hope I am just missing something obvious
Is there a better way of achieving my goal that doesn't fight the framework so much?
I finally found an answer in an old PR
def parse():
req = scrapy.Request(...)
delay = 0
reactor.callLater(delay, self.crawler.engine.schedule, request=req, spider=self)
However, the spider can exit due to being idle too early. Based on the outdated middleware https://github.com/ArturGaspar/scrapy-delayed-requests, this can be remedied with
from scrapy import signals
from scrapy.exceptions import DontCloseSpider
class ImmortalSpiderMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
s = cls()
crawler.signals.connect(s.spider_idle, signal=signals.spider_idle)
return s
#classmethod
def spider_idle(cls, spider):
raise DontCloseSpider()
The final option, updating the middleware by ArturGaspar, led to:
from weakref import WeakKeyDictionary
from scrapy import signals
from scrapy.exceptions import DontCloseSpider
from twisted.internet import reactor
class DelayedRequestsMiddleware(object):
requests = WeakKeyDictionary()
#classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)
return ext
#classmethod
def spider_idle(cls, spider):
if cls.requests.get(spider):
spider.log("delayed requests pending, not closing spider")
raise DontCloseSpider()
def process_request(self, request, spider):
delay = request.meta.pop('delay_request', None)
if delay:
self.requests.setdefault(spider, 0)
self.requests[spider] += 1
reactor.callLater(delay, self.schedule_request, request.copy(),
spider)
raise IgnoreRequest()
def schedule_request(self, request, spider):
spider.crawler.engine.schedule(request, spider)
self.requests[spider] -= 1
And can be used in parse like:
yield Request(..., meta={'delay_request': 5})

How scrapy crawl work:which class instanced and which method called?

Here is a simple python file--test.py.
import math
class myClass():
def myFun(self,x):
return(math.sqrt(x))
if __name__ == "__main__":
myInstance=myClass()
print(myInstance.myFun(9))
It print 3 with python test.py,let's analyse the running process.
1. to instance myClass and assign it to myInstance.
2.to call myFun function and print the result.
It is scrapy's turn.
In the scrapy1.4 manual,quotes_spider.py is as below.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
To run the spider with scrapy crawl quotes,i am puzzled:
1.Where is the main function or main body for the spider?
2.Which class was instanced?
3.Which method was called?
mySpider = QuotesSpider(scrapy.Spider)
mySpider.parse(response)
How scrapy crawl work exactly?
So let's start. Assuming you use linux/mac. Let's check where us scrapy
$ which scrapy
/Users/tarun.lalwani/.virtualenvs/myproject/bin/scrapy
Let's look at the content of this file
$ cat /Users/tarun.lalwani/.virtualenvs/myproject/bin/scrapy
#!/Users/tarun.lalwani/.virtualenvs/myproject/bin/python3.6
# -*- coding: utf-8 -*-
import re
import sys
from scrapy.cmdline import execute
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
sys.exit(execute())
So this executes execute method from cmdline.py and her is your main method.
cmdline.py
from __future__ import print_function
....
....
def execute(argv=None, settings=None):
if argv is None:
argv = sys.argv
# --- backwards compatibility for scrapy.conf.settings singleton ---
if settings is None and 'scrapy.conf' in sys.modules:
from scrapy import conf
if hasattr(conf, 'settings'):
settings = conf.settings
# ------------------------------------------------------------------
if settings is None:
settings = get_project_settings()
# set EDITOR from environment if available
try:
editor = os.environ['EDITOR']
except KeyError: pass
else:
settings['EDITOR'] = editor
check_deprecated_settings(settings)
# --- backwards compatibility for scrapy.conf.settings singleton ---
import warnings
from scrapy.exceptions import ScrapyDeprecationWarning
with warnings.catch_warnings():
warnings.simplefilter("ignore", ScrapyDeprecationWarning)
from scrapy import conf
conf.settings = settings
# ------------------------------------------------------------------
inproject = inside_project()
cmds = _get_commands_dict(settings, inproject)
cmdname = _pop_command_name(argv)
parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \
conflict_handler='resolve')
if not cmdname:
_print_commands(settings, inproject)
sys.exit(0)
elif cmdname not in cmds:
_print_unknown_command(settings, cmdname, inproject)
sys.exit(2)
cmd = cmds[cmdname]
parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
parser.description = cmd.long_desc()
settings.setdict(cmd.default_settings, priority='command')
cmd.settings = settings
cmd.add_options(parser)
opts, args = parser.parse_args(args=argv[1:])
_run_print_help(parser, cmd.process_options, args, opts)
cmd.crawler_process = CrawlerProcess(settings)
_run_print_help(parser, _run_command, cmd, args, opts)
sys.exit(cmd.exitcode)
if __name__ == '__main__':
execute()
Now if you notice execute method it processes the arguments passed by you. which is crawl quotes in your case. The execute methods scans the projects for classes and check which has name defined as quotes. It creates the CrawlerProcess class and that runs the whole show.
Scrapy is based on Twisted Python Framework. Which is a scheduler based framework.
Consider the below part of the code
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
When the engine executes this function and first yield is execute. The value is returned to the engined. The engine now looks at other task that are pending executes them, (when they yield, some other pending task queue function gets a chance). So yield is what allows to break a function execution into parts and help Scrapy/Twisted work.
You can get a detailed overview on the link below
https://doc.scrapy.org/en/latest/topics/architecture.html

Wrote an errback for my Scrapy spider, but tracebacks also keep happening, why?

I am using Scrapy 1.1 and I call Scrapy from within a script. My spider launching method looks like this:
def run_spider(self):
runner = CrawlerProcess(get_project_settings())
spider = SiteSpider()
configure_logging()
d = runner.crawl(spider, websites_file=self.raw_data_file)
d.addBoth(lambda _: reactor.stop())
reactor.run()
Here is an extract of my spider with an errback written as in the documentation, but it only prints when catches a failure.
class SiteSpider(scrapy.Spider):
name = 'SiteCrawler'
custom_settings = {
'FEED_FORMAT': 'json',
'FEED_URI': 'result.json',
}
def __init__(self, websites_file=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.websites_file = websites_file
print('***********')
print(self.websites_file)
def start_requests(self):
.....
if is_valid_url(website_url):
yield scrapy.Request(url=website_url, callback=self.parse, errback=self.handle_errors, meta={'url': account_id})
def parse(self, response):
.....
yield item
def handle_errors(self, failure):
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
print('HttpError on ' + response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
print('DNSLookupError on ' + request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
print('TimeoutError on ' + request.url)
My problem is that I get errors I expect, like:
TimeoutError on http://www.example.com
But also get tracebacks for the same websites:
2016-08-05 13:40:55 [scrapy] ERROR: Error downloading <GET http://www.example.com/robots.txt>: TCP connection timed out: 60: Operation timed out.
Traceback (most recent call last):
File ".../anaconda/lib/python3.5/site-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File ".../anaconda/lib/python3.5/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File ".../anaconda/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
The written exception handling messages and the tracebacks can often be traced to the same websites. After searching a lot on stackoverflow, in the docs and the likes I still dont know why I see the tracebacks.
This also occurs with DNSLookupErrors for example.
Excuse me, my Scrapy knowledge is juvenile. Is this normal behavior?
Also, I added this to settings.py, which is under my crawler. Other entires (for example the item_pipelines) most exactly work.
LOG_LEVEL = 'WARNING'
But I still see debug messages, not only warnings and everything above that. (if configure_logging() is added to the spider launch) I am running this from terminal on mac os x.
I would be very happy to get any help with this.
Try this in a script:
if __name__ == '__main__':
runner = CrawlerProcess(get_project_settings())
spider = SiteSpider()
configure_logging()
d = runner.crawl(spider, websites_file=self.raw_data_file)
d.addBoth(lambda _: reactor.stop())
reactor.run()