Can I run scrapy spider with different setting in different process(parallel)? - scrapy

I define one spider with name='myspider', its behavior would be different according to the setting.And I want to run the spider with different instances in different process, is it possible?
I check the source code,it seems the SpiderLoader just walk the spiders module and I could just run one spider with the same name one time.
the running code seems:
for item in items:
settings = get_project_settings()
settings.set('item', item)
settings.set('DEFAULT_REQUEST_HEADERS', item.get('request_header'))
process = CrawlerProcess(settings)
process.crawl("myspider")
process.start()
and of course, the error shows:
Traceback (most recent call last):
File "/home/xuanqi/workspace/github/foolcage/fospider/fospider/main.py", line 44, in <module>
process.start() # the script will block here until the crawling is finished
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1194, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1174, in startRunning
ReactorBase.startRunning(self)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Thanks advance for any help!

Setting cannot be changed at runtime.
I suggest you to use spider argument to pass different variable to spider.
process = CrawlerProcess(settings)
process.crawl("myspider", request_headers='specified headers...')
process.start()
And for doing this, you have to override init function of your spider to accept these variables. And pass the request_header to every Request object you use in the spider.
def __init__(self, **kw):
super(MySpider, self).__init__(**kw)
self.headers = kw.get('request_headers')
...
yield scrapy.Request(url='www.example.com', headers=self.headers)

Related

No `test dataloader()` method defined to run Trainer.test while training Dreambooth-Stable-Diffusion

I'm implementing Dreambooth-Stable-Diffusion On Google Colab.
I was able to install coda replicate the same steps mentioned in the repo above, and generate the regulation images successfully. However, I'm getting pytorch_lightning.utilities.exceptions.MisconfigurationException: No test dataloader() method defined to run Trainer.test after running the Training command.
Here is the full log:
trainer.test(model, data)
File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl
results = self._run(model, ckpt_path=self.tested_ckpt_path)
File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1128, in _run
verify_loop_configurations(self)
File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 42, in verify_loop_configurations
__verify_eval_loop_configuration(trainer, model, "test")
File "/usr/local/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 186, in __verify_eval_loop_configuration
raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.
I tried this solution - even though it was hard to know where to exactly put the edit and I'm not 100%, but still, it didn't work.

Scrapy: drop item from scraper

I would like to drop an item from the scraper itself instead of add the particular dropping logic of this scraper into the pipeline due is a specific case.
Scrapy has DropItem exception that is nicely handled from the Pipeline but it produces an error if is raised from the scraper:
#...
raise DropItem('Item dropped ' + self.id())
Output:
2019-11-13 13:27:27 [scrapy.core.scraper] ERROR: Spider error processing <GET http://domain.tld/> (referer: http://domain.tld/referer)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/core/core/spiders/my_spider.py", line 46, in parse get.photos())
scrapy.exceptions.DropItem: Item dropped 35
Is there a more elegant way to handle this situation?

AttributeError: 'Context' object has no attribute 'browser'

I am currently experimenting with Behavioral Driven Development. I am using behave_django with selenium. I get the following output
Creating test database for alias 'default'...
Feature: Open website and print title # features/first_selenium.feature:1
Scenario: Open website # features/first_selenium.feature:2
Given I open seleniumframework website # features/steps/first_selenium.py:2 0.001s
Traceback (most recent call last):
File "/home/vagrant/newproject3/newproject3/venv/local/lib/python2.7/site-packages/behave/model.py", line 1456, in run
match.run(runner.context)
File "/home/vagrant/newproject3/newproject3/venv/local/lib/python2.7/site-packages/behave/model.py", line 1903, in run
self.func(context, *args, **kwargs)
File "features/steps/first_selenium.py", line 4, in step_impl
context.browser.get("http://www.seleniumframework.com")
File "/home/vagrant/newproject3/newproject3/venv/local/lib/python2.7/site-packages/behave/runner.py", line 214, in __getattr__
raise AttributeError(msg)
AttributeError: 'Context' object has no attribute 'browser'
Then I print the title # None
Failing scenarios:
features/first_selenium.feature:2 Open website
0 features passed, 1 failed, 0 skipped
0 scenarios passed, 1 failed, 0 skipped
0 steps passed, 1 failed, 1 skipped, 0 undefined
Took 0m0.001s
Destroying test database for alias 'default'...
Here is the code:
first_selenium.feature
Feature: Open website and print title
Scenario: Open website
Given I open seleniumframework website
Then I print the title
first_selenium.py
from behave import *
#given('I open seleniumframework website')
def step_impl(context):
context.browser.get("http://www.seleniumframework.com")
#then('I print the title')
def step_impl(context):
title = context.browser.title
assert "Selenium" in title
manage.py
#!/home/vagrant/newproject3/newproject3/venv/bin/python
import os
import sys
sys.path.append("/home/vagrant/newproject3/newproject3/site/v2/features")
import dotenv
if __name__ == "__main__":
path = os.path.realpath(os.path.dirname(__file__))
dotenv.load_dotenv(os.path.join(path, '.env'))
from configurations.management import execute_from_command_line
#from django.core.management import execute_from_command_line
execute_from_command_line(sys.argv)
I'm not sure what this error means
I know it is a late answer but maybe somebody is going to profit from it:
you need to declare the context.browser (in a before_all/before_scenario/before_feature hook definition or just test method definition) before you use it, e.g.:
context.browser = webdriver.Chrome()
Please note that the hooks must be defined in a separate environment.py module
In my case the browser wasn't installed. That can be a case too. Also ensure path to geckodriver is exposed if you are working with Firefox.

Using LineReciever with Twisted Process protocols

I'm trying to use twisted to handle data generated by a binary (which indefinitely dumps lines onto stdout). Since by data is inherently line-delimited, I was trying to used the LineReciever instead of trying to parse data. The following is the relevant bit of the code which seems to be causing trouble :
class ProtocolBareQDAL41xB(ProcessProtocol, LineReceiver):
...
def outReceived(self, data):
print "Got Data:" + repr(data)
self.dataReceived(data)
def lineReceived(self, line):
print "Got Line: " + line
self._process_line(line)
...
This 'works' for the first of two lines in the output. I don't know yet if it works for only one line, or if it works for all but the last line. The resulting output looks something like :
$ python BareQDAL41xB.py
Made Connection
<Process pid=16486 status=-1>
Got Data:'No device found!\nMultiple devices found! Please connect only one.\n'
Got Line: No device found!
Got Serial Number : found!
Unhandled Error
Traceback (most recent call last):
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/python/log.py", line 101, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/python/log.py", line 84, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext
return func(*args,**kw)
--- <exception caught here> ---
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 597, in _doReadOrWrite
why = selectable.doRead()
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/internet/process.py", line 274, in doRead
return fdesc.readFromFD(self.fd, self.dataReceived)
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/internet/fdesc.py", line 94, in readFromFD
callback(output)
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/internet/process.py", line 277, in dataReceived
self.proc.childDataReceived(self.name, data)
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/internet/process.py", line 931, in childDataReceived
self.proto.childDataReceived(name, data)
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/internet/protocol.py", line 604, in childDataReceived
self.outReceived(data)
File "BareQDAL41xB.py", line 104, in outReceived
self.dataReceived(data)
File "/media/ldata/code/virtualenvs/tendril/local/lib/python2.7/site-packages/twisted/protocols/basic.py", line 573, in dataReceived
self.transport.disconnecting):
exceptions.AttributeError: 'Process' object has no attribute 'disconnecting'
processExited, status 0
processEnded, status 0
LineReciever seems to be expecting the transport to implement disconnecting.
Is it possible to use twisted's LineReciever with twisted's ProcessProtocol, or should I implement the line parser in my protocol instead?
LineReceiver is already a Protocol, which implements different interfaces than IProcessProtocol.
Luckily, recent versions of Twisted already contain an adapter that does what you want - which is to treat a subprocess as a stream of bytes. Rather than calling spawnProcess directly, use ProcessEndpoint, and you can pass a regular ProtocolFactory, no ProcessProtocol involved.
However, as a commenter has already pointed out, there's a bug here, where the disconnecting attribute is not formally part of ITransport, but LineReceiver (and LineOnlyReceiver) depend on it anyway, and since it's not part of the interface, ProcessEndpoint doesn't implement it. That should definitely be fixed, but in the meanwhile, we'll need to work around it.
As a happy accident, Twisted's built-in support for wrapping protocols, WrappingFactory, already has support for the disconnecting attribute, specifically because of this ugly disparity between the theory of the interface specifications and the reality of the most popular ITransport implementations. So even a do-nothing wrapper will work around the problem. You can implement this like so:
from zope.interface import implementer
from twisted.internet.interfaces import IStreamClientEndpoint
from twisted.protocols.policies import WrappingFactory
#implementer(IStreamClientEndpoint)
class DisconnectingWorkaroundEndpoint(object):
def __init__(self, endpoint):
self._endpoint = endpoint
def connect(self, protocolFactory):
return self._endpoint.connect(WrappingFactory(protocolFactory))
and then when you construct your ProcessEndpoint, do:
endpoint = DisconnectingWorkaroundEndpoint(ProcessEndpoint(...))
Sorry for the delay on answering; while you've probably worked out your own workaround, I hope this will be useful to others with the same question!

How do I set a backend for django-celery. I set CELERY_RESULT_BACKEND, but it is not recognized

I set CELERY_RESULT_BACKEND = "amqp" in celeryconfig.py
but I get:
>>> from tasks import add
>>> result = add.delay(3,5)
>>> result.ready()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/djangoprojects/venv/local/lib/python2.7/site-packages/celery/result.py", line 105, in ready
return self.state in self.backend.READY_STATES
File "/djangoprojects/venv/local/lib/python2.7/site-packages/celery/result.py", line 184, in state
return self.backend.get_status(self.task_id)
File "/djangoprojects/venv/local/lib/python2.7/site-packages/celery/backends/base.py", line 414, in _is_disabled
raise NotImplementedError("No result backend configured. "
NotImplementedError: No result backend configured. Please see the documentation for more information.
I just went through this so I can shed some light on this. One might think for all of the great documentation stating some of this would have been a bit more obvious.
I'll assume you have both RabbitMQ up and functioning (it needs to be running), and that you have dj-celery installed.
Once you have that then all you need to do is to include this single line in your setting.py file.
BROKER_URL = "amqp://guest:guest#localhost:5672//"
Then you need to run syncdb and start this thing up using:
python manage.py celeryd -E -B --loglevel=info
The -E states that you want events captured and the -B states you want celerybeats running. The former enable you to actually see something in the admin window and the later allows you to schedule. Finally you need to ensure that you are actually going to capture the events and the status. So in another terminal run this:
./manage.py celerycam
And then finally your able to see the working example provided in the docs.. -- Again assuming you created the tasks.py that is says to.
>>> result = add.delay(4, 4)
>>> result.ready() # returns True if the task has finished processing.
False
>>> result.result # task is not ready, so no return value yet.
None
>>> result.get() # Waits until the task is done and returns the retval.
8
>>> result.result # direct access to result, doesn't re-raise errors.
8
>>> result.successful() # returns True if the task didn't end in failure.
True
Furthermore then you are able to view your status in the admin panel.
I hope this helps!! I would add one more thing which helped me. Watching the RabbitMQ Log file was key as it helped me identify that django-celery was actually talking to RabbitMQ.
Are you running django celery?
If so, you need to start a python shell in the context of django (or whatever the technical term is).
Type:
python manage.py shell
And try your commands from that shell
HI tried everything to work celery v3.1.25 with Django 1.8 version nothing worked..
Finally below line helped me ,feeling happy
app = Celery('documents',backend="celery.backends.amqp:AMQPBackend")
Setting backend="celery.backends.amqp:AMQPBackend" fixed my error.