I run my crawler like this
if __name__ == "__main__":
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
s = get_project_settings()
process = CrawlerProcess(s)
process.crawl(MySpider)
process.start()
And i use custom settings which has
"JOBDIR": "crawler_1",
If the crawler fails how do i restart it from the point of failure ?
Assuming the JOBDIR captured all the persistent data accurately, then all you should need to do is run it again.
Scrapy will automatically see the JOBDIR in your settings and will check if there is any persistent data from previous runs.
Related
In a scrapy project one uses middleware quite often. Is there a generic way of enableing usage of middleware in the scrapy shell during interactive sessions as well?
Although, Middlewares set up in setting.py are enabled by default in scrapy shell. You can see it on the logs when running scrapy shell.
So to answer your question, yes you can do so using this command.
scrapy shell -s DOWNLOADER_MIDDLEWARES='<<your custom middleware>>'
You can override settings using the -s parameters.
Remember, just running scrapy shell inside a folder that contains a scrapy project.
It will load the default settings from settings.py.
Happy Scraping :)
For my current project, after I trigger a workflow, I need to check the status of its execution. I am not sure about the exact command. I have tried 'get-workflow' but it didn't seem to work.
There are a few ways, increasing in order of heavy-handedness.
You can hit with curl or something the API endpoint directly in Admin.
The Python SDK (flytekit) also ships with a command-line control plane utility called flyte-cli. In the future this may move to another location but it's here for now and you can hit it with this command.
flyte-cli -p yourproject -d development get-execution -u ex:yourproject:development:2fd90i
You can also use the Python class in flytekit that represents a workflow execution.
In [1]: from flytekit.configuration import set_flyte_config_file
In [2]: set_flyte_config_file('/Users/user/.flyte/config')
In [3]: from flytekit.common.workflow_execution import SdkWorkflowExecution
In [4]: e = SdkWorkflowExecution.fetch('yourproject', 'development', '2fd90i')
I am developing a peripheral hardware and want to use QEMU to test it.
The plan is to run the device driver in QEMU and use libvert (or something else?) to interface the VM with a python based simulation model of the peripheral.
I aware that QEMU can be single stepped via GDB, but I am looking at a python approach to do the following.
Wait for a write to a specific memory location.
Suspend QEMU
Run some background task in the host.
Run QEMU for N Cycles.
Write to a memory location
Continue
Is this possible with libvert or any other toolkit?
I needed to do something similar, and came across two approaches:
Run Python in GDB, using a python script of the commands
Use a Python API to GDB like pygdbmi
The latter ended up being more flexible, so I'll explain those steps here.
Configure qemu with debugging information:
./configure --enable-debug
Build qemu and invoke it halted, with debug hooks:
make
sudo make install
qemu-system-x86_64 -S -s
Now, use a Python script to attach to and interact with qemu via pygdbmi(instructions here):
from pygdbmi.gdbcontroller import GdbController
from pprint import pprint
# Start gdb process
gdbmi = GdbController()
print(gdbmi.get_subprocess_cmd()) # print actual command run as subprocess
gdbmi.write('target remote localhost:1234'); # attach to QEMU GDB socket
pprint(response)
response = gdbmi.write('-break-insert main') # machine interface (MI) commands start with a '-'
response = gdbmi.write('break main') # normal gdb commands work too, but the return value is slightly different
response = gdbmi.write('-exec-run')
response = gdbmi.write('run')
response = gdbmi.write('-exec-next', timeout_sec=0.1) # the wait time can be modified from the default of 1 second
response = gdbmi.write('next')
response = gdbmi.write('next', raise_error_on_timeout=False)
response = gdbmi.write('next', raise_error_on_timeout=True, timeout_sec=0.01)
response = gdbmi.write('-exec-continue')
response = gdbmi.send_signal_to_gdb('SIGKILL') # name of signal is okay
response = gdbmi.send_signal_to_gdb(2) # value of signal is okay too
response = gdbmi.interrupt_gdb() # sends SIGINT to gdb
response = gdbmi.write('si 20') # step 20 instructions
response = gdbmi.write('continue')
response = gdbmi.exit()
If you have trouble with kernel symbols, you might also need to issue a command 'file myKernel' to load the symbol table from that file, assuming it was compiled with debugging information.
For reference, the '-s' command adds GDB hooks at localhost:1234. So the first command you issue must direct gdb to look there:
gdbmi.write('target remote localhost:1234');
case 1 : scrapy crawl somespider type several times (same time, using nohup background)
case 2 : using CrawlerProcess and configure multispider in python script and run
what is diffrences cases? i already tried case2 using 5 spiders but not so fast.
scrapy crawl uses one process for each spider, while CrawlerProcess uses a single Twisted Reactor on one process (while also doing some things under the hood which I'm not so sure) to run multiple spiders at once.
So, basically:
scrapy crawl -> more than one process
CrawlerProcess -> runs only one process with a Twisted Reactor
When I register a Lua script to a redis client:
script = redis_client.register_script(lua_string)
and then run the script with the default client:
script(keys, args)
does this automatically use evalsha internally or does it send the whole script to the server every time?
Yes. Here's the (abridged) source code:
class Script(object):
def __call__(self, keys=[], args=[], client=None):
if isinstance(client, BasePipeline):
# Make sure the pipeline can register the script before executing.
client.scripts.add(self)
return client.evalsha(self.sha, len(keys), *args)