Run multiple spiders from script in scrapy - scrapy

I am doing scrapy project I want run multiple spiders at a time
This is code for run spiders from script. I getting error .. how to do
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
TO_CRAWL = [DmozSpider, CraigslistSpider]
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""Activates on spider closed signal"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
log.start(loglevel=log.DEBUG)
for spider in TO_CRAWL:
settings = Settings()
# crawl responsibly
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
blocks process so always keep as the last statement
reactor.run()

Sorry to not answer the question itself but just bringing into your attention scrapyd and scrapinghub (at least for a quick test). reactor.run() (when you make it) will run any number of Scrapy instances on a single CPU. Do you want this side effect? Even if you have a look on scrapyd's code, they don't run multiple instances with a single thread but they do fork/spawn subprocesses.

You need something like the code below. You can easily find it from Scrapy docs :)
First utility you can use to run your spiders is
scrapy.crawler.CrawlerProcess. This class will start a Twisted reactor
for you, configuring the logging and setting shutdown handlers. This
class is the one used by all Scrapy commands.
# -*- coding: utf-8 -*-
import sys
import logging
import traceback
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.utils.project import get_project_settings
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
SPIDER_LIST = [
DmozSpider, CraigslistSpider
]
if __name__ == "__main__":
try:
## set up the crawler and start to crawl one spider at a time
process = CrawlerProcess(get_project_settings())
for spider in SPIDER_LIST:
process.crawl(spider)
process.start()
except Exception, e:
exc_type, exc_obj, exc_tb = sys.exc_info()
logging.info('Error on line {}'.format(sys.exc_info()[-1].tb_lineno))
logging.info("Exception: %s" % str(traceback.format_exc()))
References:
http://doc.scrapy.org/en/latest/topics/practices.html

Related

Is there any limit of running spiders in a parallel?

import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
As here we are running our Spider in 2 parallel.
So is there any number of limit of running the parallel job in scrapy. As when I have run with 10 process, in some result i don't get any record.
As I have run with different number of process, but not sure what is the limit number for running the parallel and if there any example of running the parallel process, please share.

How can I add a new spider arg to my own template in Scrapy/Zyte

I am working on a paid proxy spider template and would like the ability to pass in a new argument on the command line for a Scrapy crawler. How can I do that?
This is achievable by using kwargs in your spider's __init__-Method:
import scrapy
class YourSpider(scrapy.Spider):
name = your_spider
def __init__(self, *args, **kwargs):
super(YourSpider, self).__init__(*args, **kwargs)
self.your_arg = kwargs.get("your_cmd_arg", 42)
Now it would be possible to call the spider as follows:
scrapy crawl your_spider -a your_cmd_arg=foo
For more information on the topic, feel free to check this page in the Scrapy documentation.

Relative imports from within a notebook that is not at the 'main.py' level

I have a structure like the following one:
/src
__init__.py
module1.py
module2.py
/tests
__init__.py
test_module1.py
test_module2.py
/notebooks
__init__.py
exploring.ipynb
main.py
I'd like to use the notebook 'exploring' to do some data exploration, and to do so I'd need to perform relative imports of module1 and module2. But if I try to run from ..src.module1 import funct1, I receive an ImportError: attempted relative import with no known parent package, which I understand is expected because I'm running the notebook as if it was a script and not as a module.
So as a workaround I have been mainly pulling the notebook outside its folder to the main.py level every time I need to use it, and then from src.module1 import funct1 works.
I know there are tons of threads already on relative imports, but I couldn't find a simpler solution so far of making this work without having to move the notebook every time. Is there any way to perform this relative import, given that the notebook when called is running "as a script"?
Scripts cannot do relative imports. Have you considered something like:
if __name__ == "__main__":
sys.path.insert(0,
os.path.abspath(os.path.join(os.getcwd(), '..')))
from src.module1 import funct1
else:
from ..src.module1 import funct1
Or using exceptions:
try:
from ..src.module1 import funct1
except ImportError:
sys.path.insert(0,
os.path.abspath(os.path.join(os.getcwd(), '..')))
from src.module1 import funct1
?

Having problems declaring SUMO_HOME

I'm trying to run a test python code to use the traci library and it is returning "please declare environment SUMO_HOME".
I'm on Ubuntu 18.4.2 and Sumo 0.32.0.I solved this problem before by running
export SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0/tools/
,but this time it couldn't solve the problem. So I tried implementing a line inside the python file using the os library giving the same command but from the code itself:
os.system("export SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0/tool/")
And it also didn't work, so came here to ask for help. May any of you help me, please?
import os
import sys
import optparse
os.system("export SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0/tool/")
# we need to import some python modules from the $SUMO_HOME/tools directory
if 'SUMO_HOME' in os.environ:
tools = os.path.join(os.environ['SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0/tools/'], 'tools')
sys.path.append(tools)
else:
sys.exit("please declare environment variable 'SUMO_HOME'")
from sumolib import checkBinary # Checks for the binary in environ vars
import traci
def get_options():
opt_parser = optparse.OptionParser()
opt_parser.add_option("--nogui", action="store_true",
default=False, help="run the commandline version of sumo")
options, args = opt_parser.parse_args()
return options
# contains TraCI control loop
def run():
step = 0
while traci.simulation.getMinExpectedNumber() > 0:
traci.simulationStep()
print(step)
step += 1
traci.close()
sys.stdout.flush()
# main entry point
if __name__ == "__main__":
options = get_options()
# check binary
if options.nogui:
sumoBinary = checkBinary('sumo')
else:
sumoBinary = checkBinary('sumo-gui')
# traci starts sumo as a subprocess and then this script connects and runs
traci.start([sumoBinary, "-c", "demo.sumocfg",
"--tripinfo-output", "tripinfo.xml"])
run()
I expected for the steps to appear on the terminal.
The correct location is probably
export SUMO_HOME=/home/gustavo/Downloads/sumo-0.32.0
without the tools or tool suffix. It will not work from inside the python script with os.system but you could modify os.environ directly.
Furthermore you mixed up the call to os.environ in the script. It should read:
tools = os.path.join(os.environ['SUMO_HOME'], 'tools')
I swapped the if else part for another code :
try:
sys.path.append("/home/gustavo/Downloads/sumo-0.32.0/tools")
from sumolib import checkBinary
except ImportError:
sys.exit("please declare environment variable 'SUMO_HOME' as the root directory of your sumo installation (it should contain folders 'bin', 'tools' and 'docs')")
It solved the problem

Python Reddis Queue ValueError: Functions from the __main__ module cannot be processed by workers

I'm trying to enqueue a basic job in redis using python-rq, But it throws this error
"ValueError: Functions from the main module cannot be processed by workers"
Here is my program:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
from rq import Connection, Queue
from redis import Redis
redis_conn = Redis()
q = Queue(connection=redis_conn)
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
Break the provided code to two files:
count_words.py:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
and main.py (where you'll import the required function):
from rq import Connection, Queue
from redis import Redis
from count_words import count_words_at_url # added import!
redis_conn = Redis()
q = Queue(connection=redis_conn)
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
I always separate the tasks from the logic running those tasks to different files. It's just better organization. Also note that you can define a class of tasks and import/schedule tasks from that class instead of the (over-simplified) structure I suggest above. This should get you going..
Also see here to confirm you're not the first to struggle with this example. RQ is great once you get the hang of it.
Currently there is a bug in RQ, which leads to this error. You will not be able to pass functions in enqueue from the same file without explicitly importing it.
Just add from app import count_words_at_url above the enqueue function:
import requests
def count_words_at_url(url):
resp = requests.get(url)
return len(resp.text.split())
from rq import Connection, Queue
from redis import Redis
redis_conn = Redis()
q = Queue(connection=redis_conn)
from app import count_words_at_url
job = q.enqueue(count_words_at_url, 'http://nvie.com')
print job
The other way is to have the functions in a separate file and import them.