differences CrawlerProcess and scrapy crawl somespider in commandline in scrapy? - scrapy

case 1 : scrapy crawl somespider type several times (same time, using nohup background)
case 2 : using CrawlerProcess and configure multispider in python script and run
what is diffrences cases? i already tried case2 using 5 spiders but not so fast.

scrapy crawl uses one process for each spider, while CrawlerProcess uses a single Twisted Reactor on one process (while also doing some things under the hood which I'm not so sure) to run multiple spiders at once.
So, basically:
scrapy crawl -> more than one process
CrawlerProcess -> runs only one process with a Twisted Reactor

Related

How can I use scrapy middleware in the scrapy Shell?

In a scrapy project one uses middleware quite often. Is there a generic way of enableing usage of middleware in the scrapy shell during interactive sessions as well?
Although, Middlewares set up in setting.py are enabled by default in scrapy shell. You can see it on the logs when running scrapy shell.
So to answer your question, yes you can do so using this command.
scrapy shell -s DOWNLOADER_MIDDLEWARES='<<your custom middleware>>'
You can override settings using the -s parameters.
Remember, just running scrapy shell inside a folder that contains a scrapy project.
It will load the default settings from settings.py.
Happy Scraping :)

Why running specific spider doesn't work (but run all the spider)

I created a scrapy project, and want to have two separate spiders (with two different names):
1
I'm trying to only run "listing_enseigne.py" spider with the command scrapy crawl nameofthespider but it seems that this command also run the other spider (from the file "detail_enseigne.py")...
However when looking in the scrapy documentation,it seems that this command should run only the spider named.
If anyone can help me.. thanks!
Edit 1:
Indeed, scrapy won't run them both but it will execute all codes in all spiders which run before the actual spider (thanks wishmaster for the answer).
I don't really understand how to organize my spiders then..
I want to have a first spider to collect urls from a website (in fact the first spider should export csv files containing multiple information included urls).
Then I want a second spider to find the latest file from the export folder, collecting all urls from this latest file, and then parsing this urls to collect other informations...

Python Scrapy Error. Running 'scrapy crawl' with more than one spider is no longer supported

I made a script in Scrapy Python which has been working fine during months (without changes). Recently when I execute the script in Windows Powershell it raises the next error:
scrapy crawl spider –o 'filename.csv' –t 'csv'
...
Running 'scrapy crawl' with more than one spider is no longer supported
I wonder what the problem is.
Thanks in advance.
When you experience this, most probably you have an extra space somewhere in the params and crawler sees more params than expected.
Remove it and it should work.
Make sure that you write the command option with the short dash: -o and not –o.
I tried copying and pasting your command and did not work, but it works with short dashes.
I had this issue and I fixed it by changing:
scrapy crawl s-mart -o test 2.csv -t csv
To:
scrapy crawl s-mart -o test2.csv -t csv
So I am guessing that the whitespace was causing this?
Run command without -t, it's shortcut for generating new spiders based on pre-defined templates:
scrapy genspider [-t template] <name> <domain>
Try something like:
scrapy crawl <yourspidername> –o filename.csv
Documentation https://doc.scrapy.org/en/0.10.3/topics/commands.html#available-tool-commands
Try
scrapy crawl spider –o filename.csv
Possible solution:
try changing the name of your spider in its module. Maybe you have created a same named spider somewhere else or copied it, and scrapy keeps track of what you gave run in the past, thus it encounters 2+ spiders with same name and since name must be unique, it can't crawl it
Changing the name solved my problem.
I had this error after I renamed my spider.
The solution was to delete all the *.pyc files in the spiders folder.
These get regenerated by python compiler when you next run scrapy.
I had a "\n" in the parameter.
...replace("\n","")

nutch crawl using protocol-selenium with phantomjs launched as a Mesos task : org.openqa.selenium.NoSuchElementException

I am trying to crawl AJAX based sites with Nutch using protocol-selenium with the phantomjs driver. I am using apache-nutch-1.13 compiled from nutch's github repository. These crawls are launched as tasks in a system managed by Mesos. When I launch nutch's crawl script from a terminal in the server everything goes perfect and the site is crawled as I asked. However, when I execute the same crawl script with the same parameters inside a Mesos task nutch raises the exception:
fetch of http://XXXXX failed with: java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: {"errorMessage":"Unable to find element with tag name 'body'","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"35","Content-Type":"application/json; charset=utf-8","Host":"localhost:12215","User-Agent":"Apache-HttpClient/4.3.5 (java 1.5)"},"httpVersion":"1.1","method":"POST","post":"{\"using\":\"tag name\",\"value\":\"body\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/a7f98ec0-b8aa-11e6-8b84-232b0d8e1024/element"}}
My first impression was that there was something strange with the environmental variables (HADOOP_HOME, PATH, CLASSPATH...) but I put the same vars in the nutch script and in the terminal and still the same result.
Any ideas about what I am doing wrong?

How to invoke shell and pass arguments to spiders

I was used to do that in order to inspect a HTML and find out xpaths:
$ scrapy shell http://www.domain.com/whatever
Now I have a spider which must receive some arguments. Something like
$ scrapy crawl -a arg1=one MySpiderForDomainDotCom
And I still want to invoke the shell from command line. But now scrapy try to use/load my spider (documentation says it do it in this way) and I get an error saying the spider have no arguments.
My question is how to invoke shell from command line when the spider must receive arguments?
I have tried some things and combinations, searched the web, but nothing...
PS: scrapy 0.22.2
PS2: I do not want to invoke the shell from within my spider.
The simple solution is to simply invoke:
$ scrapy shell
from the command line, and once console is launched:
>>> fetch('http://www.domain.com/whatever')
The scrapy shell command will load all the settings you have defined defined in settings.py. This does not initiate any spider.