Why running specific spider doesn't work (but run all the spider) - scrapy

I created a scrapy project, and want to have two separate spiders (with two different names):
1
I'm trying to only run "listing_enseigne.py" spider with the command scrapy crawl nameofthespider but it seems that this command also run the other spider (from the file "detail_enseigne.py")...
However when looking in the scrapy documentation,it seems that this command should run only the spider named.
If anyone can help me.. thanks!
Edit 1:
Indeed, scrapy won't run them both but it will execute all codes in all spiders which run before the actual spider (thanks wishmaster for the answer).
I don't really understand how to organize my spiders then..
I want to have a first spider to collect urls from a website (in fact the first spider should export csv files containing multiple information included urls).
Then I want a second spider to find the latest file from the export folder, collecting all urls from this latest file, and then parsing this urls to collect other informations...

Related

differences CrawlerProcess and scrapy crawl somespider in commandline in scrapy?

case 1 : scrapy crawl somespider type several times (same time, using nohup background)
case 2 : using CrawlerProcess and configure multispider in python script and run
what is diffrences cases? i already tried case2 using 5 spiders but not so fast.
scrapy crawl uses one process for each spider, while CrawlerProcess uses a single Twisted Reactor on one process (while also doing some things under the hood which I'm not so sure) to run multiple spiders at once.
So, basically:
scrapy crawl -> more than one process
CrawlerProcess -> runs only one process with a Twisted Reactor

Python Scrapy Error. Running 'scrapy crawl' with more than one spider is no longer supported

I made a script in Scrapy Python which has been working fine during months (without changes). Recently when I execute the script in Windows Powershell it raises the next error:
scrapy crawl spider –o 'filename.csv' –t 'csv'
...
Running 'scrapy crawl' with more than one spider is no longer supported
I wonder what the problem is.
Thanks in advance.
When you experience this, most probably you have an extra space somewhere in the params and crawler sees more params than expected.
Remove it and it should work.
Make sure that you write the command option with the short dash: -o and not –o.
I tried copying and pasting your command and did not work, but it works with short dashes.
I had this issue and I fixed it by changing:
scrapy crawl s-mart -o test 2.csv -t csv
To:
scrapy crawl s-mart -o test2.csv -t csv
So I am guessing that the whitespace was causing this?
Run command without -t, it's shortcut for generating new spiders based on pre-defined templates:
scrapy genspider [-t template] <name> <domain>
Try something like:
scrapy crawl <yourspidername> –o filename.csv
Documentation https://doc.scrapy.org/en/0.10.3/topics/commands.html#available-tool-commands
Try
scrapy crawl spider –o filename.csv
Possible solution:
try changing the name of your spider in its module. Maybe you have created a same named spider somewhere else or copied it, and scrapy keeps track of what you gave run in the past, thus it encounters 2+ spiders with same name and since name must be unique, it can't crawl it
Changing the name solved my problem.
I had this error after I renamed my spider.
The solution was to delete all the *.pyc files in the spiders folder.
These get regenerated by python compiler when you next run scrapy.
I had a "\n" in the parameter.
...replace("\n","")

nutch crawl using protocol-selenium with phantomjs launched as a Mesos task : org.openqa.selenium.NoSuchElementException

I am trying to crawl AJAX based sites with Nutch using protocol-selenium with the phantomjs driver. I am using apache-nutch-1.13 compiled from nutch's github repository. These crawls are launched as tasks in a system managed by Mesos. When I launch nutch's crawl script from a terminal in the server everything goes perfect and the site is crawled as I asked. However, when I execute the same crawl script with the same parameters inside a Mesos task nutch raises the exception:
fetch of http://XXXXX failed with: java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: {"errorMessage":"Unable to find element with tag name 'body'","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"35","Content-Type":"application/json; charset=utf-8","Host":"localhost:12215","User-Agent":"Apache-HttpClient/4.3.5 (java 1.5)"},"httpVersion":"1.1","method":"POST","post":"{\"using\":\"tag name\",\"value\":\"body\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/a7f98ec0-b8aa-11e6-8b84-232b0d8e1024/element"}}
My first impression was that there was something strange with the environmental variables (HADOOP_HOME, PATH, CLASSPATH...) but I put the same vars in the nutch script and in the terminal and still the same result.
Any ideas about what I am doing wrong?

How to invoke shell and pass arguments to spiders

I was used to do that in order to inspect a HTML and find out xpaths:
$ scrapy shell http://www.domain.com/whatever
Now I have a spider which must receive some arguments. Something like
$ scrapy crawl -a arg1=one MySpiderForDomainDotCom
And I still want to invoke the shell from command line. But now scrapy try to use/load my spider (documentation says it do it in this way) and I get an error saying the spider have no arguments.
My question is how to invoke shell from command line when the spider must receive arguments?
I have tried some things and combinations, searched the web, but nothing...
PS: scrapy 0.22.2
PS2: I do not want to invoke the shell from within my spider.
The simple solution is to simply invoke:
$ scrapy shell
from the command line, and once console is launched:
>>> fetch('http://www.domain.com/whatever')
The scrapy shell command will load all the settings you have defined defined in settings.py. This does not initiate any spider.

Scrapy not exporting to csv

I have just created a new scrapy project after ages, and seem to be forgetting something. In any case, my spider runs great, but does not store the output to a csv. Is there something that needs to go into the pipeline or settings files? I am using this command:
scrapy crawl ninfo -- set FEED_URI=myinfo.csv --set FEED_FORMAT=csv
Any help is appreciated, Thanks.
TM
Try with this command:
$ scrapy crawl ninfo -o myinfo.csv -t csv
See http://doc.scrapy.org/en/latest/intro/tutorial.html#storing-the-scraped-data (the only difference being they use it to generate JSON data, but Scrapy embarks a CSV exporter: http://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-csv)