How to invoke shell and pass arguments to spiders - scrapy

I was used to do that in order to inspect a HTML and find out xpaths:
$ scrapy shell http://www.domain.com/whatever
Now I have a spider which must receive some arguments. Something like
$ scrapy crawl -a arg1=one MySpiderForDomainDotCom
And I still want to invoke the shell from command line. But now scrapy try to use/load my spider (documentation says it do it in this way) and I get an error saying the spider have no arguments.
My question is how to invoke shell from command line when the spider must receive arguments?
I have tried some things and combinations, searched the web, but nothing...
PS: scrapy 0.22.2
PS2: I do not want to invoke the shell from within my spider.

The simple solution is to simply invoke:
$ scrapy shell
from the command line, and once console is launched:
>>> fetch('http://www.domain.com/whatever')
The scrapy shell command will load all the settings you have defined defined in settings.py. This does not initiate any spider.

Related

How can I use scrapy middleware in the scrapy Shell?

In a scrapy project one uses middleware quite often. Is there a generic way of enableing usage of middleware in the scrapy shell during interactive sessions as well?
Although, Middlewares set up in setting.py are enabled by default in scrapy shell. You can see it on the logs when running scrapy shell.
So to answer your question, yes you can do so using this command.
scrapy shell -s DOWNLOADER_MIDDLEWARES='<<your custom middleware>>'
You can override settings using the -s parameters.
Remember, just running scrapy shell inside a folder that contains a scrapy project.
It will load the default settings from settings.py.
Happy Scraping :)

Why running specific spider doesn't work (but run all the spider)

I created a scrapy project, and want to have two separate spiders (with two different names):
1
I'm trying to only run "listing_enseigne.py" spider with the command scrapy crawl nameofthespider but it seems that this command also run the other spider (from the file "detail_enseigne.py")...
However when looking in the scrapy documentation,it seems that this command should run only the spider named.
If anyone can help me.. thanks!
Edit 1:
Indeed, scrapy won't run them both but it will execute all codes in all spiders which run before the actual spider (thanks wishmaster for the answer).
I don't really understand how to organize my spiders then..
I want to have a first spider to collect urls from a website (in fact the first spider should export csv files containing multiple information included urls).
Then I want a second spider to find the latest file from the export folder, collecting all urls from this latest file, and then parsing this urls to collect other informations...

Python Scrapy Error. Running 'scrapy crawl' with more than one spider is no longer supported

I made a script in Scrapy Python which has been working fine during months (without changes). Recently when I execute the script in Windows Powershell it raises the next error:
scrapy crawl spider –o 'filename.csv' –t 'csv'
...
Running 'scrapy crawl' with more than one spider is no longer supported
I wonder what the problem is.
Thanks in advance.
When you experience this, most probably you have an extra space somewhere in the params and crawler sees more params than expected.
Remove it and it should work.
Make sure that you write the command option with the short dash: -o and not –o.
I tried copying and pasting your command and did not work, but it works with short dashes.
I had this issue and I fixed it by changing:
scrapy crawl s-mart -o test 2.csv -t csv
To:
scrapy crawl s-mart -o test2.csv -t csv
So I am guessing that the whitespace was causing this?
Run command without -t, it's shortcut for generating new spiders based on pre-defined templates:
scrapy genspider [-t template] <name> <domain>
Try something like:
scrapy crawl <yourspidername> –o filename.csv
Documentation https://doc.scrapy.org/en/0.10.3/topics/commands.html#available-tool-commands
Try
scrapy crawl spider –o filename.csv
Possible solution:
try changing the name of your spider in its module. Maybe you have created a same named spider somewhere else or copied it, and scrapy keeps track of what you gave run in the past, thus it encounters 2+ spiders with same name and since name must be unique, it can't crawl it
Changing the name solved my problem.
I had this error after I renamed my spider.
The solution was to delete all the *.pyc files in the spiders folder.
These get regenerated by python compiler when you next run scrapy.
I had a "\n" in the parameter.
...replace("\n","")

Convert .odt to .docx using libreoffice5.0 in python

command = "libreoffice5.0 --headless --convert-to odt /data/Format/000001535edbaf8f27a9c331003600c900520045/test.docx --outdir /data/Format/000001535edbaf8f27a9c331003600c900520045"
When we run this command on terminal that gives me output
/data/Format/000001535edbaf8f27a9c331003600c900520045/test.odt
But whenever I am trying with apache request os.system(command) it goes in process but doesn't return anything. The process keeps running in background continously.
Have you thought about using
subprocess.call(["ls", "-l"])
"The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several older modules and functions:"
os.system
os.spawn*
os.popen*
popen2.*
commands.*
Ref Python 2.7.x module subprocess

Scrapy not exporting to csv

I have just created a new scrapy project after ages, and seem to be forgetting something. In any case, my spider runs great, but does not store the output to a csv. Is there something that needs to go into the pipeline or settings files? I am using this command:
scrapy crawl ninfo -- set FEED_URI=myinfo.csv --set FEED_FORMAT=csv
Any help is appreciated, Thanks.
TM
Try with this command:
$ scrapy crawl ninfo -o myinfo.csv -t csv
See http://doc.scrapy.org/en/latest/intro/tutorial.html#storing-the-scraped-data (the only difference being they use it to generate JSON data, but Scrapy embarks a CSV exporter: http://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-csv)