How can I use scrapy middleware in the scrapy Shell? - scrapy

In a scrapy project one uses middleware quite often. Is there a generic way of enableing usage of middleware in the scrapy shell during interactive sessions as well?

Although, Middlewares set up in setting.py are enabled by default in scrapy shell. You can see it on the logs when running scrapy shell.
So to answer your question, yes you can do so using this command.
scrapy shell -s DOWNLOADER_MIDDLEWARES='<<your custom middleware>>'
You can override settings using the -s parameters.
Remember, just running scrapy shell inside a folder that contains a scrapy project.
It will load the default settings from settings.py.
Happy Scraping :)

Related

How can I find the status of an workflow execution?

For my current project, after I trigger a workflow, I need to check the status of its execution. I am not sure about the exact command. I have tried 'get-workflow' but it didn't seem to work.
There are a few ways, increasing in order of heavy-handedness.
You can hit with curl or something the API endpoint directly in Admin.
The Python SDK (flytekit) also ships with a command-line control plane utility called flyte-cli. In the future this may move to another location but it's here for now and you can hit it with this command.
flyte-cli -p yourproject -d development get-execution -u ex:yourproject:development:2fd90i
You can also use the Python class in flytekit that represents a workflow execution.
In [1]: from flytekit.configuration import set_flyte_config_file
In [2]: set_flyte_config_file('/Users/user/.flyte/config')
In [3]: from flytekit.common.workflow_execution import SdkWorkflowExecution
In [4]: e = SdkWorkflowExecution.fetch('yourproject', 'development', '2fd90i')

Scrapyd: No active project - How to schedule spiders with scrapyd

I am trying to schedule a scrapy 2.1.0 spider with the help of scrapyd 1.2
curl --insecure http://localhost:6800/schedule.json -d project=bid -d spider=test
This should in theory start the crawl for spider test within project bid. Instead it outputs the error message:
{"node_name": "spider1", "status": "error", "message": "Scrapy 2.1.0 - no active project\n\nUnknown command: list\n\nUse \"scrapy\" to see available commands\n"}
If I cd into the project directory there is the project with several spiders which I can start via "cd /var/spiders/ && scrapy crawl test &".
However beeing in another folder would also give me the message "no active projects":
/var$ scrapy list
Scrapy 2.1.0 - no active project
Unknown command: list
Use "scrapy" to see available commands
This looks like the exact same info I get from scrapyd, so I suspect that I need to configure somehow the working directory where my projects live.
Scrapyd is running and I can access the console via web "gui".
What is the right approach to start the job via scrapyd?
Before you can launch your spider with scrapyd, you'll have to deploy your spider first. You can do this by:
Using addversion.json (https://scrapyd.readthedocs.io/en/latest/api.html#addversion-json)
Using scrapyd-deploy (https://github.com/scrapy/scrapyd-client)

Python Scrapy Error. Running 'scrapy crawl' with more than one spider is no longer supported

I made a script in Scrapy Python which has been working fine during months (without changes). Recently when I execute the script in Windows Powershell it raises the next error:
scrapy crawl spider –o 'filename.csv' –t 'csv'
...
Running 'scrapy crawl' with more than one spider is no longer supported
I wonder what the problem is.
Thanks in advance.
When you experience this, most probably you have an extra space somewhere in the params and crawler sees more params than expected.
Remove it and it should work.
Make sure that you write the command option with the short dash: -o and not –o.
I tried copying and pasting your command and did not work, but it works with short dashes.
I had this issue and I fixed it by changing:
scrapy crawl s-mart -o test 2.csv -t csv
To:
scrapy crawl s-mart -o test2.csv -t csv
So I am guessing that the whitespace was causing this?
Run command without -t, it's shortcut for generating new spiders based on pre-defined templates:
scrapy genspider [-t template] <name> <domain>
Try something like:
scrapy crawl <yourspidername> –o filename.csv
Documentation https://doc.scrapy.org/en/0.10.3/topics/commands.html#available-tool-commands
Try
scrapy crawl spider –o filename.csv
Possible solution:
try changing the name of your spider in its module. Maybe you have created a same named spider somewhere else or copied it, and scrapy keeps track of what you gave run in the past, thus it encounters 2+ spiders with same name and since name must be unique, it can't crawl it
Changing the name solved my problem.
I had this error after I renamed my spider.
The solution was to delete all the *.pyc files in the spiders folder.
These get regenerated by python compiler when you next run scrapy.
I had a "\n" in the parameter.
...replace("\n","")

How to invoke shell and pass arguments to spiders

I was used to do that in order to inspect a HTML and find out xpaths:
$ scrapy shell http://www.domain.com/whatever
Now I have a spider which must receive some arguments. Something like
$ scrapy crawl -a arg1=one MySpiderForDomainDotCom
And I still want to invoke the shell from command line. But now scrapy try to use/load my spider (documentation says it do it in this way) and I get an error saying the spider have no arguments.
My question is how to invoke shell from command line when the spider must receive arguments?
I have tried some things and combinations, searched the web, but nothing...
PS: scrapy 0.22.2
PS2: I do not want to invoke the shell from within my spider.
The simple solution is to simply invoke:
$ scrapy shell
from the command line, and once console is launched:
>>> fetch('http://www.domain.com/whatever')
The scrapy shell command will load all the settings you have defined defined in settings.py. This does not initiate any spider.

Scrapy not exporting to csv

I have just created a new scrapy project after ages, and seem to be forgetting something. In any case, my spider runs great, but does not store the output to a csv. Is there something that needs to go into the pipeline or settings files? I am using this command:
scrapy crawl ninfo -- set FEED_URI=myinfo.csv --set FEED_FORMAT=csv
Any help is appreciated, Thanks.
TM
Try with this command:
$ scrapy crawl ninfo -o myinfo.csv -t csv
See http://doc.scrapy.org/en/latest/intro/tutorial.html#storing-the-scraped-data (the only difference being they use it to generate JSON data, but Scrapy embarks a CSV exporter: http://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-csv)