Question
I want to know how to disable Item storing in scrapyd.
What I tried
I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine.
However Scrapyd logs each scraped Scrapy item. You can see this when examining the scrapyd web interface.
This item data is stored in ..../items/<project name>/<spider name>/<job name>.jl
I have no clue how to disable this. I run scrapyd in a Docker container and it uses way too much storage.
I have tried suppress Scrapy Item printed in logs after pipeline, but this does nothing for scrapyd logging it seems. All spider logging settings seem to be ignored by scrapyd.
Edit
I found this entry in the documentation about Item storing. It seems if you omit the items_dir setting, item logging will not happen. It is said that this is disabled by default. I do not have a scrapyd.conf file, so item logging should be disabled. It is not.
After writing my answer I re-read your question and I see that what you want has nothing to do with logging but it's about not writing to the (default-ish) .jl feed (Maybe update the title to: "Disable scrapyd Item storing"). To override scrapyd's default, just set FEED_URI to an empty string like this:
$ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=example -d setting=FEED_URI=
For other people who are looking into logging... Let's see an example. We do the usual:
$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com
then edit tutorial/spiders/example.py to contain the following:
import scrapy
class TutorialItem(scrapy.Item):
name = scrapy.Field()
surname = scrapy.Field()
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
for i in xrange(100):
t = TutorialItem()
t['name'] = "foo"
t['surname'] = "bar %d" % i
yield t
Notice the difference between running:
$ scrapy crawl example
# or
$ scrapy crawl example -L DEBUG
# or
$ scrapy crawl example -s LOG_LEVEL=DEBUG
and
$ scrapy crawl example -s LOG_LEVEL=INFO
# or
$ scrapy crawl example -L INFO
By trying such combinations on your spider confirm that it doesn't print Item info for log-level beyond debug.
It's now time, after you deploy to scrapyd to do exactly the same:
$ curl http://localhost:6800/schedule.json -d setting=LOG_LEVEL=INFO -d project=tutorial -d spider=example
Confirm that the logs don't contain items when you run:
Note that if your items are still printed in INFO level, it likely means that your code or some pipeline is printing it. You could rise log-level further and/or investigate and find the code that prints it and remove it.
Related
I created GCP VM (ubunto). I installed python and scrapy.
I would like to run my spider from there, scrapy crawl test -o test1.csv
I opened the terminal from gcp and run the spider (worked), it will take at least 3 hours.
How can I make sure when i exit the terminal (browser) the script will continue.
You can use nohup to make sure the crawling continues:
nohup scrapy crawl test -o test1.csv &
When you log off, the crawler will continues until it finishes. The & at the end will make the process execute in the background.
To redirect the output to a log file, you can execute it as follows:
nohup scrapy crawl test -o test1.csv &> test.log &
For a better way to run & deploy spiders on a server, you can checkout scrapyd
You can create a run.py file in the spiders directory.
document content
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'test', '-o', 'test1.csv'])
After that
nohup python -u run.py > spider_house.log 2>&1 &
If the log has been configured inside the crawler and the log will be recorded according to the log output inside the crawler, the log output configured by nohup will not be used.
If a paused sustainable crawl is configured, that is, the JOBDIR= parameter wants to gracefully pause the crawler, so that the next time the crawler starts, the last pause is the crawl. The close crawler method is
kill 2 pid
The name of spider is quotes14 and it works well from command line
i.e if I run scrapy crawl quotes14 from the directory /var/www/html/sprojects/tutorial/ it works fine in command line.
I have scrapyd running as daemon.
My scrapy spider files are present here: /var/www/html/sprojects/tutorial/tutorial/spiders
I have many spiders and other files under the above directory and project is /var/www/html/sprojects/tutorial/tutorial/
I have tried
curl http://localhost:6800/schedule.json -d project=tutorial -d spider=spiders/quotes14
curl http://localhost:6800/schedule.json -d project=/var/www/html/sprojects/tutorial/tutorial/tutorial -d spider=quotes14
curl http://localhost:6800/schedule.json -d project=/var/www/html/sprojects/tutorial/tutorial/ -d spider=quotes14
curl http://localhost:6800/schedule.json -d project=/var/www/html/sprojects/tutorial/tutorial/tutorial -d spider=spiders/quotes14
It either says project not found or spider not found
Please help
In order to use the schedule endpoint you have to first deploy the spider to the daemon. The docs tell you how to do this.
Deploying your project involves eggifying it and uploading the egg to Scrapyd via the addversion.json endpoint. You can do this manually, but the easiest way is to use the scrapyd-deploy tool provided by scrapyd-client which will do it all for you.
I am trying to set up a crontab for scraping something. So far, I wrote
23 18 * * * cd PycharmProjects/untitled/Project1 && scrapy crawl xx -o test.csv
But when I do that I get this:
/bin/sh: scrapy: command not found.
What should I do?
I tried to locate the scrapy in my mac but couldn't find it. But I am able run the second part of the crontab task from terminal.
Since crontab doesn't set up PATH variable for you, it doesn't know what scrapy is.
The easy way to remedy is to use full path of scrapy:
$ which scrapy
/usr/bin/scrapy
Then use that instead of just scrapy:
23 18 * * * cd PycharmProjects/untitled/Project1 && /usr/bin/scrapy crawl xx -o test.csv
Another way of doing this is to set the PATH environment in your crontab:
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
# or your custom path, check your `.bashrc` for PATH you have set in your shell
23 18 * * * cd PycharmProjects/untitled/Project1 && scrapy crawl xx -o test.csv
Sidenote:
Also it's very common in cron to wrap your command in some sort of script that populates the PATH and other configurations and calling that script in cron instead of calling the commands directly.
So I've written my first scraper with Scrapy and I'm having some trouble with the next steps. I want to run the scraper daily, probably with cron, and track the changes in the values I've scraped. When I export to a json or csv file, then run the scraper again, the new data gets dumped into the same file. Is there a way to make each scrape export into a separate file? Any insight would be great, thanks!
tell scrapy the name of the file to write to using -o
$ scrapy crawl -h | grep output=
--output=FILE, -o FILE dump scraped items into FILE (use - for stdout)
you can use current date as file name like:
$ scrapy crawl <spider-name> -t json/csv -o $(date '+%Y-%m-%d')
I have scrapy and scrapyd installed on a debian machine. I log in to this server using a ssh-tunnel. I then start scrapyd by going:
scrapyd
Scrapyd starts up fine and I then open up another ssh-tunnel to the server and schedule my spider with:
curl localhost:6800/schedule.json -d project=myproject -d spider=myspider
The spider runs nicely and everything is fine.
The problem is that scrapyd stops running when I quit the session where I started up scrapyd. This prevents me from using cron to schdedule spiders with scrapyd since scrapyd isn't running when the cronjob is launched.
My simple question is: How do I keep scrapyd running so that it doesn't shut down when I quit the ssh session.
Run it in a screen session:
$ screen
$ scrapyd
# hit ctrl-a, then d to detach from that screen
$ screen -r # to re-attach to your scrapyd process
You might consider launching scrapyd with supervisor.
And there is a good .conf script available as a gist here:
https://github.com/JallyHe/scrapyd/blob/master/supervisord.conf
How about ?
$ sudo service scrapyd start