I developed spider by scrapy, and run it by command
scrapy crawl myspider
Now I try to run it background by command:
nohup scrapy crawl myspider &
but after I close ssh session, the scrapy stop, why?
My guess is this has nothing to do with Scrapy.
Does this suggested solution work?
https://unix.stackexchange.com/questions/658535/what-is-the-real-reason-why-nohup-a-out-dies-when-ssh-session-times-out
Related
I created GCP VM (ubunto). I installed python and scrapy.
I would like to run my spider from there, scrapy crawl test -o test1.csv
I opened the terminal from gcp and run the spider (worked), it will take at least 3 hours.
How can I make sure when i exit the terminal (browser) the script will continue.
You can use nohup to make sure the crawling continues:
nohup scrapy crawl test -o test1.csv &
When you log off, the crawler will continues until it finishes. The & at the end will make the process execute in the background.
To redirect the output to a log file, you can execute it as follows:
nohup scrapy crawl test -o test1.csv &> test.log &
For a better way to run & deploy spiders on a server, you can checkout scrapyd
You can create a run.py file in the spiders directory.
document content
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'test', '-o', 'test1.csv'])
After that
nohup python -u run.py > spider_house.log 2>&1 &
If the log has been configured inside the crawler and the log will be recorded according to the log output inside the crawler, the log output configured by nohup will not be used.
If a paused sustainable crawl is configured, that is, the JOBDIR= parameter wants to gracefully pause the crawler, so that the next time the crawler starts, the last pause is the crawl. The close crawler method is
kill 2 pid
I'm using the splash on an ubuntu server and followed the instructions to install with docker (https://github.com/scrapy-plugins/scrapy-splash).
docker run -p 8050: 8050 scrapinghub / splash
How can I change the settings and set username and password?
The easiest way is to use Aquarium with the auth_user and auth_password options.
This is described in How to run Splash in production?, from the Splash FAQ.
When it comes to integrating Aquarium-Splash and Scrapy, the user and password could be passed under scrapy crawl using the -a option as per official documentation of Scrapy.
In windows power shell ,I can run scrapy shell 'http://www.hao123.com',
scrapy shell 'http://www.hao123.com
I can run ipython
I can run ipython but not scrapy shell 'http://www.hao123.com'
ipython then scrapy shell 'http://www.hao123.com
In ipython notebook,I can't run scrapy shell 'http://www.hao123.com'also
scrapy shell 'http://www.hao123.com'
File "<ipython-input-3-be4048c8f90b>", line 1
scrapy shell 'http://www.hao123.com'
^
SyntaxError: invalid syntax
Ipython is installed by anaconda,scrapy is installed by pip,anaconda and pip is in different file.
Please help me!
That's not a feature you can have in ipython. scrapy shell is a command, it's own application completely separate from ipython.
However, there are two things you can do:
If you have a Spider and Response objects from somewhere you can simply use scrapy.shell.inspect_repsonse
from scrapy.shell import inspect_response
# You need a scrapy spider object that has crawler instance attached to it
some_spider = Spider()
# you need response object
some_response = Response()
inspect_response(some_spider, some_response)
# new ipython shell will be embedded, like one from scrapy shell command
Otherwise you can spawn a subprocess:
import subprocess
subprocess.call('scrapy shell http://stackoverflow.com', shell=True)
# new ipython shell will be embedded, like one from scrapy shell command
if you are using Anaconda on windows 7 .
Follow: environment --> root --> Open terminal. Then you can write:
scrapy shell 'http://www.hao123.com'
Question
I want to know how to disable Item storing in scrapyd.
What I tried
I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine.
However Scrapyd logs each scraped Scrapy item. You can see this when examining the scrapyd web interface.
This item data is stored in ..../items/<project name>/<spider name>/<job name>.jl
I have no clue how to disable this. I run scrapyd in a Docker container and it uses way too much storage.
I have tried suppress Scrapy Item printed in logs after pipeline, but this does nothing for scrapyd logging it seems. All spider logging settings seem to be ignored by scrapyd.
Edit
I found this entry in the documentation about Item storing. It seems if you omit the items_dir setting, item logging will not happen. It is said that this is disabled by default. I do not have a scrapyd.conf file, so item logging should be disabled. It is not.
After writing my answer I re-read your question and I see that what you want has nothing to do with logging but it's about not writing to the (default-ish) .jl feed (Maybe update the title to: "Disable scrapyd Item storing"). To override scrapyd's default, just set FEED_URI to an empty string like this:
$ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=example -d setting=FEED_URI=
For other people who are looking into logging... Let's see an example. We do the usual:
$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com
then edit tutorial/spiders/example.py to contain the following:
import scrapy
class TutorialItem(scrapy.Item):
name = scrapy.Field()
surname = scrapy.Field()
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
for i in xrange(100):
t = TutorialItem()
t['name'] = "foo"
t['surname'] = "bar %d" % i
yield t
Notice the difference between running:
$ scrapy crawl example
# or
$ scrapy crawl example -L DEBUG
# or
$ scrapy crawl example -s LOG_LEVEL=DEBUG
and
$ scrapy crawl example -s LOG_LEVEL=INFO
# or
$ scrapy crawl example -L INFO
By trying such combinations on your spider confirm that it doesn't print Item info for log-level beyond debug.
It's now time, after you deploy to scrapyd to do exactly the same:
$ curl http://localhost:6800/schedule.json -d setting=LOG_LEVEL=INFO -d project=tutorial -d spider=example
Confirm that the logs don't contain items when you run:
Note that if your items are still printed in INFO level, it likely means that your code or some pipeline is printing it. You could rise log-level further and/or investigate and find the code that prints it and remove it.
I have scrapy and scrapyd installed on a debian machine. I log in to this server using a ssh-tunnel. I then start scrapyd by going:
scrapyd
Scrapyd starts up fine and I then open up another ssh-tunnel to the server and schedule my spider with:
curl localhost:6800/schedule.json -d project=myproject -d spider=myspider
The spider runs nicely and everything is fine.
The problem is that scrapyd stops running when I quit the session where I started up scrapyd. This prevents me from using cron to schdedule spiders with scrapyd since scrapyd isn't running when the cronjob is launched.
My simple question is: How do I keep scrapyd running so that it doesn't shut down when I quit the ssh session.
Run it in a screen session:
$ screen
$ scrapyd
# hit ctrl-a, then d to detach from that screen
$ screen -r # to re-attach to your scrapyd process
You might consider launching scrapyd with supervisor.
And there is a good .conf script available as a gist here:
https://github.com/JallyHe/scrapyd/blob/master/supervisord.conf
How about ?
$ sudo service scrapyd start