Update spider code controlled by scrapyd - scrapy

What is the proper way to install/activate a spider that is controlled by scrapyd?
I install a new spider version using scrapyd-deploy; a job is currently running. Do I have to stop the job using cancel.json, then schedule a new job?

Answering my own question:
I wrote a little python script that stops all running spiders. After running this script, I run scrapyd-deploy, then relaunch my spiders.
I am still not sure if this is the way scrapy pros would do it, though, but it looks sensible to me.
This is the script (replace the value for PROJECT to suit yours), it requires the requests package (pip install requests):
import requests
import sys
import time
PROJECT = 'crawler' # replace with your project's name
resp = requests.get("http://localhost:6800/listjobs.json?project=%s" % PROJECT)
list_json = resp.json()
failed = False
count = len(list_json["running"])
if count == 0:
print "No running spiders found."
sys.exit(0)
for sp in list_json["running"]:
# cancel this spider
r = requests.post("http://localhost:6800/cancel.json", data={"project":PROJECT, "job": sp["id"]})
print "Sent cancel request for %s %s" % (sp["spider"], sp["id"])
print "Status: %s" % r.json()
if r.json()["status"] != "ok":
print "ERROR: Failed to stop spider %s" % sp["spider"]
failed = True
if failed:
sys.exit(1)
# poll running spiders and wait until all spiders are down
while count:
time.sleep(2)
resp = requests.get("http://localhost:6800/listjobs.json?project=%s" % PROJECT)
count = len(resp.json()["running"])
print "%d spiders still running" % count

Related

stderr changes behavior of python's Popen when closing the script

I was using this script on python2 to launch an application and then immediately exit the python script without waiting for the child process to end. This was my code:
kwargs = {}
if platform.system() == 'Windows':
# from msdn [1]
CREATE_NEW_PROCESS_GROUP = 0x00000200 # note: could get it from subprocess
DETACHED_PROCESS = 0x00000008 # 0x8 | 0x200 == 0x208
kwargs.update(creationflags=DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP)
kwargs.update(stderr=subprocess.PIPE)
elif sys.version_info < (3, 2): # assume posix
kwargs.update(preexec_fn=os.setsid)
else: # Python 3.2+ and Unix
kwargs.update(start_new_session=True)
subprocess.Popen([APP], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE,**kwargs)
It works OK on python 2.7, but it doesn't start the expected 'APP' on python 3.7 without changes.
In order to make it work in python3, I found two independent workarounds:
Change stderr to this: stderr=subprocess.DEVNULL
or
Add a time.sleep(0.1) call after the Popen (before closing the script).
I assume this is not actually related to python, but to some event that needs to happen after the process gets opened before the python script can safely exit?
Any hints? I'd really like to know why it happens. Right now I simply added the sleep call.
Thank you

using scrapy creating a spider and unable to store data to csv

I'm pretty new to scrapy, here I created a spider using the amazon URL unable to get the output to the csv.
Here is my code:
import scrapy
class AmazonMotoMobilesSpider(scrapy.Spider):
name = "amazon"
start_urls = ['https://www.amazon.in/Samsung-Mobiles/b/ref=amb_link_47?ie=UTF8&node=4363159031&pf_rd_m=A1VBAL9TL5WCBF&pf_rd_s=merchandised-search-leftnav&pf_rd_r=NGA52N9RAWY1W103MPZX&pf_rd_r=NGA52N9RAWY1W103MPZX&pf_rd_t=101&pf_rd_p=1ce3e975-c6e8-479a-8485-2e490b9f58a9&pf_rd_p=1ce3e975-c6e8-479a-8485-2e490b9f58a9&pf_rd_i=1389401031',]
def parse(self,response):
product_name = response.xpath('//h2[contains(#class,"a-size-base s-inline s-access-title a-text-normal")]/text()').extract()
product_price = response.xpath('//span[contains(#class,"a-size-base a-color-price s-price a-text-bold")]/text()').extract()
yield {'product_name'product_name,'product_price': product_price}
My shell is showing this result:
len(response.xpath('//h2[contains(#class,"a-size-base s-inline s-access-tit
le a-text-normal")]/text()'))
24
do I need to change any settings ?
To generate results in CSV you need to run the crawler with a output option
scrapy crawl -o results.csv spidername
Only when you activate a output the results are sent to the file. Else they will processed by your piplelines. If you are not saving them anywhere through pipeline then they will be just on terminal's console logs
I think Its because your yield has some syntax errors in the dictionary.
Change this
yield {'product_name'product_name,'product_price': product_price}
to
yield {'product_name':product_name,'product_price': product_price}

how to use scrapyd start a weekly or monthly task run a spider?

as the title
Scheduling a monthly or weekly spider run task
and how to use setting params
Parameters:
project (string, required) - the project name
spider (string, required) - the spider name
setting (string, optional) - a Scrapy setting to use when running the spider
jobid (string, optional) - a job id used to identify the job, overrides the
You can leverage crontab for scheduling weekly and monthly spiders. Once you deploy your scrapy project in scrapyd, you can simply use this function to make an entry in your crontab using python package:- python-crontab
from crontab import CronTab
def set_cron_job(project_name, spider_name, log_path, run_type, hours, minutes, dow, dom, month):
my_cron = CronTab(user='xyz')
cmd = 'curl http://localhost:6800/schedule.json -d project='+ project_name + ' -d spider='+ spider_name + ' > '+ log_path + ' 2>&1'
job = my_cron.new(command=cmd)
if run_type.lower() == "daily":
job.minute.on(minutes)
job.hour.on(hours)
elif run_type.lower() == "weekly":
job.dow.on(dow)
job.minute.on(minutes)
job.hour.on(hours)
elif run_type.lower() == "monthly":
job.dom.on(dom)
job.minute.on(minutes)
job.hour.on(hours)
elif run_type.lower() == "yearly":
job.dom.on(dom)
job.month.on(month)
job.minute.on(minutes)
job.hour.on(hours)
my_cron.write()
print my_cron.render()

Running external selenium script from Groovy[SOAPUI]

I see online that the way to run a simple python file from groovy is:
def cmdArray2 = ["python", "/Users/test/temp/hello.py"]
def cmd2 = cmdArray2.execute()
cmd2.waitForOrKill(1000)
log.info cmd2.text
If hello.py contains - Print "Hello". It seems to work fine
But when I try to run a .py file containing the below Selenium code nothing happens.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.PhantomJS(executable_path=r'C:\phantomjs\bin\phantomjs.exe')
driver.get("http://www.google.com") # Load page
# 1 & 2
title = driver.title
print title, len(title)
driver.quit()
Any help would be appreciated.
FYI - I have tried using all the browsers including headless browsers but no luck.
Also, I am able to run the selenium script fine from the command line. But when I run from the SOAPUI, I get no errors, the script runs and I do not see anything in the log
Most likely you do not see any errors from your python script because they are printed to stderr not stdout (as you expected when called cmd2.text). Try this groovy script to check error messages from python script stderr
def cmdArray2 = ["python", "/Users/test/temp/hello.py"]
def process = new ProcessBuilder(cmdArray2).redirectErrorStream(true).start()
process.inputStream.eachLine {
log.warn(it)
}
process.waitFor()
return process.exitValue()
Another thing you might want to try is using selenium directly from groovy without calling external python script.

Is it possible to ack nagios alerts from the terminal?

I have nagios alerts set up to come through jabber with an http link to ack.
Is is possible there is a script I can run from a terminal on a remote workstation that takes the hostname as a parameter and acks the alert?
./ack hostname
The benefit, while seemingly mundane, is threefold. First, take http load off nagios. Secondly, nagios http pages can take up to 10-20 seconds to load, so I want to save time there. Thirdly, avoiding slower use of mouse + web interface + firefox/other annoyingly slow browser.
Ideally, I would like a script bound to a keyboard shortcut that simply acks the most recent alert. Finally, I want to take the inputs from a joystick, buttons and whatnot, and connect one to a big red button bound to the script so I can just ack the most recent nagios alert by hitting the button lol. (It would be rad too if the button had a screen on the enclosure that showed the text of the alert getting acked lol)
Make fun of me all you want, but this is actually something that would be useful to me. If I can save five seconds per alert, and I get 200 alerts per day I need to ack, that's saving me 15 minutes a day. And isn't the whole point of the sysadmin to automate what can be automated?
Thanks!
Yes, it's possible to ack nagios by parsing /var/lib/nagios3/retention.dat file.
See :
#!/usr/bin/env python
# -*- coding: utf8 -*-
# vim:ts=4:sw=4
import sys
file = "/var/lib/nagios3/retention.dat"
try:
sys.argv[1]
except:
print("Usage:\n"+sys.argv[0]+" <HOST>\n")
sys.exit(1)
f = open(file, "r")
line = f.readline()
c=0
name = {}
state = {}
host = {}
while line:
if "service_description=" in line:
name[c] = line.split("=", 2)[1]
elif "current_state=" in line:
state[c] = line.split("=", 2)[1]
elif "host_name=" in line:
host[c] = line.split("=", 2)[1]
elif "}" in line:
c+=1
line = f.readline()
for i in name:
num = int(state[i])
if num > 0 and sys.argv[1] == host[i].strip():
print(name[i].strip("\n"))
You simply have to put the host as parameter, and the script will displays the broken services.