Now I wish to scrape the all the images of the items (iphone) in this web page. First I extract all the links of the image, and then send a request one by one to the src and download them to the folder '/phone/'. Here is my code:
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
#every(minutes=24 * 60)
def on_start(self):
print 'hi'
self.crawl('https://s.taobao.com/search?q=iphone&imgfile=&ie=utf8', callback=self.index_page, fetch_type='js')
##config(age=10 * 24 * 60 * 60)
def index_page(self, response):
items = response.doc('.item').items()
for item in items:
imgurl = item('.J_ItemPic img').attr('.src')
if imgurl:
filename = item('.J_ItemPic.img').attr('.id')
self.crawl(imgurl, callback=self.scrape_photo, save={'filename': filename})
def save_photo(self, content, filename):
with open('phone/'+filename, 'wb') as f:
f.write(content)
def scrape_photo(self, response):
content = response.content
filename = response.save['filename']+'.jpg'
self.save_photo(content, filename)
It's quite intuitive and simple. But when I run the code, nothing happened and I just got this log messages in the terminal:
[I 160602 18:57:42 scheduler:664] restart task sk:on_start data:,on_start
[I 160602 18:57:42 scheduler:771] select sk:on_start data:,on_start
[I 160602 18:57:42 tornado_fetcher:178] [200] sk:on_start data:,on_start 0s
[I 160602 18:57:42 processor:199] process sk:on_start data:,on_start -> [200] len:8 -> result:None fol:1 msg:0 err:None
[I 160602 18:57:42 scheduler:712] task done sk:on_start data:,on_start
I am nearly crazy about this issue. Could you please tell me what is the problem and how can I fix it? Thanks in advance!
Did you ever crawled the link 'https://s.taobao.com/search?q=iphone&imgfile=&ie=utf8' before?
pyspider will discard the crawled links by default (your commented #config(age=10 * 24 * 60 * 60) means never recrawl)
If you want to restart the hold project http://docs.pyspider.org/en/latest/apis/self.crawl/#itag will help.
Related
In below code I am trying to collect email ids from a website. It can be on contact or about us page.
From parse method I follow extemail method for all those pages.
From every page I collected few email ids.
Now I need to print them with original record sent to init method.
For example:
record = "https://www.wockenfusscandies.com/"
I want to print output as,
https://www.wockenfusscandies.com/|abc#gamil.com|def#outlook.com
I am not able to store them in self.emails and deliver back to init method.
Please help.
import scrapy
from scrapy.crawler import CrawlerProcess
class EmailSpider(scrapy.Spider):
def __init__(self, record):
self.record = record
self.emails = []
url = record.split("|")[4]
if not url.startswith("http"):
url = "http://{}".format(url)
if url:
self.start_urls = ["https://www.wockenfusscandies.com/"]
else:
self.start_urls = []
def parse(self, response):
contact_list = [a.attrib['href'] for a in response.css('a') if 'contact' in a.attrib['href'] or 'about' in a.attrib['href']]
contact_list.append(response.request.url)
for fllink in contact_list:
yield response.follow(fllink, self.extemail)
def extemail(self, response):
emails = response.css('body').re('[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
yield {
'emails': emails
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
f = open("/Users/kalpesh/work/data/test.csv")
for rec in f:
process.crawl(EmailSpider, record=rec)
f.close()
process.start()
If I understand your intend correctly you could try the following proceeding:
a) collect the mail-ids in self.emails like
def extemail(self, response):
emails = response.css('body').re('[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
self.emails = emails.copy()
yield {
'emails': emails
}
(Or on what other way you get the email-ids from emails)
b) add a close(self, reason) method as in GitHub-Example which is called when the spider has finished
def close(self, reason):
mails_for_record = ""
for mail in self.emails:
mails_for_record += mail + "|"
print(self.record + mails_for_record)
Please also note, I read somewhere that for some versions of Scrapy it is def close(self, reason), for others it is def closed(self, reason).
Hope, this proceeding helps you.
You should visit all the site pages before yielding result for this one site.
This means that you should have queue of pages to visit and results storage.
It can be done using meta.
Some pseudocode:
def parse(self, response):
meta = response.meta
if not meta.get('seen'):
# -- finding urls of contact and about us pages --
# -- putting it to meta['queue'] --
# -- setting meta['seen'] = True
page_emails_found = ...getting emails here...
# --- extending already discovered emails
# --- from other pages/initial empty list with new ones
meta['emails'].extend(page_emails_found)
# if queue isn't empty - yielding new request
if meta['queue']:
next_url = meta['queue'].pop()
yield Request(next_url, callback=self.parse, meta=copy(meta))
# if queue is empty - yielding result from meta
else:
yield {'url': current_domain, 'emails': meta['emails']}
Something like this..
I posted a similar question earlier but I think this is a more refined question.
I'm trying to scrape: https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=0
My code randomly throws errors when I send a GET request to the URL. After debugging, I saw the following happen. A GET request for the following url will be sent(Example URL, could happen on any page): https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=2400
The webpage will then say "There were no matching transactions found.". However, if I refresh the page, the content will then be loaded. I'm using BeautifulSoup and Selenium and have put sleep statements in my code in hopes that it'll work but to no avail. Is this a problem on the website's end? It doesn't make sense to me how one GET request will return nothing but the exact same request will return something. Also, is there anything I could to fix it or is it out of control?
Here is a sample of my code:
t
def scrapeWebsite(url, start, stop):
driver = webdriver.Chrome(executable_path='/Users/Downloads/chromedriver')
print(start, stop)
madeDict = {"Date": [], "Team": [], "Name": [], "Relinquished": [], "Notes": []}
#for i in range(0, 214025, 25):
for i in range(start, stop, 25):
print("Current Page: " + str(i))
currUrl = url + str(i)
#print(currUrl)
#r = requests.get(currUrl)
#soupPage = BeautifulSoup(r.content)
driver.get(currUrl)
#Sleep program for dynamic refreshing
time.sleep(1)
soupPage = BeautifulSoup(driver.page_source, 'html.parser')
#page = urllib2.urlopen(currUrl)
#time.sleep(2)
#soupPage = BeautifulSoup(page, 'html.parser')
info = soupPage.find("table", attrs={'class': 'datatable center'})
time.sleep(1)
extractedInfo = info.findAll("td")
The error occurs at the last line. "findAll" complains because it can't find findAll when the content is null(meaning the GET request returned nothing)
I did some workaround to scrape all the page using try except.
Probably the requests loop it is so fast and the page can't support it.
See the example below, worked like a charm:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=' \
'&PlayerMovementChkBx=yes&submit=Search&start=%s'
def scrape(start=0, stop=214525):
for page in range(start, stop, 25):
current_url = URL % page
print('scrape: current %s' % page)
while True:
try:
response = requests.request('GET', current_url)
if response.ok:
soup = BeautifulSoup(response.content.decode('utf-8'), features='html.parser')
table = soup.find("table", attrs={'class': 'datatable center'})
trs = table.find_all('tr')
slice_pos = 1 if page > 0 else 0
for tr in trs[slice_pos:]:
yield tr.find_all('td')
break
except Exception as exception:
print(exception)
for columns in scrape():
values = [column.text.strip() for column in columns]
# Continuous your code ...
Run the first.feature file successfully,however, call it from the second.feature failed without any clue to analysis. Do you have any idea help me find the root cause?
The source of my first.feature:
Feature: 采样管理-样本登记
Background: 读取随机生成的条形码、手机号、采样类型等作为入参
* url baseURL
* def randomData = Java.type('utils.RandomData')
* def barcode = randomData.getRandom(11)
* def randomPhone = randomData.getTelephone()
* def sampletype = randomData.getNum(0,1)
Scenario: 输入合法参数进行正常样本登记,确认能够登记成功
Given path 'iEhr/PersonSample'
# * header Content-type = 'application/x-www-form-urlencoded; charset=UTF-8'
* cookies { JSESSIONID: '#(jsessionID)',SESSION: '#(sessionID)', ACMETMP: '#(acmetmpID)'}
* def autoMotherName = "autoMname"+ barcode
# * def confData = {mothername: "#(autoMotherName)", barcode: "#(barcode)", mobile: '#(randomPhone)', sampletype:"#(sampletype)" }
# 设置sampletype为1,已被采样
* def confData = {mothername: "#(autoMotherName)", barcode: "#(barcode)", mobile: '#(randomPhone)', sampletype:"1" }
# 打印入参变量输出
* print confData
# 用例与数据分离
* def paramObj = read('classpath:mainFlow/sampleSaveReqTest.json')
* print paramObj
* form field param = paramObj
When method post
Then status 200
* json result = response[0].result
* def personId = result[0].personid
* def sampleid = result[0].sampleid
* print personId
* print sampleid
The source of my second.feature:
Feature: 提交递送样本
Background:
* def sampleResult = call read('classpath:mainFlow/first.feature')
* print sampleResult
I run the first.feature singly, it works. However, karate reports the error below after running the second.feature. Any idea how can I debug to find the root cause? I have no idea what's wrong with the second read. Many thanks!
* def sampleResult = call read('classpath:mainFlow/first.feature')
-unknown-:14 - javascript evaluation failed: read('classpath:mainFlow/first.feature'), null
Look for some issue with karate-config.js. As Babu said in the comments, it is very hard to make out what the problem is, I suggest you follow this process: https://github.com/intuit/karate/wiki/How-to-Submit-an-Issue
Also try if the latest preview version 0.9.3.RC2 is better at showing what the error is.
If you can replicate the problem as a small example, it will help us - because we really need to do better at showing more useful error logs, instead of just null.
I'm a newbie here so forgive my question,
so I have a url http://example.com/news?count=XX), I want scrapy to go over all count (1,2,3,4,5,) till it reach an empty page (no html) or 404 page
my issue the total count are unknown so I'm not sure how I can tell scrapy to work like this:
http://example.com/news?count=1 ===> found data, save it
http://example.com/news?count=2 ===> found data, save it
http://example.com/news?count=3 ===> found data, save it
....
....
....
http://example.com/news?count=X ===> no data found, stop here.
Just code a spider to do it:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/news?count=1"]
count = 1
def parse(self, response):
... make your magic! ...
self.count = self.count + 1
next_url = response.url[:-1] + str(self.count)
yield scrapy.Request(next_url, callback=self.parse)
Obviously you must improve the logic in next_url if you want count > 9.
I save crawled urls in Mysql database. When scrapy crawls sites again, the schedule or the downloader should only hit/crawl/download the page if its url is not in database.
#settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'myproject.middlewares.DupFilterMiddleware': 390,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
# Disable compression middleware, so the actual HTML pages are cached
}
#middlewares.py
class DupFilterMiddleware(object):
def process_response(self, request, response, spider):
conn = MySQLdb.connect(user='dbuser',passwd='dbpass',db='dbname',host='localhost', charset='utf8', use_unicode=True)
cursor = conn.cursor()
log.msg("Make mysql connection", level=log.INFO)
cursor.execute("""SELECT id FROM scrapy WHERE url = %s""", (response.url))
if cursor.fetchone() is None:
return None
else:
raise IgnoreRequest("Duplicate --db-- item found: %s" % response.url)
#spider.py
class TestSpider(CrawlSpider):
name = "test_spider"
allowed_domains = ["test.com"]
start_urls = ["http://test.com/company/JV-Driver-Jobs-dHJhZGVzODkydGVhbA%3D%3D"]
rules = [
Rule(SgmlLinkExtractor(allow=("http://example.com/job/(.*)",)),callback="parse_items"),
Rule(SgmlLinkExtractor(allow=("http://example.com/company/",)), follow=True),
]
def parse_items(self, response):
l = XPathItemLoader(testItem(), response = response)
l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars)
l.add_xpath('job_title', '//h1/text()')
l.add_value('url',response.url)
l.add_xpath('job_description', '//tr[2]/td[2]')
l.add_value('job_code', '99')
return l.load_item()
It works but I got ERROR: Error downloading from raise IgnoreRequest() . Is it intended ?
2013-10-15 17:54:16-0600 [test_spider] ERROR: Error downloading <GET http://example.com/job/aaa>: Duplicate --db-- item found: http://example.com/job/aaa
Another problem with my approach is I have to query for each url I am going to crawl. Says, I have 10k urls to crawl which means I hit mysql server 10k times. How can i do it in 1 mysql query? (e.g. get all crawled urls and store them somewhere, then check the request url against them)
Update:
Follow audiodude suggestion, here is my latest code. However, DupFilterMiddleware stops working. It runs the init but never call process_request anymore. Removing _init_ will make the process_request works again. What did I do wrong ?
class DupFilterMiddleware(object):
def __init__(self):
self.conn = MySQLdb.connect(user='myuser',passwd='mypw',db='mydb',host='localhost', charset='utf8', use_unicode=True)
self.cursor = self.conn.cursor()
self.url_set = set()
self.cursor.execute('SELECT url FROM scrapy')
for url in self.cursor.fetchall():
self.url_set.add(url)
print self.url_set
log.msg("DupFilterMiddleware Initialize mysql connection", level=log.INFO)
def process_request(self, request, spider):
log.msg("Process Request URL:{%s}" % request.url, level=log.WARNING)
if request.url in url_set:
log.msg("IgnoreRequest Exception {%s}" % request.url, level=log.WARNING)
raise IgnoreRequest()
else:
return None
A few things I can think of:
First, you should use process_request in your DupFilterMiddleware. That way, you filter the request before it ever even gets downloaded. Your current solution is wasting alot of time and resources downloading pages that eventually get thrown out.
Secondly, you should not connect to your database inside of process_response/process_request. That means you are creating a new connection for every item (and throwing away the old one). This is very inefficient. Try the following:
class DupFilterMiddleware(object):
def __init__(self):
self.conn = MySQLdb.connect(...
self.cursor = conn.cursor()
Then replace cursor.execute(... in your process_response method with self.cursor.execute(...
Finally, I would agree that it can be suboptimal to hit the MySQL server 10k times. For such a low volume of data, why not load it all into a set() in memory. Put this in the __init__ method of your downloader middleware:
self.url_set = set()
cursor.execute('SELECT url FROM scrapy')
for url in cursor.fetchall():
self.url_set.add(url)
Then instead of executing a query and checking results, simply do:
if response.url in url_set:
raise IgnoreRequest(...