Scrapy start_requests() didn't yield all requests - scrapy

def start_requests(self):
db = SeedUserGenerator()
result = db.selectSeedUsers()
db.closeDB()
urls = []
for name in result:
urls.append(self.user_info_url.format(name))
for url in urls:
yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
print('fin')
def parse_user(self, response):
.........ignore some code here...........
yield Request(url=next_url, priority=20, callback=self.parse_info)
def parse_info(self, response):
.........ignore some code here...........
yield Request(url=next_url, priority=30, callback=self.parse_user)
The program runs as follows:
several Requests yields from start_requests, and the function start_requests seems to be paused without outputing the string fin.
a response comes, and the function parse_user yield another Request, but the remaining Requests in the function start_requests can not be yield until the response has been processed, and here the yield operation formed a ring.
It seems to be
synchronous: Before sending a Request from start_requests and processing its response, other Requests can not be yield?
Is that mean scrapy can never yield the remaining Requests in the function start_requests?
How could I make scrapy finish running start_requests first?
I'm new in python and scrapy. Can scrapy process a response and yield Requests at the same time?
By the way, I'm using Python3.6 and Scrapy1.5.1 Twisted 20.3.0

I solved my problem by referring to the source code of Scrapy engine:
def _next_request(self, spider):
slot = self.slot
if not slot:
return
if self.paused:
return
while not self._needs_backout(spider):
if not self._next_request_from_scheduler(spider):
break
if slot.start_requests and not self._needs_backout(spider):
try:
request = next(slot.start_requests)
except StopIteration:
slot.start_requests = None
except Exception:
slot.start_requests = None
logger.error('Error while obtaining start requests',
exc_info=True, extra={'spider': spider})
else:
self.crawl(request, spider)
if self.spider_is_idle(spider) and slot.close_if_idle:
self._spider_idle(spider)
Here Scrapy always tries to get requests from scheduler's queues first, rather than start_requests.
What's more, Scrapy never put all requests of function start_requests first.
So, I change my code like this:
def start_requests(self):
db = SeedUserGenerator()
result = db.selectSeedUsers()
db.closeDB()
urls = []
for name in result:
urls.append(self.user_info_url.format(name))
yield Request(url=urls[0], callback=self.parse_temp, dont_filter=True, priority=10, meta={'urls': urls})
def parse_temp(self, response):
urls = response.meta['urls']
for url in urls:
print(url)
yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
print('fin2')
Then Scrapy put all requests into the queues first.

Related

scrapy dosn't stop after yield in python

I'm trying to make a spider that goes through a certain amount of start urls and if the resulting page is the right one I yield another request. The problem is that if I try anyway of not yielding a second request the spider will stop directly. There are no problems if I yield the second request.
Here is the relevant code:
def start_requests(self):
urls = ['https://www.hltv.org' + player for player in self.hashPlayers]
print(len(urls))
for url in urls:
return [scrapy.Request(url=url, callback=self.parse)]
def parse(self, response):
result = response.xpath("//div[#class = 'playerTeam']//a/#href").get()
if result is None:
result = response.xpath("//span[contains(concat(' ',normalize-space(#class),' '),' profile-player-stat-value bold ')]//a/#href").get()
if result is not None:
yield scrapy.Request(
url = "https://www.hltv.org" + result,
callback = self.parseTeam
)
So I want a way to make the spider to continue after I call the parse function and don't yield a request.

Need to return Scrapy callback method data to calling function

In below code I am trying to collect email ids from a website. It can be on contact or about us page.
From parse method I follow extemail method for all those pages.
From every page I collected few email ids.
Now I need to print them with original record sent to init method.
For example:
record = "https://www.wockenfusscandies.com/"
I want to print output as,
https://www.wockenfusscandies.com/|abc#gamil.com|def#outlook.com
I am not able to store them in self.emails and deliver back to init method.
Please help.
import scrapy
from scrapy.crawler import CrawlerProcess
class EmailSpider(scrapy.Spider):
def __init__(self, record):
self.record = record
self.emails = []
url = record.split("|")[4]
if not url.startswith("http"):
url = "http://{}".format(url)
if url:
self.start_urls = ["https://www.wockenfusscandies.com/"]
else:
self.start_urls = []
def parse(self, response):
contact_list = [a.attrib['href'] for a in response.css('a') if 'contact' in a.attrib['href'] or 'about' in a.attrib['href']]
contact_list.append(response.request.url)
for fllink in contact_list:
yield response.follow(fllink, self.extemail)
def extemail(self, response):
emails = response.css('body').re('[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
yield {
'emails': emails
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
f = open("/Users/kalpesh/work/data/test.csv")
for rec in f:
process.crawl(EmailSpider, record=rec)
f.close()
process.start()
If I understand your intend correctly you could try the following proceeding:
a) collect the mail-ids in self.emails like
def extemail(self, response):
emails = response.css('body').re('[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
self.emails = emails.copy()
yield {
'emails': emails
}
(Or on what other way you get the email-ids from emails)
b) add a close(self, reason) method as in GitHub-Example which is called when the spider has finished
def close(self, reason):
mails_for_record = ""
for mail in self.emails:
mails_for_record += mail + "|"
print(self.record + mails_for_record)
Please also note, I read somewhere that for some versions of Scrapy it is def close(self, reason), for others it is def closed(self, reason).
Hope, this proceeding helps you.
You should visit all the site pages before yielding result for this one site.
This means that you should have queue of pages to visit and results storage.
It can be done using meta.
Some pseudocode:
def parse(self, response):
meta = response.meta
if not meta.get('seen'):
# -- finding urls of contact and about us pages --
# -- putting it to meta['queue'] --
# -- setting meta['seen'] = True
page_emails_found = ...getting emails here...
# --- extending already discovered emails
# --- from other pages/initial empty list with new ones
meta['emails'].extend(page_emails_found)
# if queue isn't empty - yielding new request
if meta['queue']:
next_url = meta['queue'].pop()
yield Request(next_url, callback=self.parse, meta=copy(meta))
# if queue is empty - yielding result from meta
else:
yield {'url': current_domain, 'emails': meta['emails']}
Something like this..

How to use the yield function to scrape data from multiple pages

I'm trying to scrape data from amazon India website. I am not able collect response and parse the elements using the yield() method when:
1) I have to move from product page to review page
2) I have to move from one review page to another review page
Product page
Review page
Code flow:
1) customerReviewData() calls the getCustomerRatingsAndComments(response)
2) The getCustomerRatingsAndComments(response)
finds the URL of the review page and call the yield request method with getCrrFromReviewPage(request) as callback method, with url of this review page
3) getCrrFromReviewPage() gets new response of the firstreview page and scrape all the elements from the first review page (page loaded) and add it to customerReviewDataList[]
4) get URL of the next page if it exists and recursively call getCrrFromReviewPage() method, and crawl elements from next page, until all the review page is crawled
5) All the reviews gets added to the customerReviewDataList[]
I have tried playing around with yield() changing the parameters and also looked up the scrapy documentation for yield() and Request/Response yield
# -*- coding: utf-8 -*-
import scrapy
import logging
customerReviewDataList = []
customerReviewData = {}
#Get product name in <H1>
def getProductTitleH1(response):
titleH1 = response.xpath('normalize-space(//*[#id="productTitle"]/text())').extract()
return titleH1
def getCustomerRatingsAndComments(response):
#Fetches the relative url
reviewRelativePageUrl = response.css('#reviews-medley-footer a::attr(href)').extract()[0]
if reviewRelativePageUrl:
#get absolute URL
reviewPageAbsoluteUrl = response.urljoin(reviewRelativePageUrl)
yield Request(url = reviewPageAbsoluteUrl, callback = getCrrFromReviewPage())
self.log("yield request complete")
return len(customerReviewDataList)
def getCrrFromReviewPage():
userReviewsAndRatings = response.xpath('//div[#id="cm_cr-review_list"]/div[#data-hook="review"]')
for userReviewAndRating in userReviewsAndRatings:
customerReviewData[reviewTitle] = response.css('#cm_cr-review_list .review-title span ::text').extract()
customerReviewData[reviewDescription] = response.css('#cm_cr-review_list .review-text span::text').extract()
customerReviewDataList.append(customerReviewData)
reviewNextPageRelativeUrl = response.css('#cm_cr-pagination_bar .a-pagination .a-last a::attr(href)')[0].extract()
if reviewNextPageRelativeUrl:
reviewNextPageAbsoluteUrl = response.urljoin(reviewNextPageRelativeUrl)
yield Request(url = reviewNextPageAbsoluteUrl, callback = getCrrFromReviewPage())
class UsAmazonSpider(scrapy.Spider):
name = 'Test_Crawler'
allowed_domains = ['amazon.in']
start_urls = ['https://www.amazon.in/Philips-Trimmer-Cordless-Corded-QT4011/dp/B00JJIDBIC/ref=sr_1_3?keywords=philips&qid=1554266853&s=gateway&sr=8-3']
def parse(self, response):
titleH1 = getProductTitleH1(response),
customerReviewData = getCustomerRatingsAndComments(response)
yield{
'Title_H1' : titleH1,
'customer_Review_Data' : customerReviewData
}
I'm getting the following response:
{'Title_H1': (['Philips Beard Trimmer Cordless and Corded for Men QT4011/15'],), 'customer_Review_Data': <generator object getCustomerRatingsAndComments at 0x048AC630>}
The "Customer_review_Data" should be a list of dict of title and review
I am not able to figure out as to what mistake I am doing here.
When I use the log() or print() to see what data is captured in customerReviewDataList[], unable to see the data in the console either.
I am able to scrape all the reviews in customerReviewDataList[], if they are present in the product page,
In this scenario where I have to use the yield function I am getting the output stated above like this [https://ibb.co/kq8w6cf]
This is the kind of output I am looking for:
{'customerReviewTitle': ['Difficult to find a charger adapter'],'customerReviewComment': ['I already have a phillips trimmer which was only cordless. ], 'customerReviewTitle': ['Good Product'],'customerReviewComment': ['Solves my need perfectly HK']}]}
Any help is appreciated. Thanks in advance.
You should complete the Scrapy tutorial. The Following links section should be specially helpful to you.
This is a simplified version of your code:
def data_request_iterator():
yield Request('https://example.org')
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
yield {
'title': response.css('title::text').get(),
'data': data_request_iterator(),
}
Instead, it should look like this:
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
item = {
'title': response.css('title::text').get(),
}
yield Request('https://example.org', meta={'item': item}, callback=self.parse_data)
def parse_data(self, response):
item = response.meta['item']
# TODO: Extend item with data from this second response as needed.
yield item

Scrapy FormRequest return 400 error code

I am trying to scrapy following website in which the pagination is though AJAX request.
http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak
I am sending FormRequest to access the different pages, however I am getting following error.
Retrying http://studiegids.uva.nl/xmlpages/plspub/uva_search.courses_pls> (failed 1 times): 400 Bad Request
Not able to understand what is wrong? Following is the code.
class Spider(BaseSpider):
name = "zoek"
allowed_domains = ["studiegids.uva.nl"]
start_urls = ["http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak"]
def parse(self, response):
base_url = "http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak"
for i in range(1, 10):
data = {'p_fetch_size': unicode(20),
'p_page:': unicode(i),
'p_searchpagetype': u'courses',
'p_site_lang': u'nl',
'p_strip': u'/2014-2015',
'p_ctxparam': u'/xmlpages/page/2014-2015/',
'p_rsrcpath':u'/xmlpages/resources/TXP/studiegidswebsite/'}
yield FormRequest.from_response(response,
formdata=data,
callback=self.fetch_details,
dont_click=True)
# yield FormRequest(base_url,
# formdata=data,
# callback=self.fetch_details)
def fetch_details(self, response):
# print response.body
hxs = HtmlXPathSelector(response)
item = ZoekItem()
Studiegidsnummer = hxs.select("//div[#class=item-info']//tr[1]/td[2]/p/text()")
Studielast = hxs.select("//div[#class=item-info']//tr[2]/td[2]/p/text()")
Voertaal = hxs.select("//div[#class=item-info']//tr[3]/td[2]/p/text()")
Ingangseis = hxs.select("//div[#class=item-info']//tr[4]/td[2]/p/text()")
Studiejaar = hxs.select("//div[#class=item-info']//tr[5]/td[2]/p/text()")
Onderwijsinstituut = hxs.select("//div[#class=item-info']//tr[6]/td[2]/p/text()")
for i in range(20):
item['Studiegidsnummer'] = Studiegidsnummer
item['Studielast'] = Studielast
item['Voertaal'] = Voertaal
yield item
Try also check headers using firebug.
400 Bad Request usually means that your request does not fully match the expected request format. Common causes include missing or invalid cookies, headers or parameters.
On your web browser, open the Network tab of the Developer Tools and trigger the request. When you see the request in the Network tab, inspect it fully (parameters, headers, etc.). Try to match such a request in your code.

do not crawl saved urls in database

I save crawled urls in Mysql database. When scrapy crawls sites again, the schedule or the downloader should only hit/crawl/download the page if its url is not in database.
#settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'myproject.middlewares.DupFilterMiddleware': 390,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
# Disable compression middleware, so the actual HTML pages are cached
}
#middlewares.py
class DupFilterMiddleware(object):
def process_response(self, request, response, spider):
conn = MySQLdb.connect(user='dbuser',passwd='dbpass',db='dbname',host='localhost', charset='utf8', use_unicode=True)
cursor = conn.cursor()
log.msg("Make mysql connection", level=log.INFO)
cursor.execute("""SELECT id FROM scrapy WHERE url = %s""", (response.url))
if cursor.fetchone() is None:
return None
else:
raise IgnoreRequest("Duplicate --db-- item found: %s" % response.url)
#spider.py
class TestSpider(CrawlSpider):
name = "test_spider"
allowed_domains = ["test.com"]
start_urls = ["http://test.com/company/JV-Driver-Jobs-dHJhZGVzODkydGVhbA%3D%3D"]
rules = [
Rule(SgmlLinkExtractor(allow=("http://example.com/job/(.*)",)),callback="parse_items"),
Rule(SgmlLinkExtractor(allow=("http://example.com/company/",)), follow=True),
]
def parse_items(self, response):
l = XPathItemLoader(testItem(), response = response)
l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars)
l.add_xpath('job_title', '//h1/text()')
l.add_value('url',response.url)
l.add_xpath('job_description', '//tr[2]/td[2]')
l.add_value('job_code', '99')
return l.load_item()
It works but I got ERROR: Error downloading from raise IgnoreRequest() . Is it intended ?
2013-10-15 17:54:16-0600 [test_spider] ERROR: Error downloading <GET http://example.com/job/aaa>: Duplicate --db-- item found: http://example.com/job/aaa
Another problem with my approach is I have to query for each url I am going to crawl. Says, I have 10k urls to crawl which means I hit mysql server 10k times. How can i do it in 1 mysql query? (e.g. get all crawled urls and store them somewhere, then check the request url against them)
Update:
Follow audiodude suggestion, here is my latest code. However, DupFilterMiddleware stops working. It runs the init but never call process_request anymore. Removing _init_ will make the process_request works again. What did I do wrong ?
class DupFilterMiddleware(object):
def __init__(self):
self.conn = MySQLdb.connect(user='myuser',passwd='mypw',db='mydb',host='localhost', charset='utf8', use_unicode=True)
self.cursor = self.conn.cursor()
self.url_set = set()
self.cursor.execute('SELECT url FROM scrapy')
for url in self.cursor.fetchall():
self.url_set.add(url)
print self.url_set
log.msg("DupFilterMiddleware Initialize mysql connection", level=log.INFO)
def process_request(self, request, spider):
log.msg("Process Request URL:{%s}" % request.url, level=log.WARNING)
if request.url in url_set:
log.msg("IgnoreRequest Exception {%s}" % request.url, level=log.WARNING)
raise IgnoreRequest()
else:
return None
A few things I can think of:
First, you should use process_request in your DupFilterMiddleware. That way, you filter the request before it ever even gets downloaded. Your current solution is wasting alot of time and resources downloading pages that eventually get thrown out.
Secondly, you should not connect to your database inside of process_response/process_request. That means you are creating a new connection for every item (and throwing away the old one). This is very inefficient. Try the following:
class DupFilterMiddleware(object):
def __init__(self):
self.conn = MySQLdb.connect(...
self.cursor = conn.cursor()
Then replace cursor.execute(... in your process_response method with self.cursor.execute(...
Finally, I would agree that it can be suboptimal to hit the MySQL server 10k times. For such a low volume of data, why not load it all into a set() in memory. Put this in the __init__ method of your downloader middleware:
self.url_set = set()
cursor.execute('SELECT url FROM scrapy')
for url in cursor.fetchall():
self.url_set.add(url)
Then instead of executing a query and checking results, simply do:
if response.url in url_set:
raise IgnoreRequest(...