Scrapy just stops unexpectedly with no error - scrapy

every time Scrapy just stop after running a few minutes and I still cant figure out why ?
spider added 69 , is that too much spiders ? There are indeed lots of ParseException... Any help will be appreciated.
stop like this :
2014-12-01 12:34:06+0800 [scrapy] INFO: info : {'platform_name': '\xe7\xa7\xaf\xe6\x9c\xa8\xe7\x9b\x92\xe5\xad\x90', 'platform': 'jimubox', 'loanId': '13774', 'create_time': 1417408354133} exception : ParseException('type', 'list index out of range')
2014-12-01 12:34:06+0800 [scrapy] INFO: info : {'platform_name': '\xe7\xa7\xaf\xe6\x9c\xa8\xe7\x9b\x92\xe5\xad\x90', 'platform': 'jimubox', 'loanId': '13773', 'create_time': 1417408354133} exception : ParseException('type', 'list index out of range')
2014-12-01 12:34:07+0800 [eloancn] INFO: Crawled 19 pages (at 19 pages/min), scraped 0 items (at 0 items/min)
2014-12-01 12:34:09+0800 [qmdai] INFO: Crawled 10 pages (at 10 pages/min), scraped 0 items (at 0 items/min)
scrapy version : 0.24.4

Related

Amazon reviews: List index out of range

I would like to scrape the customer reviews of the kindle paperwhite of amazon.
I am aware that although amazon might say the have 5900 reviews, it is only possible to access 5000 of them. (after page=500 no more reviews are displayed with 10 reviews per page).
For the first few pages my spider returns 10 reviews per page, but later this shrinks to just one or two. This results in only about 1300 reviews.
There seems to be a problem with adding the data of the variable "helpul" and "verified". Both throw the following error:
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
Any help would be greatly appreciated!
I tried implementing if statements in case the variables were empty or contained a list, but it didnt work.
My Spider amazon_reviews.py:
import scrapy
from scrapy.extensions.throttle import AutoThrottle
class AmazonReviewsSpider(scrapy.Spider):
name = 'amazon_reviews'
allowed_domains = ['amazon.com']
myBaseUrl = "https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber="
start_urls=[]
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,550):
start_urls.append(myBaseUrl+str(i))
def parse(self, response):
data = response.css('#cm_cr-review_list')
# Collecting various data
star_rating = data.css('.review-rating')
title = data.css('.review-title')
text = data.css('.review-text')
date = data.css('.review-date'))
# Number how many people thought the review was helpful.
helpful = response.xpath('.//span[#data-hook="helpful-vote-statement"]//text()').extract()
verified = response.xpath('.//span[#data-hook="avp-badge"]//text()').extract()
# I scrape more information, but deleted it here not to make the code too big
# yielding the scraped results
for review in star_rating:
yield{'ASIN': 'B07CXG6C9W',
#'ID': ''.join(id.xpath('.//text()').extract()),
'stars': ''.join(review.xpath('.//text()').extract_first()),
'title': ''.join(title[count].xpath(".//text()").extract_first()),
'text': ''.join(text[count].xpath(".//text()").extract_first()),
'date': ''.join(date[count].xpath(".//text()").extract_first()),
### There seems to be a problem with adding these two, as I get 5000 reviews back if I delete them. ###
'verified purchase': ''.join(verified[count]),
'helpful': ''.join(helpful[count])
}
count=count+1
My settings.py :
AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_TIMEOUT = 180
REDIRECT_ENABLED = False
#DOWNLOAD_DELAY =5.0
RANDOMIZE_DOWNLOAD_DELAY = True
The extracting of the data works fine. The reviews I do get have complete and accurate information. Just the amount of reviews I get are too little.
When I run the spider with the following command:
runspider amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py -o reviews.csv
The ouput on the console looks like the following:
2019-04-22 11:54:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=164> (referer: None)
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'BRANDI', 'title': 'Bookworms rejoice!', 'text': "The (...) 5 STARS! 🌟🌟🌟🌟🌟", 'date': 'December 7, 2018'}
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'Doug Stender', 'title': 'As good as adverised', 'text': 'I read (...) mazon...', 'date': 'January 8, 2019'}
2019-04-22 11:54:41 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161> (referer: None)
Traceback (most recent call last):
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\OneDrive\Dokumente\Uni\05_SS 19\Masterarbeit\Code\Scrapy\amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py", line 78, in parse
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
Turns out that if a review didnt't have the "verified" tag or if no one commented it, the html part scrapy was looking for isn't there and therefore no item gets added to the list which makes the "verified" and "comments" list shorter than the other ones. Because of this error all items in the list got dropped and werent added to my csv file. The simple fix below which checks if the lists are as long as the other lists worked just fine :)
Edit:
When using this fix it might happen that values are assigned to the wrong review, because it is always added to the end of the list.
If you want to be on the safe side, don't scrape the verified tag or replace the whole list with "Na" or something else that indicates that the value is unclear.
helpful = response.xpath('.//span[#data-hook="helpful-vote-statement"]//text()').extract()
while len(helpful) != len(date):
helpful.append("0 people found this helpful")

Start multiple spiders sequentially from another spider

I have one spider which creates +100 spiders with arguments.
Those spiders scrape x items and forward them to a mysqlpipeline.
The mysqldatabase can handle 10 connections at the time.
Due to that reason I can only have max 10 spiders running at the same time.
How can I make this happen?
My not succesful approach now is
- Add spiders to a list in the first spider like this:
if item.get('location_selectors') is not None and item.get('start_date_selectors') is not None:
spider = WikiSpider.WikiSpider(template=item.get('category'), view_threshold=0, selectors = {
'location': [item.get('location_selectors')],
'date_start': [item.get('start_date_selectors')],
'date_end': [item.get('end_date_selectors')]
})
self.spiders.append(spider)
Then in the first spider I listen to the close_spider signal:
def spider_closed(self, spider):
for spider in self.spiders:
process = CrawlerProcess(get_project_settings())
process.crawl(spider)
But this approach gives me the following error:
connection to the other side was lost in a non-clean fashion
What is the correct way to start multiple spiders in a sequentially manner?
Thanks in advance!

How to Configure the Web Connector from metrics.log Values

I am reviewing the ColdFusion Web Connector settings in workers.properties to hopefully address a sporadic response time issue.
I've been advised to inspect the output from the metrics.log file (CF Admin > Debugging & Logging > Debug Output Settings > Enable Metric Logging) and use this to inform the adjustments to the settings max_reuse_connections, connection_pool_size and connection_pool_timeout.
My question is: How do I interpret the metrics.log output to inform the choice of setting values? Is there any documentation that can guide me?
Examples from over a 120 hour period:
95% of entries -
"Information","scheduler-2","06/16/14","08:09:04",,"Max threads: 150 Current thread count: 4 Current thread busy: 0 Max processing time: 83425 Request count: 9072 Error count: 72 Bytes received: 1649 Bytes sent: 22768583 Free memory: 124252584 Total memory: 1055326208 Active Sessions: 1396"
Occurred once -
"Information","scheduler-2","06/13/14","14:20:22",,"Max threads: 150 Current thread count: 10 Current thread busy: 5 Max processing time: 2338 Request count: 21 Error count: 4 Bytes received: 155 Bytes sent: 139798 Free memory: 114920208 Total memory: 1053097984 Active Sessions: 6899"
Environment:
3 x Windows 2008 R2 (hardware load balanced)
ColdFusion 10 (update 12)
Apache 2.2.21
Richard, I realize your question here is from 2014, and perhaps you have since resolved it, but I suspect your problem was that the port set in the CF admin (below the "metrics log" checkbox) was set to 8500, which is your internal web server (used by the CF admin only, typically, if at all). That's why the numbers are not changing. (And for those who don't enable the internal web server at installation of CF, or later, most values in the metrics log are null).
I address this problem in a blog post I happened to do just last week: http://www.carehart.org/blog/client/index.cfm/2016/3/2/cf_metrics_log_part1
Hope any of this helps.

nutch 2.0 fetch page repeatedly when a job failed

I use mysql as storage backend with nutch.
Job failed when crawling some sites. Got the following exception and exit nutch when reaching this page: http://www.appchina.com/users.html
Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:47)
at org.apache.nutch.parse.ParserJob.run(ParserJob.java:249)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:171)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
So I modify the ./src/java/org/apache/nutch/util/NutchJob.java
change the
if (getConfiguration().getBoolean("fail.on.job.failure", true)) {
to
if (getConfiguration().getBoolean("fail.on.job.failure", false)) {
After recompiling, I won't get any exception, but unlimited restart crawling.
FetcherJob : timelimit set for : -1
FetcherJob: threads: 30
FetcherJob: parsing: false
FetcherJob: resuming: false
Using queue mode : byHost
Fetcher: threads: 30
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching http://www.appchina.com/
fetching http://www.appchina.com/users.html
-finishing thread FetcherThread0, activeThreads=29
-finishing thread FetcherThread29, activeThreads=28
...
0/0 spinwaiting/active, 2 pages, 0 errors, 0.4 0.4 pages/s, 137 137 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://www.appchina.com/
Parsing http://www.appchina.com/users.html
UPDATE
error in hadoop.log
2012-09-17 18:48:51,257 WARN mapred.LocalJobRunner - job_local_0004
java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.sql.BatchUpdateException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
... 6 more
Caused by: java.sql.SQLException: Incorrect string value: '\xE7\x94\xA8\xE6\x88\xB7...' for column 'text' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
... 8 more
UPDATE again
I've drop the table gora created and create a similar table with a VARCHAR(128) id and utf8mb4 DEFAULT CHARSET. It works now. Why?
Anyone help?
You need to add the hadoop logs for the Parse job. The stack trace attached is not showing that info. After u did that code change, did parsing happen successfully ?

Cant crawl scrapy with depth more than 1

I couldn't configure scrapy to run with depth > 1, I have tried the 3 following options, noone of them worked and the request_depth_max at summary log is always 1:
1) Adding:
from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2
to spider file (the example on site, just with different site)
2) Running the command line with -s option:
/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org
3) Adding to settings.py and scrapy.cfg:
DEPTH_LIMIT=2
How should it be configured to more than 1?
warwaruk is right, The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".
So let's scrape miniova and see what happens. Starting at the today page we see that there are two tor links:
stav#maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]
Let's scrape the first link, where we see there are no new tor links on that page, just the link to iteself, which does not get recrawled by default (scrapy.http.Request(url[, ... dont_filter=False, ...])):
>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]
No luck there, we are still at depth 1. Let's try the other link:
>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]
Nope, this page only contains one link as well, a link to itself, which also gets filtered. So there are actually no links to scrape, so Scrapy closes the spider (at depth==1).
I had a similar issue, it helped to set follow=True when defining Rule:
follow is a boolean which specifies if links should be followed from
each response extracted with this rule. If callback is None follow
defaults to True, otherwise it default to False.
The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".
You wrote:
request_depth_max at summary log is always 1
What you see in the logs is the statistics, not the settings. When it says that request_depth_max as 1 it means that from the first callback no other requests have been yielded.
You have to show your spider code to understand what is going on.
But create another question for it.
UPDATE:
Ah, i see you are running mininova spider for the scrapy intro:
class MininovaSpider(CrawlSpider):
name = 'mininova.org'
allowed_domains = ['mininova.org']
start_urls = ['http://www.mininova.org/today']
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[#id='description']").extract()
torrent['size'] = x.select("//div[#id='info-left']/p[2]/text()[2]").extract()
return torrent
As you see from the code, the spider never issues any request for other pages, it scrapes all the data right from the top level pages. That's why the maximum depth is 1.
If you make you own spider which will be following links to other pages, the maximum depth will be greater then 1.