Amazon reviews: List index out of range - scrapy

I would like to scrape the customer reviews of the kindle paperwhite of amazon.
I am aware that although amazon might say the have 5900 reviews, it is only possible to access 5000 of them. (after page=500 no more reviews are displayed with 10 reviews per page).
For the first few pages my spider returns 10 reviews per page, but later this shrinks to just one or two. This results in only about 1300 reviews.
There seems to be a problem with adding the data of the variable "helpul" and "verified". Both throw the following error:
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
Any help would be greatly appreciated!
I tried implementing if statements in case the variables were empty or contained a list, but it didnt work.
My Spider amazon_reviews.py:
import scrapy
from scrapy.extensions.throttle import AutoThrottle
class AmazonReviewsSpider(scrapy.Spider):
name = 'amazon_reviews'
allowed_domains = ['amazon.com']
myBaseUrl = "https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber="
start_urls=[]
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,550):
start_urls.append(myBaseUrl+str(i))
def parse(self, response):
data = response.css('#cm_cr-review_list')
# Collecting various data
star_rating = data.css('.review-rating')
title = data.css('.review-title')
text = data.css('.review-text')
date = data.css('.review-date'))
# Number how many people thought the review was helpful.
helpful = response.xpath('.//span[#data-hook="helpful-vote-statement"]//text()').extract()
verified = response.xpath('.//span[#data-hook="avp-badge"]//text()').extract()
# I scrape more information, but deleted it here not to make the code too big
# yielding the scraped results
for review in star_rating:
yield{'ASIN': 'B07CXG6C9W',
#'ID': ''.join(id.xpath('.//text()').extract()),
'stars': ''.join(review.xpath('.//text()').extract_first()),
'title': ''.join(title[count].xpath(".//text()").extract_first()),
'text': ''.join(text[count].xpath(".//text()").extract_first()),
'date': ''.join(date[count].xpath(".//text()").extract_first()),
### There seems to be a problem with adding these two, as I get 5000 reviews back if I delete them. ###
'verified purchase': ''.join(verified[count]),
'helpful': ''.join(helpful[count])
}
count=count+1
My settings.py :
AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_TIMEOUT = 180
REDIRECT_ENABLED = False
#DOWNLOAD_DELAY =5.0
RANDOMIZE_DOWNLOAD_DELAY = True
The extracting of the data works fine. The reviews I do get have complete and accurate information. Just the amount of reviews I get are too little.
When I run the spider with the following command:
runspider amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py -o reviews.csv
The ouput on the console looks like the following:
2019-04-22 11:54:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=164> (referer: None)
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'BRANDI', 'title': 'Bookworms rejoice!', 'text': "The (...) 5 STARS! ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ", 'date': 'December 7, 2018'}
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'Doug Stender', 'title': 'As good as adverised', 'text': 'I read (...) mazon...', 'date': 'January 8, 2019'}
2019-04-22 11:54:41 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161> (referer: None)
Traceback (most recent call last):
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\OneDrive\Dokumente\Uni\05_SS 19\Masterarbeit\Code\Scrapy\amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py", line 78, in parse
'helpful': ''.join(helpful[count]),
IndexError: list index out of range

Turns out that if a review didnt't have the "verified" tag or if no one commented it, the html part scrapy was looking for isn't there and therefore no item gets added to the list which makes the "verified" and "comments" list shorter than the other ones. Because of this error all items in the list got dropped and werent added to my csv file. The simple fix below which checks if the lists are as long as the other lists worked just fine :)
Edit:
When using this fix it might happen that values are assigned to the wrong review, because it is always added to the end of the list.
If you want to be on the safe side, don't scrape the verified tag or replace the whole list with "Na" or something else that indicates that the value is unclear.
helpful = response.xpath('.//span[#data-hook="helpful-vote-statement"]//text()').extract()
while len(helpful) != len(date):
helpful.append("0 people found this helpful")

Related

Download a page doesn't return a status code

I have found a page I need to download that doesn't include an http status code in the returned headers. I get the error: ParseError: ('non-integer status code', b'Tag: "14cc1-5a76434e32f9e"') which is obviously accurate. But otherwise the returned data is complete.
I'm just trying to save the page content manually in a call back: afilehandle.write(response.body) sort of thing. It's a pdf. Is there a way I can bypass this and still get the contents of the page?
The returned example that also crashed fiddler. The first thing in the header is Tag.
Tag: "14cc1-5a76434e32f9
e"..Accept-Ranges: bytes
..Content-Length: 85185.
.Keep-Alive: timeout=15,
max=100..Connection: Ke
ep-Alive..Content-Type:
application/pdf....%PDF-
1.4.%ร“รดรŒรก.1 0 obj.<<./Cr
eationDate(D:20200606000
828-06'00')./Creator(PDF
sharp 1.50.4740 \(www.pd
fsharp.com\))./Producer(
PDFsharp 1.50.4740 \(www
.pdfsharp.com\)).>>.endo
bj.2 0 obj.<<./Type/Cata
log./Pages 3 0 R.>>.endo
bj.3 0 obj.<<./Type/Page
s./Count 2./Kids[4 0 R 8
0 R].>>.endobj.4 0 obj.
<<./Type/Page./MediaBox[
0 0 612 792]./Parent 3 0
R./Contents 5 0 R./Reso
urces.<<./ProcSet [/PDF/
Text/Ima.... etc
Note: For any not familiar with PDF file structure %PDF-1.4 and everything after is the correct format for a PDF document. Chrome downloads the PDF just fine even with the bad headers.
In the end, I modified the file twisted/web/_newclient.py directly to not throw the error, and use a weird status code that I could identify:
def statusReceived(self, status):
parts = status.split(b' ', 2)
if len(parts) == 2:
version, codeBytes = parts
phrase = b""
elif len(parts) == 3:
version, codeBytes, phrase = parts
else:
raise ParseError(u"wrong number of parts", status)
try:
statusCode = int(codeBytes)
except ValueError:
# Changes were made here
version = b'HTTP/1.1' #just assume it is what it should be
statusCode = 5200 # deal with invalid status codes later
phrase = b'non-integer status code' # sure, pass on the error message
# and commented out the line below.
# raise ParseError(u"non-integer status code", status)
self.response = Response._construct(
self.parseVersion(version),
statusCode,
phrase,
self.headers,
self.transport,
self.request,
)
And I set the spider to accept that status code.
class MySpider(Spider):
handle_httpstatus_list = [5200]
However, in the end I discovered the target site behaved correctly when accessed via https, so I ended up rolling back all the above changes.
Note the above hack would work, until you updated the library, at which point you would need to reapply the hack. But it could possibly get it done if you are desparate.

Start multiple spiders sequentially from another spider

I have one spider which creates +100 spiders with arguments.
Those spiders scrape x items and forward them to a mysqlpipeline.
The mysqldatabase can handle 10 connections at the time.
Due to that reason I can only have max 10 spiders running at the same time.
How can I make this happen?
My not succesful approach now is
- Add spiders to a list in the first spider like this:
if item.get('location_selectors') is not None and item.get('start_date_selectors') is not None:
spider = WikiSpider.WikiSpider(template=item.get('category'), view_threshold=0, selectors = {
'location': [item.get('location_selectors')],
'date_start': [item.get('start_date_selectors')],
'date_end': [item.get('end_date_selectors')]
})
self.spiders.append(spider)
Then in the first spider I listen to the close_spider signal:
def spider_closed(self, spider):
for spider in self.spiders:
process = CrawlerProcess(get_project_settings())
process.crawl(spider)
But this approach gives me the following error:
connection to the other side was lost in a non-clean fashion
What is the correct way to start multiple spiders in a sequentially manner?
Thanks in advance!

Trippinin api Keyerror

I need som help to get started with this trippinin api, if you have worked with this api it would be very nice of you to just help me here to get started! I dont understand what I should write in for dayin data[....]:
import requests
import json
r = requests.get("http://api.v1.trippinin.com/City/London/Eat?day=monday&time=morning&limit=10& offset=2&KEY=58ffb98334528b72937ce3390c0de2b7")
data = r.json()
for day in data['city Name']:
print (day['city Name']['weekday'] + ":")
The error:
Traceback (most recent call last):
File "C:\Users\Nux\Desktop\Kurs3\test.py", line 7, in <module>
for day in data['city Name']:
KeyError: 'city Name'
The error KeyError: 'X' means you are trying to access the key X in a dictionary, but it doesn't exist. In your case you're trying to access data['city Name']. Apparently, the information in data does not have the key city Name. That means either a) you aren't getting any data back, or b) the data isn't in the format you expected. In both cases you can validate (or invalidate) your assumptions by printing out the value of data.
To help debug this issue, add the following immediately after you assign a value to data:
print(data)

Cant crawl scrapy with depth more than 1

I couldn't configure scrapy to run with depth > 1, I have tried the 3 following options, noone of them worked and the request_depth_max at summary log is always 1:
1) Adding:
from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2
to spider file (the example on site, just with different site)
2) Running the command line with -s option:
/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org
3) Adding to settings.py and scrapy.cfg:
DEPTH_LIMIT=2
How should it be configured to more than 1?
warwaruk is right, The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".
So let's scrape miniova and see what happens. Starting at the today page we see that there are two tor links:
stav#maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]
Let's scrape the first link, where we see there are no new tor links on that page, just the link to iteself, which does not get recrawled by default (scrapy.http.Request(url[, ... dont_filter=False, ...])):
>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]
No luck there, we are still at depth 1. Let's try the other link:
>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]
Nope, this page only contains one link as well, a link to itself, which also gets filtered. So there are actually no links to scrape, so Scrapy closes the spider (at depth==1).
I had a similar issue, it helped to set follow=True when defining Rule:
follow is a boolean which specifies if links should be followed from
each response extracted with this rule. If callback is None follow
defaults to True, otherwise it default to False.
The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".
You wrote:
request_depth_max at summary log is always 1
What you see in the logs is the statistics, not the settings. When it says that request_depth_max as 1 it means that from the first callback no other requests have been yielded.
You have to show your spider code to understand what is going on.
But create another question for it.
UPDATE:
Ah, i see you are running mininova spider for the scrapy intro:
class MininovaSpider(CrawlSpider):
name = 'mininova.org'
allowed_domains = ['mininova.org']
start_urls = ['http://www.mininova.org/today']
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[#id='description']").extract()
torrent['size'] = x.select("//div[#id='info-left']/p[2]/text()[2]").extract()
return torrent
As you see from the code, the spider never issues any request for other pages, it scrapes all the data right from the top level pages. That's why the maximum depth is 1.
If you make you own spider which will be following links to other pages, the maximum depth will be greater then 1.

Automating the detection of 500s and 404s?

Like most portals out there, our portal makes calls to several services to display the requested information.
My question - is there a way to automate the capture of any 500s or 404s that either of these (GET) calls make? Using Selenium?
I personally wouldn't use Selenium for testing in this manner. I would do the testing in a more programmatic way.
In Python I would do it like this
import urllib2
try:
urllib2.urlopen('http://my-site')
except urllib2.HTTPError, e:
print e.getcode() #prints if its 404 or 500
Starting up browsers is a very expensive task in terms of time for the browser to load and other things along that vein.
I actually found a 'graceful' way of doing this using Selenium.
Starting the server instance as -
selenium.start("captureNetworkTraffic=true");
and then using
String trafficOutput = selenium.captureNetworkTraffic("json"); // "xml" or "plain"
in the #Test presents all the HTTP traffic stats.
The advantage in this approach is that network stats can be captured, while navigating through a portal.
Here is a sample (formatted) output from www.google.com:
--------------------------------
results for http://www.google.com/
content size: 149841 kb
http requests: 14
status 204: 1
status 200: 8
status 403: 1
status 301: 2
status 302: 2
file extensions: (count, size)
png: 2, 60.019000
js: 2, 67.443000
ico: 2, 2.394000
xml: 4, 11.254000
unknown: 4, 8.731000
http timing detail: (status, method, url, size(bytes), time(ms))
301, GET, http://google.com/, 219, 840
200, GET, http://www.google.com/, 8358, 586
403, GET, http://localhost:4444/favicon.ico, 1244, 2
200, GET, http://www.google.com/images/logos/ps_logo2.png, 26209, 573
200, GET, http://www.google.com/images/nav_logo29.png, 33810, 1155
200, GET, http://www.google.com/extern_js/f/CgJlbhICdXMrMEU4ACwrMFo4ACwrMA44ACwrMBc4ACwrMCc4ACwrMDw4ACwrMFE4ACwrMFk4ACwrMAo4AEAvmgICcHMsKzAWOAAsKzAZOAAsKzAlOM-IASwrMCo4ACwrMCs4ACwrMDU4ACwrMEA4ACwrMEE4ACwrME04ACwrME44ACwrMFM4ACwrMFQ4ACwrMF84ACwrMGM4ACwrMGk4ACwrMB04AZoCAnBzLCswXDgALCswGDgALCswJjgALIACKpACLA/rw4kSbs2oIQ.js, 61717, 1413
200, GET, https://sb-ssl.google.com:443/safebrowsing/newkey?pver=2%2E2&client=navclient%2Dauto%2Dffox&appver=3%2E6%2E13, 154, 1055
200, GET, http://www.google.com/extern_chrome/8ce0e008a607e93d.js, 5726, 159
204, GET, http://www.google.com/csi?v=3&s=webhp&action=&e=27944,28010,28186,28272&ei=a6M5TfqRHYPceYybsJYK&expi=27944,28010,28186,28272&imc=2&imn=2&imp=2&rt=xjsls.6,prt.54,xjses.1532,xjsee.1579,xjs.1581,ol.1702,iml.1200, 0, 230
200, GET, http://www.google.com/favicon.ico, 1150, 236
302, GET, http://fxfeeds.mozilla.com/en-US/firefox/headlines.xml, 232, 1465
302, GET, http://fxfeeds.mozilla.com/firefox/headlines.xml, 256, 317
301, GET, http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml, 256, 1357
200, GET, http://feeds.bbci.co.uk/news/rss.xml?edition=int, 10510, 221
I am, though, interested in knowing if anyone did authenticate these results captured by Selenium.