Like most portals out there, our portal makes calls to several services to display the requested information.
My question - is there a way to automate the capture of any 500s or 404s that either of these (GET) calls make? Using Selenium?
I personally wouldn't use Selenium for testing in this manner. I would do the testing in a more programmatic way.
In Python I would do it like this
import urllib2
try:
urllib2.urlopen('http://my-site')
except urllib2.HTTPError, e:
print e.getcode() #prints if its 404 or 500
Starting up browsers is a very expensive task in terms of time for the browser to load and other things along that vein.
I actually found a 'graceful' way of doing this using Selenium.
Starting the server instance as -
selenium.start("captureNetworkTraffic=true");
and then using
String trafficOutput = selenium.captureNetworkTraffic("json"); // "xml" or "plain"
in the #Test presents all the HTTP traffic stats.
The advantage in this approach is that network stats can be captured, while navigating through a portal.
Here is a sample (formatted) output from www.google.com:
--------------------------------
results for http://www.google.com/
content size: 149841 kb
http requests: 14
status 204: 1
status 200: 8
status 403: 1
status 301: 2
status 302: 2
file extensions: (count, size)
png: 2, 60.019000
js: 2, 67.443000
ico: 2, 2.394000
xml: 4, 11.254000
unknown: 4, 8.731000
http timing detail: (status, method, url, size(bytes), time(ms))
301, GET, http://google.com/, 219, 840
200, GET, http://www.google.com/, 8358, 586
403, GET, http://localhost:4444/favicon.ico, 1244, 2
200, GET, http://www.google.com/images/logos/ps_logo2.png, 26209, 573
200, GET, http://www.google.com/images/nav_logo29.png, 33810, 1155
200, GET, http://www.google.com/extern_js/f/CgJlbhICdXMrMEU4ACwrMFo4ACwrMA44ACwrMBc4ACwrMCc4ACwrMDw4ACwrMFE4ACwrMFk4ACwrMAo4AEAvmgICcHMsKzAWOAAsKzAZOAAsKzAlOM-IASwrMCo4ACwrMCs4ACwrMDU4ACwrMEA4ACwrMEE4ACwrME04ACwrME44ACwrMFM4ACwrMFQ4ACwrMF84ACwrMGM4ACwrMGk4ACwrMB04AZoCAnBzLCswXDgALCswGDgALCswJjgALIACKpACLA/rw4kSbs2oIQ.js, 61717, 1413
200, GET, https://sb-ssl.google.com:443/safebrowsing/newkey?pver=2%2E2&client=navclient%2Dauto%2Dffox&appver=3%2E6%2E13, 154, 1055
200, GET, http://www.google.com/extern_chrome/8ce0e008a607e93d.js, 5726, 159
204, GET, http://www.google.com/csi?v=3&s=webhp&action=&e=27944,28010,28186,28272&ei=a6M5TfqRHYPceYybsJYK&expi=27944,28010,28186,28272&imc=2&imn=2&imp=2&rt=xjsls.6,prt.54,xjses.1532,xjsee.1579,xjs.1581,ol.1702,iml.1200, 0, 230
200, GET, http://www.google.com/favicon.ico, 1150, 236
302, GET, http://fxfeeds.mozilla.com/en-US/firefox/headlines.xml, 232, 1465
302, GET, http://fxfeeds.mozilla.com/firefox/headlines.xml, 256, 317
301, GET, http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml, 256, 1357
200, GET, http://feeds.bbci.co.uk/news/rss.xml?edition=int, 10510, 221
I am, though, interested in knowing if anyone did authenticate these results captured by Selenium.
Related
I have several projects created on the web interface, each has several batches that are already ended. The max assignment per task is 3. I need to add 2 more assignments for each HIT, is it possible?
I've tried using the API on a in-progress batch:
mturk.create_additional_assignments_for_hit(HITId=HIT_ID,NumberOfAdditionalAssignments=2)
and the response :
{'ResponseMetadata': {'RequestId': '.....some id ...',
'HTTPStatusCode': 200,
'HTTPHeaders': {'x-amzn-requestid': '.....some id ...',
'content-type': 'application/x-amz-json-1.1',
'content-length': '2',
'date': 'Thu, 20 Jan 2022 12:20:02 GMT'},
'RetryAttempts': 0}}
But I can't see any update on the web for +2 extra assignments..
Found the answer.
First, the Requester UI (RUI) and the API are not fully connected, such that not all changes in the API will be visible in the RUI.
Here is my answer using the python API
To "revive" old HITs and add a new assignments to them you need to:
mturk = boto3.client('mturk',
aws_access_key_id='xxxxxxxxxxxx',
aws_secret_access_key='xxxxxxxxxxxxx',
region_name='us-east-1',
endpoint_url='https://mturk-requester.us-east-1.amazonaws.com')
Extend the expiration date for the HIT (even if its already passed):
mturk.update_expiration_for_hit(HITId=HIT_ID_STRING,ExpireAt=datetime.datetime(2022, 1, 23, 20, 00, 00))
Then, increase max assignments, here we add +2 more:
mturk.create_additional_assignments_for_hit(HITId=HIT_ID_STRING,NumberOfAdditionalAssignments=2)
That's it, you can see that the total number of NumberOfAssignmentsAvailable increased by 2 and that MaxAssignments increased as well:
mturk.get_hit(HITId=HIT_ID_STRING)
'MaxAssignments':5,
'NumberOfAssignmentsPending': 0,
'NumberOfAssignmentsAvailable': 2,
'NumberOfAssignmentsCompleted': 3
From various other StackOverflow posts I understand I can do a Zeppelin API call to run and get the output from a paragraph using the URL:
https://[zeppelin url]:[port]/api/notebook/run/[note ID]/[paragraph ID]
but this gives me:
HTTP ERROR 405
Problem accessing /api/notebook/run/2GG52SU6/2025492809-066545_207456631. Reason:
Method Not Allowed
Is there a way of fixing this? Other API calls work fine and the paragraph runs fine in the Zeppelin Web UI (it just does a simple Impala query). I just want to get the output via a REST API so I can call it from an Angular paragraph and manipulate the results before display.
Thanks!
The documentation of the run paragraph api states it to be a post request ; If you send the get request it will fail with 405 not allowed.
curl -X POST http://localhost:8000/zeppelin/api/notebook/run/2GUEWJDQ4/paragraph_1642773079113_366171993|jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 298 100 298 0 0 2712 0 --:--:-- --:--:-- --:--:-- 2733
{
"status": "OK",
"body": {
"code": "SUCCESS",
"msg": [
{
"type": "TEXT",
"data": "common.cmd\ncommon.sh\nfunctions.cmd\nfunctions.sh\ninstall-interpreter.sh\ninterpreter.cmd\ninterpreter.sh\nstop-interpreter.sh\nupgrade-note.sh\nzeppelin-daemon.sh\nzeppelin-systemd-service.sh\nzeppelin.cmd\nzeppelin.sh\n"
}
]
}
}
I would like to scrape the customer reviews of the kindle paperwhite of amazon.
I am aware that although amazon might say the have 5900 reviews, it is only possible to access 5000 of them. (after page=500 no more reviews are displayed with 10 reviews per page).
For the first few pages my spider returns 10 reviews per page, but later this shrinks to just one or two. This results in only about 1300 reviews.
There seems to be a problem with adding the data of the variable "helpul" and "verified". Both throw the following error:
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
Any help would be greatly appreciated!
I tried implementing if statements in case the variables were empty or contained a list, but it didnt work.
My Spider amazon_reviews.py:
import scrapy
from scrapy.extensions.throttle import AutoThrottle
class AmazonReviewsSpider(scrapy.Spider):
name = 'amazon_reviews'
allowed_domains = ['amazon.com']
myBaseUrl = "https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber="
start_urls=[]
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,550):
start_urls.append(myBaseUrl+str(i))
def parse(self, response):
data = response.css('#cm_cr-review_list')
# Collecting various data
star_rating = data.css('.review-rating')
title = data.css('.review-title')
text = data.css('.review-text')
date = data.css('.review-date'))
# Number how many people thought the review was helpful.
helpful = response.xpath('.//span[#data-hook="helpful-vote-statement"]//text()').extract()
verified = response.xpath('.//span[#data-hook="avp-badge"]//text()').extract()
# I scrape more information, but deleted it here not to make the code too big
# yielding the scraped results
for review in star_rating:
yield{'ASIN': 'B07CXG6C9W',
#'ID': ''.join(id.xpath('.//text()').extract()),
'stars': ''.join(review.xpath('.//text()').extract_first()),
'title': ''.join(title[count].xpath(".//text()").extract_first()),
'text': ''.join(text[count].xpath(".//text()").extract_first()),
'date': ''.join(date[count].xpath(".//text()").extract_first()),
### There seems to be a problem with adding these two, as I get 5000 reviews back if I delete them. ###
'verified purchase': ''.join(verified[count]),
'helpful': ''.join(helpful[count])
}
count=count+1
My settings.py :
AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_TIMEOUT = 180
REDIRECT_ENABLED = False
#DOWNLOAD_DELAY =5.0
RANDOMIZE_DOWNLOAD_DELAY = True
The extracting of the data works fine. The reviews I do get have complete and accurate information. Just the amount of reviews I get are too little.
When I run the spider with the following command:
runspider amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py -o reviews.csv
The ouput on the console looks like the following:
2019-04-22 11:54:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=164> (referer: None)
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'BRANDI', 'title': 'Bookworms rejoice!', 'text': "The (...) 5 STARS! 🌟🌟🌟🌟🌟", 'date': 'December 7, 2018'}
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'Doug Stender', 'title': 'As good as adverised', 'text': 'I read (...) mazon...', 'date': 'January 8, 2019'}
2019-04-22 11:54:41 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161> (referer: None)
Traceback (most recent call last):
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\OneDrive\Dokumente\Uni\05_SS 19\Masterarbeit\Code\Scrapy\amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py", line 78, in parse
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
Turns out that if a review didnt't have the "verified" tag or if no one commented it, the html part scrapy was looking for isn't there and therefore no item gets added to the list which makes the "verified" and "comments" list shorter than the other ones. Because of this error all items in the list got dropped and werent added to my csv file. The simple fix below which checks if the lists are as long as the other lists worked just fine :)
Edit:
When using this fix it might happen that values are assigned to the wrong review, because it is always added to the end of the list.
If you want to be on the safe side, don't scrape the verified tag or replace the whole list with "Na" or something else that indicates that the value is unclear.
helpful = response.xpath('.//span[#data-hook="helpful-vote-statement"]//text()').extract()
while len(helpful) != len(date):
helpful.append("0 people found this helpful")
I've had my S3 bucket logging into another bucket using Server Access Log Format for a while. For the Operation: REST.GET.OBJECT sometimes an HTTP Status: 206 Partial Content is returned because the whole file wasn't downloaded. But I can see in the logs that sometimes when HTTP Status: 206 is returned the whole file was downloaded. I've removed some fields to make it simpler:
Operation: REST.GET.OBJECT
Request-URI: "GET [File] HTTP/1.1"
HTTP Status: 206
Error Code: -
Bytes Sent: 76431360
Object Size: 76431360
Total Time: 16276
Turn-Around Time: 190
What happened here? If the Bytes Sent are the same as the Object Size then how can the source report this as a Partial Content?
The 206 status has nothing to do with incomplete file transfer. The server determines what status code to send before it starts sending the response body, so it would have to predict future to know whether it will be able to send the whole file.
Instead, what 206 status code actually means is that the following three things happened at once:
the client sent Range header in its request;
the server decided to honour it and send exactly the bytes requested, not the whole file;
the server was actually able to do so — the range was valid and satisfiable.
In this case, the standard requires the server to reply with the 206 status code, not 200, regardless whether the range happen to cover exactly the whole file or only a part of it.
We noticed we got the error from SoftLayer API when trying to get categories from
product package 200 ( hourly bare metal server) preset Id=64 starting 10/18.
The following API query
https://<apiuser>:<apikey>#api.softlayer.com/rest/v3/SoftLayer_Product_Package/200/getActivePresets.json?objectMask=mask
[id,packageId,description,name,keyName,isActive,categories.id,categories.name, categories.categoryCode]
now returns presetId as 103, 97, 93,95,99,101,105,151,147,149, 143, 157
It used to return the following additional active presetIds:
64,66,68,70,74,76 , 78 before 10/17/2016.
I don't find these changes from SoftLayer release note
https://softlayer.github.io/release_notes/
Why are the previous active preset Ids 64,66,68,70,74,76 , 78 no longer available? Will they be added back ?
Thanks.
You are right, these presets are not longer available since 10/17/2016, because the DCs are no longer building the configurations and have moved to the Haswell and Broadwell configurations.
For Haswell:
Presets: 93, 95, 97, 99, 101, 103, 105
For Broadwell:
Presets: 147, 149, 151, 153, 157.