How to capture JS redirects in Selenium? - selenium

Is there any way to capture all the redirects on the page performed in JS? For instance, let's take a look at this web page making redirect using window.location
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Redirect JS</title>
</head>
<body>
<script>
window.location = "http://www.example.com";
</script>
</body>
</html>
or meta tag
<meta http-equiv="refresh" content="0; url=http://example.com/">
I would like to render web page and get all urls where user has been redirected. Is it possible? How to do that in selenium?

In Python: http://selenium-python.readthedocs.org/en/latest/api.html : webdriver has property current_url. After you driver.get() the page, I would assume current_url is the redirected URL. Is it not?

Your requirement "in Selenium" will make this impossible. Selenium interacts with a browser as a human would - a human should generally not know or care about all the redirects. If you are willing to abandon Selenium for this purpose, then there are libraries such as HttpBuilder (in the Java world) and many others (for other languages) that allow you to manipulate and watch HTTP traffic, which is what you are after here.

Related

Unable to scrape parts of a page webpage with scrapy

I'm using scrapy to crawl an e-commerce website I'm experienced with simpler websites where scrapy alone or with splash/selenium handle most cases.
I have a new situation where I have no experience to deal with. From my investigations it could be like a captcha but without any request to the user.
I've made tests to solve it with scrapy alone, scrapy and selenium with no success.
With my scrapy request I receive the following response
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>Challenge Validation</title>
<link rel="stylesheet" type="text/css" href="/_sec/cp_challenge/sec-2-9.css">
<script type="text/javascript">function cp_clge_done(){location.reload(true);}</script>
<script src="/_sec/cp_challenge/sec-cpt-int-2-9.js" async defer></script>
<script type="text/javascript">sessionStorage.setItem('data-duration', 5);</script>
</head>
<body>
<div class="sec-container">
<div id="sec-text-container"><iframe id="sec-text-if" class="custmsg" src="https://beta.elcorteingles.es/sgfm/statics/eci_non_food/contents/cc/cca.html"></iframe></div>
<div id="sec-if-container">
<iframe id="sec-cpt-if" class="crypto" data-key="" data-duration=5 src="/_sec/cp_challenge/ak-challenge-2-9.htm"></iframe>
</div>
</div>
</body>
</html>
With the chrome inspector i see also noticed two GET requests (non-java) that might be related:
check -> returns HTML ( ... <title>RP iframe</title> ...)
check-session?origin=https%3A%2F%2Fwww.elcorteingles.es -> returns HTML (...<title>OP iframe</title>...)
Using scrapy shell with view(response) it looks like a captcha situation, waiting for something. Page example could be:
scrapy shell "https://www.elcorteingles.es/supermercado/0110120903000022-coosur-aceite-de-oliva-intenso-1-botella-1-l/"
The title 'challenge validation' suggests it. I have no idea how to handle with this case. From research, I've seen solutions involving scrapy middleware but for cases where input was asked from the user. I found no example similar to this case. Any guidance on how to proceed is appreciated.

Browser doesnt cache script tag requests upon page reload even if the url is same

this might sound like a very basic question, but i couldnt find much help from google..
so, i have a html file -
<!doctype html>
<html>
<title>New Form Title</title>
<head>
<script type='text/javascript' src='http://localhost/whatever.js'></script>
</head>
<body>
</body>
</html>
when i hit f5(after loading the page for first time), i can see the server returned a 304 status, but i was under assumption that a server request will not even be sent in the first place (i.e the browser would not send a request because the url is the same, and the browser would use the cached item)
what am i missing? is this the actual behaviour?
thank you..

Permalinks vs pretty URLs

Let's say i have a simple blog engine. I've posted a simple post with URL
http://example.org/blog/awesomr-post
Few days later i've noticed the typo and fix my URL
http://example.org/blog/awesome-post
But search engines have already indexed "awesomr-post" and if somebody follow this link he'll get 404 error. There is the same issue with bookmarked pages.
So i think the post should be accepted by two links
http://example.org/blog/awesome-post
http://example.org/permalinks/1
Now i have to specify relationships somehow. What i can do
http://example.org/permalinks/1
<!DOCTYPE html>
<html>
<head>
<link rel="canonical" href="http://example.org/blog/awesome-post">
</head>
<body>
page content
</body>
</html>
http://example.org/blog/awesome-post
<!DOCTYPE html>
<html>
<head>
<link rel="bookmark" href="http://example.org/permalinks/1">
</head>
<body>
page content
</body>
</html>
Is it right solution? And should i use the canonical or permalink URL when linking from another site pages?
One of the way is to have 301 (permanent) redirect from http://example.org/blog/awesomr-post to http://example.org/blog/awesome-post

Sniff and modify URL requests coming from UIWebViewController

I have one html page open inside UiWebViewController with cordova. While index.html loading inside the Uiwebviewcontroller can we sniff the requests that is originating from index.html?
for example I have following html that is getting opened in UiWebviewcontroller:
<html>
<head>
<link rel="stylesheet" type="text/css" href="theme.css">
<script src="app.js"></script>
</head>
<body>
<img src="img.jpg"/>
</body>
</html>
Can I sniff and modify the url that is getting requested inside Uiwebviewcontroller ie. img.jpg,theme.css,app.js to something like content/img.jpg, css/theme.css, js/app.js using Objective-C.
Yes, that’s possible using NSURLProtocol, see this blog post by NSHipster and this related Stack Overflow thread.

SSL and W3 XHTML Validator

This may be a dumb newbie question, so appologies for that.
My website is using a SSL certificate. I also include the W3 validator link in each of my webpages as follows:
<img src="valid-xhtml1.png" alt="Valid XHTML 1.0 Strict" height="31" width="88" />
(Note: copied over the w3 validator image so SSL wouldn't complain about unsecure resources).
When I do this, and click on the image to validate the page, I get this message from the validator:
The error mentions requesting the validator unsecurely. So I tried changing the href of the <a> tag to use https for the validator, but then the page simply doesn't load (I guess because the validator doesn't use SSL).
Does anyone know a way around this? I am guessing there is not a way to use the code as is, but maybe there is a way to update uri=referer to be uri=https://mysite.com/...? Is there a way to dynamically grab the URL of the current page?
Also, just for further reference, does SSL simply prevent the referer request header from being accessed?
Oh, and I know I can just go to my website using http instead of https, and the validator works. But I'd rather get it configured to work with https too.
As for the "validate icon" question:
This would usually lead to displaying a messages about "unsecure items" (=mixed http+https content)... the validate icon is not officially supported in such constellation... a partial workaround is described here.
IF you want to grab the uri dynamically I suspect you will have to use JavaScript for that and then create/add the <a> in the DOM...
As for the SSL/Referer question:
The standard says that a client (=browser) should send referer only if the destination is secure - so yes, in mixed cases the referer won't get sent to the non-secure URL.
Ok, so it's not looking like there is a way to do this with just HTML. So instead, I decided to use JavaScript to handle the issue.
I removed the <a> tag from around the W3 logo and added an onclick JavaScript function validatePage(). So here is basically a template for an XHTML Strict page that still allows you to include the validation icon.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<title>Title of document</title>
<script type="text/javascript">
function validatePage() {
var validatorUrl = "http://validator.w3.org/check?uri=http" + (document.URL).substring(5);
window.open(validatorUrl);
}
</script>
</head>
<body>
<h1>Test Template Page</h1>
<p><img src="valid-xhtml1.png" alt="Valid XHTML 1.0 Strict" height="31" width="88" onclick="validatePage()" /></p>
</body>
</html>
Notice how the validatorUrl variable trims off the "https" from the URL and instead uses "http". So I just circumvented using the HTTP referer header.
Hope this helps someone else.