phantomjs to render a page from a string - phantomjs

I want to render a webpage from a string. I've looked at the docs of phantomjs and they suggested the following:
var webPage = require('webpage');
var page = webPage.create();
var expectedContent = '<html><body><div>Test div</div></body></html>';
var expectedLocation = 'http://www.phantomjs.org/';
page.setContent(expectedContent, expectedLocation);
It's not quite working. Why? (I use the latest version).

I suggest you render a normal page (about:blank works) and then do webPage.content='<html><body><div>Test div</div></body></html>';
then render your page.
hope that helps.

Related

Unable to get `src` attribute of `<video>` with HTMLUnit

I am creating a video scraper (for the Rumble website) and I am trying to get the src attribute of the video using HTMLUnit, this is because the element is added dynamically to the page (I am a beginner to these APIs):
val webClient = WebClient()
webClient.options.isThrowExceptionOnFailingStatusCode = false
webClient.options.isThrowExceptionOnScriptError = false
webClient.options.isJavaScriptEnabled = true
val myPage: HtmlPage? = webClient.getPage("https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html")
Thread.sleep(10000)
val document: Document = Jsoup.parse(myPage!!.asXml())
println(document)
The issue is, the output for the <video> element is the following:
<video muted playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata"></video>
Whereas -- if you navigate to the page itself and let the JS load -- it should be:
<video muted="" playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata" poster="https://sp.rmbl.ws/s8/1/I/6/v/1/I6v1f.OvCc-small-Our-First-Automatic-AFK-Far.jpg" src="blob:https://rumble.com/91372f42-30cf-46b3-8850-805ee634e2e8"></video>
Some attributes are missing, which are crucial for my scraper to work. I need the src value so that ExoPlayer can play the video.
I am not totally sure, but I was wondering whether it had to do with the fact that the crossOrigin attribute is anonymous in the JavaScript:
<video muted playsinline hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="'+t+'"'+(a.vars.opts.cc?' crossorigin="anonymous"':"")+'>
I tried to play around with the different HTMLUnit options, as well as look online but I still haven't been able to extract the right attributes I need so that it can work.
How would I be able to bypass this and get the appropriate element values (src) that I need for the scraper using HTMLUnit? Is this even possible to do with HTMLUnit? I was also suspecting that maybe the site owners added this cross origin anonymous statement because it can bypass scrapers, though I am not sure.
How to reproduce my issue
Navigate to this link with a GUI browser.
Press 'Inspect Element' until you find the <video> HTML tag and observe that it contains an src attribute as you would expect to the mp4 file:
<video muted="" playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata" src="https://sp.rmbl.ws/s8/2/I/6/v/1/I6v1f.caa.rec.mp4?u=3&b=0" poster="https://sp.rmbl.ws/s8/1/I/6/v/1/I6v1f.OvCc-small-Our-First-Automatic-AFK-Far.jpg"></video>
Now, let's simulate this with a headless browser, so add the following code to IntelliJ or any IDE (add a dependency to HTMLUnit and JSoup):
To gradle (Kotlin):
implementation(group = "net.sourceforge.htmlunit", name = "htmlunit", version = "2.64.0")
implementation("org.jsoup:jsoup:1.15.3")
To gradle (Groovy):
implementation group = 'net.sourceforge.htmlunit', name = 'htmlunit', version = '2.64.0'
implementation 'org.jsoup:jsoup:1.15.3'
Then in Main function:
val webClient = WebClient()
webClient.options.isThrowExceptionOnFailingStatusCode = false
webClient.options.isThrowExceptionOnScriptError = false
webClient.options.isJavaScriptEnabled = true
val myPage: HtmlPage? = webClient.getPage("https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html")
Thread.sleep(10000)
val document: Document = Jsoup.parse(myPage!!.asXml())
println(".....................")
println(document.getElementsByTag("video").first())
If it throws an exception add this:
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.html.HtmlScript").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.WindowProxy").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache").setLevel(Level.OFF);
We are simply fetching the page with the headless browser and then using JSoup to parse the HTML output and finding the first video element.
Observe that the output does not contain any 'src' attribute as you saw in the GUI browser:
<video muted playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata"></video>
Screenshot of how your output should look like in the console:
This is the major issue I am having, the src attribute of the <video> element is seemingly disappeared in the headless browser, and I am unsure why although I suspect it's related to some sort of mp4 codec issue.
Correct, the js support for the video element was not sufficient for this case.
Have done a bunch of fixes/improvements and the upcoming version 2.66.0 will be able to support this.
Btw: there is no need to parse the page a second time using jsoup - HtmlUnit has all the methods to deeply look inside the dom tree of the current page.
String url = "https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
HtmlVideo video = (HtmlVideo) page.getElementsByTagName("video").get(0);
System.out.println(video.getSrc());
}
This code prints https://sp.rmbl.ws/s8/2/I/6/v/1/I6v1f.caa.rec.mp4?u=3&b=0 - the same as the source attribute in the browser.
But there are still two js errors reported when running this code. This is because some other js (i guess some tracking staff) provokes this errors. You can fix this by ignoring the js code for this two locations, this will make the code a bit faster also.
String url = "https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
// ignore some js
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("sovrn_standalone_beacon.js")
|| request.getUrl().toExternalForm().contains("r2.js")) {
WebResponseData data = new WebResponseData("".getBytes(response.getContentCharset()),
response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
HtmlVideo video = (HtmlVideo) page.getElementsByTagName("video").get(0);
System.out.println(video.getSrc());
Thanks for this report - will inform on https://twitter.com/htmlunit about the new release.

Could get the link in selenium using href attribute

I tried following to get the link on Start Onboarding but could not.
//string webElement = driver.FindElement(By.XPath("/html/body/div/div/div/a")).GetAttribute("href");
var tes = driver.FindElement(By.XPath(#"/html/body/div/div/div/a")).GetAttribute("href");
var web2 = driver.FindElement(By.XPath(#"//a[contains('saa',#href)"));
var web3 = driver.FindElement(By.XPath(#"//a[text()='saa')]/#href"));
var tes2 = driver.FindElement(By.XPath(#"//a[text()='saa')]/#href")).GetAttribute("href");
Start Onboarding
Please suggest me, where I fail and what I can do. I am new to selenium.
Thanks in advance.
A better way is to stay away from xpaths as they are rigid and do not work across all browsers. The below would be better.
var tes2 = driver.FindElement(By.cssSelector("a#idOfElement")).getAttribute("href");
or
var tes2 = driver.FindElement(By.cssSelector("a.classnameOfElement")).getAttribute("href");

Alternative to innerHTML for IE

I create HTML documents that include proprietary tags for links that get transformed into standard HTML when they go through our publishing system. For example:
<LinkTag contents="Link Text" answer_id="ID" title="Tooltip"></LinkTag>
When I'm authoring and reviewing these documents, I need to be able to test these links in a browser, before they get published. I wrote the following JavaScript to read the attributes and write them into an <a> tag:
var LinkCount = document.getElementsByTagName('LinkTag').length;
for (i=0; i<LinkCount; i++) {
var LinkText = document.getElementsByTagName('LinkTag')[i].getAttribute('contents');
var articleID = document.getElementsByTagName('LinkTag')[i].getAttribute('answer_id');
var articleTitle = document.getElementsByTagName('LinkTag')[i].getAttribute('title');
document.getElementsByTagName('LinkTag')[i].innerHTML = '' + LinkText + '';
}
This works great in Firefox, but not in IE. I've read about the innerHTML issue with IE and imagine that's the problem here, but I haven't been able to figure a way around it. I thought perhaps jQuery might be the way to go, but I'm not that well versed in it.
This would be a huge productivity boost if I could get this working in IE. Any ideas would be greatly appreciated.
innerHTML only works for things INSIDE of the open/close tags. So for instance if your LinkTag[i] is an <a> element, then putting innerHTML="<a .... > </a> would put that literally between the <a tag=LinkTag> and </a>.
You would need to put that in a DIV. Perhaps use your code to draw from links, then place the corresponding HTML code into a div.
var LinkCount = document.getElementsByTagName('LinkTag').length;
for (i=0; i<LinkCount; i++) {
var LinkText = document.getElementsByTagName('LinkTag')[i].getAttribute('contents');
var articleID = document.getElementsByTagName('LinkTag')[i].getAttribute('answer_id');
var articleTitle = document.getElementsByTagName('LinkTag')[i].getAttribute('title');
document.getElementsById('MyDisplayDiv')[i].innerHTML = '' + LinkText + '';
This should produce your HTML results within a div. You could also simply append the other LinkTag elements to a single DIV to produce a sort of "Preview of Website" within the div.
document.getElementsById('MyDisplayDiv').innerHTML += '' + LinkText + '';
Note the +=. Should append your HTML fabrication to the DIV with ID "MyDisplayDiv". Your list of LinkTag elements would be converted into a list of HTML elements within the div.
DOM functions might be considered more "clean" (and faster) than simply replacing the innerHTML in cases like this. Something like this may work better for you:
// where el is the element you want to wrap with <a link.
newel = document.createElement('a')
el.parentNode.insertBefore(newel,prop);
el = prop.parentNode.removeChild(prop);
newel.appendChild(prop);
newel.setAttribute('href','urlhere');
Similar code worked fine for me in Firebug, it should wrap the element with <a> ... </a>.
I wrote a script I have on my blog which you can find here: http://blog.funelr.com/?p=61 anyways take it a look it automatically fixes ie innerHTML errors in ie8 and ie9 by hijacking ie's innerHTML property which is "read-only" for table elements and replacing it with my own.
I have this xmlObject:
<c:value i:type="b:AccessRights">ReadAccess WriteAccess AppendAccess AppendToAccess CreateAccess DeleteAccess ShareAccess AssignAccess</c:value>
I find value use this: responseXML.getElementsByTagNameNS("http://schemas.datacontract.org/2004/07/System.Collections.Generic",'value')[0].firstChild.data
This one is the best : elm.insertAdjacentHTML( 'beforeend', str );
In case your element size is very small :elm.innerHTML += str;

html 5 video thumb gallery with jQuery

I have been trying to alter my code from image thumbnail gallery to a video thumbnail gallery. I have the videos encoded in the various formats for html 5...It works in Firefox and Safari. In chrome it works one time for each thumbnail, then doesnt load the video if you click it again. Is there a better way to do this? The html is like <a href = "#" rel ="videos/video1" class = "image"><a href = "#" rel ="videos/video2" class = "image">
$(function() {
$(".image").live('click',function() {
var image = $(this).attr("rel");
var title = $(this).attr("alt");
$('#largevideo').hide();
$('#largevideo').fadeIn(1500);
$('#largevideo').html('<video controls><source src="'+image+'.mp4" type="video/mp4"/><source src="'+image+'.webm" type="video/webm"/><source src="'+image+'.ogv" type="video/ogg"/><embed src=".'+image+'.mp4" type="application/x-shockwave-flash" autoplay="0"allowfullscreen="true" allowScriptAccess="always"></embed></video>');
return false;
});
(location.attr)? $("a [rel="+location.attr+"]").click():$(".thumbs a:first").click();
});

How to verify links

How to verify whether links are present or not?
eg.
I have 10 links in a page, I want to verify the particular link
Is it possible?
I am using selenium with Java.
Does i can write inside the selenium code
eg
selenium.click("searchimage-size");
selenium.waitForPopUp("dataitem", "3000");
selenium.selectWindow("name=dataitem");
foreach(var link in getMyLinkTextsToTest())
{
var elementToTest = driver.findElement(By.linkText(link));
Assert.IsNotNull(elementToTest);
}
What you can do is find all links on the page like this:
var anchorTags driver.findElement(By.TagName("a"));
and then iterate through the anchorTags collection to make you you've got what you're looking for.
Or if you have a list of the link texts you can do something like this:
foreach(var link in getMyLinkTextsToTest())
{
var elementToTest = driver.findElement(By.linkText(link));
Assert.IsNotNull(elementToTest);
}
This code is all untested and right off the top of my head so you might need to do some slight modification but it should be close to usable.
if you are using Selenium 1.x you can use this code.
String xpath = "//<xpath till your anchor tag>a/#herf";
String href = selenium.getAttribute(xpath);
String expectedLink = "your link";
assertEquals(href,expectedLink);
I hope this may help you...
List<WebElement> links = driver.findElements(By.tagName("a"));
for(WebElement we : links) {
if("Specific link text".equals(we.getText("Specific link text"))) {
we.click();
}
}
I'm taking all links to List variable 'links' and iterating it. Then checking condition, for the specific text we looking in the link is presenting in the list or not. If it found out, it'll click on it
If you're looking to verify each specific for the content of href, you can use javascript to return the outerHTML for a specific Webelement which you can identify however you like; in the example below I use By.cssSelector:
WebElement Element = driver.findElement(By.cssSelector("..."));
String sourceContents = (String)((JavascriptExecutor)driver).executeScript("return arguments[0].outerHTML;", element);
assertEquals(sourceContents, "Learn More");
If you want to make it a tad more elegant you can shave the undesired elements off of the string, but this is the general case as of Selenium-java: 2.53.1 / Selenium-api: 2.47.1 as I can observe.
Best approach would be to use getText() method
List<WebElement> allLinks = driver.findElements(By.tagName("a"));
for(WebElement specificlink : allLinks ) {
if(specificlink.getText().equals("link Text"){
//SOPL("Link found");
break;
}
}