I am creating a video scraper (for the Rumble website) and I am trying to get the src attribute of the video using HTMLUnit, this is because the element is added dynamically to the page (I am a beginner to these APIs):
val webClient = WebClient()
webClient.options.isThrowExceptionOnFailingStatusCode = false
webClient.options.isThrowExceptionOnScriptError = false
webClient.options.isJavaScriptEnabled = true
val myPage: HtmlPage? = webClient.getPage("https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html")
Thread.sleep(10000)
val document: Document = Jsoup.parse(myPage!!.asXml())
println(document)
The issue is, the output for the <video> element is the following:
<video muted playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata"></video>
Whereas -- if you navigate to the page itself and let the JS load -- it should be:
<video muted="" playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata" poster="https://sp.rmbl.ws/s8/1/I/6/v/1/I6v1f.OvCc-small-Our-First-Automatic-AFK-Far.jpg" src="blob:https://rumble.com/91372f42-30cf-46b3-8850-805ee634e2e8"></video>
Some attributes are missing, which are crucial for my scraper to work. I need the src value so that ExoPlayer can play the video.
I am not totally sure, but I was wondering whether it had to do with the fact that the crossOrigin attribute is anonymous in the JavaScript:
<video muted playsinline hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="'+t+'"'+(a.vars.opts.cc?' crossorigin="anonymous"':"")+'>
I tried to play around with the different HTMLUnit options, as well as look online but I still haven't been able to extract the right attributes I need so that it can work.
How would I be able to bypass this and get the appropriate element values (src) that I need for the scraper using HTMLUnit? Is this even possible to do with HTMLUnit? I was also suspecting that maybe the site owners added this cross origin anonymous statement because it can bypass scrapers, though I am not sure.
How to reproduce my issue
Navigate to this link with a GUI browser.
Press 'Inspect Element' until you find the <video> HTML tag and observe that it contains an src attribute as you would expect to the mp4 file:
<video muted="" playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata" src="https://sp.rmbl.ws/s8/2/I/6/v/1/I6v1f.caa.rec.mp4?u=3&b=0" poster="https://sp.rmbl.ws/s8/1/I/6/v/1/I6v1f.OvCc-small-Our-First-Automatic-AFK-Far.jpg"></video>
Now, let's simulate this with a headless browser, so add the following code to IntelliJ or any IDE (add a dependency to HTMLUnit and JSoup):
To gradle (Kotlin):
implementation(group = "net.sourceforge.htmlunit", name = "htmlunit", version = "2.64.0")
implementation("org.jsoup:jsoup:1.15.3")
To gradle (Groovy):
implementation group = 'net.sourceforge.htmlunit', name = 'htmlunit', version = '2.64.0'
implementation 'org.jsoup:jsoup:1.15.3'
Then in Main function:
val webClient = WebClient()
webClient.options.isThrowExceptionOnFailingStatusCode = false
webClient.options.isThrowExceptionOnScriptError = false
webClient.options.isJavaScriptEnabled = true
val myPage: HtmlPage? = webClient.getPage("https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html")
Thread.sleep(10000)
val document: Document = Jsoup.parse(myPage!!.asXml())
println(".....................")
println(document.getElementsByTag("video").first())
If it throws an exception add this:
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.html.HtmlScript").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.WindowProxy").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache").setLevel(Level.OFF);
We are simply fetching the page with the headless browser and then using JSoup to parse the HTML output and finding the first video element.
Observe that the output does not contain any 'src' attribute as you saw in the GUI browser:
<video muted playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata"></video>
Screenshot of how your output should look like in the console:
This is the major issue I am having, the src attribute of the <video> element is seemingly disappeared in the headless browser, and I am unsure why although I suspect it's related to some sort of mp4 codec issue.
Correct, the js support for the video element was not sufficient for this case.
Have done a bunch of fixes/improvements and the upcoming version 2.66.0 will be able to support this.
Btw: there is no need to parse the page a second time using jsoup - HtmlUnit has all the methods to deeply look inside the dom tree of the current page.
String url = "https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
HtmlVideo video = (HtmlVideo) page.getElementsByTagName("video").get(0);
System.out.println(video.getSrc());
}
This code prints https://sp.rmbl.ws/s8/2/I/6/v/1/I6v1f.caa.rec.mp4?u=3&b=0 - the same as the source attribute in the browser.
But there are still two js errors reported when running this code. This is because some other js (i guess some tracking staff) provokes this errors. You can fix this by ignoring the js code for this two locations, this will make the code a bit faster also.
String url = "https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
// ignore some js
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("sovrn_standalone_beacon.js")
|| request.getUrl().toExternalForm().contains("r2.js")) {
WebResponseData data = new WebResponseData("".getBytes(response.getContentCharset()),
response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
HtmlVideo video = (HtmlVideo) page.getElementsByTagName("video").get(0);
System.out.println(video.getSrc());
Thanks for this report - will inform on https://twitter.com/htmlunit about the new release.
Related
Every day, I need to check my visa application status on the USCIS website (https://egov.uscis.gov/casestatus/landing.do). Since manually doing it gets cumbersome, I created automation in UIPath to run every few hours and email me if the status changed. However, it still needs to open the browser, navigate to the page, read the result, etc.
Is there a better way of going about this?
I tried finding if USCIS has any API that I could programmatically call, but there doesn't seem to be any. I looked at the page and found that the text box for the receipt number has the following HTML:
<input id="receipt_number" name="appReceiptNum" type="text" class="form-control textbox" maxlength="13"/>
So, from Postman, I tried firing a GET request:
GET https://egov.uscis.gov/casestatus/landing.do?receipt_number=XXXXXXXX
where XXXXXXXX would be my actual application number. But this didn't work and it just returned the main page. I tried switching it to a POST, but that didn't work either and returned the same result. On further inspection, I realized that the actual result page has a different URL, so I tried GET and POST both, on the result URL:
GET https://egov.uscis.gov/casestatus/mycasestatus.do?receipt_number=XXXXXXXX
This gets me a page telling me that there were validation errors and they didn't recognize.
Went back to the manual process to see if I was missing anything. The result page URL has a format
https://egov.uscis.gov/casestatus/mycasestatus.do?JSESSIONID=ZZZZZZZZZ
where ZZZZZZZZZ is the value of JSESSIONID cookie set during the landing page. So I changed my process to:
Send a GET request to the landing page (https://egov.uscis.gov/casestatus/landing.do)
Copy the value of JSESSIONID cookie from the response and set that as a query parameter in the request to the result page (https://egov.uscis.gov/casestatus/mycasestatus.do), while sending receipt_number as the payload in a POST request
This isn't working either. My end goal was to write a Python or Java code (since those are the two I am familiar with) to get me the result, but I guess if I can't get my manual requests working from Postman, getting it to work from code is a pipe dream.
You don't need the session tag, just change the param name in your postman request to appReceiptNum and it will work: https://egov.uscis.gov/casestatus/mycasestatus.do?appReceiptNum=LINXXXXXXXXXX
#Alok
What you require is term "headless browser / scraper"
just created a quick sample ( but in node.js)
const today = formatYmd(new Date())
const browser = await puppeteer.launch()
const page = await browser.newPage()
console.log("going to URL")
await page.goto(url)
await page.$eval('#receipt_number', (el,receipt) => el.value = `${receipt}`, process.env.RECEIPT_NUMBER)
await page.click('input[type="submit"]')
console.log("waiting for submission to be completed.")
await page.waitForSelector('div.current-status-sec').catch(t => console.log("Not able to load status screen"))
const status = removeTags(await page.$eval('.current-status-sec', el => el.innerText))
console.log(`${today}: ${status}`)
await page.screenshot({path: `./screenshot/${today}_screenshot.png`})
browser.close()
You can find the full repo here.
https://github.com/Parthashah/uscis-status-check
The code will provide back screenshot and status
I just created a simple USCIS web crawler scraper Spring Boot app using HtmlUnit.
The only thing is my crawler ignores the wrong case number. But is working. The github link is here: https://github.com/somych1/USCISCaseStatusWebScraper
public ResponseDTO getStatus(String caseId){
ResponseDTO responseDTO = new ResponseDTO();
//browser setup
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(false);
webClient.getOptions().setTimeout(8000);
webClient.getOptions().setDownloadImages(false);
webClient.getOptions().setGeolocationEnabled(false);
webClient.getOptions().setAppletEnabled(false);
try{
// loading the HTML to a Document Object
HtmlPage page = webClient.getPage(url);
// case lookup
HtmlInput input = page.getHtmlElementById("receipt_number");
input.setValueAttribute(caseId);
HtmlInput button = page.getElementByName("initCaseSearch");
HtmlPage pageAfterClick = button.click();
// new page after click
HtmlHeading1 h1 = pageAfterClick.getFirstByXPath("//div/h1");
HtmlParagraph paragraph = pageAfterClick.getFirstByXPath("//div/p");
//setting response object
responseDTO.setCaseId(caseId);
responseDTO.setStatus(status);
responseDTO.setDescription(description);
} catch (IOException ex) {
ex.printStackTrace();
}
return responseDTO;
}
I have adopted various approaches to embed PDF blob in html in IE in order to display it.
1) creating a object URL and passing it to the embed or iframe tag. This works fine in Chrome but not in IE.
</head>
<body>
<input type="file" onchange="previewFile()">
<iframe id="test_iframe" style="width:100%;height:500px;"></iframe>
<script>
function previewFile() {
var file = document.querySelector('input[type=file]').files[0];
var downloadUrl = URL.createObjectURL(file);
console.log(downloadUrl);
var element = document.getElementById('test_iframe');
element.setAttribute('src',downloadUrl);
}
</script>
</body>
2) I have also tried wrapping the URL Blob inside a encodeURIcomponent()
Any pointers on how I can approach to solve this?
IE doesn't support iframe with data url as src attribute. You could check it in caniuse. It shows that the support is limited to images and linked resources like CSS or JS in IE. Please also check this documentation:
Data URIs are supported only for the following elements and/or
attributes.
object (images only)
img
input type=image
link
CSS declarations that accept a URL, such as background, backgroundImage, and so on.
Besides, IE doesn't have PDF viewer embeded, so you can't display PDFs directly in IE 11. You can only use msSaveOrOpenBlob to handle blobs in IE, then choose to open or save the PDF file:
if(window.navigator.msSaveOrOpenBlob) {
//IE11
window.navigator.msSaveOrOpenBlob(blobData, fileName);
}
else{
//Other browsers
window.URL.createObjectURL(blobData);
...
}
I am trying to scrape this website as part of my lesson to learn Kotlin and Web scraping with jsoup.
What I am trying to scrape is the Jackpot $1,000,000 est. values.
The below code was something that I wrote after searching and checking out a couple of tutorials online, but it won't even give me $1,000,000 (which was what this code was trying to scrape).
Jsoup.connect("https://online.singaporepools.com/lottery/en/home")
.get()
.run {
select("div.slab__text slab__text--highlight").forEachIndexed { i, element ->
val titleAnchor = element.select("div")
val title = titleAnchor.text()
println("$i. $title")
}
}
My first thought is that maybe this website is using JavaScript. That's why it was not successful.
How should I be going about scraping it?
I was able to scrape what you were looking for from this page on that same site.
Even if it's not what you want, the procedure may help someone in the future.
Here is how I did that:
First I opened that page
Then I opened the Chrome developer tools by pressing CTRL+
SHIFT+i or
by right-clicking somewhere on page and selecting Inspect or
by clicking ⋮ ➜ More tools ➜ Developer tools
Next I selected the Network tab
And finally I refreshed the page with F5 or with the refresh button ⟳
A list of requests start to appear (network log) and after, say, a few seconds, all requests will complete executing. Here, we want to look for and inspect a request that has a Type like xhr. We can filter requests by clicking the filter icon and then selecting the desired type.
To inspect a request, click on its name (first column from left):
Clicking on one of the XHR requests, and then selecting the Response tab shows that the response contains exactly what we are looking for. And it is HTML, so jsoup can parse it:
Here is that response (if you want to copy or manipulate it):
<div style='vertical-align:top;'>
<div>
<div style='float:left; width:120px; font-weight:bold;'>
Next Jackpot
</div>
<span style='color:#EC243D; font-weight:bold'>$8,000,000 est</span>
</div>
<div>
<div style='float:left; width:120px; font-weight:bold;'>
Next Draw
</div>
<div class='toto-draw-date'>Mon, 15 Nov 2021 , 9.30pm</div>
</div>
</div>
By selecting the Headers tab (to the left of the Response tab), we see the Request URL is https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m and the Request Method is GET and agian the Content-Type is text/html.
So, with the URL and the HTTP method we found, here is the code to scrape that HTML:
val document = Jsoup
.connect("https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m")
.userAgent("Mozilla")
.get()
val targetElement = document
.body()
.children()
.single()
val phrase = targetElement.child(0).text()
val prize = targetElement.select("span").text().removeSuffix(" est")
println(phrase) // Next Jackpot $8,000,000 est
println(prize) // $8,000,000
Here is another solution for parsing a dynamic page with Selenium and jsoup.
We first get and store the page with Selenium and then parse it with jsoup.
Just make sure to download the browser driver and move its executable file to your classpath.
I downloaded the Chrome driver version 95 and placed it along my Kotlin .kts script.
System.setProperty("webdriver.chrome.driver", "chromedriver.exe")
val result = File("output.html")
// OR FirefoxDriver(); download its driver and set the appropriate system property above
val driver = ChromeDriver()
driver.get ("https://www.singaporepools.com.sg/en/product/sr/Pages/toto_results.aspx")
result.writeText(driver.pageSource)
driver.close()
val document = Jsoup.parse(result, "UTF-8")
val targetElement = document
.body()
.children()
.select(":containsOwn(Next Jackpot)")
.single()
.parent()!!
val phrase = targetElement.text()
val prize = targetElement.select("span").text().removeSuffix(" est")
println(phrase) // Next Jackpot $8,000,000 est
println(prize) // $8,000,000
Another version of code for getting the target element:
val targetElement = document
.body()
.selectFirst(":containsOwn(Next Jackpot)")
?.parent()!!
I only used the following dependencies:
org.seleniumhq.selenium:selenium-java:4.0.0
org.jsoup:jsoup:1.14.3
See the standalone script file. It can be executed with Kotlin runner from command line like this:
kotlin my-script.main.kts
I'm using the awesome GoSquared API to get the number of current visitors on my Site.
I have build a Jquery Script, that automatically updates the number every two seconds with Jquery .get, but this doesn't seem to work in IE and Firefox.
JSFiddle
Thanks :)
In Firefox data is a string for some reason. You can specify the data type of the response explicitly:
$.get('url', function(){}, "json");
Otherwise you can turn it into an object like this:
if (typeof data === "string"){
data = JSON.parse(data);
}
For some reason I'm sure the folks at DivX think is important, there is no straightforward way to prevent their plugin from replacing all video elements on your page with they fancy logo.
What I need is a workaround for this, telling the plugin to skip some videos, i.e. not replace them with their playable content.
I got around this by putting an empty HTML 5 video tag, then putting in the video source tags in a JavaScript function in the body onload event. The video then comes up in the normal HTML 5 player and not the DivX web player.
e.g.
This would give the DivX player:
<video width="320" height="240" controls="controls">
<source src="movie.mp4" type="video/mp4" />
</video>
But this would give the normal html 5 player:
<head>
<script type="text/javascript">
function changevid() {
document.getElementById('vid').innerHTML = '<source src="inc/videos/sample1.mp4" type="video/mp4" />';
document.getElementById('vid').load();
}
</script>
</head>
<body onload="changevid()">
<video id="vid" width="800" height="450" controls="controls">
</video>
</body>
At this time, there is no API or means to block the divx plugin from replacing video elements with their placeholder. :-(
i started reverse-engineering the divx-plugin to find out what can be done to hack a way into disabling it. An example, including the complete sourcecode of the divx-plugin, can be found here: http://jsfiddle.net/z4JPB/1/
It currently appears to me that a possible solution could work like this:
create a "clean" backup of the methods appendChild, replaceChild and insertBefore - this has to happen before the content-script from the chrome-extension is executed.
the content-script will execute, overrides the methods mentioned above and adds event-listeners to the DOMNodeInsertedIntoDocument and DOMNodeInserted events
after that, the event-listeners can be removed and the original DOM-Methods restored. You should now be able to replace the embed-elements created by the plugin with the video-elements
It seems, that the plugin is only replacing the video when there are src elements within the video tag. For me it worked by first adding the video tag, and then - in a second thread - add the src tags. However, this doesn´t work in IE but IE had no problem with an insertion of the complete video tag at once.
So following code worked for me in all browsers (of course, jQuery required):
var $container = $('video_container');
var video = 'my-movie';
var videoSrc = '<source src="video/'+video+'.mp4" type="video/mp4"></source>' +
'<source src="video/'+video+'.webm" type="video/webm"></source>' +
'<source src="video/'+video+'.ogv" type="video/ogg"></source>';
if(!$.browser.msie) {
$container.html('<video autoplay loop></video>');
// this timeout avoids divx player to be triggered
setTimeout(function() {
$container.find('video').html(videoSrc);
}, 50);
}
else {
// IE has no problem with divx player, so we add the src in the same thread
$container.html('<video autoplay loop>' + videoSrc + '</video>');
}