Web scraping with jsoup in Kotlin - kotlin

I am trying to scrape this website as part of my lesson to learn Kotlin and Web scraping with jsoup.
What I am trying to scrape is the Jackpot $1,000,000 est. values.
The below code was something that I wrote after searching and checking out a couple of tutorials online, but it won't even give me $1,000,000 (which was what this code was trying to scrape).
Jsoup.connect("https://online.singaporepools.com/lottery/en/home")
.get()
.run {
select("div.slab__text slab__text--highlight").forEachIndexed { i, element ->
val titleAnchor = element.select("div")
val title = titleAnchor.text()
println("$i. $title")
}
}
My first thought is that maybe this website is using JavaScript. That's why it was not successful.
How should I be going about scraping it?

I was able to scrape what you were looking for from this page on that same site.
Even if it's not what you want, the procedure may help someone in the future.
Here is how I did that:
First I opened that page
Then I opened the Chrome developer tools by pressing CTRL+
SHIFT+i or
by right-clicking somewhere on page and selecting Inspect or
by clicking ⋮ ➜ More tools ➜ Developer tools
Next I selected the Network tab
And finally I refreshed the page with F5 or with the refresh button ⟳
A list of requests start to appear (network log) and after, say, a few seconds, all requests will complete executing. Here, we want to look for and inspect a request that has a Type like xhr. We can filter requests by clicking the filter icon and then selecting the desired type.
To inspect a request, click on its name (first column from left):
Clicking on one of the XHR requests, and then selecting the Response tab shows that the response contains exactly what we are looking for. And it is HTML, so jsoup can parse it:
Here is that response (if you want to copy or manipulate it):
<div style='vertical-align:top;'>
<div>
<div style='float:left; width:120px; font-weight:bold;'>
Next Jackpot
</div>
<span style='color:#EC243D; font-weight:bold'>$8,000,000 est</span>
</div>
<div>
<div style='float:left; width:120px; font-weight:bold;'>
Next Draw
</div>
<div class='toto-draw-date'>Mon, 15 Nov 2021 , 9.30pm</div>
</div>
</div>
By selecting the Headers tab (to the left of the Response tab), we see the Request URL is https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m and the Request Method is GET and agian the Content-Type is text/html.
So, with the URL and the HTTP method we found, here is the code to scrape that HTML:
val document = Jsoup
.connect("https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m")
.userAgent("Mozilla")
.get()
val targetElement = document
.body()
.children()
.single()
val phrase = targetElement.child(0).text()
val prize = targetElement.select("span").text().removeSuffix(" est")
println(phrase) // Next Jackpot $8,000,000 est
println(prize) // $8,000,000

Here is another solution for parsing a dynamic page with Selenium and jsoup.
We first get and store the page with Selenium and then parse it with jsoup.
Just make sure to download the browser driver and move its executable file to your classpath.
I downloaded the Chrome driver version 95 and placed it along my Kotlin .kts script.
System.setProperty("webdriver.chrome.driver", "chromedriver.exe")
val result = File("output.html")
// OR FirefoxDriver(); download its driver and set the appropriate system property above
val driver = ChromeDriver()
driver.get ("https://www.singaporepools.com.sg/en/product/sr/Pages/toto_results.aspx")
result.writeText(driver.pageSource)
driver.close()
val document = Jsoup.parse(result, "UTF-8")
val targetElement = document
.body()
.children()
.select(":containsOwn(Next Jackpot)")
.single()
.parent()!!
val phrase = targetElement.text()
val prize = targetElement.select("span").text().removeSuffix(" est")
println(phrase) // Next Jackpot $8,000,000 est
println(prize) // $8,000,000
Another version of code for getting the target element:
val targetElement = document
.body()
.selectFirst(":containsOwn(Next Jackpot)")
?.parent()!!
I only used the following dependencies:
org.seleniumhq.selenium:selenium-java:4.0.0
org.jsoup:jsoup:1.14.3
See the standalone script file. It can be executed with Kotlin runner from command line like this:
kotlin my-script.main.kts

Related

Webdriver Selenium not loading new page after click()

I´m using selenium to scrape a webpage and it finds the elements on the main page, but when I use the click() function, the driver never finds the elements on the new page. I used beautifulSoup to see if it´s getting the html, but the html is always from the main. (When I see the driver window it shows that the page is opened).
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify)
I´ve used webDriverWait() to see if it´s not loading but even after 60 seconds it never does,
element = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.ID, "ddlProducto")))
also execute_script() to check if by clicking the button using javascript loads the page, but it returns None when I print a variable saving the new page.
selectProducto = driver.execute_script("return document.getElementById('ddlProducto');")
print(selectProducto)
Also used chwd = driver.window_handles and driver.switch_to_window(chwd[1]) but it says that the index is out of range.
chwd = driver.window_handles
driver.switch_to.window(chwd[1])

Unable to get `src` attribute of `<video>` with HTMLUnit

I am creating a video scraper (for the Rumble website) and I am trying to get the src attribute of the video using HTMLUnit, this is because the element is added dynamically to the page (I am a beginner to these APIs):
val webClient = WebClient()
webClient.options.isThrowExceptionOnFailingStatusCode = false
webClient.options.isThrowExceptionOnScriptError = false
webClient.options.isJavaScriptEnabled = true
val myPage: HtmlPage? = webClient.getPage("https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html")
Thread.sleep(10000)
val document: Document = Jsoup.parse(myPage!!.asXml())
println(document)
The issue is, the output for the <video> element is the following:
<video muted playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata"></video>
Whereas -- if you navigate to the page itself and let the JS load -- it should be:
<video muted="" playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata" poster="https://sp.rmbl.ws/s8/1/I/6/v/1/I6v1f.OvCc-small-Our-First-Automatic-AFK-Far.jpg" src="blob:https://rumble.com/91372f42-30cf-46b3-8850-805ee634e2e8"></video>
Some attributes are missing, which are crucial for my scraper to work. I need the src value so that ExoPlayer can play the video.
I am not totally sure, but I was wondering whether it had to do with the fact that the crossOrigin attribute is anonymous in the JavaScript:
<video muted playsinline hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="'+t+'"'+(a.vars.opts.cc?' crossorigin="anonymous"':"")+'>
I tried to play around with the different HTMLUnit options, as well as look online but I still haven't been able to extract the right attributes I need so that it can work.
How would I be able to bypass this and get the appropriate element values (src) that I need for the scraper using HTMLUnit? Is this even possible to do with HTMLUnit? I was also suspecting that maybe the site owners added this cross origin anonymous statement because it can bypass scrapers, though I am not sure.
How to reproduce my issue
Navigate to this link with a GUI browser.
Press 'Inspect Element' until you find the <video> HTML tag and observe that it contains an src attribute as you would expect to the mp4 file:
<video muted="" playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata" src="https://sp.rmbl.ws/s8/2/I/6/v/1/I6v1f.caa.rec.mp4?u=3&b=0" poster="https://sp.rmbl.ws/s8/1/I/6/v/1/I6v1f.OvCc-small-Our-First-Automatic-AFK-Far.jpg"></video>
Now, let's simulate this with a headless browser, so add the following code to IntelliJ or any IDE (add a dependency to HTMLUnit and JSoup):
To gradle (Kotlin):
implementation(group = "net.sourceforge.htmlunit", name = "htmlunit", version = "2.64.0")
implementation("org.jsoup:jsoup:1.15.3")
To gradle (Groovy):
implementation group = 'net.sourceforge.htmlunit', name = 'htmlunit', version = '2.64.0'
implementation 'org.jsoup:jsoup:1.15.3'
Then in Main function:
val webClient = WebClient()
webClient.options.isThrowExceptionOnFailingStatusCode = false
webClient.options.isThrowExceptionOnScriptError = false
webClient.options.isJavaScriptEnabled = true
val myPage: HtmlPage? = webClient.getPage("https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html")
Thread.sleep(10000)
val document: Document = Jsoup.parse(myPage!!.asXml())
println(".....................")
println(document.getElementsByTag("video").first())
If it throws an exception add this:
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.html.HtmlScript").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit.javascript.host.WindowProxy").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache").setLevel(Level.OFF);
We are simply fetching the page with the headless browser and then using JSoup to parse the HTML output and finding the first video element.
Observe that the output does not contain any 'src' attribute as you saw in the GUI browser:
<video muted playsinline="" hidefocus="hidefocus" style="width:100% !important;height:100% !important;display:block" preload="metadata"></video>
Screenshot of how your output should look like in the console:
This is the major issue I am having, the src attribute of the <video> element is seemingly disappeared in the headless browser, and I am unsure why although I suspect it's related to some sort of mp4 codec issue.
Correct, the js support for the video element was not sufficient for this case.
Have done a bunch of fixes/improvements and the upcoming version 2.66.0 will be able to support this.
Btw: there is no need to parse the page a second time using jsoup - HtmlUnit has all the methods to deeply look inside the dom tree of the current page.
String url = "https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
HtmlVideo video = (HtmlVideo) page.getElementsByTagName("video").get(0);
System.out.println(video.getSrc());
}
This code prints https://sp.rmbl.ws/s8/2/I/6/v/1/I6v1f.caa.rec.mp4?u=3&b=0 - the same as the source attribute in the browser.
But there are still two js errors reported when running this code. This is because some other js (i guess some tracking staff) provokes this errors. You can fix this by ignoring the js code for this two locations, this will make the code a bit faster also.
String url = "https://rumble.com/v1m9oki-our-first-automatic-afk-farms-locals-minecraft-server-smp-ep3-live-stream.html";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX)) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
// ignore some js
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("sovrn_standalone_beacon.js")
|| request.getUrl().toExternalForm().contains("r2.js")) {
WebResponseData data = new WebResponseData("".getBytes(response.getContentCharset()),
response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
HtmlVideo video = (HtmlVideo) page.getElementsByTagName("video").get(0);
System.out.println(video.getSrc());
Thanks for this report - will inform on https://twitter.com/htmlunit about the new release.

How do I programmatically check my application status on USCIS?

Every day, I need to check my visa application status on the USCIS website (https://egov.uscis.gov/casestatus/landing.do). Since manually doing it gets cumbersome, I created automation in UIPath to run every few hours and email me if the status changed. However, it still needs to open the browser, navigate to the page, read the result, etc.
Is there a better way of going about this?
I tried finding if USCIS has any API that I could programmatically call, but there doesn't seem to be any. I looked at the page and found that the text box for the receipt number has the following HTML:
<input id="receipt_number" name="appReceiptNum" type="text" class="form-control textbox" maxlength="13"/>
So, from Postman, I tried firing a GET request:
GET https://egov.uscis.gov/casestatus/landing.do?receipt_number=XXXXXXXX
where XXXXXXXX would be my actual application number. But this didn't work and it just returned the main page. I tried switching it to a POST, but that didn't work either and returned the same result. On further inspection, I realized that the actual result page has a different URL, so I tried GET and POST both, on the result URL:
GET https://egov.uscis.gov/casestatus/mycasestatus.do?receipt_number=XXXXXXXX
This gets me a page telling me that there were validation errors and they didn't recognize.
Went back to the manual process to see if I was missing anything. The result page URL has a format
https://egov.uscis.gov/casestatus/mycasestatus.do?JSESSIONID=ZZZZZZZZZ
where ZZZZZZZZZ is the value of JSESSIONID cookie set during the landing page. So I changed my process to:
Send a GET request to the landing page (https://egov.uscis.gov/casestatus/landing.do)
Copy the value of JSESSIONID cookie from the response and set that as a query parameter in the request to the result page (https://egov.uscis.gov/casestatus/mycasestatus.do), while sending receipt_number as the payload in a POST request
This isn't working either. My end goal was to write a Python or Java code (since those are the two I am familiar with) to get me the result, but I guess if I can't get my manual requests working from Postman, getting it to work from code is a pipe dream.
You don't need the session tag, just change the param name in your postman request to appReceiptNum and it will work: https://egov.uscis.gov/casestatus/mycasestatus.do?appReceiptNum=LINXXXXXXXXXX
#Alok
What you require is term "headless browser / scraper"
just created a quick sample ( but in node.js)
const today = formatYmd(new Date())
const browser = await puppeteer.launch()
const page = await browser.newPage()
console.log("going to URL")
await page.goto(url)
await page.$eval('#receipt_number', (el,receipt) => el.value = `${receipt}`, process.env.RECEIPT_NUMBER)
await page.click('input[type="submit"]')
console.log("waiting for submission to be completed.")
await page.waitForSelector('div.current-status-sec').catch(t => console.log("Not able to load status screen"))
const status = removeTags(await page.$eval('.current-status-sec', el => el.innerText))
console.log(`${today}: ${status}`)
await page.screenshot({path: `./screenshot/${today}_screenshot.png`})
browser.close()
You can find the full repo here.
https://github.com/Parthashah/uscis-status-check
The code will provide back screenshot and status
I just created a simple USCIS web crawler scraper Spring Boot app using HtmlUnit.
The only thing is my crawler ignores the wrong case number. But is working. The github link is here: https://github.com/somych1/USCISCaseStatusWebScraper
public ResponseDTO getStatus(String caseId){
ResponseDTO responseDTO = new ResponseDTO();
//browser setup
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(false);
webClient.getOptions().setTimeout(8000);
webClient.getOptions().setDownloadImages(false);
webClient.getOptions().setGeolocationEnabled(false);
webClient.getOptions().setAppletEnabled(false);
try{
// loading the HTML to a Document Object
HtmlPage page = webClient.getPage(url);
// case lookup
HtmlInput input = page.getHtmlElementById("receipt_number");
input.setValueAttribute(caseId);
HtmlInput button = page.getElementByName("initCaseSearch");
HtmlPage pageAfterClick = button.click();
// new page after click
HtmlHeading1 h1 = pageAfterClick.getFirstByXPath("//div/h1");
HtmlParagraph paragraph = pageAfterClick.getFirstByXPath("//div/p");
//setting response object
responseDTO.setCaseId(caseId);
responseDTO.setStatus(status);
responseDTO.setDescription(description);
} catch (IOException ex) {
ex.printStackTrace();
}
return responseDTO;
}

Selenium - move page to front

I am performing a web page scrape and update using Selenium and phantomJS webdriver. At one stage, completing one page can bring up either page_A or page_B. The page I want is page_B. If page_A appears, I can locate the elements:
<li class="menu_item">
<a class="menu_item_anchor_selected" href="#page_a">Page_A</a>
</li>
<li class="menu_item">
<a class="menu_item_anchor" href="#page_b">Page_B</a>
</li>
and I can then extract the target from the anchor to page_B. However, once loaded by the driver, every element on page_B remains invisible.
public class fill_page_B (WebDriver driver, String target) throws Exception
{
String currentUrl = driver.getCurrentUrl;
String page_B_Url = currentUrl + target;
driver.get (page_B_Url);
// prove that page_B actually loaded
String page = driver.getPageSource ();
System.println (page);
WebElement we = driver.findElement (By.id ("MyId"));
// ignore test for we exists; it does
WebDriver wait = new WebDriverWait (driver, 60); // should be long enough!
wait.until (ExpectedConditions.visibilityOf (we)); // fails
. . .
}
I have tried using:
driver.manage().window().maximize();
and other tricks to ensure the window hasn't shrunk to make the required control outside the viewable window and other, similar tricks to make sure it hasn't migrated off the screen entirely.
I have also tried:
JavascriptExecutor jse = (JavascriptExecutor) driver;
jse.executeScript ("arguments[0].checked=true;", we);
to force the element to be visible. This approach however merely threw an Exception "cannot click on option element. Executing JavaScript click function returned an unexpected error etc."
However, when I ran my application again, this time using an Internet Explorer driver, I noticed that when I got page_B (and proved it by displaying the page contents in the system log), page_A was still visible. I did try resizing page_A (which was previously maximized) before getting page_B, but this made no difference; everything on page_B remained invisible and unclickable.
Any suggestions as to how Selenium could deal with this?

How to get multi page JMeter Webdriver timing

I've been using JMeter for quite a while but webdriver is new to me.
I'm trying to do some timings for a multi-page scenario and have a question.
I'm using JMeter webdriver sampler and HTMLunit:
Here is the scenario
1. Go to a web page http://162.243.100.234
2. Enter the word hello in the search box
3. Click on submit
What I want to get is:
1. How long it took to load the first page
2. How long it took from when I clicked on submit to when the results page was loaded
I have the following code which only gives me ONE sample timing.
How do I change it so I'll have two?
var pkg = JavaImporter(org.openqa.selenium)
var support_ui = JavaImporter(org.openqa.selenium.support.ui.WebDriverWait)
var wait = new support_ui.WebDriverWait(WDS.browser, 5000)
WDS.sampleResult.sampleStart()
WDS.browser.get('http://162.243.100.234/')
var searchField = WDS.browser.findElement(pkg.By.id('s'))
searchField.click()
searchField.sendKeys(['hello'])
var button = WDS.browser.findElement(pkg.By.id('searchsubmit'))
button.click()
WDS.sampleResult.sampleEnd()
I tried adding another sampleStart and sampleEnd but got and error.
Do I need to use two samplers somehow?
Yep, you need to split your code into 2 pieces:
First Sampler:
WDS.sampleResult.sampleStart()
WDS.browser.get('http://162.243.100.234')
WDS.sampleResult.sampleEnd()
Second Sampler:
var pkg = JavaImporter(org.openqa.selenium)
WDS.sampleResult.sampleStart()
var searchField = WDS.browser.findElement(pkg.By.id('s'))
searchField.click()
searchField.sendKeys(['hello'])
var button = WDS.browser.findElement(pkg.By.id('searchsubmit'))
button.click()
WDS.sampleResult.sampleEnd()
Mention WDS.sampleResult.sampleStart() and WDS.sampleResult.sampleEnd() methods invocation
As per Using Selenium with JMeter's WebDriver Sampler guide
WDS.sampleResult.sampleStart() and WDS.sampleResult.sampleEnd()
captures sampler’s time and will track it. You can remove them, the
script will still work but you can’t get load time
Hope this helps