How to get the last child of an HTMLElement - vba

I have written a macro in Excel that opens and parses a website and pulls the data from it. The trouble I'm having is once I'm done with all of the data on the current page I want to go to the next page. To do this I want to get the last child of the "result-stats" node. I found the lastChild function, and so came up with the following code:
'Checks to see if there is a next page
If html.getElementById("result-stats").LastChild.innerText = "Next" Then
html.getElementById("result-stats").LastChild.Click
End If
And here is the HTML that it is accessing:
<p id="result-stats">
949 results
<span class="optional"> (1.06 seconds)</span>
Modify search
Show more columns
Next
</p>
When I try to run this, I get an error. After a lot of searching I think I found the reason. According to what I read, getElementById returns an element and not a node. lastChild only works on nodes, which is why the function doesn't work here.
My question is this. Is there a clean and simple way to grab the last child of an element? Or is there a way to typecast an element to that of a node? I feel like I'm missing something obvious, but I've been at this way longer than I should have been. Any help anyone could provide would be greatly appreciated.
Thanks.

Here's a shell of how to do it. If my comments are not clear, ask away. I assumed knowledge of how to navigate to the page, wait for the browser, etc.
Sub ClickLink()
Dim IE As Object
Set IE = CreateObject("InternetExplorer.Application")
'load up page and all that stuff
'process data ...
'click link
Dim doc As Object
Set doc = IE.document
Dim aLinks As Object, sLink As Object
For Each sLink In doc.getElementsByTagName("a")
If sLink.innerText = "Next" Then 'may need to play with this, if `innerttext' doesn't work
sLink.Click
Exit For
End If
Next
End Sub

Related

Macro single step works when routine doesn't

I have been running this macro and it come up with an 424 Object Required Error but the macro works and I get the expected result when I run it with a single step button "F8".
Sub FileUpload()
Dim IEexp As InternetExplorer
Set IEexp = CreateObject("InternetExplorer.Application")
IEexp.Visible = False
IEexp.navigate "https://www.google.co.uk/?gws_rd=ssl#q=lenti+a+contatto+colorate"
Do While IEexp.ReadyState <> 4: DoEvents: Loop
Dim inputElement As HTMLDivElement
Set inputElement = IEexp.Document.getElementById("brs")
MsgBox inputElement.textContent
IEexp.Quit
Set IEexp = Nothing
End Sub
The error comes up on the Set inputElement = IEexp.Document.getElementById("brs") line.
You’re checking the ReadyState of the browser, but with some modern web pages the DOM isn’t actually updated with some objects until at least that point.
IE automation in VBA is quite primitive, and it sounds like in this scenario you’re trying to access a node in the DOM before it exists - despite your best efforts to wait until the browser is ready. In some cases this can literally be a matter of milliseconds out in timings.
Your quickest fix here is to simply add Application.Wait() in your loop to cause an actual time delay. A more elegant option might be to introduce a check in your loop and exit the loop when the desired DOM object actually exists. If you do this, there’s a danger of ending up in an infinite loop and so I would always recommend setting a maximum number of increments as a backup.

IE source code placeholder control for my VBA scraper

I have the following code which opens an IE page, and fills in the fields with the value "caravan". However I only need the first field to be filled in with "caravan". I need the second one to be filled in with "2016" for example. I've had trouble with this task because I can't seem to uniquely identify each element within the input tag (to which all of the fields belong).
Here is my code:
Sub Quote()
Dim ie As Object
Set ie = CreateObject("InternetExplorer.Application")
ie.navigate ("https://insurance.qbe.com.au/portal/caravan/form/estimate")
ie.Visible = True
Do
DoEvents
Loop Until ie.readystate = 4
Application.Wait (Now + TimeValue("00:00:03"))
Do
DoEvents
Loop Until ie.readystate = 4
Set inputCollection = ie.document.getElementsByTagName("input")
For Each inputElement In inputCollection
inputElement.Value = "Caravan"
Next inputElement
Loop
End Sub
So it's taking each "inputElement" that is housed within the "input" tag, and where possible, it's making a corresponding field's display value be that of "caravan".
To illustrate why I'm having difficulty in uniquely identifying each field, here is the source of the first two fields (first one is for caravan type; second one is for caravan year-of-manufacture):
First one
Second one
So neither have an id. And both are within the "input" tag and both have the same classname. So I can't get-element-by-id or get-elements-by-classname. I've tried getting elements by classname in a wide range of ways and it simply does nothing (no error is produced and the web page isn't affected).
The only way I've managed to fill in a field is through using the code I have above. But, again, it's changing all the fields of course. I figure that the only thing I can really use to get my code to tell the two apart is the placeholder element of each one.
But how do I achieve this seeing as you cannot "get element by placeholder"
//
I've since tried to confirm that there's no way to use classname, with the following code modification:
Set inputCollection = ie.document.getElementsByTagName("input")
For Each inputElement In inputCollection
If ie.document.getElementsByClassName.Value = "ui-select-search ui-select-toggle ng-pristine ng-valid ng-touched" Then inputElement.Value = "Caravan"
Oh my! How exciting!! I finally found out how to do this after literally days of searching online. It always had to be something simple (but, alas, this isn't my area of expertise at all so it's always going to be really challenging for me). Anyway, this code works (and I expect I will need to put a fire-event line in soon):
Set inputCollection = ie.document.getElementsByTagName("input")
For Each inputElement In inputCollection
If inputElement.getAttribute("placeholder") = "Caravan type" Then
inputElement.Value = "Caravan"
Exit For
End If
Next inputElement
I was so unaware of "getAttribute" but it makes so much sense. If you don't have an id and some of the fields you are looking at have the same classname (as can often be the case), then you need to rely on other unique attributes and use this sort of code.
If you're wondering where I found out about this, I found this pretty cool Youtube channel, and here's the specific video that helped me:
https://www.youtube.com/user/alexcantu3/videos
Hope it helps someone else some day!

Scrape html data Vba

I want to make a function that extracts data from a part of a site.
The following is the HTML site. HTML code.
Code for the function
Function GetElementById(url As String, id As String, Optional isVolatile As Boolean)
Application.Volatile (isVolatile)
On Error Resume Next
Dim html As Object, objResult As Object
ret = GetPageContent(url)
Set html = CreateObject("htmlfile")
html.Body.innerHtml = ret
Set objResult = html.GetElementById(id)
GetElementById = objResult.innerHtml
End Function
I need that extracts only the class "panel-body"
directly into the function. I think it would be .children (3). Is that correct?
And so that it is practical and fast, because I need to extract more than 50 sites.
I see at least two options.
Once you have the HTMLDivElement with id=Result you could simply get the children. Please test this by first doing objResult.Children(2) and checking what the element is that is returned.
objResult.Children(2).Children(0).Children(0)
The second is that in later versions of MSHTML I think with IE8 or later installed you have the method "GetElementsByClassName" This will return a collection of IHTMLElements. If the HTMLDocument only has 1 "panel-body" then you are in luck. If not you would need to loop through each one and check some other unique feature to know you have the right one.
Another way to generate the code for this job is to record a macro, then add a loop around the recorded macro that loops through your 50 pages and gets the results.
On the data tab in the ribbon there is an option get data from external sources. If you use this it's gives you a point and click interface that let's you chose the table your looking for. Record a macro while your doing this and it generates the code for you.

How to work with result collections from Selenium

Or how to work with collections (or arrayss) in VBA.
The issue is most probably myself, but I couldn't find an answer yet.
I am trying to go trough a some pages on a web-site with Selenium-vba to find some data.
As usual if there is more to display, the site shows a 'NEXT' button. The button has <a href ... > when the link is activated, else it's just plain text.
To test if there is another page I have found the way to use findElementsByLinkText, and either there is a link or the the collection is empty. So this can be tested by the size of the collection.
This works so far.
But when I try to use the collection (aside from a for each loop) for further action I can't get it to operate.
This is the code:
Dim driver As New SeleniumWrapper.WebDriver
Dim By As New By, Assert As New Assert, Verify As New Verify, Waiter As New Waiter
On Error GoTo ende1
driver.Start "chrome", "http://www.domain.tld/"
driver.setImplicitWait 5000
driver.get "//......."
Set mynext = driver.findElementsByLinkText("Next")
if mynext.Count >0 Then
mynext(1).Click 'THIS STATEMENT DOES NOT WORK
End If
So please help me to get around my understanding issue (which I am convinced it is)
How can I access an element from the collection.
My workaround so far is to execute
driver.findElementByLinkText("Next").Click
but this is unprofessional as it executes the query again.
The Next button is probably loaded asynchonously after the page is completed.
This implies that findElementsByLinkText("Next") returns no elements at the time it's called.
A way to handle this case is to silent the error, adjust the timeout and test the returned element:
Dim driver As New Selenium.ChromeDriver
driver.Get "https://www.google.co.uk/search?q=selenium"
Set ele = driver.FindElementByLinkText("Next", Raise:=False, timeout:=1000)
If Not ele Is Nothing Then
ele.Click
End If
driver.Quit
To get the latest version in date working with the above example:
https://github.com/florentbr/SeleniumBasic/releases/latest

VB.NET Webbrowser.Document - what you see is not what you can get

My attempts at writing a simple crawler seem to be confounded by the fact that my target webpage (as would appear in the UI browser control, or through a typical browser application) is not completely accessible as an HTMLDocument (due to frames, javascript, etc.)
The code below executes, and the correct webpage (e.g. the one displaying items 50-59) can even be seen in the control, but where I would expect the “next page” hyperlink retrieved to be “...&start=60”, I see something else – the one corresponding to opening the first catalog page “...&start=10”.
What is odd, is that if I press the button a second time, I DO get what I’m looking for. Even odder to me, if I inserted a MsgBox, say right after I’ve looped to wait until WebBrowserReadyState.Complete, then I get what I’m looking for.
Private Sub ButtonGo_Click(sender As System.Object, e As System.EventArgs) Handles ButtonGo.Click
'start at this URL
'e.g. http://www.somewebsite.com/properties?l=Dallas+TX&co=US&start=50
catalogPageURL = TextBoxInitialURL.Text
WebBrowser1.Navigate(catalogPageURL)
While WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
Application.DoEvents()
End While
'Locate the URL associated with the NEXT>> hyperlink
Dim allLinksInDocument As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
Dim strNextPgLink As String = ""
For Each link As HtmlElement In allLinksInDocument
If link.GetAttribute("className") = "next" Then
strNextPgLink = link.GetAttribute("href")
End If
Next
End Sub
I’ve googled around enough to try things like using a WebBrowser1.DocumentCompleted
event, but that still didn’t work. I’ve tried inserting sleep commands.
I’ve avoided using WebClient and regular expressions, the way I would have ordinarily done this, because I’m convinced using the DOM will be easier for other things I have planned down the road, and I’m aware of HTML Agility Pack but not ambitious enough to learn it. Because it seems there has to be a simple way to have this dang webbrowser.document object synchronized with the stuff you can actually see.
If this is because of javascript, is there a way I can tell the webbrowser to just execute them all?
First question on the forum, looking forward to more (smarter ones hopefully)
Be warned when using webbrowser1.Document or something similar - you will not get 'raw html'
Example: (assume wbMain is a webbrowser control)
RTB_RawHTML.Text = wbMain.DocumentText
Try
RTB_BodyHTML.Text = wbMain.Document.Body.OuterHtml
Catch
debugMessage("Body tag not found.")
End Try
in this example, the code in the body tag as displayed in the body tag portion of RTB_RawHTML will NOT perfectly match the html as displayed in RTB_BodyHTML. Accessing it through (yourwebbrowserhere).Document.Body.OuterHtml appears to 'clean' it somewhat as opposed to the 'raw' html as retreived by (yourwebbrowserhere).DocumentText
This was a problem for me when i was making a web scraper, as it would continually throw me off - sometimes i would try to match a tag and it would find it, and other times it wouldnt even though i was sure it was there. The reason was that i was trying to match the raw html, but i needed to match the 'cleaned' html.
Im not sure if this will help you isolate the problem or not - for me it did.