I'd like to use the MSHTML library to parse some HTML that I have in a string variable. However, I can't figure out how to do this. I can easily parse the contents of a webpage given a known URL, but not the source HTML directly. Is this possible? If so, how?
Public Sub ParseHTML(sHTML As String)
Dim oHTML As New HTMLDocument, oDoc As HTMLDocument
'This works:'
Set oDoc = oHTML.createDocumentFromUrl("http://www.google.com", "")
'I would like to do the following but no such method actually exists:'
Set oDoc = oHTML.createDocumentFromString(sHTML)
....
'Parse the HTML using the oDoc variable'
....
You can;
Dim odoc As Object
Set odoc = CreateObject("htmlfile") '// late binding
'// or:
'// Set odoc = New HTMLDocument
'// for early binding
odoc.open
odoc.write "<p> In his house at R'lyeh, dead <b>Cthulhu</b> waits dreaming</p>"
odoc.Close
MsgBox odoc.body.outerHTML
For straight HTML code such as Access-Rich-Text this does it:
Dim HTMLDoc As New HTMLDocument
HTMLDoc.Body.innerHTML = strHTMLText
This is a much better example. You will not get a null exception, nor late binding.
(And if you use WPF, just add System.Windows.Forms in your reference.)
Dim a As Object
a = New mshtml.HTMLDocument
a.open()
a.writeln(code)
a.close()
Do Until a.readyState = "complete"
System.Windows.Forms.Application.DoEvents()
Loop
Dim doc As mshtml.HTMLDocument = a
Dim b As mshtml.HTMLSelectElement = doc.getElementsByTagName("Select").item("lang", 0)
Related
i want to replace the line htmldoc from htmlobject library to something suitable for selenium. i want to pass htmldoc as argument in another subroutine so Here is the code:
Dim htmldoc As MSHTML.HTMLDocument
Dim htmldiv As Selenium.WebElement
Dim htmlul As Selenium.WebElement
Dim htmlAs As Selenium.WebElements
Dim htmlA As Selenium.WebElement
Dim TableName As String
URL = "https://www.whoscored.com/Statistics"
sel.Start "Chrome"
sel.Get URL
'set htmldoc= sel.document..... something....
Set htmldiv = sel.FindElementById("top-player-stats")
Set htmlul = sel.FindElementById("top-player-stats-options")
Set htmlAs = htmlul.FindElementsByTag("a")
For Each htmlA In htmlAs
TableName = htmlA.attribute("href")
htmlA.Click
GoToTable htmldoc, TableName
Next htmlA
End Sub
If you're trying to capture the entire HTML source code.
One options is to use
sel.PageSource
But that might not behave as you expect as a limitation to how it is generated (source: https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/WebDriver.html#getPageSource()).
You could also try these after the page is fully loaded:
sel.ExecuteScript("return document.documentElement.innerHTML")
sel.ExecuteScript("return document.body.innerHTML")
I'm trying to select the main menu ID of this page http://greyhoundstats.co.uk/index.php labeled ("menu_wholesome") in order to get their hyperlinks later on. In the HTML document, there are two tags with this ID, a <div> and its child element <ul>, but when i search for them with the code below, i get the object variable not set" error.
Option Explicit
Public Const MenuPage As String = "http://greyhoundstats.co.uk/index.php"
Sub BrowseMenus()
Dim XMLHTTPReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim MainMenuList As MSHTML.IHTMLElement
Dim aElement As MSHTML.IHTMLElementCollection
Dim ulElement As MSHTML.IHTMLUListElement
Dim liElement As MSHTML.IHTMLLIElement
XMLHTTPReq.Open "GET", MenuPage, False
XMLHTTPReq.send
HTMLDoc.body.innerText = XMLHTTPReq.responseText
Set MainMenuList = HTMLDoc.getElementById("menu_wholesome")(0) '<-- error happens here
End Sub
Anyone knows why getElementsById can't find the refered ID, although it is part of the HTML document set? I know that this method is supposed to return a unique ID, but when we have the same one refered by other tags i also know that i will return the first ID found which should be the <div id="menu_wholesome"> part of the HTML page being requested.
Firstly: You want to work and set the innerHTML as you intend to traverse a DOM document.
Secondly: This line
Set MainMenuList = HTMLDoc.getElementById("menu_wholesome")(0)
It is incorrect. getElementById returns a single element which you cannot index into. You index into a collection.
Please note: Both div and and ul lead to the same content.
If you want to select them separately use querySelector
HTMLDoc.querySelector("div#menu_wholesome")
HTMLDoc.querySelector("ul#menu_wholesome")
The above target by tag name first then the id attribute.
If you want a collection of ids then use querySelectorAll to return a nodeList of matching items. Ids should be unique to the page but sometimes they are not!
HTMLDoc.querySelectorAll("#menu_wholesome")
You can then index into the nodeList e.g.
HTMLDoc.querySelectorAll("#menu_wholesome").item(0)
VBA:
Option Explicit
Public Const MenuPage As String = "http://greyhoundstats.co.uk/index.php"
Sub BrowseMenus()
Dim sResponse As String, HTMLDoc As New MSHTML.HTMLDocument
Dim MainMenuList As Object, div As Object, ul As Object
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", MenuPage, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
HTMLDoc.body.innerHTML = sResponse
Set MainMenuList = HTMLDoc.querySelectorAll("#menu_wholesome")
Debug.Print MainMenuList.Length
Set div = HTMLDoc.querySelector("div#menu_wholesome")
Set ul = HTMLDoc.querySelector("ul#menu_wholesome")
Debug.Print div.outerHTML
Debug.Print ul.outerHTML
End Sub
It is unclear what are you trying to achieve. I just fixed the current problem you are having at this moment. .getElementById() deals with an individual element so when you treats it as a collection of element then it will throws that error. If you notice this portion getElementBy and getElementsBy, you can see the variation as to which one is a collection of elements (don't overlook the s). You can only use (0) or something similar when you make use of getElementsBy.
You should indent your code in the right way so that others can read it without any trouble:
Sub BrowseMenus()
Const MenuPage$ = "http://greyhoundstats.co.uk/index.php"
Dim HTTPReq As New XMLHTTP60, HTMLDoc As New HTMLDocument
Dim MainMenuList As Object
With HTTPReq
.Open "GET", MenuPage, False
.send
HTMLDoc.body.innerHTML = .responseText
End With
Set MainMenuList = HTMLDoc.getElementById("menu_wholesome")
End Sub
I am trying to open a file but the Office application hangs during document opening in VB.NET.
I have this code:
Dim oProp As Object
Dim strPropValue As String
Dim lngRetVal As Integer
Dim strmsg As String
Dim lngretcode As Integer
Dim strPropertyName As String
Dim oWordDoc As Word.Document
Dim ObjOfficeAPP As Object
ObjOfficeAPP = New Word.Application()
GetWORDKEYS = cstFAILURE
ObjOfficeAPP.DisplayAlerts = WdAlertLevel.wdAlertsAll
ObjOfficeAPP.Application.Visible = True
oWordDoc = ObjOfficeAPP.Documents.Open(FileName:=strpFileName, Visible:=False)
I have problems on the line:
oWordDoc = ObjOfficeAPP.Documents.Open(FileName:=strpFileName, Visible:=False)
The debugger hangs on the Documents.Open() call, and just stays there waiting - without firing any type of exception or error. We have looked in the event log but only found the following.
My problem is, how can I set to open this document and not to block on the line with the Documents.Open() call?
Here are few points that could help:
Don't use the Application property for setting the Visible property:
ObjOfficeAPP.Visible = True
Don't set the DisplayAlerts property before opening a document.
Use the System.Reflection.Missing.Value for missing arguments.
You may find the How to automate Word from Visual Basic .NET to create a new document article helpful which explains the required steps for automating Word and provides a sample code:
'Start Word and open the document template.
oWord = CreateObject("Word.Application")
oWord.Visible = True
oDoc = oWord.Documents.Add
I have checked that both browser-generated page and VBA XMLHTTP request's string response have the same tree structure, with a tag being a child of aside.
Unfortunately when I want to return bookie name, which is title attribute of a, I get an error accessing 1st child of aside. It comes out that I need to use code assuming that a tag is a sibling of aside to get it working:
Required reference: Microsoft HTML Library
Sub SendRequest()
Dim XMLHTTP As Object: Set XMLHTTP = CreateObject("MSXML2.XMLHTTP.6.0")
Dim htmlEle1 As IHTMLElement
Dim htmlDoc As New HTMLDocument
Dim urlName As String
urlName = "https://www.oddschecker.com/golf/the-masters/2018-us-masters/winner"
With XMLHTTP
.Open "GET", urlName, False
.send
htmlDoc.body.innerHTML = .responseText
For Each htmlEle1 In htmlDoc.getElementsByClassName("eventTableHeader")(0).Children
If InStr(htmlEle1.className, "bookie-area") <> 0 Then
Debug.Print htmlEle1.Children(1).getAttribute("title")
End If
Next htmlEle1
End With
End Sub
Does this behavior have something to do with the fact that aside is HTML5 element and VBA thinks that it is a semi-closing tag?
So this took awful lot of time to figure out. The issue is that you can't do it this way. When you launch a new HTMLDocument the documentMode of it is by default set to 5
So when we load a write any HTML inside it, it has no idea of these HTML5 tags and it just does its own correction. This is as good as you running HTML5 site in a IE6 browser or something. Unfortunately there is no way I could find out which would allow us to create/parse document with a higher documentMode
Update
Thanks to #FlorentB for pointing out that emulation mode works on the MSHTML library as well. I was already aware of the same from below
Embedding Youtube Videos in webbrowser. Object doesn't support property or method
But I assumed it won't work for the MSHTML library. I have now tested it by running below command
REG ADD "HKCU\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION" /v excel.exe /t REG_DWORD /d 11001 /f
And then the existing code and it works.
Alternat approach
If setting the registry key needs to be avoided for any reason then one can use the IE COM Browser directly.
You can do this by adding a reference to Microsoft Internet Controls and then execute the below code
Sub dothis()
Dim XMLHTTP As Object: Set XMLHTTP = CreateObject("MSXML2.XMLHTTP.6.0")
Dim htmlEle1 As IHTMLElement
Dim htmlDoc As HTMLDocument
'Set htmlIDoc = htmlDoc
Dim urlName As String
urlName = "https://www.oddschecker.com/golf/the-masters/2018-us-masters/winner"
Dim ie As InternetExplorerMedium
Set ie = New InternetExplorerMedium
ie.Visible = False
ie.navigate2 urlName
While ie.readyState <> READYSTATE_COMPLETE
DoEvents
Wend
Set htmlDoc = ie.document
Debug.Print (htmlDoc.documentMode)
For Each htmlEle1 In htmlDoc.getElementsByClassName("eventTableHeader")(0).Children
If InStr(htmlEle1.className, "bookie-area") <> 0 Then
Debug.Print htmlEle1.Children(0).children(0).getAttribute("title")
End If
Next htmlEle1
End Sub
And now you can see that a is a child of aside
How can I use "createDocumentFromUrl()" to fetch "HTMLDocument" from a webpage directly in vba? I tried a lot to reach out any documentation on it in SO but failed to find out. Hope there is somebody to stretch a helping hand to accomplish this. Thanks in advance.
Here is what I've tried so far which is definitely not right:
Sub HtmlScraper()
Dim odoc As Object
Set odoc = New HTMLDocument
odoc.Open createDocumentFromUrl("http://www.stackoverflow.com", "null")
MsgBox odoc.body.innerHTML
End Sub
I tried like this as well but no luck:
Sub htmlparser()
Dim odoc As HTMLDocument, hdoc As HTMLDocument
Set odoc = New HTMLDocument
Set hdoc = New HTMLDocument
Set hdoc = odoc.createDocumentFromUrl("http://www.stackoverflow.com", Null, False)
MsgBox hdoc.body.outerHTML
End Sub
This worked for me, it may be the site.
Sub test()
Dim d As MSHTML.HTMLDocument
Set d = New MSHTML.HTMLDocument
Dim d2 As MSHTML.HTMLDocument
set d2=d.createDocumentFromUrl("www.bbc.co.uk", "null")
While d.readyState <> "complete"
DoEvents
Wend
End Sub