Excel VBA XML HTTP - Code not working on Windows 8 - vba

I have the following function which returns HTML document for the URL passed. I am using the returned HTML Doc in some other function.
The function works perfectly on Windows 7 but NOT on windows 8 unfortunately. How can I write code which works on Windows 7 and 8 both? I think I need to use a different version of XML HTTP object.
Function GetHtmlDoc(ByVal URL As String) As Object
Dim msg As String
' Reset the Global variables.
PageSrc = ""
Set htmlDoc = Nothing
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, True
.Send
While .readyState <> 4: DoEvents: Wend
a = .statusText
' Check for any server errors.
If .statusText <> "OK" Then
' Check for WinHTTP error.
'msg = GetWinHttpErrorMsg(.Status)
'If msg = "" Then msg = .statusText
' Return the error number and message.
GetPageSource = "ERROR: " & .Status & " - " & msg
Exit Function
End If
' Save the HTML code in the Global variable.
PageSrc = .responseText
End With
' Create an empty HTML Document.
Set htmlDoc = CreateObject("htmlfile")
htmlDoc.Open URL:="text/html", Replace:=False
' Convert the HTML code into an HTML Document Object.
htmlDoc.write PageSrc
' Terminate Input-Output with the Global variable.
htmlDoc.Close
' Return the HTML text of the web page.
Set GetHtmlDoc = htmlDoc
End Function
Example function call:
Set htmlDoc = GetHtmlDoc("http://www.censusdata.abs.gov.au/census_services/getproduct/census/2011/quickstat/POA2155?opendocument&navpos=220")

XMLHTTP no longer likes accessing remote servers from local scripts, switch to ServerXMLHTTP:
With CreateObject("MSXML2.ServerXMLHTTP")
.Open "GET", URL, False
(Using False performs the operation synchronously ngating the need for the readyState loop.)

I had same error turns out my code was working fine, just firewall blocked Excel so it failed.

Related

Can't get rid of "old format or invalid type library" error in vba

I've written a script in vba to get some set names from a webpage and the script is getting them accordingly until it catches an error somewhere within the execution. This is the first time I encountered such error.
What my script is doing is get all the links under Company Sets and then tracking down each of the links it goes one layer deep and then following all the links under Set Name it goes another layer deep and finally parse the table from there. I parsed the name of PUBLISHED SET which is stored within the variable bName instead of the table as the script is getting bigger. I used IE to get the PUBLISHED SET as there are few leads which were causing encoding issues.
I searched through all the places to find any workaround but no luck.
However, I came across this thread where there is a proposed solution written in vb but can't figure out how can I make it work within vba.
Script I'm trying with:
Sub FetchRecords()
Const baseUrl$ = "https://www.psacard.com"
Const link = "https://www.psacard.com/psasetregistry/baseball/company-sets/16"
Dim IE As New InternetExplorer, Htmldoc As HTMLDocument
Dim Http As New XMLHTTP60, Html As New HTMLDocument, bName$, tRow As Object
Dim post As Object, elem As Object, posts As Object, I&, R&, C&
Dim key As Variant
Dim idic As Object: Set idic = CreateObject("Scripting.Dictionary")
With Http
.Open "GET", link, False
.send
Html.body.innerHTML = .responseText
End With
Set posts = Html.querySelectorAll(".dataTable tr td a[href*='/psasetregistry/baseball/company-sets/']")
For I = 0 To posts.Length - 7
idic(baseUrl & Split(posts(I).getAttribute("href"), "about:")(1)) = 1
Next I
For Each key In idic.Keys
With Http
.Open "GET", key, False
.send
Html.body.innerHTML = .responseText
End With
For Each post In Html.getElementsByTagName("a")
If InStr(post.getAttribute("title"), "Contact User") > 0 Then
If InStr(post.ParentNode.getElementsByTagName("a")(0).getAttribute("href"), "publishedset") > 0 Then
IE.Visible = True
IE.navigate baseUrl & Split(post.ParentNode.getElementsByTagName("a")(0).getAttribute("href"), "about:")(1)
While IE.Busy = True Or IE.readyState < 4: DoEvents: Wend
Set Htmldoc = IE.document
bName = Htmldoc.querySelector("h1 b.text-primary").innerText
If InStr(bName, "/") > 0 Then bName = Split(Htmldoc.querySelector(".inline-block a[href*='/contactuser/']").innerText, " ")(1)
R = R + 1: Cells(R, 1) = bName
End If
End If
Next post
Next key
IE.Quit
End Sub
I get that error pointing at the following line after extracting records between 70 to 90:
bName = Htmldoc.querySelector("h1 b.text-primary").innerText
The error looks like:
Automation Error: old format or invalid type library
Proposed solution in the linked thread written in vb (can't convert to vba):
'save the current settings for easier restoration later
Dim oldCI As System.Globalization.CultureInfo = _
System.Threading.Thread.CurrentThread.CurrentCulture
'change the settings
System.Threading.Thread.CurrentThread.CurrentCulture = _
New System.Globalization.CultureInfo("en-US")
Your code here
'restore the previous settings
System.Threading.Thread.CurrentThread.CurrentCulture = oldCI

Unable to parse some links lying within an iframe

I've written a script in vba using IE to parse some links from a webpage. The thing is the links are within an iframe. I've twitched my code in such a way so that the script will first find a link within that iframe and navigate to that new page and parse the required content from there. If i do this way then I can get all the links.
Webpage URL: weblink
Successful approach (working one):
Sub Get_Links()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim elem As Object, post As Object
With IE
.Visible = True
.navigate "put here the above link"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set elem = .document.getElementById("compInfo") #it is within iframe
.navigate elem.src
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
For Each post In HTML.getElementsByClassName("news")
With post.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
End With
Next post
IE.Quit
End Sub
I've seen few sites where no such links exist within iframe so, I will have no option to use any link to track down the content.
If you take a look at the below approach by tracking the link then you can notice that I've parsed the content from a webpage which are within Iframe. There is no such link within Iframe to navigate to a new webpage to locate the content. So, I used contentWindow.document instead and found it working flawlessly.
Link to the working code of parsing Iframe content from another site:
contentWindow approach
However, my question is: why should i navigate to a new webpage to collect the links as I can see the content in the landing page? I tried using contentWindow.document but it is giving me access denied error. How can I make my below code work using contentWindow.document like I did above?
I tried like this but it throws access denied error:
Sub Get_Links()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim frm As Object, post As Object
With IE
.Visible = True
.Navigate "put here the above link"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
''the code breaks when it hits the following line "access denied error"
Set frm = HTML.getElementById("compInfo").contentWindow.document
For Each post In frm.getElementsByClassName("news")
With post.getElementsByTagName("a")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
End With
Next post
IE.Quit
End Sub
I've attached an image to let you know which links (they are marked with pencil) I'm after.
These are the elements within which one such link (i would like to grab) is found:
<div class="news">
<span class="news-date_time"><img src="images/arrow.png" alt="">19 Jan 2018 00:01</span>
<a style="color:#5b5b5b;" href="/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17019039003&opt=9">ABB India Limited - Press Release</a>
</div>
Image of the links of that page I would like to grab:
From the very first day while creating this thread I strictly requested not to use this url http://hindubusiness.cmlinks.com/Companydetails.aspx?cocode=INE117A01022 to locate the data. I requested any solution from this main_page_link without touching the link within iframe. However, everyone is trying to provide solutions that I've already shown in my post. What did I put a bounty for then?
You can see the links within <iframe> in browser but can't access them programmatically due to Same-origin policy.
There is the example showing how to retrieve the links using XHR and RegEx:
Option Explicit
Sub Test()
Dim sContent As String
Dim sUrl As String
Dim aLinks() As String
Dim i As Long
' Retrieve initial webpage HTML content via XHR
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.thehindubusinessline.com/stocks/abb-india-ltd/overview/", False
.Send
sContent = .ResponseText
End With
'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
' Extract target iframe URL via RegEx
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
' Process all a within div.news
.Pattern = "<iframe[\s\S]*?src=""([^""]*?Companydetails[^""]*)""[^>]*>"
sUrl = .Execute(sContent).Item(i).SubMatches(0)
End With
' Retrieve iframe HTML content via XHR
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", sUrl, False
.Send
sContent = .ResponseText
End With
'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
' Parse links via XHR
With CreateObject("VBScript.RegExp")
.Global = True
.MultiLine = True
.IgnoreCase = True
' Process all anchors within div.news
.Pattern = "<div class=""news"">[\s\S]*?href=""([^""]*)"
With .Execute(sContent)
ReDim aLinks(0 To .Count - 1)
For i = 0 To .Count - 1
aLinks(i) = .Item(i).SubMatches(0)
Next
End With
End With
Debug.Print Join(aLinks, vbCrLf)
End Sub
Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx.
The output for me as follows:
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17047038016&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17046039003&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17045039006&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043039002&opt=9
/HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043010019&opt=9
I also tried to copy the content of the <iframe> from IE to clipboard (for further pasting to the worksheet) using commands:
IE.ExecWB OLECMDID_SELECTALL, OLECMDEXECOPT_DODEFAULT
IE.ExecWB OLECMDID_COPY, OLECMDEXECOPT_DODEFAULT
But actually that commands select and copy the main document, excluding the frame, unless I click on the frame manually. So that might be applied if click on the frame could be reproduced from VBA (frame node methods like .focus and .click didn't help).
Something like this should work. They key is to realize the iFrame is technically another Document. Reviewing the iFrame on the page you listed, you can easily use a web request to get at the data you need. As already mentioned, the reason you get an error is due to the Same-Origin policy. You could write something to get the src of the iFrame then do the web request as I've shown below, or, use IE to scrape the page, get the src, then load that page which looks like what you have done.
I would recommend using a web request approach, Internet Explorer can get annoying, fast.
Code
Public Sub SOExample()
Dim html As Object 'To store the HTML content
Dim Elements As Object 'To store the anchor collection
Dim Element As Object 'To iterate the anchor collection
Set html = CreateObject("htmlFile")
With CreateObject("MSXML2.XMLHTTP")
'Navigate to the source of the iFrame, it's another page
'View the source for the iframe. Alternatively -
'you could navigate to this page and use IE to scrape it
.Open "GET", "https://stocks.thehindubusinessline.com/Companydetails.aspx?&cocode=INE117A01022"
.send ""
'See if the request was ok, exit it there was an error
If Not .Status = 200 Then Exit Sub
'Assign the page's HTML to an HTML object
html.body.InnerHTML = .responseText
Set Elements = html.body.document.getElementByID("hmstockchart_CompanyNews1_updateGLVV")
Set Elements = Elements.getElementsByTagName("a")
For Each Element In Elements
'Print out the data to the Immediate window
Debug.Print Element.InnerText
Next
End With
End Sub
Results
ABB India Limited - AGM/Book Closure
Board of ABB India recommends final dividend
ABB India to convene AGM
ABB India to pay dividend
ABB India Limited - Outcome of Board Meeting
More ?
The simple of solution like everyone suggested is to directly go the link. This would take the IFRAME out of picture and it would be easier for you loop through links. But in case you still don't like the approach then you need to get a bit deeper into the hole.
Below is a function from a library I wrote long back in VB.NET
https://github.com/tarunlalwani/ScreenCaptureAPI/blob/2646c627b4bb70e36fe2c6603acde4cee3354b39/Source%20Code/ScreenCaptureAPI/ScreenCaptureAPI/ScreenCapture.vb#L803
Private Function _EnumIEFramesDocument(ByVal wb As HTMLDocumentClass) As Collection
Dim pContainer As olelib.IOleContainer = Nothing
Dim pEnumerator As olelib.IEnumUnknown = Nothing
Dim pUnk As olelib.IUnknown = Nothing
Dim pBrowser As SHDocVW.IWebBrowser2 = Nothing
Dim pFramesDoc As Collection = New Collection
_EnumIEFramesDocument = Nothing
pContainer = wb
Dim i As Integer = 0
' Get an enumerator for the frames
If pContainer.EnumObjects(olelib.OLECONTF.OLECONTF_EMBEDDINGS, pEnumerator) = 0 Then
pContainer = Nothing
' Enumerate and refresh all the frames
Do While pEnumerator.Next(1, pUnk) = 0
On Error Resume Next
' Clear errors
Err.Clear()
' Get the IWebBrowser2 interface
pBrowser = pUnk
If Err.Number = 0 Then
pFramesDoc.Add(pBrowser.Document)
i = i + 1
End If
Loop
pEnumerator = Nothing
End If
_EnumIEFramesDocument = pFramesDoc
End Function
So basically this is a VB.NET version of below C++ version
Accessing body (at least some data) in a iframe with IE plugin Browser Helper Object (BHO)
Now you just need to port it to VBA. The only problem you may have is finding the olelib rerefernce. Rest most of it is VBA compatible
So once you get the array of object, you will find the one which belongs to your frame and then you can just that one
frames = _EnumIEFramesDocument(IE)
frames.Item(1).document.getElementsByTagName("A").length

Web-scraping across multipages without even knowing the last page number

Running my code for a site to crawl the titles of different tutorials spreading across several pages, I found it working flawless. I tried to write some code not depending on the last page number the url has but on the status code until it shows http.status<>200. The code I'm pasting below is working impeccably in this case. However, Trouble comes up when I try to use another url to see whether it breaks automatically but found that the code did fetch all the results but did not break. What is the workaround in this case so that the code will break when it is done and stop the macro? Here is the working one?
Sub WiseOwl()
Const mlink = "http://www.wiseowl.co.uk/videos/default"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object
Do While True
y = y + 1
With http
.Open "GET", mlink & "-" & y & ".htm", False
.send
If .Status <> 200 Then
MsgBox "It's done"
Exit Sub
End If
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("woVideoListDefaultSeriesTitle")
With post.getElementsByTagName("a")
x = x + 1
If .Length Then Cells(x, 1) = .item(0).innerText
End With
Next post
Loop
End Sub
I found a logic to get around with yellowpage. My update script is able to parse yellowpage but breaks before scraping the last page because there is no "Next Page" button. I tried with this:
"https://www.dropbox.com/s/iptqm79b0byw3dz/Yellowpage.txt?dl=0"
However, the same logic I tried to apply with torrent site but it doesn't work here:
"https://www.yify-torrent.org/genres/western/p-1/"
You can always rely on elements if they exits or not. Here for example, if you try to use the object which you have set your element to, you will get:
Run-time error '91': Object variable or With block variable not set
This is the key you should be looking for to put an end to your code. Please see the below example:
Sub yify()
Const mlink = "https://www.yify-torrent.org/genres/western/p-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object
Dim posts As Object
y = 1
Do
With http
.Open "GET", mlink & y & "/", False
.send
html.body.innerHTML = .responseText
End With
Set posts = html.getElementsByClassName("mv")
On Error GoTo Endofpage
Debug.Print Len(posts) 'to force Error 91
For Each post In posts
With post.getElementsByTagName("div")
x = x + 1
If .Length Then Cells(x, 1) = .Item(0).innerText
End With
Next post
y = y + 1
Endofpage:
Loop Until Err.Number = 91
Debug.Print "It's over"
End Sub

VBScript GetElementsByClassName not supported?

I am trying to convert some of my working VBA code to VBScript, but keep getting errors in VBScript when trying to use the getElementsByClassName method. Here's the full code:
option explicit
Dim XMLPage, html
Set XMLPage = WScript.CreateObject("MSXML2.XMLHTTP")
Set html= CreateObject("HTMLFile")
XMLPage.Open "GET", "https://www.hltv.org/stats/matches/mapstatsid/48745/immortals-vs-dignitas", False
Wscript.Sleep 50
XMLPage.send
Wscript.Sleep 50
If XMLPage.Status <> 200 Then MsgBox XMLPage.statusText
html.Open
html.write XMLPage.responseText
html.Close
'msgbox html.getElementsByTagName("tbody")(0).innertext'WORKS
msgbox html.getElementsByClassName("match-info-box-con")(0).innertext'DOESNT WORK
The last line of code is where the following error occurs:
If I comment that out and run it to search for a tag name instead (code on the line above) - it works fine no problem.
I suspect it has something to do with how the variable html is declared, as from what I understand. getElementsByClassName comes from IHTMLElement6 - but I am unsure on how to get this to work in VBScript.
MSHTML behaves differently depending on how it was instantiated - it exposes different interfaces depending on whether or not its early or late bound (its heavily reliant on IDispatch).
You are late binding and no interface exposing getElementsByClassName is available.
You can loop over document.all() and look at each item.className.
I used similar code to retrieve data from a POST request.
getElementsByClassName worked only if preceded with another command, like "msgbox 1" or anything to halt the script for a fraction of a second.
Then I tried Wscript.Sleep 200, decreasing it to the smallest possible number, and it still worked.
Wscript.Sleep 1 ' This line got it working.
msgbox html.getElementsByClassName("match-info-box-con")(0).innertext'
I took some time to elaborate a working example.
Thanks to GSerg for pointing out the load delay. That is accurate.
Had to tweak the code a little to get it working though.
Based on the previous comments, maybe the behaviour of MSHTML depends on the code being parsed. Hence the aditional meta tag below.
Set objHTTP = CreateObject("MSXML2.XMLHTTP")
Dim htmldoc: Set htmldoc = CreateObject("htmlfile")
'
URL = "https://stackoverflow.com/questions/44853941/vbscript-getelementsbyclassname-not-supported"
' sEnv = ""
'
objHTTP.Open "GET", URL, False
' objHTTP.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
objHTTP.send ' (sEnv)
ttext = objHTTP.responsetext
ttext = "<meta http-equiv=""X-UA-Compatible"" content=""IE=EDGE,chrome=1"" />" & vbnewline & ttext
htmldoc.write ttext
htmldoc.close
htmldoc.designMode = "on" ' Refer to https://developpaper.com/method-of-parsing-html-documents-by-vbs-htmlfile/
WScript.ConnectObject htmldoc, "htmldoc_"
Sub htmldoc_onreadystatechange()
If htmldoc.readyState = "interactive" Then
ttext = htmldoc.getElementsByClassName("fs-headline1").Item(0).innerText
msgbox ttext
Wscript.quit
End If
End Sub
'-----------------
Wscript.Sleep 10000 ' Random timeout
msgbox "Timeout!"

How to differentiate legitmate URL redirects from ISP redirects or other hijacks

I'm trying to use VBA to test URLs for bad links and redirects.
I'm able to obtain redirect URLs using HTTP request GET method for URLs that return a 3xx series .Status response using either MSXML2 or WinHttp libraries through VBA.
When I attempt this method for some URLs that have legitimate (intended by the URL site) redirects, I receive an error when trying to obtain the Request object.
For example, when this URL is entered into a browser:
http://www.teconnectivity.com
...the browser will ultimately arrive at this address:
http://www.te.com/en/home.html
However this VBA code will return Error -2147012744: "The server returned an invalid or unrecognized response". The error is thrown on the .Send statement.
Sub Test()
Debug.Print sGetRedirectURL("http://www.teconnectivity.com/")
End Sub
Private Function sGetRedirectURL(ByVal sURL As String) As String
Dim oRequest As WinHttp.WinHttpRequest
Dim sReturn As String
Set oRequest = New WinHttp.WinHttpRequest
On Error GoTo ErrProc
With oRequest
.Option(WinHttpRequestOption_EnableRedirects) = False
.Open "GET", sURL, False
.Send
If Left$(.Status, 1) = "3" Then
sReturn = .GetResponseHeader("Location")
End If
End With
On Error GoTo 0
ExitProc:
sGetRedirectURL = sReturn
Exit Function
ErrProc:
Debug.Print Err.Number & ": " & Err.Description
Resume ExitProc
End Function
How is my browser able to get to the destination: http://www.te.com/en/home.html, if an HTTP request with that URL returns an error?
I can use the browser's application to get the final destination URL like this:
Function sGetFinalURL(sURL As String) As String
Dim oAppIE As Object
Dim sReturn As String
Set oAppIE = CreateObject("InternetExplorer.Application")
With oAppIE
.Navigate sURL
Do While .Busy
Loop
sReturn = .document.url
.Quit
End With
Set oAppIE = Nothing
sGetFinalURL = sReturn
End Function
However, that approach will also return URLs for ISP redirects or other hijacks.
The approach I ended up taking for this was to use the browser application to try to navigate to a non-existent domain URL. If the browser arrives at a URL, then that means it has been redirected by the ISP or something other than the non-existent domain.
Private Function sGetNXDomainRedirect() As String
'--attempts to return the domain used to redirect non-existant
' domain urls. Used to distinguish legitimate redirects
' from ISP redirects or other hijacks
'--returns "NoRedirect" if the bogus url is not redirected.
Dim sReturn As String
Dim oIEapp As Object
Dim uTest As udtHttpInfo
Const sBOGUS_URL As String = _
"http://wwXYXw.NXDomainToTest"
Set oIEapp = CreateObject("InternetExplorer.Application")
With oIEapp
.Navigate sBOGUS_URL
Do While .Busy
Loop
On Error Resume Next
sReturn = .document.Domain
On Error GoTo 0
.Quit
End With
Set oIEapp = Nothing
If Len(sReturn) = 0 Then sReturn = "NoRedirect"
sGetNXDomainRedirect = sReturn
End Function
Link to complete code on MrExcel.com forum
Once that "ISP Redirect" site is known, then the IE Application can be used to test for bad links or legitimate redirects. If the Navigation through the IE Application returns a domain other than the ISP Redirect domain, then I interpret that as a legitimate redirect.