System.UnauthorizedAccessException only using multithreading - vb.net

I wrote a code to parse some Web tables.
I get some web tables into an IHTMLElementCollection using Internet Explorer with this code:
TabWeb = IE.document.getelementsbytagname("table")
Then I use a sub who gets an object containing the IHTMLElementCollection and some other data:
Private Sub TblParsing(ByVal ArrVal() As Object)
Dim WTab As mshtml.IHTMLElementCollection = ArrVal(0)
'some code
End sub
My issue is: if I simply "call" this code, it works correctly:
Call TblParsing({WTab, LiRow})
but, if I try to run it into a threadpool:
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing), {WTab, LiRow})
the code fails and give me multiple
System.UnauthorizedAccessException
This happens on (each of) these code rows:
Rws = WTab(RifWT("Disc")).Rows.Length
If Not IsError(WTab(6).Cells(1).innertext) Then
Ogg_W = WTab(6).Cells(1).innertext
My goal is to navigate to another web page while my sub perform parsing.
I want to clarify that:
1) I've tryed to send the entire HTML to the sub and get it into a webbrowser but it didn't work because it isn't possible to cast from System.Windows.Forms.HtmlElementCollection to mshtml.IHTMLElementCollection (or I wasn't able to do it);
2) I can't use WebRequest and similar: I'm forced to use InternetExplorer;
3) I can't use System.Windows.Forms.HtmlElementCollection because my parsing code uses Cells, Rows and so on that are unavailable (and I don't want to rewrite all my parsing code)
EDIT:
Ok, I modified my code using answer hints as below:
'This in the caller sub
Dim IE As Object = CreateObject("internetexplorer.application")
'...some code
Dim IE_Body As String = IE.document.body.innerhtml
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing_2), {IE_Body, LiRow})
'...some code
'This is the called sub
Private Sub TblParsing_2(ByVal ArrVal() As Object)
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(ArrVal(0))
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim TabWeb As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
'...some code
I get no errors but I'm not sure that it's all right because I tryed to use IE_Body string into webbrowser and it throws errors in the webpage (it shows a popup and I can ignore errors).
Am I using the right way to get Html from Internet Explorer into a string?
EDIT2:
I changed my code to:
Dim IE As New SHDocVw.InternetExplorer
'... some code
Dim sourceIDoc3 As mshtml.IHTMLDocument3 = CType(IE.Document, mshtml.IHTMLDocument3)
Dim html As String = sourceIDoc3.documentElement.outerHTML
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing_2), {html, LiRow})
'... some code
Private Sub TblParsing_2(ByVal ArrVal() As Object)
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(ArrVal(0))
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim TabWeb As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
But I get an error PopUp like (I tryed to translate it):
Title:
Web page error
Text:
Debug this page?
This page contains errors that might prevent the proper display or function properly.
If you are not testing the web page, click No.
two checkboxes
do not show this message again
Use script debugger built-in Internet Explorer
It's the same error I got trying to get Html text into a WebBrowser.
But, If I could ignore this error, I think the code could work!
While the pop is showing I get error on
Dim domDoc As New mshtml.HTMLDocument
Error text translated is:
Retrieving the COM class factory for component with CLSID {25336920-03F9-11CF-8FD0-00AA00686F13} failed due to the following error: The 8,001,010th message filter indicated that the application is busy. (Exception from HRESULT: 0x8001010A (RPC_E_SERVERCALL_RETRYLATER)).
Note that I've alredy set IE.silent = True

Edit: There was confusion as to what the OP meant by "Internet Explorer". I originally assumed that it meant the WinForm Webbrowser control; however the OP is creating the COM browser directly instead of using the .Net wrapper.
To get the browser document's defining HTML, you can cast the document against the mshtml.IHTMLDocument3 interface to expose the documentElement property.
Dim ie As New SHDocVw.InternetExplorer ' Proj COM Ref: Microsoft Internet Controls
ie.Navigate("some url")
' ... other stuff
Dim sourceIDoc3 As mshtml.IHTMLDocument3 = CType(ie.Document, mshtml.IHTMLDocument3)
Dim html As String = sourceIDoc3.documentElement.outerHTML
End Edit.
The following is based on my comment above. You use the WebBrowser.DocumentText property to create a mshtml.HTMLDocument.
Use this property when you want to manipulate the contents of an HTML page displayed in the WebBrowser control using string processing tools.
Once you extract this property as a String, there is no connection to the WebBrowser control and you can process the data in any thread you want.
Dim html As String = WebBrowser1.DocumentText
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(html)
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim tables As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
' ... do something
' cleanup COM objects
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(body)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(tables)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(domDoc)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(domDoc2)

Related

Using MSXML to fetch data from website

I am trying to use the following code to geocode a bunch of cities from this website: mygeoposition.com but there seems to be some sort of issue and the variable 'Lat' in the following code always returns empty:
Sub Code()
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60
IE.Open "GET", "http://mygeoposition.com/?q=Chuo-ku, Osaka", False
IE.send
While IE.ReadyState <> 4
DoEvents
Wend
Dim HTMLDoc As MSHTML.HTMLDocument
Dim htmlBody As MSHTML.htmlBody
Set HTMLDoc = New MSHTML.HTMLDocument
Set htmlBody = HTMLDoc.body
htmlBody.innerHTML = IE.responseText
Lat = HTMLDoc.getElementById("geodata-lat").innerHTML
IE.abort
End Sub
I have another code that uses the browser to do the same thing and it works fine with that but it gets quite slow. When I use this code with MSXML, it just doesn't work. Apologies I am a newbie with using VBA for data extraction from website. Please help.
The response contains no content in the geodata-lat element. It appears that client side code is getting that data so your response only is looking at the html that the server generated. I tried this out myself and this is the section of the response you are looking for. You can see it is empty:
If you try an element that has content (geodata-kml-button), it does pull in a value ("Download KML file"). Ignore the ByteArrayToString() call, that was just me testing:
If they don't have a real API then I don't think you can get your data this way.

VBA on Webpage throws automation, unspecified error at HTML document

I'm doing automation for a webpage. I just need to do a simple task (set a value on the page) but I cannot get it to work. Here's a cut from my code.
Dim IExp As SHDocVw.InternetExplorer
Dim hDoc As HTMLDocument
Dim hCol As MSHTML.IHTMLElementCollection
Dim hInp As MSHTML.HTMLInputElement
Dim hPoint As MSHTML.tagPOINT
Set IExp = New SHDocVw.InternetExplorer
IExp.Visible = True
IExp.navigate "http://somesite.com/"
Do Until IExp.Busy = False
DoEvents
Loop
Set hDoc = IExp.document
' Find the "search for" input box on the page
Set hCol = hDoc.getElementsByTagName("input")
For Each hInp In hCol
The code throws an exception at this line
Set HTMLdoc = iExp.Document
Error says:
Run-time error '-2147467259 (80004005)':
Automation error
Unspecified error
I just want it to throw a single string on the webpage.
Things that might help:
- Just need to run the htmldoc line without error
- By the way, im using IE9 and MS Excel 2010. I added vb.net tag as I know I could edit some syntax to get it to work with vba.
- I tried changing my references and the declared variables. I used MSXML2.XMLHTTP60, and also the code with CreateObject("InternetExplorer.Application") but they also don't work.
Thanks in advance guys!
Hmm I think the problem might be with what you have referenced, and the fact that you don't specify what library you are referencing HTMLDocument from:
Try changing
Dim hDoc As HTMLDocument
To:
Dim hDoc As MSHTML.HTMLDocument
If that sorts it out, make sure you don't have multiple libraries referenced that have a HTMLDocument object.
I just tested the code in Office 2013 referencing 'Microsoft Internet Controls' and 'Microsoft HTML Object Library' only and it worked fine for me.
Sure I have had a similar problem in the past when I had different XML libraries reference both using an xDocument object.

MSXML2.XMLHTTP page request: How do you make sure you get ALL of the final HTML code?

I've used this simple subroutine for loading HTML documents from the web for some time now with no problems:
Function GetSource(sURL As String) As Variant
' Purpose: To obtain the HTML text of a web page
' Receives: The URL of the web page
' Returns: The HTML text of the web page in a variant
Dim oXHTTP As Object, n As Long
Set oXHTTP = CreateObject("MSXML2.XMLHTTP")
oXHTTP.Open "GET", sURL, False
oXHTTP.send
GetSource = oXHTTP.responsetext
Set oXHTTP = Nothing
End Function
but I've run into a situation where it only loads part of a page most of the time (not always -- sometimes it loads all of the expected HTML code). If you SAVE the HTML of the page to another file on the web from a browser, the subroutine will always read it with no problem.
I'm guessing that the issue is timing -- that the dynamic page registers "done" while a script is still filling in details. Sometimes it completes in time, other times it doesn't.
Has anyone ever encountered this behavior before and surmounted it? It seems that there should be a way of capturing via the MSXML2.XMLHTTP object exactly what you'd get if went to the page and chose the save to HTML option.
If you'd like to see the behavior for yourself, here's a sample of a page that doesn't load consistently:
http://www.tiff.net/festivals/thefestival/programmes/specialpresentations/mr-turner
and here's a saved HTML file of that same page:
http://tofilmfest.ca/2014/film/fest/Mr_Turner.htm
Is there any known workaround for this?
I found a workaround that gives me what I want. I control Internet Explorer programmatically and invoke a three-second delay after I tell it to navigate to a page to enable the content to finish loading. Then I extract the HTML code by using an IHTMLElement from Microsoft's HTML library. It's not pretty, but it retrieves all of the HTML code for every page I've tried it with. If anybody has a better way accomplishing the same end, feel free to show off.
Function testbrowser() As Variant
Dim oIE As InternetExplorer
Dim hElm As IHTMLElement
Set oIE = New InternetExplorer
oIE.Height = 600
oIE.Width = 800
oIE.Visible = True
oIE.Navigate "http://www.tiff.net/festivals/thefestival/programmes/galapresentations/the-riot-club"
Call delay(3)
Set hElm = oIE.Document.all.tags("html").Item(0)
testbrowser = hElm.outerHTML
End Function
Sub delay(ByVal secs As Integer)
Dim datLimit As Date
datLimit = DateAdd("s", secs, Now())
While Now() < datLimit
Wend
End Sub
Following Alex's suggestion, here's how to do it without a brute force fixed delay:
Function GetHTML(ByVal strURL as String) As Variant
Dim oIE As InternetExplorer
Dim hElm As IHTMLElement
Set oIE = New InternetExplorer
oIE.Navigate strURL
Do While (oIE.Busy Or oIE.ReadyState <> READYSTATE_COMPLETE)
DoEvents
Loop
Set hElm = oIE.Document.all.tags("html").Item(0)
GetHTML = hElm.outerHTML
Set oIE = Nothing
Set hElm = Nothing
End Function

HtmlAgilityPack not finding nodes from HttpWebRequest's returned HTML

I am a little new to htmlagilitypack. I want to use my HttpWebRequest which can return the html of a webpage and then parse that html with htmlagilitypack. I want to find all div's with a specific class and then get the inner text of what is inside those div's. This is what I have so far. My get request successfully returns webpage html:
Public Function mygetreq(ByVal myURL as String, ByRef thecookie As CookieContainer)
Dim getreq As HttpWebRequest = DirectCast(HttpWebRequest.Create(myURL), HttpWebRequest)
getreq.Method = "GET"
getreq.KeepAlive = True
getreq.CookieContainer = thecookie
getreq.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0"
Dim getresponse As HttpWebResponse
getresponse = DirectCast(getreq.GetResponse, HttpWebResponse)
Dim getreqreader As New StreamReader(getresponse.GetResponseStream())
Dim thePage = getreqreader.ReadToEnd
'Clean up the streams and the response.
getreqreader.Close()
getresponse.Close()
Return thePage
End Function
This function returns the html. I then put the html into this:
'The html successfully shows up in the RichTextBox
RichTextBox1.Text = mygetreq("http://someurl.com", thecookie)
Dim htmldoc = New HtmlAgilityPack.HtmlDocument()
htmldoc.LoadHtml(RichTextBox1.Text)
Dim htmlnodes As HtmlNodeCollection
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='someClass']")
If htmlnodes IsNot Nothing Then
For Each node In htmlnodes
MessageBox.Show(node.InnerText())
Next
End If
The problem is, htmlnodes is coming back as null. So the final If Then loop won't run. It finds nothing, but I KNOW for a fact that this div and class exists in the html page because I can see the html in the RichTextBox1:
<div class="someClass"> This is inner text </div>
What exactly is the problem here? Does the htmldoc.LoadHtml not like the type of string that the mygetreq returns for the page html?
Does this have anything to do with html entities? thePage contains < and > brackets. They are not entitied.
I also saw someone post here (C#) to use the HtmlWeb class, but I am not sure how I would set that up. Most of my code is already written with httpWebRequest.
Thanks for reading and thanks for helping.
If you are willing to switch, you could use CsQuery, something along these lines:
Dim q As New CQ(mygetreq("http://someurl.com", thecookie))
For Each node In q("div.someClass")
Console.WriteLine(node.InnerText)
Next
You may want to add some error handling, but overall should be a good start for you.
You can add CsQuery to your project via NuGet:
Install-Package CsQuery
And don't forget to use Imports CsQuery at the top of your code file.
This may not directly solve your problem, but should make it easier to experiment with your data (via immediate window, for example).
Interesting read (performance comparison):
CsQuery Performance vs. Html Agility Pack and Fizzler
Using htmlweb is trully a simple and good way to work with HtmlAgilityPack...here is an example:
Private Sub GetHtml()
Dim HtmlWeb As New HtmlWeb
Dim HtmlDoc As HtmlDocument
Dim NodeCollection As HtmlNodeCollection
Dim URL As String = ""
HtmlDoc = HtmlWeb.Load(URL) 'Notice that i used load, and not LoadHtml
NodeCollection=HtmlDoc.DocumentNode.SelectNodes(put here your XPath)
For Each Node As HtmlNode In NodeCollection
If IsNothing(Node) = False Then
MsgBox(Node.InnerText)
End If
Next
End Sub

Text from webpage

I need to get some text from this web page. I want to use the trade feed for my program to analyse the sentiment of the markets.
I used the browser control and the get element command but its not working. The problem is that whenever my browser starts to open the page I get Java scripts errors.
I tried with DOM but seems that i dont quite understand what i need to do :)
Here is the code:
Dim code As String
Using client As New WebClient
code = client.DownloadString("http://openbook.etoro.com/ahanit/#/profile/Trades/")
End Using
Dim htmlDocument As IHTMLDocument2 = New HTMLDocument(code)
htmlDocument.write(htmlDocument)
Dim allElements As IHTMLElementCollection = htmlDocument.body.all
Dim allid As IHTMLElementCollection = allElements.tags("id")
Dim element As IHTMLElement
For Each element In allid
element.title = element.innerText
MsgBox(element.innerText)
Next
Update: So I tried the HTML Agility pack, as suggested in the comments, and I am stuck again on this code
Dim plain As String = String.Empty
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml("http://openbook.etoro.com/ahanit/#/profile/Trades/")
Dim goodnods As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("THE PROBLEM")
For Each node In goodnods
TextBox1.Text = htmldoc.DocumentNode.InnerText
Next
Any advice what to now?
Ok I think I know what the problem is somehow the div that I need is hidden and its not loaded when I load the web page just the source code. Does someone knows how to load all the hidden divs ??
Here is my new code
Dim doc As New HtmlAgilityPack.HtmlDocument
Dim web As New HtmlWeb
doc = web.Load("http://openbook.etoro.com/ahanit/#/profile/Trades/")
Dim nodes As HtmlNode = doc.GetElementbyId("feed-items")
Dim id As String = nodes.WriteTo()
TextBox1.Text = TextBox1.Text & vbCrLf & id
user1336635,
Welcome to SO! Something you might try is to check out his source code, figure out what javascript function is populating the field you want (using firebug - I assume it's the one that "trades result in profit" next to it), and then embedding that script into a web page that your webbrowser control loads. That's where I'd try to start. I checked his source code and searched for "trades result in profit" and didn't find anything which leads me to believe hunting for the element 'might' not be possible. Just a starting place until someone with more experience with this chimes in!! Best!
-sf