Text from webpage - vb.net

I need to get some text from this web page. I want to use the trade feed for my program to analyse the sentiment of the markets.
I used the browser control and the get element command but its not working. The problem is that whenever my browser starts to open the page I get Java scripts errors.
I tried with DOM but seems that i dont quite understand what i need to do :)
Here is the code:
Dim code As String
Using client As New WebClient
code = client.DownloadString("http://openbook.etoro.com/ahanit/#/profile/Trades/")
End Using
Dim htmlDocument As IHTMLDocument2 = New HTMLDocument(code)
htmlDocument.write(htmlDocument)
Dim allElements As IHTMLElementCollection = htmlDocument.body.all
Dim allid As IHTMLElementCollection = allElements.tags("id")
Dim element As IHTMLElement
For Each element In allid
element.title = element.innerText
MsgBox(element.innerText)
Next
Update: So I tried the HTML Agility pack, as suggested in the comments, and I am stuck again on this code
Dim plain As String = String.Empty
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml("http://openbook.etoro.com/ahanit/#/profile/Trades/")
Dim goodnods As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("THE PROBLEM")
For Each node In goodnods
TextBox1.Text = htmldoc.DocumentNode.InnerText
Next
Any advice what to now?
Ok I think I know what the problem is somehow the div that I need is hidden and its not loaded when I load the web page just the source code. Does someone knows how to load all the hidden divs ??
Here is my new code
Dim doc As New HtmlAgilityPack.HtmlDocument
Dim web As New HtmlWeb
doc = web.Load("http://openbook.etoro.com/ahanit/#/profile/Trades/")
Dim nodes As HtmlNode = doc.GetElementbyId("feed-items")
Dim id As String = nodes.WriteTo()
TextBox1.Text = TextBox1.Text & vbCrLf & id

user1336635,
Welcome to SO! Something you might try is to check out his source code, figure out what javascript function is populating the field you want (using firebug - I assume it's the one that "trades result in profit" next to it), and then embedding that script into a web page that your webbrowser control loads. That's where I'd try to start. I checked his source code and searched for "trades result in profit" and didn't find anything which leads me to believe hunting for the element 'might' not be possible. Just a starting place until someone with more experience with this chimes in!! Best!
-sf

Related

Creating new Section in onenote using VBA

Hi I have searched through the internet looking for examples of creating a new onenote section and I can't find the right example for me to understand. The closest I can get to is using .OpenHierarchy function but I'm still very new to it and I couldn't get the parameters right.
I'm currently working on an OCR marco for multiple PDF file. Everything is working fine till I realise that I'm creating huge waste files on my computer.
Here's the code I used to delete all the pages created in the section
Dim oneNote As OneNote14.Application
Dim secDoc As MSXML2.DOMDocument60
Set secDoc = New MSXML2.DOMDocument60
Dim secNodes As MSXML2.IXMLDOMNodeList
Set secNodes = secDoc.DocumentElement.getElementsByTagName("one:Section")
' Get the first section.
Dim secNode As MSXML2.IXMLDOMNode
Set secNode = secNodes(0)
Dim sectionName As String
sectionName = secNode.Attributes.getNamedItem("name").Text
Dim sectionID As String
sectionID = secNode.Attributes.getNamedItem("ID").Text
oneNote.DeleteHierarchy (sectionID)
oneNote.OpenHierarchy
End Sub
Deletehierarchy function deletes the entire section away leaving no section behind but my OCR macro requires at least one section to work.
Thanks for reading and thank you in advance!
oneNote.OpenHierarchy in vba does not allow bracklet and thats the issue that is causing the error.
solution:
oneNote.OpenHierarchy fileName, "", "New Section 1", 3

System.UnauthorizedAccessException only using multithreading

I wrote a code to parse some Web tables.
I get some web tables into an IHTMLElementCollection using Internet Explorer with this code:
TabWeb = IE.document.getelementsbytagname("table")
Then I use a sub who gets an object containing the IHTMLElementCollection and some other data:
Private Sub TblParsing(ByVal ArrVal() As Object)
Dim WTab As mshtml.IHTMLElementCollection = ArrVal(0)
'some code
End sub
My issue is: if I simply "call" this code, it works correctly:
Call TblParsing({WTab, LiRow})
but, if I try to run it into a threadpool:
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing), {WTab, LiRow})
the code fails and give me multiple
System.UnauthorizedAccessException
This happens on (each of) these code rows:
Rws = WTab(RifWT("Disc")).Rows.Length
If Not IsError(WTab(6).Cells(1).innertext) Then
Ogg_W = WTab(6).Cells(1).innertext
My goal is to navigate to another web page while my sub perform parsing.
I want to clarify that:
1) I've tryed to send the entire HTML to the sub and get it into a webbrowser but it didn't work because it isn't possible to cast from System.Windows.Forms.HtmlElementCollection to mshtml.IHTMLElementCollection (or I wasn't able to do it);
2) I can't use WebRequest and similar: I'm forced to use InternetExplorer;
3) I can't use System.Windows.Forms.HtmlElementCollection because my parsing code uses Cells, Rows and so on that are unavailable (and I don't want to rewrite all my parsing code)
EDIT:
Ok, I modified my code using answer hints as below:
'This in the caller sub
Dim IE As Object = CreateObject("internetexplorer.application")
'...some code
Dim IE_Body As String = IE.document.body.innerhtml
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing_2), {IE_Body, LiRow})
'...some code
'This is the called sub
Private Sub TblParsing_2(ByVal ArrVal() As Object)
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(ArrVal(0))
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim TabWeb As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
'...some code
I get no errors but I'm not sure that it's all right because I tryed to use IE_Body string into webbrowser and it throws errors in the webpage (it shows a popup and I can ignore errors).
Am I using the right way to get Html from Internet Explorer into a string?
EDIT2:
I changed my code to:
Dim IE As New SHDocVw.InternetExplorer
'... some code
Dim sourceIDoc3 As mshtml.IHTMLDocument3 = CType(IE.Document, mshtml.IHTMLDocument3)
Dim html As String = sourceIDoc3.documentElement.outerHTML
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing_2), {html, LiRow})
'... some code
Private Sub TblParsing_2(ByVal ArrVal() As Object)
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(ArrVal(0))
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim TabWeb As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
But I get an error PopUp like (I tryed to translate it):
Title:
Web page error
Text:
Debug this page?
This page contains errors that might prevent the proper display or function properly.
If you are not testing the web page, click No.
two checkboxes
do not show this message again
Use script debugger built-in Internet Explorer
It's the same error I got trying to get Html text into a WebBrowser.
But, If I could ignore this error, I think the code could work!
While the pop is showing I get error on
Dim domDoc As New mshtml.HTMLDocument
Error text translated is:
Retrieving the COM class factory for component with CLSID {25336920-03F9-11CF-8FD0-00AA00686F13} failed due to the following error: The 8,001,010th message filter indicated that the application is busy. (Exception from HRESULT: 0x8001010A (RPC_E_SERVERCALL_RETRYLATER)).
Note that I've alredy set IE.silent = True
Edit: There was confusion as to what the OP meant by "Internet Explorer". I originally assumed that it meant the WinForm Webbrowser control; however the OP is creating the COM browser directly instead of using the .Net wrapper.
To get the browser document's defining HTML, you can cast the document against the mshtml.IHTMLDocument3 interface to expose the documentElement property.
Dim ie As New SHDocVw.InternetExplorer ' Proj COM Ref: Microsoft Internet Controls
ie.Navigate("some url")
' ... other stuff
Dim sourceIDoc3 As mshtml.IHTMLDocument3 = CType(ie.Document, mshtml.IHTMLDocument3)
Dim html As String = sourceIDoc3.documentElement.outerHTML
End Edit.
The following is based on my comment above. You use the WebBrowser.DocumentText property to create a mshtml.HTMLDocument.
Use this property when you want to manipulate the contents of an HTML page displayed in the WebBrowser control using string processing tools.
Once you extract this property as a String, there is no connection to the WebBrowser control and you can process the data in any thread you want.
Dim html As String = WebBrowser1.DocumentText
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(html)
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim tables As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
' ... do something
' cleanup COM objects
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(body)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(tables)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(domDoc)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(domDoc2)

HtmlAgilityPack not finding nodes from HttpWebRequest's returned HTML

I am a little new to htmlagilitypack. I want to use my HttpWebRequest which can return the html of a webpage and then parse that html with htmlagilitypack. I want to find all div's with a specific class and then get the inner text of what is inside those div's. This is what I have so far. My get request successfully returns webpage html:
Public Function mygetreq(ByVal myURL as String, ByRef thecookie As CookieContainer)
Dim getreq As HttpWebRequest = DirectCast(HttpWebRequest.Create(myURL), HttpWebRequest)
getreq.Method = "GET"
getreq.KeepAlive = True
getreq.CookieContainer = thecookie
getreq.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0"
Dim getresponse As HttpWebResponse
getresponse = DirectCast(getreq.GetResponse, HttpWebResponse)
Dim getreqreader As New StreamReader(getresponse.GetResponseStream())
Dim thePage = getreqreader.ReadToEnd
'Clean up the streams and the response.
getreqreader.Close()
getresponse.Close()
Return thePage
End Function
This function returns the html. I then put the html into this:
'The html successfully shows up in the RichTextBox
RichTextBox1.Text = mygetreq("http://someurl.com", thecookie)
Dim htmldoc = New HtmlAgilityPack.HtmlDocument()
htmldoc.LoadHtml(RichTextBox1.Text)
Dim htmlnodes As HtmlNodeCollection
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='someClass']")
If htmlnodes IsNot Nothing Then
For Each node In htmlnodes
MessageBox.Show(node.InnerText())
Next
End If
The problem is, htmlnodes is coming back as null. So the final If Then loop won't run. It finds nothing, but I KNOW for a fact that this div and class exists in the html page because I can see the html in the RichTextBox1:
<div class="someClass"> This is inner text </div>
What exactly is the problem here? Does the htmldoc.LoadHtml not like the type of string that the mygetreq returns for the page html?
Does this have anything to do with html entities? thePage contains < and > brackets. They are not entitied.
I also saw someone post here (C#) to use the HtmlWeb class, but I am not sure how I would set that up. Most of my code is already written with httpWebRequest.
Thanks for reading and thanks for helping.
If you are willing to switch, you could use CsQuery, something along these lines:
Dim q As New CQ(mygetreq("http://someurl.com", thecookie))
For Each node In q("div.someClass")
Console.WriteLine(node.InnerText)
Next
You may want to add some error handling, but overall should be a good start for you.
You can add CsQuery to your project via NuGet:
Install-Package CsQuery
And don't forget to use Imports CsQuery at the top of your code file.
This may not directly solve your problem, but should make it easier to experiment with your data (via immediate window, for example).
Interesting read (performance comparison):
CsQuery Performance vs. Html Agility Pack and Fizzler
Using htmlweb is trully a simple and good way to work with HtmlAgilityPack...here is an example:
Private Sub GetHtml()
Dim HtmlWeb As New HtmlWeb
Dim HtmlDoc As HtmlDocument
Dim NodeCollection As HtmlNodeCollection
Dim URL As String = ""
HtmlDoc = HtmlWeb.Load(URL) 'Notice that i used load, and not LoadHtml
NodeCollection=HtmlDoc.DocumentNode.SelectNodes(put here your XPath)
For Each Node As HtmlNode In NodeCollection
If IsNothing(Node) = False Then
MsgBox(Node.InnerText)
End If
Next
End Sub

How do I search through a string for a particular hyperlink in Visualbasic.net?

I have a written a program which downloads a webpage's source but now I want to search the source for a particular link I know the link is written like this:
<b>Geographical Survey Work</b>
Is there anyway of using "Geographical Survey Work" as criteria to retrieve the link? The code I am using to download the source to a string is this:
Dim sourcecode As String = ((New Net.WebClient).DownloadString("http://examplesite.com"))
So just to clarify I want to type into an input box "Geographical Survey Work" for instance and "/internet/A2" to popup in a messagebox? I think it can be done using a regex, but that's a bit beyond me. Any help would be great.
With HTMLAgilityPack:
Dim vsPageHTML As String = "<html>... your webpage HTML code ...</html>"
Dim voHTMLDoc.LoadHtml(vsPageHTML) : vsPageHTML = ""
Dim vsURI As String = ""
Dim voNodes As HtmlAgilityPack.HtmlNodeCollection = voHTMLDoc.SelectNodes("//a[#href]")
If Not IsNothing(voNodes) Then
For Each voNode As HtmlAgilityPack.HtmlNode In voNodes
If voNode.innerHTML.toLower() = "<b>geographical survey work</b>" Then
vsURI = voNode.GetAttributeValue("href", "")
Exit For
End If
Next
End If
voNodes = Nothing : voHTMLDoc = Nothing
Do whatever you want with vsURI.
You might need to tweak the code a bit as I'm writing free-hand.

HtmlAgilityPack - getting error when looping through nodes. Doesn't make sense

I'm trying to get all nodes below but I am getting an error message of:
Overload resolution failed because no accessible 'GetAttributeValue' accepts this number of arguments.
Dim doc As New HtmlDocument()
doc.LoadHtml("shaggybevo.com/board/register.php")
Dim docNode As HtmlNode = doc.DocumentNode
Dim nodes As HtmlNodeCollection = docNode.SelectNodes("//input")
For Each node As HtmlNode In nodes
Dim id As String = node.GetAttributeValue("id")
Next
Any ideas on why I am getting this error message? Thanks
You need to provide a default value as a second parameter to GetAttributeValue:
Dim id As String = node.GetAttributeValue("id", "")
Update for updated question
In addition to the above fix, you are retrieving the HtmlDocument incorrectly. HtmlDocument.Load will either load a file or an HTML string, not retrieve the file from the web server.
You need to modify your code to fetch the data from the URL using HtmlWeb. Replace the following lines:
Dim doc As New HtmlDocument()
doc.LoadHtml("shaggybevo.com/board/register.php")
with these:
Dim doc As HtmlDocument
Dim web As New HtmlWeb
doc = web.Load("http://shaggybevo.com/board/register.php")