Web parsing issue using VB - vb.net

I am very new to VB.NET and currently learning how to scrape and parse websites. My problem in a nutshell is - if I use “getElementsByClassName” more than one time in my code, it will only work the first time. Same situation with “getElementsByTagName”. And even when I just parse html code manually it will only work the first time.
Here is an example using “getElementsByClassName”. I have Form1 with Button 1 and ListBox1. I am trying to get news titles from two websites (Google and BBC) and then put them into the ListBox1. You can see I split my code into two parts. I would like to point out that both parts work very well and get the information I need, but only when used individually. When put together like in the example below, the first part (Google) will execute without problems but the second part (BBC) will give me an error on line “Dim AllItemsBBC As Object = SecondBrowser.Document.getElementsByClassName("title-link__title-text")”.
Now what’s more interesting, if I flip the code around and put the BBC part first and Google second, BBC will execute without problems and Google will give me error on line “Dim AllItemsGoogle As Object = FirstBrowser.Document.getElementsByClassName("titletext")”. Basically whichever is first executes without problems, second one fails.
The error message shows “An unhandled exception of type 'System.NotSupportedException' occurred in Microsoft.VisualBasic.dll Additional information: Exception from HRESULT: 0x800A01B6”.
Example1:
Public Class Form1
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
'START OF PART 1
'Creating and navigating the IE browser to Google news page
Dim FirstBrowser As Object = CreateObject("InternetExplorer.Application")
FirstBrowser.Visible = True
FirstBrowser.Navigate("https://news.google.com/news?cf=all&pz=1&ned=us")
Do
Application.DoEvents()
Loop Until FirstBrowser.readyState = 4
'Getting the titles from Google news page and adding them to ListBox1
Dim ItemGoogle As Object
Dim AllItemsGoogle As Object = FirstBrowser.Document.getElementsByClassName("titletext")
For Each ItemGoogle In AllItemsGoogle
ListBox1.Items.Add(ItemGoogle.InnerText)
Next ItemGoogle
'Closing the browser
FirstBrowser.Quit()
'END OF PART1
'START OF PART 2
'Creating and navigating the IE browser to BBC news page
Dim SecondBrowser As Object = CreateObject("InternetExplorer.Application")
SecondBrowser.Visible = True
SecondBrowser.Navigate("http://www.bbc.com/news")
Do
Application.DoEvents()
Loop Until SecondBrowser.readyState = 4
'Getting the titles from BBC news page and adding them to ListBox1
Dim ItemBBC As Object
Dim AllItemsBBC As Object = SecondBrowser.Document.getElementsByClassName("title-link__title-text")
For Each ItemBBC In AllItemsBBC
ListBox1.Items.Add(ItemBBC.InnerText)
Next ItemBBC
'Closing the browser
SecondBrowser.Quit()
'END OF PART 2
End Sub
End Class
My second example is me parsing same websites by basically just finding the phrases I need. Same situation, Google part works, BBC fails on line “Dim the_html_code_bbc As String = SecondBrowser.Document.Body.InnerHTML”.
Flip it around and BBC works and Google fails on line “Dim the_html_code_google As String = FirstBrowser.Document.Body.InnerHTML”.
The error message shows “An unhandled exception of type 'System.MissingMemberException' occurred in Microsoft.VisualBasic.dll Additional information: Public member 'InnerHTML' on type 'JScriptTypeInfo' not found.”
Example 2
Public Class Form1
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
'START OF PART 1
'Creating and navigating the IE browser to Google news page
Dim FirstBrowser As Object = CreateObject("InternetExplorer.Application")
FirstBrowser.Visible = True
FirstBrowser.Navigate("https://news.google.com/news?cf=all&pz=1&ned=us")
Do
Application.DoEvents()
Loop Until FirstBrowser.readyState = 4
'Getting the titles from Google news page and adding them to ListBox1
Dim the_html_code_google As String = FirstBrowser.Document.Body.InnerHTML
Dim start_of_code_google As String
Dim code_selection_google As String
Do
Application.DoEvents()
start_of_code_google = InStr(the_html_code_google, "titletext")
If start_of_code_google > 0 Then
code_selection_google = Mid(the_html_code_google, start_of_code_google + 11, Len(the_html_code_google))
the_html_code_google = Mid(the_html_code_google, start_of_code_google + 11, Len(the_html_code_google))
code_selection_google = Mid(code_selection_google, 1, InStr(code_selection_google, Chr(60)) - 1)
ListBox1.Items.Add(code_selection_google)
End If
Loop Until start_of_code_google = 0
'Closing the browser
FirstBrowser.Quit()
'END OF PART1
'START OF PART 2
'Creating and navigating the IE browser to BBC news page
Dim SecondBrowser As Object = CreateObject("InternetExplorer.Application")
SecondBrowser.Visible = True
SecondBrowser.Navigate("http://www.bbc.com/news")
Do
Application.DoEvents()
Loop Until SecondBrowser.readyState = 4
'Getting the titles from BBC news page and adding them to ListBox1
Dim the_html_code_bbc As String = SecondBrowser.Document.Body.InnerHTML
Dim start_of_code_bbc As String
Dim code_selection_bbc As String
Do
Application.DoEvents()
start_of_code_bbc = InStr(the_html_code_bbc, "title-link__title-text")
If start_of_code_bbc > 0 Then
code_selection_bbc = Mid(the_html_code_bbc, start_of_code_bbc + 24, Len(the_html_code_bbc))
the_html_code_bbc = Mid(the_html_code_bbc, start_of_code_bbc + 24, Len(the_html_code_bbc))
code_selection_bbc = Mid(code_selection_bbc, 1, InStr(code_selection_bbc, Chr(60)) - 1)
ListBox1.Items.Add(code_selection_bbc)
End If
Loop Until start_of_code_bbc = 0
'Closing the browser
SecondBrowser.Quit()
'END OF PART 2
End Sub
End Class
Another thing worth mentioning is that if I use one method of parsing for the Google part and a different method for BBC, everything works great.
I must be missing something due to my inexperience with Visual Studio. I am using Express 2013 for Windows Desktop version. If you know what’s causing this issue, I would greatly appreciate your advice.

Related

Controling Web browser through VB.net application

I want to open separate instances of internet explorer to specified websites. After they are open I would like to cycle through them being displayed on a timer.
I have the following code, but I am not able to switch to the specified IE process:
Dim rotatethrough As Boolean = True
For i = 0 To ListBox1.Items.Count - 1 'I have a list box that contains the website URLs
Dim Processname As New List(Of String)
Dim processnum(Environment.ProcessorCount) As Process
processnum(i) = New Process
processnum(i) = System.Diagnostics.Process.Start(ListBox1.Items(i)) 'start up seperate instances of IE for each website
Next
Do While rotatethrough = True
For n = 0 To ListBox1.Items.Count - 1
AppActivate(processnum(n).Id) 'activate the websites
Threading.Thread.Sleep(1000)
Next
Loop
So far the code opens up the separate instance of IE but fails on appactivate because " Object reference not set to an instance of an object.". I have tried creating IE as an object, but then I am not sure how to get the correct process ID for the corresponding listbox item.
ANY help would be awesome, I just can seem to figure this one out.

Populate Web Form From VB Application

I have created a simple FORM in VB.NET that takes some details and then needs to log in to 3 locations using this information.
At the moment I have the code so it takes this data from the textBoxs and assigns them to 4 different variables. From there I have also opened up the three different websites.
I am having difficulties finding how I will take the variables and then populate the corresponding field on the web application. Any suggestions?
My Code:
Public Class Form1
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
'Define Store variable
Dim Store As String
Store = Me.TextBox1.Text
'Define IP Address variable
Dim IPAddress As String
IPAddress = Me.TextBox2.Text
'Define Username variable
Dim Username As String
Username = Me.TextBox3.Text
'Define Password variable
Dim Password As String
Password = Me.TextBox4.Text
' Open Store Specific URL 1
Dim WebAddress1 As String = "http://" & IPAddress & ":"
Process.Start(WebAddress1)
getElementByName
' Open Store Specific URL 2
Dim WebAddress2 As String = "http://somedomain2.com"
Process.Start(WebAddress2)
' Open Store Specific URL 3
Dim WebAddress3 As String = "http://somedomain3.com"
Process.Start(WebAddress3)
End Sub
End Class
What you need to do is identify the element name that you want to populate. This can typically done by going to the web page, and pressing View Source (changes by web browser, some you can right click and it will be there, some you can access through the settings button.)
Once looking at the source, you will want to find the object (usually a text box or something along those lines) where you want to send the information. Usually these boxes have titles, like Username, or Password. So I would recommend doing a Ctrl + F search based on the information you can see on the site. I see in your code you have GetElementByName, and that's exactly what you'll do. You will want to store
Here's an example code:
Dim IE As Object 'Internet explorer object
Dim objCollection As Object 'Variable used for cycling through different elements
'Create IE Object
IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate("https://somewebsite.com/") 'Your website
Do While IE.Busy
Application.DoEvents() 'This allows the site to load first
Loop
'Find the field you are looking for and store it into the objCollection variable
objCollection = IE.document.getelementsbyname("CustomerInfo.AccountNumber") 'The "CustomerInfo.AccountNumber" is the name of the element I looked for in this case.
'Call element, and set value equal to the data you have from your form
objCollection(0).Value = MainForm.tbLoan.Text
' Clean up
IE = Nothing
objCollection = Nothing
This should be a good start for you. There are multiple resources on this site that might be able to give you additional information when it comes to entering data into websites using vb.net.
Hopefully this helps!

System.UnauthorizedAccessException only using multithreading

I wrote a code to parse some Web tables.
I get some web tables into an IHTMLElementCollection using Internet Explorer with this code:
TabWeb = IE.document.getelementsbytagname("table")
Then I use a sub who gets an object containing the IHTMLElementCollection and some other data:
Private Sub TblParsing(ByVal ArrVal() As Object)
Dim WTab As mshtml.IHTMLElementCollection = ArrVal(0)
'some code
End sub
My issue is: if I simply "call" this code, it works correctly:
Call TblParsing({WTab, LiRow})
but, if I try to run it into a threadpool:
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing), {WTab, LiRow})
the code fails and give me multiple
System.UnauthorizedAccessException
This happens on (each of) these code rows:
Rws = WTab(RifWT("Disc")).Rows.Length
If Not IsError(WTab(6).Cells(1).innertext) Then
Ogg_W = WTab(6).Cells(1).innertext
My goal is to navigate to another web page while my sub perform parsing.
I want to clarify that:
1) I've tryed to send the entire HTML to the sub and get it into a webbrowser but it didn't work because it isn't possible to cast from System.Windows.Forms.HtmlElementCollection to mshtml.IHTMLElementCollection (or I wasn't able to do it);
2) I can't use WebRequest and similar: I'm forced to use InternetExplorer;
3) I can't use System.Windows.Forms.HtmlElementCollection because my parsing code uses Cells, Rows and so on that are unavailable (and I don't want to rewrite all my parsing code)
EDIT:
Ok, I modified my code using answer hints as below:
'This in the caller sub
Dim IE As Object = CreateObject("internetexplorer.application")
'...some code
Dim IE_Body As String = IE.document.body.innerhtml
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing_2), {IE_Body, LiRow})
'...some code
'This is the called sub
Private Sub TblParsing_2(ByVal ArrVal() As Object)
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(ArrVal(0))
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim TabWeb As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
'...some code
I get no errors but I'm not sure that it's all right because I tryed to use IE_Body string into webbrowser and it throws errors in the webpage (it shows a popup and I can ignore errors).
Am I using the right way to get Html from Internet Explorer into a string?
EDIT2:
I changed my code to:
Dim IE As New SHDocVw.InternetExplorer
'... some code
Dim sourceIDoc3 As mshtml.IHTMLDocument3 = CType(IE.Document, mshtml.IHTMLDocument3)
Dim html As String = sourceIDoc3.documentElement.outerHTML
ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf TblParsing_2), {html, LiRow})
'... some code
Private Sub TblParsing_2(ByVal ArrVal() As Object)
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(ArrVal(0))
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim TabWeb As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
But I get an error PopUp like (I tryed to translate it):
Title:
Web page error
Text:
Debug this page?
This page contains errors that might prevent the proper display or function properly.
If you are not testing the web page, click No.
two checkboxes
do not show this message again
Use script debugger built-in Internet Explorer
It's the same error I got trying to get Html text into a WebBrowser.
But, If I could ignore this error, I think the code could work!
While the pop is showing I get error on
Dim domDoc As New mshtml.HTMLDocument
Error text translated is:
Retrieving the COM class factory for component with CLSID {25336920-03F9-11CF-8FD0-00AA00686F13} failed due to the following error: The 8,001,010th message filter indicated that the application is busy. (Exception from HRESULT: 0x8001010A (RPC_E_SERVERCALL_RETRYLATER)).
Note that I've alredy set IE.silent = True
Edit: There was confusion as to what the OP meant by "Internet Explorer". I originally assumed that it meant the WinForm Webbrowser control; however the OP is creating the COM browser directly instead of using the .Net wrapper.
To get the browser document's defining HTML, you can cast the document against the mshtml.IHTMLDocument3 interface to expose the documentElement property.
Dim ie As New SHDocVw.InternetExplorer ' Proj COM Ref: Microsoft Internet Controls
ie.Navigate("some url")
' ... other stuff
Dim sourceIDoc3 As mshtml.IHTMLDocument3 = CType(ie.Document, mshtml.IHTMLDocument3)
Dim html As String = sourceIDoc3.documentElement.outerHTML
End Edit.
The following is based on my comment above. You use the WebBrowser.DocumentText property to create a mshtml.HTMLDocument.
Use this property when you want to manipulate the contents of an HTML page displayed in the WebBrowser control using string processing tools.
Once you extract this property as a String, there is no connection to the WebBrowser control and you can process the data in any thread you want.
Dim html As String = WebBrowser1.DocumentText
Dim domDoc As New mshtml.HTMLDocument
Dim domDoc2 As mshtml.IHTMLDocument2 = CType(domDoc, mshtml.IHTMLDocument2)
domDoc2.write(html)
Dim body As mshtml.IHTMLElement2 = CType(domDoc2.body, mshtml.IHTMLElement2)
Dim tables As mshtml.IHTMLElementCollection = body.getElementsByTagName("TABLE")
' ... do something
' cleanup COM objects
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(body)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(tables)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(domDoc)
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(domDoc2)

Working with Internet Explorer Object

My goal is to create a VB application that reads and writes information to various webpages loaded in Internet explorer.
I have created a program that works exactly as I intend in VBA. I am now trying to re-implement the same programming in VB.
I have a function that looks for and returns an Internet Explorer Object where the input matches the LocationName.
Assuming the target page is loaded, I can work with it. Methods such as getElementByID() work perfectly. If the browser window is closed and reopened, and the code is run again, the results are very inconsistent. The getIE function seems to work, but when trying to use methods like document.getElementByID() a NullReferenceException is thrown.
Does anyone know if there is anything I am missing that I need to include to get the document property to update?
EDIT: I have looked over the NullReferenceException article. It hasn't helped me unfortunately. In case I was unclear in my wording, I would like to reiterate that the same bit of code yields a different result when executed the 2nd time under the same conditions (same webpage open in Internet Explorer).
On the second execution IE.locationName is retrievable but IE.Document.title is not. The problem is definitely with the Document property as far as I can tell. I am truly stumped.
Many thanks
Public Function getIE(targetTitle As String) As SHDocVw.InternetExplorer
' Create the shell application and Collection of open windows
Dim shellObj As Object = CreateObject("Shell.Application")
Dim shellWindows As SHDocVw.ShellWindows = shellObj.Windows()
getIE = Nothing
' Scan through the Collection
For I = (shellObj.Windows.Count - 1) To 0 Step -1
' If found, assign this to the function output and exit early
If InStr(shellWindows(I).LocationName, targetTitle) Then
getIE = shellWindows(I)
Debug.Print("Found: " & shellWindows(I).LocationName)
Exit For
End If
Next I
End Function
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim IE As SHDocVw.InternetExplorer
IE = getIE("my site title")
If IE Is Nothing Then
Debug.Print("Site not open")
Exit Sub
End If
' The code always gets this far, and despite the page being found, starts throwing exceptions from here on if the window has been closed and reopened whilst my application has stayed running.
' Sample form data insertion code
IE.Document.getElementById("textbox1").value = "my value"
' Click the submit button
IE.Document.getElementById("submit").click()
' Wait for page to load
While IE.Busy
End While
IE.Document.getElementById("textbox2").value = "my 2nd value"
' done
IE = Nothing
End sub

Get HTML element without 'For Each' loop

I've started to do some programming in Visual Basic and I need some help. This might be a simple question but I can't figure this out.
So I have created a web browser and I would like to get the Profile Picture of a person on Facebook and display it on a Picture Box, and also the Name of that person and display it on a TextBox. After that the program should save the picture in a folder on my hard drive.
Private Sub WebBrowser1_DocumentCompleted(sender As Object, e As WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
Dim elemCollection As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("img")
Dim fbPath As String = "C:\Users\" + Environment.UserName.ToString + "\Documents\FB Images\"
For Each curElement As HtmlElement In elemCollection
If curElement.OuterHtml.Contains("profilePic img") Then
PictureBox1.ImageLocation = curElement.GetAttribute("src") 'Showing the profile pic in picture box
TextBoxName.Text = curElement.GetAttribute("alt") 'Showing the name in a textbox
Dim NumberOfFiles As Integer = System.IO.Directory.GetFiles(fbPath).Length
Dim bmp As New Bitmap(PictureBox1.Width, PictureBox1.Height)
PictureBox1.DrawToBitmap(bmp, PictureBox1.ClientRectangle)
bmp.Save(System.IO.Path.Combine(fbPath, CStr(NumberOfFiles) + TextBoxName.Text + ".jpg"), System.Drawing.Imaging.ImageFormat.Jpeg)
End If
Next
End Sub
So the program should save the picture with a file name according to the number of files there is in the folder path(e.g '1 George Johnson.jpg'). But what happens, it is saving 5 different images because I guess there are more than one HTML elements that matches those attributes, so the For Each loop brings up more results.
Is there any way to get an HTML element without using this loop and just get a particular element that I want?
If you just want the first time the if statement for the attribute is true , add "Exit For" the line before the End If.