How would I retrieve certain information on a webpage using their ID value? - vb.net

In vb.net, I can download a webpage as a string like this:
Using ee As New System.Net.WebClient()
Dim reply As String = ee.DownloadString("https://pastebin.com/eHcQRiff")
MessageBox.Show(reply)
End Using
Would it be possible to specify an ID tag of an item on the webpage so that the reply will only output the information inside of the code box/id tag?
Example:
The ID tag of RAW Paste Data on https://pastebin.com/eHcQRiff is id="paste_code" which includes the following text:
Test=1
Test=2
Is there anyway to get the WebClient to only output that exact same message using the ID tag (or any other method)?

You can use HtmlAgilityPack library
Dim document as HtmlAgilityPack.HtmlDocument = new HtmlAgilityPack.HtmlDocument()
document.Load(#"C:\YourDownloadedHtml.html")
Dim text as string = document.GetElementbyId("paste_code").InnerText
Some more sample code:
(Tested with HtmlAgilityPack 1.6.10.0)
Dim html As string = "<TD width=""""50%""""><DIV align=right>Name :<B> </B></DIV></TD><TD width=""""50%""""><div id='i1'>SomeText</div></TD><TR vAlign=center>"
Dim htmlDoc As HtmlDocument = New HtmlDocument
htmlDoc.LoadHtml(html) 'To load from html string directly
Dim name As String = htmlDoc.DocumentNode.SelectSingleNode("//td/div[#id='i1']").InnerText
Console.WriteLine(name)
Output:
SomeText

Related

Drill down using HtmlAgilityPack.HtmlDocument in VB.NET

I've created an HTML Document using
Dim htmlDoc = New HtmlAgilityPack.HtmlDocument()
and have a node
node = htmlDoc.DocumentNode.SelectSingleNode("/html/body/main/section/form[1]/input[2]")
and the OuterHtml is
"<input type="hidden" id="public-id" value="michael.smith.1">"
I need the value of michael.smith.1. Is there a way to pull the value property from the node or am I at the point where I use substring to parse out the value?
Thanks for the help
I would use the id firstly as this makes for faster matching, then use the GetAttributeValue method of HtmlNode to extract the value attribute
Imports System
Imports HtmlAgilityPack
Public Class Program
Public Shared Sub Main()
Dim doc = new HtmlDocument
Dim output As String = "<html><head><title>Text</title></head><body><input type=""hidden"" id=""public-id"" value=""michael.smith.1""></body></html>"
doc.LoadHtml(output)
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[#id='public-id']").GetAttributeValue("value","Not present"))
End Sub
End Class
Fiddle

How to get a specific class that is in a div

I'm at the beginning with the Html Agility Pack library, I can't understand why if I only take the divs with the following code it works
Dim url As String = "https://example.com"
Dim web = New HtmlWeb()
Dim doc = web.Load(url)
For Each node As HtmlNode In doc.DocumentNode.SelectNodes("//div")
Console.Write(node.InnerText)
Next
while if I want to find a class in a div like in the following code it gives me an error
Dim url As String = "https://example.com"
Dim web = New HtmlWeb()
Dim doc = web.Load(url)
For Each node As HtmlNode In doc.DocumentNode.SelectNodes("//div[#class='myclass']")
Console.Write(node.InnerText)
Next

VB.net - How to extract content of HTML using regex?

<div class="gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long" dir="ltr" style="word-break:break-all;">pastebin.com/N8VKGxR9</div>
If I have this, how can I extract only the pastebin url portion in VB.net using regex? I've downloaded the entire webpage using WC.DownloadString().
Dim text As String = "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
Dim pattern As String = "<div[\w\W]+gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long.*>(.*)<\/div>"
Dim m As Match = r.Match(text)
Dim g as Group = m.Groups(1)
Will give you pastebin.com/N8VKGxR9
BTW: Topic in the comments for matching special tags, not the text between tags itself. So it's pretty possible.
Edited to keep only divs with these classes
If you use an HTML parser like HtmlAgilityPack (Getting Started With HTML Agility Pack), you can do something like this:
Option Infer On
Option Strict On
Imports HtmlAgilityPack
Module Module1
Sub Main()
' some test data...
Dim s = "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
s &= "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/ABC</div>"
s &= "<div class=""WRONGCLASS gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
Dim doc As New HtmlDocument
doc.LoadHtml(s)
' match the classes string /exactly/:
Dim wantedNodes = doc.DocumentNode.SelectNodes("//div[#class='gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long']")
' An alternative for if you want the divs with /at least/ those classes:
'Dim wantedNodes = doc.DocumentNode.SelectNodes("//div[contains(#class, 'gs-bidi-start-align') and contains(#class, 'gs-visibleUrl') and contains(#class, 'gs-visibleUrl-Long')]")
' show the resultant data:
If wantedNodes IsNot Nothing Then
For Each n In wantedNodes
Console.WriteLine(n.InnerHtml)
Next
End If
Console.ReadLine()
End Sub
End Module
Outputs:
pastebin.com/N8VKGxR9
pastebin.com/ABC
HTML parsers have the advantage that they will generally tolerate malformed HTML - for example, the test data shown above is not a valid HTML document and yet the desired data is parsed from it successfully.

How to Post & Retrieve Data from Website

I am working with a Windows form application. I have a textbox called "tbPhoneNumber" which contains a phone number.
I want to go on the website http://canada411.com and enter in the number that was in my textbox, into the website textbox ID: "c411PeopleReverseWhat" and then somehow send a click on "Find" (which is an input belonging to class "c411ButtonImg").
After that, I want to retrieve what is in between the asterixs of the following HTML section:
<div id="contact" class="vcard">
<span><h1 class="fn c411ListedName">**Full Name**</h1></span>
<span class="c411Phone">**(###)-###-####**</span>
<span class="c411Address">**Address**</span>
<span class="adr">
<span class="locality">**City**</span>
<span class="region">**Province**</span>
<span class="postal-code">**L#L#L#**</span>
</span>
So basically I am trying to send data into an input box, click the input button and store the values retrieved into variables. I want to do this seemlessly so I would need to do something like an HTTPWebRequest? Or do I use a WebBrowser object? I just don't want the user to see that the application is going on a website.
I do a good amount of website scraping and I will show you how I do it. Feel free to skip ahead if I am being too specific, but this is a commonly requested theme and should be made specific.
URL Simplification
The library I use for this is htmlagilitypack (It is a dll, make a new project and add a reference to it). The first thing to check is if we have to go to take any special steps to get to a page by using a phone number. I searched for John Smith and found quite a few. I entered 2 of these results and noticed that the url formatting is very simple. Those results were..
http://www.canada411.ca/res/7056736767/John-Smith/138223109.html
http://www.canada411.ca/res/7052355273/John-Smith/172439951.html
I tested to see if I can remove some of the values from the url that I don't know and just leave the phone number. The result was that I can...
http://www.canada411.ca/search/re/1/7056736767/-
http://www.canada411.ca/search/re/1/7052355273/-
You can see by the url that there are some static areas in the url and our phone number. From this lets construct a string for the url.
Dim phoneNumber as string = "7056736767" 'this could be TextBox1.Text or whatever
Dim URL as string = "http://www.canada411.ca/search/re/1/" + phoneNumber +"/-"
Value Extraction with XPath
Now that we have the page dialed in, lets examine the html you provided above. You need 6 values from the page so we will create them now...
Dim FullName As String
Dim Phone As String
Dim Address As String
Dim Locality As String
Dim Region As String
Dim PostalCode As String
As mentioned above, we will be using htmlagilitypack which uses Xpath. The cool thing about this is that once we can find some unique identifier in the html, we can use Xpath to find our values. I know it may be confusing, but it will become clearer.
All of the values you need are within tags that have a class name. Lets use the class name in our Xpath to find them.
Dim FullNameXPath As String = "//*[#class='fn c411ListedName']"
Dim PhoneXPath As String = "//*[#class='c411Phone']"
Dim AddressXPath As String = "//*[#class='c411Address']"
Dim LocalityXPath As String = "//*[#class='locality']"
Dim RegionXPath As String = "//*[#class='region']"
Dim PostalCodeXPath As String = "//*[#class='postal-code']"
Essentially what we are looking at is a string that will inform htmlagilitypack what to look for. In our case, text contained within the classes we named. There is a lot to XPath and it could take a while to explain all of it. On a side note though...If you use Google Chrome and highlight a value on a page, you can right click inspect element. In the code that appears below, you can right click the value and copy to XPath!!! Very useful.
Basic HTMLAgilityPack Template
Now, all that is left is to connect to the page and get those variables populated.
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load(URL)
For Each nameResult As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(FullNameXPath)
Msgbox(nameResult.InnerText)
Next
In the above example we create an HtmlWeb object named Web. This is the actual crawler of our project. We then define a HtmlDocument which will consist of our converted and searchable page source. All of this is done behind the scenes. We then send Web to get the page source and assign it to the Doc object we created. Doc is reusable, which thankfully requires us to connect to the page only once.
The for loop looks for any nodes in our Doc that match FullNameXPath which was defined previously as the XPath value for finding the name. When a Node is found, it is assigned to the nameResult variable and from within the loop we call a message box to display the inner text of our node.
So when we put it all together
Complete Working Code (As of 2/17/2013)
Dim phoneNumber As String = "7056736767" 'this could be TextBox1.Text or whatever
Dim URL As String = "http://www.canada411.ca/search/re/1/" + phoneNumber + "/-"
Dim FullName As String
Dim Phone As String
Dim Address As String
Dim Locality As String
Dim Region As String
Dim PostalCode As String
Dim FullNameXPath As String = "//*[#class='fn c411ListedName']"
Dim PhoneXPath As String = "//*[#class='c411Phone']"
Dim AddressXPath As String = "//*[#class='c411Address']"
Dim LocalityXPath As String = "//*[#class='locality']"
Dim RegionXPath As String = "//*[#class='region']"
Dim PostalCodeXPath As String = "//*[#class='postal-code']"
Dim Web As New HtmlAgilityPack.HtmlWeb
Dim Doc As New HtmlAgilityPack.HtmlDocument
Doc = Web.Load(URL)
For Each nameResult As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(FullNameXPath)
FullName = nameResult.InnerText
MsgBox(FullName)
Next
For Each PhoneResult As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(PhoneXPath)
Phone = PhoneResult.InnerText
MsgBox(Phone)
Next
For Each ADDRResult As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(AddressXPath)
Address = ADDRResult.InnerText
MsgBox(Address)
Next
For Each LocalResult As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(LocalityXPath)
Locality = LocalResult.InnerText
MsgBox(Locality)
Next
For Each RegionResult As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(RegionXPath)
Region = RegionResult.InnerText
MsgBox(Region)
Next
For Each postalCodeResult As HtmlAgilityPack.HtmlNode In Doc.DocumentNode.SelectNodes(PostalCodeXPath)
PostalCode = postalCodeResult.InnerText
MsgBox(PostalCode)
Next
Yes it is possible, I've done this using the selenium framework, which is aimed for testing automation. However, it provides you with the tools to do exactly that.
Download for .net here:
http://docs.seleniumhq.org/download/

HtmlAgilityPack - getting error when looping through nodes. Doesn't make sense

I'm trying to get all nodes below but I am getting an error message of:
Overload resolution failed because no accessible 'GetAttributeValue' accepts this number of arguments.
Dim doc As New HtmlDocument()
doc.LoadHtml("shaggybevo.com/board/register.php")
Dim docNode As HtmlNode = doc.DocumentNode
Dim nodes As HtmlNodeCollection = docNode.SelectNodes("//input")
For Each node As HtmlNode In nodes
Dim id As String = node.GetAttributeValue("id")
Next
Any ideas on why I am getting this error message? Thanks
You need to provide a default value as a second parameter to GetAttributeValue:
Dim id As String = node.GetAttributeValue("id", "")
Update for updated question
In addition to the above fix, you are retrieving the HtmlDocument incorrectly. HtmlDocument.Load will either load a file or an HTML string, not retrieve the file from the web server.
You need to modify your code to fetch the data from the URL using HtmlWeb. Replace the following lines:
Dim doc As New HtmlDocument()
doc.LoadHtml("shaggybevo.com/board/register.php")
with these:
Dim doc As HtmlDocument
Dim web As New HtmlWeb
doc = web.Load("http://shaggybevo.com/board/register.php")