Drill down using HtmlAgilityPack.HtmlDocument in VB.NET - vb.net

I've created an HTML Document using
Dim htmlDoc = New HtmlAgilityPack.HtmlDocument()
and have a node
node = htmlDoc.DocumentNode.SelectSingleNode("/html/body/main/section/form[1]/input[2]")
and the OuterHtml is
"<input type="hidden" id="public-id" value="michael.smith.1">"
I need the value of michael.smith.1. Is there a way to pull the value property from the node or am I at the point where I use substring to parse out the value?
Thanks for the help

I would use the id firstly as this makes for faster matching, then use the GetAttributeValue method of HtmlNode to extract the value attribute
Imports System
Imports HtmlAgilityPack
Public Class Program
Public Shared Sub Main()
Dim doc = new HtmlDocument
Dim output As String = "<html><head><title>Text</title></head><body><input type=""hidden"" id=""public-id"" value=""michael.smith.1""></body></html>"
doc.LoadHtml(output)
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[#id='public-id']").GetAttributeValue("value","Not present"))
End Sub
End Class
Fiddle

Related

Need Example HtmlAgilityPack

I try again to scrape for an exemple.
Actually i have the follow code :
Imports System
Imports System.Xml
Imports HtmlAgilityPack
Imports System.Net
Imports System.IO
Imports System.Collections.Generic
Public Class Program
Public Shared Sub Main()
'Enable SSL Suppport'
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12
'WebPage to Scraping'
Dim link As String = "https://www.nextinpact.com"
'download page from the link into an HtmlDocument'
Dim doc As HtmlDocument = New HtmlWeb().Load(link)
'select the title'
Dim div As HtmlNode = doc.DocumentNode.SelectSingleNode("//section[#class='small_article_section']")
If Not div Is Nothing Then
For Each node As HtmlNode In doc.DocumentNode.SelectNodes("//h2[#class='color_title']//a[#class='ui-link'][contains(text())]")
Console.Write(div.InnerText.Trim())
Next
End If
End Sub
End Class
Actualy i try to catch all the title from
"//section[#class='small_article_section']"
But how i do to get all the title ?
For the first title the xpath is
"//h2[#class='color_title']//a[#class='ui-link'][contains(text(),'Les
obligations de Netflix passeront d')]"
Thanks you.
Edit:
I try an other example,
with
Dim doc As HtmlDocument = New HtmlWeb().Load("https://www.sideshow.com/collectibles?manufacturer=sideshow+collectibles&type=premium+format%28tm%29+figure&brand=aspen")
Dim div As HtmlNode = doc.DocumentNode.SelectSingleNode("//div[#class='c-ProductList row']")
Now i try to get for each product the title, with :
For Each node As HtmlNode In div.SelectNodes("//h2[contains(text(),'Grace')]") 'That is for Only Grace
Console.Write(node.InnerText.Trim())
Next
But with
//h2[contains(text(),'Grace')]
i have Nothing and i want Gace and Aspen and try with
.//h2[contains(text()]
and nothing too
This is how you do it.
Dim doc As HtmlDocument = New HtmlWeb().Load("https://www.nextinpact.com/")
Dim div As HtmlNode = doc.DocumentNode.SelectSingleNode("//section[#class='small_article_section']")
'If div IsNot Nothing Then 'I think this part is pointless as it will always exist
For Each node As HtmlNode In div.SelectNodes(".//h2[#class='color_title']/a") 'a class='ui-link' doesn't exist so do h2/a
Console.Write(node.InnerText.Trim())
Next

How would I retrieve certain information on a webpage using their ID value?

In vb.net, I can download a webpage as a string like this:
Using ee As New System.Net.WebClient()
Dim reply As String = ee.DownloadString("https://pastebin.com/eHcQRiff")
MessageBox.Show(reply)
End Using
Would it be possible to specify an ID tag of an item on the webpage so that the reply will only output the information inside of the code box/id tag?
Example:
The ID tag of RAW Paste Data on https://pastebin.com/eHcQRiff is id="paste_code" which includes the following text:
Test=1
Test=2
Is there anyway to get the WebClient to only output that exact same message using the ID tag (or any other method)?
You can use HtmlAgilityPack library
Dim document as HtmlAgilityPack.HtmlDocument = new HtmlAgilityPack.HtmlDocument()
document.Load(#"C:\YourDownloadedHtml.html")
Dim text as string = document.GetElementbyId("paste_code").InnerText
Some more sample code:
(Tested with HtmlAgilityPack 1.6.10.0)
Dim html As string = "<TD width=""""50%""""><DIV align=right>Name :<B> </B></DIV></TD><TD width=""""50%""""><div id='i1'>SomeText</div></TD><TR vAlign=center>"
Dim htmlDoc As HtmlDocument = New HtmlDocument
htmlDoc.LoadHtml(html) 'To load from html string directly
Dim name As String = htmlDoc.DocumentNode.SelectSingleNode("//td/div[#id='i1']").InnerText
Console.WriteLine(name)
Output:
SomeText

How to read simple pseudo XML file?

I want to be able to read info from a simple, pseudo XML file I created to get the content. Here is what my XML file would look like :
<title>Form Title</title>
<Message1>A message or something</Message1>
<FormWidth>500</FormWidth>
<FormHeight>500</FormHeight>
The XML class I find online and inside Visual Studio are too advance. This is just a simple config file I'd like to use. Any tips?
Add a function like this to your code behind (I just converted this from c# so you may need to change some things):
Private Shared Function ReadValueFromXML(ValueToRead As String) As String
Try
Dim doc As New XPathDocument(System.Web.HttpContext.Current.Server.MapPath("filenameOfYourXML.xml"))
Dim nav As XPathNavigator = doc.CreateNavigator()
Dim expr As XPathExpression
expr = nav.Compile(Convert.ToString("/") & ValueToRead)
Dim iterator As XPathNodeIterator = nav.Select(expr)
While iterator.MoveNext()
Return iterator.Current.Value
End While
Return String.Empty
Catch
Return String.Empty
End Try
End Function
Don't forget to add these import statements in your code behind as well:
Imports System.Xml
Imports System.Xml.XPath
And when you use it, let's say you want to get the value of FormWidth:
Dim FormWidth As String = ReadValueFromXML("FormWidth")

VB.net - How to extract content of HTML using regex?

<div class="gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long" dir="ltr" style="word-break:break-all;">pastebin.com/N8VKGxR9</div>
If I have this, how can I extract only the pastebin url portion in VB.net using regex? I've downloaded the entire webpage using WC.DownloadString().
Dim text As String = "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
Dim pattern As String = "<div[\w\W]+gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long.*>(.*)<\/div>"
Dim m As Match = r.Match(text)
Dim g as Group = m.Groups(1)
Will give you pastebin.com/N8VKGxR9
BTW: Topic in the comments for matching special tags, not the text between tags itself. So it's pretty possible.
Edited to keep only divs with these classes
If you use an HTML parser like HtmlAgilityPack (Getting Started With HTML Agility Pack), you can do something like this:
Option Infer On
Option Strict On
Imports HtmlAgilityPack
Module Module1
Sub Main()
' some test data...
Dim s = "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
s &= "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/ABC</div>"
s &= "<div class=""WRONGCLASS gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
Dim doc As New HtmlDocument
doc.LoadHtml(s)
' match the classes string /exactly/:
Dim wantedNodes = doc.DocumentNode.SelectNodes("//div[#class='gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long']")
' An alternative for if you want the divs with /at least/ those classes:
'Dim wantedNodes = doc.DocumentNode.SelectNodes("//div[contains(#class, 'gs-bidi-start-align') and contains(#class, 'gs-visibleUrl') and contains(#class, 'gs-visibleUrl-Long')]")
' show the resultant data:
If wantedNodes IsNot Nothing Then
For Each n In wantedNodes
Console.WriteLine(n.InnerHtml)
Next
End If
Console.ReadLine()
End Sub
End Module
Outputs:
pastebin.com/N8VKGxR9
pastebin.com/ABC
HTML parsers have the advantage that they will generally tolerate malformed HTML - for example, the test data shown above is not a valid HTML document and yet the desired data is parsed from it successfully.

Check if element has a specific attribute using HtmlAgilityPack in VB.Net

I'm using HtmlAgilityPack to parse HTML.
I want to check if an element has a specific attribute.
I want to check whether an <a> tag has the href attribute.
Dim doc As HtmlDocument = New HtmlDocument()
doc.Load(New StringReader(content))
Dim root As HtmlNode = doc.DocumentNode
Dim anchorTags As New List(Of String)
For Each link As HtmlNode In root.SelectNodes("//a")
If link.HasAttributes("href") Then doSomething() 'this doesn't work because hasAttributes only checks whether an element has attributes or not
Next
Like this:
If link.Attributes("href") IsNot Nothing Then