VB.net - How to extract content of HTML using regex? - vb.net

<div class="gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long" dir="ltr" style="word-break:break-all;">pastebin.com/N8VKGxR9</div>
If I have this, how can I extract only the pastebin url portion in VB.net using regex? I've downloaded the entire webpage using WC.DownloadString().

Dim text As String = "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
Dim pattern As String = "<div[\w\W]+gs-bidi-start-align gs-visibleUrl gs-visibleUrl-long.*>(.*)<\/div>"
Dim m As Match = r.Match(text)
Dim g as Group = m.Groups(1)
Will give you pastebin.com/N8VKGxR9
BTW: Topic in the comments for matching special tags, not the text between tags itself. So it's pretty possible.
Edited to keep only divs with these classes

If you use an HTML parser like HtmlAgilityPack (Getting Started With HTML Agility Pack), you can do something like this:
Option Infer On
Option Strict On
Imports HtmlAgilityPack
Module Module1
Sub Main()
' some test data...
Dim s = "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
s &= "<div class=""gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/ABC</div>"
s &= "<div class=""WRONGCLASS gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long"" dir=""ltr"" style=""word-break:break-all;"">pastebin.com/N8VKGxR9</div>"
Dim doc As New HtmlDocument
doc.LoadHtml(s)
' match the classes string /exactly/:
Dim wantedNodes = doc.DocumentNode.SelectNodes("//div[#class='gs-bidi-start-align gs-visibleUrl gs-visibleUrl-Long']")
' An alternative for if you want the divs with /at least/ those classes:
'Dim wantedNodes = doc.DocumentNode.SelectNodes("//div[contains(#class, 'gs-bidi-start-align') and contains(#class, 'gs-visibleUrl') and contains(#class, 'gs-visibleUrl-Long')]")
' show the resultant data:
If wantedNodes IsNot Nothing Then
For Each n In wantedNodes
Console.WriteLine(n.InnerHtml)
Next
End If
Console.ReadLine()
End Sub
End Module
Outputs:
pastebin.com/N8VKGxR9
pastebin.com/ABC
HTML parsers have the advantage that they will generally tolerate malformed HTML - for example, the test data shown above is not a valid HTML document and yet the desired data is parsed from it successfully.

Related

Drill down using HtmlAgilityPack.HtmlDocument in VB.NET

I've created an HTML Document using
Dim htmlDoc = New HtmlAgilityPack.HtmlDocument()
and have a node
node = htmlDoc.DocumentNode.SelectSingleNode("/html/body/main/section/form[1]/input[2]")
and the OuterHtml is
"<input type="hidden" id="public-id" value="michael.smith.1">"
I need the value of michael.smith.1. Is there a way to pull the value property from the node or am I at the point where I use substring to parse out the value?
Thanks for the help
I would use the id firstly as this makes for faster matching, then use the GetAttributeValue method of HtmlNode to extract the value attribute
Imports System
Imports HtmlAgilityPack
Public Class Program
Public Shared Sub Main()
Dim doc = new HtmlDocument
Dim output As String = "<html><head><title>Text</title></head><body><input type=""hidden"" id=""public-id"" value=""michael.smith.1""></body></html>"
doc.LoadHtml(output)
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[#id='public-id']").GetAttributeValue("value","Not present"))
End Sub
End Class
Fiddle

How would I retrieve certain information on a webpage using their ID value?

In vb.net, I can download a webpage as a string like this:
Using ee As New System.Net.WebClient()
Dim reply As String = ee.DownloadString("https://pastebin.com/eHcQRiff")
MessageBox.Show(reply)
End Using
Would it be possible to specify an ID tag of an item on the webpage so that the reply will only output the information inside of the code box/id tag?
Example:
The ID tag of RAW Paste Data on https://pastebin.com/eHcQRiff is id="paste_code" which includes the following text:
Test=1
Test=2
Is there anyway to get the WebClient to only output that exact same message using the ID tag (or any other method)?
You can use HtmlAgilityPack library
Dim document as HtmlAgilityPack.HtmlDocument = new HtmlAgilityPack.HtmlDocument()
document.Load(#"C:\YourDownloadedHtml.html")
Dim text as string = document.GetElementbyId("paste_code").InnerText
Some more sample code:
(Tested with HtmlAgilityPack 1.6.10.0)
Dim html As string = "<TD width=""""50%""""><DIV align=right>Name :<B> </B></DIV></TD><TD width=""""50%""""><div id='i1'>SomeText</div></TD><TR vAlign=center>"
Dim htmlDoc As HtmlDocument = New HtmlDocument
htmlDoc.LoadHtml(html) 'To load from html string directly
Dim name As String = htmlDoc.DocumentNode.SelectSingleNode("//td/div[#id='i1']").InnerText
Console.WriteLine(name)
Output:
SomeText

Grab specific part of text from a local html file and use it as variable

I am making a small "home" application using VB. As the title says, I want to grab a part of text from a local html file and use it as variable, or put it in a textbox.
I have tried something like this...
Private Sub Open_Button_Click(sender As Object, e As EventArgs) Handles Open_Button.Click
Dim openFileDialog As New OpenFileDialog()
openFileDialog.CheckFileExists = True
openFileDialog.CheckPathExists = True
openFileDialog.FileName = ""
openFileDialog.Filter = "All|*.*"
openFileDialog.Multiselect = False
openFileDialog.Title = "Open"
If openFileDialog.ShowDialog = Windows.Forms.DialogResult.OK Then
Dim fileReader As String = My.Computer.FileSystem.ReadAllText(openFileDialog1.FileName)
TextBox.Text = fileReader
End If
End Sub
The result is to load the whole html code inside this textbox. What should I do so to grab a specific part of html files's code? Let's say I want to grab only the word text from this span...<span id="something">This is a text!!!</a>
I make the following assumptions on this answer.
Your html is valid - i.e. the id is completely unique in the document.
You will always have an id on your html tag
You'll always be using the same tag (e.g. span)
I'd do something like this:
' get the html document
Dim fileReader As String = My.Computer.FileSystem.ReadAllText(openFileDialog1.FileName)
' split the html text based on the span element
Dim fileSplit as string() = fileReader.Split(New String () {"<span id=""something"">"}, StringSplitOptions.None)
' get the last part of the text
fileReader = fileSplit.last
' we now need to trim everything after the close tag
fileSplit = fileReader.Split(New String () {"</span>"}, StringSplitOptions.None)
' get the first part of the text
fileReader = fileSplit.first
' the fileReader variable should now contain the contents of the span tag with id "something"
Note: this code is untested and I've typed it on the stack exchange mobile app, so there might be some auto correct typos in it.
You might want to add in some error validation such as making sure that the span element only occurs once, etc.
Using an HTML parser is highly recommended due to the HTML language's many nested tags (see this question for example).
However, finding the contents of a single tag using Regex is possible with no bigger problems if the HTML is formatted correctly.
This would be what you need (the function is case-insensitive):
Public Function FindTextInSpan(ByVal HTML As String, ByVal SpanId As String, ByVal LookFor As String) As String
Dim m As Match = Regex.Match(HTML, "(?<=<span.+id=""" & SpanId & """.*>.*)" & LookFor & "(?=.*<\/span>)", RegexOptions.IgnoreCase)
Return If(m IsNot Nothing, m.Value, "")
End Function
The parameters of the function are:
HTML: The HTML code as string.
SpanId: The id of the span (ex. <span id="hello"> - hello is the id)
LookFor: What text to look for inside the span.
Online test: http://ideone.com/luGw1V

Grab text from webpage using vb

i want code that grab text from webpage
here is the html
<div><span>Version : </span> " 1.3"</div>
so i want 1.3 text in textbox1
To manipulate HTML elements/documents easily, you need to install HTML Agility Pack. You can get it from NuGet at: https://www.nuget.org/packages/HtmlAgilityPack
After you have it, you can do a lot of magic with HTML documents/tags.
Dim voHAP As New HtmlAgilityPack.HtmlDocument
voHAP.LoadHtml("<div><span>Version : </span> "" 1.3""</div>")
Dim voDiv As HtmlAgilityPack.HtmlNode = voHAP.DocumentNode.Elements("div")(0)
voDiv.RemoveChild(voDiv.Element("span"))
Dim vsText As String = Replace(voDiv.InnerText, """", "").Trim
The vsText variable will contain your value of 1.3. The final Replace() function is to remove the unwanted " characters in the string.

HtmlAgilityPack not finding nodes from HttpWebRequest's returned HTML

I am a little new to htmlagilitypack. I want to use my HttpWebRequest which can return the html of a webpage and then parse that html with htmlagilitypack. I want to find all div's with a specific class and then get the inner text of what is inside those div's. This is what I have so far. My get request successfully returns webpage html:
Public Function mygetreq(ByVal myURL as String, ByRef thecookie As CookieContainer)
Dim getreq As HttpWebRequest = DirectCast(HttpWebRequest.Create(myURL), HttpWebRequest)
getreq.Method = "GET"
getreq.KeepAlive = True
getreq.CookieContainer = thecookie
getreq.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0"
Dim getresponse As HttpWebResponse
getresponse = DirectCast(getreq.GetResponse, HttpWebResponse)
Dim getreqreader As New StreamReader(getresponse.GetResponseStream())
Dim thePage = getreqreader.ReadToEnd
'Clean up the streams and the response.
getreqreader.Close()
getresponse.Close()
Return thePage
End Function
This function returns the html. I then put the html into this:
'The html successfully shows up in the RichTextBox
RichTextBox1.Text = mygetreq("http://someurl.com", thecookie)
Dim htmldoc = New HtmlAgilityPack.HtmlDocument()
htmldoc.LoadHtml(RichTextBox1.Text)
Dim htmlnodes As HtmlNodeCollection
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='someClass']")
If htmlnodes IsNot Nothing Then
For Each node In htmlnodes
MessageBox.Show(node.InnerText())
Next
End If
The problem is, htmlnodes is coming back as null. So the final If Then loop won't run. It finds nothing, but I KNOW for a fact that this div and class exists in the html page because I can see the html in the RichTextBox1:
<div class="someClass"> This is inner text </div>
What exactly is the problem here? Does the htmldoc.LoadHtml not like the type of string that the mygetreq returns for the page html?
Does this have anything to do with html entities? thePage contains < and > brackets. They are not entitied.
I also saw someone post here (C#) to use the HtmlWeb class, but I am not sure how I would set that up. Most of my code is already written with httpWebRequest.
Thanks for reading and thanks for helping.
If you are willing to switch, you could use CsQuery, something along these lines:
Dim q As New CQ(mygetreq("http://someurl.com", thecookie))
For Each node In q("div.someClass")
Console.WriteLine(node.InnerText)
Next
You may want to add some error handling, but overall should be a good start for you.
You can add CsQuery to your project via NuGet:
Install-Package CsQuery
And don't forget to use Imports CsQuery at the top of your code file.
This may not directly solve your problem, but should make it easier to experiment with your data (via immediate window, for example).
Interesting read (performance comparison):
CsQuery Performance vs. Html Agility Pack and Fizzler
Using htmlweb is trully a simple and good way to work with HtmlAgilityPack...here is an example:
Private Sub GetHtml()
Dim HtmlWeb As New HtmlWeb
Dim HtmlDoc As HtmlDocument
Dim NodeCollection As HtmlNodeCollection
Dim URL As String = ""
HtmlDoc = HtmlWeb.Load(URL) 'Notice that i used load, and not LoadHtml
NodeCollection=HtmlDoc.DocumentNode.SelectNodes(put here your XPath)
For Each Node As HtmlNode In NodeCollection
If IsNothing(Node) = False Then
MsgBox(Node.InnerText)
End If
Next
End Sub