How extract multi words in string by vb.net - vb.net

<tr class="sh"onclick="ii.ShowShareHolder('7358,IRO1GNBO0008')">
<td>ghanisha sherkat-</td>
<td><div class='ltr' title="141,933,691">142 M</div></td>
<td>52.560</td>
<td>0</td>
<td><div class=""/></td>
</tr>
We want output under items of above text:
ghanisha sherkat
141,933,691
52.560
0
My try:
Dim input as string="above text"
Dim c2 As String() = input.Split(New String() {"</td>"},StringSplitOptions.None)
Dim r As Integer
For r = 0 To c2.Length - 2
MessageBox.Show(c2(r))
Next
other my try
Dim sDelimStart As String = "<td>"
Dim sDelimEnd As String = "</td>"
Dim nIndexStart As Integer = input.IndexOf(sDelimStart)
Dim nIndexEnd As Integer = input.IndexOf(sDelimEnd)
Res = Strings.Mid(input, nIndexStart + sDelimStart.Length + 1, nIndexEnd - nIndexStart - sDelimStart.Length)
MessageBox.Show(res)
by this way extract "ghanisha sherkat"
how extract other items?
Now how continue it? thank you

Is not good to directly search a string or use Regular Expressions to parse markup code like HTML. You can instead use XmlDocument and parse HTML as XML using <?xml...?> tag.
1. Using XmlDocument
Dim input As String = <?xml version="1.0" encoding="utf-8"?>
<tr class="sh" onclick="ii.ShowShareHolder('7358,IRO1GNBO0008')">
<td>ghanisha sherkat</td>
<td><div class="ltr" title="141,933,691">142 M</div></td>
<td>52.560</td>
<td>0</td>
<td><div class=""/></td>
</tr>.ToString()
Dim doc As New XmlDocument
doc.LoadXml(input)
' Fetch all TDs inside TRs using XPath
Dim tds = doc.SelectNodes("/tr/td")
For Each item As XmlNode In tds
' If element id a DIV
If item.FirstChild.Name = "div" Then
' Get the title attribute
Dim titleAttr = item.FirstChild.Attributes("title")
If Not titleAttr Is Nothing Then
Console.WriteLine(titleAttr.Value)
End If
End If
Console.WriteLine(item.InnerText())
Next
You can see a working example using XmlDocument at this .NET Fiddle.
2. Usign Regular Expressions (not recommended)
Dim input As String =
<tr class="sh" onclick="ii.ShowShareHolder('7358,IRO1GNBO0008')">
<td>ghanisha sherkat</td>
<td><div class="ltr" title="141,933,691">142 M</div></td>
<td>52.560</td>
<td>0</td>
<td><div class=""/></td>
</tr>.ToString()
' Extract all TDs
Dim tds = Regex.Matches(input, "<td[^>]*>\s*(.*?)\s*<\/td>")
For Each td In tds
Dim content = td.Groups(1).ToString()
' Check if element id DIV
If Regex.IsMatch(content, "^<div") Then
' Check if DIV has a title attribute
Dim title = Regex.Match(content, "<div.*title=""(.*?)"">(.*)<\/div>")
If title.Length > 0 Then
' Print the first group for title 141,933,691
Console.WriteLine(title.Groups(1))
' Print the second group for element content 142 M
Console.WriteLine(title.Groups(2))
End If
Else
Console.WriteLine(td.Groups(1))
End If
Next
You can see a working example using Regular Expressions at this .NET Fiddle.
Result
The result on both examples will be the same:
ghanisha sherkat
141,933,691
142 M
52.560
0

Regex can make this job easier.
I'm assuming the values you want to capture are the ones directly inside a td tag or into a title attribute inside a td.
So you can use the following Regular Expression to capture all values that starts with td tag, may have or not a tag inside with a title attribute, then have or not a closing quote after the value, and ends with closing td tag.
<td>(?:<.*title="")?([a-zA-Z0-9 \.,]*)""?.*\-?</td>
Here's the VB.Net code:
Imports System.Text.RegularExpressions
Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim input As String = "above text"
Dim matches As MatchCollection = Regex.Matches(input, "<td>(?:<.*title="")?([a-zA-Z0-9 \.,]*)""?.*\-?</td>")
' Loop over matches.
For Each m As Match In matches
MessageBox.Show(m.Groups(1).Value)
Next
End Sub
End Class

Related

Extracting information from page source

I hope the title is a good one. I use this code to upload source page in Page string.
Dim Page As String = New System.Net.WebClient().DownloadString("https://live.blockcypher.com/btc/address/3M23WLCdaBVXaAjG3CuVUDoAKgpd3xiv8V/")
Console.WriteLine(Page)
The source page:
view-source:https://live.blockcypher.com/btc/address/bc1qwsn95n9ddnxq8aduxnwp6tdc6gagj02kqqew7d/
How can I get this information from the source page written one by one in a msgbox?
MsgBox like: Recivied: 0.0 BTC ; Sent: 0.0 BTC ; Balance: 0.0 BTC
<ul>
<li>
<span class="dash-label">Received</span><br>
0.0 BTC
</li>
<li>
<span class="dash-label">Sent</span><br>
0.0 BTC
</li>
<li>
<span class="dash-label">Balance</span><br>
0.0 BTC
</li>
</ul>
I tried to write a code to get the received bitcoin number but it doesn't work:
Private Sub StartBTN_Click(sender As Object, e As EventArgs) Handles StartBTN.Click
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12
Dim Page As String = New System.Net.WebClient().DownloadString("https://live.blockcypher.com/btc/address/3M23WLCdaBVXaAjG3CuVUDoAKgpd3xiv8V/")
'Console.WriteLine(Page)
Dim ReciviedDelimStart As String = "<span class=""dash-label"">Received</span><br>"
Dim ReciviedDelimEnd As String = "</li>"
Dim Find_ReciviedStart As Integer = Page.IndexOf(ReciviedDelimStart)
Dim Find_ReciviedEnd As Integer = Page.IndexOf(ReciviedDelimEnd)
If Find_ReciviedStart > -1 AndAlso Find_ReciviedEnd > -1 Then
Dim Result As String = Mid(Page, Find_ReciviedStart + ReciviedDelimStart.Length + 1, Find_ReciviedEnd - Find_ReciviedStart - ReciviedDelimStart.Length)
MessageBox.Show("Recivied: " + Result) 'THE OUTPUT MUST BE: "Recivied: 0.0 BTC"
End If
End Sub
I also tried to make a similar code from where the program obtains from the source of the page the link to the QR code of the bitcoin address and Picture_BTC_QR receives the picture with the QR code
Private Sub StartBTN_Click(sender As Object, e As EventArgs) Handles StartBTN.Click
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12
Dim Page As String = New System.Net.WebClient().DownloadString("https://live.blockcypher.com/btc/address/3M23WLCdaBVXaAjG3CuVUDoAKgpd3xiv8V/")
'Console.WriteLine(Page)
Dim QR_DelimStart As String = "<img src=""//"
Dim QR_DelimEnd As String = """"
Dim QR_FindStart As Integer = Page.IndexOf(QR_DelimStart)
Dim QR_FindEnd As Integer = Page.IndexOf(QR_DelimEnd)
If QR_FindStart > -1 AndAlso QR_FindEnd > -1 Then
Dim Result As String = Mid(Page, QR_FindStart + QR_DelimStart.Length + 1, QR_FindEnd - QR_FindStart - QR_DelimStart.Length)
Picture_BTC_QR.Load("https://" + Result)
'MessageBox.Show("https://" + Result) 'THE OUTPUT MUST BE: "https://chart.googleapis.com/chart?cht=qr&chl=bitcoin%3Abc1qwsn95n9ddnxq8aduxnwp6tdc6gagj02kqqew7d&choe=UTF-8&chs=300x300"
End If
End Sub
The error I get is:
"System.ArgumentException: 'Argument 'Length' must be greater or equal to zero."
I did not find a solving solution. That's why I decided to post here, maybe someone can help me solve the error or with a code to help me with what I want to get. Thanks!
I would use Regex to match the text. First add this line at begginning:
Imports System.Text.RegularExpressions
Then match like this:
Dim Re As Regex = New Regex("(.*)\s+BTC")
Dim Matches As MatchCollection = Re.Matches(Page)
For Each Match As Match In Matches
Console.WriteLine(Match.Value)
Next

Xpath syntax for HtmlAgilityPack row

I'm using the following code:
Dim cl As WebClient = New WebClient()
Dim html As String = cl.DownloadString(url)
Dim doc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(html)
Dim table As HtmlNode = doc.DocumentNode.SelectSingleNode("//table[#class='table']")
For Each row As HtmlNode In table.SelectNodes(".//tr")
Dim inner_text As String = row.InnerHtml.Trim()
Next
My inner_text for each row looks like this, with different years and data:
"<th scope="row">2015<!-- --> RG Journal Impact</th><td>6.33</td>"
Each row has a th element and a td element and I have tried different ways to pull the value but I can't seem to pull them one after the other by looping the column collection. How can I pull just the th element and the td element using the correct Xpath syntax ?
Until I can use better code I'll use standard parsing functions:
Dim hname As String = row.InnerHtml.Trim()
Dim items() As String = hname.Split("</td>")
Dim year As String = items(1).Substring(items(1).IndexOf(">") + 1)
Dim value As String = items(4).Substring(items(4).IndexOf(">") + 1)
If value.ToLower.Contains("available") Then
value = ""
End If
You can carry on with querying the row:
Option Infer On
Option Strict On
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim h = "<html><head><title></title></head><body>
<table class=""table"">
<tr><th scope=""row"">2015<!-- --> RG Journal Impact</th><td>6.33</td></tr>
<tr><th scope=""row"">2018 JIR</th><td>9.99</td></tr>
</table>
</body></html>"
Dim doc = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(h)
Dim table = doc.DocumentNode.SelectSingleNode("//table[#class='table']")
For Each row In table.SelectNodes(".//tr")
Dim yearData = row.SelectSingleNode(".//th").InnerText.Split(" "c)(0)
Dim value = row.SelectSingleNode(".//td").InnerText
Console.WriteLine($"Year: {yearData} Value: {value}")
Next
Console.ReadLine()
End Sub
End Module
Outputs:
Year: 2015 Value: 6.33
Year: 2018 Value: 9.99

HtmlAgilityPack - SelectNodes

I'm trying to retrieve a <p class> element.
<div class="thread-plate__details">
<h3 class="thread-plate__title">(S) HexHunter BOW</h3>
<p class="thread-plate__summary">created by Aazoth</p> <!-- (THIS ONE) -->
</div>
But with no luck.
The code I am using is below:
' the example url to scrape
Dim url As String = "http://services.runescape.com/m=forum/forums.ws?39,40,goto," & Label6.Text
Dim source As String = GetSource(url)
If source IsNot Nothing Then
' create a new html document and load the pages source
Dim htmlDocument As New HtmlDocument
htmlDocument.LoadHtml(source)
' Create a new collection of all href tags
Dim nodes As HtmlNodeCollection = htmlDocument.DocumentNode.SelectNodes("//p[#class]")
' Using LINQ get all href values that start with http://
' of course there are others such as www.
Dim links =
(
From node
In nodes
Let attribute = node.Attributes("class")
Where attribute.Value.StartsWith("created by ")
Select attribute.Value
)
Me.ListBox1a.Items.AddRange(links.ToArray)
Dim o, j As Long
For o = 0 To ListBox1a.Items.Count - 1
For j = ListBox1a.Items.Count - 1 To (o + 1) Step -1
If ListBox1a.Items(o) = ListBox1a.Items(j) Then
ListBox1a.Items.Remove(ListBox1a.Items((j)))
End If
Next
Next
For i As Integer = 0 To Me.ListBox1a.Items.Count - 1
Me.ListBox1a.Items(i) = Me.ListBox1a.Items(i).ToString.Replace("created by ", "")
Next
For Each s As String In ListBox1a.Items
Dim lvi As New NetSeal.NSListView
lvi.Text = s
NsListView1.Items.Add(lvi.Text)
Next
It runs but I can't get the 'created by XXX' text.
I've tried many ways but got no luck, an hand would be appreciated.
Thanks in advance everyone.
Looks like you are looking wrong string in the attribute.Value. What I see is that attribute.Value.StartsWith("created by ") must be changed to this one attribute.Value.StartsWith("thread-plate__summary").
And to grab inner content of node you have to do this: Select node.InnerText;
' the example url to scrape
Dim url As String = "http://services.runescape.com/m=forum/forums.ws?39,40,goto," & Label6.Text
Dim source As String = GetSource(url)
If source IsNot Nothing Then
' create a new html document and load the pages source
Dim htmlDocument As New HtmlDocument
htmlDocument.LoadHtml(source)
' Create a new collection of all href tags
Dim nodes As HtmlNodeCollection = htmlDocument.DocumentNode.SelectNodes("//p[#class]")
' Using LINQ get all href values that start with http://
' of course there are others such as www.
Dim links =
(
From node
In nodes
Let attribute = node.Attributes("class")
Where attribute.Value.StartsWith("thread-plate__summary")
Select node.InnerText
)
Me.ListBox1a.Items.AddRange(links.ToArray)
Dim o, j As Long
For o = 0 To ListBox1a.Items.Count - 1
For j = ListBox1a.Items.Count - 1 To (o + 1) Step -1
If ListBox1a.Items(o) = ListBox1a.Items(j) Then
ListBox1a.Items.Remove(ListBox1a.Items((j)))
End If
Next
Next
For i As Integer = 0 To Me.ListBox1a.Items.Count - 1
Me.ListBox1a.Items(i) = Me.ListBox1a.Items(i).ToString.Replace("created by ", "")
Next
For Each s As String In ListBox1a.Items
Dim lvi As New NetSeal.NSListView
lvi.Text = s
NsListView1.Items.Add(lvi.Text)
Next
I hope this will work for you.

How to replace the html tagged text in a word Document in VB.NET

I have a VB.NET code that have always find and replace the text in the Word Document File(.docx). I am using OpenXml for this process.
But I wants to replace only the HTML tagged text and always removing the tags after replace the new text in the document.
my code is:
Public Sub SearchAndReplace(ByVal document As String)
Dim wordDoc As WordprocessingDocument = WordprocessingDocument.Open(document, True)
Using (wordDoc)
Dim docText As String = Nothing
Dim sr As StreamReader = New StreamReader(wordDoc.MainDocumentPart.GetStream)
Using (sr)
docText = sr.ReadToEnd
End Using
Dim regexText As Regex = New Regex("<ReplaceText>")
docText = regexText.Replace(docText, "Hi Everyone!")
Dim sw As StreamWriter = New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))
Using (sw)
sw.Write(docText)
End Using
End Using
Here's to help you resolve your problem.
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim Text As String = "Blah<foo>Blah"
'Prints Text
Console.WriteLine(Text)
Dim regex As New Regex("(<)[]\w\/]+(>)")
'Prints Text after replace the in-between the capturing group 1 and 2.
'Capturing group are marked between parenthesis in the regex pattern
Console.WriteLine(regex.Replace(Text, "$1foo has been replaced.$2"))
'Update Text
Text = regex.Replace(Text, "$1foo has been replaced.$2")
'Remove starting tag
Dim p As Integer = InStr(Text, "<")
Text = Text.Remove(p - 1, 1)
'Remove trailing tag
Dim pp As Integer = InStr(Text, ">")
Text = Text.Remove(pp - 1, 1)
'Print Text
Console.WriteLine(Text)
Console.ReadLine()
End Sub
End Module
Output:
The above code will not function if you have multiple tags per line.
I would advise not to use regex to parse HTML.

how to extract certain text from string

How do I filter/extract strings?
I have converted a PDF file into String using itextsharp and I have the text displayed into a Richtextbox1.
However there are too many irrelevant text that I don't need in the Richtextbox.
Is there a way I can display the text I want based on keywords, the entire length of the text.
Example of text that is displayed in textrichbox1 after conversation of PDF to text:
**774**
**Bos00232940
Bos00320491
Das1234
Das3216**
RAGE*
So the keywords would be "Bos", "Das", "774". and the new text that would be displayed in the richtextbox1 is shown below, instead of the entire text above.
*Bos00232940
Bos00320491
Das1234
Das3216
774*
Here is what I have so far. But it doesn't work it still displays the entire PDF in the richtextbox.
Public Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim pdffilename As String
pdffilename = TextBox1.Text
Dim filepath = "c:\temp\" & TextBox1.Text & ".pdf"
Dim thetext As String
thetext = GetTextFromPDF(filepath)
Dim lines() As String = System.Text.RegularExpressions.Regex.Split(thetext, Environment.NewLine)
Dim keywords As New List(Of String)
keywords.Add("Bos")
keywords.Add("Das")
keywords.Add("774")
Dim newTextLines As New List(Of String)
For Each line As String In lines
For Each keyw As String In thetext
If line.Contains(keyw) Then
newTextLines.Add(line)
Exit For
End If
Next
Next
RichTextBox1.Text = String.Join(Environment.NewLine, newTextLines.ToArray)
End Sub
SOLUTION
Thanks everyone for your help. Below is the code that worked and did exactly what I wanted it to do.
Public Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim pdffilename As String
pdffilename = TextBox1.Text
Dim filepath = "c:\temp\" & TextBox1.Text & ".pdf"
Dim thetext As String
thetext = GetTextFromPDF(filepath)
Dim re As New Regex("[\t ](?<w>((774)|(Bos)|(Das))[a-z0-9]*)[\t ]", RegexOptions.ExplicitCapture Or RegexOptions.IgnoreCase Or RegexOptions.Compiled)
Dim Lines() As String = {thetext}
Dim words As New List(Of String)
For Each s As String In Lines
Dim mc As MatchCollection = re.Matches(s)
For Each m As Match In mc
words.Add(m.Groups("w").Value)
Next
Next
RichTextBox1.Text = String.Join(Environment.NewLine, words.ToArray)
End Sub
For Each Word As String In thetext.Split(" ")
For Each key As String In keywords
If Word.StartsWith(key) Then
newTextLines.Add(Word)
Continue For
End If
Next
Next
or using LINQ:
Dim q = From word In thetext.Split(" ")
Where keywords.Any(Function(s) word.StartsWith(s))
Select word
RichTextBox1.Text = String.Join(Environment.NewLine, q.ToArray())
If don't know the keywords in advance but know in which context they occur, you can find them with a Regex expression. Two very handy Regex expressions allow you to find occurences succeeding or preceeding another:
(?<=prefix)find finds a pattern that follows another.
find(?=suffix) finds a pattern that comes before another.
If your number keyword (774) always preceeds " SIZE" you can find it like this: \w+(?=\sSIZE).
If the other keywords are always between "EX " and " DETAILS" you can find them like this: (?<=EX\s)(\w+\s)+(?=DETAILS).
You can put the whole thing together like this: \w+(?=\sSIZE)|(?<=EX\s)(\w+\s)+(?=DETAILS).
The disadvantage is that the keywords between "EX " and "DETAILS" will be returned as one match. But you can split the matches afterwards as in:
Const input As String = "2 3 3 4 4 A A B B SHEET 1 OF 1 774 SIZE SCALE 24.000-47.999 12.000-23.999 CON BAG WIRE 90in. EX Bos00232940 Bos00320491 Das1234 Das3216 DETAILS 1 2 RAGE"
Dim matches = Regex.Matches(input, "\w+(?=\sSIZE)|(?<=EX\s)(\w+\s)+(?=DETAILS)")
For Each m As Match In matches
Dim words = m.Value.Split(" "c)
For Each word As String In words
If word.Length > 0 Then ' Suppress the last empty word.
Console.WriteLine(word)
End If
Next
Next
Output:
774
Bos00232940
Bos00320491
Das1234
Das3216
How to do it with regular expression...
Dim re As New Regex("[\t ](?<w>((774)|(Bos)|(Das))[a-z0-9]*)[\t ]", RegexOptions.ExplicitCapture Or RegexOptions.IgnoreCase Or RegexOptions.Compiled)
Private Sub test()
Dim Lines() As String = {"2 3 3 4 4 A A B B SHEET 1 OF 1 774 SIZE SCALE 24.000-47.999 12.000-23.999 CON BAG WIRE 90in. EX Bos00232940 Bos00320491 Das1234 Das3216 DETAILS 1 2 RAGE"}
Dim words As New List(Of String)
For Each s As String In Lines
Dim mc As MatchCollection = re.Matches(s)
For Each m As Match In mc
words.Add(m.Groups("w").Value)
Next
Next
End Sub
Regex break down...
[\t ] Single tab or space (there is an alternative for whitespace too)
(?<w> Start of capture group called "w" This the the text returned later in the "m.Groups"
((774)|(Bos)|(Das)) one of the 3 blobs of text
[a-z0-9]* any a-z or 0-9 character, * = any number of them
) End of Capture group "w" from above.