Xpath syntax for HtmlAgilityPack row - vb.net

I'm using the following code:
Dim cl As WebClient = New WebClient()
Dim html As String = cl.DownloadString(url)
Dim doc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(html)
Dim table As HtmlNode = doc.DocumentNode.SelectSingleNode("//table[#class='table']")
For Each row As HtmlNode In table.SelectNodes(".//tr")
Dim inner_text As String = row.InnerHtml.Trim()
Next
My inner_text for each row looks like this, with different years and data:
"<th scope="row">2015<!-- --> RG Journal Impact</th><td>6.33</td>"
Each row has a th element and a td element and I have tried different ways to pull the value but I can't seem to pull them one after the other by looping the column collection. How can I pull just the th element and the td element using the correct Xpath syntax ?
Until I can use better code I'll use standard parsing functions:
Dim hname As String = row.InnerHtml.Trim()
Dim items() As String = hname.Split("</td>")
Dim year As String = items(1).Substring(items(1).IndexOf(">") + 1)
Dim value As String = items(4).Substring(items(4).IndexOf(">") + 1)
If value.ToLower.Contains("available") Then
value = ""
End If

You can carry on with querying the row:
Option Infer On
Option Strict On
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim h = "<html><head><title></title></head><body>
<table class=""table"">
<tr><th scope=""row"">2015<!-- --> RG Journal Impact</th><td>6.33</td></tr>
<tr><th scope=""row"">2018 JIR</th><td>9.99</td></tr>
</table>
</body></html>"
Dim doc = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(h)
Dim table = doc.DocumentNode.SelectSingleNode("//table[#class='table']")
For Each row In table.SelectNodes(".//tr")
Dim yearData = row.SelectSingleNode(".//th").InnerText.Split(" "c)(0)
Dim value = row.SelectSingleNode(".//td").InnerText
Console.WriteLine($"Year: {yearData} Value: {value}")
Next
Console.ReadLine()
End Sub
End Module
Outputs:
Year: 2015 Value: 6.33
Year: 2018 Value: 9.99

Related

How extract multi words in string by vb.net

<tr class="sh"onclick="ii.ShowShareHolder('7358,IRO1GNBO0008')">
<td>ghanisha sherkat-</td>
<td><div class='ltr' title="141,933,691">142 M</div></td>
<td>52.560</td>
<td>0</td>
<td><div class=""/></td>
</tr>
We want output under items of above text:
ghanisha sherkat
141,933,691
52.560
0
My try:
Dim input as string="above text"
Dim c2 As String() = input.Split(New String() {"</td>"},StringSplitOptions.None)
Dim r As Integer
For r = 0 To c2.Length - 2
MessageBox.Show(c2(r))
Next
other my try
Dim sDelimStart As String = "<td>"
Dim sDelimEnd As String = "</td>"
Dim nIndexStart As Integer = input.IndexOf(sDelimStart)
Dim nIndexEnd As Integer = input.IndexOf(sDelimEnd)
Res = Strings.Mid(input, nIndexStart + sDelimStart.Length + 1, nIndexEnd - nIndexStart - sDelimStart.Length)
MessageBox.Show(res)
by this way extract "ghanisha sherkat"
how extract other items?
Now how continue it? thank you
Is not good to directly search a string or use Regular Expressions to parse markup code like HTML. You can instead use XmlDocument and parse HTML as XML using <?xml...?> tag.
1. Using XmlDocument
Dim input As String = <?xml version="1.0" encoding="utf-8"?>
<tr class="sh" onclick="ii.ShowShareHolder('7358,IRO1GNBO0008')">
<td>ghanisha sherkat</td>
<td><div class="ltr" title="141,933,691">142 M</div></td>
<td>52.560</td>
<td>0</td>
<td><div class=""/></td>
</tr>.ToString()
Dim doc As New XmlDocument
doc.LoadXml(input)
' Fetch all TDs inside TRs using XPath
Dim tds = doc.SelectNodes("/tr/td")
For Each item As XmlNode In tds
' If element id a DIV
If item.FirstChild.Name = "div" Then
' Get the title attribute
Dim titleAttr = item.FirstChild.Attributes("title")
If Not titleAttr Is Nothing Then
Console.WriteLine(titleAttr.Value)
End If
End If
Console.WriteLine(item.InnerText())
Next
You can see a working example using XmlDocument at this .NET Fiddle.
2. Usign Regular Expressions (not recommended)
Dim input As String =
<tr class="sh" onclick="ii.ShowShareHolder('7358,IRO1GNBO0008')">
<td>ghanisha sherkat</td>
<td><div class="ltr" title="141,933,691">142 M</div></td>
<td>52.560</td>
<td>0</td>
<td><div class=""/></td>
</tr>.ToString()
' Extract all TDs
Dim tds = Regex.Matches(input, "<td[^>]*>\s*(.*?)\s*<\/td>")
For Each td In tds
Dim content = td.Groups(1).ToString()
' Check if element id DIV
If Regex.IsMatch(content, "^<div") Then
' Check if DIV has a title attribute
Dim title = Regex.Match(content, "<div.*title=""(.*?)"">(.*)<\/div>")
If title.Length > 0 Then
' Print the first group for title 141,933,691
Console.WriteLine(title.Groups(1))
' Print the second group for element content 142 M
Console.WriteLine(title.Groups(2))
End If
Else
Console.WriteLine(td.Groups(1))
End If
Next
You can see a working example using Regular Expressions at this .NET Fiddle.
Result
The result on both examples will be the same:
ghanisha sherkat
141,933,691
142 M
52.560
0
Regex can make this job easier.
I'm assuming the values you want to capture are the ones directly inside a td tag or into a title attribute inside a td.
So you can use the following Regular Expression to capture all values that starts with td tag, may have or not a tag inside with a title attribute, then have or not a closing quote after the value, and ends with closing td tag.
<td>(?:<.*title="")?([a-zA-Z0-9 \.,]*)""?.*\-?</td>
Here's the VB.Net code:
Imports System.Text.RegularExpressions
Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim input As String = "above text"
Dim matches As MatchCollection = Regex.Matches(input, "<td>(?:<.*title="")?([a-zA-Z0-9 \.,]*)""?.*\-?</td>")
' Loop over matches.
For Each m As Match In matches
MessageBox.Show(m.Groups(1).Value)
Next
End Sub
End Class

check if file contains item in list

I have a list of graphic names, "graphicList". I need to search my file for each item in the graphicList in an entity string. I don't know how to reference each item in the graphicList and search for it.
Code so far:
Dim Regex = New Regex("<!ENTITY .*?SYSTEM ""<graphicListItem>"" .*?>")
Dim strMasterDoc = File.ReadAllText(FileLocation)
Dim rxMatches = Regex.Matches(strMasterDoc)
Dim entityList As New List(Of String)
Dim entityFound As MatchCollection = Regex.Matches(strMasterDoc)
'For each file's multipled image file references
For Each m As Match In entityFound
Dim found As Group = m.Groups(1)
entityList.Add(found.Value)
Next
I found the answer to my question
For Each item As Match In graphicRefList
Dim found As Group = item.Groups(1)
GraphicList.Add(found.Value)
Dim Regex = New Regex("(<!ENTITY " & found.Value & " SYSTEM ""\w.+?\w"" .*?>)")
Next

Updating values in .csv file depend to values in second .csv

I have two csv files, which contains some data. One of them looks like this:
drid;aid;1date;2date;res;
2121;12;"01.11.2019 06:49";"01.11.2019 19:05";50;
9;10;"01.11.2019 10:47";"01.11.2019 11:33";0;
72;33;"01.11.2019 09:29";"01.11.2019 14:19";0;
777;31;"03.11.2019 04:34";"03.11.2019 20:38";167,35;
Second scv looks like this
datetime;res;drid
"2019-11-01 09:02:00";14,59;2121
"2019-11-03 12:59:00";25,00;777
My target to compare day of date also "drid" and if they are the same in both files then get sum of "res" and replace values of "res" in first csv. Result have to looks like this:
2121;12;"01.11.2019 06:49";"01.11.2019 19:05";64,59;
9;10;"01.11.2019 10:47";"01.11.2019 11:33";0;
72;33;"01.11.2019 09:29";"01.11.2019 14:19";0;
777;31;"03.11.2019 04:34";"03.11.2019 20:38";192,35;
What I have to do to obtain that results in vb.net? I tried to use LINQ Query, but with no results, because I'm newbie and I didn't find way to declare variables in two files and then compare it.
Ok, with .bat I made join both csv in one big.csv and tried to obtain results from same file, but again without success. Last one code is:
Private Sub Button12_Click(sender As Object, e As EventArgs) Handles Button12.Click
Dim Raplines As String() = IO.File.ReadAllLines("C:\Users\big.csv")
Dim strList As New List(Of String)
Dim readFirst As Boolean
For Each line In Raplines
If readFirst Then
Dim strValues As String() = line.Split(";")
Dim kn1 As String = strValues(0)
Dim kn2 As String = strValues(59)
Dim pvm1 As Date = strValues(2)
Dim pvm1Changed = pvm1.ToString("dd")
Dim pvm2 As Date = strValues(3)
Dim pvm2Changed = pvm2.ToString("dd")
Dim pvm3 As Date = strValues(60)
Dim pvm3Changed = pvm3.ToString("dd")
Dim Las1 As Decimal = strValues(9)
Dim Las2 As Decimal = strValues(61)
Dim sum As Decimal = Las1 - Las2
If kn1 = kn2 And pvm3Changed = pvm1Changed Or pvm3Changed = pvm2Changed Then
strValues(9) = sum
strList.Add(String.Join(";", strValues))
End If
End If
readFirst = True
Next
IO.File.WriteAllLines("C:\Users\big_new.csv", strList.ToArray())
End Sub
Instead of changing the existing file I wrote a new one. I used a StringBuilder so the runtime would not have to create and throw away so many strings. StringBuilder are mutable unlike Strings. I parsed the different formats of the dates and used .Date to disregard the Times.
Private Sub ChangeCVSFile()
Dim lines1 = File.ReadAllLines("C:\Users\someone\Desktop\CSV1.cvs")
Dim lines2 = File.ReadAllLines("C:\Users\someone\Desktop\CSV2.cvs")
Dim sb As New StringBuilder
For Each line1 In lines1
Dim Fields1 = line1.Split(";"c) 'drid;aid;1date;2date;res
For Each line2 In lines2
Dim Fields2 = line2.Split(";"c) 'datetime;res;drid
'
' Trim the exta double quotes "01.11.2019 06:49"
Dim d1 = DateTime.ParseExact(Fields1(2).Trim(Chr(34)), "dd.MM.yyyy hh:mm", CultureInfo.InvariantCulture).Date
' "2019-11-01 09:02:00"
Dim d2 = DateTime.ParseExact(Fields2(0).Trim(Chr(34)), "yyyy-MM-dd hh:mm:ss", CultureInfo.InvariantCulture).Date
If Fields1(0) = Fields2(2) AndAlso d1 = d2 Then
Dim sum = CDec(Fields1(4)) + CDec(Fields2(1))
Fields1(4) = sum.ToString
End If
Next
sb.AppendLine(String.Join(";", Fields1))
Next
File.WriteAllText("C:\Users\someone\Desktop\CSV3.cvs", sb.ToString)
End Sub

HtmlAgilityPack - SelectNodes

I'm trying to retrieve a <p class> element.
<div class="thread-plate__details">
<h3 class="thread-plate__title">(S) HexHunter BOW</h3>
<p class="thread-plate__summary">created by Aazoth</p> <!-- (THIS ONE) -->
</div>
But with no luck.
The code I am using is below:
' the example url to scrape
Dim url As String = "http://services.runescape.com/m=forum/forums.ws?39,40,goto," & Label6.Text
Dim source As String = GetSource(url)
If source IsNot Nothing Then
' create a new html document and load the pages source
Dim htmlDocument As New HtmlDocument
htmlDocument.LoadHtml(source)
' Create a new collection of all href tags
Dim nodes As HtmlNodeCollection = htmlDocument.DocumentNode.SelectNodes("//p[#class]")
' Using LINQ get all href values that start with http://
' of course there are others such as www.
Dim links =
(
From node
In nodes
Let attribute = node.Attributes("class")
Where attribute.Value.StartsWith("created by ")
Select attribute.Value
)
Me.ListBox1a.Items.AddRange(links.ToArray)
Dim o, j As Long
For o = 0 To ListBox1a.Items.Count - 1
For j = ListBox1a.Items.Count - 1 To (o + 1) Step -1
If ListBox1a.Items(o) = ListBox1a.Items(j) Then
ListBox1a.Items.Remove(ListBox1a.Items((j)))
End If
Next
Next
For i As Integer = 0 To Me.ListBox1a.Items.Count - 1
Me.ListBox1a.Items(i) = Me.ListBox1a.Items(i).ToString.Replace("created by ", "")
Next
For Each s As String In ListBox1a.Items
Dim lvi As New NetSeal.NSListView
lvi.Text = s
NsListView1.Items.Add(lvi.Text)
Next
It runs but I can't get the 'created by XXX' text.
I've tried many ways but got no luck, an hand would be appreciated.
Thanks in advance everyone.
Looks like you are looking wrong string in the attribute.Value. What I see is that attribute.Value.StartsWith("created by ") must be changed to this one attribute.Value.StartsWith("thread-plate__summary").
And to grab inner content of node you have to do this: Select node.InnerText;
' the example url to scrape
Dim url As String = "http://services.runescape.com/m=forum/forums.ws?39,40,goto," & Label6.Text
Dim source As String = GetSource(url)
If source IsNot Nothing Then
' create a new html document and load the pages source
Dim htmlDocument As New HtmlDocument
htmlDocument.LoadHtml(source)
' Create a new collection of all href tags
Dim nodes As HtmlNodeCollection = htmlDocument.DocumentNode.SelectNodes("//p[#class]")
' Using LINQ get all href values that start with http://
' of course there are others such as www.
Dim links =
(
From node
In nodes
Let attribute = node.Attributes("class")
Where attribute.Value.StartsWith("thread-plate__summary")
Select node.InnerText
)
Me.ListBox1a.Items.AddRange(links.ToArray)
Dim o, j As Long
For o = 0 To ListBox1a.Items.Count - 1
For j = ListBox1a.Items.Count - 1 To (o + 1) Step -1
If ListBox1a.Items(o) = ListBox1a.Items(j) Then
ListBox1a.Items.Remove(ListBox1a.Items((j)))
End If
Next
Next
For i As Integer = 0 To Me.ListBox1a.Items.Count - 1
Me.ListBox1a.Items(i) = Me.ListBox1a.Items(i).ToString.Replace("created by ", "")
Next
For Each s As String In ListBox1a.Items
Dim lvi As New NetSeal.NSListView
lvi.Text = s
NsListView1.Items.Add(lvi.Text)
Next
I hope this will work for you.

VB .NET HTMLAgilityPack Colon Separated Values

Is there a way to get the values within a tag using HTMLAgilityPack?
My variable dataNode is an HtmlAgilityPack.HtmlNode and contains:
Dim doc as New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml("
<div id="container" data="id:12,country:usa,city:oregon,id:13,country:usa,city:atlanta">
Google
</div>
")
Would like to get the value of each id, country,city. They repeat within the tag and have different values.
Dim dataNode as HtmlAgililtyPack.HtmlNode
dataNode = doc.documentNode.SelectSingleNode("//div")
txtbox.text = dataNode.Attributes("id[1]").value
This gives an error System.NullReferenceException
You need the "data" attribute, not the "id" attribute.
Once you have the value of the correct attribute, you will need to parse it into some data structure suitable for holding each part of the data, for example:
Option Infer On
Option Strict On
Module Module1
Public Class LocationDatum
Property ID As Integer
Property Country As String
Property City As String
Public Overrides Function ToString() As String
Return $"ID={ID}, Country={Country}, City={City}"
End Function
End Class
Sub Main()
Dim doc As New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml("
<div id=""container"" data=""id:12,country:usa,city:oregon,id:13,country:usa,city:atlanta"">
Google
</div>
")
Dim dataNode = doc.DocumentNode.SelectSingleNode("//div")
Dim rawData = dataNode.Attributes("data").Value
Dim dataParts = rawData.Split(","c)
Dim locationData As New List(Of LocationDatum)
' A simple way of parsing the data
For i = 0 To dataParts.Count - 1 Step 3
If i + 2 < dataParts.Count Then
Dim id As Integer = -1
Dim country As String = ""
Dim city As String = ""
' used to check all three required parts have been found:
Dim partsFoundFlags = 0
For j = 0 To 2
Dim itemParts = dataParts(i + j).Split(":"c)
Select Case itemParts(0)
Case "id"
id = CInt(itemParts(1))
partsFoundFlags = partsFoundFlags Or 1
Case "country"
country = itemParts(1)
partsFoundFlags = partsFoundFlags Or 2
Case "city"
city = itemParts(1)
partsFoundFlags = partsFoundFlags Or 4
End Select
Next
If partsFoundFlags = 7 Then
locationData.Add(New LocationDatum With {.ID = id, .Country = country, .City = city})
End If
End If
Next
For Each d In locationData
Console.WriteLine(d)
Next
Console.ReadLine()
End Sub
End Module
Which outputs:
ID=12, Country=usa, City=oregon
ID=13, Country=usa, City=atlanta
It is resistant to some data malformations, such as id/city/country being in a different order, and spurious data at the end.
You would, of course, put the parsing code into its own function.