How to get a specific class that is in a div - vb.net

I'm at the beginning with the Html Agility Pack library, I can't understand why if I only take the divs with the following code it works
Dim url As String = "https://example.com"
Dim web = New HtmlWeb()
Dim doc = web.Load(url)
For Each node As HtmlNode In doc.DocumentNode.SelectNodes("//div")
Console.Write(node.InnerText)
Next
while if I want to find a class in a div like in the following code it gives me an error
Dim url As String = "https://example.com"
Dim web = New HtmlWeb()
Dim doc = web.Load(url)
For Each node As HtmlNode In doc.DocumentNode.SelectNodes("//div[#class='myclass']")
Console.Write(node.InnerText)
Next

Related

Go to each link stored in string and list all PDF links

My first attempt to create program.
I am trying to go to website, get all links, after that proceed to each link and get all links that are ending with
.pdf
I am able to get all needed links. Now I want to proceed to each link and search for PDF files.
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim mainUrl As String = "xxx"
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument
Dim listLinks As New List(Of String)
Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//ul[#class='products-list-page']//a") '< - - - select nodes with links
For Each src As HtmlNode In srcs
' Store links in array
listLinks.Add(src.Attributes("href").Value)
Next
' Here I am attempting to through each link and get listed all .pdf links
'get the array from the list.
Dim arrayLinks() As String = listLinks.ToArray()
'Console.Read()
Dim scrapedsrcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//ul[#class='dl-items']//a") '< - - - select nodes with links
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
Console.WriteLine(scrapedlink.Attributes("href").Value) '< - - - Print urls
Next
End Sub
End Module
How to make it happen? Can somebody give me a hint?
EDIT:
First of all, you did not iterate each product links and download the html to scan for the pdf file download links.
This is done by :
For Each productLink As String In listLinks
Dim prodDoc As HtmlDocument = New HtmlWeb().Load(productLink)
Dim scrapedsrcs As HtmlNodeCollection = prodDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a") '< - - - select nodes with links
If scrapedsrcs IsNot Nothing Then
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
Console.WriteLine($"-- {scrapedlink.Attributes("href").Value}") '< - - - Print urls
Next
End If
Next
Secondly, the a link to download the pdf is contained inside a div instead of ul. So to select the nodes, use :
prodDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a")
Or you can specify * to select by class regardless of element like :
prodDoc.DocumentNode.SelectNodes("//*[#class='dl-items']//a")
Since you are not doing anything with those links, why not simply write it nice and short?
Like:
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load("https: //www.nordicwater.com/products/waste-water/")
For Each src As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//ul[#class='products-list-page']//a")
htmlDoc = New HtmlWeb().Load(src.Attributes("href").Value)
Dim LinkTest As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='dl-items']/a")
If LinkTest IsNot Nothing AndAlso LinkTest.Attributes("href").Value.Length > 0 Then Console.WriteLine(LinkTest.Attributes("href").Value)
Next
End Sub
End Module

Download PDF files from webpage

I am trying to download files from website. My current solution seems to work but there are some things I don't understand.
First issue comes while:
//div[#class='large-4 medium-4 columns']//a
There are other divs with class large-4 medium-4 columns. So I am getting couple of unnecessary links. How to get rid of them? I need only pages that contain /products/
Second issue is that nothing gets downloaded to C:\temp\ and I guess there is something with:
//div[#class='large-6 medium-8 columns large-centered']/a[string-length(#href)>0]
but what is wrong?
"xxx" is the link in my code and it should be
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim mainUrl As String = "xxx"
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument
Dim listLinks As New List(Of String)
Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='large-4 medium-4 columns']//a") '< - - - select nodes with links
For Each src As HtmlNode In srcs
' Store links in array
listLinks.Add(src.Attributes("href").Value)
Console.WriteLine(src.Attributes("href").Value)
Next
Console.Read()
For Each productLink As String In listLinks
Dim prodDoc As HtmlDocument = New HtmlWeb().Load(productLink)
Dim scrapedsrcs As HtmlNodeCollection = prodDoc.DocumentNode.SelectNodes("//div[#class='large-6 medium-8 columns large-centered']/a[string-length(#href)>0]") '< - - - select nodes with links
If scrapedsrcs IsNot Nothing Then
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
'Console.WriteLine($"-- {scrapedlink.Attributes("href").Value}") '< - - - Print urls
My.Computer.Network.DownloadFile(scrapedlink.Attributes("href").Value, "C:\temp\" & System.IO.Path.GetFileName(scrapedlink.Attributes("href").Value) & ".pdf")
Next
End If
Next
Console.Read()
' End of scraping
End Sub
End Module
EDIT:
Ok, first one should be
//div[#class='row inset1 productItem padb1 padt1']/div[#class='large-4 medium-4 columns']//a
This will download brochures to folder where app is run:
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load("https://webpage.com")
Dim ProductListPage As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='productContain padb6']//div[#class='large-4 medium-4 columns']/a")
For Each src As HtmlNode In ProductListPage
htmlDoc = New HtmlWeb().Load(src.Attributes("href").Value)
Dim LinkTester As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='row padt6 padb4']//a")
If LinkTester IsNot Nothing Then
For Each dllink In LinkTester
Dim LinkURL As String = dllink.Attributes("href").Value
Console.WriteLine(LinkURL)
Dim ExtractFilename As String = LinkURL.Substring(LinkURL.LastIndexOf("/"))
Dim DLClient As New WebClient
DLClient.DownloadFileAsync(New Uri(LinkURL), ".\" & ExtractFilename)
Next
End If
Next

How would I retrieve certain information on a webpage using their ID value?

In vb.net, I can download a webpage as a string like this:
Using ee As New System.Net.WebClient()
Dim reply As String = ee.DownloadString("https://pastebin.com/eHcQRiff")
MessageBox.Show(reply)
End Using
Would it be possible to specify an ID tag of an item on the webpage so that the reply will only output the information inside of the code box/id tag?
Example:
The ID tag of RAW Paste Data on https://pastebin.com/eHcQRiff is id="paste_code" which includes the following text:
Test=1
Test=2
Is there anyway to get the WebClient to only output that exact same message using the ID tag (or any other method)?
You can use HtmlAgilityPack library
Dim document as HtmlAgilityPack.HtmlDocument = new HtmlAgilityPack.HtmlDocument()
document.Load(#"C:\YourDownloadedHtml.html")
Dim text as string = document.GetElementbyId("paste_code").InnerText
Some more sample code:
(Tested with HtmlAgilityPack 1.6.10.0)
Dim html As string = "<TD width=""""50%""""><DIV align=right>Name :<B> </B></DIV></TD><TD width=""""50%""""><div id='i1'>SomeText</div></TD><TR vAlign=center>"
Dim htmlDoc As HtmlDocument = New HtmlDocument
htmlDoc.LoadHtml(html) 'To load from html string directly
Dim name As String = htmlDoc.DocumentNode.SelectSingleNode("//td/div[#id='i1']").InnerText
Console.WriteLine(name)
Output:
SomeText

HtmlAgilityPack - getting error when looping through nodes. Doesn't make sense

I'm trying to get all nodes below but I am getting an error message of:
Overload resolution failed because no accessible 'GetAttributeValue' accepts this number of arguments.
Dim doc As New HtmlDocument()
doc.LoadHtml("shaggybevo.com/board/register.php")
Dim docNode As HtmlNode = doc.DocumentNode
Dim nodes As HtmlNodeCollection = docNode.SelectNodes("//input")
For Each node As HtmlNode In nodes
Dim id As String = node.GetAttributeValue("id")
Next
Any ideas on why I am getting this error message? Thanks
You need to provide a default value as a second parameter to GetAttributeValue:
Dim id As String = node.GetAttributeValue("id", "")
Update for updated question
In addition to the above fix, you are retrieving the HtmlDocument incorrectly. HtmlDocument.Load will either load a file or an HTML string, not retrieve the file from the web server.
You need to modify your code to fetch the data from the URL using HtmlWeb. Replace the following lines:
Dim doc As New HtmlDocument()
doc.LoadHtml("shaggybevo.com/board/register.php")
with these:
Dim doc As HtmlDocument
Dim web As New HtmlWeb
doc = web.Load("http://shaggybevo.com/board/register.php")

Check if element has a specific attribute using HtmlAgilityPack in VB.Net

I'm using HtmlAgilityPack to parse HTML.
I want to check if an element has a specific attribute.
I want to check whether an <a> tag has the href attribute.
Dim doc As HtmlDocument = New HtmlDocument()
doc.Load(New StringReader(content))
Dim root As HtmlNode = doc.DocumentNode
Dim anchorTags As New List(Of String)
For Each link As HtmlNode In root.SelectNodes("//a")
If link.HasAttributes("href") Then doSomething() 'this doesn't work because hasAttributes only checks whether an element has attributes or not
Next
Like this:
If link.Attributes("href") IsNot Nothing Then