Need Example HtmlAgilityPack - vb.net

I try again to scrape for an exemple.
Actually i have the follow code :
Imports System
Imports System.Xml
Imports HtmlAgilityPack
Imports System.Net
Imports System.IO
Imports System.Collections.Generic
Public Class Program
Public Shared Sub Main()
'Enable SSL Suppport'
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12
'WebPage to Scraping'
Dim link As String = "https://www.nextinpact.com"
'download page from the link into an HtmlDocument'
Dim doc As HtmlDocument = New HtmlWeb().Load(link)
'select the title'
Dim div As HtmlNode = doc.DocumentNode.SelectSingleNode("//section[#class='small_article_section']")
If Not div Is Nothing Then
For Each node As HtmlNode In doc.DocumentNode.SelectNodes("//h2[#class='color_title']//a[#class='ui-link'][contains(text())]")
Console.Write(div.InnerText.Trim())
Next
End If
End Sub
End Class
Actualy i try to catch all the title from
"//section[#class='small_article_section']"
But how i do to get all the title ?
For the first title the xpath is
"//h2[#class='color_title']//a[#class='ui-link'][contains(text(),'Les
obligations de Netflix passeront d')]"
Thanks you.
Edit:
I try an other example,
with
Dim doc As HtmlDocument = New HtmlWeb().Load("https://www.sideshow.com/collectibles?manufacturer=sideshow+collectibles&type=premium+format%28tm%29+figure&brand=aspen")
Dim div As HtmlNode = doc.DocumentNode.SelectSingleNode("//div[#class='c-ProductList row']")
Now i try to get for each product the title, with :
For Each node As HtmlNode In div.SelectNodes("//h2[contains(text(),'Grace')]") 'That is for Only Grace
Console.Write(node.InnerText.Trim())
Next
But with
//h2[contains(text(),'Grace')]
i have Nothing and i want Gace and Aspen and try with
.//h2[contains(text()]
and nothing too

This is how you do it.
Dim doc As HtmlDocument = New HtmlWeb().Load("https://www.nextinpact.com/")
Dim div As HtmlNode = doc.DocumentNode.SelectSingleNode("//section[#class='small_article_section']")
'If div IsNot Nothing Then 'I think this part is pointless as it will always exist
For Each node As HtmlNode In div.SelectNodes(".//h2[#class='color_title']/a") 'a class='ui-link' doesn't exist so do h2/a
Console.Write(node.InnerText.Trim())
Next

Related

Drill down using HtmlAgilityPack.HtmlDocument in VB.NET

I've created an HTML Document using
Dim htmlDoc = New HtmlAgilityPack.HtmlDocument()
and have a node
node = htmlDoc.DocumentNode.SelectSingleNode("/html/body/main/section/form[1]/input[2]")
and the OuterHtml is
"<input type="hidden" id="public-id" value="michael.smith.1">"
I need the value of michael.smith.1. Is there a way to pull the value property from the node or am I at the point where I use substring to parse out the value?
Thanks for the help
I would use the id firstly as this makes for faster matching, then use the GetAttributeValue method of HtmlNode to extract the value attribute
Imports System
Imports HtmlAgilityPack
Public Class Program
Public Shared Sub Main()
Dim doc = new HtmlDocument
Dim output As String = "<html><head><title>Text</title></head><body><input type=""hidden"" id=""public-id"" value=""michael.smith.1""></body></html>"
doc.LoadHtml(output)
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[#id='public-id']").GetAttributeValue("value","Not present"))
End Sub
End Class
Fiddle

Go to each link stored in string and list all PDF links

My first attempt to create program.
I am trying to go to website, get all links, after that proceed to each link and get all links that are ending with
.pdf
I am able to get all needed links. Now I want to proceed to each link and search for PDF files.
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim mainUrl As String = "xxx"
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument
Dim listLinks As New List(Of String)
Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//ul[#class='products-list-page']//a") '< - - - select nodes with links
For Each src As HtmlNode In srcs
' Store links in array
listLinks.Add(src.Attributes("href").Value)
Next
' Here I am attempting to through each link and get listed all .pdf links
'get the array from the list.
Dim arrayLinks() As String = listLinks.ToArray()
'Console.Read()
Dim scrapedsrcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//ul[#class='dl-items']//a") '< - - - select nodes with links
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
Console.WriteLine(scrapedlink.Attributes("href").Value) '< - - - Print urls
Next
End Sub
End Module
How to make it happen? Can somebody give me a hint?
EDIT:
First of all, you did not iterate each product links and download the html to scan for the pdf file download links.
This is done by :
For Each productLink As String In listLinks
Dim prodDoc As HtmlDocument = New HtmlWeb().Load(productLink)
Dim scrapedsrcs As HtmlNodeCollection = prodDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a") '< - - - select nodes with links
If scrapedsrcs IsNot Nothing Then
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
Console.WriteLine($"-- {scrapedlink.Attributes("href").Value}") '< - - - Print urls
Next
End If
Next
Secondly, the a link to download the pdf is contained inside a div instead of ul. So to select the nodes, use :
prodDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a")
Or you can specify * to select by class regardless of element like :
prodDoc.DocumentNode.SelectNodes("//*[#class='dl-items']//a")
Since you are not doing anything with those links, why not simply write it nice and short?
Like:
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load("https: //www.nordicwater.com/products/waste-water/")
For Each src As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//ul[#class='products-list-page']//a")
htmlDoc = New HtmlWeb().Load(src.Attributes("href").Value)
Dim LinkTest As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='dl-items']/a")
If LinkTest IsNot Nothing AndAlso LinkTest.Attributes("href").Value.Length > 0 Then Console.WriteLine(LinkTest.Attributes("href").Value)
Next
End Sub
End Module

Download PDF files from webpage

I am trying to download files from website. My current solution seems to work but there are some things I don't understand.
First issue comes while:
//div[#class='large-4 medium-4 columns']//a
There are other divs with class large-4 medium-4 columns. So I am getting couple of unnecessary links. How to get rid of them? I need only pages that contain /products/
Second issue is that nothing gets downloaded to C:\temp\ and I guess there is something with:
//div[#class='large-6 medium-8 columns large-centered']/a[string-length(#href)>0]
but what is wrong?
"xxx" is the link in my code and it should be
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim mainUrl As String = "xxx"
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument
Dim listLinks As New List(Of String)
Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='large-4 medium-4 columns']//a") '< - - - select nodes with links
For Each src As HtmlNode In srcs
' Store links in array
listLinks.Add(src.Attributes("href").Value)
Console.WriteLine(src.Attributes("href").Value)
Next
Console.Read()
For Each productLink As String In listLinks
Dim prodDoc As HtmlDocument = New HtmlWeb().Load(productLink)
Dim scrapedsrcs As HtmlNodeCollection = prodDoc.DocumentNode.SelectNodes("//div[#class='large-6 medium-8 columns large-centered']/a[string-length(#href)>0]") '< - - - select nodes with links
If scrapedsrcs IsNot Nothing Then
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
'Console.WriteLine($"-- {scrapedlink.Attributes("href").Value}") '< - - - Print urls
My.Computer.Network.DownloadFile(scrapedlink.Attributes("href").Value, "C:\temp\" & System.IO.Path.GetFileName(scrapedlink.Attributes("href").Value) & ".pdf")
Next
End If
Next
Console.Read()
' End of scraping
End Sub
End Module
EDIT:
Ok, first one should be
//div[#class='row inset1 productItem padb1 padt1']/div[#class='large-4 medium-4 columns']//a
This will download brochures to folder where app is run:
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load("https://webpage.com")
Dim ProductListPage As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='productContain padb6']//div[#class='large-4 medium-4 columns']/a")
For Each src As HtmlNode In ProductListPage
htmlDoc = New HtmlWeb().Load(src.Attributes("href").Value)
Dim LinkTester As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='row padt6 padb4']//a")
If LinkTester IsNot Nothing Then
For Each dllink In LinkTester
Dim LinkURL As String = dllink.Attributes("href").Value
Console.WriteLine(LinkURL)
Dim ExtractFilename As String = LinkURL.Substring(LinkURL.LastIndexOf("/"))
Dim DLClient As New WebClient
DLClient.DownloadFileAsync(New Uri(LinkURL), ".\" & ExtractFilename)
Next
End If
Next

Stream editing OpenXml powerpoint presentation slides

I am trying to edit the XML stream of Powerpoint slides using OpenXml and Streamreader/Streamwriter.
For a word document, it's easy:
Imports System.IO
Imports DocumentFormat.OpenXml
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Presentation
Imports DocumentFormat.OpenXml.Wordprocessing
'
'
'
' Open a word document
CurrentOpenDocument = WordprocessingDocument.Open(TheWordFileName, True)
' for a word document, this works
Using (CurrentOpenDocument)
' read the xml stream
Dim sr As StreamReader = New StreamReader(CurrentOpenDocument.MainDocumentPart.GetStream)
docText = sr.ReadToEnd
' do the substitutions here
docText = DoSubstitutions(docText)
' write the modified xml stream
Dim sw As StreamWriter = New StreamWriter(CurrentOpenDocument.MainDocumentPart.GetStream(FileMode.Create))
Using (sw)
sw.Write(docText)
End Using
End Using
But for Powerpoint (presentations), I find that the inserted modified XML stream for the slideparts do not get saved:
Imports System.IO
Imports DocumentFormat.OpenXml
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Presentation
Imports DocumentFormat.OpenXml.Wordprocessing
'
'
' Open a powerpoint presentation
CurrentOpenPresentation = PresentationDocument.Open(ThePowerpointFileName, True)
' for a powerpoint presentation, this doesn't work
Using (CurrentOpenPresentation)
' Get the presentation part of the presentation document.
Dim pPart As PresentationPart = CurrentOpenPresentation.PresentationPart
' Verify that the presentation part and presentation exist.
If pPart IsNot Nothing AndAlso pPart.Presentation IsNot Nothing Then
' Get the Presentation object from the presentation part.
Dim pres As Presentation = pPart.Presentation
' Verify that the slide ID list exists.
If pres.SlideIdList IsNot Nothing Then
' Get the collection of slide IDs from the slide ID list.
Dim slideIds = pres.SlideIdList.ChildElements
' loop through each slide
For Each sID In slideIds
Dim slidePartRelationshipId As String = (TryCast(sID, SlideId)).RelationshipId
Dim TheslidePart As SlidePart = CType(pPart.GetPartById(slidePartRelationshipId), SlidePart)
' If the slide exists...
If TheslidePart.Slide IsNot Nothing Then
Dim sr As StreamReader = New StreamReader(TheslidePart.GetStream)
Using (sr)
docText = sr.ReadToEnd
End Using
docText = DoSubstitutions(docText)
Dim sw As StreamWriter = New StreamWriter(TheslidePart.GetStream(FileMode.Create))
Using (sw)
sw.Write(docText)
End Using
End If
Next
End If
End Using
I've also tried iterating through the in-memory slideparts to check the XML stream, and they ARE changed.
It's just that this never gets saved back to the file in the dispose (end using) and no error exceptions are raised.
Has anyone else experienced this?
After about a week of messing around with this, I came upon the answer. It is to reference the slideparts from the collection instead of referencing via the relationship Id's, although I don't know why this works and the initial approach doesn't:
' This DOES work
Using (CurrentOpenPresentation)
' Get the presentation part of the presentation document.
Dim pPart As PresentationPart = CurrentOpenPresentation.PresentationPart
' Verify that the presentation part and presentation exist.
If pPart IsNot Nothing AndAlso pPart.Presentation IsNot Nothing Then
' reference each slide in turn and do the Substitution
For Each s In pPart.SlideParts
Dim sr As StreamReader = New StreamReader(s.GetStream)
Using (sr)
docText = sr.ReadToEnd
End Using
docText = DoSubstitutions(docText)
Dim sw As StreamWriter = New StreamWriter(s.GetStream(FileMode.Create))
Using (sw)
sw.Write(docText)
End Using
Next
End If
End Using

Download a webpage to a text file

I have the following code which works.
Imports System.IO
Imports System.Net
Module Module1
Sub Main()
Dim webClient1 As New WebClient()
webClient1.Encoding = System.Text.Encoding.ASCII
webClient1.DownloadFile("http://www.bmreports.com/servlet/com.logica.neta.bwp_MarketIndexServlet?displayCsv=true", "C:\temp\stream.txt")
End Sub
End Module
This creates the text file but it does download all the html as well. How can I omit this and just get the text that is displayed on the page?
You can remove all the html tags from the document using Regex:
Dim source as string = File.ReadAllText("C:\temp\stream.txt")
'Clean html tags
source = StripTagsRegex(source)
'Strip function
Private Function StripTagsRegex(source As String) As String
Return Regex.Replace(source, "<.*?>", String.Empty)
End Function
Here you have an example of thir regex, it extracts only text:
http://regexr.com?36ori