Download PDF files from webpage - vb.net

I am trying to download files from website. My current solution seems to work but there are some things I don't understand.
First issue comes while:
//div[#class='large-4 medium-4 columns']//a
There are other divs with class large-4 medium-4 columns. So I am getting couple of unnecessary links. How to get rid of them? I need only pages that contain /products/
Second issue is that nothing gets downloaded to C:\temp\ and I guess there is something with:
//div[#class='large-6 medium-8 columns large-centered']/a[string-length(#href)>0]
but what is wrong?
"xxx" is the link in my code and it should be
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim mainUrl As String = "xxx"
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument
Dim listLinks As New List(Of String)
Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='large-4 medium-4 columns']//a") '< - - - select nodes with links
For Each src As HtmlNode In srcs
' Store links in array
listLinks.Add(src.Attributes("href").Value)
Console.WriteLine(src.Attributes("href").Value)
Next
Console.Read()
For Each productLink As String In listLinks
Dim prodDoc As HtmlDocument = New HtmlWeb().Load(productLink)
Dim scrapedsrcs As HtmlNodeCollection = prodDoc.DocumentNode.SelectNodes("//div[#class='large-6 medium-8 columns large-centered']/a[string-length(#href)>0]") '< - - - select nodes with links
If scrapedsrcs IsNot Nothing Then
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
'Console.WriteLine($"-- {scrapedlink.Attributes("href").Value}") '< - - - Print urls
My.Computer.Network.DownloadFile(scrapedlink.Attributes("href").Value, "C:\temp\" & System.IO.Path.GetFileName(scrapedlink.Attributes("href").Value) & ".pdf")
Next
End If
Next
Console.Read()
' End of scraping
End Sub
End Module
EDIT:
Ok, first one should be
//div[#class='row inset1 productItem padb1 padt1']/div[#class='large-4 medium-4 columns']//a

This will download brochures to folder where app is run:
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load("https://webpage.com")
Dim ProductListPage As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='productContain padb6']//div[#class='large-4 medium-4 columns']/a")
For Each src As HtmlNode In ProductListPage
htmlDoc = New HtmlWeb().Load(src.Attributes("href").Value)
Dim LinkTester As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//div[#class='row padt6 padb4']//a")
If LinkTester IsNot Nothing Then
For Each dllink In LinkTester
Dim LinkURL As String = dllink.Attributes("href").Value
Console.WriteLine(LinkURL)
Dim ExtractFilename As String = LinkURL.Substring(LinkURL.LastIndexOf("/"))
Dim DLClient As New WebClient
DLClient.DownloadFileAsync(New Uri(LinkURL), ".\" & ExtractFilename)
Next
End If
Next

Related

Go to each link stored in string and list all PDF links

My first attempt to create program.
I am trying to go to website, get all links, after that proceed to each link and get all links that are ending with
.pdf
I am able to get all needed links. Now I want to proceed to each link and search for PDF files.
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim mainUrl As String = "xxx"
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load(mainUrl) '< - - - Load the webage into htmldocument
Dim listLinks As New List(Of String)
Dim srcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//ul[#class='products-list-page']//a") '< - - - select nodes with links
For Each src As HtmlNode In srcs
' Store links in array
listLinks.Add(src.Attributes("href").Value)
Next
' Here I am attempting to through each link and get listed all .pdf links
'get the array from the list.
Dim arrayLinks() As String = listLinks.ToArray()
'Console.Read()
Dim scrapedsrcs As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//ul[#class='dl-items']//a") '< - - - select nodes with links
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
Console.WriteLine(scrapedlink.Attributes("href").Value) '< - - - Print urls
Next
End Sub
End Module
How to make it happen? Can somebody give me a hint?
EDIT:
First of all, you did not iterate each product links and download the html to scan for the pdf file download links.
This is done by :
For Each productLink As String In listLinks
Dim prodDoc As HtmlDocument = New HtmlWeb().Load(productLink)
Dim scrapedsrcs As HtmlNodeCollection = prodDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a") '< - - - select nodes with links
If scrapedsrcs IsNot Nothing Then
For Each scrapedlink As HtmlNode In scrapedsrcs
' Show links in console
Console.WriteLine($"-- {scrapedlink.Attributes("href").Value}") '< - - - Print urls
Next
End If
Next
Secondly, the a link to download the pdf is contained inside a div instead of ul. So to select the nodes, use :
prodDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a")
Or you can specify * to select by class regardless of element like :
prodDoc.DocumentNode.SelectNodes("//*[#class='dl-items']//a")
Since you are not doing anything with those links, why not simply write it nice and short?
Like:
Imports HtmlAgilityPack
Module Module1
Sub Main()
Dim htmlDoc As HtmlDocument = New HtmlWeb().Load("https: //www.nordicwater.com/products/waste-water/")
For Each src As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//ul[#class='products-list-page']//a")
htmlDoc = New HtmlWeb().Load(src.Attributes("href").Value)
Dim LinkTest As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='dl-items']/a")
If LinkTest IsNot Nothing AndAlso LinkTest.Attributes("href").Value.Length > 0 Then Console.WriteLine(LinkTest.Attributes("href").Value)
Next
End Sub
End Module

How to get a specific class that is in a div

I'm at the beginning with the Html Agility Pack library, I can't understand why if I only take the divs with the following code it works
Dim url As String = "https://example.com"
Dim web = New HtmlWeb()
Dim doc = web.Load(url)
For Each node As HtmlNode In doc.DocumentNode.SelectNodes("//div")
Console.Write(node.InnerText)
Next
while if I want to find a class in a div like in the following code it gives me an error
Dim url As String = "https://example.com"
Dim web = New HtmlWeb()
Dim doc = web.Load(url)
For Each node As HtmlNode In doc.DocumentNode.SelectNodes("//div[#class='myclass']")
Console.Write(node.InnerText)
Next

How to get/download the album art from internet for Music Player project

Hello i'm making Music player with clean look and great design and i'm about to finish it but i want to add a cool smart feature is when there is an internet connection and there isn't album cover in the mp3 file ('The song) then the program should download and get the cover by name of the mp3 file and if there is cover in the mp3 file use it (i already done that). all i wan't knew is to knew what i done wrong in this code. The problem is Msgbox appearing and says error in downloading song cover BUT why. Is there is something wrong in my code? and thanks for you time and helping me :)
''label13 and labell5 is the song name
''download cover if internet is avaiable
If My.Computer.Network.IsAvailable Then
Label13.Text = Label5.Text
Dim Clint As New WebClient()
Dim photolink As String = Nothing
Application.DoEvents()
Dim sourceCollection As String = Clint.DownloadString(New Uri("https://www.amazon.com/s?k=" + Label13.Text + "&i=digital-music&ref=nb_sb_noss"))
Dim Weber As New WebBrowser With {.ScriptErrorsSuppressed = True}
Weber.DocumentText = sourceCollection
Dim htmlColl As HtmlDocument = Weber.Document.OpenNew(True)
htmlColl.Write(sourceCollection)
If htmlColl.GetElementById("mp3StoreShovelerCell0") IsNot Nothing Then
Dim theHtmlElementCollection As HtmlElementCollection = htmlColl.GetElementById("mp3StoreShovelerCell0").GetElementsByTagName("img")
For Each curElement As HtmlElement In theHtmlElementCollection
photolink = curElement.GetAttribute("src")
Next
If photolink IsNot Nothing Then
photolink = photolink.ToString.Replace("._SL500_SS110_.jpg", "._SS500_.jpg")
BunifuImageButton2.ImageLocation = photolink
End If
Else
BunifuImageButton2.Image = My.Resources.clipart536736
MsgBox("Error in Downloading Song Cover")
End If
Else
Dim file As TagLib.File = TagLib.File.Create(AxWindowsMediaPlayer1.URL.ToString())
If (file.Tag.Pictures.Length > 0) Then
Dim bin = CType(file.Tag.Pictures(0).Data.Data, Byte())
BunifuImageButton2.Image = Image.FromStream(New MemoryStream(bin)).GetThumbnailImage(600, 600, Nothing, IntPtr.Zero)
BunifuImageButton7.Image = Image.FromStream(New MemoryStream(bin)).GetThumbnailImage(600, 600, Nothing, IntPtr.Zero)
Else
BunifuImageButton7.Image = My.Resources.clipart536736
End If
End If
BunifuImageButton7.Image = BunifuImageButton2.Image
''end of code cover

How can I get the URL of an internet shortcut (.url)?

My client has some internet shortcuts (*.url) on his desktop and I want to get their URL through a VB application and use them as variables.
Any idea how can I do that?
There's a sample on MSDN for *.lnk and *.appref-ms-files.
But seems to work for *.url-files too.
Quote from the site:
To check if a file is a shortcut and to resolve a shortcut path, the
COM Library Microsoft Shell Controls And Automation is used. This
library is added to the References of the Visual Studio project.
Code:
Public Function IsShortcut(strPath As String) As Boolean
If Not File.Exists(strPath) Then
Return False
End If
Dim directory As String = Path.GetDirectoryName(strPath)
Dim strFile As String = Path.GetFileName(strPath)
Dim shell As Shell32.Shell = New Shell32.Shell()
Dim folder As Shell32.Folder = shell.NameSpace(directory)
Dim folderItem As Shell32.FolderItem = folder.ParseName(strFile)
If folderItem IsNot Nothing Then
Return folderItem.IsLink
End If
Return False
End Function
Public Function ResolveShortcut(strPath As String) As String
If IsShortcut(strPath) Then
Dim directory As String = Path.GetDirectoryName(strPath)
Dim strFile As String = Path.GetFileName(strPath)
Dim shell As Shell32.Shell = New Shell32.Shell()
Dim folder As Shell32.Folder = shell.NameSpace(directory)
Dim folderItem As Shell32.FolderItem = folder.ParseName(strFile)
Dim link As Shell32.ShellLinkObject = folderItem.GetLink
Return link.Path
End If
Return String.Empty
End Function

HtmlAgilityPack - getting error when looping through nodes. Doesn't make sense

I'm trying to get all nodes below but I am getting an error message of:
Overload resolution failed because no accessible 'GetAttributeValue' accepts this number of arguments.
Dim doc As New HtmlDocument()
doc.LoadHtml("shaggybevo.com/board/register.php")
Dim docNode As HtmlNode = doc.DocumentNode
Dim nodes As HtmlNodeCollection = docNode.SelectNodes("//input")
For Each node As HtmlNode In nodes
Dim id As String = node.GetAttributeValue("id")
Next
Any ideas on why I am getting this error message? Thanks
You need to provide a default value as a second parameter to GetAttributeValue:
Dim id As String = node.GetAttributeValue("id", "")
Update for updated question
In addition to the above fix, you are retrieving the HtmlDocument incorrectly. HtmlDocument.Load will either load a file or an HTML string, not retrieve the file from the web server.
You need to modify your code to fetch the data from the URL using HtmlWeb. Replace the following lines:
Dim doc As New HtmlDocument()
doc.LoadHtml("shaggybevo.com/board/register.php")
with these:
Dim doc As HtmlDocument
Dim web As New HtmlWeb
doc = web.Load("http://shaggybevo.com/board/register.php")