Preventing errors with HTMLAgilitypack in VB.Net - vb.net

I'm using the HTMLAgilityPack to parse HTML pages. However at some point I try to parse wrong data (in this specific case an image), which ofc fails for obvious reasons.
Private Sub parseHtml(ByVal content As String, ByVal url As String)
Try
Dim contentHash As String = hashGenerator.ComputeHash(content, "SHA1")
Dim doc As HtmlDocument = New HtmlDocument()
doc.Load(New StringReader(content))
Dim root As HtmlNode = doc.DocumentNode
Dim anchorTags As New List(Of String)
For Each link As HtmlNode In root.SelectNodes("//a")
cururl = link.OuterHtml
If link.Attributes("href") Is Nothing Then Continue For
If Uri.IsWellFormedUriString(link.Attributes("href").Value, UriKind.Absolute) Then
urlQueue.Enqueue(link.Attributes("href").Value)
Else
Dim myUri As New Uri(url)
urlQueue.Enqueue(myUri.Scheme & "://" & myUri.Host & link.Attributes("href").Value)
End If
Next
Catch ex As Exception
MsgBox(ex.Message, MsgBoxStyle.Critical, "Error (parseHtml(" & url & "))")
End Try
End Sub
The error I get is:
A first chance exception of type
'System.NullReferenceException'
occurred in Webcrawler.exe Object
reference not set to an instance of an
object.
On the content I try to parse:
�����Iޥ�+�: 8�0�x�
How to check whether the content is 'parse-able' before trying to parse it to prevent the error?
For now it is an image which makes an error popup however I think it might be just anything which isn't (x)html.
Thanks in advance ow great community :)

You need to check the returned content-type header before trying to parse the returned data.
For an HTML page this should be text/html, for XHTML is would be application/xhtml+xml.

If you only have the content (If you can't have access to original HTTP headers like Oded suggested), you could assume a good HTML string should contain at least a "<" character within, say, the 10 first characters of the string.
Of course, there is no guarantee and you will still need to handle the extreme cases, but this should discard most garbage or unexpected content types, and will let specific encoding bytes pass fine (like UTF-8 byte order mark, etc...).

Related

Google searching URL and add Listbox C# vb.net

i want to using webrequest codes then adding google search result URL's listbox1. But i can't codes gives error.
Try
Dim adres As String = "https://www.google.com/search?q=" + TextBox1.Text
Dim istek As WebRequest = HttpWebRequest.Create(adres)
Dim cevap As WebResponse
cevap = istek.GetResponse()
Dim donenBilgiler As StreamReader = New StreamReader(cevap.GetResponseStream())
Dim gelen As String = donenBilgiler.ReadToEnd()
Dim titleIndexBaslangici As Integer = gelen.IndexOf("<link href") + 2
Dim titleIndexBitisi As Integer = gelen.Substring(titleIndexBaslangici).IndexOf(">")
ListBox1.Items.Add(gelen.Substring(titleIndexBaslangici, titleIndexBitisi))
Catch ex As Exception
MsgBox(ex.Message)
End Try
First, are you familliar with API, because that's what you need ! You might work this out with the way you want to make it but its really bad and no one will recommend you to continue like this!
What you need to look for is API, (google search API)! API "only" purpose is to access to some data (in database) with easy Http route (well documented), try it out!
If you keep trying to do it in your own way, the best result that you will get is a really bad html page that you will need to parse and you don't want that !

Get IP From Google

I'm writing an app in vb.net that needs public IP address in text only format. I'd knew there is lots of site that give you your IP in text format. But where is always a chance of being closed or getting out of service. But the Google will not ever stop! Now I want to get my ip from google searching. For example if you search "my ip" in google it will bring your ip like this:
Sample of search
Is anyway to get IP from Google?
Thanks guys but I found a way:
At first import some namespaces:
Imports System.Net
Imports System.Text.RegularExpressions
Now lets write a function:
Dim client As New WebClient
Dim To_Match As String = "<div class=""_h4c _rGd vk_h"">(.*)"
Dim recived As String = client.DownloadString("https://www.google.com/search?sclient=psy-ab&site=&source=hp&btnG=Search&q=my+ip")
Dim m As Match = Regex.Match(recived, To_Match)
Dim text_with_divs As String = m.Groups(1).Value
Dim finalize As String() = text_with_divs.Split("<")
Return finalize(0)
It's now working and live!
Hard-coded Div Class names make me a bit nervous because they can easily change at any time, so, I expanded on Hirod Behnam's example a little bit.
I removed the Div Class pattern, replacing it with a simpler IP address search, and it'll return only the first one found, which for this search, should be the first one shown on the page (your external IP).
That also removed the need for splitting the results into the array, and those related vars. I also simplified the Google search string to the bare minimum.
It'd still probably be a nice touch to include a timeout or two for the .DownloadString() and the .Match() respectively, if speed is of the essence.
Private Function GetExternalIP() As String
Dim m As Match = Match.Empty
Try
Dim wClient As New System.Net.WebClient
Dim strURL As String = wClient.DownloadString("https://www.google.com/search?q=my+ip")
Dim strPattern As String = "\b(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b"
' Look for the IP
m = Regex.Match(strURL, strPattern)
Catch ex As Exception
Debug.WriteLine(String.Format("GetExternalIP Error: {0}", ex.Message))
End Try
' Failed getting the IP
If m.Success = False Then Return "IP: N/A"
' Got the IP
Return m.value
End Function

Saving embedded resource contents to string

I am trying to copy the contents of an embedded file to a string in Visual Basic using Visual Studio 2013. I already have the resource (Settings.xml) imported and set as an embedded resource. Here is what I have:
Function GetFileContents(ByVal FileName As String) As String
Dim this As [Assembly]
Dim fileStream As IO.Stream
Dim streamReader As IO.StreamReader
Dim strContents As String
this = System.Reflection.Assembly.GetExecutingAssembly
fileStream = this.GetManifestResourceStream(FileName)
streamReader = New IO.StreamReader(fileStream)
strContents = streamReader.ReadToEnd
streamReader.Close()
Return strContents
End Function
When I try to save the contents to a string by using:
Dim contents As String = GetFileContents("Settings.xml")
I get the following error:
An unhandled exception of type 'System.ArgumentNullException' occurred in mscorlib.dll
Additional information: Value cannot be null.
Which occurs at line:
streamReader = New IO.StreamReader(fileStream)
Nothing else I've read has been very helpful, hoping someone here can tell me why I'm getting this. I'm not very good with embedded resources in vb.net.
First check fileStream that its not empty as it seems its contains nothing that's why you are getting a Null exception.
Instead of writing to file test it by using a msgBox to see it its not null.
fileStream is Nothing because no resources were specified during compilation, or because the resource is not visible to GetFileContents.
After fighting the thing for hours, I discovered I wasn't importing the resource correctly. I had to go to Project -> Properties -> Resources and add the resource from existing file there, rather than importing the file from the Solution Explorer. After adding the file correctly, I was able to write the contents to a string by simply using:
Dim myString As String = (My.Resources.Settings)
Ugh, it's always such a simple solution, not sure why I didn't try that first. Hopefully this helps someone else because I saw nothing about this anywhere else I looked.

HtmlAgilityPack not finding nodes from HttpWebRequest's returned HTML

I am a little new to htmlagilitypack. I want to use my HttpWebRequest which can return the html of a webpage and then parse that html with htmlagilitypack. I want to find all div's with a specific class and then get the inner text of what is inside those div's. This is what I have so far. My get request successfully returns webpage html:
Public Function mygetreq(ByVal myURL as String, ByRef thecookie As CookieContainer)
Dim getreq As HttpWebRequest = DirectCast(HttpWebRequest.Create(myURL), HttpWebRequest)
getreq.Method = "GET"
getreq.KeepAlive = True
getreq.CookieContainer = thecookie
getreq.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0"
Dim getresponse As HttpWebResponse
getresponse = DirectCast(getreq.GetResponse, HttpWebResponse)
Dim getreqreader As New StreamReader(getresponse.GetResponseStream())
Dim thePage = getreqreader.ReadToEnd
'Clean up the streams and the response.
getreqreader.Close()
getresponse.Close()
Return thePage
End Function
This function returns the html. I then put the html into this:
'The html successfully shows up in the RichTextBox
RichTextBox1.Text = mygetreq("http://someurl.com", thecookie)
Dim htmldoc = New HtmlAgilityPack.HtmlDocument()
htmldoc.LoadHtml(RichTextBox1.Text)
Dim htmlnodes As HtmlNodeCollection
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='someClass']")
If htmlnodes IsNot Nothing Then
For Each node In htmlnodes
MessageBox.Show(node.InnerText())
Next
End If
The problem is, htmlnodes is coming back as null. So the final If Then loop won't run. It finds nothing, but I KNOW for a fact that this div and class exists in the html page because I can see the html in the RichTextBox1:
<div class="someClass"> This is inner text </div>
What exactly is the problem here? Does the htmldoc.LoadHtml not like the type of string that the mygetreq returns for the page html?
Does this have anything to do with html entities? thePage contains < and > brackets. They are not entitied.
I also saw someone post here (C#) to use the HtmlWeb class, but I am not sure how I would set that up. Most of my code is already written with httpWebRequest.
Thanks for reading and thanks for helping.
If you are willing to switch, you could use CsQuery, something along these lines:
Dim q As New CQ(mygetreq("http://someurl.com", thecookie))
For Each node In q("div.someClass")
Console.WriteLine(node.InnerText)
Next
You may want to add some error handling, but overall should be a good start for you.
You can add CsQuery to your project via NuGet:
Install-Package CsQuery
And don't forget to use Imports CsQuery at the top of your code file.
This may not directly solve your problem, but should make it easier to experiment with your data (via immediate window, for example).
Interesting read (performance comparison):
CsQuery Performance vs. Html Agility Pack and Fizzler
Using htmlweb is trully a simple and good way to work with HtmlAgilityPack...here is an example:
Private Sub GetHtml()
Dim HtmlWeb As New HtmlWeb
Dim HtmlDoc As HtmlDocument
Dim NodeCollection As HtmlNodeCollection
Dim URL As String = ""
HtmlDoc = HtmlWeb.Load(URL) 'Notice that i used load, and not LoadHtml
NodeCollection=HtmlDoc.DocumentNode.SelectNodes(put here your XPath)
For Each Node As HtmlNode In NodeCollection
If IsNothing(Node) = False Then
MsgBox(Node.InnerText)
End If
Next
End Sub

Check if folder on web exists or not

I'm creating desktop aplication and when I write a username in TextBox1 and on Button1.Click event it should check does folder on web exists.
As far I've tried this:
username = Me.TextBox1.Text
password = Me.TextBox2.Text
Dim dir As Boolean = IO.Directory.Exists("http://www.mywebsite.com/" + username)
If dir = true Then
Dim response As String = web.DownloadString("http://www.mywebsite.com/" + username + "/Password.txt")
If response.Contains(password) Then
MsgBox("You've logged in succesfully", MsgBoxStyle.Information)
Exit Sub
Else
MsgBox("Password is incorect!")
End If
Else
MsgBox("Username is wrong, try again!")
End If
First problem is that my boolean is giving FALSE as answer (directory exists for sure and all permissions are granted to see folder). I tried to solve that with setting dir = false and after that I go into first IF (but that's not what I want, since it should be TRUE, not FALSE)
There we come to second problem, in this line: Dim response As String=web.DownloadString("http://www.mywebsite.com/" + username + "/Password.txt") I get this error message: The remote server returned an error: (404) Not Found.
Anyone more experienced with this kind of things who can help me?
IO.Directory.Exists will not work in this case. That method only works to check for a folder on a disk somewhere (locally or network) ; you can't use it to check for the existence of a resource over HTTP. (i.e a URI)
But even if it did work this way, it's actually pointless to call it before attempting to download - the method DownloadString will throw an exception if something goes wrong - as you have seen, in this case it's telling you 404 Not Found which means "This resource does not exist as far as you are concerned". **
So you should try/catch the operation, you need to catch exceptions of type WebException, cast its Response member to HttpWebException, and check the StatusCode property.
An good example (albeit in C#) is here
** I say "as far as you are concerned" because for all you know, the resource may very well exist on the server, but it has decided to hide it from you because you do not have access to it, etc, and the developer of that site decided to return 404 in this case instead of 401 Unauthorised. The point being that from your point of view, the resource is not available.
Update:
here is the code from the answer I linked to, translated via this online tool because my VB is dodgy enough :). This code runs just fine for me in LinqPad, and produces the output "testlozinka"
Sub Main
Try
Dim myString As String
Using wc As New WebClient()
myString = wc.DownloadString("http://dota2world.org/HDS/Klijenti/TestKlijent/password.txt")
Console.WriteLine(myString)
End Using
Catch ex As WebException
Console.WriteLine(ex.ToString())
If ex.Status = WebExceptionStatus.ProtocolError AndAlso ex.Response IsNot Nothing Then
Dim resp = DirectCast(ex.Response, HttpWebResponse)
If resp.StatusCode = HttpStatusCode.NotFound Then
' HTTP 404
'the page was not found, continue with next in the for loop
Console.WriteLine("Page not found")
End If
End If
'throw any other exception - this should not occur
Throw
End Try
End Sub
Hope that helps.