vb.net get src links insade iframes with htmlagilitypack - vb.net

im using htmlagility and trying to get both wanted1 and wanted2
the html code is like this
<div class='class1' id='id1'>
<iframe id="iframe1" src="wanted1"</iframe>
<iframe id="iframe" src="wanted2"</iframe>
</div>
but no luck can someone help me please

Here is a commented sample to get you started:
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument
Dim html As String = <![CDATA[<div class='class1' id='id1'>
<iframe id="iframe1" src="wanted1"</iframe>
<iframe id="iframe" src="wanted2"</iframe>
</div>]]>.Value
'load the html string to the HtmlDocument we defined
htmlDoc.LoadHtml(html)
'using LINQ and some xpath you can target any node you want
' //iframe[#src] xpath passed to the SelectNodes function means select all iframe nodes that has src attribute
Dim srcs = From iframeNode In htmlDoc.DocumentNode.SelectNodes("//iframe[#src]")
Select iframeNode.Attributes("src").Value
'print all the src you got
For Each src In srcs
Console.WriteLine(src)
Next
make sure you learn about XPath.

Related

Why HtmlAgilityPack adds some characters to my html

Here is my code:
Dim input = "<div><textarea>something</div></textarea>"
Dim doc As New HtmlAgilityPack.HtmlDocument
doc.OptionOutputAsXml = True
doc.LoadHtml(Input)
Using writer As New StringWriter
doc.Save(writer)
Dim res = writer.ToString
End Using
and the value of 'res' is:
"<?xml version="1.0" encoding="windows-1255"?>
<div>
<textarea>
//<![CDATA[
something
//]]>//
</textarea>
</div>"
the result as html is: My textarea
How can I prevent it ?
From my understanding of it, the reason is implied by this answer to Set textarea value with HtmlAgilityPack:
A <textarea> element doesn't have a value attribute. It's content is it's own text node:
<textarea>
Some content
</textarea>
To simulate the same thing safely, HAP has to enclose the content in a //<![CDATA[ section.
The source code for HAP has this comment for the relevant line(s):
// tags whose content may be anything
ElementsFlags.Add("textarea", HtmlElementFlag.CData);
So, you can't prevent it.

replacing attributes within an html image tag

I have a 1000+ database entries that contain html image tags.
The problem is, 90% of the 'src' attributes are just placeholders. I need to replace all of those placeholders with the appropriate, real sources.
A typical database entry looks like this(the amount of image tags vary from entry to entry):
<p>A monster rushes at you!</p>
Monster:<p><img id="d8fh4-gfkj3" src="(image_placeholder)" /></p>
<br />
Treasure: <p><img id="x23zo-115a9" src="(image_placeholder)" /></p>
Please select your action below:
</br />
Using the IDs in the image tags above, 'd8fh4-gfkj3' & 'x23zo-115a9', I can query another function to get the "real" sources for those images.
So I tried using HtmlAgilityPack and came up with this(below):
Dim doc As New HtmlDocument()
doc.LoadHtml(encounterText)
For Each imgTag As HtmlNode In doc.DocumentNode.SelectNodes("//img")
'get the ID
Dim imgId As HtmlAttribute = imgTag.Attributes("id")
Dim imageId As String = imgId.Value
'get the new/real path
Dim newPath = getMediaPath(imageId)
Dim imgSrc As HtmlAttribute = imgTag.Attributes("src")
'check to see if the <img> tag "src" attribute has a placeholder
If imgSrc.Value.Contains("(image_placeholder)") Then
'replace old image src attribute with 'src=newPath'
End If
Next
But I can't figure out how to actually replace the old value with the new value.
Is there a way to do this with the HtmlAgilityPack?
Thanks!
You should be able to just set the value for the attribute:
'check to see if the <img> tag "src" attribute has a placeholder
If imgSrc.Value.Contains("(image_placeholder)") Then
'replace old image src attribute with 'src=newPath'
imgSrc.Value = newPath
End If
After the replacement, you can get the updated HTML with:
doc.DocumentNode.OuterHtml

Access elements inside html <embed> tag source html using VB.Net

I’m using SHDocVw.InternetExplorer APIs in my Vb.Net WinForms application to fetch elements from Internet Explorer. I can easily access the elements inside parent document and frame elements but I am not able to access the elements inside the 'embed' container. Here's the sample code:
Dim ie As SHDocVw.InternetExplorer
ie.Navigate("Some URL")
ie.Visible = True
Dim ieDoc As mshtml.IHTMLDocument2 = ie.Document
'All Elements
Dim allElements = ieDoc.all
'Frames
Dim allFrames = ieDoc.frames
'Fetch each frame and use its document to get all elements
Dim allEmbed = ieDoc.embeds
'How to fetch document inside embed to access its elements?
And here's a sample html:
Sample.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Sample</title>
</head>
<body>
<embed src="test.html" name="test1"/>
</body>
</html>
Test.html
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Sample</title>
</head>
<body bgcolor="#FFFFFF">
<button>Button1</button>
<label>Test 1</label>
</body>
</html>
How can I access the button and label inside the Test.html loaded in Sample.html using 'embed' tag?
Edit 1:
As per my research I can access the document inside the 'object' container using the .contentDocument property of 'object' element but the same is not working for 'embed' container.
I can get some comObject using getSVGDocument() property on 'embed' container but not able to cast it to mshtml.IHTMLDocument2
Well, I have been using "Html Agility Pack" to parse html over here and it's pretty awesome,
you can get all embed elements in your page and them read/parse the inner content.
http://html-agility-pack.net/
My sample:
'<html xmlns='http://www.w3.org/1999/xhtml'>
'<head>
' <title>Sample</title>
'</head>
'<body>
' <embed src='http://stackoverflow.com/questions/41806246/access-elements-inside-html-embed-tag-source-html-using-vb-net' name='test1'/>
'</body>
'</html>
'The htmlCode string:
Dim htmlCode As String = "<html xmlns='http://www.w3.org/1999/xhtml'><head><title>Sample</title></head><body><embed src='http://stackoverflow.com/questions/41806246/access-elements-inside-html-embed-tag-source-html-using-vb-net' name='test1'/></body></html>";
Dim client As New WebClient()
Dim doc = New HtmlDocument()
doc.LoadHtml(htmlCode)
Dim nodes = doc.DocumentNode.Descendants("embed")
For Each item As var In nodes
Dim srcEmded = item.GetAttributeValue("src", "")
If Not String.IsNullOrWhiteSpace(srcEmded) Then
Dim yourEmbedHtml As String = client.DownloadString(srcEmded)
'Do what you want with yourEmbedHtml
End If
Next

How to get elements inside of commented out code HtmlAgilityPack in VB.NET

Is there a way to use HtmlAgilityPack on html that is inside <!-- --> comment blocks? For example, how can I target the inner text of "//div.[#class='theClass']" that is inside a block like this:
<!-- <div class="theClass'>Hello I am <span class="theSpan">some text.</span> </div>-->
So that I get
Hello I am some text.
The reason I ask is because I kept finding that this kept returning NULL, because the div's are inside comments:
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='theClass']")
Unfortunately, XPath treats comment node content as plain text, means you can't query the content just like common nodes.
One possible way is to parse the comment node content as another HtmlDocument so you can query from it, for example :
'get desired comment node'
Dim htmlnode As HtmlNode = htmldoc.DocumentNode.SelectSingleNode("//comment()[contains(., theClass)]")
Dim comment As New HtmlDocument()
'remove the outer <!-- --> so we have clean content'
comment.LoadHtml(htmlnode.InnerHtml.Replace("<!--", "").Replace("-->", ""))
'here you can use common XPath query again'
Dim result As HtmlNode = comment.DocumentNode.SelectSingleNode("//div[#class='theClass']")
'following line will print "Hello I am some text."'
Console.WriteLine(result.InnerText)

VB.net Can you read the text in a certain Div

I was wondering if you read text in a certain div so when the html code says:
<html>
<head>
</head>
<body>
<div id="Main">SomeText</div>
<div id="Text">Welcome to my website</div>
</body>
</html>
i only want to see 'Welcome to my website' in the textbox 1.
is there anyone who knows how i can do that?
any help would be much appreciated.
Mark your div with runat="server":
<div id="TextDiv" runat="server">Welcome to my website</div>
then access the text in VB.NET code:
TextDiv.InnerHtml
I would recommend the HTML Agility pack hosted on codeplex at http://htmlagilitypack.codeplex.com/. With it you can connect to a HTML source, load the HTML into an reasonably friendly navigator and use XML type queries to traverse and manipulate the HTML.
I would use HtmlAgilityPack, then it's easy as:
Dim html = System.IO.File.ReadAllText("path")
Dim doc = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(html)
Dim welcomeDiv = doc.GetElementbyId("Text")
Me.TextBox1.Text = welcomeDiv.InnerText