Visual Basic 2010 - Retrieving numbers from an html div - vb.net

I'm currently working on a program that will average the prices from a searched item on Amazon.
I have button on the program that when pressed, prints out the HTML source code into a richtextbox and then finds the specific div within the source code.
My only problem right now is having it print out the money amount after each div.
Is there any way to do this?

You can use HTMLAgilityPack, one of the best HTML parsing libraries to ease this task.
An example.
Examples:
Assuming div's id is divPrice
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHTML);
HtmlNode priceNode = doc.GetElementbyId("divPrice");
string price = priceNode.InnerText;
Let's say div has no id but a css class cssPrice, then you can query it with XPath:
HtmlNode priceNode = doc.DocumentNode.SelectSingleNode("//div[#class='cssPrice']");
OR
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class='cssPrice']");
foreach (HtmlNode node in nodes) {
string nodeText = node.InnerText;
}

Related

How to find the href attribute using VBA/selenium/chrome?

I am fairly new to web scraping and I started with Selenium in VBA/Excel using Chrome. My target elements are the menu items.
In some websites I can find the elements but I can't get their "href". Here is the destination website. I want the "href" of the menu items in the right side of the page.
This is what I tried:
' geting the menus, working well:
Set mnus = bot1.FindElementsByClass("topmenu")
MenuCounter = 0
For Each mnu In mnus
'getting the href, fails
lnk = mnu.Attribute("href")
I also tried some other ways but no success.
This is a screen shot of the inspect:
Note that I don't only want the href for this specific element (whose href is "art"). I also want the href of other equivalent menu items (except for the first item).
The href value is inside anchor tag Not li element.You need to target anchor tag.Use following css selector.
Set mnus = bot1.FindElementsByCssSelector("li.topmenu >a[href]")
MenuCounter = 0
For Each mnu In mnus
lnk = mnu.Attribute("href")
Or To get all menus link try this.
To get all menus link
Set mnus = bot1.FindElementsByCssSelector("ul.topmenu a[href]")
MenuCounter = 0
For Each mnu In mnus
lnk = mnu.Attribute("href")

Selenium webdriver:select a "div" from many "div"s that dynamically change the absolute path

I need some help in selecting a div form many div's that for every session number of div's are changing.
For example:
one time a have the div in that position(absolute path): /html/body/div[97]
other time in that position (absolute path):/html/body/div[160]
and so on...
At a moment only one div is active and the other div's are hidden.
I attached a picture to show the code.
I try the xpath below but doesn't work,I get the error "no such element: Unable to locate element ...
driver.findElement(By.xpath(".//*[#class=\'ui-selectmenu-menu ui-selectmenu-open\']/ul/li[1]")).click();
Picture with html code is here:
XPath is great if you expect the target element to be in the exact location every time the page is displayed. However, CSS is the way to go if the content is constantly changing.
In the example below I have found the DIV and the Element that you highlighted in your screen capture.
WebElement targetElementDiv = null;
WebElement targetElement = null;
targetElementDiv = driver.findElement(By.cssSelector("[class='ui-selectmenu-menu ui-selectmenu-open']"));
if (targetElementDiv != null) {
targetElement = targetElementDiv.findElement(By.cssSelector("[class='ui-menu-item ui-state-focus'])");
}

replacing attributes within an html image tag

I have a 1000+ database entries that contain html image tags.
The problem is, 90% of the 'src' attributes are just placeholders. I need to replace all of those placeholders with the appropriate, real sources.
A typical database entry looks like this(the amount of image tags vary from entry to entry):
<p>A monster rushes at you!</p>
Monster:<p><img id="d8fh4-gfkj3" src="(image_placeholder)" /></p>
<br />
Treasure: <p><img id="x23zo-115a9" src="(image_placeholder)" /></p>
Please select your action below:
</br />
Using the IDs in the image tags above, 'd8fh4-gfkj3' & 'x23zo-115a9', I can query another function to get the "real" sources for those images.
So I tried using HtmlAgilityPack and came up with this(below):
Dim doc As New HtmlDocument()
doc.LoadHtml(encounterText)
For Each imgTag As HtmlNode In doc.DocumentNode.SelectNodes("//img")
'get the ID
Dim imgId As HtmlAttribute = imgTag.Attributes("id")
Dim imageId As String = imgId.Value
'get the new/real path
Dim newPath = getMediaPath(imageId)
Dim imgSrc As HtmlAttribute = imgTag.Attributes("src")
'check to see if the <img> tag "src" attribute has a placeholder
If imgSrc.Value.Contains("(image_placeholder)") Then
'replace old image src attribute with 'src=newPath'
End If
Next
But I can't figure out how to actually replace the old value with the new value.
Is there a way to do this with the HtmlAgilityPack?
Thanks!
You should be able to just set the value for the attribute:
'check to see if the <img> tag "src" attribute has a placeholder
If imgSrc.Value.Contains("(image_placeholder)") Then
'replace old image src attribute with 'src=newPath'
imgSrc.Value = newPath
End If
After the replacement, you can get the updated HTML with:
doc.DocumentNode.OuterHtml

How to get elements inside of commented out code HtmlAgilityPack in VB.NET

Is there a way to use HtmlAgilityPack on html that is inside <!-- --> comment blocks? For example, how can I target the inner text of "//div.[#class='theClass']" that is inside a block like this:
<!-- <div class="theClass'>Hello I am <span class="theSpan">some text.</span> </div>-->
So that I get
Hello I am some text.
The reason I ask is because I kept finding that this kept returning NULL, because the div's are inside comments:
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[#class='theClass']")
Unfortunately, XPath treats comment node content as plain text, means you can't query the content just like common nodes.
One possible way is to parse the comment node content as another HtmlDocument so you can query from it, for example :
'get desired comment node'
Dim htmlnode As HtmlNode = htmldoc.DocumentNode.SelectSingleNode("//comment()[contains(., theClass)]")
Dim comment As New HtmlDocument()
'remove the outer <!-- --> so we have clean content'
comment.LoadHtml(htmlnode.InnerHtml.Replace("<!--", "").Replace("-->", ""))
'here you can use common XPath query again'
Dim result As HtmlNode = comment.DocumentNode.SelectSingleNode("//div[#class='theClass']")
'following line will print "Hello I am some text."'
Console.WriteLine(result.InnerText)

Alternative to innerHTML for IE

I create HTML documents that include proprietary tags for links that get transformed into standard HTML when they go through our publishing system. For example:
<LinkTag contents="Link Text" answer_id="ID" title="Tooltip"></LinkTag>
When I'm authoring and reviewing these documents, I need to be able to test these links in a browser, before they get published. I wrote the following JavaScript to read the attributes and write them into an <a> tag:
var LinkCount = document.getElementsByTagName('LinkTag').length;
for (i=0; i<LinkCount; i++) {
var LinkText = document.getElementsByTagName('LinkTag')[i].getAttribute('contents');
var articleID = document.getElementsByTagName('LinkTag')[i].getAttribute('answer_id');
var articleTitle = document.getElementsByTagName('LinkTag')[i].getAttribute('title');
document.getElementsByTagName('LinkTag')[i].innerHTML = '' + LinkText + '';
}
This works great in Firefox, but not in IE. I've read about the innerHTML issue with IE and imagine that's the problem here, but I haven't been able to figure a way around it. I thought perhaps jQuery might be the way to go, but I'm not that well versed in it.
This would be a huge productivity boost if I could get this working in IE. Any ideas would be greatly appreciated.
innerHTML only works for things INSIDE of the open/close tags. So for instance if your LinkTag[i] is an <a> element, then putting innerHTML="<a .... > </a> would put that literally between the <a tag=LinkTag> and </a>.
You would need to put that in a DIV. Perhaps use your code to draw from links, then place the corresponding HTML code into a div.
var LinkCount = document.getElementsByTagName('LinkTag').length;
for (i=0; i<LinkCount; i++) {
var LinkText = document.getElementsByTagName('LinkTag')[i].getAttribute('contents');
var articleID = document.getElementsByTagName('LinkTag')[i].getAttribute('answer_id');
var articleTitle = document.getElementsByTagName('LinkTag')[i].getAttribute('title');
document.getElementsById('MyDisplayDiv')[i].innerHTML = '' + LinkText + '';
This should produce your HTML results within a div. You could also simply append the other LinkTag elements to a single DIV to produce a sort of "Preview of Website" within the div.
document.getElementsById('MyDisplayDiv').innerHTML += '' + LinkText + '';
Note the +=. Should append your HTML fabrication to the DIV with ID "MyDisplayDiv". Your list of LinkTag elements would be converted into a list of HTML elements within the div.
DOM functions might be considered more "clean" (and faster) than simply replacing the innerHTML in cases like this. Something like this may work better for you:
// where el is the element you want to wrap with <a link.
newel = document.createElement('a')
el.parentNode.insertBefore(newel,prop);
el = prop.parentNode.removeChild(prop);
newel.appendChild(prop);
newel.setAttribute('href','urlhere');
Similar code worked fine for me in Firebug, it should wrap the element with <a> ... </a>.
I wrote a script I have on my blog which you can find here: http://blog.funelr.com/?p=61 anyways take it a look it automatically fixes ie innerHTML errors in ie8 and ie9 by hijacking ie's innerHTML property which is "read-only" for table elements and replacing it with my own.
I have this xmlObject:
<c:value i:type="b:AccessRights">ReadAccess WriteAccess AppendAccess AppendToAccess CreateAccess DeleteAccess ShareAccess AssignAccess</c:value>
I find value use this: responseXML.getElementsByTagNameNS("http://schemas.datacontract.org/2004/07/System.Collections.Generic",'value')[0].firstChild.data
This one is the best : elm.insertAdjacentHTML( 'beforeend', str );
In case your element size is very small :elm.innerHTML += str;