XPath extracted from Chrome only works for first TR-element in .NET - vb.net

Could someone help me to solve my problem in the following situation:
I try to get the 3. text "Sangria Vinyl Lyrics" from this webpage:
http://www.metrolyrics.com/pharrell-williams-lyrics.html
Thanks to Google Chrome I found the following XPath to extract it:
//*[#id='popular']/div/table/tbody/tr[3]/td[2]/a
If I try to use this XPath in a Addon like XPath Helper, it works for every element that I search with tr[1...to...65]
If I copy & paste this XPath to my .NET code and installed HtmlAgilityPack, I get just "nothing".
Strangely, a modified XPath (/tr[1]) to get the 1. text ("Happy Lyrics") is the only one that I get perfectly working, returning "Happy Lyrics".
Is there something special about XPath, element indexes and HtmlAgilityPack in .NET?
My code so far:
Dim doc As HtmlDocument = New HtmlWeb().Load("http://www.metrolyrics.com/pharell-williams-lyrics.html")
Dim div As HtmlNode = doc.DocumentNode.SelectSingleNode("//*[#id='popular']/div/table/tbody/tr[3]/td[2]/a")
I tried it even with the "full path XPath" like
(/html[1]/body[1]/div[2]/div[3]/div[3]/div[2]/div[1]/div[3]/div[1]/div[1]/div[2]/div[1]/div[1]/table[1]/tbody[1]/tr[3]/td[2]/a)
but with the same result. It works in the Addon but not in .NET
Any idea, what I'm doing wrong?

Related

MS Access VBA web scraper return "&" inplace of "&"

I am using the Access VBA to do some web scraping.
It works fine for scraping table columns in most places but I have found that when there is a string such as
Mon&day it actually returns Mon&day.
I am using the IE object to do the web scraping
Set ie = CreateObject("InternetExplorer.Application")
And for scraping individual cells I am doing:
tdRow(subCounter).innerHTML
I know that the & is a special character in HTML, which is probably why this is happening. Is there a way to return the HTML as it is instead of letting VBA do some further parsing?
Use innerText to get just the text, without spacing and inner element tags.

Selenium XPath with Single, double quotes and comma

Trying to create XPath for below string (string is in .properties file and picked up by framework for further identification.):
If you would like to change your answers or finish a section that doesn’t have a checkmark, click on “Change”.
Tried below XPath:
CHANGE_ANSWER = xpath://ul/li[contains(.,concat('If you would like to change your answers or finish a section that doesn',"’", 't have a checkmark',",", ' click on ','“','Change','”','.'))]
Don't know what's exactly wrong.

Trying to scrape item from website

I was attempting to create a simple program that would pull a text item from a website and add it to the textbox. I'm simply just experimenting and thought I could do it but it is not that easy for me. I know how to get the entire source code of a website(below). It has a id I know but it does not have a tag name. So Im not really sure how to make it read through the text and only keep the part next to the id . Or would it be better to use a Webbrowser tool and then try and get the text item like that. I'm just trying to do whatever is faster. I think my 1st option is better because it would be better for the computer's ram. Using the code below I don't know what to add next?
Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create("Website")
Dim response As System.Net.HttpWebResponse = request.GetResponse()
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
Dim source As String = sr.ReadToEnd()
Lets say the id is "name" for example. Viewing the source of the page this is what the part looks like(below). How can I parse through the source which is a string and find this section, get the name Brandon, and add it to the textbox.
<span id="name">Brandon</span>
There are a few ways to go about this. I'm not going to write any source code though since I haven't used Visual Basic in a very long time. But if you Google for how to do any of the following you should find many tutorials and documents on it.
Regular Expressions
Using a Regular Expression on the full source code can help you find the element by searching for the ID attribute which should be unique. Regular Expressions can sometimes be very slow, which is why if you have to perform a lot of searches on large sections of text, it should be avoided.
/<([a-z0-9]+)\sid="name"(.*?)>(.*?)<\// -> Not Tested, but might help you
String Position
Using a function that will find the position of a substring in a string would be useful. In C it's strstr and in PHP it's strpos. These type of functions will give you starting position of a string, in which your case would be searching for id="name". Once you find that, you will find the position of the end of the tag and then find the closing tag for that element. You then will perform a substring function that will get you the text starting at position X for the length of that you specify, which would be the closing tag position - end of opening tag position.
HTML / XML Library
There are probably a ton of HTML / XML libraries that will parse the document into some sort of object or an array. You then can loop through these elements until you find the one you are looking for. Some of these libraries may even have search functions of element ID's similar to how JavaScript will sort for a specific element.
These libraries may be hard to get started with, but they will offer you a lot of options in the future if you need to continue finding more HTML elements.

Insert image from URL bookmark Microsoft word

I have a image URL contained in my sql database.
I create a bookmark for that column in the word document (this works fine).
Now I want to use the image URL that is passed from the database to insert an image.
I have tried hyperlink (does not work and does not display image).
I have tried Quick Parts - IncludePicture (does not work).
I have been Googleing and have not found anything that works.
Ok let me simplify this.
I want to insert a image using an URL.
You can do this in alot of different ways I know.
For instance using Quick Parts and the selecting IncludePicture you would the past the URL of the picture and BAM image inserted.
Now I want to do exactly that with one exception. The URL is a microsoft word bookmark that I get from my database.
For some reason this does not want to work. I have also checked the bookmark data and it is correct and yes it is a valid URL because if I copy and paste it from the database in the way I described above it works.
So is there any other way to do this?
To be honest I still don't know where is exactly your problem. I assumed that you have knowledge and code to take both bookmark name and url from your database using VBA. If so, there would be quite simple code which would allow you to load picture from web to bookmark in your word document.
Below is the code I have tested with half of success. If I add any picture it will work fine. But will not work with url of active google map. I have no idea what you you mean with 'static google map' (in comment), you didn't provide any example therefore you need to make your own test.
Before you run this for test be sure you have two bookmarks in your active document: bookmark_logo and bookmark_poland. Hope this will help a bit.
Sub Insert_picture_To_Bookmark()
Dim mapURL As String
Dim soLOGO As String
soLOGO = "http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png"
ActiveDocument.Bookmarks("bookmark_logo"). _
Range.InlineShapes.AddPicture _
soLOGO, True, True
mapURL = "https://maps.google.pl/maps?q=poland&hl=pl&sll=50.046766,20.004863&sspn=0.22047,0.617294&t=h&hnear=Polska&z=6"
ActiveDocument.Bookmarks("bookmark_poland"). _
Range.InlineShapes.AddPicture _
mapURL, True, True
End Sub

HTML Tag Isolation

Using the htmlagilitypack how can I isolate the tag I am searching for.
The application is to parse through the source code of a certain website to find a tag that has text that is going to be extracted. as shown in the below
<div> this is the text i want to extract</div>
I have tried RegEx and used some string manipulation but to no use.
When you have a good discriminant (which you pointed out in the comment), for example a CLASS attribute, XPATH is an easy way to query the HTML DOM, like this:
Dim question As String = "what time is it"
Dim web As New HtmlWeb
Console.WriteLine(web.Load(("http://wiki.answers.com/Q/" & question)).DocumentNode.SelectSingleNode("//div[#class='answer_text']").InnerText.Trim)