Using the htmlagilitypack how can I isolate the tag I am searching for.
The application is to parse through the source code of a certain website to find a tag that has text that is going to be extracted. as shown in the below
<div> this is the text i want to extract</div>
I have tried RegEx and used some string manipulation but to no use.
When you have a good discriminant (which you pointed out in the comment), for example a CLASS attribute, XPATH is an easy way to query the HTML DOM, like this:
Dim question As String = "what time is it"
Dim web As New HtmlWeb
Console.WriteLine(web.Load(("http://wiki.answers.com/Q/" & question)).DocumentNode.SelectSingleNode("//div[#class='answer_text']").InnerText.Trim)
Related
Could someone help me to solve my problem in the following situation:
I try to get the 3. text "Sangria Vinyl Lyrics" from this webpage:
http://www.metrolyrics.com/pharrell-williams-lyrics.html
Thanks to Google Chrome I found the following XPath to extract it:
//*[#id='popular']/div/table/tbody/tr[3]/td[2]/a
If I try to use this XPath in a Addon like XPath Helper, it works for every element that I search with tr[1...to...65]
If I copy & paste this XPath to my .NET code and installed HtmlAgilityPack, I get just "nothing".
Strangely, a modified XPath (/tr[1]) to get the 1. text ("Happy Lyrics") is the only one that I get perfectly working, returning "Happy Lyrics".
Is there something special about XPath, element indexes and HtmlAgilityPack in .NET?
My code so far:
Dim doc As HtmlDocument = New HtmlWeb().Load("http://www.metrolyrics.com/pharell-williams-lyrics.html")
Dim div As HtmlNode = doc.DocumentNode.SelectSingleNode("//*[#id='popular']/div/table/tbody/tr[3]/td[2]/a")
I tried it even with the "full path XPath" like
(/html[1]/body[1]/div[2]/div[3]/div[3]/div[2]/div[1]/div[3]/div[1]/div[1]/div[2]/div[1]/div[1]/table[1]/tbody[1]/tr[3]/td[2]/a)
but with the same result. It works in the Addon but not in .NET
Any idea, what I'm doing wrong?
so, i am creating a program in VB that opem the html of a webpage, and searching the page code for a word like "youtube.com/watch?", so i want to know how i can copy in a variable the word next to the one that i looking for.
Here is an example what i am looking for:
https://www.youtube.com/watch?v=NwYv-f65P6w
so lets say that this is what i found on the page and that is what i want to copy "v=NwYv-f65P6w" the problem is that the "youtube.com/watch?" is always the same but the next is different for any video. So how can i copy it?
use regular expression to extract specific text pattern, in vb.net is so easy
you only need is to learn how to develop you’re own pattern.
something like this. (http://).+(?v=) this patter extract any text that’s start with http:// and contains any char and contains the text ?v=
lookup in google for some RegEx Patterns
I was attempting to create a simple program that would pull a text item from a website and add it to the textbox. I'm simply just experimenting and thought I could do it but it is not that easy for me. I know how to get the entire source code of a website(below). It has a id I know but it does not have a tag name. So Im not really sure how to make it read through the text and only keep the part next to the id . Or would it be better to use a Webbrowser tool and then try and get the text item like that. I'm just trying to do whatever is faster. I think my 1st option is better because it would be better for the computer's ram. Using the code below I don't know what to add next?
Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create("Website")
Dim response As System.Net.HttpWebResponse = request.GetResponse()
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
Dim source As String = sr.ReadToEnd()
Lets say the id is "name" for example. Viewing the source of the page this is what the part looks like(below). How can I parse through the source which is a string and find this section, get the name Brandon, and add it to the textbox.
<span id="name">Brandon</span>
There are a few ways to go about this. I'm not going to write any source code though since I haven't used Visual Basic in a very long time. But if you Google for how to do any of the following you should find many tutorials and documents on it.
Regular Expressions
Using a Regular Expression on the full source code can help you find the element by searching for the ID attribute which should be unique. Regular Expressions can sometimes be very slow, which is why if you have to perform a lot of searches on large sections of text, it should be avoided.
/<([a-z0-9]+)\sid="name"(.*?)>(.*?)<\// -> Not Tested, but might help you
String Position
Using a function that will find the position of a substring in a string would be useful. In C it's strstr and in PHP it's strpos. These type of functions will give you starting position of a string, in which your case would be searching for id="name". Once you find that, you will find the position of the end of the tag and then find the closing tag for that element. You then will perform a substring function that will get you the text starting at position X for the length of that you specify, which would be the closing tag position - end of opening tag position.
HTML / XML Library
There are probably a ton of HTML / XML libraries that will parse the document into some sort of object or an array. You then can loop through these elements until you find the one you are looking for. Some of these libraries may even have search functions of element ID's similar to how JavaScript will sort for a specific element.
These libraries may be hard to get started with, but they will offer you a lot of options in the future if you need to continue finding more HTML elements.
Sorry if that title wasn't very easy to get the gist of. This is my first post.
Basically within my program is the option to send an email. Its an E-Ticket. I have set 'IsBodyHtml' to true. It sends it fine. No problems at all.
Within the HTML code however I want to insert some fields that are relevant to each customer.
When I put set ETicket.Body = to the HTML Code I get a number of errors because words such as 'Width' and 'Height' etc are being picked up as VB words.
As a short term fix so I could test that the HTML body actually works I put the code into a rich text box and then set ETicket.Body = RichTextBox1.Text . It works, but doesn't have the data in it that I want.
The data relevant to each customer is held in an array. Any idea how I can get the HTML code to be accepted by VB? Or how I can get my data from the array into the relevant position in the rich text box?
Thankyou!
Joe
This will likely be due to the double quotes in the HMTL markup. Try doing a find and replace on the HTML, and replace double quotes (") with single ones (').
I know how to make the vb program go to Google. I even know how to navigate around, but I don't know how to manipulate the results.
Basically I want the program to grab search results from Google and output them to a listbox. So if the user searches for burgers, then the search results would be output to a listbox. Does anyone know how to do this?
here's my code so far:
Public Class Form1
Dim look, retrieve As String
Private Sub Search_Click(sender As Object, e As EventArgs) Handles Search.Click
look = InputBox("What are you looking for?")
look = look.Replace(" ", "+")
Dim G1 As String = "http://www.google.co.uk/#hl=en&tbo=d&output=search&sclient=psy-ab&q="
WebBrowser1.Navigate(G1 + look)
retrieve = InputBox("What links do you want to retrieve?")
End Sub
End Class
I know it is easier to use the google api, but it is also a lot slower. I've used the API in the past and have seen performance issues. I've just seen in another thread how to download a website's source; pretty quickly. I just don't know how to grab the urls from the downloaded source. Is anyone here any good with string manipulation?
Code so far:
sourcecode = ((New Net.WebClient).DownloadString(G1 + look))
If you look into XPATH and are not adverse to using open source third party tools, the HTML Agility Pack (Cose Examples) is supposed to be a great tool for parsing html.
Another option, that can be a pain, is to convert the source html string into a valid xml document, and then parse it using VB's xml name space. I have done this in an application I use to parse youtube play lists. The issue with this approach is it takes a bit of manual cleaning of the html string before you can turn it into an xml document.
Lastly you could try to digest the html string using string methods only, however this is going to be error prone and will again depend very largely on the structure of the document.
No matter what, once you have your method of parsing the html, currently in Google search results there is a div with the ID 'Search'. From a purely string stand point you could search for this in your source string as such:
dim searchTerm as string = "<div id=""search"""
dim searchLoc as integer = 0
searchLoc = sourceCode.indexOf(searchTerm)
once you know where the search results section starts you can then start searching first for "<li class=""g""" tokens and then "<h3 class=""r""" tokens inside those. Inside the h3 is where the result text is. You would want to consume to the first </h3> and </li> respectively to get the tokens.
once you had this text, you would need to sanitize it by searching through it and removing the html tags. You could easily write an algorithm to consume just the link text by looping through the indexes of key characters.
The whole point is to break it down into smaller pieces incrementally and then digest the smaller pieces. No matter how you approach it you are going to be doing this. However using a parser of some kind and utilizing the power of XPATH selector expressions would make it much easier than manually generating the tokens.
The pure string way is going to be the most difficult and also the slowest way to try and accomplish this. I would highly recommend trying to find a way to do it with some form of HTML parser otherwise you may go mad before you get a working solution.
As a final note, it looks like you are using a webbrowser control on your form. You can use this control and its related classes to parse the html of the pages it retrieves. I have done this before and it is not the most efficient way of scraping the web, but it can be very easy. Look into the HTMLDocument class for methods involving this controls return objects.