VB.NET Webbrowser.Document - what you see is not what you can get - vb.net

My attempts at writing a simple crawler seem to be confounded by the fact that my target webpage (as would appear in the UI browser control, or through a typical browser application) is not completely accessible as an HTMLDocument (due to frames, javascript, etc.)
The code below executes, and the correct webpage (e.g. the one displaying items 50-59) can even be seen in the control, but where I would expect the “next page” hyperlink retrieved to be “...&start=60”, I see something else – the one corresponding to opening the first catalog page “...&start=10”.
What is odd, is that if I press the button a second time, I DO get what I’m looking for. Even odder to me, if I inserted a MsgBox, say right after I’ve looped to wait until WebBrowserReadyState.Complete, then I get what I’m looking for.
Private Sub ButtonGo_Click(sender As System.Object, e As System.EventArgs) Handles ButtonGo.Click
'start at this URL
'e.g. http://www.somewebsite.com/properties?l=Dallas+TX&co=US&start=50
catalogPageURL = TextBoxInitialURL.Text
WebBrowser1.Navigate(catalogPageURL)
While WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
Application.DoEvents()
End While
'Locate the URL associated with the NEXT>> hyperlink
Dim allLinksInDocument As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
Dim strNextPgLink As String = ""
For Each link As HtmlElement In allLinksInDocument
If link.GetAttribute("className") = "next" Then
strNextPgLink = link.GetAttribute("href")
End If
Next
End Sub
I’ve googled around enough to try things like using a WebBrowser1.DocumentCompleted
event, but that still didn’t work. I’ve tried inserting sleep commands.
I’ve avoided using WebClient and regular expressions, the way I would have ordinarily done this, because I’m convinced using the DOM will be easier for other things I have planned down the road, and I’m aware of HTML Agility Pack but not ambitious enough to learn it. Because it seems there has to be a simple way to have this dang webbrowser.document object synchronized with the stuff you can actually see.
If this is because of javascript, is there a way I can tell the webbrowser to just execute them all?
First question on the forum, looking forward to more (smarter ones hopefully)

Be warned when using webbrowser1.Document or something similar - you will not get 'raw html'
Example: (assume wbMain is a webbrowser control)
RTB_RawHTML.Text = wbMain.DocumentText
Try
RTB_BodyHTML.Text = wbMain.Document.Body.OuterHtml
Catch
debugMessage("Body tag not found.")
End Try
in this example, the code in the body tag as displayed in the body tag portion of RTB_RawHTML will NOT perfectly match the html as displayed in RTB_BodyHTML. Accessing it through (yourwebbrowserhere).Document.Body.OuterHtml appears to 'clean' it somewhat as opposed to the 'raw' html as retreived by (yourwebbrowserhere).DocumentText
This was a problem for me when i was making a web scraper, as it would continually throw me off - sometimes i would try to match a tag and it would find it, and other times it wouldnt even though i was sure it was there. The reason was that i was trying to match the raw html, but i needed to match the 'cleaned' html.
Im not sure if this will help you isolate the problem or not - for me it did.

Related

In VB.NET use a textbox as a log for which if statement it is beeing proccesed inside a sub

Hi i have a Sub that has multiple if statements in it.
Each if statement has a large loop that searches for specific files and text inside files.
I tried various ways to use a text box in order to get the information which if is currently proccessing at the time and i see that for some reason the ui is not refreshed until the sub finishes and so i see everytime in the textbox the last proceesed if message.
What do you think is the best way to handle it?
I hope that this has nothing to do with threads because threads are something that i am not familiar with !
I think using Application.DoEvents() is an easy choice. But I don't know if that would be the desired behavior.
If the use of Application.DoEvents() fails, another thread should handle it.
https://learn.microsoft.com/en-us/dotnet/api/system.windows.forms.application.doevents?view=netframework-3.5
I use this:
Public Sub logWithCrLf(tx As TextBox, s As String)
tx.AppendText(s & vbCrLf)
tx.Select(tx.TextLength - 1, 0)
tx.ScrollToCaret()
tx.Refresh()
End Sub
I see that it scrolls a bit smoother using tx.AppendText(s) than tx.Text &= s, which scrolls up to 0 then down to caret again.
(I write this as an answer to contribute with the tx.AppendText() recommendation)

FindElements not working for controls in some parts of some Windows programs

WinAppDriver's FindElement will not always find objects in the program to be automated.
I've gotten this to work with other programs, like Notepad, and even a different dialog in my program to be automated, and it worked in those places.
This is the code I am using so far. The first three lines execute without error, successfully launching the application into it's Login dialog:
Dim appCapabilities As DesiredCapabilities = New DesiredCapabilities()
appCapabilities.SetCapability("app", "C:\[my program].exe")
Dim ProgramSession = New WindowsDriver(Of WindowsElement)(New Uri("http://127.0.0.1:4723"), appCapabilities)
ProgramSession.FindElementByName("Password").SendKeys("Password")
The fourth line should find the element, a text box, and enter the string "Password" into it via sendkeys, but it fails, with the following exception:
System.InvalidOperationException: 'An element could not be located on the page using the given search parameters.'
The target object is on screen, and this should work. I'm using the info shown for the object in Inspect.exe, Name: "Password".
WinAppDriver's window shows the following error information:
{"using":"name","value":"Password"}
HTTP/1.1 404 Not Found
Content-Length: 139
Content-Type: application/json
{"status":7,"value":{"error":"no such element","message":"An element could not be located on the page using the given search parameters."}}
The fourth line of code is executed directly after program startup.
Since a program needs some load time, you'll need to wait for the program to finish loading before trying to search for a control on the GUI. You can do this by using a while loop in combination with a stopwatch for timeout.
Dim shouldContinue As Boolean = True
Dim stopWatch As StopWatch = New StopWatch()
Dim timeOut As TimeSpan = TimeSpan.FromSeconds(30)
stopWatch.Start()
While shouldContinue AndAlso timeOut > stopWatch.Elapsed
If element.IsFound Then
shouldContinue = False
stopWatch.Stop()
End If
End While
element.IsFound is just mock-up code, you will need to fill in that blank. This is a good Q/A to show you how to check if a element has loaded.
Another thing you need to take in account is the possibility that your Login Dialog runs in another window handle. If the window handle winappdriver is using is different from the window handle where your element is at, you won't be able to find that element.
Also check if you can find whatever you are searching for in the PageSource property xml from your driver. I usually do this by calling that property in the visual studio watch window, and copying it's content to a xml formatter tool.
I was able to find the password field by using FindElementByXPath instead of FindElementByName.
In order to find the xpath, I used the Recorder for WinAppDriver.
These xpaths can be VERY long. I was able to shorten some of them by removing some duplicate attributes, but some are over 450 characters long. I can sometimes reduce it further with variables, but I'm not exactly delighted so far with WinAppDriver as a replacement for CodedUI.

How to get the last child of an HTMLElement

I have written a macro in Excel that opens and parses a website and pulls the data from it. The trouble I'm having is once I'm done with all of the data on the current page I want to go to the next page. To do this I want to get the last child of the "result-stats" node. I found the lastChild function, and so came up with the following code:
'Checks to see if there is a next page
If html.getElementById("result-stats").LastChild.innerText = "Next" Then
html.getElementById("result-stats").LastChild.Click
End If
And here is the HTML that it is accessing:
<p id="result-stats">
949 results
<span class="optional"> (1.06 seconds)</span>
Modify search
Show more columns
Next
</p>
When I try to run this, I get an error. After a lot of searching I think I found the reason. According to what I read, getElementById returns an element and not a node. lastChild only works on nodes, which is why the function doesn't work here.
My question is this. Is there a clean and simple way to grab the last child of an element? Or is there a way to typecast an element to that of a node? I feel like I'm missing something obvious, but I've been at this way longer than I should have been. Any help anyone could provide would be greatly appreciated.
Thanks.
Here's a shell of how to do it. If my comments are not clear, ask away. I assumed knowledge of how to navigate to the page, wait for the browser, etc.
Sub ClickLink()
Dim IE As Object
Set IE = CreateObject("InternetExplorer.Application")
'load up page and all that stuff
'process data ...
'click link
Dim doc As Object
Set doc = IE.document
Dim aLinks As Object, sLink As Object
For Each sLink In doc.getElementsByTagName("a")
If sLink.innerText = "Next" Then 'may need to play with this, if `innerttext' doesn't work
sLink.Click
Exit For
End If
Next
End Sub

Visual Basic (2010) - Using variables in embedded text files?

Ive always been able to just search for what I need on here, and I've usually found it fairly easily, but this seems to be an exception.
I'm writing a program in Visual Basic 2010 Express, it's a fairly simple text based adventure game.
I have a story, with multiple possible paths based on what button/option you choose.
The text of each story path is saved in its own embedded resource .txt file. I could just write the contents of the text files straight into VB, and that would solve my problem, but that's not the way I want to do this, because that would end up looking really messy.
My problem is that I need to use variable names within my story, here's an example of the contents of one of the embedded text files,
"When "+playername+" woke up, "+genderheshe+" didn't recognise "+genderhisher+" surroundings."
I have used the following code to read the file into my text box
Private Sub frmAdventure_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim thestorytext As String
Dim imageStream As Stream
Dim textStreamReader As StreamReader
Dim assembly As [Assembly]
assembly = [assembly].GetExecutingAssembly()
imageStream = assembly.GetManifestResourceStream("Catastrophe.CatastropheStoryStart.png")
textStreamReader = New StreamReader(assembly.GetManifestResourceStream("Catastrophe.CatastropheStoryStart.txt"))
thestorytext = textStreamReader.ReadLine()
txtAdventure.Text = thestorytext
End Sub
Which works to an extent, but displays it exactly as it is in the text file, keeps the quotes and the +s and the variable names instead of removing the quotes and the +s and replacing the variable names with what's stored within the variables.
Can anyone tell me what I need to change or add to make this work?
Thanks, and apologies if this has been answered somewhere and I just didn't recognise it as the solution or didn't know what to search to find it or something.
Since your application is compiled, you cannot just put some of your VB code in the text file and have it executed when it is read.
What you can do, and what is usually done, is that you leave certain tags inside your text file, then locate them and replace them with the actual values.
For example:
When %playername% woke up, %genderheshe% didn`t recognise %genderhisher% surroundings.
Then in your code, you would find all the tags:
Dim matches = Regex.Matches(thestorytext, "%(\w+?)%")
For Each match in matches
' the tag name is now in: match.Groups(1).Value
' replace the tag with the value and replace it back into the original string
Next
Of course the big problem still remains - which is how to fill in the actual values. Unfortunately, there is no clean way to do this, especially using any local variables.
You can either manually maintain a Dictionary of tag names and their values, or use Reflection to get the values directly at the runtime. While it should be used carefully (speed, security, ...), it will work just fine for your case.
Assuming you have all your variables defined as properties in the same class (Me) as the code that reads and processes this text, the code will look like this:
Dim matches = Regex.Matches(thestorytext, "%(\w+?)%")
For Each match in matches
Dim tag = match.Groups(1).Value
Dim value = Me.GetType().GetField(tag).GetValue(Me)
thestorytext = thestorytext.Replace(match.Value, value) ' Lazy code
Next
txtAdventure.Text = thestorytext
If you don't use properties, but only fields, change the line to this:
Dim value = Me.GetType().GetField(tag).GetValue(Me)
Note that this example is rough and the code will happily crash if the tags are misspelled or not existing (you should do some error checking), but it should get you started.

VB.NET WebBrowser disable javascript

Is there a way to disable javascript webbrowser in vb.net?
works for me:
Private Function TrimScript(ByVal htmlDocText As String) As String
While htmlDocText.ToLower().IndexOf("<script type=""text/javascript"">") > -1
Dim s_index As Integer = htmlDocText.ToLower().IndexOf("<script type=""text/javascript"">")
Dim e_index As Integer = htmlDocText.ToLower().IndexOf("</script>")
htmlDocText = htmlDocText.Remove(s_index, e_index - s_index)
End While
Return htmlDocText
End Function
Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim webClient As New System.Net.WebClient
Dim result As String = webClient.DownloadString(yourUrl)
Dim wb As New WebBrowser
wb.Navigate("")
Do While wb.ReadyState <> WebBrowserReadyState.Complete
Application.DoEvents()
Loop
Dim script As String = TrimScript(result)
wb.DocumentText = script
End Sub
The short answer is: No.
The slightly longer answer is: No, the web-browser control API does not allow disabling standard browser functionality.
No really...but if you getting that annoying error message that pops up saying a script is running then you can turn the property of the webbrowser's suppress-errors "true"
Which popup message do you want to disable? If it's the alert message, try this, obviously resolving the window or frame object to your particular needs, I’ve just assumed top-level document, but if you need an iframe you can access it using window.frames(0). for the first frame and so on... (re the JavaScript part)... here is some code, assuming WB is your webbrowser control...
WB.Document.parentWindow.execScript "window.alert = function () { };", "JScript"
You must run the above code only after the entire page is done loading, i understand this is very difficult to do (and a full-proof version hasn't been published yet) however I have been doing it (full proof) for some time now, and you can gather hints on how to do this accurately if you read some of my previous answers labelled "webbrowser" and "webbrowser-control", but getting back to the question at hand, if you want to cancel the .confirm JavaScript message, just replace window.alert with window.confirm (of course, qualifying your window. object with the correct object to reach the document hierarchy you are working with). You can also disable the .print method with the above technique and the new IE9 .prompt method as well.
If you want to disable JavaScript entirely, you can use the registry to do this, and you must make the registry change before the webbrowser control loads into memory, and every time you change it (on & off) you must reload the webbrowser control out and into memory (or just restart your application).
The registry key is \HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Zones\ - the keyname is 1400 and the value to disable it is 3, and to enable it is 0.
Of course, because there are 5 zones under the Zones key, you need to either change it for the active zone or for all zones to be sure. However, you really don't need to do this if all you want to do si supress js dialog popup messages.
Let me know how you go, and if I can help further.