how to read/parse dynamically generated web content? - dynamic

I need to find a way to write a program (in any language) that will connect to a website and read dynamically generated data from the website.
Note that it's dynamically generated--it's not enough to get the source html, because the data I'm interested in is generated via javascript that references back-end code. So when i view the webpage source, I can't see the data. (For example, go to google, and do a search. Check the source code on the search results page. Very little of the data your browser is displaying is reflected in the source--most of it is dynamically generated. I need some way to access this data.)

Pick a language and environment that includes an HTML renderer (e.g. .NET and the WebBrowser control). Use the HTML renderer to get the URL and produce an HTML DOM in memory (making sure that scripting is enabled). Read the contents of the HTML DOM after the renderer has done its work.
Example (you'll need to do this inside a System.Windows.Form derived class):
WebBrowser browser = new WebBrowser();
browser.Navigate("http://www.google.com");
HtmlDocument document = browser.Document;
// extract what you want from the document

I used to have a Perl program to access Mapguide.com to get the drive direction from one location to another location. I parsed the returned page and save to database. If the source never change their format, it is OK. the problem is the source format often change, your parser also need change.

A simple thought: if we're talking about AJAX, you can rather look up the urls for the dynamic data. Then you can use the javascript on the page you're talking about to reformat this.

If you have Firefox/greasemonkey making a DOM dumper should be a simple matter.

Related

Svg-edit usage reference where i can find it?

I found this tool(https://code.google.com/p/svg-edit/) very useful, but there is any reference for this project allow to integrate properly in your applicatione instead of simply add it to an iframe?
For example i want to retrieve the svg code for save it in a variables or something like this
It's possible but might be a little tricky to strip out the required resources and load them in your own application while making sure that there aren't any conflicts. This is actually something I plan on doing for one of my own projects.
Do you need to do this though? You can talk to the iframe from your application pretty easily. For example, to get the SVG content you would use this (assuming you only have 1 iframe on the page) ->
var svgedit = window.frames[0];
svgedit.svgCanvas.svgCanvasToString();

How to pass data between pages through worklight client API

I want to invoke a procedure in one page and use it in another page, and the response is only used by the next page, so I think JsonStore is not suit for that. Should I define a global var?
Is there any code sample to do such things? Thanks for your help.
I presume by pages you mean different HTML files. If so, that is not recommended, Worklight is intended for single page applications. There are no code samples that show how to do that.
I would recommended having a single HTML page and using something like jQuery.load to inject new HTML / DOM elements. By dynamically injecting new HTML your single/main HTML file shouldn't be too big and you can destroy (i.e. remove from memory / the DOM) unused DOM elements. Searching on Google for page fragments and html templates could help you find examples. The idea is that you don't lose the JavaScript context.
Maybe you can get away with doing a new init to re-initialize JSONStore (it won't delete any the data, just give you access) on every new HTML page and use get to get access to the JSONStore collections to perform operations such as find.

Windows Phone 8: Load/Create HTML on the fly and load into browser

I am working on an app that reads XML and displays content accordingly with whats contained n the XML. Now i have the XML part done but i need one other part and that is to load a Small section of html code into a web browser element. Is there anyway for me to either dynamically create a html file (i was thinking maybe create one and save in storage then load from there?) or directly insert code into the web browser element.
Failing this i'll just create a php page on my server that adjusts according to value its passed.
You can store your entire HTML code in a string variable and call the NavigateToString method.
myWebBrowser.NavigateToString("myHTMLcode")
How you create the HTML string depends on your app but you could store a basic template and use String.Replace to replace any particular items in the code.

Find a link inside iframe in webbrowser control in VB.net

I want to find a url webbrowser control inside iframe.
1) my webbrowsercontrol opena url
2)that url has one iframe inside it
3) That Iframe has a link which I want to grab programmatically using vb.net
At any point of time use webBrowser1.Url.ToString() to get the URL of the current open link.
You can get the html code of the open url by using webBrowser1.DocumentText. Once you have the html code use string manipulation to find the "iframe src" value.
This can be abit complicated as you migt not know how may iframes you need to handle.
As well there are some limitations for the FRAME elements according to HtmlWindow.WindowFrameElement Property
You cannot access a FRAME elements or the FRAME's document if the
FRAME is in a different zone than the FRAMESET that contains it. For a
full explanation, see About Cross-Frame Scripting and Security.
Actually, all you need to do is this...
Msgbox Webbrowser1.document.frames(0),getelementbyid("linkTagId").href
This will show you the href of the link, don't bother wasting time with string manipulation.
Of course, you can loop through the frames and links as well using the .length properties in a for loop.
Also, there are ways to bypass the cross-frame security issues since you are running the code in an exe, there are examples online, just search for "bypass cross-frame security webbrowser control" in google without the quotes.
If you need more help with these let me know as I can tell you how. Remember the cross frame stuff only need bypassing if the parent domain name and iframe domain name are different (not subdomains though, they can be different no problems).
Let me know mate :)

Checking the contains of an embed tag using Selenium

We generate a pdf doc via a call to a web service that returns the path to the generated doc.
We use an embed html tag to display the pdf inline, i.e.
<div id="ctl00_ContentPlaceHolder2_ctl01_embedArea">
<embed wmode="transparent" src="http://www.company.com/vdir/folder/Pdfs/file.pdf" width="710" height="400"/>
I'd like to use selenium to check that the pdf is actually being displayed and if possible save the path, i.e. the src link into a variable.
Anyone know how to do this? Ideally we'd like to be able to then compare this pdf to a reference one but that's a question for another day.
As far as inspecting the pdf from selenium, you're more or less out of luck. The embed tag just drops a plugin into the page, and because a plugin isn't well represented in the DOM, Selenium can't get a very good handle on it.
However, if you're using Selenium-RC you may want to consider getting the src of the embed element, then requesting that URL directly and evaluating the resulting PDF in code. Assuming your embed element looks like this <embed id="embedded" src="http://example.com/static/pdf123.pdf" /> you can try something like this
String pdfSrc = selenium.getAttribute("embedded#src");
Then make a web request to the pdfSrc url and do (somehow) validate it's the one you want. It may be enough to just check that it's not a 404.