Script or piece of code to get a quick list of links per page in a website - scripting

How can I quickly produce a report of a website in the format:
Page Name.
- Links within the page
Page Name.
- Links within the page
Any programming or scripting language will do.
Although I prefer a solution on Windows, we have all of: Windows, Mac and Linux platforms available in the office.
Just looking for a way to do it without much fanfare.

There might be tools able to do this for you, but it isn't all that hard to put together yourself. One possible solution would be to...
Use wget (can be found for Windows) to download all HTML files, and
use some xpath tool or grep with regexps to get the title and the links from the pages.
///Jens

There are loads of link analysers that will do exactly that. Here's the first I found in Google.
For something a little more interesting, Don Syme did a great F# demo in which he wrote a really simple asynch URL processing class. I can't find the exact link, but here's something similar from an F# MVP. You would need to adapt it to pull out links, and recursively follow them if you want nesting.

Related

Which is the correct technical name of those products, and a list of availables

I'm trying to find other type of products which allows me to create a desktop app through html5 + javascript. Actually I found these three but I still don't know their technical name so I really can't search for them on google. Any suggestion about this?
Also, I'm looking for a list of similar products, to choose the one that fits my needs.
I really like how you build interfaces with html + css + javascripts with great results, but I need quite good interaction with the O.S. to handle window. Expecially, I were looking for transparent windows which seems not implemented on node-webkit at the moment, while on AppJS seems ok but I don't like the idea of serving the content like a webserver, I prefer the node-webkit approach.
Search for "HTML5 Desktop" and you will find all the platforms that allows you to build desktop apps using html5 in the first page like appjs, tidesk, pokki, node-webkit etc.

Objective C get html page's links

I'm quite new in Objective C programming and I'm trying to make an application that returns all the link addresses in HTML page. In that case i shouldn't just parse the HTML, but get these links intercepting them from the page's network request.
Is it possible to intercept the application's network requests or something?
Thanks
Coincidentally, Ray Wenderlich's rather AWESOME iOS tutorial site posted this article in the last hour. As you are new to iOS/ObjC, I highly recommend reading it thoroughly.
Let’s say you want to find some information inside a web page and
display it in a custom way in your app.
This technique is called
“scraping.” Let’s also assume you’ve thought through alternatives to
scraping web pages from inside your app, and are pretty sure that’s
what you want to do.
Well then you get to the question – how can you
programmatically dig through the HTML and find the part you’re looking
for, in the most robust way possible? Believe it or not, regular
expressions won’t cut it!
And before you think Regular Expressions might really be an answer, please read this.

How does Safari's reader feature work?

I want to add a similar feature to a tool I'm making. I'm interested in how it works code-wise. I want to be able get an html page and exclude all but the article.
The Readability project does something similar for chrome and iOS. I'm not sure how it detects the content automatically but I know that Readability has an API for people who want to integrate it's features. You might want to check that out.
http://www.readability.com/learn-more
If you're working with Ruby, you could use Pismo. It extracts an article from a given document.

Where are the docs for the Chromium Embedded Framework?

I downloaded and started playing with CEF, but there doesn't seem to be any docs for it. Not even a working wiki… Am I missing something?
Most of the documentation is in CEF's header files. The binary distribution comes with docs generated from those files. It's well documented in terms of amount of content written, but I had a lot of trouble while learning to use it. The project's Wiki page contains a lot of useful content as does the cefclient sample program.
The CEF3 API documentation can be found at http://magpcss.org/ceforum/apidocs3/
and CEF1 API docuemntation can be found at http://magpcss.org/ceforum/apidocs/. These two links can be found on the Chromium Embedded framework (CEF) wiki home page: https://bitbucket.org/chromiumembedded/cef/wiki/Home
You didn't provide a link to CEF, so I Googled it, and found the project's Web site, which features a prominent link to their wiki.
The wiki has several pages, but the first one that jumped out at me is the General Usage page that shows how to create a "fully functional embedded browser window using CEF".
So I'm not sure where you were looking, but yes, it looks like you were missing something (grin). The wiki documentation is right there.

How can I programmatically obtain content from a website on a regular basis?

Let me preface this by saying I don't care what language this solution gets written in as long as it runs on windows.
My problem is this: there is a site that has data which is frequently updated that I would like to get at regular intervals for later reporting. The site requires JavaScript to work properly so just using wget doesn't work. What is a good way to either embed a browser in a program or use a stand-alone browser to routinely scrape the screen for this data?
Ideally, I'd like to grab certain tables on the page but can resort to regular expressions if necessary.
You could probably use web app testing tools like Watir, Watin, or Selenium to automate the browser to get the values from the page. I've done this for scraping data before, and it works quite well.
If JavaScript is a must, you can try instantiating an Internet Explorer via ActiveX (CreateObject("InternetExplorer.Application")) and use it's Navigate2() Method to open your web page.
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.Navigate2 "http://stackoverflow.com"
After the page has finished loading (check document.ReadyState), you have full access to the DOM and can use whatever methods to extract any content you like.
You can look at Beautiful Soup - being open source python, it is easily programmable. Quoting the site:
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
I would recommend Yahoo Pipes, that's exactly what they were built to do. Then you can get the yahoo pipes data as an RSS feed and do as you want with it.
If you are familiar with Java (or perhaps, other language that runs on a JVM such as JRuby, Jython, etc.), you can use HTMLUnit; HTMLUnit simulates a complete browser; http requests, creating a DOM for each page and running Javascript (using Mozilla's Rhino).
Additionally, you can run XPath queries on documents loaded in the simulated browser, simulate events, etc.
http://htmlunit.sourceforge.net
Give Badboy a try. It's meant to automate the system testing of your websites but you may find it's regular expression rules handy enough to do what you want.
If you have Excel then you should be able to import the data from the webpage into Excel.
From the Data menu select Import External Data and then New Web Query.
Once the data is in Excel then you can either manipulate it within Excel or output it in a format (e.g. CSV) you can use elsewhere.
In compliment to Whaledawg's suggestion, I was going to suggest using an RSS scraper application (do a Google search) and then you can get nice raw XML to programmatically consume instead of a response stream. There may even be a few open-source implementation which would give you more of an idea if you wanted to implement yourself.
You could use the Perl module LWP, with module JavaScript. While this may not be the quickest to set up, it should work reliably. I would definitely not have this be your first foray into Perl though.
I recently did some research on this topic. The best resource I found is this Wikipedia article, which gives links to many screen scraping engines.
I needed to have something that I can use as a server and run it in batch, and from my initial investigation, I think Web Harvest is quite good as an open source solution, and I have also been impressed by Screen Scraper, which seems to be very feature rich and you can use it with different languages.
There is also a new project called Scrapy, haven't checked it out yet, but it's a python framework.