Extract Address Information from a Web Page - vb.net

I need to take a web page and extract the address information from the page. Some are easier than others. I'm looking for a firefox plugin, windows app, or VB.NET code that will help me get this done.
Ideally I would like to have a web page on our admin (ASP.NET/VB.NET) where you enter a URL and it scraps the page and returns a Dataset that I can put in a Grid.

If you know the format of the page (for instance, if they're all like that ashnha.com page) then it's fairly easy to write VB.NET code that does this:
Create a System.Net.WebRequest and read the response into a string.
Then create a
System.Text.RegularExpressions.Regex
and iterate over the collection of
Matches between that and the string
you just retrieved. For each match,
create a new row in a DataTable.
The tough bit is writing the regex, which is a bit of a black art. See regexlib.com for loads of tools, books etc about regexes.
If the HTML format isn't well-defined enough for a regex, then you're probably going to have to rely on some amount of user intervention in order to identify which bits are the addresses...

What type of address information are you referring to?
There are a couple FireFox plugins Operator & Tails that allow you to extract and view microformats from web pages.

Aza Raskin has talked about recognising when selected text is an address in his Firefox Proposal: A Better New Tab Screen. No code yet, but I mention it as there may be code in firefox to do this in the future.
Alternatively, you could look at using the map command in Ubiquity, although you'd have to select the addresses yourself.

For general HTML screen scraping in VB.NET, check out HTML Agility Pack. Much easier than trying to Regex it (unless you happen to be a Regex ninja already!)
The page you mentioned in your answer would be easy to automate, as the addresses are in a consistent format.
But to allow the users to point to any page, that's a much harder job. The data could be in any format at all. You could write something to dump all the text, guess how they are divided, try and recognise bits like country and state names, telephone numbers etc, and get then show your results with an interface that will let the users complete missing sections, move the dividers, and identify the bits you missed or they didn't want.
It's not simple though, and making an interface that provides a big advantage over simply cutting and pasting into validated form fields would be quite an achievement I think - I'd be interested to know how you get on!
EDIT: Just noticed this other question that might cover quite a bit of what you want to do:
Parse usable Street Address, City, State, Zip from a string

Related

Trying to understand Google Results and meta tags

Note: this does NOT regard ranking, I just want the results to look better overall.
I'm working with a "news site" with a lot of articles, some dynamic, some static.
The developers haven't really given much thought about SEO but now want the Google Results to look a bit prettier - which landed on my table.
In the source code there's a few meta-tags, example:
<meta name="twitter:title" content="content">
<meta name="og:title" content="content">
Running it through Google Structured Data Testing Tool shows what I'd expect but it doesn't look like my search result for that specific link has the correct snippet.
Seems like it doesn't want to pick the og:description content all the time. Sometimes it does, and sometimes it also adds the title again in the snippet.
What I don't get: is Google using og:title for results or is that only for ex Facebook sharing? Do I simply need this one below, since that is actually missing from the code?
The description itself would be the same as og:description since they contain the same content.
<meta name="description" content="content">
As far as I understand it can be quite tricky to customize these sorts of things but could it really be that hard to have any sort of consistency throughout the results from our page?
There are two things you can do but both come with a caveat.
Google takes anything from your site as a suggestion. There is no way to program it to perform identically in all situations. If Google's algorithm believes there is a better way to present a result - it will ignore any direction you give it and auto-generate a new presentation for your page.
That said there's two things you can do:
Add meta tags with the exact text you'd like to appear on the SERP. The page title may or may not be appended with your brand/company name. If it already contains the company/brand name, Google is more likely to leave it where it is.
Google takes text from the page based on what it thinks is more important/relevant to the search. For News, using either HTML5 elements (nav, article, aside) or labelling your divs with a class using those key words will help Google understand what the real content is. Asides are less likely to be used while Articles will be focused upon.
I would also recommend having authors write their own custom descriptions and insert them with your CMS. They're likely much better at constructing a good summary than Google or an auto-summary script. Google will experiment with alt descriptions occasionally but once something solidifies itself as popular in terms of click rate, it'll stick.

PDF file search then display that page only

I create a PDF file with 20,000 pages. Send it to a printer and individual pages are printed and mailed. These are tax bills to homeowners.
I would like to place the PDF file my web server.
When a customer inputs a unique bill number on a search page, a search for that specific page is started.
When the page within the PDF file is located, only that page is displayed to the requester.
There are other issues with security, uniqueness of bill number to search that can be worked out.
The main question is... 1: Can this be done 2: Is there third party program that is required.
I am a novice programmer and would like to try and do this myself.
Thank you
It is possible but I would strongly recommend a different route. Instead of one 20,000 page document which might be great for printing, can you instead make 20,000 individual documents and just name them with something unique (bill number or whatever)? PDFs are document presentations and aren't suited for searching or even text information storage. There's no "words" or "paragraphs" and there's even no guarantee that text is written letter after letter. "Hello World" could be written "Wo", "He", "llo", "rld". Your customer's number might be "H1234567" but be written "1234567", "H". Text might be "in-page" but it also might be in form fields which adds to the complexity. There are many PDF libraries out there that try to solve these problems but if you can avoid them in the first your life will be much easier.
If you can't re-make the main document then I would suggest a compromise. Take some time now and use a library like iText (Java) or iTextSharp (.Net) to split the giant document into smaller documents arbitrarily named. Then try to write your text extraction logic using the same libraries to find your uniqueifiers in the documents and rename each document accordingly. This is really the only way that you can prove that your logic worked on every possible scenario.
Also, be careful with your uniqueifiers. If you have accounts like "H1234" and "H12345" you need to make sure that your search algorithm is aware that one is a subset (and therefore a match) of the other.
Finally, and this depends on how sensitive your client's data is, but if you're transporting very sensitive material I'd really suggest you spot-check every single document. Sucks, I know, I've had to do it. I'd get a copy of Ghostscript and convert all of the PDFs to images and then just run them through a program that can show me the document and the file name all at once. Google Picasa works nice for this. You could also write a Photoshop action that cropped the document to a specific region and then just use Windows Explorer.

Business Applications: What are the fundamental features of a search form?

In a typical business application it is quite common to have forms that are used for searching.
Some basic features are:
A pane that contains the search criteria
A grid to display the results
Sorting on the grid
A detail page that opens when an item is selected in the results grid
What other features would you expect in a business application's search functionality?
Maybe it's a bit trite but there is some sense in this picture:
removed dead ImageShack link
Do it as it shown at the second example, not as at the 3rd one.
There is a well known extreme programming principle - YAGNI. I think it's absolutely appliabe to almost any problem. You always can add something new if it's necessary, but it's much more difficult to remove something what is already exist because someone already uses it even if it's wrong.
How about the ability to save search criteria, in order to easily re-run a search later. Or, the ability to easily, cleanly, print the list of results.
If search refining is allowed (given a search result, limited future searches to the current results), you may also want to add a breadcrumb system, so that the user can see the sequence of refinements that lead you to the current result-set -- and by clicking on a breadcrumb, return to a previous refinement stage.
Faceted search:
(source: msdn.com)
This is displayed in the area in the right ellipse. There are filters and the engine shows the number of results that will remain after aplying the filter. This is very useful and can be done without pain in some search engines, such as Apache Solr. Of course, implement this only if filters make sense in your task.
Aggregate summary info, like total(s), count(s) or percentages.
One or more menus, like right click context for the grid, a ribbon or menu on top.
Your list for the UI elements is kinda good. Export, print (asking them whether it is really necessary to print this?), category/tag and language selection is worth to consider. Smart and working pagination (don't forget ordering).
Please do not force a search to open in a new (or even worse, always in the same window). Links of search results should be copy-pastable (always use GET),
But it really matters to have a functional (i.e. a really good) algorithm. Mostly I google company websites, because their search engine is, cough, awwwwkward. Looking for a feature chart, technical spec, pricing etc. one is not interested in press releases and vica-versa.
Search engine providers offer integration into company websites.
Use Auto-complete wherever possible on your text input fields.
If using selects or combo boxes with related information try and use chain selects to organise the information.
Where results depend on location try and serve relevant results.
Also remember to keep the search form as simple as possible even down to one text field. To refine the search you can have an alternate form as an "Advanced Search interface".
Printing, export.
A grid to display the results
Watch out not to display results a user is not authorized to see (roles / permissions / access rights).
A detail page that opens when an item is selected in the results grid
In case a user attempts to circumvent the search page links and enter some document directly, again, check out for permissions.
Validation, validation, validation.
It should be very hard, near impossible, for me to run a query that makes no sense. ie, start date occurring after an end date.
Export a numerical dataset (even if it only has one numeric column - so just make it so by default) to CSV for import into Excel (people love this function, even if only 1% of users seem to use it with any regularity. Just ask yourself when's the last time you highlighted something for copy-n-paste. Would it have been easier to open a CSV?
Refinable searches (think Google's use of site: -). People who use the search utility a lot will appreciate this. People who don't won't know it's not there.
The ability to choose to display 1 records, 5 records, 100 records, 1000 records, etc. "Paging" I believe is what we most commonly call it ;).
You mentioned sortable grids. Somebody else mentioned auto-sum or auto-count. Those are good if (once again) you have largely numeric data. But those are almost report-oriented functions.
Hope this helps.
One thing you can do is have a drop down of most common searches in plain english. e.g. "High value sales in New York in last 5 days". This is the equivalent of user selecting an amount, the city, date ranges etc. done conveniently for them.
Another thing is to have multiple search criteria tabs based on perspective of the user. Like "sales search", "reporting search", "admin search" etc.
ALso consider limiting the number of entries retrieved in the search and allow users to do more narrow searches. This depends on the business needs however.
The most commonly used search option listed first and in a prominent location.
I think your requirements are good. Take a cue from Google. Google got it right. One text box where you type whatever you want, and your engine spits out the answers. Most folks will try this, and if the answers are good enough, then that is what they will use. In the back-end, you'll probably want to flatten all of the data into a big honkin' table and then index it or use a SQL query with "LIKE" in it.
However, you will probably want to allow the user to refine the search. For this, have a link to "Advanced Search" and use a form there to specify filter criteria. This lets the user zero in on the results if basic search is not good enough. For the results on th is page, you will certainly want to have sorting on key fields, but do it after you have produced the initial result set.
It depends on the content that you are searching for.. make it relevant :) Search always look easy but can be incredibly difficult to get right.
Not mentioned yet, but very important I think - a search that actually works. This item is often neglected and makes the rest a bit moot.

Code related web searches

Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)

Does apparent filename affect SEO?

If I name my HTML file "Banks.html" located at www.example.com/Banks.html, but all the content is about Cats and all my other SEO tags are about Cats on the page, will it affect my page's SEO?
Can you name your files whatever you want, as long as you have the page title, description, and the rest of the SEO done properly?
Page names are often not very representative of the page content (I've seen pages named 7d57As09). Therefore search engines are not going to be particularly upset if the page names appear misleading. However, it's likely that the page name is one of many factors a search engine considers.
If there's no disadvantage in naming a page about cats, "cats.html", then do so! If it doesn't help your SEO, it will make it easier for your visitors!
If you want to be on better place when someone searchs for 'banks', then yes, it can help you. But unless you are creating pages about cats in banks I'm sure that this wont help you very much :)
It shouldn't affect your search engine ranking, but it may influence people who, having completed a search on Google (or some of the other great search engines, like um...uh...), are now scanning the results to decide where to click first. Someone faced with a url like www.dummy.com/banks.html would be more likely to click than someone faced with www.dummy.com/default.php?p_id=1&sessid=876492u942fgspw24z because most people haven't a clue what the last part means. It's also more memorable and gives people greater faith in getting back to the same site if you write your URLs nicely. No one that isn't Dustin Hoffman can remember the second URL without a little intense memory training, while everyone can remember banks.html. Just make sure your URL generation is consistent and your rewriting is solid, so you don't end up with loads of page not found errors which can detriment search engine ranking.
Ideally, your page name should be relevant to the content of the page - so your ranking may improve if you call the page "cats.html", as that is effectively another occurrence of the keyword in the page.
Generally, this is fairly minor compared to the benefits of decent keywords, titles, etc on the page. For more information take a look at articles around Url Strategy, for example:
"I’ve heard that search engines give some weighting to pages which contain keywords users are searching for which are contained within the page URL?"
Naming your pages something meaningful is a good idea and does improve SEO. It's another hint to the search engines what the page is about, in addition to the title and content. You would be surprised if you opened a file on your computer called "Letter to Grandma.doc" and it was actually your tax return!
In general, the best URLs are those that simply give a page name and hierarchical structure, without extensions or ID numbers. Keep it lowercase and separate words with dashes, like this:
example.com/my-cats
or
example.com/cats/mittens
In your case you will probably wanna keep the .html extension to avoid complexities with URL rewriting.
Under circumstances this can be considered a black-hat SEO technique. Watch out not to be caught or reported by curious users.
Google's PageRank algo has hundreds, thousands or even millions of variables and factors. From this point of view, you can be sure that the name of the files that you use on your website will affect your pagerank and/or your keyword targeting. Think about it.
There are few on-page elements that have significance. The URL, while it can be /234989782 is going to be more beneficial if it's named relevantly.
From any point of view, Google and all search engines like to see a coherence between everything: if you have a page named XYZ, then google will like it better if the text, meta, images, url, documents, etc, on the page to have XYZ in them. The bigger this synchronisation between the different elements on a page, the more the search engine sees how focused the content of that page is, resulting in more hits for you when someone looks up that focused search term.
If you have an image for example, you're better off having the same:
caption
description
name
alt text
(wordpress users will recognize that these are the four parameters that can be set for images on wordpress).
The same goes for all files you have on your website. Any parameter that can be seen by a search engine is better of optimized in regards to the content that goes with it, in sync with all the other parameters of this same thing.
The question of how useful this all is arises afterwards. Will I really rank lower if my alt text is different than the name of my image? Probably not by a lot. But either way, taking advantage of small subtleties like these can take you a long way in SEO. There are so many things we can't control in SEO (or that we shouldn't be able to control, like backlinks), that we have to use what we can control in the best way possible, to compensate.
It's also hard to tell if it is all useful after the Google Panda and Penguin. It definitely has less of an impact ever since those reforms (back then, this kind of thing was crucial), the question is simply how much of an impact it still has. But all in all, as I said, whenever possible, name your files according to your content.
Today algorithm is totally different when the SEO was introduce. The seo today is about content and its quality. It must produce a good reader and follower so any filename and description are no longer important.
Page name doesn't affects much in terms of SEO. but naming a page is also one of the Google 200 SEO signals.
Naming a url different sure will reduce your bounce rate a little. Because any user comes to your site through organic search results doesn't understand what the page has.
Even search engines loves when a page name is relevant to the topic in the page.