Apache NUTCH, relevant crawling

Apache NUTCH, relevant crawling - apache

I am crawling websites using Apache NUTCH 2.2.1, which provides me content to index on SOLR. When NUTCH fetches content, there are contextual information such as "contact us","legal notice" or some other irrelevant information (generally coming from upper menu, left menu or from footer of the page) that I do not need to index.
One of the solution would be to automatically select the most relevant part of the content to index, which can be done by an automatic summarizer. There is a plugin "summary-basic", is it used for this purpose? If so how is it configured? Other solutions are also welcome.

In regex-urlfilter.txt you can specify the list of urls you want to ignore. You can specify the http link for "contact us"(typically all header, footer information that you don't want to crawl) etc.. in that regex list. While crawling web, nutch will ignore those urls and will only fetch the requires content. You can find regex-urlfilter.txt under apache-nutch-2.2.1/conf folder

Related

Extracting specific tags values with Apache Nutch

I'm trying to fetch a list of several URLs and parse their title keywords and description (and ignore all the rest) using Apache Nutch
After that I just want to save for each URL all the title, keywords and description content (preferably without the tags themselves) without any indexing
I looked at several examples on how to this. Just a few examples of what I encountered:
How to parse content located in specific HTML tags using nutch plugin?
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
However they all propose complicated (at least to a Nutch newbie) plugins configuration and settings
Since my use case sounds like a very common one I was wondering if there is any simpler solution?
Thanks

Google Custom Search - Add/remove sites to search dynamically

Google Custom Search has a feature to specify the sites you want the search engine to search - "Sites to search" feature.
I have a requirement to add/remove these sites on the fly. Is there any api or any other way provided by Google with which I can achieve this?

Here you can find the relevant information:
https://developers.google.com/custom-search/docs/tutorial/creatingcse
To create a custom search engine:
Sign into Control Panel using your Google Account (get an account if you don't have one).
In the Sites to search section, add the pages you want to include in your search engine. You can include any sites you want, not just
the sites you own. You can include whole site URLs or individual pages
URLs. You can also use URL patterns.
https://support.google.com/customsearch/answer/71826?hl=en
URL patterns
URL patterns are used to specify what pages you want included in your
custom search engine. When you use the control panel or the Google
Marker to add sites, you're generating URL patterns. Most URL patterns
are very simple and simply specify a whole site. However, by using
more advanced patterns, you can more precisely pick out portions of
sites.
For example, the pattern 'www.foo.com/bar' will only match the single
page 'www.foo.com/bar'. To cover all the pages where the URL starts
with ' www.foo.com/bar', you must explicitly add a '' at the end. In
the form-based interfaces for adding sites, 'foo.com' defaults to
'.foo.com/*'. If this is not what you want, you can change it back in
the control panel. No such defaulting occurs for patterns that you
upload. Also note that URLs are case sensitive - if your site URLs
include capital letters, you'll need to make sure your patterns do as
well.
In addition, the use of wildcards in URL patterns allows you to
include or exclude multiple pages or portions of a site all at once.
So, basically you've to navigate to the "Sites to search section" and enter the needed sites there. If you want to change these site on the fly, you've to manipulate your URL pattern.
There's also an option to use the XML configuration files. You just have to add (or remove) your sites there:
https://developers.google.com/custom-search/docs/annotations
Annotations: The annotations XML file lists the webpages or websites
you want your search engine to cover, and indicates any preferences
you have about how these sites should be ranked in your search
results. Each site and its associated information is called an
annotation. More information about the annotations XML file.
An example for an annotation:
<Annotation about="http://www.solarenergy.org/*">
<Label name="_cse_abcdefghijk"/>
</Annotation>

Using api we can add filter "siteSearch"=>"somedmain.com somdomain2.com","siteSearchFilter"=>"e" but there will be spacing between seperate domains.

noindex follow in Robots.txt

I have a wordpress website which has been indexed in search engines.
I have edited Robots.txt to disallow certain directories and webpages from search index.
I only know how to use allow and disallow, but don't know how to use the follow and nofollow in Robots.txt file.
I read somewhere while Googling on this that I can have webpages that won't be indexed in Google but will be crawled for pageranks. This can be achieved by disallowing the webpages in Robots.txt and use follow for the webpages.
Please let me know how to use follow and nofollow in Robots.txt file.
Thanks
Sumit

a.) The follow/no follow and index/no index rules are not for robots.txt (sets general site rules) but for an on-page meta-robots tag (sets the rules for this specific page)
More info about Meta-Robots
b.) Google won't crawl the Disallowed pages but it can index them on SERP (using info from inbound links or website directories like Dmoz).
Having said that, there is no PR value you can gain from this.
More info about Googlebot's indexing behavior

Google actually does recognize the Noindex: directive inside robots.txt. Here's Matt Cutts talking about it: http://www.mattcutts.com/blog/google-noindex-behavior/
If you put "Disallow" in robots.txt for a page already in Google's index, you will usually find that the page stays in the index, like a ghost, stripped of its keywords. I suppose this is because they know they won't be crawling it, and they don't want the index containing bit-rot. So they replace the page description with "A description for this result is not available because of this site's robots.txt – learn more."
So, the problem remains: How do we remove that link from Google since "Disallow" didn't work? Typically, you'd want to use meta robots noindex on the page in question because Google will actually remove the page from the index if it sees this update, but with that Disallow directive in your robots file, they'll never know about it.
So you could remove that page's Disallow rule from robots.txt and add a meta robots noindex tag to the page's header, but now you've got to wait for Google to go back and look at a page you told them to forget about.
You could create a new link to it from your homepage in hopes that Google will get the hint, or you could avoid the whole thing by just adding that Noindex rule directly to the robots.txt file. In the post above, Matt says that this will result in the removal of the link.

No you cant.
You can set which directories you want to block and which bots but you cant set nofollow by robots.txt
Use robots meta tag on the pages to set nofollow.

will limiting dynamic urls with robots.txt improve my SEO ranking?

My website has about 200 useful articles. Because the website has an internal search function with lots of parameters, the search engines end up spidering urls with all possible permutations of additional parameters such as tags, search phrases, versions, dates etc. Most of these pages are simply a list of search results with some snippets of the original articles.
According to Google's Webmaster-tools Google spidered only about 150 of the 200 entries in the xml sitemap. It looks as if Google has not yet seen all of the content years after it went online.
I plan to add a few "Disallow:" lines to robots.txt so that the search engines no longer spiders those dynamic urls. In addition I plan to disable some url parameters in the Webmaster-tools "website configuration" --> "url parameter" section.
Will that improve or hurt my current SEO ranking? It will look as if my website is losing thousands of content pages.

This is exactly what canonical URLs are for. If one page (e.g. article) can be reached by more then one URL then you need to specify the primary URL using a canonical URL. This prevents duplicate content issues and tells Google which URL to display in their search results.
So do not block any of your articles and you don't need to enter any parameters, either. Just use canonical URLs and you'll be fine.

As nn4l pointed out, canonical is not a good solution for search pages.
The first thing you should do is have search results pages include a robots meta tag saying noindex. This will help get them removed from your index and let Google focus on your real content. Google should slowly remove them as they get re-crawled.
Other measures:
In GWMT tell Google to ignore all those search parameters. Just a band aid but may help speed up the recovery.
Don't block the search page in the robots.txt file as this will block the robots from crawling and cleanly removing those pages already indexed. Wait till your index is clear before doing a full block like that.
Your search system must be based on links (a tags) or GET based forms and not POST based forms. This is why they got indexed. Switching them to POST based forms should stop robots from trying to index those pages in the first place. JavaScript or AJAX is another way to do it.

Is there a way to prevent Googlebot from indexing certain parts of a page?

Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.

What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.

In short NO - unless you use cloaking with is discouraged by Google.

Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->

Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>

At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.

Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.

All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.

There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas