Not sure if this is even possible but wanted to give it a shot.
Is it possible to add PDFs and other files to Kentico Media Library folder that wouldn’t be searchable through Google or another search engine? It also should not be searchable through Kentico's Smart Search.
Users should be able to access it ONLY in case they know the full URL.
I know I can add the path to robots.txt to disable indexing, but is there are more foolproof way?
Thanks.
By default, any files in the Media Library are not searchable by Kentico's smart search index (you need to add files to the Content Tree to be able to index them, or create a custom indexer yourself).
The robots.txt is the way to go, search engines honor it as long as it's set properly.
If you want to take another step, you would have to modify the Response the server gives for those files and include the headers
X-Robots-Tag: noindex
there are more tags to look at here.
You can modify the response tags through the URL rewrite engine in IIS.
Related
Google Custom Search has a feature to specify the sites you want the search engine to search - "Sites to search" feature.
I have a requirement to add/remove these sites on the fly. Is there any api or any other way provided by Google with which I can achieve this?
Here you can find the relevant information:
https://developers.google.com/custom-search/docs/tutorial/creatingcse
To create a custom search engine:
Sign into Control Panel using your Google Account (get an account if you don't have one).
In the Sites to search section, add the pages you want to include in your search engine. You can include any sites you want, not just
the sites you own. You can include whole site URLs or individual pages
URLs. You can also use URL patterns.
https://support.google.com/customsearch/answer/71826?hl=en
URL patterns
URL patterns are used to specify what pages you want included in your
custom search engine. When you use the control panel or the Google
Marker to add sites, you're generating URL patterns. Most URL patterns
are very simple and simply specify a whole site. However, by using
more advanced patterns, you can more precisely pick out portions of
sites.
For example, the pattern 'www.foo.com/bar' will only match the single
page 'www.foo.com/bar'. To cover all the pages where the URL starts
with ' www.foo.com/bar', you must explicitly add a '' at the end. In
the form-based interfaces for adding sites, 'foo.com' defaults to
'.foo.com/*'. If this is not what you want, you can change it back in
the control panel. No such defaulting occurs for patterns that you
upload. Also note that URLs are case sensitive - if your site URLs
include capital letters, you'll need to make sure your patterns do as
well.
In addition, the use of wildcards in URL patterns allows you to
include or exclude multiple pages or portions of a site all at once.
So, basically you've to navigate to the "Sites to search section" and enter the needed sites there. If you want to change these site on the fly, you've to manipulate your URL pattern.
There's also an option to use the XML configuration files. You just have to add (or remove) your sites there:
https://developers.google.com/custom-search/docs/annotations
Annotations: The annotations XML file lists the webpages or websites
you want your search engine to cover, and indicates any preferences
you have about how these sites should be ranked in your search
results. Each site and its associated information is called an
annotation. More information about the annotations XML file.
An example for an annotation:
<Annotation about="http://www.solarenergy.org/*">
<Label name="_cse_abcdefghijk"/>
</Annotation>
Using api we can add filter "siteSearch"=>"somedmain.com somdomain2.com","siteSearchFilter"=>"e" but there will be spacing between seperate domains.
We do a lot of email marketing and sometimes developers will put the html file out on the image server (i know the easy answer is to not do this) but those html files end up getting indexed by Google and eventually rank high on search results. Which in turns makes the SEO company's want us to remove these pages. Is it possible to have google not index anything from our sub domain? we have image.{ourUrl}.com where we put all these files.
Would putting a robot.txt file in the main directory do it? Or would we need to add that robot text file in every directory?
Is there an easy way to blanket this?
A robots.txt file would just stop crawling, files might still be indexed. a noindex directive would work, you could use an x-robots-tag. See here https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
In the effort of building a live site on its actual live hosting platform is there a way to tell google to not YET index the website? I found the following:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93710
But would that tell them to never come back or would they simply see the noindex tag and then not list the results, then when it comes back to crawl again later and my site is good to go I would have the noindex removed and the site would then start getting indexed?
Sounds like you want to use a robots.txt file instead:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&topic=2370588&ctx=topic
Update your robots.txt file when you want your content to be indexed.
You can use the robot.txt method.
You can specify which subpage could be spidered. And google comes back, checking the file before indexing. So you can delete the file later in order to get fully indexed.
More Information
About /robots.txt
Robots.txt File Generator
You can always change it. The way Google and other robots find your page is if it is linked to on another page. As long as it isn't linked to on another page, it won't be found. Also, once your site is up, chances are that it will be far back in the list of sites.
I have a couple of PDF files with services descriptions in the URL:
{service_name}_{year}.pdf
I am just wondering what would be better for search engines - to index the file with a direct link like http://mysite.com/pdf/service_name_details_2010.pdf(opens in new browser tab using target="_blank" attribute on href) or have action http://mysite.com/pdf/servicename/2010 which would ask user to save the file.
So what is the better way for SEO, what would a search engine crawler prefer more?
target=_blank can easily be over-ridden, and is a hasn't been a supported attribute for some time. I'd suggest you read the following: http://www.useit.com/alertbox/open_new_windows.html
With regards to SEO, the search engines will identify the documents for what they are regardless of the file extension: http://googlewebmastercentral.blogspot.com/2011/09/pdfs-in-google-search-results.html
Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.
What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.
In short NO - unless you use cloaking with is discouraged by Google.
Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->
Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>
At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.
Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.
All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.
There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.