I have seen sites hide the robot.txt file.
if you enter the name of the site as
http://www.mysite.com/robot.txt
you will not receive the robot.txt,
I also want to hide the file robot.txt from visitors how to do it?
Is there a connection of these lines
<meta name="ROBOTS" content="NOODP">
<meta name="Slurp" content="NOYDIR">
So I do not understand the meaning of the code.
Thank you!
I'm not sure exactly what you're asking, but couldn't you do that with URL rewrites? You might be able to display the robots.txt file for visitors with the User Agent strings of a crawler (for instance, "Googlebot"), and then redirect to a 404 if it's a non-crawler UA
Related
I have a ecommerce site with hundreds of products. I recently changed permalinks and their base. Using Wordpress and Woocommerce plugin, I removed /shop/%product-category% from the URL. However, my old URLs are still active. Check out the following example:
greenenvysupply.com/shop/accessories/gro1-1-3mp-usb-led-digital-microscope-10x-300x/
greenenvysupply.com/gro1-1-3mp-usb-led-digital-microscope-10x-300x/
The first URL is old. Why does it still work? Shouldn't I get a 404 page?
Here is code from page source related to the canonical:
href="https://www.greenenvysupply.com/shop/feed/" />
<link rel='canonical' href='https://www.greenenvysupply.com/gro1-1-3mp-usb-led-digital-microscope-10x-300x/' />
<meta name="description" content="The 1.3 Mega-Pixel USB LED Digital Microscope is great for identifying pests and diseases on your plants so you can accurately resolve the problem."/>
<link rel="canonical" href="https://www.greenenvysupply.com/gro1-1-3mp-usb-led-digital-microscope-10x-300x/" />
Because the old URL is still active and not redirecting, my entire website is being seen as having duplicate content. Google crawlers are not being redirected. Why is the URL with /shop/ in it still active even though I have changed the permalink? There has got to be an easy fix for this.
A canonical URL or other metadata in your response is not the same as a redirect. To accomplish a redirect, your server needs to return a 3xx status code (typically a 301 or 308 for a permanent move as you have here or a 302 or 307 for a temporary move) and return a "Location" header that indicates the URL to which to redirect. How exactly you make your server do this is dependent on the type of server or server framework that you happen to be using for your website.
How to accomplish a redirect is somewhat independent of your implicit SEO question about whether to prefer a redirect over a canonical URL, which I'm afraid I cannot answer. Regardless of the approach you use, though, you should be aware that search engines -- Google or otherwise -- may not reflect the changes from your website immediately, so don't panic if you don't see the desired search engine change you were looking for immediately following a change to your website.
is there a reliable way to prevent google from crawling/indexing/caching a page?
i am thinking about creating a product where users could temporarily share information, using temporary url's.
the information is not very confidential, but i'd definitely like to not see it show up on some cache or even search results.
what's the most reliable way of doing this, and what are the possible pitfalls?
Make a robots.txt file. See http://www.robotstxt.org/ for information.
You can also use <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> in the <HEAD> of your index.html.
Google's specifications for their robots.txt handling are here.
We have clients who build their site on a UserDir URL before their real domain goes live. The UserDir URL is always in the format:
http://1.2.3.4/~johndoe
Sometimes, Google crawls these UserDir URLs and the temporary site will show up in results even after the site is live on http://johndoe.com
So, once a client is live on http://johndoe.com, how can I prevent Google from crawling the UserDir address?
(of course, I need Google to crawl the real domain because SEO is important to our clients)
I use the canonical tag for this purpose. If you put the canonical tag on the index.html file like such:
<link rel="canonical" href="http://johndoe.com/" />
Then when Googlebot finds it at http://1.2.3.4/~johndoe it will know that it is a duplicate of http://johndoe.com/ and Google will index the correct one. Googlebot will see the same tag when it crawls the real site and not have a problem with the self-referential canonical.
We're doing a whitelabel site, which mustn't be google indexed.
Does anyone know a tool to check if the googlebot will index a given url ?
I've put <meta name="robots" content="noindex" /> on all pages, so it shouldn't be indexed - however I'd rather be 110% certain by testing it.
I know I could use robots.txt, however the problem with robots.txt is as follows:
Our mainsite should be indexed, and it's the same application on the IIS (ASP.Net) as the whitelabel site - the only difference is the url.
I cannot modify the robots.txt depending on the incoming url, but I can add a meta tag to all pages from my code-behind.
You should add a Robots.txt to your site.
However, the only perfect way to prevent search engines from indexing a site is to require authentication. (Some spiders ignore Robots.txt)
EDIT: You need to add an handler for Robots.txt to serve different files depending on the Host header.
You'll need to configure IIS to send the Robots.txt request through ASP.Net; the exact instructions depend on the IIS version.
Google Webmasters Tools (google.com/webmasters/tools) will (other than permitting you to upload a sitemap) do a test crawl of your site and tell you what they crawled, how it rates for certain queries, and what they will crawl and what not.
The test crawl isn't automatically included in google results, anyway if you're trying to hide sensitive data from the prying eyes of Google you cannot count on that alone: put some authentication on the line of fire, no matter what.
Let's say I have a web site for hosting community generated content that targets a very specific set of users. Now, let's say in the interest of fostering a better community I have an off-topic area where community members can post or talk about anything they want, regardless of the site's main theme.
Now, I want most of the content to get indexed by Google. The notable exception is the off-topic content. Each thread has it's own page, but all the threads are listed in the same folder so I can't just exclude search engines from a folder somewhere. It has to be per-page. A traditional robots.txt file would get huge, so how else could I accomplish this?
This will work for all well-behaving search engines, just add it to the <head>:
<meta name="robots" content="noindex, nofollow" />
If using Apache I'd use mod-rewrite to alias robots.txt to a script that could dynamically generate the necessary content.
Edit: If using IIS you could use ISAPIrewrite to do the same.
You can implement it by substituting robots.txt with dynamic script generating the output.
With Apache You could make simple .htaccess rule to acheive that.
RewriteRule ^robots\.txt$ /robots.php [NC,L]
Simlarly to #James Marshall's suggestion - in ASP.NET you could use an HttpHandler to redirect calls to robots.txt to a script which generated the content.
Just for that thread , make sure your head contains a noindex meta tag. Thats one more way to tell search engines not to crawl your page other than blocking in robots.txt
Just keep in mind that a robots.txt disallow will NOT prevent Google from indexing pages that have links from external sites, all it does is prevent crawling internally. See http://www.webmasterworld.com/google/4490125.htm or http://www.stonetemple.com/articles/interview-matt-cutts.shtml.
You can disallow search engines to read or index your content by restricting robot meta tags. In this way, spider will consider your instructions and will index only such pages that you want.
block dynamic webpage by robots.txt use this code
User-agent: *
Disallow: /setnewsprefs?
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&