I am using DokuWiki and as we've tried to secure it as much as possible the best security for us to keep it's location on our server secret. Therefore we want to make sure no link can be clicked on any pages which would reveal the location of our infrastructure. Is there any way to configure this restriction with DokuWiki or are there known ways to pass URLs through a third party?
Did you tried to protect the site with .htaccess and .htpasswd?? Is a good solution for other to not enter on your site.
And if the site is online you should include a robots.txt to avoid crawlers to index it
User-agent: *
Disallow: /
Hope i help you
Related
How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.
When I searching our web site on Google I found three sites with the same content show up. I always thought we were using only one site www.foo.com, but it turn out we have www.foo.net and www.foo.info with the same content as www.foo.com.
I know it is extremely bad to have the same content under different URL. And it seems we have being using three domains for years and I have not seen punitive blunt so far. What is going on? Is Google using new policy like this blog advocate?http://www.seodenver.com/duplicate-content-over-multiple-domains-seo-issues/ Or is it OK using DNS redirect? What should I do? Thanks
If you are managing the websites via Google Webmaster Tools, it is possible to specify the "primary domain".
However, the world of search engines doesn't stop with Google, so your best bet is to send a 301 redirect to your primary domain. For example.
www.foo.net should 301 redirect to www.foo.com
www.foo.net/bar should 301 redirect to www.foo.com/bar
and so on.
This will ensure that www.foo.com gets the entire score, rather than (potentially) a third of the score that you might get for link-backs (internal and external).
Look into canonical links, as documented by Google.
If your site has identical or vastly
similar content that's accessible
through multiple URLs, this format
provides you with more control over
the URL returned in search results. It
also helps to make sure that
properties such as link popularity are
consolidated to your preferred
version.
They explicitly state it will work cross-domain.
I'm working on this big website and I want to put it online before its fully finished...
I'm working locally and the database is getting really big so I wanted to upload the website and continue to work on it in the server, but allowing people to enter, so I can test.
The question is if this is good for SEO, I mean, there are a lot of things SEO related that are incomplete.. For example: there are no friendly URLs, no sitemap, no .htacces file, lot of 'in-construction' sections...
Will Google penalize me forever? How does it work? Google indexes and gets the structure of the site just once or is it constantly updating and checking for changes? Will using User-agent: * Disallow: in robots.txt fully stop Google from indexing it? Can I change the robots.txt file later and have Google index it again? What dp you recommend?
Sure, just put a robots.txt file in your root so you can be safe that google doesn't start indexing it.
Like this:
User-agent: *
Disallow: /
This is how i understand this issue:
Google will reach your website if someone submitted your website URL http://www.google.com/addurl/ or there is a link to your website in another already indexed website.
When google reach your website it will look at the robots.txt and will see what rules there, if you disallow indexing using code like the following, google will not index your website at the moment.
User-agent: *
Disallow: /
But google will visit your website again after some days may be, and will do the same as the first time, if you didn't find the robots.txt or found that you put rules that allow them to index the website using code like the following, they will start indexing the website pages and content.
User-agent: *
Allow: /
About putting the website online from now or not? if you will disallow google index using robots.txt, there no difference, go for which is better for you.
Note:
I am not sure 100% from rules i mentioned in this answer as google always change their indexing technics.
Also what i said about Google is the same for other search engines such as yahoo and bing, but its not a rule for any search engine, its just a common way, so may be other search engine index all your website links while you have robots.txt disallow indexing.
And i used to put a stage version from my websites to test on the live environment before going on the real life version, and used to use the robots.txt and i never found any of these stage links in Google, Bing or Yahoo.
As long as your security is not beta quality, it's a good idea to get your site online as early as possible.
Google indexes your site periodically, and will index more frequently as it detects more frequent changes and/or your pagerank increases.
We're doing a whitelabel site, which mustn't be google indexed.
Does anyone know a tool to check if the googlebot will index a given url ?
I've put <meta name="robots" content="noindex" /> on all pages, so it shouldn't be indexed - however I'd rather be 110% certain by testing it.
I know I could use robots.txt, however the problem with robots.txt is as follows:
Our mainsite should be indexed, and it's the same application on the IIS (ASP.Net) as the whitelabel site - the only difference is the url.
I cannot modify the robots.txt depending on the incoming url, but I can add a meta tag to all pages from my code-behind.
You should add a Robots.txt to your site.
However, the only perfect way to prevent search engines from indexing a site is to require authentication. (Some spiders ignore Robots.txt)
EDIT: You need to add an handler for Robots.txt to serve different files depending on the Host header.
You'll need to configure IIS to send the Robots.txt request through ASP.Net; the exact instructions depend on the IIS version.
Google Webmasters Tools (google.com/webmasters/tools) will (other than permitting you to upload a sitemap) do a test crawl of your site and tell you what they crawled, how it rates for certain queries, and what they will crawl and what not.
The test crawl isn't automatically included in google results, anyway if you're trying to hide sensitive data from the prying eyes of Google you cannot count on that alone: put some authentication on the line of fire, no matter what.
A few days ago we replaced our web site with an updated version. The original site's content was migrated to http://backup.example.com. Search engines do not know about the old site, and I do not want them to know.
While we were in the process of updating our site, Google crawled the old version.
Now when using Google to search for our web site, we get results for both the new and old sites (e.g., http://www.example.com and http://backup.example.com).
Here are my questions:
Can I update the backup site content with the new content? Then we can get rid all of old content. My concern is that Google will lower our page ranking due to duplicate content.
If I prevent the old site from being accessed, how long will it take for the information to clear out of Google's search results?
Can I use google disallow to block Google from the old web site.
You should probably put a robots.txt file in your backup site and tell robots not to crawl it at all. Google will obey the restrictions though not all crawlers will. You might want to check out the options available to you at Google's WebMaster Central. Ask Google and see if they will remove the errant links for you from their data.
you can always use robot.txt on backup.* site to disallow google to index it.
More info here: link text
Are the URL formats consistent enough between the backup and current site that you could redirect a given page on the backup site to its equivalent on the current one? If so you could do so, having the backup site send 301 Permanent Redirects to each of the equivalent pages on the site you actually want indexed. The redirecting pages should drop out of the index (after how much time, I do not know).
If not, definitely look into robots.txt as Zepplock mentioned. After setting the robots.txt you can expedite removal from Google's index with their Webmaster Tools
Also you can make a rule in your scripts to redirect with header 301 each page to new one
Robots.txt is a good suggestion but...Google doesn't always listen. Yea, that's right, they don't always listen.
So, disallow all spiders but....also put this in your header
<meta name="robots" content="noindex, nofollow, noarchive" />
It's better to be safe than sorry. Meta commands are like yelling at Google "I DONT WANT YOU TO DO THIS TO THIS PAGE". :)
Do both, save yourself some pain. :)
I suggest you to either add no index meta tag in all old page or just disallow by robots.txt. Best way to just blocked the by robots.txt. One thing more add the sitemap in new site and submit it in webmaster that improve your new website indexing.
Password protect your webpages or directories that you don't want web spiders to crawl/index by putting password protecting code in the .htaccess file (if present in your website's root directory on the server or create a new one and upload it).
The web spiders will never know that password and hence won't be able to index the protected directories or web pages.
you can block any particular urls in webmasters check once...even you can block using robots.txt....remove sitemap for your old backup site and put noindex no follow tag for all of your old backup pages...i too handled this situation for one of my client............