Robots.txt Specific Exclusion

Robots.txt Specific Exclusion - seo

Currently my robots.txt is the following
#Sitemaps
Sitemap: http://www.baopals.com.com/sitemap.xml
#Disallow select URLs
User-agent: *
Disallow: /admin/
Disallow: /products/
My products have a lot of duplicate content as I pull over data from taobao.com and automatically translate it resulting in a lot of duplicate and low quality names which is why I just disallow the whole thing. However I manually change the titles on certain products and re-save them to the database and showcase them on the homepage with proper translations they just still get saved back to /products/ and are lost forever when I remove them from the homepage.
I'm wondering if it would be possible to allow the products that I save to the homepage with the updated translations still be indexed by google or am I forced to change the directory of the manually updated products?

Some bots (including the Googlebot) support the Allow field. This allows you to specify paths that should be allowed to crawl anyway.
So you would have to add an Allow line for each product that you want to get crawled.
User-agent: *
Disallow: /admin/
Disallow: /products/
Allow: /products/foo-bar-1
Allow: /products/foo-foo-2
Allow: /products/bar-foo
But instead of disallowing crawling of your product pages, you might want to disallow indexing. Then a bot is still allowed to visit your pages and follow links, but it won’t add the pages to its search index.
Add <meta name="robots" content="noindex" /> to each product page (in the head), and remove it (or change it to index) for each product page you want to get indexed. There’s also a corresponding HTTP header, if that’s easier for you.

Related

Robots.txt disallow by regex

On my website I have a page for the cart, that is: http://www.example.com/cart and another for the cartoons: http://www.example.com/cartoons. How should I write on my robots.txt file to ignore only the cart page?
The cart page does not accept an ending slash on the URL, so if I do:
Disallow: /cart, it will ignore /cartoon too.
I don't know if it's possible and it will be correctly parsed by the spider bots something like /cart$. I dont want to force Allow: /cartoon because may be another pages with the same prefix.

In the original robots.txt specification, this is not possible. It neither supports Allow nor any characters with special meaning inside a Disallow value.
But some consumers support additional things. For example, Google gives a special meaning to the $ sign, where it represents the end of the URL path:
Disallow: /cart$
For Google, this will block /cart, but not /cartoon.
Consumers that don’t give this special meaning will interpret $ literally, so they will block /cart$, but not /cart or /cartoon.
So if using this, you should specify the bots in User-agent.
Alternative
Maybe you are fine with crawling but just want to prevent indexing? In that case you could use meta-robots (with a noindex value) instead of robots.txt. Supporting bots will still crawl the /cart page (and follow links, unless you also use nofollow), but they won’t index it.
<!-- in the <head> of the /cart page -->
<meta name="robots" content="noindex" />

You could explicitly allow and disallow both paths. More specific paths will take a higher precedent if they are longer in length:
disallow: /cart
allow: /cartoon
More info is available at: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Google still indexing unique URLs

I have a robots.txt file set up as such
User-agent: *
Disallow: /*
For a site that is all unique URL based. Sort of like https://jsfiddle.net/ when you save a new fiddle it gives it a unique URL. I want all of my unique URLs to be invisible to Google. No indexing.
Google has indexed all of my unique URLs, even though it says "A description for this result is not available because of the site's robots.txt file. - learn more"
But that still sucks because all the URLs are there, and clickable - so all the data inside is available. What can I do to 1) get rid of these off Google and 2) stop Google from indexing these URLs.

Robots.txt tells search engines not to crawl the page, but it does not stop them from indexing the page, especially if there are links to the page from other sites. If your main goal is to guarantee that these pages never wind up in search results, you should use robots meta tags instead. A robots meta tag with 'noindex' means "Do not index this page at all". Blocking the page in robots.txt means "Do not request this page from the server."
After you have added the robots meta tags, you will need to change your robots.txt file to no longer disallow the pages. Otherwise, the robots.txt file would prevent the crawler from loading the pages, which would prevent it from seeing the meta tags. In your case, you can just change the robots.txt file to:
User-agent: *
Disallow:
(or just remove the robots.txt file entirely)
If robots meta tags are not an option for some reason, you can also use the X-Robots-Tag header to accomplish the same thing.

robots txt file syntax can I dis allow all then only allow some sites

Can you disallow all and then allow specific sites only. I am aware one approach is to disallow specific sites and allow all. Its is valid to do the reverse: E.G:
User-agent: *
Disallow: /
Allow: /siteOne/
Allow: /siteTwo/
Allow: /siteThree/
To simply disallow all and then allow sites seems much more secure than to all all and them have to think about all the places you dont want them to crawl.
could this method above be responsible for the sites description saying 'A description for this result is not available because of this site's robots.txt – learn more.' in the organic ranking on Google's home page
UPDATE - I have gone into Google webmaster tools > Crawl > robots.txt tester. At first when I entered siteTwo/default.asp it said Blocked and highlighted the 'Disallow: /' line. After leaving and re visiting the tool it now says Allowed. Very weird. So if this says Allowed I wonder why it gived the message above in the description for the site?
UPDATE2 - The example of the robots.txt file above should have said dirOne, dirTwo and not siteOne, siteTwo. Two great links to know all about robot.txt are unor's robot.txt specification in the accepted answer below and the robots exclusion standard is also a must read. This is all explained in these two pages. In summary yes you can disallow and them allow BUT always place the disallow last.

(Note: You don’t disallow/allow crawling of "sites" in the robots.txt, but URLs. The value of Disallow/Allow is always the beginning of a URL path.)
The robots.txt specification does not define Allow.
Consumers following this specification would simply ignore any Allow fields. Some consumers, like Google, extend the spec and understand Allow.
For those consumers that don’t know Allow: Everything is disallowed.
For those consumers that know Allow: Yes, your robots.txt should work for them. Everything’s disallowd, except those URLs matched by the Allow fields.
Assuming that your robots.txt is hosted at http://example.org/robots.txt, Google would be allowed to crawl the following URLs:
http://example.org/siteOne/
http://example.org/siteOne/foo
http://example.org/siteOne/foo/
http://example.org/siteOne/foo.html
Google would not be allowed to crawl the following URLs:
http://example.org/siteone/ (it’s case-sensitive)
http://example.org/siteOne (missing the trailing slash)
http://example.org/foo/siteOne/ (not matching the beginning of the path)

Why is Google crawling pages blocked by my robots.txt?

I have a “double” question on the number of pages crawled by Google and it’s maybe relation with possible duplicate content (or not) and impact on SEO.
Facts on my number of pages and pages crawled by Google
I launched a new website two months ago. Today, it has close to 150 pages (it's increasing every day). This is the number of pages in my sitemap anyway.
If I look in "Crawl stats" in Google webmaster, I can see the number of pages crawled by Google everyday is much bigger (see image below).
I'm not sure it's good actually because not only it make my server a bit more busy (5,6 MB of download for 903 pages in a day), but I'm scared it makes some duplicate content as well.
I have checked on Google (site:mysite.com) and it gives me 1290 pages (but only 191 are shown unless I click on "repeat the search with the omitted results included". Let’s suppose the 191 ones are the ones in my sitemap (I think I have a problem of duplicate content of around 40 pages, but I just update the website for that).
Facts on my robots.txt
I use a robots.txt file to disallow all crawling engines to go to pages with parameters (see robots below) and also “Tags”.
User-Agent: *
Disallow: /administrator
Disallow: *?s
Disallow: *?r
Disallow: *?c
Disallow: *?viewmode
Disallow: */tags/*
Disallow: *?page=1
Disallow: */user/*
The most important one is tags. They are in my url as follow:
www.mysite.com/tags/Advertising/writing
It is blocked by the robots.txt (I’ve check with google webmaster) but it is still present in Google search (but you need to click on “repeat the search with the omitted results included.”)
I don’t want those pages to be crawled as it is duplicate content (it’s a kind of search on a keyword) that’s why I put them in robots.txt
Finaly, my questions are:
Why Google is crawling the pages that I blocked in robots.txt?
Why is Google indexing pages that I have blocked? Are those pages considered by Google as duplicate content? If yes I guess it’s bad for SEO.
EDIT: I'm NOT asking how to remove the pages indexed in Google (I know the answer already).

Why google is crawling the pages that I blocked in robots.txt? Why google is indexing pages that I have blocked?
They may have crawled it before you blocked it. You have to wait until they read your updated robots.txt file and then update their index accordingly. There is no set timetable for this but it is typically longer for newer websites.
Are those pages considered as duplicate content?
You tell us. Duplicate content is when two pages have identical or nearly identical content on two or more pages. Is that happening on your site?
Blocking duplicate content is not the way to solve that problem. You should be using canonical URLs. Blocking pages means you're linking to "black holes" in your website which hurts your SEO efforts. Canonical URLs prevents this and gives the canonical URL full credit for its related terms and all links to all duplicated pages as well.

Meta tag vs robots.txt

Is it better to use meta tags* or the robots.txt file for informing spiders/crawlers to include or exclude a page?
Are there any issues in using both the meta tags and the robots.txt?
*Eg: <#META name="robots" content="index, follow">

There is one significant difference. According to Google they will still index a page behind a robots.txt DENY, if the page is linked to via another site.
However, they will not if they see a metatag:
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Google Search results completely by using other URL blocking methods, such as password-protecting the files on your server or using the noindex meta tag or response header.

Both are supported by all crawlers which respect webmasters wishes. Not all do, but against them neither technique is sufficient.
You can use robots.txt rules for general things, like disallow whole sections of your site. If you say Disallow: /family then all links starting with /family are not indexed by a crawler.
Meta tag can be used to disallow a single page. Pages disallowed by meta tags do not affect sub pages in the page hierarchy. If you have meta disallow tag on /work, it does not prevent a crawler from accessing /work/my-publications if there is a link to it on an allowed page.

Robots.txt IMHO.
The Meta tag option tells bots not to index individual files, whereas Robots.txt can be used to restrict access to entire directories.
Sure, use a Meta tag if you have the odd page in indexed folders that you want skipping, but generally, I'd recommend you most of your non-indexed content in one or more folders and use robots.txt to skip the lot.
No, there isn't a problem in using both - if there is a clash, in general terms, a deny will overrule an allow.

meta is superior.
In order to exclude individual pages from search engine indices, the noindex meta tag is actually superior to robots.txt.

There is a very huge difference between meta robot and robots.txt.
In robots.txt, we ask crawlers which page you have to crawl and which one you have to exclude but we don't ask crawler to not to index those excluded pages from crawling.
But if we use meta robots tag, we can ask search engine crawlers not to index this page.The tag to be used for this is:
<#meta name = "robot name", content = "noindex"> (remove #)
OR
<#meta name = "robot name", content = "follow, noindex"> (remove #)
In the second meta tag, I have asked robot to follow that URL but not to index in search engine.

Here is my knowledge about them. I am talking about their work area. Both we can use for blocking content.
The difference between both is:
Meta Robot can block a single page with some piece of the code paste in the header of the website. By using the meta robot tag we tell the search engine for which function we are using meta tag.
In Robots.txt file you can block the whole website.
Here is the example of meta robot:
<meta name="robots" content="index, follow">
<meta name="robots" CONTENT="all">
<meta name="robots" content="noindex, follow">
<meta name="robots" content="noindex, nofollow">
<meta name="robots" content="index, nofollow" />
<meta name="robots" content="noindex, nofollow" />
Here is the example of Robots.txt file:
Allowing crawlers to crawl all website
user-agent: *
Allow:
Disallow:
Disallowing crawlers to crawl all website
user-agent: *
Allow:
Disallow:/

I would probably use robots.txt over the meta tag. Robots.txt has been around longer, and might be more widely supported (But I am not 100% sure on that).
As for the second part, I think most spiders will take whatever is the most restrictive setting for a page - if there is a disparity between the robots.txt and meta tag.

Robots.txt is good for pages which consume a lot of your crawling budget like internal search or filters with infinite combination. If you allow Google to index yoursite.com/search=lalalala it will waste you crawling budget.

You want to use 'noindex,follow' in a robots meta tag, rather than robots.txt, because it will allow the link juice to pass through. It is better from a SEO perspective.

Is it better to use meta tags* or the robots.txt file for informing spiders/crawlers to include or exclude a page?
Answer: Both are important to use, they are used for different purposes. Robots file is used to include or exclude pages or root files from spider's index. While, Meta tags are used analyse a website page that defines about it's niche & content within the page.
Are there any issues in using both the meta tags and the robots.txt?
Answer: Both should be implemented to sites so that search engine spiders/crawlers can index or de-index the site urls.
Read more here about working of a search engine spiders >>https://www.playbuzz.com/alexhuber10/how-search-and-spider-engines-work

You can have any one but if your website has plenty of web pages then robots.txt is easy and reduces time complexity

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas