Does this block or allow Google bot access?
User-Agent: Googlebot
Allow: /*.js*
Allow: /*.css*
Anybody knows above robots.txt format is blocking or allowing Google Bot access?
Your rules would work, but probably the simplest form of allow rule to allow crawling javascript and css resources:
User-Agent: Googlebot
Allow: .js
Allow: .css
This will allow anything like https://example.com/deep/style.css?something=1 or https://example.com/deep/javascript.js, and leaves no much space for interpretation for other search engines.
If however you have a disallow rule that's more specific than the blanket allow rules, then that will take precedence. For example if you have:
User-Agent: Googlebot
Disallow: /deep/
Allow: .js
Allow: .css
Then the allow rules won't work, because the https://example.com/deep/javascript.js (but would for https://example.com/javascript.js). To allow the JS and CSS file in the generally disallowed directory, you would do:
User-Agent: Googlebot
Disallow: /deep/
Allow: /deep/*.js
Allow: /deep/*.css
Once you have this, you can test your setup in Search Console - Blocked resources feature.
User-Agent: Googlebot
Allow: .js
Allow: .css
Those rules will only Allow: the following URLs...
website.com/.jswebpage.html
website.com/.csswebpage.html
The best way to unblock JS and CSS files would be to either unblock the holding categories or use the full Allow: path to the holding directory, such as...
Allow: /assets/*.js
Allow: /assets/*.css
The above example assumes the js and css files are held in /assets/.
Also to note...
If you had the following in your robots.txt file...
User-Agent: *
Disallow: /cat1/
Disallow: /cat2/
Disallow: /cat3/
Allow: /assets/*.js
Allow: /assets/*.css
User-Agent: Googlebot
Allow: /assets/*.js
Allow: /assets/*.css
Google will skip the wildcard entries in full and only take note of what is actually under the user-agent Googlebot. So it's best to try not to use the useragent Googlebot in robots unless you absolutely have to. And if you do then add all the pages/assets that they need to take note of, even if they are duplicated with the wildcard entries.
Related
I'm trying to set it up where www.url.com/folder is disallowed, but www.url.com/folder/1 is allowed. I have it set up as follows:
User-agent: *
Disallow: /folder
Allow: /folder/*
which works when testing with the Google robots.txt tester, but if I look at the logs, I can see Googlebot hitting all of the urls other than /folder.
Am I missing something? Should allow go first?
I think this one should work:
User-agent: *
Disallow: /folder/$
Allow: /folder/*
I'd like to disallow all subdirectories in my folder /search but allow indexing the search folder itself (I have content on /search).
Testing this does not work:
User-Agent: *
Allow: /search/
Disallow: /search/*
Your code appears correct. Try with a slight adjustment to Allow:
User-Agent: *
Disallow: /search/*
Allow: /search/$
I have a site with the following structure:
http://www.example.com/folder1/folder2/folder3
I would like to disallow indexing in folder1, and folder2.
But I would like the robots to index everything under folder3.
Is there a way to do this with the robots.txt?
For what I read I think that everything inside a specified folder is disallowed.
Would the following achieve my goal?
user-agent: *
Crawl-delay: 0
Sitemap: <Sitemap url>
Allow: /folder1/folder2/folder3
Disallow: /folder1/folder2/
Disallow: /folder1/
Allow: /
Yes, it works... however google has a tool to test your robots.txt file
you only need to go on google webmaster tools (https://www.google.com/webmasters/tools/)
and open the section "site configuration -> crawler access"
All you would need is:
user-agent: *
Crawl-delay: 0
Sitemap:
Allow: /folder1/folder2/folder3
Disallow: /folder1/
Allow: /
At least googlebot will see the more specific allowing of that one directory and disallow anything from folder1 and on. This is backed up by this post by a Google employee.
Line breaks in records are not allowed, so your original robots.txt should look like this:
user-agent: *
Crawl-delay: 0
Sitemap: <Sitemap url>
Allow: /folder1/folder2/folder3
Disallow: /folder1/folder2/
Disallow: /folder1/
Allow: /
Possible improvements:
Specifying Allow: / is superfluous, as it’s the default anyway.
Specifying Disallow: /folder1/folder2/ is superfluous, as Disallow: /folder1/ is sufficient.
As Sitemap is not per record, but for all bots, you could specify it as a separate block.
So your robots.txt could look like this:
User-agent: *
Crawl-delay: 0
Allow: /folder1/folder2/folder3
Disallow: /folder1/
Sitemap: http://example.com/sitemap
(Note that the Allow field is not part of the original robots.txt specification, so don’t expect all bots to understand it.)
I don't want any search search engines to index most of my website.
I do however want search engines to index 2 folders ( and their children ). This is what I set up, but I don't think it works, I see pages in Google that I wanted to hide:
Here's my robots.txt
User-agent: *
Allow: /archive/
Allow: /lsic/
User-agent: *
Disallow: /
What's the correct way to disallow all folders, except for 2 ?
I gave a tutorial about this on this forum here. And in Wikipedia here
Basically the first matching robots.txt pattern always wins:
User-agent: *
Allow: /archive/
Allow: /lsic/
Disallow: /
But I suspect it might be too late. Once the page is indexed it's pretty hard to remove it. The only way is to shift it to another folder or just password protect the folder. You should be able to do that in your hosts CPanel.
should i then do
User-agent: *
Disallow: /
is it as simple as that?
or will that not crawl the files in the root either?
basically that is what i am after - crawling all the files/pages in the root, but not any of the folders at all
or am i going to have to specify each folder explicitly.. ie
disallow: /admin
disallow: /this
.. etc
thanks
nat
Your example will block all all the files in root.
There isn't a "standard" way to easily do what you want without specifying each folder explicitly.
Some crawlers however do support extensions that will allow you to do pattern matching. You could disallow all bots that don't support the pattern matching, but allow those that do.
For example
# disallow all robots
User-agent: *
Disallow: /
# let google read html and files
User-agent: Googlebot
Allow: /*.html
Allow: /*.pdf
Disallow: /