Should I use a trailing slash when disallowing a directory in robots.txt? - seo

I want to disallow crawling of a directory /acct in robots.txt Which rule should I use?
Disallow: /acct or Disallow: /acct/
acct contains sub-directories and files both. What is the effect of a trailing slash?

Since robots.txt rules are all "starts with" rules, both of your proposed rules would disallow the following:
https://example.com/acct/
https://example.com/acct/foo
https://example.com/acct/bar
However, the following would only be disallowed by the rule without the trailing slash:
https://example.com/acct
https://example.com/acct.html
https://example.com/acctbar
Disallow: /acct/ is usually better because there is no risk of disallowing unexpected URLs. However, it does NOT prevent crawling of /acct.
In most cases web servers redirect directory URLs without a trailing slash to add the trailing slash. It is likely that on your server, https://example.com/acct redirects to https://example.com/acct/. If that is the case, it is usually fine to allow bots to crawl /acct with no trailing slash and see the redirect. They would be blocked from crawling the target of the redirect.

Related

Handling multiple re-writing and redirection with .htaccess

Working with .htaccess has always been little confusing for many developers.
Currently I am also experiencing a issue
we want 3-4 things to work simultaneously with htaccess
1) redirect non-www to www
2) remove .php extension
3) for pages with trailing parameters abc.php?pageid=28 and abc.php?pageid=95&cat=92 - these pages must show their actual page names like www.xyz.com/about-us rather than ids.
all above must work together.
Refer the following links, it may resolve the issues.
How to write multiple rewrite conditions and rules in one .htaccess file for redirecting urls?
htaccess mod_rewrite - Trailing Slash and loop of redirects
.htaccess - Rewrite multiple subdirectories to root

Redirect from a "shortcut site" to a full site preserving an argument

I've got a bunch of QR code labels encoded with URL's like "https://mysh.ort.url/A5B6D", and I need to redirect them to "https://www.myfullsite.com/items/item.php?A5B6D".
I tried with:
RewriteRule ^/([a-zA-Z0-9]+)$/ https://www.myfullsite.com/items/item.php?$1
[R,L]
But it doesn't work. I'm not too strong with regexes but all the "item codes" I've tried don't seem to work. They are all 5 characters and only letters or uppercase numbers. I managed to make it work with a blanket redirect, but I want to preserve other format URL's on the shorthand domain for other purposes, so ideally it should only redirect requests that are only 5 letters or numbers after the /, like "https://mysh.ort.url/A1BCD" but not "https://mysh.ort.url/gimme/news".
Is that rule placed in the real host configuration or a cheap .htaccess style file?
RewriteEngine on
RewriteRule ^/?([a-zA-Z0-9]+)$ https://www.fullsite.com/items/item.php?$1 [R=301,L]
This would be a more precise version:
RewriteEngine on
RewriteRule ^/?([A-Z0-9]{5})$ https://www.fullsite.com/items/item.php?$1 [R=301,L]
Note the ? following the leading slash (/). You have to consider the different bases RewriteRules work on in the real host configuration and .htaccess style files. Absolute versus relative. Also You have to remove the trailing slash from your pattern.
I also added the =301 parameter to the redirection flag, since most likely you want an external redirect, not an internal one.
In general one should prefer to place such rules inside the host configuration of the http server whenever possible. More transparent, more reliable, more efficient. .htaccess style files are notoriously error prine, hard to debug and they really slow down the server, often for nothing. They are only provided as a last fallback for users without access to the host configuration, so for example in situations with a really cheap web hosting provider.

How to prevent AEM/Sling from adding trailing slash to the extensionless URLs?

All extensionless URLs on the site, which resolve to the actual nodes, are being redirected (with 301 code) to their versions with added trailing slash. It doubles amount of requests to the frontend web server so we would like to fix this.
We do use Apache mod_rewrite to rewrite all incoming URLs (with or without slash) to their .html equivalents in order to make dispatcher caching consistent, but the actual processing is a bit weird.
In general, we have three cases:
URL has an extension ( i.e. /content/xxx/yyy.html ) - it's being processed right away, no redirects
URL has trailing slash ( /content/xxx/yyy/ ) - it is processed by mod_rewrite and rewritten to /content/xxx/yyy.html successfully. no redirects
Extensionless URL ( /content/xxx/yyy ) - processed by mod_rewrite, rewritten to /content/xxx/yyy.html and immediately redirected to /content/xxx/yyy/ which is subsequently goes through the routine from the point 2 above.
To exclude Apache originated redirects we disabled almost all modules, such as mod_dir, mod_negotiation, mod_autoindex, etc to avoid redirects due to the content negotiation or directory indexing but requests are still being redirected.
Our app doesn't contain any redirects based on the URL so I'm wondering if there is any OSGI service or hidden configuration setting which triggers such redirects?
We also have a set of shortcuts on the site, Apache rewrites them to actual URLs and they are NOT being redirected.
For example, if requesting URL is /aboutus it's being successfully mapped to the /content/xxx/yyy/operations/aboutus.html and processed in one loop without any additional redirects. The problem described above is valid only there is an actual corresponding node in JCR and request is extensionless.

Why doesn't this disable directory listing for my site?

I'm using Options –Indexes in a .htaccess file in the directory I want to disable listing, but still when I go to the directory I can see all the files in that directory. What am I doing wrong? (Apache server).
Is it possible you have DirectorySlash turned off? It makes it so requests for directories that are missing the trailing slash gets redirected to the same directory with the trailing slash. By default it is on, but if it's turned off, it has a strange side effect that allows listing of directories regardless of the Indexes options:
Turning off the trailing slash redirect may result in an information disclosure. Consider a situation where mod_autoindex is active (Options +Indexes) and DirectoryIndex is set to a valid resource (say, index.html) and there's no other special handler defined for that URL. In this case a request with a trailing slash would show the index.html file. But a request without trailing slash would list the directory contents.

In robots.txt only allow crawling for subdomain NOT subdirectory on shared hosting?

I just changed the DNS settings so the folder /forum is now a subdomain instead of a subdirectory. If I do a robots.txt file and say:
User-agent: *
Disallow: /forum
Will that disallow crawling for the subdirectory AND subdomain?
I want to disallow crawling of the subdirectory, but ALLOW crawling of the subdomain. Note: this is on shared hosting so both the subdirectory and subdomain can be visited. This is why I have this issue.
So, How can I only permit crawling for the subdomain?
It's the correct way, if you want to stop crawling. But note: If the URLs are already indexed, the won't be removed.
The way I would prefer is to set all pages to "noindex/follow" via meta tags or even better you the "canonical tag" to send the search engines traffic to the subdomain url
Into your
On a given URL like "http://www.yourdomain.com/directoryname/post-of-the-day" use
<link rel="canonical" href="http://directoyname.yourdomain.com/post-of-the-day" />
The latest URL will be the only one in SERPs