robots.txt block bots crawling subdirectory [closed] - seo

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'd like to block all bots from crawling a sub directory http://www.mysite.com/admin plus any files and folders in that directory. For example there may be further directories inside /admin such as http://www.mysite.com/admin/assets/img
I'm not sure what is the exact correct declarations to include in robots.txt to do this.
Should it be:
User-agent: *
Disallow: /admin/
Or:
User-agent: *
Disallow: /admin/*
Or:
User-agent: *
Disallow: /admin/
Disallow: /admin/*

Based on information available on the net (I can't retrieve it all but some forums actually report the problem, as in here and here, for example) I'd follow those who suggest we never tell people or bots (or both) where is that we don't want them to look ("admin" looks like sensitive content...).
After having checked, I'd confirm it's the first one you say. Reference here

Related

Apache structure issue (.htaccess) [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have a structure like this
/ < root folder
/Site/Public/.htaccess
/Site/Public/index.php
/Site/Public/error.php
/Site/Public/images/chat.png
In my htaccess I have disabled access to subfolders and set a default 403 document like so:
ErrorDocument 403 error.php
Options All -Indexes
But the problem is that I cannot get it to pick that error.php file unless I use the full path starting from root. I also tried this
ErrorDocument 403 chat.png
And it doesn't pick that up either just displays a string in both situations. Can anyone tell me how to target that error.php file without using the absolute path?
The experimenting url is localhost/Site/Public/images
from http://httpd.apache.org/docs/2.2/mod/core.html#errordocument
URLs can begin with a slash (/) for local web-paths (relative to the
DocumentRoot), or be a full URL which the client can resolve.
Alternatively, a message can be provided to be displayed by the
browser.
any argument that is not a full url (http://www.example.com) or does not start with / will be treated as string.
The urls have to be defined relative to the DocumentRoot, which in your case seems to be the same for your sites.
Alternatively you can use full urls that can be resolved by the client.
That may be an alternative for you.
Everything else you need to know can be read in the manual:
http://httpd.apache.org/docs/current/mod/core.html#errordocument

index.php appending to url [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have a magento site which has index.php appended to the url you click on. I googled a lot to find the solution and i did what i could find.To clear my my doubts i uploaded htaccess file fresh copy from magento copy and made the url rewrite in configuation>system>web to yes and clear the cache too but still it put index.php in url.I have also double checked secure and unsecure link to see if it contain any index.php which it doesn't
I can do all what i can to do research and applied it but no change. What can i do or what can be wrong?
The steps you describe should be right:
System > Configuration > Web > Use Web Server Rewrites set to yes (also check the store view level value, because the scope for this is not global)
.htaccess present in document root
clear Magento cache
Additional things to check:
System > Configuration > Web > (Un)secure base url
does your Apache take into consideration .htaccess (AllowOverride)
how did you clear the cache
the scope for your settings System > Configuration > Web > Use Web Server Rewrites

Generating a subdomain wrecks url's with IP? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 8 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
This question does not appear to be about programming within the scope defined in the help center.
Improve this question
On my site until today these 2 URLS gave me the same result:
www.mysite.com/test.jpg
10.10.10.10/test.jpg
(where 10.10.10.10 is my static IP address)
Today I used cPanel to generate a new subdomain (blog.mysite.com) and since then
10.10.10.10/test.jpg
resolves to
www.mysite.com/blog/test.jpg
(which doesn't exist)
My ISP tech support says that by default any new subdomain comes on top in the apache conf file, so by making a new subdomain it gets inserted when calling URLs by IP.
What would be the best way to get back to the original functionality?
I can't edit the server conf files but can edit my own htaccess.
You can use mod_rewrite. Try placing this in an .htaccess file in your document root (for the blog.mysite.com site)
RewriteEngine On
RewriteCond %{HTTP_HOST} ^10.10.10.10$
RewriteRule ^(.*)$ http://www.mysite.com/$1 [R=301,L]
Or replace the R=301 flag with P if you really don't want to redirect the browser.

robots.txt file is probably invalid [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
this is my robots.txt. I want to only allow the base url domain.com for indexing and disallow all sub urls like domain.com/foo and domain.com/bar.html.
User-agent: *
Disallow: /*/
Because I am not sure whether this is a valid syntax I tested it using Google Webmaster Tools. It shows me this message.
robots.txt file is probably invalid.
Is my file valid? Is there a better way of only allowing the base url for indexing?
Update: Google downloaded my robots.txt 4 hours ago. I think thats why it doesn't work. I will wait some time and if the problem stays I will update my question again.
Here is a link to a validator. It might help you work through any errors in the file.
Robots.txt Checker
I checked on another validator, robots.txt Checker, and this is what I got for the second line:
Wildcard characters (like "*") are not allowed here The line below
must be an allow, disallow, comment or a blank line statement
This might be what you're looking for:
User-Agent: *
Allow: /index.html
Disallow: /
This assumes your homepage is index.html.
If index.php is your homepage, you should be able to swap out index.html for index.php.
User-Agent: *
Allow: /index.php
Disallow: /
On my dynamic websites that run through index.php, going to mydomain.com/index.php still takes me to the homepage, so the above should work.

Apache wildcard at domain level [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have few sites, and they all have identical setup on a single server. Now, instead of the separate configuration file for each of them in sites-enabled directory, I want to have a common file.
Idea is this:
www.abc.com should have /var/www/abc as DocumentRoot,
www.xyz.com should have /var/www/xyz as DocumentRoot, etc.
All other parameteres like log files, contact emails etc should also have identical setup (abc.com should have contact#abc.com as admin email, xyz.com should have contact#xyz.com as admin email etc).
I couldnt find any tutorial on how to backreference wildcards, etc.
regards,
JP
Aha. Found the solution. VirtualDocumentRoot is the answer.
A single line like:
VirtualDocumentRoot /var/www/%0
does the job. Havent really figured the logs stuff but should be similar and easy.
See https://serverfault.com/questions/182929/wildcard-subdomain-directory-names for a nice related thread.
You gotta enable vhost_alias module for this. (sudo a2enmod vhost_alias on ubuntu).