How to ignore some links in my website? - seo

I am working on a small php script and i have some links like this
*-phones-*.html
* are variables i want to disallow google to index this kind of links using robots.txt, it is possible ?

You're not disallowing anything. robots.txt is just a set of guidelines for webcrawlers, who can choose to follow them or not.
Rude crawlers should of course be IP banned. But you can't avoid that the webcrawler might come across that page. Anyway, you can add it to your robots.txt and googles webcrawler might obey.

Related

SEO - Need full path url on link for multiple subdomain site?

My site has different language and I plan to have subdomain like en.domain.com, cht.domain.com,
chs.domain.com then inside the site all other links i will have like href='/music', etc. as it will work for all subdomains.
Will it be confusing for seo to index my site? Do I need to dynamically set the full path for each subdomains?
Thanks.
The drawback of your method it that you will have to create one sitemap for each subdomain and post it in each subdomain, which can be tedious if you have many subdomains. You would need to maintain several Google Webmaster Tools account too to monitor them too. Maintaining several subdomains is not very efficient SEO-wise.
Another method is to use folders, such as domain.com/en, domain.com/cht, domain.com/chs, etc... You would only need to maintain one sitemap and one Google Webmaster Tool account, which is less hassle. It would also be much more efficient regarding SEO and rankings.
No matter which method you choose, it is highly recommended to use the rel="alternate" hreflang="x" tags to notify existing page translations. This is very good for indexing, it helps search engines a lot.

How to make sure a link in the spam-post won't get benefit in search-engine result

I have a wiki website. Many spammers using it for seo. They are adding spam-posts with a link to an external website. Is there way to make sure they won't get benefit of it? My thought is adding a text file like robots.txt to inform the search engine "don't consider external website links for search results". I don't want to prevent spammers from creating posts for the sake of advertisements :)
Add rel="nofollow" to the links when you output them on your site.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=96569
They will still spam your site with links, so you'll need to monitor as well.

Sub-domain vs Sub-directory to block from crawlers

I've google a lot and read a lot of articles, but got mixed reactions.
I'm a little confused about which is a better option if I want a certain section of my site to be blocked from being indexed by Search Engines. Basically I make a lot of updates to my site and also design for clients, I don't want all the "test data" that I upload for previews to be indexed to avoid the duplicate content issue.
Should I use a sub-domain and block the whole sub-domain
or
Create a sub-directory and block it using robots.txt.
I'm new to web-designing and was a little insecure about using sub-domains (read somewhere that it's a little advanced procedure and even a tiny mistake could have big consequences, moreover Matt Cutts has also mentioned something similar (source):
"I’d recommend using sub directories until you start to feel pretty
confident with the architecture of your site. At that point, you’ll be
better equipped to make the right decision for your own site."
But on the other hand I'm hesitant on using robots.txt as well as anyone could access the file.
What are the pros and cons of both?
For now I am under the impression that Google treats both similarly and it would be best to go for a sub-directory with robots.txt, but I'd like a second opinion before "taking the plunge".
Either you ask bots not to index your content (→ robots.txt) or you lock everyone out (→ password protection).
For this decision it's not relevant whether you use a separate subdomain or a folder. You can use robots.txt or password protection for both. Note that the robots.txt always has to be put in the document root.
Using robots.txt gives no guaranty, it's only a polite request. Polite bots will honor it, others not. Human users will still be able to visit your "disallowed" pages. Even those bots that honor your robots.txt (e.g. Google) may still link to your "disallowed" content in their search (they won't index content, though).
Using a login mechanism protects your pages from all bots and visitors.

noindex follow in Robots.txt

I have a wordpress website which has been indexed in search engines.
I have edited Robots.txt to disallow certain directories and webpages from search index.
I only know how to use allow and disallow, but don't know how to use the follow and nofollow in Robots.txt file.
I read somewhere while Googling on this that I can have webpages that won't be indexed in Google but will be crawled for pageranks. This can be achieved by disallowing the webpages in Robots.txt and use follow for the webpages.
Please let me know how to use follow and nofollow in Robots.txt file.
Thanks
Sumit
a.) The follow/no follow and index/no index rules are not for robots.txt (sets general site rules) but for an on-page meta-robots tag (sets the rules for this specific page)
More info about Meta-Robots
b.) Google won't crawl the Disallowed pages but it can index them on SERP (using info from inbound links or website directories like Dmoz).
Having said that, there is no PR value you can gain from this.
More info about Googlebot's indexing behavior
Google actually does recognize the Noindex: directive inside robots.txt. Here's Matt Cutts talking about it: http://www.mattcutts.com/blog/google-noindex-behavior/
If you put "Disallow" in robots.txt for a page already in Google's index, you will usually find that the page stays in the index, like a ghost, stripped of its keywords. I suppose this is because they know they won't be crawling it, and they don't want the index containing bit-rot. So they replace the page description with "A description for this result is not available because of this site's robots.txt – learn more."
So, the problem remains: How do we remove that link from Google since "Disallow" didn't work? Typically, you'd want to use meta robots noindex on the page in question because Google will actually remove the page from the index if it sees this update, but with that Disallow directive in your robots file, they'll never know about it.
So you could remove that page's Disallow rule from robots.txt and add a meta robots noindex tag to the page's header, but now you've got to wait for Google to go back and look at a page you told them to forget about.
You could create a new link to it from your homepage in hopes that Google will get the hint, or you could avoid the whole thing by just adding that Noindex rule directly to the robots.txt file. In the post above, Matt says that this will result in the removal of the link.
No you cant.
You can set which directories you want to block and which bots but you cant set nofollow by robots.txt
Use robots meta tag on the pages to set nofollow.

How to tell search engines NOT to look at this specific link?

Suppose I have a link in the page My Messages, which on click will display an alert message "You must login to access my messages".
May be it's better to just not display this link when user is not logged in, but I want "My Messages" to be visible even if user is not logged in.
I think this link is user-friendly, but for search engines they will get redirected to login page, which I think is.. bad for SEO? or is it fine?
I thought of keeping My Messages displayed as normal text (not as a link), then wrap it with a link tag by using javascript/jquery, is this solution good or bad? other ideas please? Thank you.
Try to create a robots.txt file and write:
User-agent: *
Disallow: /mymessages
This will keep SEO bots out of that folder
Use a robots.txt file to tell search engines which pages they should not index.
Using nofollow to block access to a page is erroneous - this is not what nofollow is for. This attribute was designed to allow to you place a link in page without conferring any weight or endorsement of the link. In other words, it's not a link that search engines should regard as significant for page-ranking algorithms. It does not mean "do not index this page" - just "don't follow this particular link to that page"
Here's what Google have to say about nofollow
...However, the target pages may still appear in our index if other
sites link to them without using nofollow or if the URLs are submitted
to Google in a Sitemap. Also, it's important to note that other search
engines may handle nofollow in slightly different ways.
One way of keeping the URL from affecting your rank is setting the rel attribute of your link:
My Messages
Another option is robots.txt, that way you can disallow the bots from the URL entirely.
You might want to use robots.txt to exclude /mymessages. This will also prevent engines which have already visited /mymessages from visiting it again.
Alternatively, add the following to the top of the /mymessages script:
<meta name="robots" content="noindex" />
If you want to tell search engines, not to follow a particular link , then use rel="nofollow".
It is a way to tell search engines and bots that don't follow this link.
Now,google will not crawl that link and does not transfer PageRank or anchor text across this link.