Should I be concerned if googlebot is trying to index marketing URLs? - seo

I have recently started using Google Webmaster Tools.
I was quite surprised to see just how many links google is trying to index.
http://www.example.com/?c=123
http://www.example.com/?c=82
http://www.example.com/?c=234
http://www.example.com/?c=991
These are all campaigns that exist as links from partner sites.
For right now they're all being denied by my robots file until the site is complete - as is EVERY page on the site.
I'm wondering what is the best approach to deal with links like this is - before I make my robots.txt file less restrictive.
I'm concerned that they will be treated as different URLS and start appearing in google's search results. They all correspond to the same page - give or take. I dont want people finding them as they are and clicking on them.
By best idea so far is to render a page that contains a query string as follows :
// DO NOT TRY THIS AT HOME. See edit below
<% if (Request.QueryString != "") { %>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<% } %>
Do I need to do this? Is this the best approach?
Edit: This turns out NOT TO BE A GOOD APPROACH. It turns out that Google is seeing NOINDEX on a page that has the same content as another page that does not have NOINDEX. Apparently it figures they're the same thing and the NOINDEX takes precedence. My site completely disappeared from Google as a result. Caveat: it could have been something else i did at the same time, but i wouldn't risk this approach.

This is the sort of thing that rel="canonical" was designed for. Google posted a blog article about it.

Yes, Google would interprete them as different URLs.
Depending on your webserver you could use a rewrite filter to remove the parameter for search engines, eg url rewrite filter for Tomcat, or mod rewrite for Apache.
Personally I'd just redirect to the same page with the tracking parameter removed.

That seems like the best approach unless the page exists in it's own folder in which case you can modify the robots.txt file just to ignore that folder.

For resources that should not be indexed I prefer to do a simple return in the page load:
if (IsBot(Request.UserAgent)
return;

Related

301 redirect vs canonical links?

For technical reasons on a site we may have two or more links that refer to the same product page. For example:
http://example.com/a-nice-product-no1234.html
and:
http://example.com/a-nice-foobar-product-no1234.html
Apparently the first one is the "correct" link. What is the right approach when the second link is opened?
Approach 1)
Redirect 301 to the first link
Approach 2)
Status 200 and
<link rel="canonical" href="http://example.com/a-nice-product-no1234.html">
in the HTML head? Is approach 2) applicable for other search engines than Google? Other suggestions?
Thank you!
If
http://example.com/a-nice-foobar-product-no1234.html
Is in any way invalid or you have the intention of removing it a 301 Moved Permanently is the way to go.
A technical discussion from google of rel="canonical" shows it should be used to indicate original content, as opposed to say, the same content ordered differently, using different formatting and so on.
This will also have the benefit of users not bookmarking and using links to these "slightly invalid" pages. Making their use lessen over time.

How to tell search engines NOT to look at this specific link?

Suppose I have a link in the page My Messages, which on click will display an alert message "You must login to access my messages".
May be it's better to just not display this link when user is not logged in, but I want "My Messages" to be visible even if user is not logged in.
I think this link is user-friendly, but for search engines they will get redirected to login page, which I think is.. bad for SEO? or is it fine?
I thought of keeping My Messages displayed as normal text (not as a link), then wrap it with a link tag by using javascript/jquery, is this solution good or bad? other ideas please? Thank you.
Try to create a robots.txt file and write:
User-agent: *
Disallow: /mymessages
This will keep SEO bots out of that folder
Use a robots.txt file to tell search engines which pages they should not index.
Using nofollow to block access to a page is erroneous - this is not what nofollow is for. This attribute was designed to allow to you place a link in page without conferring any weight or endorsement of the link. In other words, it's not a link that search engines should regard as significant for page-ranking algorithms. It does not mean "do not index this page" - just "don't follow this particular link to that page"
Here's what Google have to say about nofollow
...However, the target pages may still appear in our index if other
sites link to them without using nofollow or if the URLs are submitted
to Google in a Sitemap. Also, it's important to note that other search
engines may handle nofollow in slightly different ways.
One way of keeping the URL from affecting your rank is setting the rel attribute of your link:
My Messages
Another option is robots.txt, that way you can disallow the bots from the URL entirely.
You might want to use robots.txt to exclude /mymessages. This will also prevent engines which have already visited /mymessages from visiting it again.
Alternatively, add the following to the top of the /mymessages script:
<meta name="robots" content="noindex" />
If you want to tell search engines, not to follow a particular link , then use rel="nofollow".
It is a way to tell search engines and bots that don't follow this link.
Now,google will not crawl that link and does not transfer PageRank or anchor text across this link.

Short URL or long URL for SEO

I am implementing cs-cart for a web site. Which one is better for SEO if possible with a reason or reference. Site sells books stamps cds etc.
www.domain.com/book/Java.html (or) www.domain.com/book/programming/Java.html
or
www.domain.com/Java.html
Some says short URLs are good. But isnt it good that stating which category the product it is in. Thanks
You can go for both, via canonical URLs. For example, in the <head> of both /Java.html and /book/Java.html:
<link rel="canonical" href="/book/programming/Java.html" />
With that, Googlebot (and Yahoo/MS' spiders) will see the current page as a duplicate of the canonical link and ignore it, without the usual demerits that come with dupe content.
Long URLs are good for being descriptive, clear, & searchable, while short URLs are nice for people to send around to friends and whatever social network du jour - chances are you want both.
Maintaining the different URLs & dupes will add some server work though. If it's too much effort, I'd go with the long form for the users' sake & search-ability. "java.html" could just be some random page about coffee, it needs context.
What if it fits into two categories? That would be the case where I say that it's better to go with a short URL because you don't want duplicate content.
Try to get the right size for the content of the site. For example the word "book" is redundant in a URL of a bookshop.

Google Webmaster Tools - Remove query parameters from URL

I am using JBoss Seam on a Jetty web server and am having some issues with the query parameters breaking links when they appear in google searches.
The first parameter is one JBoss Seam uses to track conversations, cid or conversationId. This is a minor issue as Google is complaining I am submitting different urls with the same information.
Secondly, would it make sense to publish/remove urls via the Google Webmaster API instead of publishing/removing via the sitemap?
Walter
Hey Walter, I would recommend that you use the rel=canonical tag to tell the search engines to ignore certain parameters in your URL strings. The canonical tag is a common standard that Google, Yahoo and Microsoft have committed to supporting.
For example, if JBoss is creating URLs that look like this: mysite.com?cid=FOO&conversationId=BAR, then you can create a canonical tag in the section of your website like this:
<html>
<head>
<link rel="canonical" href="http://mysite.com" />
</head>
</html>
The search engines will use this information to normalize the URLs on your website to the canonical (or shortest & most authoritative) version. Specifically, they will treat this as a 301 redirect from the URL of the HTTP request to the URL specified in the canonical tag (as long as you haven't done anything silly, like make it an infinite loop, or pointed to a URL that doesn't exist).
While the canonical tag is pretty fricken cool, it is only a 90% solution, in that you can still run into issues with metrics tracking with all the extra parameters on your website. The best solution would be to update your infrastructure to trap these tracking parameters, create a cookie, and then use a 301 redirect to redirect the URL to the canonical version. However, this can be a prohibitive amount of work for that extra 10% gain, so many people prefer to start with the canonical tag.
As for your second question, generally you don't want to remove these URLs from Google if people are linking to them. By using the canonical tag, you achieve the same goal, but don't loose any value of the inbound links to your website.
For more information about the canonical tag, and the specific issues & solutions, check out this article I wrote on it here: http://janeandrobot.com/library/url-referrer-tracking.
Google Webmaster Tools will tell you about duplicate titles and other issues that Google see that are being caused by "duplicates" that are really the same page being served up with two different URL versions. I suggest trying to make sure the number of errors listed in Webmaster Tools account under duplicate titles is as close to zero as possible.

About Isolated Page In My Web Site

I Produced a page which I have no intention to let Search Engines find and claw it.
The advisable solution is robot.txt. But it is not applicable in my situation.
So I isolated this page from my site by clearing all links from other pages to this page, and never put its URL in external sites.
Logically, then, it is impossible for search engines to find out this page. And that means no matter how many out-bound links nesting in this page, the PR of site is save.
Am I right?
Thank you very much!
Hope this question is programming related!
No, there's still a chance your page can be found by search engine crawlers. For example, it's been speculated that data from the Google Toolbar can be used to alert Googlebot to the presence of a page. And there's still a chance others might link to your page from external sites if the URL becomes known.
Your best bet is to add a robots meta tag to your page, this will prevent it from being indexed, and prevent crawlers from following any links:
<meta name="robots" content="noindex,nofollow" />
If it is on the internet and not restricted, it will be found. It may make it harder to find, but it is still possible a crawler may happen across it.
What is the link so I can check? ;)
If you have outbound links on this "isolated" page then your page will probably show up as a referrer in the logs of the linked-to page. Depending on how much the owners of the linked-to page track their stats, then they may find your page.
I've seen httpd log files turn up in Google searches. This in turn may lead others to find your page, including crawlers and other robots.
The easiest solution might be to password protect the page?