Removing URL duplicates when using pretty urls - seo

I'm using pretty URLs in my web app, one example is 'forum/post/1' which invokes PostController in Forum module, which loads a post with id=1. This is what I need but that post is also accessible from 'forum/post/view/id/1'. That's bad, because search crawlers don't like when same page is accessible from several URLs, right?
I'm using Yii framework which supports 'useStrictParsing' option, which tells that incoming request must match at least one "pretty" route, otherwise request fails with 404. However it's not a perfect solution, because I don't have pretty URLs for every controller/action.
Ideally, framework should redirect 'forum/post/view/id/1' to 'forum/post/1' with a 301 status code. How did you solve this problem? It's not Yii/PHP specific question, how does your framework/tool deal with it?

The best way to make sure search engines only rank one page the pretty url over another, if there are multiple ways to view the content is to your a canonical tag within the header of your document
<link rel="canonical" href="http://www.mydomain.com/nice-url/" />
This is very useful with windows based system as IIS is not case sensitive with its web pages but the web standard is case sensitive.
So
www.maydomain.com/Newpage.aspx
www.maydomain.com/newpage.aspx
www.maydomain.com/NEWPAGE.aspx
These are all seen by Google as different pages, and you are then marked down for having a site with duplicate content, but not so with a canonical as each page in the case above would have the same canonical meta tag and the that url is the only one which will be used by the search engines.

Provided that no one links to your non-pretty urls, the search engines will never know that they exist.
If you do want to eliminate them, you could bypass your web framework by adding an alias in you web server's configuration file; the url will be redirected before it ever reaches the framework.
Frameworks like Django, which don't provide 'magic' routing, don't face this issue, the only routes which exist are those which you define manually. In it's case, you could define a view for the non-pretty url which returns the appropriate redirect.

Related

How do the Facebook like button and Google +1 button deal with a redirected url? [duplicate]

I understand the og:url meta tag is the canonical url for the resource in the open graph.
What strategies can I use if I wish to support 301 redirecting of the resource, while preserving its place in the open graph? I don't want to lose my likes because i've changed the URLs.
Is the best way to do this to store the original url of the content, and refer to that? Are there any other strategies for dealing with this?
To clarify - I have page:
/page1, with an og:url of http://www.example.com/page1
I now want to move it to
/page2, using a 301 redirect to http://www.example.com/page2
Do I have any options to avoid losing the likes and comments other than setting the og:url meta to /page1?
Short answer, you can't.
Once the object has been created on Facebook's side its URL in Facebook's graph is fixed - the Likes and Comments are associated with that URL and object; you need that URL to be accessible by Facebook's crawler in order to maintain that object in the future. (note that the object becoming inaccessible doesn't necessarily remove it from Facebook, but effectively you'd be starting over)
What I usually recommend here is (with examples http://www.example.com/oldurl and http://www.example.com/newurl):
On /newpage, keep the og:url tag pointing to /oldurl
Add a HTTP 301 redirect from /oldurl to /newurl
Exempt the Facebook crawler from this redirect
Continue to serve the meta tags for the page on http://www.example.com/oldurl if the request comes from the Facebook crawler.
No need to return any actual content to the crawler, just a simple HTML page with the appropriate tags
Thus:
Existing instances of the object on Facebook will, when clicked, bring users to the correct (new) page via your redirect
The Like button on the (new) page will still produce a like of the correct object (but at the old URL)
If you're moving a lot of URLs around or completely rewriting your URL scheme you should use the new URLs for new articles/products/etc, but you'll need to keep the redirect in place if you want to retain likes, comments, etc on the older content.
This includes if you're changing domain.
The only problem here is maintaining the old URL -> new URL mapping somewhere in your code, but it's not technically difficult, just an additional thing to maintain in the future.
BTW, The Facebook crawler UA is currently facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
I'm having the same problem with my old sites. Domains are changing, admins want to change urls for seo etc
I came to conclusion its best to have some sort uniqe id in db just for facebook - from the beginning. For articles for example I have myurl.com/a/123 where 123 is ID of the article.
Real url is myurl.com/category/article-title. Article can then be put in different category, renamed etc with extensive logic for 301 redirects behind it. But the basic fb identifier can stay the same for ever.
Of course this is viable only when starting with a fresh site or when implementing fb comments for the first time.
Just an idea if you can plan ahead :) Let me know what you think.

How to use regular urls without the hash symbol in spine.js?

I'm trying to achieve urls in the form of http://localhost:9294/users instead of http://localhost:9294/#/users
This seems possible according to the documentation but I haven't been able to get this working for "bookmarkable" urls.
To clarify, browsing directly to http://localhost:9294/users gives a 404 "Not found: /users"
You can turn on HTML5 History support in Spine like this:
Spine.Route.setup(history: true)
By passing the history: true argument to Spine.Route.setup() that will enable the fancy URLs without hash.
The documentation for this is actually buried a bit, but it's here (second to last section): http://spinejs.com/docs/routing
EDIT:
In order to have urls that can be navigated to directly, you will have to do this "server" side. For example, with Rails, you would have to build a way to take the parameter of the url (in this case "/users"), and pass it to Spine accordingly. Here is an excerpt from the Spine docs:
However, there are some things you need to be aware of when using the
History API. Firstly, every URL you send to navigate() needs to have a
real HTML representation. Although the browser won't request the new
URL at that point, it will be requested if the page is subsequently
reloaded. In other words you can't make up arbitrary URLs, like you
can with hash fragments; every URL passed to the API needs to exist.
One way of implementing this is with server side support.
When browsers request a URL (expecting a HTML response) you first make
sure on server-side that the endpoint exists and is valid. Then you
can just serve up the main application, which will read the URL,
invoking the appropriate routes. For example, let's say your user
navigates to http://example.com/users/1. On the server-side, you check
that the URL /users/1 is valid, and that the User record with an ID of
1 exists. Then you can go ahead and just serve up the JavaScript
application.
The caveat to this approach is that it doesn't give search engine
crawlers any real content. If you want your application to be
crawl-able, you'll have to detect crawler bot requests, and serve them
a 'parallel universe of content'. That is beyond the scope of this
documentation though.
It's definitely a good bit of effort to get this working properly, but it CAN be done. It's not possible to give you a specific answer without knowing the stack you're working with.
I used the following rewrites as explained in this article.
http://www.josscrowcroft.com/2012/code/htaccess-for-html5-history-pushstate-url-routing/

Hash character in URLs (accessing and redirecting in Apache)

It looks as though this question has been asked in part by some others, but I can't find the answer I'm looking for specifically, so I thought I'd pose my particular scenario in case anyone is able to help.
We have an old website (developed externally by a third party) that is due to be retired and replaced by a new site designed in house. For reasons best known to themselves, the developers of the old site used the hash character as part of the URL for the old site (www.mysite.com/#/my-content-stuff). To assist with the transition and help with SEO I need to set up 301 redirects for the top performing URLs from the old site. As I'm now discovering however, I'm not able to set up a simple redirect in the .htaccess file as I believe it takes the hash character to be a comment and ignores the remainder of the line. I've tried escape characters, using %23 instead, wildcard matching, nothing seems to work.
As a workaround, I wondered about simply creating dummy files with the same paths and URLs as the old site had, then simply creating HTML redirects within them to drive traffic to the correct new pages, but it looks as though the server is doing something similar regarding the hash character in the URL, and ignoring anything afterit. So, if I create a sub-folder on my news server called '#' and create a file in there called 'test.html', I expected to be able to just go to 'www.myNEWsite.com/#/test.html', but it just takes me to the default root file of my site.
Please can anyone shed any light on how I might get around this? I must admit I'm not that clued up on Apache so I'm having to learn a lot as I go.
Many thanks in advance for any pointers or info anyone can provide.
Cheers,
Rich
A hash character in the URL specifies the anchor, and it's not even sent to your webserver. A redirect is impossible on the server side, and the old developer probably did it using JavaScript. Implement fallback URLs without the hash instead, and have a global JavaScript script detect these URLs and redirect automatically.
Hash tags cannot be read by the server. They are regarded as locations within the document and are therefore not exposed to the server. The client is the only one whom see's these. The best you could do is use a "meta refresh" tag, or alternatively, you could use javascript to detect the url, and if its one which requires 301 redirection, use "window.location" to move the user to a full url where mod_rewrite or a php page can issue a 301 header.
However neither are SEO friendly and only really solve the issue for users that click onto an old link via an external site
<!-- Put in head tag so the page does not wait to load the content-->
<script type="text/javascript">
if(window.location.hash != "") {
var h = window.location.hash.match(/#\/?(.*)/i)[1];
switch(h) {
case "something_old":
window.location = "/something_new.html";
break;
case "something_also_old":
window.location = "/something_also_new.html";
break;
}
}
</script>

Can search engines index pages generated by server side code?

I'm guessing a site like stack overflow doesn't keep an html file around for every question ever asked. Instead, server-side code creates the page every time a question is clicked on(I think). Is it possible for search engines to index every quesiton on Stack Overflow, or would a page-per-question need to be kept in the directory so the search engine can crawl it?
Yes. Search engines can index dynamically generated pages no problem. In fact, from the search engine bot's perspective, it can't really even distinguish between a dynamically generated page and a static one.
You might be interested by the Dynamic URLs vs. static URLs post on the Official Google Webmaster Central Blog.
Yes it's perfectly possible - when a link is followed the server returns HTML just like any other web page. The only difference is that the server generated it, rather than a person.
As far as the client (be it a browser or search engine) is concerned, there is no difference between a server-generated page and a static file. They're virtually indistinguishable (depending on how the page is generated, it might be missing Last-Modified headers, etc). As such, yes, search engines can index generated pages without a problem.
That said, there is something to be said for giving them a hint. Using sitemaps, for example, gives a search engine a nice listing of all your pages, so it's less likely to miss them. More importantly, it can summarize last modified times, to focus the search engine's attention on what has changed recently. This isn't mandatory, but it does help - regardless of whether the pages are static HTML or generated.
Any link that uses a GET can be followed by most crawlers. Anything that requires a POST will generally be ignored.
The mechanism for generating the page is irrelevant.
yes if this is not restricted by robot.txt or meta tags.Search engine requests web page like normal user,no one have access to server side code(if your site isn't hacked))
Search engines can see pretty much anything on a given Web page that isn't hidden behind client-side code (i.e., JavaScript).
So, if there's a URL that you can enter into your browser's address bar to get this page, and this page is linked to from somewhere, a search engine will find it and "see" the same content that you do. The fact that the page was generated dynamically by a server is irrelevant to a search engine, since what is sent to a browser upon requesting a URL is still just an HTML file.
In other words, that HTML file doesn't exist in the same form on the server - i.e., it's actually some server-side code that generates HTML, not a static HTML file - but that's not what a search engine is crawling though and indexing, rather links to document URLs that are exactly what you see in your browser's address bar.

Google Webmaster Tools - Remove query parameters from URL

I am using JBoss Seam on a Jetty web server and am having some issues with the query parameters breaking links when they appear in google searches.
The first parameter is one JBoss Seam uses to track conversations, cid or conversationId. This is a minor issue as Google is complaining I am submitting different urls with the same information.
Secondly, would it make sense to publish/remove urls via the Google Webmaster API instead of publishing/removing via the sitemap?
Walter
Hey Walter, I would recommend that you use the rel=canonical tag to tell the search engines to ignore certain parameters in your URL strings. The canonical tag is a common standard that Google, Yahoo and Microsoft have committed to supporting.
For example, if JBoss is creating URLs that look like this: mysite.com?cid=FOO&conversationId=BAR, then you can create a canonical tag in the section of your website like this:
<html>
<head>
<link rel="canonical" href="http://mysite.com" />
</head>
</html>
The search engines will use this information to normalize the URLs on your website to the canonical (or shortest & most authoritative) version. Specifically, they will treat this as a 301 redirect from the URL of the HTTP request to the URL specified in the canonical tag (as long as you haven't done anything silly, like make it an infinite loop, or pointed to a URL that doesn't exist).
While the canonical tag is pretty fricken cool, it is only a 90% solution, in that you can still run into issues with metrics tracking with all the extra parameters on your website. The best solution would be to update your infrastructure to trap these tracking parameters, create a cookie, and then use a 301 redirect to redirect the URL to the canonical version. However, this can be a prohibitive amount of work for that extra 10% gain, so many people prefer to start with the canonical tag.
As for your second question, generally you don't want to remove these URLs from Google if people are linking to them. By using the canonical tag, you achieve the same goal, but don't loose any value of the inbound links to your website.
For more information about the canonical tag, and the specific issues & solutions, check out this article I wrote on it here: http://janeandrobot.com/library/url-referrer-tracking.
Google Webmaster Tools will tell you about duplicate titles and other issues that Google see that are being caused by "duplicates" that are really the same page being served up with two different URL versions. I suggest trying to make sure the number of errors listed in Webmaster Tools account under duplicate titles is as close to zero as possible.