Content Duplication and Search Engines

Content Duplication and Search Engines - seo

I am developing an app that allows people to publish content to my site and then pushed to their blog. I don't want to get hit by Google or the other search engines for duplicate content, so what can I do to avoid being penalized? Thanks.

You need to figure out which site (yours or theirs) that should be treated as the canonical source of the content. Depending on your decision/answer, the following would apply:
Your site canonical:
- reference the URL with the rel="canonical" link element.
- delay the push to their blog by 24 hours
- update the URL in your XML sitemap with a time-stamp
- make all of the HREF values of any links in the article as absolute (with your domain)
Their site canonical:
- reference their site with a rel=canonical element in your head
- push instantly to their blog
- don't include any reference the article in your XML sitemap
- consider using "noindex, follow" in your meta
- make all of the HREF values of any links in the article as relative
Then it comes down to what control you can exert on their site - but ultimately it's up to them.

Related

How do I create "internal Outlinks" for a better SEO?

I was searching on the web after I analyzed the link structure of Yoast. There he uses links to redirect users to a different page.
Here a example:
https://yoast.com/out/synthesis/
Can someone tell me what this is called, or how I create such links as well?

It's actually really simple. He isn't using it for SEO purposes since it's just a 301 redirect. He is purposefully hiding the affiliate url AND adding 'onclick' Google Analytics tracking to the link. Also - the "/out/" directory is being blocked by robots.txt and then redirect's back to the index page.
To answer your question:
This is not for SEO reasons. He is using it for both tracking click and hiding his affiliate link/url.

These are called internal links, when you link to you one of your domain or subdomain pages. Internal links adds values for SEO as it makes the crawlers aware of those existing pages. There are many options for generating internal links. It depends on your page structure etc. Some of the common options are by using html sitemap like trip advisor's does, using header and footer. For html sitemaps, go to http://www.tripadvisor.com/, scroll all the way bottom to the footer section. There you can sitemap link, which is a path way for many internal links.

Two URL's, same content, is this considered duplicate content by search engines?

I've developed a service that allows users to search for stores on www.mysite.com.
I also have partners that uses my service. To the user, it looks like they are on my partners web site, when in fact they are on my site. I have only replaced my own header and footer, with my partners header and footer.
For the user, it looks like they are on mysite.partner.com when in reality they are on partner.mysite.com.
If you understood what I tried to explain, my question is:
Will Google and other search engines consider this duplicate content?
Update - canonical page
If I understand canonical pages correctly, www.mysite.com is my canonical page.
So when my partner uses mysite.partner.com?store=wallmart&id=123 which "redirects" (CNAME) to partner.mysite.com?store=wallmart&id=123, my server recognize my sub-domain.
So what I need to do, is to dynamically add the following in my <HEAD> section:
<link rel="canonical" href="mysite.com?store=wallmart&id=123">
Is this correct?

It's duplicate content but there is no penalty as such.
The problem is, for a specific search Google will pick one version of a page and filter out the others from the results. If your partner is targeting the same region then you are in direct competition.
The canonical tag is a way to tell Google which is the official version. If you use it then only the canonical page will show up in search results. So if you canonicalise back to your domain then your partners will be excluded from search results. Only your domains pages will ever show up. Not good for your partners.
There is no win. The only way your partners will do well is if they have their own content or target a different region and you don't do the canonical tag.
So your partners have a chance, I would not add the canonical. Then it's down to the Google gods to decide which of your duplicate pages gets shown.

Definitely. You'll want to use canonical tagging to stop this happening.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394

Yes. It will be considered as duplicate content by Google. Cause you have replaced only footer and header. By recent Google algorithm, content should be unique for website or even blog. If content is not unique, your website will be penalized by Google.

will limiting dynamic urls with robots.txt improve my SEO ranking?

My website has about 200 useful articles. Because the website has an internal search function with lots of parameters, the search engines end up spidering urls with all possible permutations of additional parameters such as tags, search phrases, versions, dates etc. Most of these pages are simply a list of search results with some snippets of the original articles.
According to Google's Webmaster-tools Google spidered only about 150 of the 200 entries in the xml sitemap. It looks as if Google has not yet seen all of the content years after it went online.
I plan to add a few "Disallow:" lines to robots.txt so that the search engines no longer spiders those dynamic urls. In addition I plan to disable some url parameters in the Webmaster-tools "website configuration" --> "url parameter" section.
Will that improve or hurt my current SEO ranking? It will look as if my website is losing thousands of content pages.

This is exactly what canonical URLs are for. If one page (e.g. article) can be reached by more then one URL then you need to specify the primary URL using a canonical URL. This prevents duplicate content issues and tells Google which URL to display in their search results.
So do not block any of your articles and you don't need to enter any parameters, either. Just use canonical URLs and you'll be fine.

As nn4l pointed out, canonical is not a good solution for search pages.
The first thing you should do is have search results pages include a robots meta tag saying noindex. This will help get them removed from your index and let Google focus on your real content. Google should slowly remove them as they get re-crawled.
Other measures:
In GWMT tell Google to ignore all those search parameters. Just a band aid but may help speed up the recovery.
Don't block the search page in the robots.txt file as this will block the robots from crawling and cleanly removing those pages already indexed. Wait till your index is clear before doing a full block like that.
Your search system must be based on links (a tags) or GET based forms and not POST based forms. This is why they got indexed. Switching them to POST based forms should stop robots from trying to index those pages in the first place. JavaScript or AJAX is another way to do it.

Does a rel=canonical link remove all SEO value from the page?

Here'y the story: I have a website for a local company that publishes quality content to an on-site blog. We're expanding to a new geographic region, and I'm in the process of building another website targeting the new region.
I'd like to include a blog on the new website, which will pull in any content/posts from our existing blog. I primarily want to do this for the added SEO benefit of having fresh, relevant content that's frequently updated on your website. However, I would of course need to add a rel=canonical link back to the original blog in order to ensure I don't get any duplicate content penalties from posting the same content across two separate domains.
My question is whether adding that rel=canonical link will eliminate the SEO value of that content being posted to the new website?
I'm not really talking about which blog post would show up in SERPs, as I understand that the point of the rel=canonical tag is to provide attribution to the primary source of the content. I'm more concerned about whether using a rel=canonical on the content would eliminate the secondary SEO benefit of having relevant, frequently updated content on your website, due to Google being essentially "blind" to the duplicate content.

In most cases the answer on your question is "yes". With regard to Google - GoogleAnswers (see the last question).
Other search engines can not maintain this attribute.
Regards.

If you pull content from somewhere else to post, it isn't "fresh". Fresh content is newly written. You won't get any credit for fresh content whether or not you use rel=canonical

The canonical tag behaves in a similar way to a 301 redirect. That is, ranking that the page with the canonical tag has will mostly get transferred to the page it points it.

Google Webmaster Tools - Remove query parameters from URL

I am using JBoss Seam on a Jetty web server and am having some issues with the query parameters breaking links when they appear in google searches.
The first parameter is one JBoss Seam uses to track conversations, cid or conversationId. This is a minor issue as Google is complaining I am submitting different urls with the same information.
Secondly, would it make sense to publish/remove urls via the Google Webmaster API instead of publishing/removing via the sitemap?
Walter

Hey Walter, I would recommend that you use the rel=canonical tag to tell the search engines to ignore certain parameters in your URL strings. The canonical tag is a common standard that Google, Yahoo and Microsoft have committed to supporting.
For example, if JBoss is creating URLs that look like this: mysite.com?cid=FOO&conversationId=BAR, then you can create a canonical tag in the section of your website like this:
<html>
<head>
<link rel="canonical" href="http://mysite.com" />
</head>
</html>
The search engines will use this information to normalize the URLs on your website to the canonical (or shortest & most authoritative) version. Specifically, they will treat this as a 301 redirect from the URL of the HTTP request to the URL specified in the canonical tag (as long as you haven't done anything silly, like make it an infinite loop, or pointed to a URL that doesn't exist).
While the canonical tag is pretty fricken cool, it is only a 90% solution, in that you can still run into issues with metrics tracking with all the extra parameters on your website. The best solution would be to update your infrastructure to trap these tracking parameters, create a cookie, and then use a 301 redirect to redirect the URL to the canonical version. However, this can be a prohibitive amount of work for that extra 10% gain, so many people prefer to start with the canonical tag.
As for your second question, generally you don't want to remove these URLs from Google if people are linking to them. By using the canonical tag, you achieve the same goal, but don't loose any value of the inbound links to your website.
For more information about the canonical tag, and the specific issues & solutions, check out this article I wrote on it here: http://janeandrobot.com/library/url-referrer-tracking.

Google Webmaster Tools will tell you about duplicate titles and other issues that Google see that are being caused by "duplicates" that are really the same page being served up with two different URL versions. I suggest trying to make sure the number of errors listed in Webmaster Tools account under duplicate titles is as close to zero as possible.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas