How to cleanup URLs indexed by google - seo

so I made a mistake in my application which caused thousands of URLs to be indexed by google with the session id appended. What should I do to remove all those session id's from the google index? I'd like to only have the page indexed minus the session id.

You can fix this with an edit to your robots.txt. Also, there's a webmasters stackexchange -- consider checking there for your answer in the future, I know they have an SEO tag.

robots.txt stops crawling, it doesnt stop urls being indexed.
You could use a canonical tag to consolidate the urls to their parent url and/or parameter handling in webmaster tools.

Related

Deprecated domain in google index

We have got a deprecated domain www.deprecateddomain.com. Specific fact is that we have got reverse proxy working and redirecting all requests from this domain to the new one www.newdomain.com.
The problem is when you type "deprecateddomain.com" in google search, there is a link to www.deprecateddomain.com in search results besides results with "newdomain.com". It means that there is such entries in google index. Our customer don't want to see links to old site.
We were suggested to create fake robots.txt with Disallow: / directive for www.deprecateddomain.com and reverse proxy rules to get this file from some directory. But after investigation the subject I started hesitating that it will help. Will it remove entries with old domain from index?
Why not to just create the request in search console to remove www.deprecateddomain.com from index? In my opinion it might help.
Anyway, I'm novice in this question. Could you give me advice what to do?
Google takes time to remove old/obsolete entries from its ranking, especially on low visited or low value pages. You have no control on it. Google needs to revisit each page to see the redirection you have implemented.
So DO NOT implement a disallow on the old website, because it will make the problem worse. Bots won't be able to crawls those pages and see the redirection you have implemented. So they will stay longer in the rankings.
You must also make sure you implement a proper 301 redirection (i.e. a permanent one, not a temporary) for all pages of the old website. Else, some pages may stay in the ranking for quite some time.
If some pages are obsolete and should be deleted rather than redirected, return a 404 for them. Google will remove them quickly from its index.

How to tell search engines to only index pages inside my sitemap?

Recently my site got hacked and it has been restored. Thousands of spam URLs were indexed by google. I do have a google webmaster account and I can update and submit my sitemap. But how do I tell google to strictly only index the URLs inside my sitemap? I want to prevent any new spam urls created by hackers from being indexed.
Any parameter inside the sitemap.xml that I can use to do this?
Your sitemap should only include the new URLs and google will crawl and index only them.
If you have removed the old spammed URLs and they are in 404(not found) status, Google will remove them from the index (albeit quite slowly, it could take even 1-2 months).
If you need to remove those URLs from being displayed in the index there's a section about it the webmaster guide: https://support.google.com/webmasters/answer/1663419?hl=en

How to remove duplicate title and meta description tags if google indexed them

So, I have been building an ecommerce site for a small company.
The url structure is : www.example.com/product_category/product_name and the site has around 1000 products.
I've checked google webmaster tools and in the HTML improvements section it shows that I have multiple title and meta description tags for all the product pages. They all appear two times, both:
-www.example.com/product_category/product_name
and
-www.example.com/product_category/product_name/ (with slash in the end)
got indexed as separate pages.
I've added a 301 redirect from every www.example.com/product_category/product_name/ to www.example.com/product_category/product_name, but this was almost two weeks ago. I have resubmitted my sitemap and asked google to fetch the whole page a few times. Nothing has changed, GWT still shows the pages as duplicate tags.
I did not get any manual action message.
So I have two questions:
-how can I accelerate the reindexation process, if it's possible?
-and do these tags hurt my organic search results? I've googled it, yes and some say it does and some say it doesn't.
An option is to set a canonical link on both URLs (with and without /) using the URL without a /. Little by little, Google will stop complaining. Keep in mind Google Webmaster Tools is slow to react, especially when you don't have much traffic or backlinks.
And yes, duplicate tags can influence your rankings negatively because users won't have proper and specific information for each page.
Set a canonical link on both Urls is a solution but it take time from my experience.
The fasted way is to block old URL in robots.txt file.
Disallow: /old_url
canonical tag is option but why you are not adding different title and description for all pages.
you can add dynamic meta tags one time and it will create automatically for all pages so we dont worry about duplication.

Google, do not index YET

In the effort of building a live site on its actual live hosting platform is there a way to tell google to not YET index the website? I found the following:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93710
But would that tell them to never come back or would they simply see the noindex tag and then not list the results, then when it comes back to crawl again later and my site is good to go I would have the noindex removed and the site would then start getting indexed?
Sounds like you want to use a robots.txt file instead:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&topic=2370588&ctx=topic
Update your robots.txt file when you want your content to be indexed.
You can use the robot.txt method.
You can specify which subpage could be spidered. And google comes back, checking the file before indexing. So you can delete the file later in order to get fully indexed.
More Information
About /robots.txt
Robots.txt File Generator
You can always change it. The way Google and other robots find your page is if it is linked to on another page. As long as it isn't linked to on another page, it won't be found. Also, once your site is up, chances are that it will be far back in the list of sites.

will limiting dynamic urls with robots.txt improve my SEO ranking?

My website has about 200 useful articles. Because the website has an internal search function with lots of parameters, the search engines end up spidering urls with all possible permutations of additional parameters such as tags, search phrases, versions, dates etc. Most of these pages are simply a list of search results with some snippets of the original articles.
According to Google's Webmaster-tools Google spidered only about 150 of the 200 entries in the xml sitemap. It looks as if Google has not yet seen all of the content years after it went online.
I plan to add a few "Disallow:" lines to robots.txt so that the search engines no longer spiders those dynamic urls. In addition I plan to disable some url parameters in the Webmaster-tools "website configuration" --> "url parameter" section.
Will that improve or hurt my current SEO ranking? It will look as if my website is losing thousands of content pages.
This is exactly what canonical URLs are for. If one page (e.g. article) can be reached by more then one URL then you need to specify the primary URL using a canonical URL. This prevents duplicate content issues and tells Google which URL to display in their search results.
So do not block any of your articles and you don't need to enter any parameters, either. Just use canonical URLs and you'll be fine.
As nn4l pointed out, canonical is not a good solution for search pages.
The first thing you should do is have search results pages include a robots meta tag saying noindex. This will help get them removed from your index and let Google focus on your real content. Google should slowly remove them as they get re-crawled.
Other measures:
In GWMT tell Google to ignore all those search parameters. Just a band aid but may help speed up the recovery.
Don't block the search page in the robots.txt file as this will block the robots from crawling and cleanly removing those pages already indexed. Wait till your index is clear before doing a full block like that.
Your search system must be based on links (a tags) or GET based forms and not POST based forms. This is why they got indexed. Switching them to POST based forms should stop robots from trying to index those pages in the first place. JavaScript or AJAX is another way to do it.