How will new 404 not found indexed pages affect rankings? [closed] - seo

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
We had a situation where all of our page links were crawled and continue to be crawled. The page links contain "~/{someTerm}/{someOtherTerm}/__p/##/##".
The problem is that now both Google and MSN bots are crawling tens of thousands of pages that don't need to be crawled and causing a strain on the system.
So we changed the paging link to a Javascript link, and removed all URL's containing "__p" so they will now return a 404 - Page Not Found. We only really want page 1 indexed, and maybe a page or two thereafter (but not worried about that now.
Is there a way to remove all pages containing "__p" in the URL using WebMasterTools for Google and MSNBot, and if so, how?
Thanks.

I think you should use a <meta> tag in those pages you'd like to remove from search engines.
<meta name="robots" content="noindex, nofollow" />
Also, you can try out using robots.txt exclusion, look at this site
User-agent: *
Disallow: /*___p

Related

SEO and Content - Duplicated content [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I'm working on an application which contents are taken from books and I can find similar content on competitor site. As I'm not copying their content but the content from books I have this fear that it could be marked as duplicated content. How can I resolve this issue or avoid duplication?
You're correct. Without proper care, your site could get dinged for having duplicate content. There are several options that you can take:
1)If you want the page to be indexed, then putting the book content into an iFrame (which search engine spiders can't crawl) is a good solution. Include some original content as an introduction to the page and then place the "duplicate content" into an iframe. This will allow the page to get indexed in the search engine results without putting you at risk. I recommend having at least 500 words of unique content per page:
2)The other option - if you don't want to write introductory text - is to tell Google not to index those pages. Add a noindex,follow tag to pages that have duplicate content.
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">

Best way to prevent Google from indexing a directory [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I've researched many methods on how to prevent Google/other search engines from crawling a specific directory. The two most popular ones I've seen are:
Adding it into the robots.txt file: Disallow: /directory/
Adding a meta tag: <meta name="robots" content="noindex, nofollow">
Which method would work the best? I want this directory to remain "invisible" from search engines so it does not affect any of my site's ranking.
In other words, I want this directory to be neutral/invisible and "just there." I don't want it to affect any ranking. Which method would be the best to achieve this?
Robots.txt is the way to go for this.
According to Google, you only use the meta tag if you don't have rights to create/edit the robots.txt file.

GET vs POST in SEO [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
My web application retrieves a page for every request generated by a form submission. That form submits to the same URL of the page.
Each time the page loads with a different title tag. Does it indicate different pages with the same URL?
How does it affect SEO? how can I manage this situation?
Edit
This question is not purely SEO related no it requires SEO specific reasoning or answers it can be explained also technically how search engine robots work. if it still seems offtopic for moderators I request them to explain why
Try and use a rewiter rule to format your URL to a unqiune page if your always loading to the same page google ( or other search engines) will only index that single page.
http://www.seomoz.org/img/upload/anatomy-of-a-url.jpg
In addition to load the page each time with different title tag you need to append the URL with some uinque text like your GET variable data..
For getting crawled by spiders don't forget to submit your sitemap to search engines with relevant urls..

Should sitemap be disallowed in robots.txt? and robot.txt itself? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
This a very basic question, but I can't find a direct answer anywhere online. When searching for my website on google, sitemap.xml and robots.txt are returned as search results (amongst more useful results). To prevent this should I add the following lines to robots.txt?:
Disallow: /sitemap.xml
Disallow: /robots.txt
This won't stop search engines accessing the sitemap or robots file?
Also/Instead should I use google's URL removal tool?
you won't stop the crawler from indexing robots.txt because its a chicken and the egg situation, however, if you aren't specifying google and other search engines to look directly at the sitemap, you could lose some indexing weight from denying your sitemap.xml.
Is there a particular reason why you would want to not have users be able to see the sitemap?
I actually do this which is specific just for the google crawler:
Allow: /
# Sitemap
Sitemap: http://www.mysite.com/sitemap.xml

Where to put robots.txt file? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Where should put robots.txt?
domainname.com/robots.txt
or
domainname/public_html/robots.txt
I placed the file in domainname.com/robots.txt, but it's not opening when I type this in browser.
alt text http://shup.com/Shup/358900/11056202047-My-Desktop.png
Where the file goes in your filesystem depends on what host you're using, so it's hard for us to give a specific answer about that.
The best description is: put it wherever the index.html (or index.php or whatever) file is that represents your homepage. If that's domainname/public_html/index.html, for example, put it in domainname/public_html/robots.txt.
i think the better way to describe it is to have it in the root web folder of your domain... so http://example.com/robots.txt you can also put your sitemap.xml in the root or refer to it with a Sitemap: http://example.com/fldr/smap.xml line in your robots.txt.
dont forget: you can use Google Webmaster Tools to check to make sure you haven't restricted anything you didnt mean to(you also get to see queries and links woohoo!).
suggestion: id consider using the <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> if possible because you will still earn linkjuice for links on the page but it wont show up in googles index while a robots.txt directive can leave a plain url with do description in SERPs but will loose all value of links pointed to it because its robots.txted out (its ranking b/c of anchor text so get credit for it)
In the root of your web directory (where you put the files that show up on your website)
In this case you should put it in domainname/public_html/robots.txt, as the public.html folder is where your index file will be.