Should sitemap be disallowed in robots.txt? and robot.txt itself? [closed] - indexing

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
This a very basic question, but I can't find a direct answer anywhere online. When searching for my website on google, sitemap.xml and robots.txt are returned as search results (amongst more useful results). To prevent this should I add the following lines to robots.txt?:
Disallow: /sitemap.xml
Disallow: /robots.txt
This won't stop search engines accessing the sitemap or robots file?
Also/Instead should I use google's URL removal tool?

you won't stop the crawler from indexing robots.txt because its a chicken and the egg situation, however, if you aren't specifying google and other search engines to look directly at the sitemap, you could lose some indexing weight from denying your sitemap.xml.
Is there a particular reason why you would want to not have users be able to see the sitemap?
I actually do this which is specific just for the google crawler:
Allow: /
# Sitemap
Sitemap: http://www.mysite.com/sitemap.xml

Related

Google webmaster tools and duplicated links [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I have some problems and im confusing now with google webmaster tools
I have example:
http://site.com/link/text.html
and im using CustomVar on google analytics to track clicks from external site
example
http://site.com/link/text.html?promoid=123
Now in webmaster tools i have toons of duplicated links
In robots.txt im add
Disallow: *?promoid
but im not sure if this good idea...
What i should do now, still use robots file and disallow promoid or maybe use rel="canonical" ?
Edit: all links with ?promoid=123 is posted on external site not on my...
This is exactly what canonical URLs are for. It will tell Google that http://site.com/link/text.html is the main URL to use for that page and that any other page using that canonical URL is just a minor variation of that page.

Best way to prevent Google from indexing a directory [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I've researched many methods on how to prevent Google/other search engines from crawling a specific directory. The two most popular ones I've seen are:
Adding it into the robots.txt file: Disallow: /directory/
Adding a meta tag: <meta name="robots" content="noindex, nofollow">
Which method would work the best? I want this directory to remain "invisible" from search engines so it does not affect any of my site's ranking.
In other words, I want this directory to be neutral/invisible and "just there." I don't want it to affect any ranking. Which method would be the best to achieve this?
Robots.txt is the way to go for this.
According to Google, you only use the meta tag if you don't have rights to create/edit the robots.txt file.

best way to allow search engine to crawl site [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Thanks for reading my question. I am building a site that will be listing products from each manufacturer. I'm planning to structure the URL as following variations:
www.mysite.com/manufacturer_name/product_name/product_id
www.mysite.com/product_name/product_id
www.mysite.com/manufacturer_name
There are millions of products and I want all the major search engine to crawl them. What is the best way to go about doing that?
Would simply submitting site to all the search engines be enough? I would assume if I submit the manufacturer page which lists out all the manufacturer name as links the search engine will click on each links and click on all the products displayed within each manufacturer links (I will have paging for products) so the search engine can keep crawling the site for more products within each manufacturer until it runs out of the page number.
Would that be sufficient to list out each product on the every search engine? or is there a new and better way to do this? May be there are new SEO tricks that I'm not aware of. I am hoping if you can point me to the right direction.
I've previously used robot.txt to tell search engines which pages to crawl and that seemed to work fine.
Thanks,
bad_at_coding
Submit an XML sitemap. The easiest way to do this is to link to it in your robots.txt file.
Sample robots.txt file:
Sitemap: http://example.com/sitemap_location.xml
See Submitting Sitemaps for more on this topic from Google

How will new 404 not found indexed pages affect rankings? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
We had a situation where all of our page links were crawled and continue to be crawled. The page links contain "~/{someTerm}/{someOtherTerm}/__p/##/##".
The problem is that now both Google and MSN bots are crawling tens of thousands of pages that don't need to be crawled and causing a strain on the system.
So we changed the paging link to a Javascript link, and removed all URL's containing "__p" so they will now return a 404 - Page Not Found. We only really want page 1 indexed, and maybe a page or two thereafter (but not worried about that now.
Is there a way to remove all pages containing "__p" in the URL using WebMasterTools for Google and MSNBot, and if so, how?
Thanks.
I think you should use a <meta> tag in those pages you'd like to remove from search engines.
<meta name="robots" content="noindex, nofollow" />
Also, you can try out using robots.txt exclusion, look at this site
User-agent: *
Disallow: /*___p

Asterisk in robots.txt [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Wondering if following will work for google in robots.txt
Disallow: /*.action
I need to exclude all urls ending with .action.
Is this correct?
To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot
Disallow: /*.gif$
So, you are close. Use Disallow: /*.action$ with a trailing "$"
Of course, that's merely what Google suggests: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449
All bots are different.
The robots.txt specification provides no way to include wildcards, only the beginning of URIs.
Google implement non-standard extensions, described in their documentation (look in the Manually create a robots.txt file section under "To block files of a specific file type").
I don't think it will work, you would need to move all .action files to a location which you then disallow