Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
I found this in my root.txt file
Disallow: /search
what does it mean?
If you're talking about a robots.txt file, then it indicates to web crawlers that they are to avoid going into URLs beginning with /search on that host. Your robots.txt file is related to the Robots Exclusion Standard.
You mention "robot.txt" in the question title and "root.txt" in the body. If this is indeed a robots.txt file, it needs to be named "robots.txt", otherwise it has no effect at all.
It instructions robots/crawlers/spiders that they shouldn't access anything within that folder, or variants of that URL, such as the following examples:
/search
/search?term=x
/search/page/
/search/category=y&term=x
/search/category-name/term/
With regards to the comments above on how this affects indexation (whether or not a search engine or other entity will catalogue the URL), none of them are quite correct.
It should be noted that instructions in a robots.txt file are crawl directives, not indexation directives. Whilst compliant bots will read the robots.txt file prior to requesting a URL and determine whether or not they're allowed to crawl it, disallow rules do not prevent indexation (nor even, in the case of non-compliant bots, prevent access/crawling/scraping).
You'll see instances periodically of search results in Google with a meta description alluding to the page having been included though inaccessible; something along the lines of "we weren't able to show a description because we're not allowed to crawl this page". This typically happens when Google (or w/e) encounters a disallowed URL, but believes that the URL should still be catalogued - in Google's case, this typically occurs when a highly linked and/or authoritative URL is disallowed.
To prevent indexation, you're much better off using an on-page meta tag, or even an x-robots http header (particularly useful for non-page resources, such as PDFs, etc).
"Disallow: /search" tells search engine robots not to index and crawl those links which contains "/search" For example if the link is http://yourblog.blogspot.com/search.html/bla-bla-bla then robots won't crawl and index this link.
Related
Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 months ago.
Improve this question
I have made custom segments/blocks around my website which I use it for advertising/marketing. Google search bots are considering those as part of my website and gets confused of what my site really is versus advertisement.
This negatively impacts my SEO. Is there a way I can register or use certain directives or elements to inform google and other bots to avoid crawling that portion or even if crawled should not be considered as part of that page.
I am aware of robots.txt file but that is for an entire page. I would like to block a certain blocks within each page. This could be a side bar a floating bar.
there's no way to ensure all bots don't index parts of a page. it's kind of a all or nothing thing.
can could use a robots.txt file and with
Disallow: /iframes/
then load the content you don't want indexed into iframes.
there's also the data-nosnippet-attr tag attribute.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
Recently I had a problem where a client of mine sent out an email with MailChimp containing UTM (Google) and MC (Mailchimp) parameters in the URL.
Since the link was pointing to a Magento 2 site with Varnish running, I had to come up with a fix for that, otherwise Varnish would create a lot of different entries for the "unique" URL's.
Now, by using this adjusted snippet in the Varnish .vcl, I was able to strip these parameters:
if (req.url ~ "(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|mc_[a-z]+|utm_[a-z]+)=") {
set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|mc_[a-z]+|utm_[a-z]+)=[-_A-z0-9+()%.]+&?", "");
set req.url = regsub(req.url, "[?|&]+$", "");
}
And this works pretty good, it strips the URL.
BUT, I can't seem to find a correct explanation if this in any way will affect SEO, or Analytics tracking - I tried Googling it as much as I could, but cannot find a clear explanation.
Anyone here with a solution and / or explanation?
This will not affect SEO in any way. Those links are typically added by Google itself (Analytics, Adwords) or email marketing campaigns which use the same. The search engines will not see those links so there's no impact on SEO whatsoever.
The parameters mentioned are used by Javascript libraries and never by the PHP scripts, so what you did for better cacheability is correct. Browser's Javascript engines will still see them because they have access to full URL. The PHP backend (Magento) does not need them.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I've seen tutorials/articles discussing using Robots.txt. Is this still a necessary practice? Do we still need to use this technique?
Robots.txt file is not necessary but it is recommended for those who want to block few pages or folders on your website being crawled by search engine crawlers.
I agree with the above answer. Robot.txt file is used for blocking pages and folders from crawling by search engines. For eg. You can block the search engines from crawling and indexing the Session IDs created, which in rare cases could become a security threat! Other than this, I don't see much importance.
The way that a lot of the robots crawl through your site and rank your page has changed recently as well.
I believe for a short period of time the use of Robot.txt may have helped quite a bit, but no adays most other options you'll take in regards to SEO will have more of a positive impact than this little .txt file ever will.
Same goes for backlinks, they used to be far far more important than they are now for you getting ranked.
Robots.txt is not for indexing . its used to blocks the things that you don't want search engines to index
Robots.txt can help with indexation with large sites, if you use it to reveal an XML sitemap file.
Like this:
Sitemap: http://www.domain.com/sitemap.xml
Within the XML file, you can list up to 50,000 URLs for search engines to index. There are plugins for many content management systems that can generate and update these files automatically.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Lets say we have Twitter, and every profile needs to get indexed in search engines, how does Twitter handle their sitemap? Is there something like "regex" sitemap for domain or do they re-generate a sitemap for each user?
How does this work, for pages that you don't know, so dynamic pages? Look at Wikipedia for example, how do they make sure everything is indexed by Search Engines?
Most likely, they don't bother to do a sitemap.
For highly dynamic sites, a sitemap will not help that much. Google will index only some amount, and if everything canges before Google considers to revisit it, you don't gain much.
For slowly changing sites this is different. The sitemaps tells Google on the one hand, which sites exist that it maybe has not yet visited at all, and (more importantly), which site have not changed and thus do not need to be revisited.
But the sitemap.xml mechanism just does not scale up to huge and highly dynamic sites such as twitter.
Many systems uses dynamically generated site map.
You can upload any sitemap to Google via Webmaster Tools (the service is free of charge) - Optimization > Sitemaps. It does not have to be sitemap.xml; it can be JSP or ASPX page too.
Webmaster Tools allows you to upload many different sitemaps for a single website. However, I am not sure what is the maximum number of sitemaps.
Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form as follows:
Sitemap: http://www.yoursite.com/profiles-sitemap.xml
Sitemap: http://www.yoursite.com/sitemap_index.xml
EDIT
Microsoft website is a very good example:
The robots.txt file contains lots of sitemap entries. Example:
Sitemap: http://www.microsoft.com/en-us/sqlazure/sitemap.xml
Sitemap: http://www.microsoft.com/en-us/cloud/sitemap.xml
Sitemap: http://www.microsoft.com/en-us/server-cloud/sitemap.xml
Sitemap: http://www.microsoft.com/france/sitemap_index.xml
Sitemap: http://www.microsoft.com/fr/ca/sitemap.xml
Sitemap: http://www.microsoft.com/germany/kleinunternehmen/gsitemap.aspx
Sitemap: http://www.microsoft.com/germany/newsroom/sitemap.xml
As you can see, some sitemaps are static (XML) and some are dynamic (ASPX).
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am using postbacks to perform paging on a large amount of data. Since I did not have a sitemap for google to read, there will be products that google will never know about due to the fact that google does not push any buttons.
I am doing cloaking to spit out all the products with no paging if the user-agent is that of a search engine. There may be some work arounds for situations like this which include hidden buttons to paged urls.
What about information you want indexed buy google but you want to charge for the content. Imagine that I have articles that I want users to be able to find in google, but when the user visits the page, only half the content is displayed and users will have to pay for the rest.
I have heard that google may blacklist you for cloaking. I am not being evil, just helpful. Does google recognize the intention?
Here is a FAQ by google on that topic. I suggest to use CSS to hide some content. For example just give links to your products as an alternative to your buttons and use display:none; on them. The layout stays intact and the search engines will find your pages. However most search engines will not find out about cloaking and other techniques, but maybe competitors will denigrate you. In any way: Don't risk it. Use sitemaps, use RSS feeds, use XML documents or even PDF files with links to offer your whole range of products. Good luck!
This is why Google supports a sitemap protocol. The sitemap file needs to render as XML, but can certainly be a code-generated file, so you can produce on-demand from the database. And then point to it from your robots.txt file, as well as telling Google about it explicitly from your Google Webmaster Console area.
Highly doubtful. If you are serving different content based on IP address or User-Agent from the same URL, it's cloaking, regardless of the intentions. How would a spider parse two sets of content and figure out the "intent"?
There is intense disagreement over whether "good" cloakers are even helping the user anyway.
Why not just add a sitemap?
I don't think G will recognize your intent, unfortunately. Have you considered creating a sitemap dynamically? http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40318