I need xml file for indexing my website for google crawling. I'm using some software to make XML file. My question is do I need to list all dynamic pages. I mean like this:
http://mysite.com/page/?id=01
http://mysite.com/page/?id=02
http://mysite.com/page/?id=03
http://mysite.com/page/?id=04
http://mysite.com/page/?id=05
if yes, why is that? and what is going to happend if I wouldnt include them and just say:
http://mysite.com/page/
If I include all the id's the result would be a huge XML file. Does google accept this such a large file or they have limit for it?
Thanks in advance for all help and time.
Google isn't going to index all your dynamic pages anyways. It will throw many of them out even if you put them in the sitemap.xml. The content will be too similar.
There is a limit to the number of entries in a sitemap.xml It used to be ~50k pages/10MB. In my experience Google will crawl a few thousand and stop if they look too similar and have no inbound links.
You do not need an XML sitemap at all. It just makes it easier for google to crawl your content.
And obviously you don't have to put dynamic stuff in it.
If this is a real issue, try reading up on rel="canonical" which is made to exclude those types of pages from Google. While it's usefulness is based on use case, you may find it is the right solution for you.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=139394
Related
We do a lot of email marketing and sometimes developers will put the html file out on the image server (i know the easy answer is to not do this) but those html files end up getting indexed by Google and eventually rank high on search results. Which in turns makes the SEO company's want us to remove these pages. Is it possible to have google not index anything from our sub domain? we have image.{ourUrl}.com where we put all these files.
Would putting a robot.txt file in the main directory do it? Or would we need to add that robot text file in every directory?
Is there an easy way to blanket this?
A robots.txt file would just stop crawling, files might still be indexed. a noindex directive would work, you could use an x-robots-tag. See here https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
I hope my question is not too irrelevant to stackoverflow.
this is my website: http://www.rader.my
It's a car information website. The content is dynamic. Therefore, google crawler could not find all the cars specification pages in my website.
I created a sitemap with all my cars URL in it (for instance: http://www.rader.my/Details.php?ID=13 is for one car). I know I haven't made any mistake in my .xml file format and structure. But after submission, google only indexed one URL which is my index.php.
I have also read about rel="canonical". But I don't think in my case I should use such a thing since all my pages ARE different with different content but only the structure is the same.
Is there anything that I missed? Why google doesn't accept my URLs even though the contents are different? What can I do to fix this?
Thanks and regards,
Amin
I have a similar type of site. Google is good about figuring out dynamic sites. They'll crawl the pages and figure out the unique content as time goes on. Give it time.
You should do all the standard things:
Make sure each page has a unique H1 tag.
Make sure each page has substantial unique content
Unique keywords and description tags aren't as useful as they used to be but they can't hurt.
Cross-link internally. Create category pages that include links to all of one manufacturer and have each of the pages of that manufacturer link back to 'similar' pages.
Get links to your pages. Nothing helps getting indexed like external authority.
In the effort of building a live site on its actual live hosting platform is there a way to tell google to not YET index the website? I found the following:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93710
But would that tell them to never come back or would they simply see the noindex tag and then not list the results, then when it comes back to crawl again later and my site is good to go I would have the noindex removed and the site would then start getting indexed?
Sounds like you want to use a robots.txt file instead:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&topic=2370588&ctx=topic
Update your robots.txt file when you want your content to be indexed.
You can use the robot.txt method.
You can specify which subpage could be spidered. And google comes back, checking the file before indexing. So you can delete the file later in order to get fully indexed.
More Information
About /robots.txt
Robots.txt File Generator
You can always change it. The way Google and other robots find your page is if it is linked to on another page. As long as it isn't linked to on another page, it won't be found. Also, once your site is up, chances are that it will be far back in the list of sites.
Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.
What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.
In short NO - unless you use cloaking with is discouraged by Google.
Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->
Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>
At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.
Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.
All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.
There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.
If the value of the href for Canonical tags is populated via javascript function, would that affect the Search engine indexing (as search engines ignore javascript) ?
I'm not sure I fully understand the question as you worded it. But here's my take:
Canonical tags are used to make sure that Google (et al) knows that the same page with different URLs are, in fact, the same page.
This saves Google a lot of processing time, because it will treat those pages as a single page instead of trying to index every one of them. Also, your domain's search engine ranking will probably go up because Google doesn't think you're duplicating content.
For any page that could be duplicated because of parameters, you should include a canonical link of the page you want known as the original. So yes, it would help in your case. Though you cannot put a canonical link on someone else's domain pointing to your domain, so putting it on a partner's page would not have the intended consequences.
If you want more information, read up here: Google Webmaster Central: Specify Your Canonical