why are some pages in my start directory not showing in my site map? - seo

If i generate a site map at http://www.xml-sitemaps.com/, for aromapersona.com, I get 9 pages, however there are a bunch more pages that should show up. For example, aromapersona.com/candle_holder is in the same "front" directory as the other 9 pages, but doesnt generate in sitemap. Is this because no other pages on my site link to it? Im trying to get these other URLs indexed, and I even edited the site map to include this URL as well as others and submitted to google via webmaster tools, and still nothing. Advice?

I'm not familiar with aromapersona.com but it will only be able to list pages that are linked to from the initial page you give it (or ones they link to) unless you provide the site with FTP access (which I presume you dont).
If you include the URL's in your sitemap for goggle it should eventually list them, but linking to them from other parts of your site is probably the most effective.

I have not checked the website, but do also take the cause is not because of noindex, nofollow, robots.txt, javascript links, mixing http/https etc.

In clear wording: There is no link pointing to the subpage "candle_holder", hence the XML site generator (which works by following links on your site) cannot detect it.
You can add it manually to the XML, but then again, it should be accessible from the site directly.

Related

Google 404 soft error on index page that is working fine

A friend of mine has been having trouble getting her site indexed by google and asked me to have a look, but that is not something I really know much about and was hoping for some assistance.
Looking at her search console, google crawl shows an error of soft-404 on the index page. I marked this as fixed a few times, because the site looks fine to me but it keeps coming back.
If I fetch the site as google it seems to be working fine, although it is showing the mobile version instead of the desktop.
It keeps giving another reoccurring 404 of a page http://www.smeyan.com/new-page, which doesn't exist anywhere I can see including server files or sitemaps.
Here is what I know about this site:
It used to be a wix site and was moved to a host gator shared server 2-3 months ago.
It's using JavaScript/jQuery .load to get page content outside the index.html template.
It has 2 sitemaps one for the URLs and one for both URLs and images
http://www.smeyan.com/sitemap_url.xml http://www.smeyan.com/sitemap.xml
It has been about 2 months since it was submitted for indexing and google has not indexed any of the content when you search for site:www.smeyan.com it shows some old stuff from the wix server. Although search console says it has 172 images indexed.
it has www. as a preference set in search console.
Has anyone experienced this and has an direction for a fix?
How long time was set for this site in Cache-Control header? If long, you should use "google removals" for obsolete snippets and cache. I simulated Google visit on your webpage. Correct 404 return code. Correct headers. Thus. Report google removals for "not found" pages. You must request visit of Googlebot and keep calm and wait for reaction.
BTW: For permanently removed content use 410 Gone for Google or... report via Removals.
https://support.google.com/webmasters/answer/1663419?hl=en
The only download error that I saw while using Chrome's Inspect function pertains to a SCRIPT tag with a Facebook url as the source (src) file.
This is the error as reported by Inspect.
This is the SCRIPT tag that caused the error.
I am not sure that this is the cause of the reoccurring 404 error, but it is an issue that needs attention on this website.
I checked your site with Tor Browser which has... DISABLED SCRIPTS. You should provide any content on your site with use of <noscript/> tag. It doesn't have to be beautiful but should be visible for bots. <a href... ></a>, <img/> etc. and... TEXT. Without it the site is NOT OPTIMIZED for search bots. Read about SEO. The sitemap content can be never indexed if the content will be never linked.
Probably your webpage also doesn't meet requirements for screen readers (for blind people).
Note: The image with "SMEYAN" caption is visible on webpage and is indexed.
second image on the webpage (in source): <img class="gallery-full-image" src="./galleries/home_gallery/smeyan_home-1.jpg" /> and indexed
The menu also doesn't work without scripts.
I thought the step is good implemented.
Please use <noscript/> element and implement version for blind people (without scripts, provide alt tag for images) and for noscript browsers. You can test it via disabling script or via NOSCRIPT extension for Firefox.
BTW. You should use HTML, CSS (including animations) and... use the JS ONLY if it is needed. Or... <noscript/> method.
Google bot currently use web rendering service (WRS) that is based on old Chrome 41 (M41), so it may fail where browsers succeed.
To learn how google boot works read this.
Add this code to the page to see the real error.
You can see the error using Url Inspector live, from google search console. It will show at more info tab.
Note: if the bot gets a 301 code or if the page is too little to have significant content it will return a soft 404 error, and won't preview or show any other error.

Bing.com indexes URLs i've never submitted

Sometimes, before launching new web projects, i put the site / app under a subdomain like new.domain.com or beta.domain.com.
These URLs are only meant for my clients. So they don't get submitted at search engines and there aren't any public links to them.
However, I noticed in a few occasions, these subdomains get indexed by Bing anyway. How is this possible ?
Does Bing crawl generic subdomain names like new, old, archive, beta, ... ?
Or do URLs sent in mails, get scraped in Office 365 (which my clients use) and get indexed ?
This is possible the user has installed a toolbar from that search engine.
The best way to prevent from this, is adding no-index tag for all pages in sub domain and even you can block using robot.txt.

Does google index my index.* page and my folder structure?

I am doing some research on Canonical pages in our site.
Does Google create two indexes in this case:
http://www.foo.com/folder/index.html
http://www.foo.com/folder/
Or does it only index one of the above?
I am curious if I need to add a rel="canonical" or if I am just overthinking this simple idea.
After research it depends on the web server.
In our case it was a Sun One web server that you could hit both foo.com/ and foo.com/index.jsp
Even though these pulled up the same content, they are two different URLs and Google saw them as two sperate pages with duplicate content. This was bumping down our SEO.
The fix was to modify the web server to auotmatically redirect /index.jsp pages to the /.
So yes, google will index any page that you can browse to in your browser, unless its on you robots.txt or you are manually telling google not to index in some fashion.

How to Inform Google For Page URL Modifications in Same Domain?

I am renewing my web page and changing the site structure. It was in Asp and now it will be in Asp.Net
So page URLs will be modified. And some pages will be removed, some will be added. But mostly, the content and page names are same, only URLs will change.
The site has SEO work in it and we want to loose it minimum.Site is registered in Analytics and Webmaster Tools.
Google searches will end up blank pages and I don't want to loose my rank.
So I'm looking for a way to inform Google about new page URLs. Domain is same, only URLs. For example: the home page was /default.asp and now /home.aspx
Is there a way to tell Google that a particular URL address or page name has changed?
If all that is changing are the page URLs, Google Analytics cannot "know" that a page is the same, just with diferent URL.
But, you could apply a customized pageview using the _trackPageView() method, giving it the original url as parameter.
If you choose to do this, you will have to exclude the line that uses the method in the original GA code and apply it elsewhere, or pass the parameter to it directly with the orignial URL. All this is done in each page.
You can also read more about the method here.
For IIS (Asp.Net) you want to look into the following to find out how to do 301 redirects:
Response.RedirectPermanent(...) for redirecting from a page
or
"IIS 7 Routing Module and web.config" to set up bulk redirecting
I'd also suggest you consider supporting Search Engine Friendly (SEF) URLs while your making the move. The Routing Module can help you there as well.
You need to implement some form of 301 (301 is key) redirects. This way when google or any other search hits the old page, the index is refreshed with the new page. Asp.net allows you to do these redirects even at the IIS level, and where I'd suggest that they live. You'll also want to submit an up to date site map on webmaster tools.
Edit: Here's a good link on the redirects, http://www.iis.net/ConfigReference/system.webServer/httpRedirect

Google found my backup web site. What can I do about It?

A few days ago we replaced our web site with an updated version. The original site's content was migrated to http://backup.example.com. Search engines do not know about the old site, and I do not want them to know.
While we were in the process of updating our site, Google crawled the old version.
Now when using Google to search for our web site, we get results for both the new and old sites (e.g., http://www.example.com and http://backup.example.com).
Here are my questions:
Can I update the backup site content with the new content? Then we can get rid all of old content. My concern is that Google will lower our page ranking due to duplicate content.
If I prevent the old site from being accessed, how long will it take for the information to clear out of Google's search results?
Can I use google disallow to block Google from the old web site.
You should probably put a robots.txt file in your backup site and tell robots not to crawl it at all. Google will obey the restrictions though not all crawlers will. You might want to check out the options available to you at Google's WebMaster Central. Ask Google and see if they will remove the errant links for you from their data.
you can always use robot.txt on backup.* site to disallow google to index it.
More info here: link text
Are the URL formats consistent enough between the backup and current site that you could redirect a given page on the backup site to its equivalent on the current one? If so you could do so, having the backup site send 301 Permanent Redirects to each of the equivalent pages on the site you actually want indexed. The redirecting pages should drop out of the index (after how much time, I do not know).
If not, definitely look into robots.txt as Zepplock mentioned. After setting the robots.txt you can expedite removal from Google's index with their Webmaster Tools
Also you can make a rule in your scripts to redirect with header 301 each page to new one
Robots.txt is a good suggestion but...Google doesn't always listen. Yea, that's right, they don't always listen.
So, disallow all spiders but....also put this in your header
<meta name="robots" content="noindex, nofollow, noarchive" />
It's better to be safe than sorry. Meta commands are like yelling at Google "I DONT WANT YOU TO DO THIS TO THIS PAGE". :)
Do both, save yourself some pain. :)
I suggest you to either add no index meta tag in all old page or just disallow by robots.txt. Best way to just blocked the by robots.txt. One thing more add the sitemap in new site and submit it in webmaster that improve your new website indexing.
Password protect your webpages or directories that you don't want web spiders to crawl/index by putting password protecting code in the .htaccess file (if present in your website's root directory on the server or create a new one and upload it).
The web spiders will never know that password and hence won't be able to index the protected directories or web pages.
you can block any particular urls in webmasters check once...even you can block using robots.txt....remove sitemap for your old backup site and put noindex no follow tag for all of your old backup pages...i too handled this situation for one of my client............