Website Sitemaps and <priority>, is it working? - seo

My "Privacy Policy" page is seen more important by Google than other really more important pages on my website.
I'm currently creating a script to generate a sitemap, should I bother with the priority?
How do you effectively assign priorities to pages? I consider one of my page important but the page have less content than another one less important to my eyes... but maybe Google bot will see it the other way around.
If my degree of "importantness" differs from the one of Google, will I get penalized on the ranking for a particular page?
Thank you for sharing your black art with us :P

The priority setting is different than what you're thinking. The priority setting is what priority you put to that page for crawling frequency. It doesn't mean you care about that page more than others or anything like that, but how often do you want it crawled?
Google doesn't follow it 100%, but it's pretty good about it, and I would say your privacy policy should be the lowest priority, what you should ask: How often does this page change? For a typical site, it's something like this:
/index.htm - [10] - New articles posted all the time, crawl it often
/programming - [8] - This is my programming section, doesn't change as much
/photos - [5] - I update this about 3 times a year.
/About Us - [1] - This page rarely changes, no need to crawl it a lot.
/Privacy Policy - [1] - This page rarely changes, no need to crawl it a lot.
And remember, don't try to game the system. A setting of "10" for all pages just means you're cancelling it all out. And the Google algorithm is an extremely complex method that weeds out people trying to cheat, and I guarantee you priority abuse is addressed somewhere in their engine.
Hope this answers your question, and good luck!

Related

Hiding a page part from Google, does it hurt SEO?

We all know that showing inexistent stuff to Google bots is not allowed and will hurt the search positioning but what about the other way around; showing stuff to visitors that are not displayed for Google bots?
I need to do this because I have photo pages each with the short title and the photo along with textarea containing the embed HTML code. googlebot is taking the embed code and putting it at the page description on its search results which is very ugly.
Please advise.
When you start playing with tricks like that, you need to consider several things.
... showing stuff to visitors that are not displayed for Google bots.
That approach is a bit tricky.
You can certainly check User-agents to see if a visitor is Googlebot, but Google can add any number of new spiders with different User-agents, which will index your images in the end. You will have to constantly monitor that.
Testing of each code release your website will have to check "images and Googlebot" scenario. That will extend testing phase and testing cost.
That can also affect future development - all changes will have to be done with "images and Googlebot" scenario in mind which can introduce additional constraints to your system.
Personally I would choose a bit different approach:
First of all review if you can use any methods recommended by Google. Google provides a few nice pages describing that problem e.g. Blocking Google or Block or remove pages using a robots.txt file.
If that is not enough, maybe restructuring of you HTML would help. Consider using JavaScript to build some customer facing interfaces.
And whatever you do, try to keep it as simple as possible, otherwise very complex solutions can turn around and bite you.
It is very difficult to give you very good advise without knowledge of your system, constraints and strategy. But I hope my answer will help you out to choose good architecture / solution for your system.
Boy, you want more.
Google does not because of a respect therefore judge you cheat, he needs a review, as long as your purpose to the user experience, the common cheating tactics, Google does not think you cheating.
just block these pages with robots.txt and you`ll be fine, it is not cheating - that's why they came with solution like that in the first place

Crawling for Eternity

I've recently been building a new web app dealing with Recurring Events. These events can recur on a daily, weekly or monthly basis.
This all is working great. But when I started creating the Event Browser Page (which will be visible to the public internet) a thought came across my mind.
If a crawler hits this page, with a next and previous button to browse the dates, it will just continue forever ? So I opted out of using generic HTML links and used AJAX. Which means that bots will not be able to follow the links.
But this method means I'm losing any that functionality for users without Javascript. Or is the amount of users without Javascript too small to worry ?
Is there a better way to handle this ?
I'm also very interested in how bots like the Google Crawler detects black holes like these and what it does to handle them ?
Add a nofollow tag to the page, or to the individual links you don't want crawled. This can be in robots.txt or in the page source. See the Robots Exclusion Standard
You may still need to think about how to fend off ill-behaved bots which do not respect the standard.
Even a minimally functional web crawler requires a lot more sophistication than you might imagine, and the situation you describe is not a problem. Crawlers operate on some variant of a breadth-first search, so even if they do nothing to detect black holes, it's not a big deal. Another typical feature of web crawlers that helps is that they avoid fetching a lot of pages from the same domain in a short time span, because otherwise they would inadvertently be performing a DOS attack against any site with less bandwidth than the crawler.
Even though it's not strictly necessary for a crawler to detect black holes, a good one might have all sorts of heuristics to avoid wasting time on low-value pages. For instance, it may choose ignore a pages that don't have a minimum amount of English (or whatever language) text, pages that contain nothing but links, pages that seem to contain binary data, etc. The heuristics don't have to be perfect because the basic breadth-first nature of the search ensures that no single site can waste too much of the crawler's time, and the sheer size of the web means that even if it misses some "good" pages, there are always plenty of other good pages to be found. (Of course this is from the perspective of the web crawler; if you own the pages being skipped, it might be more of a problem for you, but companies like Google that run web crawlers are intentionally secretive about the exact details of things like that because they don't want people trying to outguess their heuristics.)

Is there a way that is more efficient than sitemap to add/force recrawl/remove your website's index entries in google?

Pretty much that is the question. Is there a way that is more efficient than the standart sitemap.xml to [add/force recrawl/remove] i.e. manage your website's index entries in google?
I remember a few years ago I was reading an article of an unknown blogger that was saying that when he write news in his website, the url entry of the news will appear immediately in google's search result. I think he was mentioning about something special. I don't remember exactly what.. . some automatic re-crawling system that is offered by google themselves? However, I'm not sure about it. So I ask, do you think that I am blundering myself and there is NO OTHER way to manage index content besides sitemap.xml ? I just need to be sure about this.
Thank you.
I don't think you will find that magical "silver bullet" answer you're looking for, but here's some additional information and tips that may help:
Depth of crawl and rate of crawl is directly influenced by PageRank (one of the few things it does influence). So increasing your site's homepage and internal pages back-link count and quality will assist you.
QDF - this Google algorithm factor, "Query Deserves Freshness", does have a real impact and is one of the core reasons behind the Google Caffeine infrastructure project to allow much faster finding of fresh content. This is one of the main reasons that blogs and sites like SE do well - because the content is "fresh" and matches the query.
XML sitemaps do help with indexation, but they won't result in better ranking. Use them to assist search bots to find content that is deep in your architecture.
Pinging, especially by blogs, to services that monitor site changes like ping-o-matic, can really assist in pushing notification of your new content - this can also ensure the search engines become immediately aware of it.
Crawl Budget - be mindful of wasting a search engine's time on parts of your site that don't change or don't deserve a place in the index - using robots.txt and the robots meta tags can herd the search bots to different parts of your site (use with caution so as to not remove high value content).
Many of these topics are covered online, but there are other intrinsic things like navigational structure, internal linking, site architecture etc that also contribute just as much as any "trick" or "device".
Getting many links, from good sites, to your website will make the Google "spiders" reach your site faster.
Also links from social sites like Twitter can help the crawlers visit your site (although the Twitter links do not pass "link juice" - the spiders still go through them).
One last thing, update your content regularly, think of content as "Google Spider Food". If the spiders will come to your site, and will not find new food, they will not come back again soon, if each time they come, there is new food, they will come a lot. Article directories for example, get indexed several times a day.

How does Google Know you are Cloaking?

I can't seem to find any information on how google determines if you are cloaking your content. How, from a technical standpoint, do you think they are determining this? Are they sending in things other than the googlebot and comparing it to the googlebot results? Do they have a team of human beings comparing? Or can they somehow tell that you have checked the user agent and executed a different code path because you saw "googlebot" in the name?
It's in relation to this question on legitimate url cloaking for seo. If textual content is exactly the same, but the rendering is different (1995-style html vs. ajax vs. flash), is there really a problem with cloaking?
Thanks for your put on this one.
As far as I know, how Google prepares search engine results is secret and constantly changing. Spoofing different user-agents is easy, so they might do that. They also might, in the case of Javascript, actually render partial or entire pages. "Do they have a team of human beings comparing?" This is doubtful. A lot has been written on Google's crawling strategies including this, but if humans are involved, they're only called in for specific cases. I even doubt this: any person-power spent is probably spent by tweaking the crawling engine.
Google looks at your site while presenting user-agent's other than googlebot.
See the Google Chrome comic book page 11 where it describes (even better than layman's terms) about how a Google tool can take a schematic of a web page. They could be using this or similar technology for Google search indexing and cloak detection - at least that would be another good use for it.
Google does hire contractors (indirectly, through an outside agency, for very low pay) to manually review documents returned as search results and judge their relevance to the search terms, quality of translations, etc. I highly doubt that this is their only tool for detecting cloaking, but it is one of them.
In reality, many of Google's algos are trivially reversed and are far from rocket science. In the case of, so called, "cloaking detection" all of the previous guesses are on the money (apart from, somewhat ironically, John K lol) If you don't believe me set up some test sites (inputs) and some 'cloaking test cases' (further inputs), submit your sites to uncle Google (processing) and test your non-assumptions via pseudo-advanced human-based cognitive correlationary quantum perceptions (<-- btw, i made that up for entertainment value (and now i'm nesting parentheses to really mess with your mind :)) AKA "checking google resuts to see if you are banned yet" (outputs). Loop until enlightenment == True (noob!) lol
A very simple test would be to compare the file size of a webpage the Googlbot saw against the file size of the page scanned by an alias user of Google that looks like a normal user.
This would detect most suspect candidates for closeer examination.
They call your page using tools like curl and they construct a hash based on the page without the user agent, then they construct another hash with the googlebot user-agent. Both hashes must be similars, they have algorithms to check the hashes and know if its cloaking or not

Is listing all products on the homepage's footer making a real difference SEO-wise?

I'm working on a website on which I am asked to add to the homepage's footer a list of all the products that are sold on the website along with a link to the products' detail pages.
The problem is that there are about 900 items to display.
Not only that doesn't look good but that makes the page render a lot slower.
I've been told that such a technique would improve the website's visibility in Search Engine.
I've also heard that such techniques could lead to the opposite effect: google seeing it as "spam".
My question is: Is listing products of a website on its homepage really efficient when it comes to becoming more visible on search engines?
That technique is called keyword stuffing and Google says that it's not a good idea:
"Keyword stuffing" refers to the practice of loading a webpage with keywords in an attempt to manipulate a site's ranking in Google's search results. Filling pages with keywords results in a negative user experience, and can harm your site's ranking. Focus on creating useful, information-rich content that uses keywords appropriately and in context.
Now you might want to ask: Does their crawler really realize that the list at the bottom of the page is just keyword stuffing? Well, that's a question that only Google could answer (and I'm pretty sure that they don't want to). In any case: Even if you could make a keyword stuffing block that is not recognized, they will probably improve they algorithm and -- sooner or later -- discover the truth. My recommendation: Don't do it.
If you want to optimize your search engine page ranking, do it "the right way" and read the Search Engine Optimization Guide published by Google.
Google is likely to see a huge list of keywords at the bottom of each page as spam. I'd highly recommend not doing this.
When is it ever a good idea to specify 900 items to a user? good practice dictates that large lists are usually paginated to avoid giving the user a huge blob of stuff to look through at once.
That's a good rule of thumb, if you're doing it to help the user, then it's probably good ... if you're doing it purely to help a machine (ie. google/bing), then it might be a bad idea.
You can return different html to genuine users and google by inspecting the user agent of the web request.
That way you can provide the google bot with a lot more text than you'd give a human user.
Update: People have pointed out that you shouldn't do this. I'm leaving this answer up though so that people know it's possible but bad.