HTML Compression and SEO? [closed] - seo

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
At work, we have a dedicated SEO Analyst who's job is to pour over lots of data (KeyNote/Compete etc) and generate up fancy reports for the executives so they can see how we are doing against our competitors in organic search ranking. He also leads initiatives to improve the SEO rankings on our sites by optimizing things as best we can.
We also have a longstanding mission to decrease our page load time, which right now is pretty shoddy on some pages.
The SEO guy mentioned that semantic, valid HTML gets more points by crawlers than jumbled messy HTML. I've been working on a real time HTML compressor that will decrease our page sizes my a pretty good chunk. Will compressing the HTML hurt us in site rankings?

I would suggest using compression at the transport layer, and eliminating whitespace from the HTML, but not sacrificing the semantics of your markup in the interest of speed. In fact, the better you "compress" your markup, the less effective the transport layer compression will be. Or, to put it a better way, let the gzip transfer-coding slim your HTML for you, and pour your energy into writing clean markup that renders quickly once it hits the browser.

Compressing HTML should not hurt you.
When you say HTML compressor I assume you mean a tool that removed whitespace etc from your pages to make them smaller, right? This doesn't impact how a crawler will see your html as it likely strips the same things from the HTML when it grabs the page from your site. The 'semantic' structure of the HTML exists whether compressed or not.
You might also want to look at:
Compressing pages with an GZIP compression in the web server
Reducing size of images, CSS, javascript etc
Considering how the browser's layout engine loads your pages.
By jumbled HTML, this SEO person probably means the use of tables for layout and re-purposing of built in HTML elements (eg. <p class="headerOne">Header 1</p>). This increases the ratio of HTML tags to page content, or keyword density in SEO terms. It has bigger problems though:
Longer page load times due to increased content to download, why not use the H1 tag?
It's difficult for screenreaders to understand and affects site accessibility.
Browsers may take longer to render the content depending on how they parse and layout pages with styles.

I once retooled a messy tables-for-layout to xhtml 1.0 transitional and the size went from 100kb to 40kb. The images loaded went from 200kb to just 50kb.
The reason I got such a large savings was because the site had all the JS embedded in every page. I also retooled all the JS so it was correct for both IE6 and FF2. The images were also compiled down to an image-map. All the techniques were well documented on A List Apart and easy to implement.

Use gzip compression to compress the HTML in the transport stage, then just make sure that code validates and that you are using logical tags for everything.

The SEO guy mentioned that semantic,
valid HTML gets more points by
crawlers than jumbled messy HTML.
If a SEO guy ever tries to provide a fact about SEO then tell him to provide a source, because to the best of my knowledge that is simply untrue. If the content is there it will be crawled. It is a common urban-myth amongst SEO analysts that just isn't true.
However, the use of header tags is recommended. <H1> tags for the page title and <H2> for main headings, then lower down for lower headings.
I've been working on a real time HTML
compressor that will decrease our page
sizes my a pretty good chunk. Will
compressing the HTML hurt us in site
rankings?
If it can be read on the client side without problem then it is perfectly fine. If you want to look up any of this I recommend anything referencing Matt Cutt's or from the following post.
FAQ: Search Engine Optimisation

Using compression does not hurt your page ranking. Matt Cutts talks about this in his article on Crawl Caching Proxy
Your page load time can also be greatly improved by resizing your images. While you can use the height and width attributes in the img tag, this does not change the size of the images that is downloaded to the browser. Resizing the images before putting them on your pages can reduce the load time by 50% or more, depending on the number and type of images that you're using.
Other things that can improve your page load time are:
Use web standards/CSS for layout
instead of tables
If you copy/paste
content from MS Word, strip out the
extra tags that Word generates
Put
CSS and javascript in external
files, rather then embedded in the
page. Helps when users visit more
than one page on your site because
browsers typically cache these files
This Web Page Analyzer will give you a speed reports that shows how long different elements of your page take to download.

First you check on the code. The code is validate w3c standards like HTML & CSS

Related

How to compress PDF to the limit

I have a 130,000-page PDF with a size of 1G. Each page of the PDF has a variable serial code. The other content is the same. It can be said that 99% of the content is the same. Can this situation be compressed to dozens of megabytes?
I hope everyone can give me some guidance, or give me some articles about the composition principle of PDF documents, or the principle of PDF compression
I don’t want to get a software tool, I want to know the deeper principles
In fact, I feel that the same thing can be used as an object instead of being repeated, just like the encapsulation of writing code
I provided a screenshot. This is how the two software processes the PDF. The content of a single page of the artwork is similar, but the size is different, but the 130,000 page on the right is only 91.51M, (and the same content I generated is 1GB), which is really amazing.(the left test file not mine)

hide text or div from crawlers [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
Improve this question
lets say i have a text
<span class="hide">for real</span><h2 id='show'>Obama is rocking the house</h2>
<span class="hide">not real</span><h2 id='show'>Bill gates is buying stackoverflow</h2>
i need the crawler to just read the
<h2 id='show'>Obama is rocking the house</h2>
<h2 id='show'>Bill gates is buying stackoverflow</h2>
can we do that?
im a bit confused here say that a hidden div is readed by google
Does google index pages with hidden divs?
but when i google for a sec, i found out that google doesnt read hidden div. so which is right?
http://www.seroundtable.com/archives/002971.html
what i have in mind is to ofucate it like using css instead.,
i can put my text in a image. output it using image generator or something.
FYi, serving different content to users then to search engines is a violation of Google's terms of service and will get you banned if you're caught. Content that is hidden but can be accessed through some kind of trigger (navigation menu links is hovered over, the clicks on an icon to expand a content area, etc) is acceptable. But in your example you are showing different content to search engines specifically for their benefit and that is definitely what you don't want to do.
The best way to suggest that a webcrawler not access content on your site is to create a robots.txt file. See http://robotstxt.org. There is no way to tell a robot to not access one part of a page
http://code.google.com/web/controlcrawlindex/docs/faq.html#h22
If you are going to use CSS remember that robots can still read CSS files! You could include the CSS file in the robots.txt file, though to exclude it.
If you really must have indexed and non-indexed content on the same page, maybe you should use frames and have the non-indexed frame listed in the robots.txt file as not to be indexed.
Well behaved crawlers will follow the robots.txt guidance, e.g. Google, but naughty ones will not. So, there is no guarantee.
I can confirm that google does read the hidden div, while it's not showing up in the search results.
The reason I know: I admin a website that has backlinks on a highly respected non profit. As the non profit doesn't want to show up in search results for a company website, they hide the links.
However, if I check google's webmaster tools, I can see the backlinks form this non profit.

would lazy-loading img src negatively impact SEO

I'm working on a shopping site. We display 40 images in our results. We're looking to reduce the onload time of our page, and since images block the onload event, I'm considering lazy loading them by initially setting img.src="" and then setting them after onload. Note that this is not ajax loading of html fragments. the image html along with the alt text is present. it's just the image src is deferred.
Does anyone have any idea as to whether this may harm SEO or lead to a google penalty box now that they are measuring sitespeed?
Images don't block anything, they are already lazy loaded. The onload event notifies you that all of the content has been downloaded, including images, but that is long after the document is ready.
It might hurt your rank because of the lost keywords and empty src attributes. You'll probably lose more than you gain - you're better off optimizing your page in other ways, including your images. Gzip + fewer requests + proper expires + a fast static server should go a long way. There is also a free CDN that might interest you.
I'm sure google doesn't mean for the whole web to remove their images from source code to gain a few points. And keep in mind that they consider anything under 3s to be good loading times, there's plenty of room to wiggle before resorting to voodoo techniques.
From a pure SEO perspective, you shouldn't be indexing search result pages. You should index your home page and your product detail pages, and have a spiderable method of getting to those pages (category pages, sitemap.xml, etc.)
Here's what Matt Cutts has to say on the topic, in a post from 2007:
In general, we’ve seen that users usually don’t want to see search results (or copies of websites via proxies) in their search results. Proxied copies of websites and search results that don’t add much value already fall under our quality guidelines (e.g. “Don’t create multiple pages, subdomains, or domains with substantially duplicate content.” and “Avoid “doorway” pages created just for search engines, or other “cookie cutter” approaches…”), so Google does take action to reduce the impact of those pages in our index.
http://www.mattcutts.com/blog/search-results-in-search-results/
This isn't to say that you're going to be penalised for indexing the search results, just that Google will place little value on them, so lazy-loading the images (or not) won't have much of an impact.
There are some different ways to approach this question.
Images don't block load. Javascript does; stylesheets do to an extent (it's complicated); images do not. However, they will consume http connections, of which the browser will only fire off 2 per domain at a time.
So, what you can do that should be worry-free and the "Right Thing" is to do a poor man's CDN and just drop them on www1, www2, www3, etc on your own site and servers. There are a number of ways to do that without much difficulty.
On the other hand: no, it shouldn't affect your SEO. I don't think Google even bothers to load images, actually.
We display 40 images in our results.
first question, is this page even a landing page? is it targeted for a specific keyword? internal search result pages are not automatically landing pages. if they are not a landingpage, then do whatever you want with them (and make sure they do not get indexed by google).
if they are a landingpages (a page targeted for a specific keyword) the performance of the site is indeed important, for the conversion rate of these pages and indirectly (and to a smaller extend also directly) also for google. so a kind of lazy load logic for pages with a lot of images is a good idea.
i would go for:
load the first two (product?) images in an SEO optimized way (as normal HTML, with a targeted alt text and a targeted filename). for the rest of the images make a lazy load logic. but not just setting the src= to blank, but insert the whole img tag onload (or onscroll, or whatever) into your code.
having a lot of broken img tags in the HTML for non javacript users (i.e.: google, old mobile devices, textviewer) is not a good idea (you will not get a penalty as long as the lazy loaded images are not missleading) but shitty markup is never a good idea.
for general SEO question please visit https://webmasters.stackexchange.com/ (stack overflow is more for programing related questions)
I have to disagree with Alex. Google recently updated its algorithm to account for page load time. According to the official Google blog
...today we're including a new signal in our search ranking algorithms: site speed. Site speed reflects how quickly a website responds to web requests.
However, it is important to keep in mind that the most important aspect of SEO is original, quality content.
http://googlewebmastercentral.blogspot.com/2010/04/using-site-speed-in-web-search-ranking.html
I have been added lazyload to my site (http://www.amphorashoes.ro) and i have better pagerank from google (maybe because the content is loading faster) :)
first,don't use src="",it may hunt your page,make a small loading image instead it.
second,I think it won't affect SEO, actually we always use alt="imgDesc.." to describe this image, and spider may catch this alt but not analyse this image what id really be.
I found this tweet regarding Google's SEO
There are various ways to lazy-load images, it's certainly worth
thinking about how the markup could work for image search indexing
(some work fine, others don't). We're looking into making some clearer
recommendations too.
12:24 AM - 28 Feb 2018
John Mueller - Senior Webmaster Trends Analyst
From what I understand, it looks like it depends on how you implement your lazy loading. And Google is yet to recommend an approach that would be SEO friendly.
Theoretically, Google should be running the scripts on websites so it should be OK to lazy load. However, I can't find a source(from Google) that confirms this.
So it looks like crawling lazy loaded or deferred images may not be full proof yet. Here's an article I wrote about lazy loading image deferring and seo that talks about it in detail.
Here's working library that I authored which focuses on lazy loading or deferring images in an SEO friendly way .
What it basically does is cancel the image loading when DOM is ready and continue loading the images after window load event.
...
<div>My last DOM element</div>
<script>
(function() {
// remove the all sources!
})();
window.addEventListener("load", function() {
// return all the sources!
}, false);
</script>
</body>
You can cancel loading of an image by removing it's src value or replacing it with a placeholder image. You can test this approach with Google Fetch
You have to make sure that you have the correct src until DOM is ready so to be sure that Google Fetch will capture your imgs original src.

Is it really helpful to store content in flat-file than a database for better google/yahoo/bing searches?

I just came across few articles, while selecting a wiki for my personal site. I am confused, as i am setting a personal wiki for my personal projects, i think a flat file system is good, even to maintain revisions of the design documents, design decisions, and comments/feedbacks from peers.
But the internet gives a mixed bag of responses, mostly irrelevant information. Can anyone please shed some light on the selection. It will be nice if some can share his experience for a wiki selection for this personal/small business site.
You're asking more about Search Engine Optimization (SEO), which has nothing much to do with how you store your content in the server. Whether in static HTML or as a DB-driven application, search engines will still index your pages by trawling from link to link.
Some factors that do affect search engines' ability to index your site:
Over-dependency on Javascript to
drive dynamic content. If certain
blocks on information can't even be
rendered on the page without
invoking use of Javascript, it will
be a problem. Search engines
typically don't execute the JS on
your page. They just take the
content as-is.
Not making use of proper HTML tags
to represent varying classes of
data. A <h1> tag is given more
emphasis by search engines than a
<p> tag. Basically, you just need
to have a proper grasp of what HTML
element to tag your content with.
URLs. Strictly speaking, I don't
think having complicated dynamic
URLs represent a problem for search
engines. However, I've seen some
weird content management systems
that expose several different URL
mappings just to point to the same
content. It would be logical that
the search engines deem this same
content as separate pages, which can
dilute your ranking.
There are other factors. I suggest you look up on "accessible web content" as your Google search key.
As for flat files vs DB-driven content, think about how you're going to manage the system. At the end of the day, it's your own labor (or your subordinates'). I, for one, sure don't want to spend my time managing content manually. So, a convenient content management system is pretty much mandatory. I know that there are a couple of Wiki implementations that write directly to flat files. As long as the management part of it is good enough, I'm sure they'd be fine for your purposes.

SEO Superstitions: Are <script> tags really bad? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
We have an SEO team at my office, and one of their dictums is that having lots of <script> blocks inline with the HTML is apocalypticly bad. As a developer that makes no sense to me at all. Surely the Google search engineers, who are the smartest people on the planet, know how to skip over such blocks?
My gut instinct is that minimizing script blocks is a superstition that comes from the early ages of search engine optimizations, and that in today's world it means nothing. Does anyone have any insight on this?
per our SEO guru, script blocks (especially those that are in-line, or occur before actual content) are very, very bad, and make the google bots give up before processing your actual content. Seems like bull to me, but I'd like to see what others say.
It's been ages since I've played the reading google's tea leafs game, but there are a few reasons your SEO expert might be saying this
Three or four years back there was a bit of conventional wisdom floating around that the search engine algorithms would give more weight to search terms that happened sooner in the page. If all other things were equal on Pages A and B, if Page A mentions widgets earlier in the HTML file than Page B, Page A "wins". It's not that Google's engineers and PhD employees couldn't skip over the blocks, it's that they found a valuable metric in their presence. Taking that into account, it's easy to see how unless something "needs" (see #2 below) to be in the head of a document, an SEO obsessed person would want it out.
The SEO people who aren't offering a quick fix tend to be proponents of well-crafted, validating/conforming HTML/XHTML structure. Inline Javascript, particularly the kind web ignorant software engineers tend to favor makes these people (I'm one) seethe. The bias against script tags themselves could also stem from some of the work Yahoo and others have done in optimizing Ajax applications (don't make the browser parse Javascript until is has to). Not necessarily directly related to SEO, but a best practice a white hat SEO type will have picked up.
It's also possible you're misunderstanding each other. Content that's generated by Javascript is considered controversial in the SEO world. It's not that Google can't "see" this content, it's that people are unsure how its presence will rank the page, as a lot of black hat SEO games revolve around hiding and showing content with Javascript.
SEO is at best Kremlinology and at worse a field that the black hats won over a long time ago. My free unsolicited advice is to stay out of the SEO game, present your managers with estimates as so how long it will take to implement their SEO related changes, and leave it at that.
There's several reasons to avoid inline/internal Javascript:
HTML is for structure, not behavior or style. For the same reason you should not put CSS directly in HTML elements, you should not put JS.
If your client does not support JS you just pushed a lot of junk. Wasted bandwith.
External JS files are cached. That saves some bandwith.
You'll have a descentralized javascript. That leads to code repetition and all the known problemns that comes with it.
I don't know about the SEO aspect of this (because I never can tell the mambo jambo from the real deal). But as Douglas Crockford pointed out in one of his javascript webcasts the browser always stops for parsing the script, at each element. So, if possible, I'd rather deliver the whole document and enhance the page as late as possible with scripts anyway.
Something like
<head>
--stylesheets--
</head>
<body>
Lorem ipsum dolor
...
...
<script src="theFancyStuff.js"></script>
</body>
I've read in a few places that Google's spiders only index the first 100KB of a page. 20KB of JS at the top of your page would mean 20KB of content later on that Google wouldn't see, etc.
Mind you, I have no idea if this fact is still true, but when combine it with the rest of the superstition/rumors/outright quackery you find in the dark underbelly of SEO forums, it starts to make a strange sort of sense.
This is in addition to the fact that inline JS is a Bad Thing with respect to the separation of presentation, content, and behavior, as mentioned in other answers.
Your SEO guru is slightly off the mark, but I understand the concern. This has nothing to do with whether or not the practice is proper, or whether or not a certain number of script tags is looked upon poorly by Google, but everything to do with page weight. Google stops caching after (I think) 150KB. The more inline scripts your page contains, the greater the chance important content will not be indexed because those scripts added too much weight.
I've spent some time working on search engines (not Google), but have never really done much from an SEO perspective.
Anyway, here are some factors which Google could reasonably use to penalise the page which should be increased by including big blocks of javascript inline.
Overall page size.
Page download time (a mix of page size and download speed).
How early in the page the search terms occurred (might ignore script tags, but that's a lot more processing).
Script tags with lots of inline javascript might be interpreted to be bad on their own. If users frequently loaded a lot of pages form the site, they'd find it much faster if the script was in a single shared file.
I would agree with all of the other comments but would add that when a page has more than just
<p> around the content you are putting your faith in Google to interpret the mark-up correctly and that is always a risky thing to do. Content is king and if Google can't read the content perfectly then it's just another reason for google to not show you the love.
This is an old question, but still pretty relevant!
In my experience, script tags are bad if they cause your site to load slowly. Site speed actually does have an impact on your appearance in SERPs, but script tags in and of themselves aren't necessarily bad for SEO.
Lots of activities in SEO is not recommended by search engine. You can use <script> tag but not excessively. Even Google Analytics snippet code in <script> tag.