Should I use <![CDATA[...]]> in HTML5? - cdata

I'm pretty sure <![CDATA[...]]> sections can be used in XHTML5, but what about HTML5?

The CDATA structure isn't really for HTML at all, it's for XML.
People sometimes use them in XHTML inside script tags because it removes the need for them to escape <, > and & characters. It's unnecessary in HTML though, since script tags in HTML are already parsed like CDATA sections.
Edit: This is where we open that really mouldy old can of worms from 2002 over whether you're sending XHTML as text/html or as application/xhtml+xml like you’re “supposed” to :-)

From the same page #pst linked to:
Element-specific parsing for script and style tags, Guidance for XHTML-HTML compatibility: "The following code with escaping can ensure script and style elements will work in both XHTML and HTML, including older browsers."
Maximum backwards compatibility:
<script type="text/javascript"><!--//--><![CDATA[//><!--
...
//--><!]]></script>
Simpler version, sort of incompatible with "much older browsers":
<script>//<![CDATA[
...
//]]></script>
So, CDATA can be used in HTML5, and it's recommended in the official Guidance for XHTML-HTML compatibility.
This useful for polyglot HTML/XML/XHTML pages, which are served as strict application/xml XML during development, but served as text/html HTML5 in production mode for better cross-browser compatibility. Polyglot pages have their benefits; I've used this myself, as it's much easier to debug XML/XHTML5. Google Chrome, for example, will throw an error for invalid XML/XHTML5 (including for example character escaping), whereas the same page served as HTML5 will "just work" also known as "probably work".

The spec seems to clear up this issue. script and style tags are considered to be "raw text elements." CDATA is not needed or allowed for them. CDATA is only used with "foreign content" - i.e. MathML and SVG. Note that there are some restrictions to what can go in the script tag -- basically you can't put something like var x = '</script>' in there because it will close the tag and needs to be split like pst noted in his answer. http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions

HTML5-supporting browsers (and most older browsers going all the way back to 2001) already read the content inside <style> and <script> tags as CDATA (character data). That means you generally do not need to add CDATA tags inside those elements for most HTML browsers built the past 20 years as they will parse any special characters ok that might popup when adding CSS and JavaScript code between them.
However...you do need to add the CDATA block inside <style> and <script> HTML5 tags if you want your HTML5 page to be compatible with XHTML and XML browsers and parsers, which do need CDATA tags. For that reason, I do recommend you use CDATA in HTML5 <style> and <script> tags, but please read on. If you do not do this right, you will break your website!
Note: The CDATA tag helps XML parsers ignore special characters that might popup in between those elements, which are part of XML elements, and therefore which would break the markup (like using < or > characters, for example). Only the <style> and <script> in modern HTML parses have this special feature already built in. That simply means in HTML browsers and parsers they are designed to ignore those weird characters, or rather not read or parse them, as part of the markup. If they did not have built in CDATA properties, your web page, styles, and scripts could break!
XML and XHTML parsers will read the <style> and <script> tag content as they do all HTML elements, as PCDATA (i.e. a normal HTML element), meaning the contents are parsed as markup and potentially break with special characters added in between those tags. You can add special CDATA sections between those two tags to support it. Because XML and XHTML parsers reads everything inside elements as potentially more markup, adding CDATA prevents certain characters from being interpreted as XML or other types of character references.
The problem is, most HTML4/HTML5 browsers and parsers don't support adding additional CDATA sections between those tags, so CDATA blocks have to be commented out for those agents if you add them for XHTML/XML support.
Also, note that all HTML comments (<!-- or -->) added inside those tags are ignored by HTML parsers, but implemented by XHTML ones, commenting out CSS and JavaScript for XHTML, when added. Many people in the past would add comment rules between those tags to hide styles and scripts from very old browsers that normally would not understand CSS or Javascript (pre-1998 browsers). But that strategy failed in XHTML without additional code.
So how do you combine all that inside <style> and <script> tags, and should you care?
I am a purist and like my HTML5 content to still be XML/XHTML-friendly, regardless of what markup recommendation I am using. I also like my pages to work in browsers that know CSS and older browsers that do not. So here are two solutions to support all those scenarios and still display your styles and scripts in modern browsers without error. They are totally safe to use in modern HTML5 browsers:
STYLE
<style type="text/css">
<!--/*--><![CDATA[/*><!--*/
...put your styles here
/*]]>*/-->
</style>
SCRIPT
<script type="text/javascript">
<!--//--><![CDATA[//><!--
...put your scripts here
//--><!]]>
</script>
ADDITIONAL NOTES
These two code blocks do not change anything in modern HTML5 browsers.
These two code blocks will allow your CSS and JavaScript to work normally as before in HTML5 browsers but hide CSS and JavaScript from very old browsers (pre-2001) that do not support those technologies.
XHTML browsers will now parse your CSS and JavaScript as before but not allow special characters like <, >, and & to be interpreted as markup or entities/escaped characters which would generate parsing errors. They are CDATA now.
XML parsers of your page will not understand your CSS and JavaScript, of course, but will accept any type of text you add in there and not try and parse them as markup. They are CDATA now.
HOW THE EXAMPLES WORK
For modern HTML5-supporting browsers, because script and style elements act like CDATA, all markup is ignored and treated as characters. So comment markers <!-- and --> inside script and style tags are ignored. Older browsers (pre-2001) that do not know scripts or CSS do not treat script and style elements as CDATA-supporting elements. They will recognize the HTML comment tags so will comment out all the CSS and JavaScript between them. Note that some browsers do know CSS and scripts but also read the HTML comments, so we close out the first comment (<!--/*-->), then they are forced to read the <![CDATA[/*> block (used for XHTML and XML parsers) which to them becomes an empty unknown element to these browsers and so is ignored. The HTML comment that follows last in the block is designed to hide all the CSS and scripts from there to the end of the block. The final <!]]> is another ignored empty element that closes the unknown CDATA markup tag for those that still read it.
For XHTML, these parsers read all the code inside these tags as HTML. They also need a CDATA element wrapped around all CSS and JavaScript in the block so they act like HTML5 browsers. They do not read the content inside these elements as CDATA yet. XHTML parsers will read the HTML comment tags, but knowing CSS and JavaScript comments as well, they will be ended early. The <![CDATA[ element is then read and starts the CDATA block for them, as it is a known HTML element in the XHTML W3C recommendations. It then wraps around all styles and scripts inside the tags till ]]> ends it, creating a true CDATA wrapper that now hides XML characters correctly for them. Everything inside the CDATA block is interpreted like HTML5 parsers do now - as normal CSS and scripts - but to the XHTML parser no longer recognize HTML markup inside them. Because old and new XHTML browsers all know CSS and JavaScript, they still parse and process that code correctly now, ignoring XML reserved characters now.
XML parsers know HTML comments but not CSS and JavaScript comments, so those parsers would hide everything between the comment tags. Since they do not need to know or parse CSS and JavaScript code, it is not needed.
Your HTML5 page is fully cross-compatible with modern HTML5 and XHTML5 browsers, older HTML/XHTML browsers, very old 1990's non-supporting CSS/script browsers, and various XML parsers, old and new! Enjoy

Perhaps see: http://wiki.whatwg.org/wiki/HTML_vs._XHTML
<![CDATA[...]]> is a a bogus comment.
In HTML, <script> is already protected -- this is why sometimes it must be written as a = "<" + "/script>", to avoid confusing the browser. Note that the code is valid outside a CDATA in HTML.

Related

Prettier, is it possible to disable for a language?

I'm using Prettier with pretty-quick and do not agree with how it formats HTML. I know I can exclude by globs but can I exclude by language so that the HTML in my Vue files are untouched?
No. It's not possible. HTML is considered the "main language" of a .vue file, JS and CSS "embedded". You can only disable formatting for embedded languages by the --embedded-language-formatting off option. As for template tags, the only way to keep them unformatted is to put <!--prettier-ignore--> in front of each of them.

Vue.js - Friendly syntax without compilation templates

Vue.js is ideal library for my case, but I use it on non-SPA page.
Is there a way to bypass syntax v-bind:click? I would like the attributes starts from data-v-* and don't contains :.
My solution (based on accepted answer):
It looks like Vue.js will not pass the exam here.
Knockout proved to be the ideal library for friendly SEO html syntax without compilation templates.
You can use script templates to "hide" your Vue-HTML from the validator. The following validates as HTML5.
<!DOCTYPE html>
<html>
<head>
<title>Whatever</title>
</head>
<body>
<script id="some-template" type="text/template">
<div v-model="foo" v-bind:click="dontCare">Whatever</div>
</script>
<some-template id="app"></some-template>
</body>
</html>
This is not as much of a cheat as it might seem, since Vue-HTML is not HTML, but is in fact a template used for generating HTML (or, I think more accurately, for generating the DOM). But it is a complete cheat in the sense that the generated HTML is never subjected to the validator. :)
Alternatively, you could look at using Knockout, which uses pure HTML5 (what you write is what is delivered to the browser). It is not as performant as Vue, but it is an MVVM framework.
Short answer is: No.
I don't think there is a way to change the templating of Vue. The generated HTML shipped to user will be valid, because modifiers (v-for, v-bind, etc.) will be stripped of your HTML. Framework like Angular, for example, does not strip the multiple "ng-*" properties. For instance, you could write:
<div v-if="model.length" />
The resulting html will be:
<div />
Which is valid and compatible with any W3C validator.

Exclude menu from content extraction with tika

I generate html documents that contain a menu and a content part. Then I want to extract the content of these document to feed it to a lucene index. However, I would like to exclude the menu from the content extraction and thus only index the content.
<div class="menu">my menu goes here</div>
<div class="content">my content goes here</div>
what is the simplest way to achieve this with apache tika?
As a more general solution (not just for you specific menu) I would advise looking at boilerpipe that deals with removing uninteresting parts from pages (menus, navigation etc).
I know it can be integrated in Solr/tika, have a look and you probably can integrate it in your scenario.
Have a look at this post which specifies how to handle DIVs during the HTML parse, by specifying whether they are safe to parse or not, in which case its ignored. For your problem, you could have some logic in the override methods which ignore only DIV elements with attribute value "menu" (i.e. tell TIKA parser this DIV is unsafe to parse).
You can parse the html with a parser to a xhtml dom object an remove the div tag cotaining the attribute class="menu".

remove redundant css rules from dynamic website

I'm trying to reduce the size of my CSS file. It is from a template which is very CSS & JS heavy. Even with CSSMin the CSS file size is 250kb.
Whilst I use alot of the CSS - I know I dont use it all. So I'm trying to work out which styles can be removed. I'm aware of Dust-Me selector - but that only takes a static look at the website. With HTML5 and CSS3 - websites are now very dynamic, and most of my CSS occurs from dynamic events, or 'responsive' events i.e. Bootstrap.
Question: Is there a tool which 'records' all my CSS use on a website for a perioid of time, so I can go and click/hover/move over each element and interact with my site. Then at the end let me know which styles were & were not used?
CSS Usage is a great extension for firefox. It tells which css are currently used in a page.
Link: https://addons.mozilla.org/en-us/firefox/addon/css-usage/
There are two tools that I think might help you out.
helium is a javascript tool that will discover any unused css rules.
csscss is a source code analyzer that will report any duplication. I'm biased because I wrote csscss precisely because I couldn't find anything that did this. But many people seem to find it useful.
250kb is really such a big figure for just CSS files.
The templates generally have all the CSS required for all the pages in a single file.
I would suggest:
Do not cut your CSS code, they might be needed some point of time.
Instead i would suggest, break your CSS file into number of small files for different page stylings,
such as a different CSS for login page, different CSS file for home page, etc.
Read your own CSS and HTML code vigorously to find out which significant part of CSS code is used in which HTML section.
Update:
You may try Removed Unused CSS - CSS optimizer.
I personally did not use it just hope it works for you.

RDFa Snippet Generator from GoodRelations

I've created a RDFa snippet to use on a client's website using the GoodRelations tool. The generated code creates the tags as expected, but there's no text between the divs, for instance:
<div typeof="vcard:Address">
<div property="vcard:locality" content="Yorba Linda"></div>
</div>
I'm assuming that this is OK, and that I am expected to put descriptive text for humans between the 'locality' divs without any adverse effects (in relation to SEO.) Correct?
As William says: In most cases, is is impractical to reuse visible content for publishing meta-data, because they differ in sequence or structure. In that case, it is better to put all meta-data in a single block of <div> elements without visible content. This is called "RDFa in Snippet Style", see
http://www.ebusiness-unibw.org/tools/rdf2rdfa/
Hepp, Martin; García, Roberto; Radinger, Andreas: RDF2RDFa: Turning RDF into Snippets for Copy-and-Paste, Technical Report TR-2009-01, 2009., PDF at http://www.heppnetz.de/files/RDF2RDFa-TR.pdf
Google is consuming such markup, despite a general preference for marking up visible content. Many big shops are using this approach with good results, e.g. http://www.rachaelraystore.com/Product/detail/Rachael-Ray-Stoneware-2-pc-Bubble-Brown-Baker-Set-Eggplant/316398
So if you can integrate the visible content and the RDFa constructs, then use
<div typeof="vcard:Address">
<div property="vcard:locality">Yorba Linda</div>
</div>
If you cannot, then use
<div typeof="vcard:Address">
<div property="vcard:locality" content="Yorba Linda"></div>
</div>
...
<div>
<div>Yorba Linda</div>
</div>
But the divs with invisible content must be close to the visible content and be placed better before than after the visible markup.
From and RDFa point of view, it is fine (I am assuming you are using bracers because you don't know how to escape greater than / less than characters).
The only thing you need to think about is how adding this fragment of HTML to your HTML document, will affect the rendering. Based on the fact that you are using the content attribute, this fragment is destined to remain hidden. So yo should think about this in relation to the CSS architecture. My advice would be to create a specific CSS class that is for annotations.
Having spoken to the author of Good Relations, his advice would be to put this fragment before any other HTML element in the body of your document. Generally, the Rich Snippets team indicate that they ignore hidden RDFa, but it doesn't actually matter and really in the long run it enables the publishing of RDF to anyone (not only Google) who wants to consume it.