Are CDATA sections really unnecessary? - sql

This question is prompted by the rather militant refusal of developer Michael Rys to include the parsing of CDATA sections into FOR XML PATH because "There is no semantic difference in the data that you store."
I have stored nuggets of HTML in CDATA nodes and other content that requires the use of special or awkward characters. However I don't feel qualified to challenge Rys's controversial assertion because, I suppose, technically he is correct in the scenarios where I've employed CDATA for convenience.
What's really baking my noodle is that, as developers take to the internet begging for advice on how to render CDATA segments using FOR XML PATH, respondents continually direct them to use FOR XML EXPLICIT instead, the XML rendering method Rys cited as being the "query from hell".
If we can really do without CDATA in every use case that anyone can suggest I guess we should stop moaning and reject CDATA usage henceforth. But if there are clearly defined cases where CDATA is essential Rys already undertook that he would bake it into FOR XML PATH going forward in the topmost link in this question.
So which is it to be? Are CDATA sections really relics of the past? Or should Rys pull his finger out and allow for CDATA parsing in FOR XML PATH? And while we're at it, in the meanwhile, are there any hacks for getting FOR XML PATH to return CDATA sections?

CDATA sections are unnecessary. They're not a "relic of the past" because they've always been unnecessary.
This does not mean they aren't useful. Look at just about any programming language or library and you can find a large number of things you could do without because they are semantically equivalent to something else, but which are useful if there's a human being sitting there having to write the stuff.
For that matter, even with programmatic production it's also handy that one could take the opposite approach and use CDATA sections for every single piece of c-data (bloaty, but it could have efficiency gains elsewhere).
FOR XML PATH does not involve a human being sitting there having to write the stuff. It's a means of producing valid XML from a the results of an SQL query. (It's also not a matter of parsing CDATA sections, but of producing them - a different matter).
And you can't really complain about FOR XML EXPLICIT being the alternative when you want really fine control - the reason FOR XML EXPLICIT is so nasty to use sometimes is precisely because it gives you really fine control. Indeed, consider if they first added support for CDATA sections and then added support for every other tweak and configuration option that seemed just as vital to someone else out there. How long would it take before FOR XML EXPLICIT was the automatic choice due to it being more straightforward than FOR XML PATH‽
There are four cases where CDATA are useful:
You're sitting at a keyboard typing this stuff in yourself.
You are dealing with a mixing different technologies with different standards designed at different times and which will be interpreted by different parsers in different ways (e.g javascript embedded into XHTML - though it's not 100% necessary here it's a nightmare to do otherwise).
You're trying to parse the XML with something that doesn't understand XML.
You're trying to use something built on a parser that allows low-level access that distinguishes between CDATA sections and other character data and using that low-level access inappropriately.
Funnily enough, these four cases are also the four cases where a ban on accepting CDATA sections can make sense.
Case 1 doesn't apply here, it isn't human-generated code.
Case 2 could apply here if you are doing something really crazy. Frankly, the lack of CDATA sections is the least of your worries here; switch to producing simpler XML in the query and transforming it elsewhere.
Case 3 could apply here, but it's not fair to complain to the SQL people if it does, when you should complain to the broken XML parser that doesn't treat <example> the same as <![CDATA[<example>]]>.
Case 4 could apply here, but again complain to the person who wrote the buggy code, not the SQL people.

CDATA sections are useful if you don't care about the semantics of the data in them (i.e. you do not need to parse it - it is simply a run of characters), and you don't wish to escape any of the XML within them.
The definition, according to w3:
CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup.
From wikipedia:
New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. Some APIs for working with XML documents do offer options for independent access to CDATA sections, but such options exist above and beyond the normal requirements of XML processing systems, and still do not change the implicit meaning of the data. Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup.
CDATA sections are useful for writing XML code as text data within an XML document. For example, if one wishes to typeset a book with XSL explaining the use of an XML application, the XML markup to appear in the book itself will be written in the source file in a CDATA section. However, a CDATA section cannot contain the string "]]>" and therefore it is not possible for a CDATA section to contain nested CDATA sections. The preferred approach to using CDATA sections for encoding text that contains the triad "]]>" is to use multiple CDATA sections by splitting each occurrence of the triad just before the ">". For example, to encode "]]>" one would write:

It is interesting to see how someone can just throw a very valuable piece of the Standard with such whimsical approach. Not everyone is using XML for a few hundred characters of HTML or a list of items for a drop down.
Some of us are actually using XML to exchange data, very complex data like a CCD, CDA CDR, these are all standard document formats in the healthcare arena and are becoming more and more prominent with ObamaCare. Part of these documents structure contain attachments things like DiCOM Images, PDF's and other Binary Data that should not be read by the parser the reason the CDATA definition exists.
Why should I pay the overhead of the parser reading a 3 megabyte DiCom image embedded in a CCD document? Why should I be forced to separate the document when it came in the original data and is part of the XML Standard. And I want the be able to locate and recover the document and is contents with XML.
This bewilders me why you all would support the parsing of data that is intended to not be parsed by the engine. If the engine sees CDATA ignore it, it is very simple. And the continued argument that some do not need it is irrelevant. It is part of the standard and the standard should be maintained. If they would like to add a "Feature" as it has been called then support the default behavior with an option.
Please stop parsing CDATA and ignore it.

You are absolutely right, CDATA are essential in many scenarios, they're part of XML standard and should be supported by every XML manipulation tool/method. But thing is that MS usually dosn't care .. you know, "640kB should be enough for everyone" kind of approach.
Edit: About FOR XML EXPLICIT - this is THE best method for generating precisely formatted XML data. Yes, syntax is kinda painful to look at and confusing, but once you use it feww times, you'll admire its beauty and power.

Related

How to style parts of i18n messages when using thymeleaf

I'm not sure this is the right place to ask this. I would like to know how best to style parts of messages from l10n properties files. For example, my client want this message and formatting in a help window:
This is a self-assessment and comparison application.
Simplest solution would be to include the HTML tags in the messages.properties entry for this label. The problem with that is that the 40 translators that will process the messages.properties are bound to make mistakes like deleting the <, translating the attributes or styles of the HTML markup etc. Also it makes maintaining the markup and styling difficult for the devs.
Any better way to do this?
The solution I've seen typically done just uses th:utext with HTML tags in the .properties files. I would opine it does create a maintenance hassle as you mention and should be kept to a minimum.
One workaround is to create separate strings in some cases, like:
<span th:text=#{thisIsA}>This is a </span><strong><span th:text="#{selfAssessment}">self-assessment</span></strong>
However, this is error-prone since certain languages may change the order of the words. So that's not a great option.
If the HTML tags specifically are an issue, another way albeit somewhat ugly could be:
thisIsASelfAssessment=This is a {0}self-assessment{1}.
Or even
thisIsA=This is a {0}.
selfAssessment=self-assessment
But that might be confusing for the next developer reading it and may introduce the same issue you have with the 40 translators looking at it since you have curly braces. It also all becomes very tedious and generates more lines.
So in the end, you're likely best going with the simplest solution of utext.
Project-wise, you could have the initial translation done without the markup and add the markup in after they are done with a first pass at translating it. The issue may arise in the future when you need to change strings, but doing this would minimize some headache. It could make sense to keep these strings in a separate block in the .properties file so you can target them later.
Good question as I've had this issue myself.

A valid reason to enable CDATA support for SQL Server?

I am posing this use case as a reason to enable support for the CDATA section of XML documents on SQL Server, in response to the opinion of Michael Rys.
He states that
"There is no semantic difference in the data that you store."
I am a software controls engineer where we use a supervised distributed system, we generally have a windows based server and database for supervisory functions as well as high speed machine control applications. We use any number of PLCs to compose our distributed control system, and we keep a copy of the PLC program on the server. The PLC program is L5X format that calls for the CDATA section per specification (see page 40 for more info).
The CDATA section is used for component descriptions due to invalid XML characters being present in some of them and the need to preserve them:
"Component descriptions are brought into the project without being processed by
the XML parser for markup language. The description text is contained in a
CDATA element, a standard in the XML specification. A CDATA element
begins with the character sequence <![CDATA[ and ends with the character
sequence ]]>. None of the text within the CDATA element is interpreted by the
XML parser. The CDATA element preserves formatting so there is no need to use
control characters to enter formatted descriptions."
Here, I think at least, is an entirely valid reason for the existence and use of the CDATA section - in contrast the the opinion of Microsoft.
Buggy tools.
You may find it more or less convenient, but the only technical reasons is if you have buggy tools that don't follow the XML rules in some way.
The CDATA section is used for component descriptions due to invalid XML characters being present in some of them and the need to preserve them.
Either you mean characters that are invalid in XML unescaped, in which case they could also be escaped, or you mean characters that are not valid in XML at all, in which case they are not valid CDATA sections. In the first case if your tools can't work with that, they are buggy. In the second case if your tools require you to work with that, they are buggy. Either way this is in the buggy tools category.
The general consensus in the XML community is that the following three forms are semantically equivalent:
<x>±</x>
<x>±</x>
<x><![CDATA[±]]></x>
and that XML processing software therefore does not need to preserve the distinction between them. This is why entity references (or numeric character references) and CDATA sections are not part of the data model used by XPath, XSLT, and XQuery (known as XDM).
Unfortunately the XML specification itself does not define a data model and is rather weak on saying which constructs are information-bearing and which are not, so there will always be people coming up with arguments like this, just as there will be people arguing that the order of attributes should be preserved.

General stategy for designing Flexible Language application using ANTLR4

Requirement:
I am trying to develop a language application using antlr4. The language in question is not important. The important thing is that the grammar is very vast (easily >2000 rules!!!). I want to do a number of operations
Extract bunch of informations. These can be call graphs, variable names. constant expressions etc.
Any number of transformations:
if a loop can be expanded, we go ahead and expand it
If we can eliminate dead code we might choose to do that
we might choose to rename all variable names to conform to some norms.
Each of these operations can be applied independent of each other. And after application of these steps I want the rewrite the input as close as possible to the original input.
e.g. So we might want to eliminate loops and rename the variable and then output the result in the original language format.
Questions:
I see a need to build a custom Tree (read AST) for this. So that I can modify the tree with each of the transformations. However when I want to generate the output, I lose the nice abilities of the TokenStreamRewriter. I have to specify how to write each of the nodes of the tree and I lose the original input formatting for the places I didn't do any transformations. Does antlr4 provide a good way to get around this problem?
Is AST the best way to go? Or do I build my own object representation? If so how do I create that object efficiently? Creating object representation is very big pain for such a vast language. But may be better in the long run. Again how do I get back the original formatting?
Is it possible to work just on the parse tree?
Are there similar language applications which do the same thing? If so what strategy do they use?
Any input is welcome.
Thanks in advance.
In general, what you want is called a Program Transformation System (PTS).
PTSs generally have parsers, build ASTs, can prettyprint the ASTs to recover compilable source text. More importantly, they have standard ways to navigate/inspect/modify the ASTs so that you can change them programmatically.
Many offer these capabilities in the form of pattern-matching code fragments written in the surface syntax of the language being transformed; this avoids the need to forever having to know excruciatingly fine details about which nodes are in your AST and how they are related to children. This is incredibly useful when you big complex grammars, as most of our modern (and our legacy languages) all seem to have.
More sophisticated PTSs (very few) provide additional facilities for teasing out the semantics of the source code. It is pretty hard to analyze/transform most code without knowing what scopes individual symbols belong to, or their type, and many other details such as data flow. Full disclosure: I build one of these.

Is the <wbr> element semantic HTML? What about in a microdata context?

In short, this is bad web development and UX:
But solving it by using CSS3 word breaking (code & demo) can lead to an 'awkward whitespace' situation, and strange cut-offs — here's an example of both:
Maybe it's not such a big deal, and the UX perspective of it is here, but let's look at the semantics of one of the solutions:
You could ... use the <wbr> element to indicate an optional word
break opportunity. This will tell the browser to insert a line break
as necessary to flow onto a new line inside the container.
The first question: is using <wbr> semantic HTML? (Does it at least degrade gracefully?)
In either case, it seems that being un-semantic in the general sense is a small price to pay for good UX functionality.
However, the second quesiton is about the big picture:
Are there any schema.org (microdata/RFDa) ramifications to consider when using <wbr> to split up an email address? Will it still be valid there?
The wbr element is defined in the HTML5 spec. So it's fine to use it. If it's used right (= according to the definition in the spec), you may call it also "semantic use".
I don't think that there would be any problems in combination with micordata/RDFa. Usually you'd provide the URL in an attribute anyway, which can't contain wbr elements of course: foo<wbr>#example<wbr>.com.
For element content I'd guess (didn't check though) that microdata/RDFa parsers should use the text content without markup resp. understand what is markup and what is text, otherwise e.g. a FOAF name would be <abbr>Dr.</abbr> Foo instead of Dr. Foo.
So you can bet that microdata/RDFa parsers know HTML ;), and therefor it shouldn't be a problem to use its elements.

What does the word "semantic" mean in Computer Science context?

I keep coming across the use of this word and I never understand its use or the meaning being conveyed.
Phrases like...
"add semantics for those who read"
"HTML5 semantics"
"semantic web"
"semantically correctly way to..."
... confuse me and I'm not just referring to the web. Is the word just another way to say "grammar" or "syntax"?
Thanks!
Semantics are the meaning of various elements in the program (or whatever).
For example, let's look at this code:
int width, numberOfChildren;
Both of these variables are integers. From the compiler's point of view, they are exactly the same. However, judging by the names, one is the width of something, while the other is a count of some other things.
numberOfChildren = width;
Syntactically, this is 100% okay, since you can assign integers to each other. However, semantically, this is totally wrong, since the width and the number of children (probably) don't have any relationship. In this case, we'd say that this is semantically incorrect, even if the compiler permits it.
Syntax is structure. Semantics is meaning. Each different context will give a different shade of meaning to the term.
HTML 5, for example, has new tags that are meant to provide meaning to the data that is wrapped in the tags. The <aside> tag conveys that the data contained within is tangentially-related to the information around itself. See, it is meaning, not markup.
Take a look at this list of HTML 5's new semantic tags. Contrast them against the older and more familiar HTML tags like <b>, <em>, <pre>, <h1>. Each one of those will affect the appearance of HTML content as rendered in a browser, but they can't tell us why. They contain no information of meaning.
The word ‘semantic ‘as an adjective simply means ‘meaningful’ which is very related to the word 'high level' in computer science.
For instances:
Semantic data model:
a data model that is semantic, that is meaningful and understood by anyone regardless of his background or expertise.
C++ is less semantic than Java, because Java uses meaningful words for its classes, methods and fields.
HTML5 semantics: refer to the tags that describe themselves such , , and so on.
It means "meaning", what you've got left when you've already accounted for syntax and grammar. For example, in C++ i++; is defined by the grammar as a valid statement, but says nothing about what it does. "Increment i by one" is semantics.
HTML5 semantics is what a well-formed HTML5 description is supposed to put on the page. "Semantic web" is, generally, a web where links and searches are on meaning, not words. The semantically correct way to do something is how to do it so it means the right thing.
It is not just Computer Science terminology, and if you ask,
What is the meaning behind this Computer Science lingo?
then I'm afraid we'll get in a recursive loop just like this.
In the HTML world, "semantic" is used to talk about the meaning of tags, rather than just considering how the output looks. For example, it's common to italicize foreign words, and it's also common to italicize emphasized words. You could simply wrap all foreign or emphasized words in <i>..</i> tags, but that only describes how they look, it doesn't describe why they look that way.
A better tag to use for emphasized word is <em>..</em>, because it conveys the semantics of emphasis. The browser (or your stylesheet) can then render them in italics, and other consumers of the page will know the word is emphasized. For example, a screen-reader could properly read it as an emphasized word.
From my view, it's almost like looking at syntax in a grammatical way. I can't speak to semantics in a broad term, but When people talk about semantics on the web, they are normally referring to the idea that if you stripped away all of the css and javascript etc; what was left (the bare bones html) would make sense to be read.
It also takes into account using the correct tags for correct markup. This stems from the old table-based layouts (tables should only be used for tabular data), and using lists to present list-like content.
You wouldn't use an h1 for something that was less important than an h2. That wouldn't make sense.
The below is syntactically different but semantically the same:
C, C++, C#, Java, JavaScript, Python, Ruby, etc.
x += y
Perl, PHP
$x += $y