A valid reason to enable CDATA support for SQL Server? - sql

I am posing this use case as a reason to enable support for the CDATA section of XML documents on SQL Server, in response to the opinion of Michael Rys.
He states that
"There is no semantic difference in the data that you store."
I am a software controls engineer where we use a supervised distributed system, we generally have a windows based server and database for supervisory functions as well as high speed machine control applications. We use any number of PLCs to compose our distributed control system, and we keep a copy of the PLC program on the server. The PLC program is L5X format that calls for the CDATA section per specification (see page 40 for more info).
The CDATA section is used for component descriptions due to invalid XML characters being present in some of them and the need to preserve them:
"Component descriptions are brought into the project without being processed by
the XML parser for markup language. The description text is contained in a
CDATA element, a standard in the XML specification. A CDATA element
begins with the character sequence <![CDATA[ and ends with the character
sequence ]]>. None of the text within the CDATA element is interpreted by the
XML parser. The CDATA element preserves formatting so there is no need to use
control characters to enter formatted descriptions."
Here, I think at least, is an entirely valid reason for the existence and use of the CDATA section - in contrast the the opinion of Microsoft.

Buggy tools.
You may find it more or less convenient, but the only technical reasons is if you have buggy tools that don't follow the XML rules in some way.
The CDATA section is used for component descriptions due to invalid XML characters being present in some of them and the need to preserve them.
Either you mean characters that are invalid in XML unescaped, in which case they could also be escaped, or you mean characters that are not valid in XML at all, in which case they are not valid CDATA sections. In the first case if your tools can't work with that, they are buggy. In the second case if your tools require you to work with that, they are buggy. Either way this is in the buggy tools category.

The general consensus in the XML community is that the following three forms are semantically equivalent:
<x>±</x>
<x>±</x>
<x><![CDATA[±]]></x>
and that XML processing software therefore does not need to preserve the distinction between them. This is why entity references (or numeric character references) and CDATA sections are not part of the data model used by XPath, XSLT, and XQuery (known as XDM).
Unfortunately the XML specification itself does not define a data model and is rather weak on saying which constructs are information-bearing and which are not, so there will always be people coming up with arguments like this, just as there will be people arguing that the order of attributes should be preserved.

Related

Usage of Vue.js with XHTML

In a page which uses Vue.js there are often attributes whose names have a colon as their first character or somewhere in the middle (e.g. after v-bind or v-on). Those starting with a colon aren't namespace-well-formed per Namespaces in XML, and for those having some prefixes before the colon those prefixes are required to be declared with the xmlns:prefix syntax. And there's even a possibility of starting an attribute name with #, which even plain XML already forbids.
Therefore, is it at all possible, and so, how practical, to use Vue.js in an XHTML document? Can just a subset of its features be used? Or do there exist simple workarounds for the above problems, enabling all of Vue.js's power in a straightforward and interoperable way?

I want to load a YAML file, possibly edit the data, and then dump it again. How can I preserve formatting?

This question tries to collect information spread over questions about different languages and YAML implementations in a mostly language-agnostic manner.
Suppose I have a YAML file like this:
first:
- foo: {a: "b"}
- "bar": [1, 2, 3]
second: | # some comment
some long block scalar value
I want to load this file into an native data structure, possibly change or add some values, and dump it again. However, when I dump it, the original formatting is not preserved:
The scalars are formatted differently, e.g. "b" loses its quotation marks, the value of second is not a literal block scalar anymore, etc.
The collections are formatted differently, e.g. the mapping value of foo is written in block style instead of the given flow style, similarly the sequence value of "bar" is written in block style
The order of mapping keys (e.g. first/second) changes
The comment is gone
The indentation level differs, e.g. the items in first are not indented anymore.
How can I preserve the formatting of the original file?
Preface: Throughout this answer, I mention some popular YAML implementations. Those mentions are never exhaustive since I do not know all YAML implementations out there.
I will use YAML terms for data structures: Atomic text content (even numbers) is a scalar. Item sequences, known elsewhere as arrays or lists, are sequences. A collection of key-value pairs, known elsewhere as dictionary or hash, is a mapping.
If you are using Python, using ruamel will help you preserve quite some formatting since it implements round-tripping up to native structures. However, it isn't perfect and cannot preserve all formatting.
Background
The process of loading YAML is also a process of losing information. Let's have a look at the process of loading/dumping YAML, as given in the spec:
When you are loading a YAML file, you are executing some or all of the steps in the Load direction, starting at the Presentation (Character Stream). YAML implementations usually promote their most high-level APIs, which load the YAML file all the way to Native (Data Structure). This is true for most common YAML implementations, e.g. PyYAML/ruamel, SnakeYAML, go-yaml, and Ruby's YAML module. Other implementations, such as libyaml and yaml-cpp, only provide deserialization up to the Representation (Node Graph), possibly due to restrictions of their implementation languages (loading into native data structures requires either compile-time or runtime reflection on types).
The important information for us is what is contained in those boxes. Each box mentions information which is not available anymore in the box left to it. So this means that styles and comments, according to the YAML specification, are only present in the actual YAML file content, but are discarded as soon as the YAML file is parsed. For you, this means that once you have loaded a YAML file to a native data structure, all information about how it originally looked in the input file is gone. Which means that when you dump the data, the YAML implementation chooses a representation it deems useful for your data. Some implementations let you give general hints/options, e.g. that all scalars should be quoted, but that doesn't help you restore the original formatting.
Thankfully, this diagram only describes the logical process of loading YAML; a conforming YAML implementation does not need to slavishly conform to it. Most implementations actually preserve data longer than they need to. This is true for PyYAML/ruamel, SnakeYAML, go-yaml, yaml-cpp, libyaml and others. In all these implementations, the style of scalars, sequences and mappings is remembered up until the Representation (Node Graph) level.
On the other hand, comments are discarded rather early since they do not belong to an event or node (the exceptions here is ruamel which links comments to the following event, and go-yaml which remembers comments before, at and after the line that created a node). Some YAML implementations (libyaml, SnakeYAML) provide access to a token stream which is even more low-level than the Event Tree. This token stream does contain comments, however it is only usable for doing things like syntax highlighting, since the APIs do not contain methods for consuming the token stream again.
So what to do?
Loading & Dumping
If you need to only load your YAML file and then dump it again, use one of the lower-level APIs of your implementation to only load the YAML up until the Representation (Node Graph) or Serialization (Event Tree) level. The API functions to search for are compose/parse and serialize/present respectively.
It is preferable to use the Event Tree instead of the Node Graph as some implementations already forget the original order of mapping keys (due to internally using hashmaps) when composing. This question, for example, details loading / dumping events with SnakeYAML.
Information that is already lost in the event stream of your implementation, for example comments in most implementations, is impossible to preserve. Also impossible to preserve is scalar layout, like in this example:
"1 \x2B 1"
This loads as string "1 + 1" after resolving the escape sequence. Even in the event stream, the information about the escape sequence has already been lost in all implementations I know. The event only remembers that it was a double-quoted scalar, so writing it back will result in:
"1 + 1"
Similarly, a folded block scalar (starting with >) will usually not remember where line breaks in the original input have been folded into space characters.
To sum up, loading to the Event Tree and dumping again will usually preserve:
Style: unquoted/quoted/block scalars, flow/block collections (sequences & mappings)
Order of keys in mappings
YAML tags and anchors
You will usually lose:
Information about escape sequences and line breaks in flow scalars
Indentation and non-content spacing
Comments – unless the implementation specifically supports putting them in events and/or nodes
If you use the Node Graph instead of the Event Tree, you will likely lose anchor representations (i.e. that &foo may be written out as &a later with all aliases referring to it using *a instead of *foo). You might also lose key order in mappings. Some APIs, like go-yaml, don't provide access to the Event Tree, so you have no choice but to use the Node Graph instead.
Modifying Data
If you want to modify data and still preserve what you can of the original formatting, you need to manipulate your data without loading it to a native structure. This usually means that you operate on YAML scalars, sequences and mappings, instead of strings, numbers, lists or whatever structures the target programming language provides.
You have the option to either process the Event Tree or the Node Graph (assuming your API gives you access to it). Which one is better usually depends on what you want to do:
The Event Tree is usually provided as stream of events. It may be better for large data since you do not need to load the complete data in memory; instead you inspect each event, track your position in the input structure, and place your modifications accordingly. The answer to this question shows how to append items giving a path and a value to a given YAML file with PyYAML's event API.
The Node Graph is better for highly structured data. If you use anchors and aliases, they will be resolved there but you will probably lose information about their names (as explained above). Unlike with events, where you need to track the current position yourself, the data is presented as complete graph here, and you can just descend into the relevant sections.
In any case, you need to know a bit about YAML type resolution to work with the given data correctly. When you load a YAML file into a declared native structure (typical in languages with a static type system, e.g. Java or Go), the YAML processor will map the YAML structure to the target type if that's possible. However, if no target type is given (typical in scripting languages like Python or Ruby, but also possible in Java), types are deduced from node content and style.
Since we are not working with native loading because we need to preserve formatting information, this type resolution will not be executed. However, you need to know how it works in two cases:
When you need to decide on the type of a scalar node or event, e.g. you have a scalar with content 42 and need to know whether that is a string or integer.
When you need to create a new event or node that should later be loaded as a specific type. E.g. if you create a scalar containing 42, you might want to control whether that it is loaded as integer 42 or string "42" later.
I won't discuss all the details here; in most cases, it suffices to know that if a string is encoded as a scalar but looks like something else (e.g. a number), you should use a quoted scalar.
Depending on your implementation, you may come in touch with YAML tags. Seldom used in YAML files (they look like e.g. !!str, !!map, !!int and so on), they contain type information about a node which can be used in collections with heterogeneous data. More importantly, YAML defines that all nodes without an explicit tag will be assigned one as part of type resolution. This may or may not have already happened at the Node Graph level. So in your node data, you may see a node's tag even when the original node does not have one.
Tags starting with two exclamation marks are actually shorthands, e.g. !!str is a shorthand for tag:yaml.org,2002:str. You may see either in your data, since implementations handle them quite differently.
Important for you is that when you create a node or event, you may be able and may also need to assign a tag. If you don't want the output to contain an explicit tag, use the non-specific tags ! for non-plain scalars and ? for everything else on event level. On node level, consult your implementation's documentation about whether you need to supply resolved tags. If not, same rule for the non-specific tags applies. If the documentation does not mention it (few do), try it out.
So to sum up: You modify data by loading either the Event Tree or the Node Graph, you add, delete or modify events or nodes in the data you get, and then you present the modified data as YAML again. Depending on what you want to do, it may help you to create the data you want to add to your YAML file as native structure, serialize it to YAML and then load it again as Node Graph or Event Tree. From there, you can include it in the structure of the YAML file you want to modify.
Conclusion / TL;DR
YAML has not been designed for this task. In fact, it has been defined as a serialization language, assuming that your data is authored as native data structures in some programming language and from there dumped to YAML. However, in reality, YAML is used a lot for configuration, meaning that you typically write YAML by hand and then load it into native data structures.
This contrast is the reason why it is so difficult to modify YAML files while preserving formatting: The YAML format has been designed as transient data format, to be written by one application, and then to be loaded by another (or the same) application. In that process, preserving formatting does not matter. It does, however, for data that is checked-in to version control (you want your diff to only contain the line(s) with data you actually changed), and other situations where you write your YAML by hand, because you want to keep style consistent.
There is no perfect solution for changing exactly one data item in a given YAML file and leaving everything else intact. Loading a YAML file does not give you a view of the YAML file, it gives you the content it describes. Therefore, everything that is not part of the described content – most importantly, comments and whitespace – is extremely hard to preserve.
If format preservation is important to you and you can't live with the compromises made by the suggestions in this answer, YAML is not the right tool for you.
I would like to challenge the accepted answer. Whether you can preserve comments, the order of map keys, or other features depends on the YAML parsing library that you use. For starters, the library needs to give you access to the parsed YAML as a YAML Document, which is a collection of YAML nodes. These nodes can contain metadata besides the actual key/value pairs. The kinds of metadata that your library chooses to store will determine how much of the initial YAML document you can preserve. I will not speak for all languages and all libraries, but Golang's most popular YAML parsing library, go-yaml supports parsing YAML into a YAML document tree and serializing YAML document back, and preserves:
comments
the order of keys
anchors and aliases
scalar blocks
However, it does not preserve indentation, insignificant whitespace, and some other minor things. On the plus side, it allows modifying the YAML document and there's another library,
yaml-jsonpath that simplifies browsing the YAML node tree. Example:
import (
"github.com/stretchr/testify/assert"
"gopkg.in/yaml.v3"
"testing"
)
func Test1(t *testing.T) {
var n yaml.Node
y := []byte(`# Comment
t: &t
- x: 1 # anchor
a:
b: *t # alias
b: |
cccc
dddd
`)
err := yaml.Unmarshal(y, &n)
assert.NoError(t, err)
y2, _ := yaml.Marshal(&n)
assert.Equal(t, y, y2)
}

Optimal set of charecters to disallow/allow that prevent SQL injection perfectly?

I know that sql injection must rely on certain set of special characters used in SQL. So here's the question:
What would be the minimal set of characters to disallow, or as large as possible, a wet to allow, that would prevent SQL injection perfectly?
I understand This is not the best nor the easiest approach to prevent SQL injection, but i just wonder questions like "can you do it even without xxx".
Answering a comment:
Just for the purpose of curiosity, some languages other than English can indeed be written normally without some special characters. for example:
Chinese:“”‘’,。!?:;——
Japanese:””’’、。!?:;ーー
English:""'',.!?:;--
You may want to look up some examples of the kinds of ways a SQL injection attack can be leveraged against a web application with weak security architecture. As far as the "WHY" you'll need to analyze the kinds of site user hacks that can get through what's supposed to be a locked down (but public facing) system.
I turned up this reference for further analysis:
You can see more at the source of this injection:
http://www.unixwiz.net/techtips/sql-injection.html
I know that sql injection must rely on certain set of special characters used in SQL. So here's the question:
There isn't any one character that is absolutely banned, nor are there others that are trusted 100% safe.
EDIT: Response to a revised OP...
A brief history of a type of SQL Injection vulnerability: Without developing data access APIs for your database inputs, the most common approach is "dynamically" constructing SQL statements, then executing the raw SQL (a dangerous mix of developer prepared code and outside user input).
Consider in pseudocode:
#SQL_QUERY = "SELECT NAME, SOCIAL_SECURITY_NUMBER, PASSWD, BANK_ACCOUNT_NUMBER" +
"FROM SECURE_TABLE " + "WHERE NAME = " + #USER_INPUT_NAME;
$CONNECT_TO SQL_DB;
EXECUTE #SQL_QUERY;
...
The expected input when obeyed is a lookup of specific user information based on a single name value input from some user form.
While it isn't my point to accidentally teach a how-to for a new generation of would-be SQL Injection hackers, you can guess that a long concatenated string of code isn't really secure if user input on a web form can also tag on anything they like to the code; including more code of their own.
The utility or danger with using dynamic sql coding is a discussion by itself. You can put a few barriers in front of a dynamic query such as this:
An "input handler" procedure that all inputs from app forms go through before the query code module is executed:
run_query( parse_and_clean( #input_value ) )
This is where you look for things like string inputs too long for expected sizes (a 500 character name???? non-alpha numerical name values?)
Web forms can also have their own scripted input handlers. Usually intended for validating for specific inputs (not empty, longer than three characters, no punctuation, format masks like "1-XXX-XXX-XXXX", etc.) They can also scrub for input hacks such as html coding tags, javascript, urls with remote calls, etc. Sometimes this need is best served here because the offending input is caught before it gets anywhere near the database.
False positives: more often it really is just user error sending malformatted input.
These examples date to the really early days of the Internet. There are lots of source code modules/libraries, add-ons, open-source and paid variants of attempts to make SQL injection much harder to break through into your host systems. Dig around; there is even a huge thread on Stack Overflow that discusses this exact topic.
For the time I have been stumbling around and developing for the Internet, I share a common observation: "perfectly" is a description of a system that has already been hacked. You won't get there, and that isn't the point. You should look towards being the "easy, quick target" (i.e., the Benz in the gas station with windows down, keys in ignition and owner in the restroom)
When you say "can you do it even without...?" The answer is "sure" if it's a "Hello World" type project for some classroom assignment... hosted on your local machine... behind your router/firewall. After all, you gotta learn somewhere. Just be smart where you deploy stuff that isn't so secured.
It's just when you put things out open to the "world" on the Internet, you should always be thinking how the "same task" has to be done differently with a security mindset. Same outputs as your test/dev workspace but with more precautions such as locking down obvious and commonly exploited weaknesses.
Enjoy your additional research, and hopefully this post (devoid of code examples) can encourage your mindset in helpful directions.
Onward.
I know that sql injection must rely on certain set of special characters used in SQL. So here's the question:
What would be the minimal set of characters to disallow, or as large as possible, a wet to allow, that would prevent SQL injection perfectly?
This is the wrong approach to this topic. Instead you need to understand why and how SQL injections happen. Then you can work out measures which prevent them.
Because the cause of SQL injections is the incorporation of user supplied data into SQL command. If the user supplied fragments are not processed properly, they could modify the intention of the SQL command.
Here the aspect of proper processing is crucial, which again depends on the developer’s intention of how the user provided data is supposed to be interpreted. In most cases it’s a literal value (e. g., string, number, date, etc.) and sometimes a SQL keyword (e. g., sort order ASC or DESC, operators, etc.) or even an identifier (e. g., column name, etc.).
These all follow certain syntax rules which can the user provided data can either be validated against and/or enforced:
Prepared statements are the first choice for literal values. They separate the SQL statement and parameter values so that latter can’t be mistakenly interpreted as something other than literal values.
White listing can always be used when the set of valid values it limited, which is the case for SQL keywords and identifiers.
So looking at the use cases for user provided data and the corresponding options that prevent SQL injections, there is no need to restrict the set of allowed characters globally. Better stick to the methods which are best practice and have proved effective.
And if you happen to have a use case that doesn’t fit the mentioned, e. g., allowing users to provide something more complex than an atomic language elements like an arbitrary filter expression, you should break them down into their atoms to be able to ensure their integrity.

Are CDATA sections really unnecessary?

This question is prompted by the rather militant refusal of developer Michael Rys to include the parsing of CDATA sections into FOR XML PATH because "There is no semantic difference in the data that you store."
I have stored nuggets of HTML in CDATA nodes and other content that requires the use of special or awkward characters. However I don't feel qualified to challenge Rys's controversial assertion because, I suppose, technically he is correct in the scenarios where I've employed CDATA for convenience.
What's really baking my noodle is that, as developers take to the internet begging for advice on how to render CDATA segments using FOR XML PATH, respondents continually direct them to use FOR XML EXPLICIT instead, the XML rendering method Rys cited as being the "query from hell".
If we can really do without CDATA in every use case that anyone can suggest I guess we should stop moaning and reject CDATA usage henceforth. But if there are clearly defined cases where CDATA is essential Rys already undertook that he would bake it into FOR XML PATH going forward in the topmost link in this question.
So which is it to be? Are CDATA sections really relics of the past? Or should Rys pull his finger out and allow for CDATA parsing in FOR XML PATH? And while we're at it, in the meanwhile, are there any hacks for getting FOR XML PATH to return CDATA sections?
CDATA sections are unnecessary. They're not a "relic of the past" because they've always been unnecessary.
This does not mean they aren't useful. Look at just about any programming language or library and you can find a large number of things you could do without because they are semantically equivalent to something else, but which are useful if there's a human being sitting there having to write the stuff.
For that matter, even with programmatic production it's also handy that one could take the opposite approach and use CDATA sections for every single piece of c-data (bloaty, but it could have efficiency gains elsewhere).
FOR XML PATH does not involve a human being sitting there having to write the stuff. It's a means of producing valid XML from a the results of an SQL query. (It's also not a matter of parsing CDATA sections, but of producing them - a different matter).
And you can't really complain about FOR XML EXPLICIT being the alternative when you want really fine control - the reason FOR XML EXPLICIT is so nasty to use sometimes is precisely because it gives you really fine control. Indeed, consider if they first added support for CDATA sections and then added support for every other tweak and configuration option that seemed just as vital to someone else out there. How long would it take before FOR XML EXPLICIT was the automatic choice due to it being more straightforward than FOR XML PATH‽
There are four cases where CDATA are useful:
You're sitting at a keyboard typing this stuff in yourself.
You are dealing with a mixing different technologies with different standards designed at different times and which will be interpreted by different parsers in different ways (e.g javascript embedded into XHTML - though it's not 100% necessary here it's a nightmare to do otherwise).
You're trying to parse the XML with something that doesn't understand XML.
You're trying to use something built on a parser that allows low-level access that distinguishes between CDATA sections and other character data and using that low-level access inappropriately.
Funnily enough, these four cases are also the four cases where a ban on accepting CDATA sections can make sense.
Case 1 doesn't apply here, it isn't human-generated code.
Case 2 could apply here if you are doing something really crazy. Frankly, the lack of CDATA sections is the least of your worries here; switch to producing simpler XML in the query and transforming it elsewhere.
Case 3 could apply here, but it's not fair to complain to the SQL people if it does, when you should complain to the broken XML parser that doesn't treat <example> the same as <![CDATA[<example>]]>.
Case 4 could apply here, but again complain to the person who wrote the buggy code, not the SQL people.
CDATA sections are useful if you don't care about the semantics of the data in them (i.e. you do not need to parse it - it is simply a run of characters), and you don't wish to escape any of the XML within them.
The definition, according to w3:
CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup.
From wikipedia:
New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. Some APIs for working with XML documents do offer options for independent access to CDATA sections, but such options exist above and beyond the normal requirements of XML processing systems, and still do not change the implicit meaning of the data. Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup.
CDATA sections are useful for writing XML code as text data within an XML document. For example, if one wishes to typeset a book with XSL explaining the use of an XML application, the XML markup to appear in the book itself will be written in the source file in a CDATA section. However, a CDATA section cannot contain the string "]]>" and therefore it is not possible for a CDATA section to contain nested CDATA sections. The preferred approach to using CDATA sections for encoding text that contains the triad "]]>" is to use multiple CDATA sections by splitting each occurrence of the triad just before the ">". For example, to encode "]]>" one would write:
It is interesting to see how someone can just throw a very valuable piece of the Standard with such whimsical approach. Not everyone is using XML for a few hundred characters of HTML or a list of items for a drop down.
Some of us are actually using XML to exchange data, very complex data like a CCD, CDA CDR, these are all standard document formats in the healthcare arena and are becoming more and more prominent with ObamaCare. Part of these documents structure contain attachments things like DiCOM Images, PDF's and other Binary Data that should not be read by the parser the reason the CDATA definition exists.
Why should I pay the overhead of the parser reading a 3 megabyte DiCom image embedded in a CCD document? Why should I be forced to separate the document when it came in the original data and is part of the XML Standard. And I want the be able to locate and recover the document and is contents with XML.
This bewilders me why you all would support the parsing of data that is intended to not be parsed by the engine. If the engine sees CDATA ignore it, it is very simple. And the continued argument that some do not need it is irrelevant. It is part of the standard and the standard should be maintained. If they would like to add a "Feature" as it has been called then support the default behavior with an option.
Please stop parsing CDATA and ignore it.
You are absolutely right, CDATA are essential in many scenarios, they're part of XML standard and should be supported by every XML manipulation tool/method. But thing is that MS usually dosn't care .. you know, "640kB should be enough for everyone" kind of approach.
Edit: About FOR XML EXPLICIT - this is THE best method for generating precisely formatted XML data. Yes, syntax is kinda painful to look at and confusing, but once you use it feww times, you'll admire its beauty and power.

removing dead variables using antlr

I am currently servicing an old VBA (visual basic for applications) application. I've got a legacy tool which analyzes that application and prints out dead variables. As there are more than 2000 of them I do not want to do this by hand.
Therefore I had the idea to transform the separate codefiles which contain the dead variable according to the aforementioned tool to ASTs and remove them that way.
My question: Is there a recommended way to do this?
I do not want to use StringTemplate, as I would need to create templates for all rules and if I had a commend on the hidden channel, it would be lost, right?
Everything I need is to remove parts of that code and print out the rest as it was read in.
Any one has any recommendations, please?
Some theory
I assume that regular expressions are not enough to solve your task. That is you can't define the notion of a dead-code section in any regular language and expect to express it in a context-free language described by some antlr grammar.
The algorithm
The following algorithm can be suggested:
Tokenize source code with a lexer.
Since you want to preserve all the correct code -- don't skip or hide it's tokens. Make sure to define separate tokens for parts which may be removed or which will be used to determine the dead code, all other characters can be collected under a single token type. Here you can use output of your auxiliary tool in predicates to reduce the number of tokens generated. I guess antlr's tokenization (like any other tokenization) is expressed in a regular language so you can't remove all the dead code on this step.
Construct AST with a parser.
Here all the powers of a context-free language can be applied -- define dead-code sections in parser's rules and remove it from the AST being constructed.
Convert AST to source code. You can use some tree parser here, but I guess there is an easier way which can be found observing toString and similar methods of a tree type returned by the parser.