Is it possible to have a rule use hidden terminals? - grammar

In all my grammar rules, except for one, spaces should be hidden and not picked up by the parser.
For this single rule, it would be good if I could parse spaces, but I feel that it would be unecessary for me to stop having hidden spaces, since it would require me to start parsing spaces in every other rule.
So I am wondering if there is some way for a rule to use hidden terminals, or is there some other way of doing it? Thanks!

Related

Help with context-free grammar of lots of literal text with some special syntax

How do you make a context free grammar that analyzes mostly literal text (spaces, characters, symbols, etc.) while also looking for expressions of the form ${...} or $someCommand{...}? Note if it finds "I got $10 today", it should not do anything special with it.
Is it possible?
Yes, it is possible.
Since this is the right answer, I believe you will accept it.

The idea of text highlighting, code completion, etc in programming

I wanna know the idea of advanced text editors features like text highlighting, code completion, automatic indentation etc.
To make my idea clear I imagine that text highlighting is reading the entire text into a string then do regular expression replacement of keywords with keywords + color codes and replace the text again. That looks logical but it would be so inefficient to do that with every keystroke when your file is 4000 lines for example ! So I wanna know the idea of implementation of such thing in C# for example (any other language would be fine also but that's what i am experimenting with right now).
Syntax highlighting:
This comes to mind. I haven't actually tried the example, so I cannot say anything about the performance, but it seems to be the simplest way of getting basic syntax highlighting up and running.
Auto-completion:
Given a list of possible keywords (that could be filtered based on context) you could quickly discard anything that doesn't match what the user is currently typing. In most languages, you can safely limit yourself to one "word", since whitespace isn't usually legal in an identifier. For example, if I start typing "li", the auto-completion database can discard anything that doesn't start with the letters 'l' and 'i' (ignoring case). As the user continues to type, more and more options can be discarded until only one -- or at least a few -- remains. Since you're just looking at one word at a time, this would be very fast indeed.
Indentation:
A quick-and-dirty approach that would (kind of) work in C-like languages is to have a counter that you increment once for every '{', and decrement once for every '}'. When you hit enter to start a new line, the indentation level is then counter * indentWidth, where indentWidth is a constant number of spaces or tabs to indent by. This suffers from a serious drawback, though -- consider the following:
if(foo)
bar(); // This line should be indented, but how does the computer know?
To deal with this, you might look for lines that end with a ')', not a semicolon.
An old, but still applicable resource for editor internals is The Craft of Text Editing. Chapter 7 address the question of redisplay strategies directly.
In order to do "advanced" syntax highlighting -- that is, highlighting that requires contextual knowledge, a parser is often needed. Most parsers are built on some sort of a formal grammar which exist in various varieties: LL, LALR, and LR are common.
However, most parsers operate on whole files, which is quite inefficient for text editing, so instead we turn to incremental parsers. Incremental parsers use knowledge of the language and the structure of what has been previously been processed in order to re-do the least amount of work possible.
Here's a few references to incremental parsing:
Language Design Patterns and incremental parsing
Incremental Parsing
Incremental Parsing in the Yi Editor and the presentation on Vimeo

removing dead variables using antlr

I am currently servicing an old VBA (visual basic for applications) application. I've got a legacy tool which analyzes that application and prints out dead variables. As there are more than 2000 of them I do not want to do this by hand.
Therefore I had the idea to transform the separate codefiles which contain the dead variable according to the aforementioned tool to ASTs and remove them that way.
My question: Is there a recommended way to do this?
I do not want to use StringTemplate, as I would need to create templates for all rules and if I had a commend on the hidden channel, it would be lost, right?
Everything I need is to remove parts of that code and print out the rest as it was read in.
Any one has any recommendations, please?
Some theory
I assume that regular expressions are not enough to solve your task. That is you can't define the notion of a dead-code section in any regular language and expect to express it in a context-free language described by some antlr grammar.
The algorithm
The following algorithm can be suggested:
Tokenize source code with a lexer.
Since you want to preserve all the correct code -- don't skip or hide it's tokens. Make sure to define separate tokens for parts which may be removed or which will be used to determine the dead code, all other characters can be collected under a single token type. Here you can use output of your auxiliary tool in predicates to reduce the number of tokens generated. I guess antlr's tokenization (like any other tokenization) is expressed in a regular language so you can't remove all the dead code on this step.
Construct AST with a parser.
Here all the powers of a context-free language can be applied -- define dead-code sections in parser's rules and remove it from the AST being constructed.
Convert AST to source code. You can use some tree parser here, but I guess there is an easier way which can be found observing toString and similar methods of a tree type returned by the parser.

Lucene gotchas with punctuation

Whilst building some unit tests for my Lucene queries I noticed some strange behavior related to punctuation, in particular around parentheses.
What are some of the best ways to deal with search fields that contain significant amounts of punctuation?
If you haven't customized the query parser, Lucene should behave according to the default query parser syntax. Are you getting something different than that? Do you want punctuation to have a special meaning or just to remove the punctuation from searches?
The other usual suspect here is the Analyzer, which determines how your field is indexed and how the query is broken into pieces for searching. Can you post specific examples of bad behavior?
It is not not just parentheses, other punctuations such as the colon, hyphen etc. will cause issues. Here is a way to deal with them.

URL formatting tips for search engine optimization?

I am looking for url encoding tips for SEO compliant site.
I have a list of variables I need!
hypen = used to split locations, Leeds-UK-England
space = underscore for where spaces occur
hypen = plus sign used in some british locations (stafford-upon-avon)
forward slash = exclamation used in house for names of things.
Are the ones chosen bad or good? Are there any better ones, I'm pretty sure I need all the data, in order to decode the url's properly.
My "SEO" gave me a list of things which are bad, but not good. I've searched these and google seems to give the same type of results.
Cheers, Sarkie
Google used not to recognise underscores as word separators - see this article from 2005. This has entered into received wisdom and most of the 'experts' and articles you will find on SEO will still be recommending this.
However, last year this changed: underscores are now recognised as word separators so it opens things up for URL design. This now allows using dashes as dashes and underscores as spaces which some consider more natural. I've not found many people who have caught up with this, including SEO consultants I deal with professionally.
As to a good system for your use case, I would recommend asking around some non technical people (colleagues, friends, family, etc) to see what they like.
Hyphens for spaces is the usual and preferred method.