How can I set an Atom Grammar based on file content? - grammar

In Sublime Text 3 there is a package called ApplySyntax which allows to set syntax highlighting based on file contents. How could this be achieved in Atom?

Looking through the specs I noticed that Atom does actually support content based grammar selection:
it "uses the filePath's shebang line if the grammar cannot be determined by the extension or basename", ->
filePath = require.resolve("./fixtures/shebang")
expect(atom.grammars.selectGrammar(filePath).name).toBe "Ruby"
The way Atom works is by calculating a grammar score based upon the file path and the content of the file.
More specifically and in relation to content each grammar contains a firstLineMatch which is a regex that looks at the first line or few lines that helps to determine if the file does indeed belong to that grammar, in the case of Ruby this is:
'firstLineMatch': '^#!\\s*/.*\\bruby|^#\\s+-\\*-\\s*ruby\\s*-\\*-'
This looks for the shebang or an equivalent comment found in some Ruby files.
Related Reading:
Open .rb files with RoR grammar by default instead of Ruby
Atom Flight Manual: Grammar

Related

Lucene query result is not correct when running official demo

I tried Lucene official demo by running IndexFiles with arguments -index . -docs . , and console prints including pom.xml and *.java and *.class are added into index.
Then I tried SearchFiles with arguments -index . -query "lucene AND main", and console prints only IndexFiles.class and SearchFiles.class and IndexFiles.java, but not SearchFiles.java (which I think should be one of searched results).
Your search results are correct (for the .java files, at least).
The sample code uses the StandardAnalyzer which, in turn, uses the StandardTokenizer.
The StandardTokenizer splits input text into tokens using the rules described in this document. For example, from section 4 of that document:
When you have text such as the following, in the source files
org.apache.lucene.analysis.Analyzer
this is tokenized as a single token. There are no word boundaries.
Looking in the IndexFiles.java source file, there is the following text:
demonstrating simple Lucene indexing
This is tokenized into 4 separate tokens.
But in the SearchFiles.java source file, the text "lucene" only ever appears in text such as org.apache.lucene.analysis.Analyzer - and therefore the single token lucene is never created.
Your query therefore does not find any hits in the IndexFiles.java document because the query matches exact tokens. Both source files contain the word "main" but only one contains the word "lucene".
For the .class files, because these are compiled bytecode files, I would say they should not be indexed in the first place. Lucene works with text files, not binary files. Yes, the class files will contain fragments of text, but they will also typically contain unprintable control characters, which are not suitable to be indexed. I think indexing results could be unpredictable because of this.
You can explore the indexed data using Luke - which is bundled in the binary releases:

awk pattern to match an XML PI at the start of a line

I have an XML document containing a number of XML Processing Instructions which are of the form:
<?cpdoc something?>
I am trying to match them in awk with the pattern
/^\<\?cpdoc/
but it's not returning anything. If I remove the ^ anchor, it works (but I have other similar PIs which don't start a line which I don't want matched).
It looks as if it's being confused by the \<\? but why is it ignoring the line-start anchor?
Don't parse XML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, XML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint
xmlstarlet
saxon-lint (my own project)
Check: Using regular expressions with HTML tags
Example using xpath :
xmllint --xpath '//processing-instruction()' file.xml
Solution by OP and explanation by Ed Morton.
It works if the less-than is not escaped, as otherwise it's a word boundary. So instead of:
\<\?
I should use literal:
<\?
This is because we can't just go escaping any character and hoping for the best, we have to know which characters are metacharacters and then escape them if we want them treated as literal.

Parsing a G4 file to generate doc / schema

I realize this question is a bit meta, but I essentially want to parse an ANTLR4 grammar (an actual .g4 file) to then generate documentation and other artifacts based on the grammar (not an instance of the grammar).
For example, consider the example Java grammar that contains this rule:
compilationUnit
: packageDeclaration? importDeclaration* typeDeclaration* EOF
;
I want to be able to parse the Java.g4 file and produce documentation that says "A compilationUnit contains an optional packageDeclaration, 0 or more importDeclarations, and 0 or more typeDeclarations". Or perhaps I want to produce an XSD with a data type called "compilationUnit" that contains "packageDeclaration", "importDeclaration", and "typeDeclaration" elements (with proper cardinality set).
What is the best way of accomplishing something like this? Is it to create a target (even though the goal isn't to create lexers/parsers), or is it to use the example antlr4 grammar to parse the g4 file, or is it something else?
Thanks!
This would be a very typical use of ANTLR, and convenient given the existing ANTLR 4 grammar.

Failed to parse command using ANTLR3 grammar, if command has same word which is declared as rule

I have facing a problem while parsing some command with the parser which, I have implemented using ANLTR3. Parser fails to parse those commands which contains 'any-word' that is declared as lexer rule in the grammar.
For Example take a look following grammar:
show :
SHOW TABLES '[' projectName? tableName']' -> ^(SHOW TABLES_ ^(PROJECT_NAME projectName)? ^(DATASET_TABLE tableName));
SHOW : S H O W;
If i try to parse command 'SHOW TABLES [sample-project:SHOW]' then parse fails for this command.But if I change the SHOW word then it works.
SHOW TABLES [sample-project:SHOW] - this works.
I don't want to get name as string which is surrounded in quotes(").
Can anyone suggest solution? I am using ANTLR3.
Thanks in advance.
This is a typical effect of using a reserved word as identifier. In ANTLR when you define a reserved word like your SHOW rule it will implicitly excluded from a identifier rule you might have defined after that keyword rule.
The solution to allow such keywords also as identifiers in rules like your tablName is to make that rule accept certain (or all) keywords that could be accepted in that place (and will not act as keywords then). Example:
tableName:
IDENTIFIER
| SHOW
| <others go here>
;

Inconsistencies in tokenizing large English files using Stanford's PTBTokenizer?

I have the Stanford PTBTokenizer (included with POS tagger v3.2.0) from the Stanford JavaNLP API that I'm using to try to tokenize a largish (~12M) file (English language text). Invoking from bash:
java -cp ../david/Desktop/quest/lib/stanford-parser.jar \
edu.stanford.nlp.process.PTBTokenizer -options normalizeAmpersandEntity=true \
-preserveLines foo.txt >tmp.out
I see instances of punctuation not tokenized properly in some places but not others. E.g., output contains "Center, Radius{4}" and also contains elsewhere "Center , Radius -LCB- 4 -RCB-". (The former is a bad tokenization; the latter is correct.)
If I isolate the lines that don't get tokenized properly in their own file and invoke the parser on the new file, the output is fine.
Has anybody else run into this? Is there a way to work around that doesn't involve checking output for bad parses, separating them, and re-tokenizing?
Upgrading to the latest version (3.3.0) fixed the comma attachment problem. There are spurious cases of brackets/braces not being tokenized correctly (mostly because they are [mis-]recognized as emoticons).
Thanks again to Professor Manning & John Bauer for their prompt & thorough help.