Define grammar in Xtext for optional array dimensions - grammar

I'm trying to define grammar in xtext for arrays where dimensions can be empty like int[][] or int[5][10]
My grammar looks like:
ArrayType:
[BasicType] ('['(dimension+=Expression)?']')+;
The problem with that rule when I use int[][] is there is no way from the model to know how many [] included because dimension list would be empty.
So I wonder if there is a way in xtext to default value like 0 in case Expression wasn't found? Otherwise what is the best way to handle such situation without changing the metamodel?
Thanks in advance for your help.

By no means an Xtext expert but can't you add an intermediary term:
ArrayType:
[BasicType] dimensions=Dimension+;
Dimension:
('['(size+=Expression)?']')

Related

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

ANTLR4 and parsing a type-length-value format

I am trying create a grammar for a format that follows a type-length-value convention. Can ANTLR4 read in a length value and then parse that many characters?
NO ...
From your question (which is very short so I could miss something ...) I gather you are mixing grammar and encoding rules.
When you say type-length-value, it sounds like an encoding rule to me (how to serialize a data). In my experience, you write this code yourself.
A grammar is at a higher level: it's a piece of text that describes something. Antlr will help you breaking this text into tokens and then into a tree that you can navigate.
This step only handles text: if you were going that way to solve your problem, you would still have to handle type, length and value yourself.
EDIT:
with a bit of googling I found this https://github.com/NickstaDB/SerializationDumper

How does spaCy tokenizer splits sentences?

I am finding the tokenization code quite complicated and I still couldn't find where in the code the sentences are split.
For example, how does the tokenizer know that
Mr. Smitt stayed at home. He was tired
should not be split in "Mr." and should be split before "He".? And where in the code does the split before "He" happens?
(In fact, I am unsure actually unsure if I am looking at the right place: if I search for sents in tokenizer.pyx I don't find any occurrence)
You access the splits via the doc object, with the generator:
doc.sents
The output of the generator is a series of spans.
As for how the splits are chosen, the document is parsed for dependency relationships. Understanding the parser is not trivial - you'll have to read into it if you want to understand it - it's using a neural network to inform the decision about how to construct the dependency trees; but the splits are those gaps between tokens which are not crossed by dependencies. This is not simply where you find a full-stop, and the method is more robust as a result.

SWI prolog make set of variables name with rbtrees or others means

I have got a term from which I want to get set of variables name.
Eg. input: my_m(aa,b,B,C,max(D,C),D)
output: [B,C,D] (no need to be ordered as order of appearance in input)
(That would call like set_variable_name(Input,Output).)
I can simply get [B,C,D,C,D] from the input, but don't know how to implement set (only one appearance in output). I've tried something like storing in rbtrees but that failed, because of
only_one([],T,T) :- !.
only_one([X|XS],B,C) :- rb_in(X,X,B), !, only_one(XS,B,C).
only_one([X|XS],B,C) :- rb_insert(B,X,X,U), only_one(XS,U,C).
it returns tree with only one node and unification like B=C, C=D.... I think I get it why - because of unification of X while questioning rb_in(..).
So, how to store only once that name of variable? Or is that fundamentally wrong idea because we are using logic programming? If you want to know why I need this, it's because we are asked to implement A* algorithm in Prolog and this is one part of making search space.
You can use sort/2, which also removes duplicates.

Bison input analyzer - basic question on optional grammar and input interpretation

I am very new to Flex/Bison, So it is very navie question.
Pardon me if so. May look like homework question - but I need to implement project based on below concept.
My question is related to two parts,
Question 1
In Bison parser, How do I provide rules for optional input.
Like, I need to parse the statment
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana'
Here the ratio token can be optional. Similarly, If I have many tokens optional, then How do I provide the grammar in the parser for the same?
My code looks like,
%start program
program : TK_COUNTRY TK_IDENTIFIER TK_STATE TK_IDENTIFIER TK_POPULATION TK_IDENTIFIER ...
where all the tokens are defined in the lexer. Since there are many tokens which are optional, If I use "|" then there will be many different ways of input combination possible.
Question 2
There are good chance that the comment might have quotes as part of the input, so I have added a token -tag which user can provide to interpret the same,
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana$'s population' -tag=$
Now, I need to reinterpret Indiana$'s as Indiana's since -tag=$.
Please provide any input or related material for to understand these topic.
Q1: I am assuming we have 4 possible tokens: NAME , '-', '=' and VALUE
Then the grammar could look like this:
attrs:
attr attrs
| attr
;
attr:
'-' NAME '=' VALUE
;
Note that, unlike you make specific attribute names distinguished tokens, there is no way to say "We must have country, state and population, but ratio is optional."
This would be the task of that part of the program that analyses the data produced by the parser.
Q2: I understand this so, that you think of changing the way lexical analysis works while the parser is running. This is not a good idea, at least not for a beginner. Have you even started to think about lexical analysis, as opposed to parsing?