"He was talking about his last night."
In the previous sentence, the preposition is about, but this preposition exists neither under the prep syntax nor under the English language Morphology in RGL as shown below.
Is there is a reason for this or this preposition acts differently in different languages?

I don't know of any particular reason. "About" is not among the most common English prepositions like "in", "on", "for", but there are less frequent prepositions on that list, like "during", so it's not frequency based. But a list like that can never be complete, that's why we have mkPrep in the lexical paradigms for English:
mkPrep : Str -> Prep -- e.g. "in front of"
mkPost : Str -> Prep -- e.g. "ago"
noPrep : Prep -- no preposition
So whenever you want to use a preposition that isn't in the RGL API, just use mkPrep. In this case, mkPrep "about".


Lucene operator precedence for boolean operators

What is the order of operations for boolean operators? Left to right? Right to left? Specific operators have higher priority?
For example, if I search for:
jakarta OR apache AND website
What do I get? Is it
Anything with "jakarta" as well as anything with both "apache" and "website"?
Anything with "website" that also has either "jakarta" or "apache"?
Something else?
Short answer:
In Lucene, the AND operator takes precedence over the OR operator. So, you are effectively doing this:
jakarta OR (apache AND website)
You can verify this for yourself by parsing your query string and seeing how it converts AND and OR to the "required" and "optional" operators.
And the NOT operator takes precendence over the AND operator, since we are discussing precedence.
But you need to be very careful when dealing with Lucene's so-called "boolean" operators, as they do not behave the way you may expect based on their collective name ("boolean").
(Unfortunately I have never seen any official documentation which provides a citation for these precedence rules - but instead I am relying on empirical observations. See below for more about that. If the documentation for this does exist, that would be great to see.)
Longer Answer
One key thing to understand is that Lucene boolean operators are not really "boolean" in the sense that you may think, based on Boolean algebra, where you use parentheses to help avoid ambiguity (or where you need to know what rules a programming language may be applying) - and where everything evaluates to TRUE or FALSE.
Lucene boolean operators serve a subtly different purpose.
They are not purely concerned with TRUE/FALSE inclusion/exclusion, but also concerned with how to score results so that the more relevant results have higher scores than less relevant results.
The Lucene query jakarta OR apache AND website is equivalent to the following:
jakarta +apache +website
This means the document's field must contain apache and website, but may also include jakarta (for a higher relevance score).
You can see this for yourself by taking your original query string and parsing it:
Query query = parser.parse(queryString);
...and then printing the resulting string representation of the query. The + operator is the "required" operator. It:
requires that the term after the "+" symbol exist somewhere in the field
And the lack of a + operator means the default of "may" as in "may contain" - meaning the term is optional: it does not need to be present, if there is some other clause in the query which does match a document.
The use of AND forces the terms on either side of the AND to be required.
You can encounter some potentially surprising situations.
Consider this:
foo AND bar OR baz AND bat
This parses to the following:
+foo +bar +baz +bat
This is because the AND operators are transformed to + operators for every term, rendering the OR redundant.
It's the same result as if you had written this:
foo AND bar AND baz AND bat
But not the same as this:
(foo AND bar) OR (baz AND bat)
which is parsed to this, where the parentheses are retained:
(+foo +bar) (+baz +bat)
Bottom Line:
Use parentheses to explicitly make your intentions clear, when using AND and OR and also NOT.
Regarding NOT, since we mentioned it - that takes prescendence over AND.
The query:
foo AND bar NOT baz AND bat
Is parsed as:
+foo +bar -baz +bat
So, a document field must contain foo, bar and bat - and must not contain baz.
Why does this situation exist?
I don't know, but I think Lucene originally did not include AND, OR and NOT - but instead used + (must include), - (must not include) and "nothing" (may include). The so-called boolean operators AND, OR, NOT were added later on, as a kind of "syntactic sugar" for these original operators - introduced for people who were more familiar with AND, OR and NOT from other contexts. I'm basing this on the following thread:
Getting a Better Understanding of Lucene's Search Operators
A summary of that thread is included in this answer about the NOT operator.

How to add new lemma rule to existing language for spacy

I want to add a new lemmatiser rule for an existing language, i.e. lemmatise all nouns ending with "z" to ending with "".
In the case of individual words, spaCy gives the opportunity to add a tokeniser exception to an existing language after loading using
nlp.tokenizer.add_special_case("adidas", [{ORTH: 'adidas', LEMMA: 'Adidas', POS: 'NOUN', TAG: 'NNP'}])
The above sets the lemma, pos and tag of the new word and this is not altered.
The default English lemmatiser returned "adida" as the lemma.
Now, I am trying to "lemmatise" nouns "wordz" to "word", "windowz" to "window" etc without setting all cases as exceptions but rather add a new rule: Noun ending with "z" has lemma the noun without the trailing "z".
I understand that it will depend on the tagger output as the rules that exist in _lemma_rules.py are pos dependent.
Is there a way to add the rule without creating a new language as a copy of an existing with just one modified file?
Since my question was very specific, I had to communicate with the spaCy developer team and got a working answer.
Actually it is does not work for the fake example in English but it works in real case scenario while using the Greek models as Greek lemmatisation is mainly rule based.
The proposed solution is to use the Lookups Api, which is only available in versions 2.2 and later.
As they mention,
returns a dict-like table that you can write to.
Full answer in spaCy GitHub

Nearley Tokenizers vs Rules

I'm pretty new to nearly.js, and I would like to know what tokenizers/lexers do compared to rules, according to the website:
By default, nearley splits the input into a stream of characters. This is called scannerless parsing.
A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"]: notice how it removed the whitespace, and combined multi-digit numbers into a single number.
Wouldn't that be the same as:
Math -> Number _ "+" _ Number
Number -> [0-9]:+
I don't see what the purpose of lexers are. I see that rules are always useable in this case and there is no need for lexers.
After fiddling around with them, I found out the use of tokenizers, say we had the following:
Keyword -> "if"|"else"
Identifier -> [a-zA-Z_]+
This won't work, if we try compiling this, we get ambiguous grammar, "if" will be matched as both a keyword and an Identifier, a tokenizer however:
"keyword": /if|else/
"identifier": /[a-zA-Z_]+/
Trying to compile this will not result in ambiguous grammar, because tokenizers are smart (at least the one shown in this example, which is Moo).

Fast lookup of tree with placeholders?

For an application I'm considering, there would be a large (100,000+) 'database' of trees (think expressions in a programming language, or S-expressions), and I would need to query that database for expressions that match a specific given expression.
Before giving the details of what I'd like to have, note that I'd appreciate any information related to indexing a large set of trees for optimizing lookup by a subtree.
In my specific situation (which would be for a backend to be used by Metamath proof assistants), expressions have the following structure (in Haskell-like notation):
data Expression = Placeholder Id | VarName Id | ConstName Id [Expression]
or as a BNF for an S-expression form:
Expression = '?' Id | Id | '(' Id Expression* ')'
where Id is some kind of identifier.
For example, I could have a database with expressions like
(equiv ?ph ?ps)
(not (in (appl (sqrt) (2)) (Q)))
(equiv (eq ?A ?B) (forall ?x (equiv (in ?x ?A) (in ?x ?B))))
In this context, two expressions match if they can be made equal by substitution of expressions for placeholders. So looking up (equiv (eq A (emptyset)) ?ph) in the above mini-database would result in the first and last expressions.
So again: how would I implement fast lookups in a large set of (expression) trees with placeholders? What kind of index data structure could I use?
I would implement the lookup with a trie. Each key would consist of one of the following:
ConstName Identifier
Variable w/ context info
These should be ordered in some fashion- possibly Placeholder, then all ConstNames (alphabetical), then variables (scope ordering, then argument order), then ConstValues (numerical order). As long as there's a concrete ordering for usage in the trie, you're fine.
Traverse the expression's tree, injecting the appropriate keys into the trie as they are encountered. Do this for all the expressions you want to insert into your data structure. When it comes time to query it, you can traverse the trie in a similar fashion, but with a few new rules.
Everything matches a placeholder node. If it matches some other key as well, then you'll need to explore both branches (easily done via a recursive DFS-like approach).
A placeholder matches everything. This is not equivalent to the previous point- we are talking about placeholders in the query here, the previous bullet is regarding placeholders as trie keys.
Now, this does mean that the search space can somewhat "explode" as you encounter placeholders, but there is one thing you can do to try to mitigate this in practice. Traverse the expression's tree in a breadth-first fashion (both in construction of the trie, and querying). This means if one of the arguments is a placeholder, you won't have to full-depth search every single subtree that matches that expression so far- instead you jump ahead to the next argument- which may not be a placeholder, and will thus greatly prune the search space (compared to matching "everything").
For completeness sake, lets take one of your examples
(not (in (appl (sqrt) (2)) (Q)))
and make a trie entry from that-
not -> in -> apply -> "Q" -> sqrt -> 2
adding (not (in ?ph E)) to this would result in-
not -> in -> apply -> "Q" -> sqrt -> 2
\-> ?ph -> "E"
Continue in this fashion injecting expressions into the trie. Also traverse in this fashion for querying until you reach the ends of your searches into the trie, and return those that matched.
Note- the uniqueness of these entries is based on the assumption you do not have to support variadic functions. If you do, attach to each key some context info (read the next paragraphs for info on how to do this) to distinguish which arguments go to which functions
There is one detail I glossed over- variables. If you only want it to match if they are the exact same variable name, then no work is necessary. But this likely isn't what you want; you probably want it to match generic variables as long as they are "consistent" with each other. The way to do this is to assign each variable an identifier that represents the scope of which it was first defined.
The easiest way to do this is just compose an identifier from the concatenation of the argument ordering of its ancestors. That is, if a variable is first defined as the second argument to a function which is the fifth argument to the root function, then we might label it as (5, 2) or (2, 5), whichever makes more sense intuitively. Either way, this will ensure the variable is given a consistent identifier regardless of other variables / functions elsewhere. Then proceed as normal with this new variable name.

Handling Grammar / Spelling Issues in Translation Strings

We are currently implementing a Zend Framework Project, that needs to be translated in 6 different languages. We already have a pretty sophisticated translation system, based on Zend_Translate, which also handles variables in translation keys.
Our project has a new Turkish translator, and we are facing a new issue: Grammar, especially Turkish one. I noticed that this problem might be evident in every translation system and in most languages, so I posted a question here.
Question: Any ideas how to handle translations like:
Key: I have a[n] {fruit}
Variables: apple, banana
Result: I have an apple. I have a banana.
Key: Stimme für {user}[s] Einsendung
Variables: Paul, Markus
Result: Stimme für Pauls Einsendung,
Result: Stimme für Markus Einsendung
Anybody has a solution or idea for this? My only guess would be to avoid this by not using translations where these issues occur.
How do other platforms handle this?
Of course the translation system has no idea which type of word it is placing where in which type of Sentence. It only does some string replacements...
PS: Turkish is even more complicated:
For example, on a profile page, we have "Annie's Network". This should translate as "Annie'nin Aği".
If the first name ends in a vowel, the suffix will start with an n and look like "Annie'nin"
If the first name ends in a consonant, it will not have the first n, and look like "Kris'in"
If the last vowel is an a or ı, it will look like "Dan'ın"; or Seyma'nın"
If the last vowel is an o or u, it will look like "Davud'un"; or "Burcu'nun"
If the last vowel is an e or i, it will look like "Erin'in"; or "Efe'nin"
If the last vowel is an ö or ü, it will look like "Göz'ün'; or "Iminönü'nün"
If the last letter is a k (like the name "Basak"), it will look like "Basağın"; or "Eriğin"
It is actually very hard problem, as grammar rules are different even among languages from the same family. I don't think you could easily do anything for let's say Slavic languages...
However, if you want to solve this problem (because this is extra challenging) and you are looking for creative (cross inspiring) ways to do that, you might want to look into something called ChoiceFormat (example would be one from ICU Project) or you can look up GNU Gettext's solution for plural forms problem.
ICU (mentioned above) has a SelectFormat http://site.icu-project.org/design/formatting/select that may be of help- it's like a choice format but with arbitrary keywords. Also, it does have a PluralFormat which already has rules for many language's plural rules.