Custom tokenization rules - tokenize

Is it possible to configure custom tokenization rules for a field that will break words containing letters and numbers into separate tokens? For example, I'd like the string "S1E1e2" to be split into three tokens "S1","E1"and "e2".

Related

Seeking the right token filters for my requirements and getting desperate

I'm indexing documents which contain normal text, programming code and other non-linguistic fragments. For reasons which aren't particularly relevant I am trying to tokenise the content into lowercased strings of normal language, and single character symbols.
Thus the input
a few words. Cost*count
should generate the tokens
[a] [few] [words] [.] [cost] [*] [count]
Thus far thus extremely straightforward. But I want to handle "compound" words too, because the content can include words like order_date and object-oriented and class.method as well.
I'm following the principle that any of [-], [_] and [.] should be treated as a compound word conjunction rather than a symbol IF they are between two word characters, and should be treated as a separate symbol character if they are adjacent to a space, another symbol character, or the beginning or end of a string. I can handle all of this with a PatternTokenizer, like so:
public static final String tokenRgx = "(([A-Za-z0-9]+[-_.])*[A-Za-z0-9]+)|[^A-Za-z0-9\\s]{1}";
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRgx), 0);
TokenStream result = new LowerCaseFilter(src);
return new TokenStreamComponents(src, result);
}
This successfully distinguishes between full stops at the end of sentences and full stops in compounds, between hyphens introducing negative numbers and hyphenated words, etc. So in the above analyzer, the input:
a few words. class.simple_method_name. dd-mm-yyyy.
produces the tokens
[a] [few] [words] [.] [class.simple_method_name] [.] [dd-mm-yyyy] [.]
We're almost there, but not quite. Finally I want to split the compound terms into their parts RETAINING the trailing hyphen/underscore/stop character in each part. So I think I need to introduce another filter step to my analyzer so that the final set of tokens I end up with is this:
[a] [few] [words] [.] [class.] [simple_] [method_] [name] [.] [dd-] [mm-] [yyyy] [.]
And this is the piece that I am having trouble with. I presume that some kind of PatternCaptureGroupTokenFilter is required here but I haven't been able to find the right set of expressions to get the exact tokens I want emerging from the analyzer.
I know it must be possible, but I seem to have walked into a regular expression wall that blocks me. I need a flash of insight or a hint, if anyone can offer me one.
Thanks,
T
Edit:
Thanks to #rici for pointing me towards the solution
The string which works (including support for decimal numbers) is:
String tokenRegex = "-?[0-9]+\\.[0-9]+|[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]";
Seems to me like it would be easier to do the whole thing in one scan, using a regex like:
[A-Za-z0-9]+([-_.](?=[A-Za-z0-9]))?|[^A-Za-z0-9\\s]
That uses a zero-width forward assertion in order to only add [-._] to the preceding word if it is immediately followed by a letter or digit. (Because (?=…) is an assertion, it doesn't include the following character in the match.)
To my mind, that won't correctly handle decimal numbers; -3.14159 will be three tokens rather than a single number token. But it depends on your precise needs.

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

Spacy tokenizer to handle final period in sentence

I'm using Spacy to tokenize sentences, and I know that the text I pass to the tokenizer will always be a single sentence.
In my tokenization rules, I would like non-final periods (".") to be attached to the text before it so I updated the suffix rules to remove the rules that split on periods (this gets abbreviations correctly).
The exception, however, is that the very last period should be split into a separate token.
I see that the latest version of Spacy allows you to split tokens after the fact, but I'd prefer to do this within the Tokenizer itself so that other pipeline components are processing the correct tokenization.
Here is one solution that uses some post processing after the tokenizer:
I added "." to suffixes so that a period is always split into its own token.
I then used a regex to find non-final periods, generated a span with doc.char_span, and merged the span to a single token with span.merge.
Would be nice to be able to do this within the tokenizer if anyone knows how to do that.

How can I use the same lexer to provide token streams with and without whitespace?

I have a lexer grammar that defines a lexer that is used in two ways: to identify tokens for a syntax-aware editor, and to identify tokens for the parser. In the first case, the lexer should return comments and whitespace, but in the second case, the comments and whitespace are not wanted. Do I need two different lexer classes, each defined by its own variant of the grammar? Or can I accomplish this with a single lexer by using channels? How?
If I need two separate grammars, I assume I can factor out all the rules except for comments and whitespace, and then import those rules from that separate "common" grammar.
Usually you filter out tokens (like whitespaces) via token channels (or skip them entirely). This is part of your grammar and hence you'd need 2 grammars if you want whitespaces in one use case and not in the other. And yes, you can import a base grammar with all the common rules into specialized grammars which only hold the differences. You can even override rules (define e.g. the whitespace rule in the base grammar and redefine it in your main grammar).
But keep in mind that not filtering whitespaces will have consequences for all your other rules. In that case you would have to explicitly add whitespace handling to your parser rules everywhere. For instance:
blah: a or b;
versus
blah: a WS* or WS* b;

Is there a way to get the number of tokens in an ANTLR4 parser rule?

In ANTLR4, it seems that predicates can only be placed at the front of sub-rules in order for them to cause the sub-rule to be skipped. In my grammar, some predicates depend on a token that appears near the end of the sub-rule, with one or more rule invocations in front of it. For example:
date :
{isYear(_input.LT(3).getText())}?
month day=INTEGER year=INTEGER { ... }
In this particular example, I know that month is always one single token, so it is always Token 3 that needs to be checked by isYear(). In general, though, I won't know the number of tokens making up a rule like month until runtime. Is there a way to get its token count?
There is no built-in way to get the length of the rule programmatically. You could use the documentation for ATNState in combination with the _ATN field in your parser to calculate all paths through a rule - if all paths through the rule contain the same number of tokens the you have calculated the exact number of tokens used by the rule.