StanfordNLP Spanish Tokenizer - tokenize

I want to tokenize a text in Spanish with StanfordNLP and my problem is that the model splits any word matching the pattern "\d*s " (a word composed by digits and ending with an "s") in two tokens. If the word finished with another letter, such as "e", the tokenizer return only one token.
For instance, given the sentence:
"Vendo iPhone 5s es libre de fabrica esta nuevo sin usar."
The tokenizer return for the text "iPhone 5s" three tokens:"iPhone", "5" and "s".
Someone has an idea how could I avoid this behaviour?

I suppose you are working with the SpanishTokenizer rather than PTBTokenizer.
SpanishTokenizer is heavily based on the FrenchTokenizer, which comes also from the PTBTokenizer (English).
I've run all three with your sentence and seems that the PTBTokenizer give you the results you need, but not the others.
As all of them are deterministic tokenizers I think you can't avoid that problem because seems to me that the problem is not in the heuristic part which should run later after the deterministic.
A possible workaround may be to use WhitespaceTokenizer, as long as you don't mind having punctuation tokens or some other gramma rules.

Related

Dose chinese need wordpiece?

I want to use Chinese bert model. In tokenization.py, I fond WordpieceTokenizer function(https://github.com/google-research/bert/blob/master/tokenization.py), but I don't think it is needed to use wordpiece for chinese, because the miminal unit of chinese is character.
WordpieceTokenizer is just for english text, am I right?
From the README:
We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages.
However, from the Multilingual README (emphasis added):
Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace characters, we add spaces around every character in the CJK Unicode range before applying WordPiece.
So WordPiece is presumably run on the whole sentence, though it would only matter for sentences that contained non-Chinese characters. So to run the code as-is you would want WordPiece.
However, to clarify:
WordPiece is not just for English, it can be used on any language and in practice is used on many
Whether single character-based tokenization for Chinese is the best decision is debated
WordPiece is not available outside Google, SentencePiece could be used as a replacement (though I think the BERT code might have a pretrained model)

Parser not recognizing a dash

My program makes calculations on physics vectors and it allows copy/pasting from websites and then tries to parse them into the x, y, and z components automatically. I've come across one website (http://mathinsight.org/cross_product_examples) that has (3,−3,1). While that looks normal, that minus is actually not recognized by VB. Visually, it is longer than the normal minus (− and -), but return the same Unicode of 45. This picture shows the Unicode for every character (I added a minus in front of the first 3 for comparison) in the Textbox. Also, from this website, I had to use Ctrl+c because right clicking shows that this is not simple HTML.
One is valid (the first), but the second gives VB fits as shown below. Either it won't compile (shown by the blue line below) or a simple assignment (the second one) wrecks havok on my form.
I have tried using
vectorString.Replace("–", "-")
and pasting in the longer dash for the target string and a normal keystroke dash as the replacement, but nothing happens. I'm guessing that since they both have the same Unicode.
Is there some way to convert the longer, invalid dash into the one recognized by VB? I tried using dash symbol that Word likes to replace the minus sign with and it comes up as Unicode 150. So, apparently there are at least three different kinds of dashes. Any thoughts?
The character from Math Insight is U+2212, minus sign. The character you tried using in your Replace call is U+2013, en dash. That's why your replace didn't work.
Beyond the standard ASCII hyphen (-, U+0045), there are two common dashes: the en dash (–, U+2013) and the em dash (—, U+2014). There is also a figure dash (‒, U+2012), but it is not as common.

Collect a word between two spaces in objective c

I'm trying to implement stuff similar to spell check, but I need to get the word that is limited by a space. EX: "HI HOW R U", I need to collect HI, HOW and so on as they type. i.e. After user hits HI and space I need to collect HI and do a spell check.
Check the documentation for NSString Here. You want the message componentsSepeparatedByString:.
I don't know objective-C, but I'm fairly sure it'll have a Regexp library - although it'd be straightforward to code it without one.
Regexp: \b([^\s])*\b
\b = word boundary (whitespace, comma, dot, exclamation-mark, etc.)
\s = whitespace character
[...] = character set
[^...] = negated character set (any character(s) EXCEPT ...)
() = grouping construct
* = zero or more times
So the suggested expression would start matching at any word boundary, then match every subsequent character that is not a whitespace character, then match a word boundary.
Your stated case is so simple you may just want to look for spaces (one char at a time) and get the substring, but RegExp is very widely used across a range of languages and platforms, and so it's fairly easy to find an expression when you need to - and one often does for common stuff like checking if zip codes, phone numbers, email addresses and so on are syntactically correct. So it's worth learning in any case. :)

RegexKitLite: Match Expression --> Match anything except ] --> Match ]

I am essentially attempting to replace all of the footnotes in a large text. There are various reasons I am doing this in Objective-C, so please assume that constraint.
Every footnote beings with this: [Footnote
Every footnote ends with this: ]
There can be absolutely anything between those two markers, including line breaks. However, there will never be ] between them.
So, essentially I want to match [Footnote, then match anything except ], until ] is matched.
This is the closest I have been able to get to identifying all of the footnotes:
NSString *regexString = #"[\\[][F][o][o][t][n][o][t][e][^\\]\n]*[\\]]";
Using this regular expression manages to identify 780/889 footnotes. It also appears that none of those 780 are false alarms. The only ones it appears to miss are those footnotes that have line breaks in them.
I have spent a lengthly amount of time on www.regular-expressions.info, specifically on the page about dots (http://www.regular-expressions.info/dot.html). This has helped me to create the above regular expressions, but I have not truly figured out how to include any character or line break, except right bracket.
Using the following regular expression instead manages to capture all of the footnotes, but it captures way too much text, because * is greedy: (?s)[\\[][F][o][o][t][n][o][t][e].*[\\]]
Here is some sample text that the regular expression is run on:
<p id="id00082">[Footnote 1: In the history of Florence in the early part of the XVIth century <i>Piero di Braccio Martelli</i> is frequently mentioned as <i>Commissario della Signoria</i>. He was famous for his learning and at his death left four books on Mathematics ready for the press; comp. LITTA, <i>Famiglie celebri Italiane</i>, <i>Famiglia Martelli di Firenze</i>.—In the Official Catalogue of MSS. in the Brit. Mus., New Series Vol. I., where this passage is printed, <i>Barto</i> has been wrongly given for Braccio.</p>
<p id="id00083">2. <i>addi 22 di marzo 1508</i>. The Christian era was computed in Florence at that time from the Incarnation (Lady day, March 25th). Hence this should be 1509 by our reckoning.</p>
<p id="id00084">3. <i>racolto tratto di molte carte le quali io ho qui copiate</i>. We must suppose that Leonardo means that he has copied out his own MSS. and not those of others. The first thirteen leaves of the MS. in the Brit. Mus. are a fair copy of some notes on physics.]</p>
<p id="id00085">Suggestions for the arrangement of MSS treating of particular subjects.(5-8).</p>
When you put together the science of the motions of water, remember to include under each proposition its application and use, in order that this science may not be useless.--
[Footnote 2: A comparatively small portion of Leonardo's notes on water-power was published at Bologna in 1828, under the title: "_Del moto e misura dell'Acqua, di L. da Vinci_".]
In this example there are two footnotes and some non-footnote text. The first footnote, as you can see, contains two line breaks inside it. The second one contains no line breaks.
The first regular expression I mentioned above will manage to capture Footnote 2 in this example text, but it will not capture Footnote 1 because it contains line breaks.
Any improvements on my regular expression would be most appreciated.
Try
#"\\[Footnote[^\\]]*\\]";
This should match across newlines. No need to put a single character into a character class, either.
As a commented, multiline regex (without string escapes):
\[ # match a literal [
Footnote # match literal "Footnote"
[^\]]* # match zero or more characters except ]
\] # match ]
Inside a character class ([...]), the caret ^ takes on a different meaning; it negates the contents of the class. So [ab] matches a or b, whereas [^ab] matches any character except a or b.
Of course, if you have nested footnotes, this will malfunction. A text like [Footnote foo [footnote bar] foo] will match from the beginning until bar]. To avoid this, change the regex to
#"\\[Footnote[^\\]\\[]*\\]";
so neither opening nor closing brackets are allowed. Then of course, you only match the innermost Footnotes and will have to apply the same regex twice (or more, depending on the maximum level of nesting) to the entire text, "peeling back" layer by layer.

How to remove strings contained in a list in VB.NET?

How can I find words like and, or, to, a, no, with, for etc. in a sentence using VB.NET and remove them. Also where can I find all words list like above.
Note that unless you use Regex word boundaries you risk falling afoul of the Scunthorpe (Sfannythorpe) problem.
string pattern = #"\band\b";
Regex re = new Regex(pattern);
string input = "a band loves and its fans";
string output = re.Replace(input, ""); // a band loves its fans
Notice the 'and' in 'band' is untouched.
You can indeed replace your list of words using the .Replace function (as colithium described) ...
myString.Replace("and", "")
Edit:
... but indeed, a nicer way is to use Regular Expressions (as edg suggested) to avoid replacing parts of words.
As your question suggests that you would like to clean-up a sentence to keep meaningfull words, you have to do more than just remove two- and three letter words.
What you need is a list of stop-words:
http://en.wikipedia.org/wiki/Stop_word
A comma seperated list of stop-words for the English language can be found here:
http://www.textfixer.com/resources/common-english-words.txt
The easiest way is:
myString.Replace("and", "")
You'd loop over your word list and have a statement like the above. Google for a list of common English words?
List of English 2 Letter Words
List of English 3 Letter Words
You can match the words and remove them using regular expressions.