How to avoid punctuationduring tokenization using Stanford NLP - tokenize

I am using standford core NLP. I have tried the following example. this example can tokenize the words from the text. However it also extract punctuation such as comma, full stop etc. I was wondering how to set the properties that allow not to extract punctuation or alternatively is there any other way to do the same. Here is the code example. I know its easy using Python but not sure how to do it in Java. Please suggest.
props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
pipeline = new StanfordCoreNLP(props);
String text = "this is simple text written in English,Spanish etc."
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
}
}

We don't have any tokenizer option for skipping these, but it shouldn't be difficult. Punctuation strings are a closed class.
You could match tokens which are punctuation using a regular expression. (Use \p{Punct}; see e.g. Punctuation Regex in Java ). Then just drop tokens whose text content matches such a regex.

Related

Escape special characters in Apache pig data

I am using Apache Pig to process some data.
My data set has some strings that contain special characters i.e (#,{}[]).
This programming pig book says that you can't escape those characters.
So how can I process my data without deleting the special characters?
I thought about replacing them but would like to avoid that.
Thanks
Have you tried loading your data? There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray.
The only issue you will have to watch out for here is if your strings ever contain the character that Pig is using as field delimiter - for example, if you are USING PigStorage(',') and your strings contain commas. But as long as you are not telling Pig to parse your field as a map, #, [, and ] will be handled just fine.
Easiest way would be,
input = LOAD 'inputLocation' USING TextLoader() as unparsedString:chararray;
TextLoader just reads each line of input into a String regardless of what's inside that string. You could then use your own parsing logic.
When writing your loader function, instead of returning tuples with e.g. maps as a String (and thus later relying on Utf8StorageConverter to get the conversion to a map right):
Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));
you can create and set directly a Java map:
HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);
This is useful especially if you have to do the parsing during loading anyway.

Issues of Error handling with ANTLR3

I tried error reporting in following manner.
#members{
public String getErrorMessage(RecognitionException e,String[] tokenNames)
{
List stack=getRuleInvocationStack(e,this.getClass().getName());
String msg=null;
if(e instanceof NoViableAltException){
<some code>
}
else{
msg=super.getErrorMessage(e,tokenNames);
}
String[] inputLines = e.input.toString().split("\r\n");
String line = "";
if(e.token.getCharPositionInLine()==0)
line = "at \"" + inputLines[e.token.getLine() - 2];
else if(e.token.getCharPositionInLine()>0)
line = "at \"" + inputLines[e.token.getLine() - 1];
return ": " + msg.split("at")[0] + line + "\" => [" + stack.get(stack.size() - 1) + "]";
}
public String getTokenErrorDisplay(Token t){
return t.toString();
}
}
And now errors are displayed as follows.
line 6:7 : missing CLOSSB at "int a[6;" => [var_declaration]
line 8:0 : missing SEMICOL at "int p" => [var_declaration]
line 8:5 : missing CLOSB at "get(2;" => [call]
I have 2 questions.
1) Is there a proper way to do the same thing I have done?
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
Thank you.
1) Is there a proper way to do the same thing I have done?
I don't know if there is a defined proper way of showing errors. My take on showing errors is a litmis test. If the user can figure out how to fix the error based on what you have given them then it is good. If the user is confued by the error message then the message needs more work. Based on the examples given in the question, symbols were only char constants.
My favorite way of seeing errors is with the line with an arrow pointing at the location.
i.e.
Expected closing brace on line 6.
int a[6;
^
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You will have to read the separately generated token file and then make a map, i.e. a dictionary data structure, to translate the token name into the token character(s).
EDIT
First we have to clarify what is meant by symbol. If you limit the definition of symbol to only tokens that are defined in the tokens file with a char or string then this can be done, i.e. '!'=13, or 'public'=92, if however you chose to use the definition of symbol to be any text associated with a token, then that is something other than what I was or plan to address.
When ANTLR generates its token map it uses three different sources:
The char or string constants in the lexer
The char or string constants in the parser.
Internal tokens such as Invalid, Down, Up
Since the tokens in the lexer are not the complete set, one should use the tokens file as a starting point. If you look at the tokens file you will note that the lowest value is 4. If you look at the TokenTypes file (This is the C# version name) you will find the remaining defined tokens.
If you find names like T__ in the tokens file, those are the names ANTLR generated for the char/string literals in the parser.
If you are using string and/or char literals in parser rules, then ANTLR must create a new set of lexer rules that include all of the string and/or char literals in the parser rules. Remember that the parser can only see tokens and not raw text. So string and/or char literals cannot be passed to the parser.
To see the new set of lexer rules, use org.antlr.Tool –Xsavelexer, and then open the created grammar file. The name may be like.g . If you have string and/or char literals in your parser rules you will see lexer rules with name starting with T .
Now that you know all of the tokens and their values you can create a mapping table from the info given in the error to the string you want to output instead for the symbol.
The code at http://markmail.org/message/2vtaukxw5kbdnhdv#query:+page:1+mid:2vtaukxw5kbdnhdv+state:results
is an example.
However the mapping of the tokens can change for such things as changing rules in the lexer or changing char/string literals in the parser. So if the message all of a sudden output the wrong string for a symbol you will have to update the mapping table by hand.
While this is not a perfect solution, it is a possible solution depending on how you define symbol.
Note: Last time I looked ANTLR 4.x creates the table automatically for access within the parser because it was such a problem for so many with ANTLR 3.x.
Bhathiya wrote:
*1) Is there a proper way to do the same thing I have done?
There is no single way to do this. Note that proper error-handling and reporting is tricky. Terence Parr spends a whole chapter on this in The Definitive ANTLR Reference (chapter 10). I recommend you get hold of a copy and read it.
Bhathiya wrote:
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You can't. For SEMICOL this may seem easy to do, but how would you get this information for a token like FOO:
FOO : (X | Y)+;
fragment X : '4'..'6';
fragment Y : 'a' | 'bc' | . ;

Strange behavior of Lucene SpanishAnalyzer class with accented words

I'm using the SpanishAnalyzer class in Lucene 3.4. When I want to parse accented words, I'm having a strange result. If I parse, for example, these two words: "comunicación" and "comunicacion", the stems I'm getting are "comun" and "comunicacion". If I instead parse "maratón" and "maraton", I'm getting the same stem for both words ("maraton").
So, at least in my opinion, it's very strange that the same word, "comunicación", gives different results depending on it is accented or not. If I search the word "comunicacion", I should get the same result regardless of whether it's accented or not.
The code I'm using is the next one:
SpanishAnalyzer sa = new SpanishAnalzyer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", sa);
String str = "comunicación";
String str2 = "comunicacion";
System.out.println("first: " + parser.parse(str)); //stem = comun
System.out.println("second: " + parser.parse(str2)); //stem = comunicacion
The solution I've found to be able to get every single word that shares the stem of "comunicacion", accented or not, is to take off accents in a first step, and then parse it with the Analyzer, but I don't know if it's the right way.
Please, can anyone help me?
Did you check what tokenizer & tokenfilters SpanishAnalyzer uses? There is something called ASCIIFoldingFilter. Try placing it before the StemFilter. It will remove the accents

Using Hibernate Search (Lucene), I Need to Be Able to Search a Code With or Without Dashes

This is really the same as it would be for a social security #.
If I have a code with this format:
WHO-S-09-0003
I want to be able to do:
query = qb.keyword().onFields("key").matching("WHOS090003").createQuery();
I tried using a WhitespaceAnalyzer.
Using StandardAnalyzer or WhitespaceAnalyzer both have the same problem. They will index 'WHO-S-09-0003' as is which means that when you do a search it will only work if you have hyphens in the search term.
One solution to your problem would be to implement your own TokenFilter which detects the format of your codes and removes the hyphens during indexing. You can use AnayzerDef to build a chain of toekn filters and an overall custom analyzer. Of course you will have to use the same analyzer when searching, but the Hibernate Search query DSL will take care of that.
actually you can implement your own method like this one:
private String specialCharacters(String keyword) {
String [] specialChars = {"-","!","?"};
for(int i = 0; i < specialChars.length; i++ )
if(keyword.indexOf(specialChars[i]) > -1)
keyword = keyword.replace(specialChars[i], "\\"+specialChars[i]);
return keyword;
}
as you know lucene has special chars, so if you want escape special chars than you should insert before that char double backslashes...

Hyphens in Lucene

I'm playing around with Lucene and noticed that the use of a hyphen (e.g. "semi-final") will result in two words ("semi" and "final" in the index. How is this supposed to match if the users searches for "semifinal", in one word?
Edit: I'm just playing around with the StandardTokenizer class actually, maybe that is why? Am I missing a filter?
Thanks!
(Edit)
My code looks like this:
StandardAnalyzer sa = new StandardAnalyzer();
TokenStream ts = sa.TokenStream("field", new StringReader("semi-final"));
while (ts.IncrementToken())
{
string t = ts.ToString();
Console.WriteLine("Token: " + t);
}
This is the explanation for the tokenizer in lucene
- Splits words at punctuation
characters, removing punctuation.
However, a dot that's not followed by
whitespace is considered part of a
token.
- Splits words at hyphens, unless
there's a number in the token, in
which case the whole token is
interpreted as a product number and
is not split.
- Recognizes email addresses and internet hostnames as one token.
Found here
this explains why it would be splitting your word.
This is probably the hardest thing to correct, human error. If an individual types in semifinal, this is theoretically not the same as searching semi-final. So if you were to have numerous words that could be typed in different ways ex:
St-Constant
Saint Constant
Saint-Constant
your stuck with the task of having
both st and saint as well as a hyphen or non hyphenated to veriy. your tokens would be huge and each word would need to be compared to see if they matched.
Im still looking to see if there is a good way of approaching this, otherwise, if you don't have a lot of words you wish to use then have all the possibilities stored and tested, or have a loop that splits the word starting at the first letter and moves through each letter splitting the string in half to form two words, testing the whole way through to see if it matches. but again whose to say you only have 2 words. if you are verifying more then two words then you have the problem of splitting the word in multiple sections
example
saint-jean-sur-richelieu
if i come up with anything else I will let you know.
I would recommend you use the WordDelimiterFilter from Solr (you can use it in just your Lucene application as a TokenFilter added to your analyzer, just go get the java file for this filter from Solr and add it to your application).
This filter is designed to handle cases just like this:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
If you're looking for a port of the WordDelimiterFilter then I advise a google of WordDelimiter.cs, I found such a port here:
http://osdir.com/ml/attachments/txt9jqypXvbSE.txt
I then created a very basic WordDelimiterAnalyzer:
public class WordDelimiterAnalyzer: Analyzer
{
#region Overrides of Analyzer
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
TokenStream result = new WhitespaceTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(true, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
result = new WordDelimiterFilter(result, 1, 1, 1, 1, 0);
return result;
}
#endregion
}
I said it was basic :)
If anyone has an implementation I would be keen to see it!
You can write your own tokenizer which will produce for words with hyphen all possible combinations of tokens like that:
semifinal
semi
final
You will need to set proper token offsets to tell lucene that semi and semifinal actually start at the same place in document.
The rule (for the classic analyzer) is from is written in jflex:
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM = ({ALPHANUM} {P} {HAS_DIGIT}
| {HAS_DIGIT} {P} {ALPHANUM}
| {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
| {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
| {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
// punctuation
P = ("_"|"-"|"/"|"."|",")