OpenNLP Name Finder - apache

I am using the NameFinder API example doc of OpenNLP. After initializing the Name Finder the documentation uses the following code for the input text:
for (String document[][] : documents) {
for (String[] sentence : document) {
Span nameSpans[] = nameFinder.find(sentence);
// do something with the names
}
nameFinder.clearAdaptiveData()
}
However when I bring this into eclipse the 'documents' (not 'document') variable is giving me an error saying the variable documents cannot be resolved. What is the documentation referring to with the 'documents' array variable? Do I need to initialize an array called 'documents' which hold txt files for this error to go away?
Thank you for your help.

The OpenNLP documentation states that the input text should be segmented into documents, sentences and tokens. The piece of code you provided illustrates how to deal with several documents.
If you have only one document you don't need the first for, just the inner one with the array of sentences, which is composed by as an array of tokens.
To create an array of sentences from a document you can use the OpenNLP SentenceDetector, and for each sentence you can use OpenNLP Tokenizer to get the array of tokens.
Your code will look like this:
// somehow get the contents from the txt file
// and populate a string called documentStr
String sentences[] = sentenceDetector.sentDetect(documentStr);
for (String sentence : sentences) {
String tokens[] = tokenizer.tokenize(sentence);
Span nameSpans[] = nameFinder.find(tokens);
// do something with the names
System.out.println("Found entity: " + Arrays.toString(Span.spansToStrings(nameSpans, tokens)));
}
You can learn how to use the SentenceDetector and the Tokenizer from OpenNLP documentation documentation.

Related

How can I Extract words with its coordinates from pdf using .net?

I'm working with pdf in hebrew language with diacritical marks. I want to extract all the words with its coordinates. I tried to use ITextSharp and pdfClown and they both didn't give me what I want.
In pdfClown there are missing letters\chars in ITextSharp I don't get the words coordinates.
Is there a way to do it? (I'm looking for a free framework\code)
EDIT:
PDFClown Code:
File file = new File(PDFFilePath);
TextExtractor te = new TextExtractor();
IDictionary<RectangleF?, IList<ITextString>> strs = te.Extract(file.Document.Pages[0].Contents);
List<string> correctText = new List<string>();
foreach (var key in strs.Keys)
{
foreach (var value in strs[key])
{
string reversedText = new string(value.Text.Reverse().ToArray());
string cleanText = RemoveDiacritics(reversedText);
correctText.Add(cleanText);
}
}
You aren't showing how you are trying to extract text using iText(Sharp). I am assuming that you are following the official documentation and that your code looks like this:
public string ExtractText(byte[] src) {
PdfReader reader = new PdfReader(src);
MyTextRenderListener listener = new MyTextRenderListener();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.GetPageN(1);
PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
processor.ProcessContent(
ContentByteUtils.GetContentBytesForPage(reader, 1), resourcesDic);
return listener.Text.ToString();
}
If your code doesn't look like this, this explains already explains the first thing you're doing wrong.
In this method, there is one class that isn't part of iTextSharp: MyTextRenderListener. This is a class you should write and that looks for instance like this:
public class MyTextRenderListener : IRenderListener {
public StringBuilder Text { get; set; }
public MyTextRenderListener() {
Text = new StringBuilder();
}
public void BeginTextBlock() {
Text.Append("<");
}
public void EndTextBlock() {
Text.AppendLine(">");
}
public void RenderImage(ImageRenderInfo renderInfo) {
}
public void RenderText(TextRenderInfo renderInfo) {
Text.Append("<");
Text.Append(renderInfo.GetText());
LineSegment segment = renderInfo.GetBaseline();
Vector start = segment.GetStartPoint();
Text.Append("| x=");
Text.Append(start[Vector.I1]);
Text.Append("; y=");
Text.Append(start[Vector.I2]);
Text.Append(">");
}
}
When you run this code, and you look what's inside Text, you'll notice that a PDF document doesn't store words. Instead, it stores text blocks. In our special IRenderListener, we indicate the start and the end of text blocks using < and >. Inside these text blocks, you'll find text snippets. We'll mark text snippets like this: <text snippet| x=36.0000; y=806.0000> where the x and y value give you the coordinate of the start of the baseline (as opposed to the ascent and descent position). You can also get the end position of the baseline (and the ascent/descent).
Now how do you distill words out of all of this? The problem with the text snippets you get, is that they don't correspond with words. See for instance this file: hello_reverse.pdf
When you open it in Adobe Reader, you read "Hello World Hello People." You'd hope you'd find four words in the content stream, wouldn't you? In reality, this is what you'll find:
<>
<<ld><Wor><llo><He>>
<<Hello People>>
To distill the words, "World" and "Hello" from the first line, you need to do plenty of Math. Instead of getting the base line of the TextRenderInfo object returned in the RenderText() method of your render listener, you have to use the GetCharacterRenderInfos() method. This will return a list of TextRenderInfo objects that gives you more info about every character (including the position of those characters). You then need to compose the words from those different characters.
This is explained in mkl's answer to this question: Retrieve the respective coordinates of all words on the page with itextsharp
We've done similar projects. One of them is described here: https://www.youtube.com/watch?v=lZnbhnU4m3Y
You'll need to do quite some coding to get it right. One word about PdfClown: your text is probably stored as UNICODE in your PDF. To retrieve the correct characters, the parser needs to examine the mapping of the glyphs stored in the font and the corresponding UNICODE character. If PdfClown can't do this, this means that PdfClown doesn't do this task correctly. PdfClown is a one man project, so you'll have to ask that developer to fix this (if he has the time).
As you can tell from the video, iText could help you out, but iText is a company with subsidiaries in the US, Belgium and Singapore. It is a company with many employees and to keep that company running, we need to make money (that's how we pay our employees). Hence you shouldn't expect that we help you for free. Surely you can understand this as you wouldn't want to work for free either, would you?

Objective-C string wrapped in parentheses?

I'm getting this on console.log;
2014-08-13 11:55:11.877 Wevo[14264:1830541] artist name: (
"Vance Joy"
)
How do I unwrap it so its just the string?
The problem comes because I'm parsing json that looks like this:
output = {
contributor = {
"/music/recording/artist" = [
{
mid = "/m/026hdj4";
name = "Marie-Mai";
}
];
};
};
notice how the mid is wrapped in an array?
So it gets converted to an object literal somewhere
I'm setting the value using:
_artistName = [[attributes[#"output"][#"contributor"][#"/music/recording/artist"] valueForKeyPath:#"name"] copy];
Why are you using valueForKeyPath:? If you use
_artistName = attributes[#"output"][#"contributor"][#"/music/recording/artist"][0][#"name"];
it should come out correctly.
Edit: For future viewers, one off lines like this will work. However, for a more maintainable and debuggable app, I would recommend splitting up the lines to extract only one object per line. That way, if something breaks, the debugger will be a larger help.
For apps where you deal with more JSON than just a one off, I would recommend creating model objects and pulling your JSON into those. There are libraries on github that could also help you there with model objects.

How to implement just some basic keywords highlighting in text editor?

I'm a novice programmer trying to learn plug-in development. I'd like to upgrade the sample XML editor so that some words like "cat", "dog", "hamster", "rabbit" and "bird" would be highlighted when it appears in an XML file (it's just for learning purpose). Can anyone give me some implementation tips or suggestions? I am clueless.. (But I am carrying out my research on this as well, I'm not being lazy. You have my word.) Thanks in advance.
You can detect words in the plain text part of the XML by modifying the sample XML editor as follows.
We can use the provided WordRule class to detect the words. The XMLScanner class which scans the plain text needs to be updated to include the word rule:
public XMLScanner(final ColorManager manager)
{
IToken procInstr = new Token(new TextAttribute(manager.getColor(IXMLColorConstants.PROC_INSTR)));
WordRule words = new WordRule(new WordDetector());
words.addWord("cat", procInstr);
words.addWord("dog", procInstr);
// TODO add more words here
IRule [] rules = new IRule [] {
// Add rule for processing instructions
new SingleLineRule("<?", "?>", procInstr),
// Add generic whitespace rule.
new WhitespaceRule(new XMLWhitespaceDetector()),
// Words rules
words
};
setRules(rules);
}
I have used the existing processing instruction token here to reduce the amount of new code, but you should define a new color and use a new token.
The WordRule constructor requires an IWordDetector class, we can use a very simple detector here:
class WordDetector implements IWordDetector
{
#Override
public boolean isWordStart(final char c)
{
return Character.isLetter(c);
}
#Override
public boolean isWordPart(final char c)
{
return Character.isLetter(c);
}
}
This is just accepting letters in words.

Lucene searching using payload and NLP tags

I have already indexed the documents with each word having payload that contains the Part of speech (POS) tag.
I want to search only those documents for which the search query words have that POS tag.
E.g. 'access google' has google as Noun. It should show only docs with google as noun.
Can writing a custom analyser help?
How can i access the Term when Payload is being accessed in Similarity class?
doing exact (:google AND :'noun') queries in lucene can be tricky... what is your query and how are you writing the docs to the index?
I would recommend using span queries. Span queries can return a Spans object which allow to inspect the payload of every matching token.
See PayloadTermQuery.
You can use the PayloadAttribute class to store the tags as payloads and then override the scorePayload method of DefaultSimilarity class to make use of the tags. In your case you would want to return 1 if the tag content is noun and zero otherwise.
The following code snippet is useful to set the payload information
String tag = "noun";
byte[] payload = tag.getBytes();
Payload payloadData = new Payload(payload);
payloadAttr.setPayload(payloadData);
Now use the following lines of code to make use of the tags during retrieval. This has to done by extending the DefaultSimilarity class.
class PayloadSimilarity extends DefaultSimilarity {
...
...
protected float scorePayload(int doc, int start, int end, BytesRef payload) {
String payloadData = payload.utf8ToString();
return payloadData.equals("noun")? 1 : 0;
}
...
...
}
Finally just set your similarity class to your extended class during retrieval.
searcher.setSimilarity(new PayloadSimilarity());

NSPredicateEditorRowTemplate, specifying of Key Path with spaces?

As per a previous question, I have reluctantly given up on using IB/Xcode4 to edit an NSPredicateEditor and done it purely in code.
In the GUI way of editing the fields, key paths can be specified with spaces, like 'field name', and it makes them work as 'fieldName'-style key paths, while still displaying them in the UI with spaces. How do I do this in code? When I specify them with spaces, they don't work. When I specify them in camelCase, they work but display in camelCase. I'm just adding a bunch of NSExpressions like this:
[NSExpression expressionForKeyPath:#"original filename"]
The proper way to get human readable strings in the predicate editor's row views is to use the localization capabilities of NSRuleEditor and NSPredicateEditor.
If you follow the instructions in this blog post, you'll have everything you need to localize the editor.
As an example, let's say your key path is fileName, you support 2 operators (is and contains), and you want the user to enter a string. You'll end up with a strings file that looks like this:
"%[fileName]# %[is]# %#" = "%1$[fileName]# %2$[is]# %3$#";
"%[fileName]# %[contains]# %#" = "%1$[fileName]# %2$[contains]# %3$#";
You can use this file to put in human-readable stuff, and even reorder things:
"%[fileName]# %[is]# %#" = "%1$[original filename]# %2$[is]# %3$#";
"%[fileName]# %[contains]# %#" = "%3$# %2$[is contained in]# %1$[original filename]#";
Once you've localized the strings file, you hand that file back to the predicate editor, and it'll pull out the translated values, do its magic, and everything will show up correctly.
If you don't want to localize everything, just map the key paths consider overriding value(forKey:) in your evaluated object like this:
class Match: NSObject {
var date: Date?
var fileName: String?
override func value(forKey key: String) -> Any? {
// Alternatively use static dictionary for mapping key paths
super.value(forKey: camelCasedKeyPath(forKey: key))
}
private func camelCasedKeyPath(forKey key: String) -> String {
key.components(separatedBy: .whitespaces)
.enumerated()
.map { $0.offset > 0 ? $0.element.capitalized : $0.element.lowercased() }
.joined()
}
}