How to implement just some basic keywords highlighting in text editor? - eclipse-plugin

I'm a novice programmer trying to learn plug-in development. I'd like to upgrade the sample XML editor so that some words like "cat", "dog", "hamster", "rabbit" and "bird" would be highlighted when it appears in an XML file (it's just for learning purpose). Can anyone give me some implementation tips or suggestions? I am clueless.. (But I am carrying out my research on this as well, I'm not being lazy. You have my word.) Thanks in advance.

You can detect words in the plain text part of the XML by modifying the sample XML editor as follows.
We can use the provided WordRule class to detect the words. The XMLScanner class which scans the plain text needs to be updated to include the word rule:
public XMLScanner(final ColorManager manager)
{
IToken procInstr = new Token(new TextAttribute(manager.getColor(IXMLColorConstants.PROC_INSTR)));
WordRule words = new WordRule(new WordDetector());
words.addWord("cat", procInstr);
words.addWord("dog", procInstr);
// TODO add more words here
IRule [] rules = new IRule [] {
// Add rule for processing instructions
new SingleLineRule("<?", "?>", procInstr),
// Add generic whitespace rule.
new WhitespaceRule(new XMLWhitespaceDetector()),
// Words rules
words
};
setRules(rules);
}
I have used the existing processing instruction token here to reduce the amount of new code, but you should define a new color and use a new token.
The WordRule constructor requires an IWordDetector class, we can use a very simple detector here:
class WordDetector implements IWordDetector
{
#Override
public boolean isWordStart(final char c)
{
return Character.isLetter(c);
}
#Override
public boolean isWordPart(final char c)
{
return Character.isLetter(c);
}
}
This is just accepting letters in words.

Related

Regenerate source from Antlr4 ParseTree preserving whitespaces

I am able to use Java.g4 grammar and generate lexers and parsers. This grammar is used to get started. I have a ParseTree and I walk it and change whatever I want and write it back to a Java source code file. The ParseTree is not directly changed though.
So for example this is the code. But I also modify method names and add annotations to methods. In some cases I also remove method parameters
#Override
public void enterClassDeclaration(JavaParser.ClassDeclarationContext ctx) {
printer.write( intervals );
printer.writeList(importFilter );
ParseTree pt = ctx.getChild(1);
/*
The class name is changed by a visitor because
we get a ParseTree back
*/
pt.accept(new ParseTreeVisitor<Object>() {
#Override
public Object visitChildren(RuleNode ruleNode) {
return null;
}
#Override
public Object visitErrorNode(ErrorNode errorNode) {
return null;
}
#Override
public Object visit(ParseTree parseTree) {
return null;
}
#Override
public Object visitTerminal(TerminalNode terminalNode) {
String className = terminalNode.getText();
System.out.println("Name of the class is [ " + className + "]");
printer.writeText( classModifier.get() + " class " + NEW_CLASS_IDENTIFIER );
return null;
}
});
}
But I am not sure how to print the changed Java code while preserving all the original whitespaces.
How is that done ?
Update : It seems that the whitespaces and comments are there but not accessible easily. So it looks like I need to specifically keep track of them and write them along with the code. Not sure though.
So more specifically the code is this.
package x;
import java.util.Enumeration;
import java.util.*;
As I hit the first ImportDeclarationContext I need to store all the hidden space tokens. When I write this code back I want to include those spaces too.
Solution :
Don't skip but add to a HIDDEN channel
//
// Whitespace and comments
//
WS : [ \t\r\n\u000C]+ -> channel(HIDDEN)
;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
Use code like this to get them back
I use this to get the whitespaces and comments before each method. But it should be possible to get whitespaces from other places too. I think.
((CommonTokenStream) tokens).getHiddenTokensToLeft( classOrInterfaceModifierContext.getStart().getTokenIndex(),
Token.HIDDEN_CHANNEL);

How can I Extract words with its coordinates from pdf using .net?

I'm working with pdf in hebrew language with diacritical marks. I want to extract all the words with its coordinates. I tried to use ITextSharp and pdfClown and they both didn't give me what I want.
In pdfClown there are missing letters\chars in ITextSharp I don't get the words coordinates.
Is there a way to do it? (I'm looking for a free framework\code)
EDIT:
PDFClown Code:
File file = new File(PDFFilePath);
TextExtractor te = new TextExtractor();
IDictionary<RectangleF?, IList<ITextString>> strs = te.Extract(file.Document.Pages[0].Contents);
List<string> correctText = new List<string>();
foreach (var key in strs.Keys)
{
foreach (var value in strs[key])
{
string reversedText = new string(value.Text.Reverse().ToArray());
string cleanText = RemoveDiacritics(reversedText);
correctText.Add(cleanText);
}
}
You aren't showing how you are trying to extract text using iText(Sharp). I am assuming that you are following the official documentation and that your code looks like this:
public string ExtractText(byte[] src) {
PdfReader reader = new PdfReader(src);
MyTextRenderListener listener = new MyTextRenderListener();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.GetPageN(1);
PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
processor.ProcessContent(
ContentByteUtils.GetContentBytesForPage(reader, 1), resourcesDic);
return listener.Text.ToString();
}
If your code doesn't look like this, this explains already explains the first thing you're doing wrong.
In this method, there is one class that isn't part of iTextSharp: MyTextRenderListener. This is a class you should write and that looks for instance like this:
public class MyTextRenderListener : IRenderListener {
public StringBuilder Text { get; set; }
public MyTextRenderListener() {
Text = new StringBuilder();
}
public void BeginTextBlock() {
Text.Append("<");
}
public void EndTextBlock() {
Text.AppendLine(">");
}
public void RenderImage(ImageRenderInfo renderInfo) {
}
public void RenderText(TextRenderInfo renderInfo) {
Text.Append("<");
Text.Append(renderInfo.GetText());
LineSegment segment = renderInfo.GetBaseline();
Vector start = segment.GetStartPoint();
Text.Append("| x=");
Text.Append(start[Vector.I1]);
Text.Append("; y=");
Text.Append(start[Vector.I2]);
Text.Append(">");
}
}
When you run this code, and you look what's inside Text, you'll notice that a PDF document doesn't store words. Instead, it stores text blocks. In our special IRenderListener, we indicate the start and the end of text blocks using < and >. Inside these text blocks, you'll find text snippets. We'll mark text snippets like this: <text snippet| x=36.0000; y=806.0000> where the x and y value give you the coordinate of the start of the baseline (as opposed to the ascent and descent position). You can also get the end position of the baseline (and the ascent/descent).
Now how do you distill words out of all of this? The problem with the text snippets you get, is that they don't correspond with words. See for instance this file: hello_reverse.pdf
When you open it in Adobe Reader, you read "Hello World Hello People." You'd hope you'd find four words in the content stream, wouldn't you? In reality, this is what you'll find:
<>
<<ld><Wor><llo><He>>
<<Hello People>>
To distill the words, "World" and "Hello" from the first line, you need to do plenty of Math. Instead of getting the base line of the TextRenderInfo object returned in the RenderText() method of your render listener, you have to use the GetCharacterRenderInfos() method. This will return a list of TextRenderInfo objects that gives you more info about every character (including the position of those characters). You then need to compose the words from those different characters.
This is explained in mkl's answer to this question: Retrieve the respective coordinates of all words on the page with itextsharp
We've done similar projects. One of them is described here: https://www.youtube.com/watch?v=lZnbhnU4m3Y
You'll need to do quite some coding to get it right. One word about PdfClown: your text is probably stored as UNICODE in your PDF. To retrieve the correct characters, the parser needs to examine the mapping of the glyphs stored in the font and the corresponding UNICODE character. If PdfClown can't do this, this means that PdfClown doesn't do this task correctly. PdfClown is a one man project, so you'll have to ask that developer to fix this (if he has the time).
As you can tell from the video, iText could help you out, but iText is a company with subsidiaries in the US, Belgium and Singapore. It is a company with many employees and to keep that company running, we need to make money (that's how we pay our employees). Hence you shouldn't expect that we help you for free. Surely you can understand this as you wouldn't want to work for free either, would you?

Code Editor with intellisense for ANTLR4

Looking for sample for building ANTLR4 grammar based Code Editor with intellisense. SharpDevelop provides all Code Editor features, however if we need to provide the intellisense and Code Completion details, then we need to write own parser.
Need sample where ANTLR4, SharpDevelop is used for building the Code Editor for custom language.
Thanks.
I could get the expected Tokens from ANTLR4 using GetExpectedTokensWithinRule API in the Listener and converting them to tokens.
pseudo code looks like this
public class MyGrammarListener : MyGrammarBaseListener
{
public MyGrammarListener(MyGrammarParser parser)
{
this.Parser = parser;
}
public override void EnterXXXXX(XXXXX_Context context)
{
IntervalSet set = Parser.GetExpectedTokensWithinCurrentRule();
base.EnterXXXXX(context);
foreach (int token in set.ToIntegerList())
{
// Returns the expected tokens.
string data = Parser.Vocabulary.GetLiteralName(token);
}
}
}
I have used Jide CodeEditor with antlr4, it seems to be working OK but took some time to get it together. I generate the errors and keywords for highlighting from the parser. I use listeners for parsing etc. and a visitor for executing the language. Not familiar with SharpDevelop

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.
You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}
If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").
you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

What analyzer should I use for a URL in lucene.net?

I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.
I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):
(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)
In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?
What do I need to do to correct this?
Thanks
P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.
The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new MyUrlTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result);
result = new SynonymFilter(result);
return result;
}
}
Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.
Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.
e.g.
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = null;
if (fieldName.equals("url")) {
result = new MyUrlTokenizer(reader);
} else {
result = new StandardTokenizer(reader);
}
You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.