Having trouble rereading a Lucene TokenStream - lucene

I am using Lucene 4.6, and am apparently unclear on how to reuse a TokenStream, because I get the exception:
java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
at the start of the second pass. I've read the Javadoc, but I'm still missing something. Here is a simple example that throws the above exception:
#Test
public void list() throws Exception {
String text = "here are some words";
TokenStream ts = new StandardTokenizer(Version.LUCENE_46, new StringReader(text));
listTokens(ts);
listTokens(ts);
}
public static void listTokens(TokenStream ts) throws Exception {
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
try {
ts.reset();
while (ts.incrementToken()) {
System.out.println("token text: " + termAtt.toString());
}
ts.end();
}
finally {
ts.close();
}
}
I've tried not calling TokenStream.end() or TokenStream.close() thinking maybe they should only be called at the very end, but I get the same exception.
Can anyone offer a suggestion?

The Exception lists, as a possible issue, calling reset() multiple times, which you are doing. This is explicitly not allowed in the implementation of Tokenizer. Since the the java.io.Reader api does not guarantee support of the reset() operation by all subclasses, the Tokenizer can't assume that the Reader passed in can be reset, after all.
You may simply construct a new TokenStream, or I believe you could call Tokenizer.setReader(Reader) (in which case you certainly must close() it first).

Related

Can ANTLR4 parsers (in Java) be instrumented to be interruptible?

I'd like to run my ANTLR parser with an ExecutorService so I can call Future.cancel() on it after a timeout. AIUI, I need the parser to check Thread.isInterrupted(); is there a mechanism in the parser interface for this kind of instrumentation?
In the cases where this is relevant, the parser seems to be deep in PredictionContext recursion.
There is a ParseCancellationException (https://javadoc.io/doc/org.antlr/antlr4-runtime/latest/index.html).
Per the docs: “This exception is thrown to cancel a parsing operation. This exception does not extend RecognitionException, allowing it to bypass the standard error recovery mechanisms. BailErrorStrategy throws this exception in response to a parse error.”
You can attach a Listener that overrides enterEveryRule() to your parser. You could have a method on that listener to set a flag to throw the ParseCancellationException the next time the parser enters a rule (which happens very frequently).
Here's a short example of what the listener might look like:
public class CancelListener extends YourBaseListener {
public boolean cancel = false;
#Override
public void enterEveryRule(ParserRuleContext ctx) {
if (cancel) {
throw new ParseCancellationException("gotta go");
}
super.enterEveryRule(ctx);
}
}
then you can add that listener to your parser:
parser.addParseListener(cancelListener);
And then:
cancelListener.cancel = true

Spring Cloud Stream deserialization error handling for Batch processing

I have a question about handling deserialization exceptions in Spring Cloud Stream while processing batches (i.e. batch-mode: true).
Per the documentation here, https://docs.spring.io/spring-kafka/docs/2.5.12.RELEASE/reference/html/#error-handling-deserializer, (looking at the implementation of FailedFooProvider), it looks like this function should return a subclass of the original message.
Is the intent here that a list of both Foo's and BadFoo's will end up at the original #StreamListener method, and then it will be up to the code (i.e. me) to sort them out and handle separately? I suspect this is the case, as I've read that the automated DLQ sending isn't desirable for batch error handling, as it would resubmit the whole batch.
And if this is the case, what if there is more than one message type received by the app via different #StreamListener's, say Foo's and Bar's. What type should the value function return in that case? Below is the pseudo code to illustrate the second question?
#StreamListener
public void readFoos(List<Foo> foos) {
List<> badFoos = foos.stream()
.filter(f -> f instanceof BadFoo)
.map(f -> (BadFoo) f)
.collect(Collectors.toList());
// logic
}
#StreamListener
public void readBars(List<Bar> bars) {
// logic
}
// Updated to return Object and let apply() determine subclass
public class FailedFooProvider implements Function<FailedDeserializationInfo, Object> {
#Override
public Object apply(FailedDeserializationInfo info) {
if (info.getTopics().equals("foo-topic") {
return new BadFoo(info);
}
else if (info.getTopics().equals("bar-topic") {
return new BadBar(info);
}
}
}
Yes, the list will contain the function result for failed deserializations; the application needs to handle them.
The function needs to return the same type that would have been returned by a successful deserialization.
You can't use conditions with batch listeners. If the list has a mixture of Foos and Bars, they all go to the same listener.

how to catch minor errors?

I have a little ANTLR v4 grammer and I am implementing a visitor on it.
Lets say it is a simple calculator and every input must be terminated with a ";"
e.g. x=4+5;
If I do not put the ; at the end, then it is working too but I get a output the teminal.
line 1:56 missing ';' at '<EOF>'
Seems it can find the rule and more or less ignores the missing terminal ";".
I would prefer a strict error or an exception instead of this soft information.
The output is generated by the line
ParseTree tree = parser.input ()
Is there a way I can intensify the error-handling and check for that kind of error?
Yes, you can. Like you, I wanted a 100% perfect parse from user-submitted text and so created a strict error handler that prevents recovery from even simple errors.
The first step is in removing the default error listeners and adding your own STRICT error handler:
AntlrInputStream inputStream = new AntlrInputStream(stream);
BailLexer lexer = new BailLexer(inputStream); // TALK ABOUT THIS AT BOTTOM
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
LISBASICParser parser = new LISBASICParser(tokenStream);
parser.RemoveErrorListeners(); // UNHOOK ERROR HANDLER
parser.ErrorHandler = new StrictErrorStrategy(); // REPLACE WITH YOUR OWN
LISBASICParser.CalculationContext context = parser.calculation();
CalculationVisitor visitor = new CalculationVisitor();
visitor.VisitCalculation(context);
Here's my StrictErrorStrategy class. It inherits from the DefaultErrorStrategy class and overrides the two 'recovery' methods that are letting small errors like your semicolon error be recoverable:
public class StrictErrorStrategy : DefaultErrorStrategy
{
public override void Recover(Parser recognizer, RecognitionException e)
{
IToken token = recognizer.CurrentToken;
string message = string.Format("parse error at line {0}, position {1} right before {2} ", token.Line, token.Column, GetTokenErrorDisplay(token));
throw new Exception(message, e);
}
public override IToken RecoverInline(Parser recognizer)
{
IToken token = recognizer.CurrentToken;
string message = string.Format("parse error at line {0}, position {1} right before {2} ", token.Line, token.Column, GetTokenErrorDisplay(token));
throw new Exception(message, new InputMismatchException(recognizer));
}
public override void Sync(Parser recognizer) { }
}
Overriding these two methods allows you to stop (in this case with an exception that is caught elsewhere) on ANY parser error. And making the Sync method empty prevents the normal 're-sync after error' behavior from happening.
The final step is in catching all LEXER errors. You do this by creating a new class that inherits from your main lexer class; it overrides the Recover() method like so:
public class BailLexer : LISBASICLexer
{
public BailLexer(ICharStream input) : base(input) { }
public override void Recover(LexerNoViableAltException e)
{
string message = string.Format("lex error after token {0} at position {1}", _lasttoken.Text, e.StartIndex);
BasicEnvironment.SyntaxError = message;
BasicEnvironment.ErrorStartIndex = e.StartIndex;
throw new ParseCanceledException(BasicEnvironment.SyntaxError);
}
}
(Edit: In this code, BasicEnvironment is a high-level context object I used in the application to hold settings, errors, results, etc. So if you decide to use this, either do as another reader commented below, or substitute your own context/container.)
With this in place, even small errors during the lexing step will be caught as well. With these two overridden classes in place, the user of my app must supply absolutely perfect syntax to get a successful execution. There you go!
Because my ANTLR is in Java I add the answer here too. But it is the same idea as the accepted answer.
TempParser parser = new TempParser (tokens);
parser.removeErrorListeners ();
parser.addErrorListener (new BaseErrorListener ()
{
#Override
public void syntaxError (final Recognizer <?,?> recognizer, Object sym, int line, int pos, String msg, RecognitionException e)
{
throw new AssertionError ("ANTLR - syntax-error - line: " + line + ", position: " + pos + ", message: " + msg);
}
});

Force antlr3 to immediately exit when a rule fails

I've got a rule like this:
declaration returns [RuntimeObject obj]:
DECLARE label value { $obj = new RuntimeObject($label.text, $value.text); };
Unfortunately, it throws an exception in the RuntimeObject constructor because $label.text is null. Examining the debug output and some other things reveals that the match against "label" actually failed, but the Antlr runtime "helpfully" continues with the match for the purpose of giving a more helpful error message (http://www.antlr.org/blog/antlr3/error.handling.tml).
Okay, I can see how this would be useful for some situations, but how can I tell Antlr to stop doing that? The defaultErrorHandler=false option from v2 seems to be gone.
I don't know much about Antlr, so this may be way off base, but the section entitled "Error Handling" on this migration page looks helpful.
It suggests you can either use #rulecatch { } to disable error handling entirely, or override the mismatch() method of the BaseRecogniser with your own implementation that doesn't attempt to recover. From your problem description, the example on that page seems like it does exactly what you want.
You could also override the reportError(RecognitionException) method, to make it rethrow the exception instead of print it, like so:
#parser::members {
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
}
However, I'm not sure you want this (or the solution by ire_and_curses), because you will only get one error per parse attempt, which you can then fix, just to find the next error. If you try to recover (ANTLR does it okay) you can get multiple errors in one try, and fix all of them.
You need to override the mismatch and recoverFromMismatchedSet methods to ensure an exception is thrown immediately (examples are for Java):
#members {
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow) throws RecognitionException {
throw e;
}
}
then you need to change how the parser deals with those exceptions so they're not swallowed:
#rulecatch {
catch (RecognitionException e) {
throw e;
}
}
(The bodies of all the rule-matching methods in your parser will be enclosed in try blocks, with this as the catch block.)
For comparison, the default implementation of recoverFromMismatchedSet inherited from BaseRecognizer:
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow) throws RecognitionException {
if (mismatchIsMissingToken(input, follow)) {
reportError(e);
return getMissingSymbol(input, e, Token.INVALID_TOKEN_TYPE, follow);
}
throw e;
}
and the default rulecatch:
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}

What is the most efficient way to handle the lifecycle of an object with COM interop?

I have a Windows Workflow application that uses classes I've written for COM automation. I'm opening Word and Excel from my classes using COM.
I'm currently implementing IDisposable in my COM helper and using Marshal.ReleaseComObject(). However, if my Workflow fails, the Dispose() method isn't being called and the Word or Excel handles stay open and my application hangs.
The solution to this problem is pretty straightforward, but rather than just solve it, I'd like to learn something and gain insight into the right way to work with COM. I'm looking for the "best" or most efficient and safest way to handle the lifecycle of the classes that own the COM handles. Patterns, best practices, or sample code would be helpful.
I can not see what failure you have that does not calls the Dispose() method. I made a test with a sequential workflow that contains only a code activity which just throws an exception and the Dispose() method of my workflow is called twice (this is because of the standard WorkflowTerminated event handler). Check the following code:
Program.cs
class Program
{
static void Main(string[] args)
{
using(WorkflowRuntime workflowRuntime = new WorkflowRuntime())
{
AutoResetEvent waitHandle = new AutoResetEvent(false);
workflowRuntime.WorkflowCompleted += delegate(object sender, WorkflowCompletedEventArgs e)
{
waitHandle.Set();
};
workflowRuntime.WorkflowTerminated += delegate(object sender, WorkflowTerminatedEventArgs e)
{
Console.WriteLine(e.Exception.Message);
waitHandle.Set();
};
WorkflowInstance instance = workflowRuntime.CreateWorkflow(typeof(WorkflowConsoleApplication1.Workflow1));
instance.Start();
waitHandle.WaitOne();
}
Console.ReadKey();
}
}
Workflow1.cs
public sealed partial class Workflow1: SequentialWorkflowActivity
{
public Workflow1()
{
InitializeComponent();
this.codeActivity1.ExecuteCode += new System.EventHandler(this.codeActivity1_ExecuteCode);
}
[DebuggerStepThrough()]
private void codeActivity1_ExecuteCode(object sender, EventArgs e)
{
Console.WriteLine("Throw ApplicationException.");
throw new ApplicationException();
}
protected override void Dispose(bool disposing)
{
if (disposing)
{
// Here you must free your resources
// by calling your COM helper Dispose() method
Console.WriteLine("Object disposed.");
}
}
}
Am I missing something? Concerning the lifecycle-related methods of an Activity (and consequently of a Workflow) object, please check this post: Activity "Lifetime" Methods. If you just want a generic article about disposing, check this.
Basically, you should not rely on hand code to call Dispose() on your object at the end of the work. You probably have something like this right now:
MyComHelper helper = new MyComHelper();
helper.DoStuffWithExcel();
helper.Dispose();
...
Instead, you need to use try blocks to catch any exception that might be triggered and call dispose at that point. This is the canonical way:
MyComHelper helper = new MyComHelper();
try
{
helper.DoStuffWithExcel();
}
finally()
{
helper.Dispose();
}
This is so common that C# has a special construct that generates the same exact code [see note] as shown above; this is what you should be doing most of the time (unless you have some special object construction semantics that make a manual pattern like the above easier to work with):
using(MyComHelper helper = new MyComHelper())
{
helper.DoStuffWithExcel();
}
EDIT:
NOTE: The actual code generated is a tiny bit more complicated than the second example above, because it also introduces a new local scope that makes the helper object unavailable after the using block. It's like if the second code block was surrounded by { }'s. That was omitted for clarify of the explanation.