Lucene stop phrases filter - lucene

I'm trying to write a filter for Lucene, similar to StopWordsFilter (thus implementing TokenFilter), but I need to remove phrases (sequence of tokens) instead of words.
The "stop phrases" are represented themselves as a sequence of tokens: punctuation is not considered.
I think I need to do some kind of buffering of the tokens in the token stream, and when a full phrase is matched, I discard all tokens in the buffer.
What would be the best approach to implements a "stop phrases" filter given a stream of words like Lucene's TokenStream?

In this thread I was given a solution: use Lucene's CachingTokenFilter as a starting point:
That solution was actually the right way to go.
EDIT: I fixed the dead link. Here is a transcript of the thread.
MY QUESTION:
I'm trying to implement a "stop phrases filter" with the new TokenStream
API.
I would like to be able to peek into N tokens ahead, see if the current
token + N subsequent tokens match a "stop phrase" (the set of stop phrases
are saved in a HashSet), then discard all these tokens when they match a
stop phrase, or keep them all if they don't match.
For this purpose I would like to use captureState() and then restoreState()
to get back to the starting point of the stream.
I tried many combinations of these API. My last attempt is in the code
below, which doesn't work.
static private HashSet<String> m_stop_phrases = new HashSet<String>();
static private int m_max_stop_phrase_length = 0;
...
public final boolean incrementToken() throws IOException {
if (!input.incrementToken())
return false;
Stack<State> stateStack = new Stack<State>();
StringBuilder match_string_builder = new StringBuilder();
int skippedPositions = 0;
boolean is_next_token = true;
while (is_next_token && match_string_builder.length() < m_max_stop_phrase_length) {
if (match_string_builder.length() > 0)
match_string_builder.append(" ");
match_string_builder.append(termAtt.term());
skippedPositions += posIncrAtt.getPositionIncrement();
stateStack.push(captureState());
is_next_token = input.incrementToken();
if (m_stop_phrases.contains(match_string_builder.toString())) {
// Stop phrase is found: skip the number of tokens
// without restoring the state
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
return is_next_token;
}
}
// No stop phrase found: restore the stream
while (!stateStack.empty())
restoreState(stateStack.pop());
return true;
}
Which is the correct direction I should look into to implement my "stop
phrases" filter?
CORRECT ANSWER:
restoreState only restores the token contents, not the complete stream. So
you cannot roll back the token stream (and this was also not possible with
the old API). The while loop at the end of you code is not working as you
exspect because of this. You may use CachingTokenFilter, which can be reset
and consumed again, as a source for further work.

You'll really have to write your own Analyzer, I should think, since whether or not some sequence of words is a "phrase" is dependent on cues, such as punctuation, that are not available after tokenization.

Related

Unable to save data from datagridview [duplicate]

I have some code and when it executes, it throws a IndexOutOfRangeException, saying,
Index was outside the bounds of the array.
What does this mean, and what can I do about it?
Depending on classes used it can also be ArgumentOutOfRangeException
An exception of type 'System.ArgumentOutOfRangeException' occurred in mscorlib.dll but was not handled in user code Additional information: Index was out of range. Must be non-negative and less than the size of the collection.
What Is It?
This exception means that you're trying to access a collection item by index, using an invalid index. An index is invalid when it's lower than the collection's lower bound or greater than or equal to the number of elements it contains.
When It Is Thrown
Given an array declared as:
byte[] array = new byte[4];
You can access this array from 0 to 3, values outside this range will cause IndexOutOfRangeException to be thrown. Remember this when you create and access an array.
Array Length
In C#, usually, arrays are 0-based. It means that first element has index 0 and last element has index Length - 1 (where Length is total number of items in the array) so this code doesn't work:
array[array.Length] = 0;
Moreover please note that if you have a multidimensional array then you can't use Array.Length for both dimension, you have to use Array.GetLength():
int[,] data = new int[10, 5];
for (int i=0; i < data.GetLength(0); ++i) {
for (int j=0; j < data.GetLength(1); ++j) {
data[i, j] = 1;
}
}
Upper Bound Is Not Inclusive
In the following example we create a raw bidimensional array of Color. Each item represents a pixel, indices are from (0, 0) to (imageWidth - 1, imageHeight - 1).
Color[,] pixels = new Color[imageWidth, imageHeight];
for (int x = 0; x <= imageWidth; ++x) {
for (int y = 0; y <= imageHeight; ++y) {
pixels[x, y] = backgroundColor;
}
}
This code will then fail because array is 0-based and last (bottom-right) pixel in the image is pixels[imageWidth - 1, imageHeight - 1]:
pixels[imageWidth, imageHeight] = Color.Black;
In another scenario you may get ArgumentOutOfRangeException for this code (for example if you're using GetPixel method on a Bitmap class).
Arrays Do Not Grow
An array is fast. Very fast in linear search compared to every other collection. It is because items are contiguous in memory so memory address can be calculated (and increment is just an addition). No need to follow a node list, simple math! You pay this with a limitation: they can't grow, if you need more elements you need to reallocate that array (this may take a relatively long time if old items must be copied to a new block). You resize them with Array.Resize<T>(), this example adds a new entry to an existing array:
Array.Resize(ref array, array.Length + 1);
Don't forget that valid indices are from 0 to Length - 1. If you simply try to assign an item at Length you'll get IndexOutOfRangeException (this behavior may confuse you if you think they may increase with a syntax similar to Insert method of other collections).
Special Arrays With Custom Lower Bound
First item in arrays has always index 0. This is not always true because you can create an array with a custom lower bound:
var array = Array.CreateInstance(typeof(byte), new int[] { 4 }, new int[] { 1 });
In that example, array indices are valid from 1 to 4. Of course, upper bound cannot be changed.
Wrong Arguments
If you access an array using unvalidated arguments (from user input or from function user) you may get this error:
private static string[] RomanNumbers =
new string[] { "I", "II", "III", "IV", "V" };
public static string Romanize(int number)
{
return RomanNumbers[number];
}
Unexpected Results
This exception may be thrown for another reason too: by convention, many search functions will return -1 (nullables has been introduced with .NET 2.0 and anyway it's also a well-known convention in use from many years) if they didn't find anything. Let's imagine you have an array of objects comparable with a string. You may think to write this code:
// Items comparable with a string
Console.WriteLine("First item equals to 'Debug' is '{0}'.",
myArray[Array.IndexOf(myArray, "Debug")]);
// Arbitrary objects
Console.WriteLine("First item equals to 'Debug' is '{0}'.",
myArray[Array.FindIndex(myArray, x => x.Type == "Debug")]);
This will fail if no items in myArray will satisfy search condition because Array.IndexOf() will return -1 and then array access will throw.
Next example is a naive example to calculate occurrences of a given set of numbers (knowing maximum number and returning an array where item at index 0 represents number 0, items at index 1 represents number 1 and so on):
static int[] CountOccurences(int maximum, IEnumerable<int> numbers) {
int[] result = new int[maximum + 1]; // Includes 0
foreach (int number in numbers)
++result[number];
return result;
}
Of course, it's a pretty terrible implementation but what I want to show is that it'll fail for negative numbers and numbers above maximum.
How it applies to List<T>?
Same cases as array - range of valid indexes - 0 (List's indexes always start with 0) to list.Count - accessing elements outside of this range will cause the exception.
Note that List<T> throws ArgumentOutOfRangeException for the same cases where arrays use IndexOutOfRangeException.
Unlike arrays, List<T> starts empty - so trying to access items of just created list lead to this exception.
var list = new List<int>();
Common case is to populate list with indexing (similar to Dictionary<int, T>) will cause exception:
list[0] = 42; // exception
list.Add(42); // correct
IDataReader and Columns
Imagine you're trying to read data from a database with this code:
using (var connection = CreateConnection()) {
using (var command = connection.CreateCommand()) {
command.CommandText = "SELECT MyColumn1, MyColumn2 FROM MyTable";
using (var reader = command.ExecuteReader()) {
while (reader.Read()) {
ProcessData(reader.GetString(2)); // Throws!
}
}
}
}
GetString() will throw IndexOutOfRangeException because you're dataset has only two columns but you're trying to get a value from 3rd one (indices are always 0-based).
Please note that this behavior is shared with most IDataReader implementations (SqlDataReader, OleDbDataReader and so on).
You can get the same exception also if you use the IDataReader overload of the indexer operator that takes a column name and pass an invalid column name.
Suppose for example that you have retrieved a column named Column1 but then you try to retrieve the value of that field with
var data = dr["Colum1"]; // Missing the n in Column1.
This happens because the indexer operator is implemented trying to retrieve the index of a Colum1 field that doesn't exist. The GetOrdinal method will throw this exception when its internal helper code returns a -1 as the index of "Colum1".
Others
There is another (documented) case when this exception is thrown: if, in DataView, data column name being supplied to the DataViewSort property is not valid.
How to Avoid
In this example, let me assume, for simplicity, that arrays are always monodimensional and 0-based. If you want to be strict (or you're developing a library), you may need to replace 0 with GetLowerBound(0) and .Length with GetUpperBound(0) (of course if you have parameters of type System.Array, it doesn't apply for T[]). Please note that in this case, upper bound is inclusive then this code:
for (int i=0; i < array.Length; ++i) { }
Should be rewritten like this:
for (int i=array.GetLowerBound(0); i <= array.GetUpperBound(0); ++i) { }
Please note that this is not allowed (it'll throw InvalidCastException), that's why if your parameters are T[] you're safe about custom lower bound arrays:
void foo<T>(T[] array) { }
void test() {
// This will throw InvalidCastException, cannot convert Int32[] to Int32[*]
foo((int)Array.CreateInstance(typeof(int), new int[] { 1 }, new int[] { 1 }));
}
Validate Parameters
If index comes from a parameter you should always validate them (throwing appropriate ArgumentException or ArgumentOutOfRangeException). In the next example, wrong parameters may cause IndexOutOfRangeException, users of this function may expect this because they're passing an array but it's not always so obvious. I'd suggest to always validate parameters for public functions:
static void SetRange<T>(T[] array, int from, int length, Func<i, T> function)
{
if (from < 0 || from>= array.Length)
throw new ArgumentOutOfRangeException("from");
if (length < 0)
throw new ArgumentOutOfRangeException("length");
if (from + length > array.Length)
throw new ArgumentException("...");
for (int i=from; i < from + length; ++i)
array[i] = function(i);
}
If function is private you may simply replace if logic with Debug.Assert():
Debug.Assert(from >= 0 && from < array.Length);
Check Object State
Array index may not come directly from a parameter. It may be part of object state. In general is always a good practice to validate object state (by itself and with function parameters, if needed). You can use Debug.Assert(), throw a proper exception (more descriptive about the problem) or handle that like in this example:
class Table {
public int SelectedIndex { get; set; }
public Row[] Rows { get; set; }
public Row SelectedRow {
get {
if (Rows == null)
throw new InvalidOperationException("...");
// No or wrong selection, here we just return null for
// this case (it may be the reason we use this property
// instead of direct access)
if (SelectedIndex < 0 || SelectedIndex >= Rows.Length)
return null;
return Rows[SelectedIndex];
}
}
Validate Return Values
In one of previous examples we directly used Array.IndexOf() return value. If we know it may fail then it's better to handle that case:
int index = myArray[Array.IndexOf(myArray, "Debug");
if (index != -1) { } else { }
How to Debug
In my opinion, most of the questions, here on SO, about this error can be simply avoided. The time you spend to write a proper question (with a small working example and a small explanation) could easily much more than the time you'll need to debug your code. First of all, read this Eric Lippert's blog post about debugging of small programs, I won't repeat his words here but it's absolutely a must read.
You have source code, you have exception message with a stack trace. Go there, pick right line number and you'll see:
array[index] = newValue;
You found your error, check how index increases. Is it right? Check how array is allocated, is coherent with how index increases? Is it right according to your specifications? If you answer yes to all these questions, then you'll find good help here on StackOverflow but please first check for that by yourself. You'll save your own time!
A good start point is to always use assertions and to validate inputs. You may even want to use code contracts. When something went wrong and you can't figure out what happens with a quick look at your code then you have to resort to an old friend: debugger. Just run your application in debug inside Visual Studio (or your favorite IDE), you'll see exactly which line throws this exception, which array is involved and which index you're trying to use. Really, 99% of the times you'll solve it by yourself in a few minutes.
If this happens in production then you'd better to add assertions in incriminated code, probably we won't see in your code what you can't see by yourself (but you can always bet).
The VB.NET side of the story
Everything that we have said in the C# answer is valid for VB.NET with the obvious syntax differences but there is an important point to consider when you deal with VB.NET arrays.
In VB.NET, arrays are declared setting the maximum valid index value for the array. It is not the count of the elements that we want to store in the array.
' declares an array with space for 5 integer
' 4 is the maximum valid index starting from 0 to 4
Dim myArray(4) as Integer
So this loop will fill the array with 5 integers without causing any IndexOutOfRangeException
For i As Integer = 0 To 4
myArray(i) = i
Next
The VB.NET rule
This exception means that you're trying to access a collection item by index, using an invalid index. An index is invalid when it's lower than the collection's lower bound or greater than equal to the number of elements it contains. the maximum allowed index defined in the array declaration
Simple explanation about what a Index out of bound exception is:
Just think one train is there its compartments are D1,D2,D3.
One passenger came to enter the train and he have the ticket for D4.
now what will happen. the passenger want to enter a compartment that does not exist so obviously problem will arise.
Same scenario: whenever we try to access an array list, etc. we can only access the existing indexes in the array. array[0] and array[1] are existing. If we try to access array[3], it's not there actually, so an index out of bound exception will arise.
To easily understand the problem, imagine we wrote this code:
static void Main(string[] args)
{
string[] test = new string[3];
test[0]= "hello1";
test[1]= "hello2";
test[2]= "hello3";
for (int i = 0; i <= 3; i++)
{
Console.WriteLine(test[i].ToString());
}
}
Result will be:
hello1
hello2
hello3
Unhandled Exception: System.IndexOutOfRangeException: Index was outside the bounds of the array.
Size of array is 3 (indices 0, 1 and 2), but the for-loop loops 4 times (0, 1, 2 and 3). So when it tries to access outside the bounds with (3) it throws the exception.
A side from the very long complete accepted answer there is an important point to make about IndexOutOfRangeException compared with many other exception types, and that is:
Often there is complex program state that maybe difficult to have control over at a particular point in code e.g a DB connection goes down so data for an input cannot be retrieved etc... This kind of issue often results in an Exception of some kind that has to bubble up to a higher level because where it occurs has no way of dealing with it at that point.
IndexOutOfRangeException is generally different in that it in most cases it is pretty trivial to check for at the point where the exception is being raised. Generally this kind of exception get thrown by some code that could very easily deal with the issue at the place it is occurring - just by checking the actual length of the array. You don't want to 'fix' this by handling this exception higher up - but instead by ensuring its not thrown in the first instance - which in most cases is easy to do by checking the array length.
Another way of putting this is that other exceptions can arise due to genuine lack of control over input or program state BUT IndexOutOfRangeException more often than not is simply just pilot (programmer) error.
These two exceptions are common in various programming languages and as others said it's when you access an element with an index greater than the size of the array. For example:
var array = [1,2,3];
/* var lastElement = array[3] this will throw an exception, because indices
start from zero, length of the array is 3, but its last index is 2. */
The main reason behind this is compilers usually don't check this stuff, hence they will only express themselves at runtime.
Similar to this:
Why don't modern compilers catch attempts to make out-of-bounds access to arrays?

Fastest way to read huge text file (6 GB) line by line [duplicate]

I have large txt file with 100000 lines.
I need to start n-count of threads and give every thread unique line from this file.
What is the best way to do this? I think I need to read file line by line and iterator must be global to lock it. Loading the text file to list will be time-consuming and I can receive OutofMemory exception. Any ideas?
You can use the File.ReadLines Method to read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Method to process the lines in multiple threads in parallel:
Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
{
// your code here
});
After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support #dtb's answer above that using the following approach is the fastest:
Parallel.ForEach(File.ReadLines(catalogPath), line =>
{
});
My tests also showed the following:
File.ReadAllLines() and File.ReadAllLines().AsParallel() appear to run at almost exactly the same speed on a file of this size. Looking at my CPU activity, it appears they both seem to use two out of my 8 cores?
Reading all the data first using File.ReadAllLines() appears to be much slower than using File.ReadLines() in a Parallel.ForEach() loop.
I also tried a producer / consumer or MapReduce style pattern where one thread was used to read the data and a second thread was used to process it. This also did not seem to outperform the simple pattern above.
I have included an example of this pattern for reference, since it is not included on this page:
var inputLines = new BlockingCollection<string>();
ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();
var readLines = Task.Factory.StartNew(() =>
{
foreach (var line in File.ReadLines(catalogPath))
inputLines.Add(line);
inputLines.CompleteAdding();
});
var processLines = Task.Factory.StartNew(() =>
{
Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
{
string[] lineFields = line.Split('\t');
int genomicId = int.Parse(lineFields[3]);
int taxId = int.Parse(lineFields[0]);
catalog.TryAdd(genomicId, taxId);
});
});
Task.WaitAll(readLines, processLines);
Here are my benchmarks:
I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.
Read the file on one thread, adding its lines to a blocking queue. Start N tasks reading from that queue. Set max size of the queue to prevent out of memory errors.
Something like:
public class ParallelReadExample
{
public static IEnumerable LineGenerator(StreamReader sr)
{
while ((line = sr.ReadLine()) != null)
{
yield return line;
}
}
static void Main()
{
// Display powers of 2 up to the exponent 8:
StreamReader sr = new StreamReader("yourfile.txt")
Parallel.ForEach(LineGenerator(sr), currentLine =>
{
// Do your thing with currentLine here...
} //close lambda expression
);
sr.Close();
}
}
Think it would work. (No C# compiler/IDE here)
If you want to limit the number of threads to n, the easiest way is to use AsParallel() along with WithDegreeOfParallelism(n) to limit the thread count:
string filename = "C:\\TEST\\TEST.DATA";
int n = 5;
foreach (var line in File.ReadLines(filename).AsParallel().WithDegreeOfParallelism(n))
{
// Process line.
}
As #dtb mentioned above, the fastest way to read a file and then process the individual lines in a file is to:
1) do a File.ReadAllLines() into an array
2) Use a Parallel.For loop to iterate over the array.
You can read more performance benchmarks here.
The basic gist of the code you would have to write is:
string[] AllLines = File.ReadAllLines(fileName);
Parallel.For(0, AllLines.Length, x =>
{
DoStuff(AllLines[x]);
//whatever you need to do
});
With the introduction of bigger array sizes in .Net4, as long as you have plenty of memory, this shouldn't be an issue.

Handling bad messages using Kafka's Streams API

I have a basic stream processing flow which looks like
master topic -> my processing in a mapper/filter -> output topics
and I am wondering about the best way to handle "bad messages". This could potentially be things like messages that I can't deserialize properly, or perhaps the processing/filtering logic fails in some unexpected way (I have no external dependencies so there should be no transient errors of that sort).
I was considering wrapping all my processing/filtering code in a try catch and if an exception was raised then routing to an "error topic". Then I can study the message and modify it or fix my code as appropriate and then replay it on to master. If I let any exceptions propagate, the stream seems to get jammed and no more messages are picked up.
Is this approach considered best practice?
Is there a convenient Kafka streams way to handle this? I don't think there is a concept of a DLQ...
What are the alternative ways to stop Kafka jamming on a "bad message"?
What alternative error handling approaches are there?
For completeness here is my code (pseudo-ish):
class Document {
// Fields
}
class AnalysedDocument {
Document document;
String rawValue;
Exception exception;
Analysis analysis;
// All being well
AnalysedDocument(Document document, Analysis analysis) {...}
// Analysis failed
AnalysedDocument(Document document, Exception exception) {...}
// Deserialisation failed
AnalysedDocument(String rawValue, Exception exception) {...}
}
KStreamBuilder builder = new KStreamBuilder();
KStream<String, AnalysedPolecatDocument> analysedDocumentStream = builder
.stream(Serdes.String(), Serdes.String(), "master")
.mapValues(new ValueMapper<String, AnalysedDocument>() {
#Override
public AnalysedDocument apply(String rawValue) {
Document document;
try {
// Deserialise
document = ...
} catch (Exception e) {
return new AnalysedDocument(rawValue, exception);
}
try {
// Perform analysis
Analysis analysis = ...
return new AnalysedDocument(document, analysis);
} catch (Exception e) {
return new AnalysedDocument(document, exception);
}
}
});
// Branch based on whether analysis mapping failed to produce errorStream and successStream
errorStream.to(Serdes.String(), customPojoSerde(), "error");
successStream.to(Serdes.String(), customPojoSerde(), "analysed");
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
Any help greatly appreciated.
Right now, Kafka Streams offers only limited error handling capabilities. There is work in progress to simplify this. For now, your overall approach seems to be a good way to go.
One comment about handling de/serialization errors: handling those error manually, requires you to do de/serialization "manually". This means, you need to configure ByteArraySerdes for key and value for you input/output topic of your Streams app and add a map() that does the de/serialization (ie, KStream<byte[],byte[]> -> map() -> KStream<keyType,valueType> -- or the other way round if you also want to catch serialization exceptions). Otherwise, you cannot try-catch deserialization exceptions.
With your current approach, you "only" validate that the given string represents a valid document -- but it could be the case, that the message itself is corrupted and cannot be converted into a String in the source operator in the first place. Thus, you don't actually cover deserialization exception with you code. However, if you are sure a deserialization exception can never happen, you approach would be sufficient, too.
Update
This issues is tackled via KIP-161 and will be included in the next release 1.0.0. It allows you to register an callback via parameter default.deserialization.exception.handler. The handler will be invoked every time a exception occurs during deserialization and allows you to return an DeserializationResponse (CONTINUE -> drop the record an move on, or FAIL that is the default).
Update 2
With KIP-210 (will be part of in Kafka 1.1) it's also possible to handle errors on the producer side, similar to the consumer part, by registering a ProductionExceptionHandler via config default.production.exception.handler that can return CONTINUE.
Update Mar 23, 2018: Kafka 1.0 provides much better and easier handling for bad error messages ("poison pills") via KIP-161 than what I described below. See default.deserialization.exception.handler in the Kafka 1.0 docs.
This could potentially be things like messages that I can't deserialize properly [...]
Ok, my answer here focuses on the (de)serialization issues as this might be the most tricky scenario to handle for most users.
[...] or perhaps the processing/filtering logic fails in some unexpected way (I have no external dependencies so there should be no transient errors of that sort).
The same thinking (for deserialization) can also be applied to failures in the processing logic. Here, most people tend to gravitate towards option 2 below (minus the deserialization part), but YMMV.
I was considering wrapping all my processing/filtering code in a try catch and if an exception was raised then routing to an "error topic". Then I can study the message and modify it or fix my code as appropriate and then replay it on to master. If I let any exceptions propagate, the stream seems to get jammed and no more messages are picked up.
Is this approach considered best practice?
Yes, at the moment this is the way to go. Essentially, the two most common patterns are (1) skipping corrupted messages or (2) sending corrupted records to a quarantine topic aka a dead letter queue.
Is there a convenient Kafka streams way to handle this? I don't think there is a concept of a DLQ...
Yes, there is a way to handle this, including the use of a dead letter queue. However, it's (at least IMHO) not that convenient yet. If you have any feedback on how the API should allow you to handle this -- e.g. via a new or updated method, a configuration setting ("if serialization/deserialization fails send the problematic record to THIS quarantine topic") -- please let us know. :-)
What are the alternative ways to stop Kafka jamming on a "bad message"?
What alternative error handling approaches are there?
See my examples below.
FWIW, the Kafka community is also discussing the addition of a new CLI tool that allows you to skip over corrupted messages. However, as a user of the Kafka Streams API, I think ideally you want to handle such scenarios directly in your code, and fallback to CLI utilities only as a last resort.
Here are some patterns for the Kafka Streams DSL to handle corrupted records/messages aka "poison pills". This is taken from http://docs.confluent.io/current/streams/faq.html#handling-corrupted-records-and-deserialization-errors-poison-pill-messages
Option 1: Skip corrupted records with flatMap
This is arguably what most users would like to do.
We use flatMap because it allows you to output zero, one, or more output records per input record. In the case of a corrupted record we output nothing (zero records), thereby ignoring/skipping the corrupted record.
Benefit of this approach compared to the others ones listed here: We need to manually deserialize a record only once!
Drawback of this approach: flatMap "marks" the input stream for potential data re-partitioning, i.e. if you perform a key-based operation such as groupings (groupBy/groupByKey) or joins afterwards, your data will be re-partitioned behind the scenes. Since this might be a costly step we don't want that to happen unnecessarily. If you KNOW that the record keys are always valid OR that you don't need to operate on the keys (thus keeping them as "raw" keys in byte[] format), you can change from flatMap to flatMapValues, which will not result in data re-partitioning even if you join/group/aggregate the stream later.
Code example:
Serde<byte[]> bytesSerde = Serdes.ByteArray();
Serde<String> stringSerde = Serdes.String();
Serde<Long> longSerde = Serdes.Long();
// Input topic, which might contain corrupted messages
KStream<byte[], byte[]> input = builder.stream(bytesSerde, bytesSerde, inputTopic);
// Note how the returned stream is of type KStream<String, Long>,
// rather than KStream<byte[], byte[]>.
KStream<String, Long> doubled = input.flatMap(
(k, v) -> {
try {
// Attempt deserialization
String key = stringSerde.deserializer().deserialize(inputTopic, k);
long value = longSerde.deserializer().deserialize(inputTopic, v);
// Ok, the record is valid (not corrupted). Let's take the
// opportunity to also process the record in some way so that
// we haven't paid the deserialization cost just for "poison pill"
// checking.
return Collections.singletonList(KeyValue.pair(key, 2 * value));
}
catch (SerializationException e) {
// log + ignore/skip the corrupted message
System.err.println("Could not deserialize record: " + e.getMessage());
}
return Collections.emptyList();
}
);
Option 2: dead letter queue with branch
Compared to option 1 (which ignores corrupted records) option 2 retains corrupted messages by filtering them out of the "main" input stream and writing them to a quarantine topic (think: dead letter queue). The drawback is that, for valid records, we must pay the manual deserialization cost twice.
KStream<byte[], byte[]> input = ...;
KStream<byte[], byte[]>[] partitioned = input.branch(
(k, v) -> {
boolean isValidRecord = false;
try {
stringSerde.deserializer().deserialize(inputTopic, k);
longSerde.deserializer().deserialize(inputTopic, v);
isValidRecord = true;
}
catch (SerializationException ignored) {}
return isValidRecord;
},
(k, v) -> true
);
// partitioned[0] is the KStream<byte[], byte[]> that contains
// only valid records. partitioned[1] contains only corrupted
// records and thus acts as a "dead letter queue".
KStream<String, Long> doubled = partitioned[0].map(
(key, value) -> KeyValue.pair(
// Must deserialize a second time unfortunately.
stringSerde.deserializer().deserialize(inputTopic, key),
2 * longSerde.deserializer().deserialize(inputTopic, value)));
// Don't forget to actually write the dead letter queue back to Kafka!
partitioned[1].to(Serdes.ByteArray(), Serdes.ByteArray(), "quarantine-topic");
Option 3: Skip corrupted records with filter
I only mention this for completeness. This option looks like a mix of options 1 and 2, but is worse than either of them. Compared to option 1, you must pay the manual deserialization cost for valid records twice (bad!). Compared to option 2, you lose the ability to retain corrupted records in a dead letter queue.
KStream<byte[], byte[]> validRecordsOnly = input.filter(
(k, v) -> {
boolean isValidRecord = false;
try {
bytesSerde.deserializer().deserialize(inputTopic, k);
longSerde.deserializer().deserialize(inputTopic, v);
isValidRecord = true;
}
catch (SerializationException e) {
// log + ignore/skip the corrupted message
System.err.println("Could not deserialize record: " + e.getMessage());
}
return isValidRecord;
}
);
KStream<String, Long> doubled = validRecordsOnly.map(
(key, value) -> KeyValue.pair(
// Must deserialize a second time unfortunately.
stringSerde.deserializer().deserialize(inputTopic, key),
2 * longSerde.deserializer().deserialize(inputTopic, value)));
Any help greatly appreciated.
I hope I could help. If yes, I'd appreciate your feedback on how we could improve the Kafka Streams API to handle failures/exceptions in a better/more convenient way than today. :-)
For the processing logic you could take this approach:
someKStream
.mapValues(inputValue -> {
// for each execution the below "return" could provide a different class than the previous run!
// e.g. "return isFailedProcessing ? failValue : successValue;"
// where failValue and successValue have no related classes
return someObject; // someObject class vary at runtime depending on your business
}) // here you'll have KStream<whateverKeyClass, Object> -> yes, Object for the value!
// you could have a different logic for choosing
// the target topic, below is just an example
.to((k, v, recordContext) -> v instanceof failValueClass ?
"dead-letter-topic" : "success-topic",
// you could completelly ignore the "Produced" part
// and rely on spring-boot properties only, e.g.
// spring.kafka.streams.properties.default.key.serde=yourKeySerde
// spring.kafka.streams.properties.default.value.serde=org.springframework.kafka.support.serializer.JsonSerde
Produced.with(yourKeySerde,
// JsonSerde could be an instance configured as you need
// (with type mappings or headers setting disabled, etc)
new JsonSerde<>()));
Your classes, though different and landing into different topics, will serialize as expected.
When not using to(), but instead one wants to continue with other processing, he could use branch() with splitting the logic based on the kafka-value class; the trick for branch() is to return KStream<keyClass, ?>[] in order to further allow one to cast to the appropriate class the individual array items.
If you want to send an exception (custom exception) to another topic (ERROR_TOPIC_NAME):
#Bean
public KStream<String, ?> kafkaStreamInput(StreamsBuilder kStreamBuilder) {
KStream<String, InputModel> input = kStreamBuilder.stream(INPUT_TOPIC_NAME);
return service.messageHandler(input);
}
public KStream<String, ?> messageHandler(KStream<String, InputModel> inputTopic) {
KStream<String, Object> output;
output = inputTopic.mapValues(v -> {
try {
//return InputModel
return normalMethod(v);
} catch (Exception e) {
//return ErrorModel
return errorHandler(e);
}
});
output.filter((k, v) -> (v instanceof ErrorModel)).to(KafkaStreamsConfig.ERROR_TOPIC_NAME);
output.filter((k, v) -> (v instanceof InputModel)).to(KafkaStreamsConfig.OUTPUT_TOPIC_NAME);
return output;
}
If you want to handle Kafka exceptions and skip it:
#Autowired
public ConsumerErrorHandler(
KafkaProducer<String, ErrorModel> dlqProducer) {
this.dlqProducer = dlqProducer;
}
#Bean
ConcurrentKafkaListenerContainerFactory<?, ?> kafkaListenerContainerFactory(
ConcurrentKafkaListenerContainerFactoryConfigurer configurer,
ObjectProvider<ConsumerFactory<Object, Object>> kafkaConsumerFactory) {
ConcurrentKafkaListenerContainerFactory<Object, Object> factory = new ConcurrentKafkaListenerContainerFactory<>();
configurer.configure(factory, kafkaConsumerFactory.getIfAvailable());
factory.setErrorHandler(((exception, data) -> {
ErrorModel errorModel = ErrorModel.builder().message()
.status("500").build();
assert data != null;
dlqProducer.send(new ProducerRecord<>(DLQ_TOPIC, data.key().toString(), errorModel));
}));
return factory;
}
All above answers although valid and useful, they are assuming that your streams topology is stateless. For example going back to the original example,
master topic -> my processing in a mapper/filter -> output topics
"my processing in a mapper/filter" should be stateless. I.e. Not re-partitioning (aka writing to a persistent re-partition topic) or doing a toTable() (aka writing to a changelog topic). If the processing fails further down the topology and you commit the transaction (by following any of the 3 option mention above - flatmap, branch or filter - then you have to cater for manually or programmatically eventually deleting that inconsistent state. That would mean writing extra custom code for automatic this.
I would personally expect Streams to also give you a LogAndSkip option for any unhandled runtime exception, not only for deserialization and production ones.
Has anyone any ideas on this?
I don't believe these examples work at all when working with Avro.
When the schema can't be resolved (i.e there is bad/non-avro message corrupting the topic, for example) there is no key or value to deserialize in the first place because by the time the DSL .branch() code is called, the exception has already been thrown (or handled).
Can anyone confirm if this i indeed the case? The very fluent approach you refer to here isn't possible when working with Avro?
KIP-161 does explain how to use a handler, however, it's much more fluent to see it as part of the topology.

How to create suggestion messages with ANTLR?

I want to create an interactive version of the ANTLR calculator example, which tells the user what to type next. For instance, in the beginning, the ID, INT, NEWLINE, and WS tokens are possible. Ignoring WS, a suggestion message could be:
Type an identifier, a number, or newline.
After parsing a number, the message should be
Type +, -, *, or newline.
and so on. How to do this?
Edit
What I have tried so far:
private void accept(String sentence) {
ANTLRInputStream is = new ANTLRInputStream(sentence);
OperationLexer l = new OperationLexer(is);
CommonTokenStream cts = new CommonTokenStream(l);
final OperationParser parser = new OperationParser(cts);
parser.addParseListener(new OperationBaseListener() {
#Override
public void enterEveryRule(ParserRuleContext ctx) {
ATNState state = parser.getATN().states.get(parser.getState());
System.out.print("RULE " + parser.ruleNames[state.ruleIndex] + " ");
IntervalSet following = parser.getATN().nextTokens(state, ctx);
for (Integer token : following.toList()) {
System.out.print(parser.tokenNames[token] + " ");
}
System.out.println();
}
});
parser.prog();
}
prints the right suggestion for the first token, but for all other tokens, it print the current token. I guess capturing the state at enterEveryRule() is too early.
Accurately gathering this information in an LL(k) parser, where k>1, requires a thorough understanding of the parser internals. Several years ago, I faced this problem with ANTLR 3, and found the only real solution was so complex that it resulted in me becoming a co-author of ANTLR 4 specifically so I could handle this issue.
ANTLR (including ANTLR 4) disambiguates the parse tree during the parsing phase, which means if your grammar is not LL(1) then performing this analysis in the parse tree means you have already lost information necessary to be accurate. You'll need to write your own version of ParserATNSimulator (or a custom interpreter which wraps it) which does not lose the information.

sprintf() and WriteFile() affecting string Buffer

I have a very weird problem which I cannot seem to figure out. Unfortunately, I'm not even sure how to describe it without describing my entire application. What I am trying to do is:
1) read a byte from the serial port
2) store each char into tagBuffer as they are read
3) run a query using tagBuffer to see what type of tag it is (book or shelf tag)
4) depending on the type of tag, output a series of bytes corresponding to the type of tag
Most of my code is implemented and I can get the right tag code sent back out the serial port. But there are two lines that I've added as debug statements which when I tried to remove them, they cause my program to stop working.
The lines are the two lines at the very bottom:
sprintf(buf,"%s!\n", tagBuffer);
WriteFile(hSerial,buf,strlen(buf), &dwBytesWritten,&ovWrite);
If I try to remove them, "tagBuffer" will only store the last character as oppose being a buffer. Same thing with the next line, WriteFile().
I thought sprintf and WriteFile are I/O functions and would have no effect on variables.
I'm stuck and I need help to fix this.
//keep polling as long as stop character '-' is not read
while(szRxChar != '-')
{
// Check if a read is outstanding
if (HasOverlappedIoCompleted(&ovRead))
{
// Issue a serial port read
if (!ReadFile(hSerial,&szRxChar,1,
&dwBytesRead,&ovRead))
{
DWORD dwErr = GetLastError();
if (dwErr!=ERROR_IO_PENDING)
return dwErr;
}
}
// resets tagBuffer in case tagBuffer is out of sync
time_t t_time = time(0);
char buf[50];
if (HasOverlappedIoCompleted(&ovWrite))
{
i=0;
}
// Wait 5 seconds for serial input
if (!(HasOverlappedIoCompleted(&ovRead)))
{
WaitForSingleObject(hReadEvent,RESET_TIME);
}
// Check if serial input has arrived
if (GetOverlappedResult(hSerial,&ovRead,
&dwBytesRead,FALSE))
{
// Wait for the write
GetOverlappedResult(hSerial,&ovWrite,
&dwBytesWritten,TRUE);
if( strlen(tagBuffer) >= PACKET_LENGTH )
{
i = 0;
}
//load tagBuffer with byte stream
tagBuffer[i] = szRxChar;
i++;
tagBuffer[i] = 0; //char arrays are \0 terminated
//run query with tagBuffer
sprintf(query,"select type from rfid where rfidnum=\"");
strcat(query, tagBuffer);
strcat(query, "\"");
mysql_real_query(&mysql,query,(unsigned int)strlen(query));
//process result and send back to handheld
res = mysql_use_result(&mysql);
while(row = mysql_fetch_row(res))
{
printf("result of query is %s\n",row[0]);
string str = "";
str = string(row[0]);
if( str == "book" )
{
WriteFile(hSerial,BOOK_INDICATOR,strlen(BOOK_INDICATOR),
&dwBytesWritten,&ovWrite);
}
else if ( str == "shelf" )
{
WriteFile(hSerial,SHELF_INDICATOR,strlen(SHELF_INDICATOR),
&dwBytesWritten,&ovWrite);
}
else //this else doesn't work
{
WriteFile(hSerial,NOK,strlen(NOK),
&dwBytesWritten,&ovWrite);
}
}
mysql_free_result(res);
// Display a response to input
//printf("query is %s!\n", query);
//printf("strlen(tagBuffer) is %d!\n", strlen(tagBuffer));
//without these, tagBuffer only holds the last character
sprintf(buf,"%s!\n", tagBuffer);
WriteFile(hSerial,buf,strlen(buf), &dwBytesWritten,&ovWrite);
}
}
With those two lines, my output looks like this:
s sh she shel shelf shelf0 shelf00 BOOKCODE shelf0001
Without them, I figured out that tagBuffer and buf only stores the most recent character at any one time.
Any help at all will be greatly appreciated. Thanks.
Where are you allocating tagbuffer, how large is it?
It's possible that you are overwriting 'buf' because you are writing past the end of tagbuffer.
It seems unlikely that those two lines would have that effect on a correct program - maybe you haven't allocated sufficient space in buf for the whole length of the string in tagBuffer? This might cause a buffer overrun that is disguising the real problem?
The first thing I'd say is a piece of general advice: bugs aren't always where you think they are. If you've got something going on that doesn't seem to make sense, it often means that your assumptions somewhere else are wrong.
Here, it does seem very unlikely that an sprintf() and a WriteFile() will change the state of the "buf" array variable. However, those two lines of test code do write to "hSerial", while your main loop also reads from "hSerial". That sounds like a recipie for changing the behaviour of your program.
Suggestion: Change your lines of debugging output to store the output somewhere else: to a dialog box, or to a log file, or similar. Debugging output should generally not go to files used in the core logic, as that's too likely to change how the core logic behaves.
In my opinion, the real problem here is that you're trying to read and write the serial port from a single thread, and this is making the code more complex than it needs to be. I suggest that you read the following articles and reconsider your design:
Serial Port I/O from Joseph Newcomer's website.
Serial Communications in Win32 from MSDN.
In a multithreaded implementation, whenever the reader thread reads a message from the serial port you would then post it to your application's main thread. The main thread would then parse the message and query the database, and then queue an appropriate response to the writer thread.
This may sound more complex than your current design, but really it isn't, as Newcomer explains.
I hope this helps!