The process which Lucene tokenizes text - lucene

This can be considered as a general Java question but for better understanding I'm using Lucene as example.
You can use different Tokenizers in Lucene to tokenize text. There's the main abstract Tokenizer class and then many different classes that extend it. Same thing for TokenFilter.
Now, it seems that each time you want to index a document, a new Tokenizer is created. The question is, since Tokeinzer is just a utility class, why not make it static? for example, a Tokenizer that converts all letters to lower case can have a static method that does just that for every input it gets. What's the point of creating a new object for every piece of text we want to index?
One thing to mention - Tokeinzer has a private field that contains the input it receives to tokenize. I just don't see why we need to store it this way because the object is destroyed right after the tokenization process is over and the new tokenized text is returned. The only thing I can think of is multi-threaded access maybe?
Thank you!

Now, it seems that each time you want to index a document, a new Tokenizer is created
This is not true, the Analyzer.reusableTokenStream method is called, which re-uses
not just a Tokenizer, but also the entire chain (TokenFilters, etc).
See http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/analysis/Analyzer.html#reusableTokenStream(java.lang.String, java.io.Reader)
One thing to mention - Tokeinzer has a private field that contains the input it receives to tokenize. I just don't see why we need to store it this way because the object is destroyed right after the tokenization process is over and the new tokenized text is returned. The only thing I can think of is multi-threaded access maybe?
As mentioned earlier, the entire Chain of tokenizers and tokenfilters is reused across documents. So all of their Attributes are reused, but also its important to note that attributes are shared across the chain (e.g. all Tokenizers and TokenFilters' Attribute references point to the same instances). This is why it is crucial to call clearAttributes() in your tokenizer to reset all attributes.
As an example, a Whitespace tokenizer adds a reference to a TermAttribute in its ctor, and its wrapped by a LowerCaseFilter which adds a reference to a TermAttribute in its ctor, too. Both these TermAttributes point to the same underlying char[]. When a new document is processed, Analyzer.reusableTokenStream is invoked, which returns the same TokenStream chain (in this case Whitespace wrapped with LowerCaseFilter) used in the previous document. The reset(Reader) method is called, resetting the tokenizer's input to the new document contents. Finally reset() is called on the entire stream, which resets any internal state from the previous document, and the contents are processed until incrementToken() returns false.

Dont worry about creating an instance here and there of a class when doing something complex like indexing a document w/ Lucene. There is probably going to be lots and lots of objects created inside the tokenizing and indexing process. One more tokeniser instance is literally nothing when one compares the left over garbage from thrown away objects when the process completes. If you dont believe me get a profile and watch object creation counts.

Related

What is the use case of atomFamily in recoil?

I did my first experiment with recoil, building an editable table. Each cell has an atom which stores its row, column, and text value.
The way I built this was by
initializing each cell's atom into a dictionary (just a plain object), with keys in a format of [column]x[row]
I then iterate over these keys in the Table component, and pass only the key to each Cell component
The Cell component uses useRecoilState and find its specific Atom by accessing the main dictionary using the key it got passed as a prop.
Now, it seems to me that this use case (creating thousands of related atoms with the same shape) is what atomFamily is meant to make easier, but I don't understand how to use it in this way, where you initialize each atom with a specific value.
And, besides that, I don't understand what is the advantage of using atomFamily over storing a collection of atoms. I understand there is memoization involved, but I don't understand what is getting memoized other than, if I am reading correctly, the ability to recall a specific atom by calling the function again with the same id, which would get you pretty much the same behavior I'm getting with a dictionary.
There is very little difference: if you want to manually memoize and manage your own collection of atoms, then you certainly can. atomFamily is essentially just sugar for that: it handles the memoization for you, so all that you have to do is use a unique key to access each atom. Verbatim from the documentation for atomFamily:
An atom represents a piece of state with Recoil. An atom is created and registered per <RecoilRoot> by your app. But, what if your state isn’t global? What if your state is associated with a particular instance of a control, or with a particular element? For example, maybe your app is a UI prototyping tool where the user can dynamically add elements and each element has state, such as its position. Ideally, each element would get its own atom of state. You could implement this yourself via a memoization pattern. But, Recoil provides this pattern for you with the atomFamily() utility. An Atom Family represents a collection of atoms. When you call atomFamily() it will return a function which provides the RecoilState atom based on the parameters you pass in.
As far as examples for how to use atomFamily: beyond the documentation linked above, there are lots of existing questions and answers on Stack Overflow which already cover exactly that. Here are a couple which I have answered previously:
Dynamic atom keys in Recoil
Looking for a pattern to normalize state in Recoil without losing the benefit of Suspense

Intercept array access using Bytebuddy

I have been using Bytebuddy to monitor application behaviours, and I would like to check whether an array field of one of application classes is updated before executing a particular method. I have read Bytebuddy documentation and stack overflow questions and have found some useful documentation of how to intercept field accesses using MemberSubstitution.
However, because the field I'm interested in is an array, onWrite and onRead events in MemberSubstitution seem irrelevant.
Is it possible to track updates on an array field using Bytebuddy?
No, unfortunately no. Arrays are read onto the stack. Then only afterwards elements are accessed by index. This can be preceded by arbitrary instructions and is not interceptable individually.

Stamping / Tagging / Branding Object Instances

I have a routine which accepts an object and does some processing on it. The objects may or may-not be mutable.
void CommandProcessor(ICommand command) {
// do a lot of things
}
There is a probability that the same command instance loops back in the processor. Things turn nasty when that happens. I want to detect these return visitors and prevent them from being processed. question is how can I do that transparently i.e. without disturbing the object themselves.
here is what i tried
Added a property Boolean Visited {get, set} on the ICommand.
I dont like this because the logic of one module shows up in other. The ShutdownCommand is concerned with shutting down, not with the bookkeeping. Also an EatIceCreamCommand may always return False in a hope to get more. Some non-mutable objects have outright problems with a setter.
privately maintain a lookup table of all processed instances. when an object comes first check against the list.
I dont like this either. (1) performance. the lookup table grows large. we need to do liner search to match instances. (2) cant rely on hashcode. the object may forge a different hashcode from time to time. (3) keeping the objects in a list prevents them from being garbage collected.
I need a way to put some invisible marker on the instance (of ICommand) which only my code can see. currently i dont discriminate between the invocations. just pray the same instances dont come back. does anyone have a better idea to implement this functionality..?
Assuming you can't stop this from happening just logically (try to cut out the loop) I would go for a HashSet of commands that you've already seen.
Even if the objects are violating the contracts of HashCode and Equals (which I would view as a problem to start with) you can create your own IEqualityComparer<ICommand> which uses System.Runtime.CompilerServices.RuntimeHelpers.GetHashCode to call Object.GetHashCode non-virtually. The Equals method would just test for reference identity. So your pool would contain distinct instances without caring whether or how the commands override Equals and GetHashCode.
That just leaves the problem of accumulating garbage. Assuming you don't have the option of purging the pool periodically, you could use WeakReference<T> (or the non-generic WeakReference class for .NET 4) to avoid retaining objects. You would then find all "dead" weak references every so often to prevent even accumulating those. (Your comparer would actually be an IEqualityComparer<WeakReference<T>> in this case, comparing the targets of the weak references for identity.)
It's not particularly elegant, but I'd argue that's inherent in the design - you need processing a command to change state somewhere, and an immutable object can't change state by definition, so you need the state outside the command. A hash set seems a fairly reasonable approach for that, and hopefully I've made it clear how you can avoid all three of the problems you mentioned.
EDIT: One thing I hadn't considered is that using WeakReference<T> makes it hard to remove entries - when the original value is garbage collected, you're not going to be able to find its hash code any more. You may well need to just create a new HashSet with the still-alive entries. Or use your own LRU cache, as mentioned in comments.

Is it possible to hook into the protobuf-net serializer to add some custom logic?

This may be overkill, but I am trying to reduce the network consumption of a client/server protocol, by having both sides keep copies of previously transferred URIs, so as to use 2-4 byte placeholders instead of the full URIs on subsequent chatter.
Problem is I think it will be quite expensive to reflect through all the complex objects being transferred to locate the URIs that need processing, whereas the serializer is already visiting all these fields and probably using a mechanism much faster than reflection.
Can this be done in protobuf-net?
If this is part of a single call to Serialize/Deserialize (i.e. your data has the same uri repeated at multiple locations), then you can already do this, simply by telling it to treat those strings as references (it has special handling of strings, so two different references of the same string contents count as equal):
[ProtoMember(7, AsReference=true)]
public string Uri {get;set;}
During serialization, the first time it spots a new string value (decorated with AsReference=true) it will generate a unique token to represent the string; all subsequent usages of that same string will serialize only the token.
If this is in separate calls to Serialize/Deserialize, then no: you would have to do it manually. I can think of some ways of doing it, but I think this would be better handled outside of the serialization layer.
Could you possibly customise the Objects that you are using that you want to Tokenise your URIs and have them inherit or implement an interface that you can check to see if that particular object is a Tokenizer.
Then if that's the case you might be able to use the BeforeSerialization / AfterDeserialization to make your transformations.

How much responsibility should a method have?

This is most certainly a language agnostic question and one that has bothered me for quite some time now. An example will probably help me explain the dilemma I am facing:
Let us say we have a method which is responsible for reading a file, populating a collection with some objects (which store information from the file), and then returning the collection...something like the following:
public List<SomeObject> loadConfiguration(String filename);
Let us also say that at the time of implementing this method, it would seem infeasible for the application to continue if the collection returned was empty (a size of 0). Now, the question is, should this validation (checking for an empty collection and perhaps the subsequent throwing of an exception) be done within the method? Or, should this methods sole responsibility be to perform the load of the file and ignore the task of validation, allowing validation to be done at some later stage outside of the method?
I guess the general question is: is it better to decouple the validation from the actual task being performed by a method? Will this make things, in general, easier at a later stage to change or build upon - in the case of my example above, it may be the case at a later stage where a different strategy is added to recover from the event of an empty collection being return from the 'loadConfiguration' method..... this would be difficult if the validation (and resulting exception) was being done in the method.
Perhaps I am being overly pedantic in the quest for some dogmatic answer, where instead it simply just relies on the context in which a method is being used. Anyhow, I would be very interested in seeing what others have to say regarding this.
Thanks all!
My recommendation is to stick to the single responsibility principle which says, in a nutshell, that each object should have 1 purpose. In this instance, your method has 3 purposes and then 4 if you count the validation aspect.
Here's my recommendation on how to handle this and how to provide a large amount of flexibility for future updates.
Keep your LoadConfig method
Have it call the a new method for reading the file.
Pass the previous method's return value to another method for loading the file into the collection.
Pass the object collection into some validation method.
Return the collection.
That's taking 1 method initially and breaking it into 4 with one calling 3 others. This should allow you to change pieces w/o having any impact on others.
Hope this helps
I guess the general question is: is it
better to decouple the validation from
the actual task being performed by a
method?
Yes. (At least if you really insist on answering such a general question – it’s always quite easy to find a counter-example.) If you keep both the parts of the solution separate, you can exchange, drop or reuse any of them. That’s a clear plus. Of course you must be careful not to jeopardize your object’s invariants by exposing the non-validating API, but I think you are aware of that. You’ll have to do some little extra typing, but that won’t hurt you.
I will answer your question by a question: do you want various validation methods for the product of your method ?
This is the same as the 'constructor' issue: is it better to raise an exception during the construction or initialize a void object and then call an 'init' method... you are sure to raise a debate here!
In general, I would recommend performing the validation as soon as possible: this is known as the Fail Fast which advocates that finding problems as soon as possible is better than delaying the detection since diagnosis is immediate while later you would have to revert the whole flow....
If you're not convinced, think of it this way: do you really want to write 3 lines every time you load a file ? (load, parse, validate) Well, that violates the DRY principle.
So, go agile there:
write your method with validation: it is responsible for loading a valid configuration (1)
if you ever need some parametrization, add it then (like a 'check' parameter, with a default value which preserves the old behavior of course)
(1) Of course, I don't advocate a single method to do all this at once... it's an organization matter: under the covers this method should call dedicated methods to organize the code :)
To deflect the question to a more basic one, each method should do as little as possible. So in your example, there should be a method that reads in the file, a method that extracts the necessary data from the file, another method to write that data to the collection, and another method that calls these methods. The validation can go in a separate method, or in one of the others, depending on where it makes the most sense.
private byte[] ReadFile(string fileSpec)
{
// code to read in file, and return contents
}
private FileData GetFileData(string fileContents)
{
// code to create FileData struct from file contents
}
private void FileDataCollection: Collection<FileData> { }
public void DoItAll (string fileSpec, FileDataCollection filDtaCol)
{
filDtaCol.Add(GetFileData(ReadFile(fileSpec)));
}
Add validation, verification to each of the methods as appropriate
You are designing an API and should not make any unnecessary assumptions about your client. A method should take only the information that it needs, return only the information requested, and only fail when it is unable to return a meaningful value.
So, with that in mind, if the configuration is loadable but empty, then returning an empty list seems correct to me. If your client has an application specific requirement to fail when provided an empty list, then it may do so, but future clients may not have that requirement. The loadConfiguration method itself should fail when it really fails, such as when it is unable to read or parse the file.
But you can continue to decouple your interface. For example, why must the configuration be stored in a file? Why can't I provide a URL, a row in a database, or a raw string containing the configuration data? Very few methods should take a file path as an argument since it binds them tightly to the local file system and makes them responsible for opening, reading, and closing files in addition to their core logic. Consider accepting an input stream as an alternative. Or if you want to allow for elaborate alternatives -- like data from a database -- consider accepting a ConfigurationReader interface or similar.
Methods should be highly cohesive ... that is single minded. So my opinion would be to separate the responsibilities as you have described. I sometimes feel tempted to say...it is just a short method so it does not matter...then I regret it 1.5 weeks later.
I think this depends on the case: If you could think of a scenario where you would use this method and it returned an empty list, and this would be okay, then I would not put the validation inside the method. But for e.g. a method which inserts data into a database which have to be validated (is the email address correct, has a name been specified, ... ) then it should be ok to put validation code inside the function and throw an exception.
Another alternative, not mentioned above, is to support Dependency Injection and have the method client inject a validator. This would allow the preservation of the "strong" Resource Acquisition Is Initialization principle, that is to say Any Object which Loads Successfully is Ready For Business (Matthieu's mention of Fail Fast is much the same notion).
It also allows a resource implementation class to create its own low-level validators which rely on the structure of the resource without exposing clients to implementation details unnecessarily, which can be useful when dealing with multiple disparate resource providers such as Ryan listed.