We're using Lucene to develop a free text search box for data delivered to a user, as in the case of an email Inbox. We'd like to allow for the box to handle dates, for instance 5/1/2011. To make things easier, we are limiting the current version of the feature to just two date formats:
mm/dd/yy
mm/dd/yyyy
For our prototype we hacked the query analysis process to attempt to pre-process the query string to look for these two date patterns. This was about 2 years ago, and we were on Lucene 2.4. Im curious to see if there are any tools in Lucene out-of-the-box to accept a DateFormat and return a TokenStream with any identified dates. Looking through the javadocs for Lucene 2.9, I found the class:
org.apache.lucene.analysis.sinks.DateRecognizerSinkFilter
which seems to do what I need, but it implements a SinkFilter, a concept which doesn't seem to be documented in the Lucene Wiki. Has anyone used this filter before, and if so, what is the most effective way to use it?
There is a bit of sample code (which is, admittedly, over-complicated) in the documentation for TeeSinkTokenFilter. Note that the way the DateRecognizerSinkFilter is designed, it does not store the actual date; it just detects that a token is a date that conforms to the specified format. What I would try is to re-implement the DateRecognizerSinkFilter class to take an array of DateFormat instances, create a new Attribute class called DateAttribute (or some-such) and use the date recognizer subclass to set the parsed date into the DateAttribute if one of its formats matches. That way, you can always test whether you have a valid date by interrogating the DateAttribute, and localize the date formats to one class. Another advantage is that you won't have to handle multiple sinks, thereby simplifying the code from the linked example.
Related
I was replacing internal Serializations in my application from Jil to Bond.
I'm switching simple classes with Ms Bond attributes and everything worked fine until I got one with a DateTime.
I had then a Dictionary KeyNotFound Exception error during Serialization.
I suspect Bond do not support DateTime, is that so?
And if it is, why is not implemented? DateTime is not a basic type but adding a custom converter is not worth it, the speed gain vs protobuf-net is minimal and I don't need generics, just simple fast de/serializer.
I hope I'm missing something, I really want to use Bond, but I need an easy tool too, I cannot risk breaking the application because something basic like a Date or Guid is not supported by default.
I'm writing here after hours of researches and the Young Guide to C# bond does not clearly mention what is and what is not supported.
No, there is no built-in timestamp type in Bond. The built-in types in Bond are documented in the manual for the gbc compiler.
For GUIDs, there's Bond.GUID, which has implicit conversions to/from System.Guid. Note that Bond.GUID lives in bond.bond, so if you want to refer to this from a .bond file, you'll need to use Bond's import functionality and import "bond/core/bond.bond"
There's an example showing how to use DateTime with a custom type alias.
The reason there is no built-in timestamp type in Bond is that there are so many different ways (and standards) for representing timestamps. There's a similar C++ example that shows representing time with a boost::posix_time::ptime, highlighting the various different ways that time is represented.
Our experience has been that projects usually already have a representation for timestamps that they want to use, so, we recommend using a converter so that you can use the representation that's appropriate for your circumstances.
As a side note, my experience has been that DateTimeOffset is a more generally useful type, compared to DateTime.
I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.
You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.
If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.
I am building a REST API that uses a filter parameter to control search results. E.g., one could search for a user by calling:
GET /users/?filter=name%3Dfoo
Now, my API should allow many different filter operators. Numeric operators such as equals, greater than, less than, string operators like contains, begins with or ends with and date operators such as year of or timediff. Moreover, AND and OR combinations should be possible.
Basically, I want to support a subset of the underlying MySQL database operators.
I found a lot of different implementations (two good examples are Google Analytics and LongJump) that seem to use custom syntax.
Looking at my requirements, I would probably design a custom syntax pretty similiar to the MySQL operator syntax.
However, I was wondering if there are any best practices established that I should follow and whether I should consider anything else. Thanks!
You need an already existing query language, don't try to reinvent the wheel! By REST this is complicated and not fully solved issue. There are some REST constraints your application must fulfill:
uniform interface / hypermedia as the engine of application state:
You have to send hypermedia responses to your clients, and they have to follow the hyperlinks given in those responses, instead of building the requests on their own. So you can decouple the clients from the structure of the URI.
uniform interface / self-descriptive messages:
You have to send messages annotated with semantics. So you can decouple the clients from the data structure. The best solution to do this is RDF with for example open linked data vocabs. If you don't want to use RDF, then the second best solution to use a vendor specific MIME type, so your messages will be self-descriptive, but the clients need to know how to parse your custom MIME type.
To describe simple search links, you can use URI templates, for example GET /users/{?name} will wait a name parameter in the query string. You can use the hydra:IRITemplateMapping from the hydra vocab to add semantics to the paramers like name.
Describing ad-hoc queries is a hard task. You have to describe somehow what your query can contain.
You can choose an URI query language and stick with URI templates and probably hydra annotation. There are many already existing URI query languages, like HTSQL, OData query (ppl don't like that one), etc...
You can choose an existing query language and send it in a single URI param. This can be anything you want, for example SQL, SPARQL, etc... You have to teach your client to generate that param. You can create your own vocab to describe the constraints of the actual query. If you don't need complicated things, this should not be a problem. I don't know of already existing query structure descibing vocabs, but I never looked for them...
You can choose an existing query language and send it in the body in a SEARCH request. Afaik SEARCH is not cached or supported by recent HTTP clients. It was defined by webdav. You can describe your query with the proper MIME type, and you can use the same vocab as by the previous solution.
You can use an RDF query solution, for example a SPARQL endpoint, or triple pattern fragments, etc... So your queries will contain the semantic metadata, and not your link description. By SPARQL you don't necessary need a triple data storage, you can translate the queries on server side to SQL, or whatever you use. You can probably use SPIN to describe query constraints and query templates, but that is new for me too. There might be other solutions to describe SPARQL query structures...
So to summarize if you want a real REST solution, you have to describe to your clients, how they can construct the queries and what parameters, logical operators they can use. Without query descriptions they won't be able to generate for example a HTML form for the user. If you don't want a REST solution, then pick a query language write a builder on the client, write a parser on the server and that's all.
The Open Data Protocol (OData)
You can check BreezeJs too and see how this protocol it's implemented for node.js + mongodb with breeze-mongodb module and for a .NET project using Web API and EntityFramework with Breeze.ContextProvider dll.
By embracing a set of common, accepted delimiters, equality comparison can be implemented in
straight-forward fashion. Setting the value of the filter query-string parameter to a string using those
delimiters creates a list of name/value pairs which can be parsed easily on the server-side and utilized
to enhance database queries as needed. You can use the delimeters of your choice say (“|”) to separate individual filter phrases for OR and ("&") to separate
individual filter phrases for AND and a double colon (“::”) to separate the names and values.
This provides a unique-enough set of delimiters to support the majority of use cases and creates a user readable
query-string parameter. A simple example will serve to clarify the technique. Suppose we want
to request users with the name “Todd” who live in "Denver" and have the title of “Grand Poobah”.
The request URI, complete with query-string might look like this:
GET http://www.example.com/users?filter="name::todd&city::denver&title::grand poobah”
The delimiter of the double colon (“::”) separates the property name from the comparison value,
enabling the comparison value to contain spaces—making it easier to parse the delimiter from the value
on the server.
Note that the property names in the name/value pairs match the name of the properties that would be
returned by the service in the payload.
Case sensitivity is certainly up for debate on a case-by-case basis, but in general,
filtering works best when case is ignored. You can also offer wild-cards as needed using the asterisk
(“*”) as the value portion of the name/value pair.
For queries that require more-than simple equality or wild-card comparisons, introduction of operators
is necessary. In this case, the operators themselves should be part of the value and parsed on the server
side, rather than part of the property name. When complex query-language-style functionality is
needed, consider introducing query concept from the Open Data Protocol (OData) Filter System Query
Option specification (http://www.odata.org/documentation/odata-version-4-0/)
There seems to be a lot of standards (like OData), but many are quite complicated in that they introduce new syntax.
For simple multi filtering the following format avoid polluting the parameter namespace while still standing on top of existing web-technology
GET /users?filter[name]=John&filter[title]=Manager
It's easily readable and on the backend languages like PHP will receive it as an array of filters to apply.
A possible standard would SCIM which is adopted by some commercial products. But it's not distinguished by brevity. For a pet project I used this
= equal
! not equal
* like
< smaller
> greater
& bitwise and
| bitwise or
^ bitwise xor
~ in comma separated value list
Examples
So GET /user?name=*An* would get all users whose name start with An and GET /user?name=~Anna,Bertha would get those two users.
Not yet a standard but who knows...
For a recent project to aid me learning NLP I am working on a number of documents, each of which contain a date. What I would like to be able to do is read the unstructured data and identify the date or dates within, converting it into a numeric format and possibly setting it to the documents metadata. (Note: Since the documents being used is all pseudo information, the actual meta data of the files being read in are false).
Recently I have been attempting to use OpenNLP in conjunction with Lucene to do so and it works to a given degree.
However if the date is written as "13 January 1990" or "2010/01/05", OpenNLP only identifies "January 1990" and "2010" respectively, but not the entire date. Other date formats may have issues as well, I have yet to try them all. While I recognise that OpenNLP works upon a statistical basis rather than a format basis, I can't help but get the feeling I'm making an elementary mistake.
Am I making a mistake? If not is there an easy manner in which to rectify this?
I understand that I may be able to construct my own trained model based on a training data set. Is the Apache OpenNLP one freely available, so I may extend it? Are there any others that are freely available?
Is there a better way to do this? I've heard of Apache UIMA, the main reason why I went for OpenNLP is due to its mention in Taming Text by Manning. I should note that the extraction of dates is the first stage of the project and other data will be extracted later as well.
Many thanks for any response.
I am not an expert in OpenNLP but I know that the problem you are trying to solve is called Temporal Expression Extraction (because I do research in this field :P). Nowadays, there are some systems which can greatly help you in extracting and unambiguously representing the temporal meaning of such expressions.
Here are some references:
ManTIME, online demo, software
HeidelTime, online demo, software
SUTime, online demo, software
If you want a broader overview of the field, please have a look at the results of the last temporal information extraction challenge (TempEval-3, Task A).
I hope this helps. :)
I'm looking to have the ability to access the length (in terms) of a specific field of a document post-indexing. Preferably, if there is a way without re-indexing I would like to do that. But if re-indexing in a certain way will give easy access to this value, that would also serve.
http://blog.mikemccandless.com/2012/03/new-index-statistics-in-lucene-40.html
That link (scoll down a bit and find the mention of length) talks of accessing the value at indexing time. I wish to be able to do so post-indexing. The link also talks about saving away the value to a doc value, but it gives no examples of how to do so.
If anyone could provide examples of saving the document length, or accessing it post-indexing, it would be incredibly helpful. Thanks.
The mention of that statistic in the article is in reference to a FieldInvertState. Once you have that, it should be fairly straightforward how to get the statistics you are looking for (Just call getLength, getUniquetermCount or whatever you need).
The FieldInvertState is passed into the Similarity, particularly to the call Similarity.computeNorm. The norm value is calculated and stored at index time, rather than evaluated at query time, so making effective use of it would require you to reindex.
The typical way to make use of this would be to create a custom Similarity, possibly extending DefaultSimilarity. Simply overriding the lengthNorm method of DefaultSimilarity would be the simplest approach. It's standard implementation is:
return (float)(1.0 / Math.sqrt(numTerms));
Which you could override with whatever you like.
That would work to tweak scoring based on a custom length-based calculation. If that's not what you are looking for, but rather need to be able to just fetch that information, I would think just storing and the field, and getting the length from the field value returned when you fetch a Document would be the simplest implementation.