How to extract the information from one resume using lucene - lucene

everyone!
I am a fresh man to Lucene.
And I am working on a resume filter project using lucene . Firstly I want to extract some basic informations such as bithday etc from the resumes .
Suppose there is always one line says that birthday: 1989/10/19 or something like this . How could I extract this kind of info with Lucene instead of directly using regular expression.
currently I find maybe use SpanNearQuery will be helpful . But it seems that I can not add a WildcardQuery to the SpanNearQuery to match the birthday info.
I have totally got stucked . Any good suggestions ? Really Appreciate!

There is not magic bullet to extract dates from a Lucene field that includes a bunch of text and a date format inside it. The best way would be to write a custom analyzer that can break the terms apart during the indexing process and identify the numerical characters as a date.
I wrote a couple Analyzers for Lucene, however something like that is not really trivial...especially if you are new to Lucene.

Related

How to get the results which has all the strings specified in the search query

I am a beginner in Lucene. I am writing a search engine to search our code base for certain key words. I have a requirement for which I need your help. Say I am searching for a word "Apple computers", I would like Lucene to throw only the lines which have case insesitive "apple computers". But what I see is I see lines having Apple computers, lines having only apple and lines having only computers. How do I filter it to get only the lines having apple and computer.
How do you query Lucene? Basically what you are asking about is covered by building a query using BooleanClause.Occur.MUST.
Exactly how to do this is dependent on your query construction: For the default query parser you should use something like
+Apple +computers
While if you are building queries programmatically you should use MUST for every term.
As Yuval suggested, it's important to know how do you use Lucene.
If you use it through lucene-java and need exact phrase results (docs that contain only "apple computers" together) you can use PhraseQuery.
The example of how to compose it.

Getting exact matches in Lucene using the standard analyzer?

Given 2 documents with the content as follows
"I love Lucene"
"Lucene is nice"
I want to be able to query lucene only for those documents with Lucene in the beginning , i.e , everything that will match the regexp "^Lucene .*".
Is there a way to do it , provided that I can't change the index , and it was analyzed using the standard analyzer?
Sure, take a look at SpanFirstQuery. Here is a good tutorial:
http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/

Lucene Boolean Query on Not ANalyzed Fields

Using RavenDB to do a query on Lucene Index.
This query parses okay:
X:[[a]] AND Y:[[b]] AND Z:[[c]]
However this query gives me a parse exception:
X:[[a]] AND Y:[[b]] AND Z:[[c]] AND P:[[d]]
"Lucene.Net.QueryParsers.ParseException: Cannot parse '( AND )': Encountered \" \"AND"
I tried this on complexed index and simple reproduce cases and same result it seems once you go past three ands it blows up. Im using [[]] and not analyzed because i want exact matches (also sometimes values contain whitespace etc..) and from RavenDB I have veyr little control over the indexing.
Im wondering how I can rewrite the query to avoid the parse exception?
This is now fixed in the latest RavenDB builds. See this thread for more info.
This looks rather like a bug in Lucene's QueryParser, perhaps try reporting this in the user mailing list.
As a bypass, you could create a BooleanQuery manually and add the terms you want yourself. Since they are not analyzed, and the query doesn't look too complicated, you may be better off without the query-parser.

Retrieving per keyword/field match position in Lucene Solr -- possible?

Is there any way to retrieve the match field/position for each keyword for each matching document from solr?
For example, if the document has title "Retrieving per keyword/field match position in Lucene Solr -- possible?" and the query is "solr keyword", I'd like to get, in addition to the doc-id (I normally only want the doc-id, not the full document), something that can tell me the matches are at:
solr:
title: 9
keyword:
title: 3
I'm pretty sure such info is computing during query execution (for phrase queries), but is it possible to return these to the application?
Thanks!
Debugging Relevance Issues in Search suggest using Solr analysis, which you can get to from the admin URL, using something like http://localhost:8983/solr/admin/analysis.jsp?highlight=on .
This highlights matching terms and gives their position.
AFAIK there is no way to do that directly, but you can use hit highlighting to implement it.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article