may someone give me a hint on how to index only words with a minimum length using Apache Lucene 5.3.1?
I've searched through the API but didn't find anything which suits my needs except this, but I couldn't figure out how to use that.
Thanks!
Edit:
I guess that's important info, so here's a copy of my explanation of what I want to achieve from my reply below:
"I don't intend to use queries. I want to create a source code summarization tool for which I created a doc-term matrix using Lucene. Now it also shows single- or double-character words. I want to exclude them so they don't show up in the results as they have little value for a summary. I know I could filter them when outputting the results, but that's not a clean solution imo. An even worse would be to add all combinations of single- or double-character words to the stoplist. I am hoping there is a more elegant way then one of those."
You should use a custom Analyzer with LengthTokeFilter. E.g.
Analyzer ana = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("standard")
.addTokenFilter("lowercase")
.addTokenFilter("length", "min", "4", "max", "50")
.addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();
But it is better to use a stopword (words what occur in almost all documents, like articles for English language) list. This gives a more accurate result.
It it possible to search for document similarity based on term-vector position in lucene?
For example there are three documents with content as follows
1: Hi how are you
2: Hi how you are
3: Hi how are you
Now if doc 1 is searched in lucene then it should return doc 3 with more score then doc 2 with less score because doc 2 has "you" and "are" words at different positions,
In short lucene should return exact matching documents with term positions
I think what you need is a PhraseQuery, it is a Lucene Query type that will take into account the precise position of your tokens and allow you to define a slop or permutation tolerance regarding those tokens.
In other words the more your tokens differ from the source in terms of positions the less they will be scored.
You can use it like that :
QueryBuilder analyzedBuilder = new QueryBuilder(new MyAnalyzer());
PhraseQuery query = analyzedBuilder.createPhraseQuery("fieldToSearchOn", textQuery);
the createPhraseQuery allows for a third parameter the slop I alluded to if you want to tweak it.
Regards,
Dim ii = _DsAttribute.Tables(0).Rows.Find(Convert.ToString(DtgFields.Rows(e.RowIndex).Cells("AttributeID").Value, CultureInfo.CreateSpecificCulture("en-US"))) '.ToString.ToString(CultureInfo.InvariantCulture))
Dim jj = _DsAttribute.Tables(0).Rows.Find(Convert.ToString(DtgFields.Rows(e.RowIndex).Cells("AttributeID").Value, CultureInfo.CreateSpecificCulture("en-US"))).Item("Checked")
I tried many variations of the above, trying to keep the database's data from being "corrupted" by my machine's Danish culture (set in region/language settings in Windows). I tried invariant culture, fr-FR and en-US.
When my machine is Danish, ii equals null and jj returns an exception ("Object reference not set to an instance of an object") but, interestingly, _DsAttribute has the same data as when my machine is English (US-English). Also, when I search for the value of DtgFields.Rows(e.RowIndex).Cells("AttributeID").Value, I can find it in the data from _DsAttribute. The data for the ID is, at least to the naked eye, the same.
How do I make use of CultureInfo to avoid this sort of problem?
This is technically possible, Danish uses a different sort order than US-English. You'll find Mr Åårdvårk at the beginning of the phone book in Denmark, not at the end as in the USA.
That makes string comparison a dangerous proposition if the Find() method uses binary search or a tree to locate the data. Which is certainly the case for DataSet, its primary key index is a red-black tree. What goes wrong is that the algorithm follows the wrong path down the tree when the index was written with Danish as the collation but is read with English as the collation. Or the other way around. The result is that it can't find an entry in the tree, even though it exists.
Contra-indications is that a dbase column of type Guid should never be a string, although it is not uncommon that it is. And that Guid values should not contain characters that can steer the search wrong. Maybe the column isn't clean and contains other, non-guid values. You fix it by changing the column type or by using the same collation order (i.e. language) consistently. Try CultureInfo.CurrentCulture for a possible quick fix.
Dim ii = _DsAttribute.Tables(0).Rows.Find(New Guid(DtgFields.Rows(e.RowIndex).Cells("AttributeID").Value.ToString))
Dim jj = _DsAttribute.Tables(0).Rows.Find(New Guid(DtgFields.Rows(e.RowIndex).Cells("AttributeID").Value.ToString)).Item("Checked")
The above was the solution, for me. Hans' answer has lots of technical information that was very helpful (thus my upvote) and lead me to my answer.
I have a question about searching process in lucene/.
I use this code for search
Directory directory = FSDirectory.GetDirectory(#"c:\index");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("content", analyzer);
qp.SetDefaultOperator(QueryParser.Operator.AND);
Query query = qp.Parse(search string);
In one document I've set "I want to go shopping" for a field and in other document I've set "I wanna go shopping".
the meaning of both sentences is same!
is there any good solution for lucene to understand meaning of sentences or kind of normalize the scentences ? for example save the fields like "I wanna /want to/ go shopping" and remove the comment with regexp in result.
Lucene provides filter to normalize words and even map similar words.
PorterStemFilter -
Stemming allows words to be reduced to their roots.
e.g. wanted, wants would be reduced to root want and search for any of those words would match the document.
However, wanna does not reduce to root want. So it may not work in this case.
SynonymFilter -
would help you to map words similar in a configuration file.
so wanna can be mapped to want and if you search for either of those, the document must match.
you would need to add the filters in your analysis chain.
Am working on web based Job search application using Lucene.User on my site can search for jobs which are within a radius of 100 miles from say "Boston,MA" or any other location.
Also, I need to show the search results sorted by "relevance"(ie. Score returned by lucene) in descending order.
I'm using a 3rd party API to fetch all the cities within given radius of a city.This API returns me around 864 cities within 100 miles radius of "Boston,MA".
I'm building the city/state Lucene query using the following logic which is part of my "BuildNearestCitiesQuery" method.
Here nearestCities is a hashtable returned by the above API.It contains 864 cities with CityName ass key and StateCode as value.
And finalQuery is a Lucene BooleanQuery object which contains other search criteria entered by the user like:skills,keywords,etc.
foreach (string city in nearestCities.Keys)
{
BooleanQuery tempFinalQuery = finalQuery;
cityStateQuery = new BooleanQuery();
queryCity = queryParserCity.Parse(city);
queryState = queryParserState.Parse(((string[])nearestCities[city])[1]);
cityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like an AND
cityStateQuery.Add(queryState, BooleanClause.Occur.MUST);
}
nearestCityQuery.Add(cityStateQuery, BooleanClause.Occur.SHOULD); //should is like an OR
finalQuery.Add(nearestCityQuery, BooleanClause.Occur.MUST);
I then input finalQuery object to Lucene's Search method to get all the jobs within 100 miles radius.:
searcher.Search(finalQuery, collector);
I found out this BuildNearestCitiesQuery method takes a whopping 29 seconds on an average to execute which obviously is unacceptable by any standards of a website.I also found out that the statements involving "Parse" take a considerable amount of time to execute as compared to other statements.
A job for a given location is a dynamic attribute in the sense that a city could have 2 jobs(meeting a particular search criteria) today,but zero job for the same search criteria after 3 days.So,I cannot use any "Caching" over here.
Is there any way I can optimize this logic?or for that matter my whole approach/algorithm towards finding all jobs within 100 miles using Lucene?
FYI,here is how my indexing in Lucene looks like:
doc.Add(new Field("jobId", job.JobID.ToString().Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("title", job.JobTitle.Trim(), Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("description", job.JobDescription.Trim(), Field.Store.NO, Field.Index.TOKENIZED));
doc.Add(new Field("city", job.City.Trim(), Field.Store.YES, Field.Index.TOKENIZED , Field.TermVector.YES));
doc.Add(new Field("state", job.StateCode.Trim(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES));
doc.Add(new Field("citystate", job.City.Trim() + ", " + job.StateCode.Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED , Field.TermVector.YES));
doc.Add(new Field("datePosted", jobPostedDateTime, Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("company", job.HiringCoName.Trim(), Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("jobType", job.JobTypeID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED,Field.TermVector.YES));
doc.Add(new Field("sector", job.SectorID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED, Field.TermVector.YES));
doc.Add(new Field("showAllJobs", "yy", Field.Store.NO, Field.Index.UN_TOKENIZED));
Thanks a ton for reading!I would really appreciate your help on this.
Janis
Not quite sure if I completely understand your code, but when it comes to geospatial search a filter approach might be more appropriate. Maybe this link can give you some ideas - http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html
Maybe you can use Filters for other parts of your query as well. To be honest your query looks quite complex.
--Hardy
Apart from tempFinalQuery being unused and an unnecessary map lookup to get the state, there doesn't seem to be anything too egregious in the code you post. Apart from the formatting...
If all the time is taken in the Parse methods, posting their code here would make sense.
I might have missed the point of your question but do you have the possibility of storing latitude and longitude for zip codes? If that is an option, you could then compute the distance between two coordinates providing a much more straightforward scoring metric.
I believe the best approach is to move the the nearest city determination into a search filter. I would also reconsider how you have the field setup; consider creating one term that has city+state so that would simplify the query.
I'd suggest:
storing the latitude and longitude of locations as they come in
when a user enters a city and distance, turn that into a lat/lon value and degrees
do a single, simple lookup based on numerical distance lat/lon comparisons
You can see an example of how this works in the Geo::Distance Perl module. Take a look at the closest method in the source, which implements this lookup via simple SQL.
Agree with the others here that this smells too much. Also doing a textual search on city names is not always that reliable. There is often a bit of subjectivity between place names (particularly areas within a city which might in themselves be large).
Doing a geo spatial query is the way to go. Not knowing the rest of your set up it is hard to advise. You do have Spatial support built into Fluent to NHibernate, and SQL Server 2008 for example. You could then do a search very quickly and efficiently. However, your challenge is to get this working within Lucene.
You could possibly do a "first pass" query using spatial support in SQL Server, and then run those results through Lucene?
The other major benefit of doing spatial queries is that you can then easily sort your results by distance which is a win for your customers.