Can I insert a Document into Lucene without generating a TokenStream? - lucene

Is there a way to add a document to the index by supplying terms and term frequencies directly, rather than via Analysis and/or TokenStream? I ask because I want to model some data where I know the term frequencies, but there is no underlying text document to be analyzed. I could create one by repeating the same term many times (I don't care about positions or highlighting in this case, either, just scoring), but that seems a bit perverse (and probably slower than just supplying the counts directly).
(also asked on the mailing list)

At any rate, you don't need to pass everything through an Analyzer in order to create the document. I'm not aware of any way to pass in Terms and Frequencies as you've asked (though I'd be interested to know if you find a good approach to it), but you can certainly pass in IndexableFields one term at a time. That would still require you to add each term multiple times, like:
IndexableField field = new StringField(fieldName, myTerm, FieldType.TYPE_NOT_STORED);
for (int i = 0; i < frequency; i++) {
document.add(field);
}
You can also take a step further back, and cut the Document class out entirely, by using any Iterable<IndexableField>, a simple List, for instance, which might suffice for a more direct approach for modelling your data.
Not sure if that gets you any closer to what you are looking for, but perhaps a step vaguely in the right direction.

Related

Is it possible to obtain, alter and replace the tfidf document representations in Lucene?

Hej guys,
I'm working on some ranking related research. I would like to index a collection of documents with Lucene, take the tfidf representations (of each document) it generates, alter them, put them back into place and observe how the ranking over a fixed set of queries changes accordingly.
Is there any non-hacky way to do this?
Your question is too vague to have a clear answer, esp. on what you plan to do with :
take the tfidf representations (of each document) it generates, alter them
Lucene stores raw values for scoring :
CollectionStatistics
TermStatistics
Per term/doc pair stats : PostingsEnum
Per field/doc pair : norms
All this data is managed by lucene and will be used to compute a score for a given query term. A custom Similarity class can be used to change the formula that generates this score.
But you have to consider that a search query is made of multiple terms, and the way the scores of individual terms are combined can be changed as well. You could use existing Query classes (e.g. BooleanQuery, DisjunctionMax) but you could also write your own.
So it really depends on what you want to do with of all this but note that if you want to change the raw values stored by lucene this is going to be rather hard. You'll have to write a custom lucene codec and probably most the query stack to take benefit of your new data.
One nice thing you should consider is the possibility to store an arbitrary byte[] payloads. This way you could store a value that would have been computed outside of lucene and use it in a custom similarity or query.
Please see the following tutorials: Getting Started with Payloads and Custom Scoring with Lucene Payloads it may you give some ideas.

Forward Index Implementation in google

I am trying to develop a search engine in my free time modeled after google.
I am using the original google research paper listed here: http://infolab.stanford.edu/~backrub/google.html
However I am having a few problems here. To be exact I am having problem developing the forward index.
In the paper it says:
If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words.
Now there are two problem with in this statement. First who decides which words out of the huge lexicon goes into the Forward Barrels? Do all of them go. Second is the meaning of the word corresponding. Does it mean words that actually appear in that document after the previous word or something else?
I am really new to Search Engines and would really appreciate any Information Retrival Expert helping me on this. If moderators think that this question belong in some other Stack Exchange site please do so.
First Question:
The string value of every word is mapped into an integer (by a hash function). This is because integers are far more easier to handle than strings. You can then define ranges (buckets or bins or whatever else you might want to call them) over these integer values, e.g.
term ids 0 to 1000 => Bin-1
term ids 1001 to 2000 => Bin-2
and so on.
Second question:
The context information is typically not used. A word is simply a term present in a document, such as the terms "the", "quick", "brown" etc.
Since you said you are new to IR, a good way to start would be to read an introductory book to IR, e.g. the book by Manning and Schutze.

Cplex/OPL local search

I have a model implemented in OPL. I want to use this model to implement a local search in java. I want to initialize solutions with some heuristics and give these initial solutions to cplex find a better solution based on the model, but also I want to limit the search to a specific neighborhood. Any idea about how to do it?
Also, how can I limit the range of all variables? And what's the best: implement these heuristics and local search in own opl or in java or even C++?
Thanks in advance!
Just to add some related observations:
Re Ram's point 3: We have had a lot of success with approach b. In particular it is simple to add constraints to fix the some of the variables to values from a known solution, and then re-solve for the rest of the variables in the problem. More generally, you can add constraints to limit the values to be similar to a previous solution, like:
var >= previousValue - 1
var <= previousValue + 2
This is no use for binary variables of course, but for general integer or continuous variables can work well. This approach can be generalised for collections of variables:
sum(i in indexSet) var[i] >= (sum(i in indexSet) value[i])) - 2
sum(i in indexSet) var[i] <= (sum(i in indexSet) value[i])) + 2
This can work well for sets of binary variables. For an array of 100 binary variables of which maybe 10 had the value 1, we would be looking for a solution where at least 8 have the value 1, but not more than 12. Another variant is to limit something like the Hamming distance (assume that the vars are all binary here):
dvar int changed[indexSet] in 0..1;
forall(i in indexSet)
if (previousValue[i] <= 0.5)
changed[i] == (var[i] >= 0.5) // was zero before
else
changed[i] == (var[i] <= 0.5) // was one before
sum(i in indexSet) changed[i] <= 2;
Here we would be saying that out of an array of e.g. 100 binary variables, only a maximum of two would be allowed to have a different value from the previous solution.
Of course you can combine these ideas. For example, add simple constraints to fix a large part of the problem to previous values, while leaving some other variables to be re-solved, and then add constraints on some of the remaining free variables to limit the new solution to be near to the previous one. You will notice of course that these schemes get more complex to implement and maintain as we try to be more clever.
To make the local search work well you will need to think carefully about how you construct your local neighbourhoods - too small and there will be too little opportunity to make the improvements you seek, while if they are too large they take too long to solve, so you don't get to make so many improvement steps.
A related point is that each neighbourhood needs to be reasonably internally connected. We have done some experiments where we fixed the values of maybe 99% of the variables in a model and solved for the remaining 1%. When the 1% was clustered together in the model (e.g. all the allocation variables for a subset of resources) we got good results, while in comparison we got nowhere by just choosing 1% of the variables at random from anywhere in the model.
An often overlooked idea is to invert these same limits on the model, as a way of forcing some changes into the solution to achieve a degree of diversification. So you could add a constraint to force a specific value to be different from a previous solution, or ensure that at least two out of an array of 100 binary variables have a different value from the previous solution. We have used this approach to get a sort-of tabu search with a hybrid matheuristic model.
Finally, we have mainly done this in C++ and C#, but it would work perfectly well from Java. Not tried it much from OPL, but it should be fine too. The key for us was being able to traverse the problem structure and use problem knowledge to choose the sets of variables we freeze or relax - we just found that easier and faster to code in a language like C#, but then the modelling stuff is more difficult to write and maintain. We are maybe a bit "old-school" and like to have detailed fine-grained control of what we are doing, and find we need to create many more arrays and index sets in OPL to achieve what we want, while we can achieve the same effect with more intelligent loops etc without creating so many data structures in a language like C#.
Those are several questions. So here are some pointers and suggestions:
In Cplex, you give your model an initial solution with the use of IloOplCplexVectors()
Here's a good example in IBM's documentation of how to alter CPLEX's solution.
Within OPL, you can do the same. You basically set a series of values for your variables, and hand those over to CPLEX. (See this example.)
Limiting the search to a specific neighborhood: There is no easy way to respond without knowing the details. But there are two ways that people do this:
a. change the objective to favor that 'neighborhood' and make other areas unattractive.
b. Add constraints that weed out other neighborhoods from the search space.
Regarding limiting the range of variables in OPL, you can do it directly:
dvar int supply in minQty..maxQty;
Or for a whole array of decision variables, you can do something along the lines of:
range CreditsAllowed = 3..12;
dvar int credits[student] in CreditsAllowed;
Hope this helps you move forward.

Improving lucene spellcheck

I have a lucene index, the documents are in around 20 different languages, and all are in the same index, I have a field 'lng' which I use to filter the results in only one language.
Based on this index I implemented spell-checker, the issue is that I get suggestions from all languages, which are irrelevant (if I am searching in English, suggestions in German are not what I need). My first idea was to create a different spell-check index for each language and than select index based on the language of the query, but I do not like this, is it possible to add additional column in spell-check index and use this, or is there some better way to do this?
Another question is how could I improve suggestions for 2 or more Terms in search query, currently I just do it for the first, which can be strongly improved to use them in combination, but I could not find any samples, or implementations which could help me solve this issue.
thanks
almir
As far as I know, it's not possible to add a 'language' field to the spellchecker index. I think that you need to define several search SpellCheckers to achieve this.
EDIT: As it turned out in the comments that the language of the query is entered by the user as well, then my answer is limited to: define multiple spellcheckers. As for the second question that you added, I think that it was discussed before, for example here.
However, even if it would be possible, it doesn't solve the biggest problem, which is the detection of query language. It is highly non-trivial task for very short messages that can include acronyms, proper nouns and slang terms. Simple n-gram based methods can be inaccurate (as e.g. the language detector from Tika). So I think that the most challenging part is how to use certainty scores from both language detector and spellchecker and what threshold should be chosen to provide meaningful corrections (e.g. language detector prefers German, but spellchecker has a good match in Danish...).
If you look at the source of SpellChecker.SuggestSimilar you can see:
BooleanQuery query = new BooleanQuery();
String[] grams;
String key;
for (int ng = GetMin(lengthWord); ng <= GetMax(lengthWord); ng++)
{
<...>
if (bStart > 0)
{
Add(query, "start" + ng, grams[0], bStart); // matches start of word
}
<...>
I.E. the suggestion search is just a bunch of OR'd boolean queries. You can certainly modify this code here with something like:
query.Add(new BooleanClause(new TermQuery(new Term("Language", "German")),
BooleanClause.Occur.MUST));
which will only look for suggestions in German. There is no way to do this without modifying your code though, apart from having multiple spellcheckers.
To deal with multiple terms, use QueryTermExtractor to get an array of your terms. Do spellcheck for each, and cartesian join. You may want to run a query on each combo and then sort based on the frequency they occur (like how the single-word spellchecker works).
After implement two different search features in two different sites with both lucene and sphinx, I can say that sphinx is the clear winner.
Consider using http://sphinxsearch.com/ instead of lucene. It's used by craigslist, among others.
They have a feature called morphology preprocessors:
# a list of morphology preprocessors to apply
# optional, default is empty
#
# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
# 'soundex', and 'metaphone'; additional preprocessors available from
# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
# (see libstemmer_c/libstemmer/modules.txt)
#
# morphology = stem_en, stem_ru, soundex
# morphology = libstemmer_german
# morphology = libstemmer_sv
morphology = none
There are many stemmers available, and as you can see, german is among them.
UPDATE:
Elaboration on why I feel that sphinx has been the clear winner for me.
Speed: Sphinx is stupid fast. Both indexing and in the serving search queries.
Relevance: Though it's hard to quantify this, I felt that I was able to get more relevant results with sphinx compared to my lucene implementation.
Dependence on the filesystem: With lucene, I was unable to break the dependence on the filesystem. And while their are workarounds, like creating a ram disk, I felt it was easier to just select the "run only in memory" option of sphinx. This has implications for websites with more than one webserver, adding dynamic data to the index, reindexing, etc.
Yes, these are just points of an opinion. However, they are an opinion from someone that has tried both systems.
Hope that helps...

Fastest way to query for object existence in NHibernate

I am looking for the fastest way to check for the existence of an object.
The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.
First lets only focus on the creation part:
public static DatabaseDirectory Get(DirectoryInfo dI)
{
var result = DatabaseController.Session
.CreateCriteria(typeof (DatabaseDirectory))
.Add(Restrictions.Eq("FullName", dI.FullName))
.List<DatabaseDirectory>().FirstOrDefault();
if (result == null)
{
result = new DatabaseDirectory
{
CreationTime = dI.CreationTime,
Existing = dI.Exists,
Extension = dI.Extension,
FullName = dI.FullName,
LastAccessTime = dI.LastAccessTime,
LastWriteTime = dI.LastWriteTime,
Name = dI.Name
};
}
return result;
}
Is this the way to go regarding:
Speed
Separation of Concern
What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...
Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...
Thanks,
Chris
PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)
Note:
In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"
And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)
First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.
In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.
However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:
Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
.SetParameterList("ids", directories.Keys)
.List();
Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).
As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.