Lucene Spans objects have startPosition() and endPosition() methods, which, according to their Javadoc return "position[s] in the current doc." How are these useful?
My understanding is that these positions are the indices of the start and end tokens of the span—indices after an Analyzer has processed the original text. But after digging around Javadocs for awhile, I don't know what I can do with these positions. It seems like I should be able to query a document and, say, get the tokens between startPosition and endPosition, or maybe get the offsets corresponding to the positions, but I don't see anything like that.
How can I relate these positions back to the original text?
Perhaps span positions are only useful for search queries. From Lucene Javadocs: "phrase and proximity searches rely on position info."
Related
This might end up being a very general question, but hopefully it will be useful to others as well.
I want to be able to request a word that is x number of syllables with a stress on x.[y] syllable. I've found plenty of APIs that return both of these such as Wordnik, but I'm not sure how to approach the search aspect. The URL to get the syllables is
GET /word.json/{word}/hyphenation
but I won't know the word ahead of time to make this request. They also have this:
GET /words.json/randomWords
which returns a list random words.
Is there a way to achieve what I want with this API without asking for random words over and over and checking if they meet my needs? That just seems like it would be really slow and push me over my usage limits.
Do I need to build my own data structure with the words and syllables to query locally?
I doubt you'll find this kind of specialized query on any of the big dictionary APIs. You'll need to download an English dictionary and create your own data structure to do this kind of thing.
The Moby Project has a hyphenated dictionary with about 185,000 words in it. There are many other dictionary projects available. A good place to start looking is http://www.dicts.info/dictionaries.php.
Once you've downloaded the dictionary, you'll need to preprocess it to build your data structure. You should be able to construct a dictionary or hash map that is indexed by (syllables, emphasis), and whose data member is a list of words. So you'd have an entry like (4, 2) (4-syllable word with emphasis on the 2nd syllable), and a list of all such words.
To query it, then, you'd just pack the query into a structure and look up that key in the hash map. Then pick a random word from the resulting list.
The various examples I see about how to find positions of the matches returned by an IndexSearcher either require retrieving the document's content and search a TokenStream or to index the positions and offsets in the term vectors, turn the query into a term and find it in the term vector.
But what happens when I use a FuzzyQuery? Is there a way to know which term(s) exactly matched in the hit so that I may look for them in the term vector of this document?
In case that's of any value, I'm new to Lucene and my goal here is to annotate a set of documents (the ones indexed in Lucene) with a set of terms, but the documents are from scanned texts and contain OCR errors, therefore I must use a FuzzyQuery. I thought about using lucene-suggest to do some spellchecking beforehand but it occured to me that it boiled down to trying to find fuzzy matches.
I am trying to develop a search engine in my free time modeled after google.
I am using the original google research paper listed here: http://infolab.stanford.edu/~backrub/google.html
However I am having a few problems here. To be exact I am having problem developing the forward index.
In the paper it says:
If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words.
Now there are two problem with in this statement. First who decides which words out of the huge lexicon goes into the Forward Barrels? Do all of them go. Second is the meaning of the word corresponding. Does it mean words that actually appear in that document after the previous word or something else?
I am really new to Search Engines and would really appreciate any Information Retrival Expert helping me on this. If moderators think that this question belong in some other Stack Exchange site please do so.
First Question:
The string value of every word is mapped into an integer (by a hash function). This is because integers are far more easier to handle than strings. You can then define ranges (buckets or bins or whatever else you might want to call them) over these integer values, e.g.
term ids 0 to 1000 => Bin-1
term ids 1001 to 2000 => Bin-2
and so on.
Second question:
The context information is typically not used. A word is simply a term present in a document, such as the terms "the", "quick", "brown" etc.
Since you said you are new to IR, a good way to start would be to read an introductory book to IR, e.g. the book by Manning and Schutze.
No keyboard patterns. i.e. keys that are adjacent vertically or horizontally on a keyboard. For example, 'ZXCVBN123' should be rejected.
No commonly used words and no words written backwards or disguised with special characters. For example 'Universe1' and 'Un1ver$e' should be rejected.
Well, first you need to define exactly what you want. What are keyboard patterns? Is 'jk' a keyboard pattern, or just 'jkl'? What's the shortest pattern there is? Is 'gy' a pattern? First you need to define what a pattern really.
Then you should make a list of all the available patterns (There aren't all that many. You have 36 starting points and 4 directions to go from each starting point). When you get a password, try to locate each of the patterns in it. Note that if you decide the shortest pattern is 3 letters long, you don't need to search for 4-letter patterns, all 4-letter patterns already contain 3-letter patterns.
As for words, that's easier, but first you need to make a list of all disallowed transformations ($->S, 1->i, etc...). Once you get a word, apply all the transformations and get yourself a 'normalized' word. Compare the normalized password against a dictionary of all legal words twice - the second time reverse the password.
You will probably need to do something a little more complicated than that, because you need to ignore numbers at the end of the word - sometimes. 1ncredible can be a substitute for 'incredible', although ncredible is not a word.
If you inspect the code of http://howsecureismypassword.net you can see that the password is compared to a large array of usual passwords.
On the page threre is a reference to the page http://xato.net/passwords/more-top-worst-passwords/ which lists the top 10.000 most common passwords.
One approach would be to download that list and check the users passwords against it or at least some top 100 of them.
Does TermQuery:ExtractTerms result in a higher count when termvectors/positions/offsets are turned on? (assuming that there is more than 1 occurence of a match). Conversely, with the inverted file info turned off, does ExtractTerms always return 1 and only 1 term?
EDIT: How and where does turning on termvectors in the schema affect scoring?
TermQuery.ExtractTerms extracts the terms in the query, not the result. So a search for "foo:bar" will always return exactly one term, regardless of what's in the index.
It sounds to me like you want to know about highlighting, not Query.ExtractTerms.
EDIT: Based on your comment, it sounds like you are asking: "how is scoring affected by term vectors?" The answer to that is: not at all. The term frequency, norm, etc. is calculated at index time, so it doesn't matter what you store.
The major exception is PhraseQuery with slop, which uses the term positions. A minor exception is that custom scoring classes can use whatever data they want, so not only term vectors but also payloads etc. can potentially affect the score.
If you're just doing TermQuerys though, what you store should have no effect.