How to build a simple inverted index? - indexing

I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence.
Now, I have several questions:
What kind of data structure is often used for building inverted index? Multidimensional list?
After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?

You can see a very simple implementation of inverted index and search in TinySearchEngine.
For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this:
val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]
or a Java-esque:
HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();
The hash maps each term/word/token to a list of Postings. A Posting is just an object that represents an occurrence of a word inside a document:
case class Posting(docId:Int, var termFrequency:Int)
Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings once with the correct termFrequency instead of updating it every time.
Regarding your second question, there are normally two cases:
(1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example.
(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible.
For case (1), you can have at least 2 files:
1 - The Inverted Index file. It lists for each term all Postings (docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data.
Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq>
Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq>
Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq>
Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq>
...
TermN<docId5,termFreq><docId7,termFreq>
2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file.
Term1 -> 0
Term2 -> 126
Term3 -> 222
....
Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm.
For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it.
The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index.
To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user.
Lucene calls this "little" indexes segments.
The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than N segments you can decide to merge all segments smaller than 10 KBs together.

Related

Is it possible to obtain, alter and replace the tfidf document representations in Lucene?

Hej guys,
I'm working on some ranking related research. I would like to index a collection of documents with Lucene, take the tfidf representations (of each document) it generates, alter them, put them back into place and observe how the ranking over a fixed set of queries changes accordingly.
Is there any non-hacky way to do this?
Your question is too vague to have a clear answer, esp. on what you plan to do with :
take the tfidf representations (of each document) it generates, alter them
Lucene stores raw values for scoring :
CollectionStatistics
TermStatistics
Per term/doc pair stats : PostingsEnum
Per field/doc pair : norms
All this data is managed by lucene and will be used to compute a score for a given query term. A custom Similarity class can be used to change the formula that generates this score.
But you have to consider that a search query is made of multiple terms, and the way the scores of individual terms are combined can be changed as well. You could use existing Query classes (e.g. BooleanQuery, DisjunctionMax) but you could also write your own.
So it really depends on what you want to do with of all this but note that if you want to change the raw values stored by lucene this is going to be rather hard. You'll have to write a custom lucene codec and probably most the query stack to take benefit of your new data.
One nice thing you should consider is the possibility to store an arbitrary byte[] payloads. This way you could store a value that would have been computed outside of lucene and use it in a custom similarity or query.
Please see the following tutorials: Getting Started with Payloads and Custom Scoring with Lucene Payloads it may you give some ideas.

lucene index match

I am trying to use Lucene for doing undup or dedup match. Essentially I have a file with records which I want to group based on certain fields (fuzzy search) and get back a result with a match key that tells me which records within that file matched to each other.
Is this possible?
This can be done (if I understand this correctly). You would index your terms that/records will be searched on in one pass. In the second pass, you will search for each term and log results.
While pre-processing the document you can generate a hash that aggregate those fields, and store this (as NOT_ANALYZED), this way you just have to search by one field with a known size, take a look at MessageDigest. This is what I normally do for duplicate detection of the file content (since the content might be too big for a single query).
If what you are looking for is creating a more complex query, try using CachingWrapperFilter, this way subsequent calls to your deduplication algorithm will be much faster.

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

What's the difference between an inverted index and a plain old index?

In software engineering we create indexes all the time (e.g., in databases) but I also hear a lot of people talk about inverted indices. Is there something fundamentally different between the two? They sound like the same thing.
One common use is "...to allow fast full-text searching."
The two types denote directionality. One takes you forward through the index, and the other takes you backward (the inverse) through the index. That's it. There's no mystery to uncover here. Otherwise the two types are identical, it's just a question of what information you have, and as a result what information you're trying to find.
To address your inquiry, I don't think there's actually a way to know why the use is what it is today. The only reason it's important to define which is forward and which one is inverted is so that we can all have a conversation about them, and everyone knows which direction we're talking about. Think about the terms "left" and "right": they are relative. Which is which doesn't matter, except that everyone needs to agree which one is "left" and which one is "right" in order for the words to have meaning. If, as a culture, we decided to flip left and right, then you'd have the same issue figuring out what a "right turn" vs a "left turn" is since the agreed upon meaning had changed. However, the naming is arbitrary, so which one is which (in and of itself) doesn't matter - what matters is that we all agree on the meaning.
In your comment where you ask, "please don't just define the terms", you're missing the point, and I think you're just getting hung up on the wording when there is absolutely no difference between them.
For the benefit of future readers, I will now provide several "forward" and "inverted" index examples:
Example 1: Web search
If you're thinking that the inverse of an index is something like the inverse of a function in mathematics, where the inverse is a special thing that has a different form, then you're mistaken: that's not the case here.
In a search engine you have a list of documents (pages on web sites), where you enter some keywords and get results back.
A forward index (or just index) is the list of documents, and which words appear in them. In the web search example, Google crawls the web, building the list of documents, figuring out which words appear in each page.
The inverted index is the list of words, and the documents in which they appear. In the web search example, you provide the list of words (your search query), and Google produces the documents (search result links).
They are both indexes - it's just a question of which direction you're going. Forward is from documents->to->words, inverted is from words->to->documents.
Example 2: DNS
Another example is a DNS lookup (which takes a host name, and returns an IP address) and a reverse lookup (which takes an IP address, and gives you the host name).
Example 3: A book
The index in the back of a book is actually an inverted index, as defined by the examples above - a list of words, and where to find them in the book. In a book, the table of contents is like a forward index: it's a list of documents (chapters) which the book contains, except instead of listing the words in those sections, the table of contents just gives a name/general description of what's contained in those documents (chapters).
Example 4: Your cell phone
The forward index in your cell phone is your list of contacts, and which phone numbers (cell, home, work) are associated with those contacts. The inverted index is what allows you to manually enter a phone number, and when you hit "dial" you see the person's name, rather than the number, because your phone has taken the phone number and found you the contact associated with it.
They called it inverted just because there is already a forward index. Take the example of search engine, it composed by two parts: the first part is "web crawler and parser" which build a index from document to word, the second part is search database which build a index from word to document. Because of the first index exist, we naturally call the second index as inverted index.
If you name the TOC (Table of Content) of a book as index, then you should call the index at the end of book as "inverted index". Or, in other side, you can call the TOC as inverted index.
typically when speaking about index, you mean some added calculations or stored results of procedures which have been done in order to speed up application (e.g. MySQL or other RDBMS Consult MySQL the docs). Indexing can also be related to caching etc.
Inverted index creates file with structure that is primarily intender for (fulltext) searching.
Inverted index consists of two main files:
Vocabulary
Occurences
In vocabulary are common words extracted from text (of course after filtering blacklist words like pronouns). The occurences file holds the connection between words and documents (word1 appears in doc1 and doc2, not in doc3). It is represented in a form of a matrix.
In the above image is shown the process of creating the two files mentioned.
If you are further interester in this problematic I can recommend you a great book written by Ricardo Yated - Modern Information Retrieval (See it on Amazon) - about page 200 I think.
Hope it helps :-)
normalocity has already wonderfully differentiated between a forward and an inverted index but for the question of why one is called a forward index and the other an inverted index, maybe this is why they are called that way---
Taking example of search engine crawling and indexing (or building index for a book), a forward index can be built simultaneously while you are crawling the web pages(or reading the book) or going forward. So if you have 10 webpages to crawl(or 10 chapters in a book) you can crawl the first webpage(read the first chapter) and then make a list of words which appear in the webpage(words which appear in the chapter) and continue this process for other webpages(other chapters) so by the time you have crawled all the 10 webpages(read all 10 chapters) your forward index is complete with each webpage(chapter) pointing to a list of words it contains.
But to make an inverted index you have to crawl all the 10 webpages(read the 10 chapters) and and then take each word from each documents list and figure out which documents contain that word. So this is like going backward once you have crawled the webpages(read chapters of the book). So its called an inverted index.
This is just my speculation.
The term "Inverted Word Index" refers to the change in relationship of
a single-document containing many-words, to each unique word containing
(or identifying) a list of many-documents. This is effectively taking a One-to-Many Relationship (Docs to Words) and Inverting (or reversing) it such that a new "Inverted" One-to-Many Relationship now exists, which is each-unique-word relating to Many-Documents (i.e., all that contain that word). It's origin really is that simple, and the term "inverted index" was used to describe manual indexes of the same type long before computers and electronic high-speed indexing even existed (yes, admittedly, I'm an old, geezer programmer, almost old enough to have considered Grace Hopper a "sweet young lady" age appropriate for courting back when COBOL was a shiny new language). Please don't discard us geezers just yet, as we may occasionally provide a useful, and possibly even valuable, historical tid-bit or two - when our personal RAM is still working, that is. [grin]
There are many types of index. For example, B-tree, R-tree, hash... For different purposes, we must choose correct index.
Inverted index is a special one. Inverted index usually used in full text search engine. Use inverted index we can find out a word's locate in a document(or documents set) as fast as possible. Think about the limit of memory and cpu, other index can't finish this job.
You can read lucene document for more details. It's a open source search engine. http://lucene.apache.org/java/docs/index.html
in inverted indexes, we have the following form:
word1-> list of docs it occurs in (sorted order)
word2-> list of docs it occurs in (sorted order)
It is very useful for search engine query processing as it allows us to find docs that word occurs in .
You can use supervised machine learing to build this inverted index.
One more difference:
Handling updates with the inverted index are expensive in comparison with forward index.
Forward index handles updates easily by reflecting the changes only in the corresponding document index, whereas in the inverted index, the same change has to reflect in multiple positions across the inverted index.

How does Lucene work

I would like to find out how lucene search works so fast. I can't find any useful docs on the web. If you have anything (short of lucene source code) to read, let me know.
A text search query using mysql5 text search with index takes about 18 minutes in my case. A lucene search for the same query takes less than a second.
Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact I think you'll find the big databases can do a simple string-equality query very quickly in that case.
Lucene does not have to optimize for transaction processing. When you add a document, it need not ensure that queries see it instantly. And it need not optimize for updates to existing documents.
However, at the end of the day, if you really want to know, you need to read the source. Both things you reference are open source, after all.
Lucene creates a big index. The index contains word id, number of docs where the word is present, and the position of the word in those documents. So when you give a single word query it just searches the index (O(1) time complexity). Then the result is ranked using different algorithms. For multi-word query just take the intersection of the set of files where the words are present.
Thus Lucene is very very fast.
For more info read this article by Google developers- http://infolab.stanford.edu/~backrub/google.html
In a word: indexing.
Lucene creates an index of your document that allows it to search much more quickly.
It's the same difference between a list O(N) data structure and a hash table O(1) data structure. The list has to walk through the entire collection to find what you want. The hash table has an index that lets it figure out exactly where the desired item is and simply fetch it.
Update:
I'm not certain what you mean by "Lucene index searches are a lot faster than mysql index searches."
My guess is that you're using MySQL "WHERE document LIKE '%phrase%'" to search for a document. If that's true, then MySQL has to do a table scan on every row, which will be O(N).
Lucene gets to parse the document into tokens, group them into n-grams at your direction, and calculate indexes for each one of those. It's O(1) to find a word in an indexed Lucene document.
Lucene works with Term frequency and Inverse document frequency. It creates an index mapping each word with the document and it's frequency count which is nothing but inverse index on the document.
Example :
File 1 : Random Access Memory is the main memory.
File 2 : Hard disks are secondary memory.
Lucene creates a reverse index something like
File 1 :
Term : Random
Frequency : 1
Position : 0
Term : Memory
Frequency : 2
Position : 3
Position : 6
So it is able to search and retrieve the searched content quickly. When there is too many matches for the search query it outputs the result based on the weight. Consider the search query "Main Memory" it searches for all 4 words individually and the result would be like,
Main
File 1 : Frequency - 1
Memory
File 1 : Frequency - 2
File 2 : Frequency - 1
The result would be File1 followed by File2. To stop getting carried away by weights on most common words like 'and', 'or', 'the' it considers the inverse document frequency (ie' it decreases the weight of the word which is most popular among the document set).