Understanding lucene segments - lucene

I have these 3 files in a folder and they are all related to an index created by Lucene:
_0.cfs
segments_2
segments.gen
What are they all used for, and is it possible to convert any of them to a human-readable format to discern a bit more about how lucene works with its indexes?

The two segments files store information about the segments, and the .cfs is a compound file consisting of other index files (like index, storage, deletion, etc. files).
For documentation of different types of files used to create a Lucene index, see this summary of file extensions
Generally, no, Lucene files are not human readable. They are designed more for efficiency and speed than human readability. The way to get a human readable format is to access them through the Lucene API (via Luke, or Solr, or something like that).
If you want a thorough understanding of the file formats in use, the codecs package would be the place to look.

Related

How does IntelliJ IDEA store search index on disk?

I know that for search capabilities IDEA builds inverted
index of all tokens (words).
For instance, for "Find in files" and regex search it uses
Trigram Index (see Wiki
and IDEA sources)
Also I know that this index could be really huge,
so it definitely must be stored on HDD,
because it can not fully fit into RAM.
And it should be rapidly loaded into RAM
when search action is executed.
I have found that they use externalization
(see IDEA sources) approach to serialize
and deserialize index data for implementation of indexes.
Questions:
Does IDEA cache index in memory, or loads index data for each search action?
If (1.) is true, how does IDEA decides what indexes to keep in memory and which should be cleared? In other words, which cache replacement policy is used?
Where is the code in repository which stores and reads index on HDD?
(optional) What is the format of indexes stored on HDD? Is there any documentation?
I will try to post my answers in the same order
After going through the entire project we write all the forward and inverse indexes to disk. When a user edits a file in the IDE, they are changing the contents of the Document representation (stored in memory) but not the contents of VirtualFile (which is stored on disk). To deal with this there are large indices on disk that reflect the state of physical files (the VirtualFile representation), and for Document and PsiFile representations there is an additional in-memory index. When an index is queried, the in-memory index, being the most up-to-date, is interrogated first, then the remaining keys are retrieved from the main on-disk indices and the cache.
Indexes located on disk can be found in IDE system directories https://intellij-support.jetbrains.com/hc/en-us/articles/206544519-Directories-used-by-the-IDE-to-store-settings-caches-plugins-and-logs
I suggest going through the usages of methods of com.intellij.util.indexing.IndexInfrastructure and com.intellij.util.indexing.FileBasedIndex these classes are working with file paths and have methods for working with and reading from indexes.
Contents of /index directory are project dependant.
Additionally: If user edits a file, we don't create indices for it until we need them, for example until the value of a specific key is requested. If the findUsages command is called while a file is being edited, additional indexing occurs only at that moment. However, a situation like that is almost impossible, since files are written to disk quite frequently and global indexation is run on changes.

Solr indexing parquet file

I have a solr instance up and running and it should read parquet files to index. Right now, I am converting the parquet to flat text file and then having solr index them. I'd like to know if it is possible to read the parquet file directly for Solr to consume?
Thanks
Directly: no, not possible.
If you want something more integrated than what you are actually doing (converting to text and indexing might be good enough already), you can follow two ways:
Create an specialized code around DIH, you probably can write a specialized DataSource, so you could use DIH to do the indexing.
Just write some java code using SolrJ that reads your file and indexes to Solr

what is the advantage of lucene's Compound File

what is diference between a single file like lucene's Compound File and muti file for dependent index type?
it uses less files to keep the index, so it can help avoid the Too many open files issue (that in unix you handle with ulimit).
It is also slower.

Lucene indexing with for structured document where each text line has meta-data

I have a document structure where each text line in the document has some meta-data associated with it. The search result must show the line and the meta-data for the line.
Currently I am storing each such line as a Lucene documents and storing the metata-data as one of the non-indexed fields. That is I create and add a Lucene Document structure for each line. My concerns is that I may end up with too many Documents in the index.
Is there a more elegant approach ?
Thanks
Personally I'd index the documents as normal, and figure out the metadata / line number later.
There is no question about whether or not Lucene can cope with that many documents, however it might degrade the search results somewhat. For you can perform searches where you look for multiple terms in close proximity to each other, however this obviously won't work when the terms are split over multiple documents (lines).
How many is "too many"? Lucene has been known to handle hundreds of millions of records in a single index, so I doubt that you should have a problem. That being said, there's no substitute for testing and benchmarking yourself to see if this approach is good for your needs.

Information Retrieval database formats?

I'm looking for some documentation on how Information Retrieval systems (e.g., Lucene) store their indexes for speedy "relevancy" lookups. My Google-fu is failing me: I've found a page which describes Lucene's file format, but it's more focused on how many bits each number is than on how the database is used in producing speedy queries.
Surely someone has some useful bookmarks lying around that they can refer me to.
Thanks!
The Lucene index is an inverted index, so any search on this topic should be relevant, like:
http://en.wikipedia.org/wiki/Inverted_index
http://www.ibm.com/developerworks/library/wa-lucene/