I know that for search capabilities IDEA builds inverted
index of all tokens (words).
For instance, for "Find in files" and regex search it uses
Trigram Index (see Wiki
and IDEA sources)
Also I know that this index could be really huge,
so it definitely must be stored on HDD,
because it can not fully fit into RAM.
And it should be rapidly loaded into RAM
when search action is executed.
I have found that they use externalization
(see IDEA sources) approach to serialize
and deserialize index data for implementation of indexes.
Questions:
Does IDEA cache index in memory, or loads index data for each search action?
If (1.) is true, how does IDEA decides what indexes to keep in memory and which should be cleared? In other words, which cache replacement policy is used?
Where is the code in repository which stores and reads index on HDD?
(optional) What is the format of indexes stored on HDD? Is there any documentation?
I will try to post my answers in the same order
After going through the entire project we write all the forward and inverse indexes to disk. When a user edits a file in the IDE, they are changing the contents of the Document representation (stored in memory) but not the contents of VirtualFile (which is stored on disk). To deal with this there are large indices on disk that reflect the state of physical files (the VirtualFile representation), and for Document and PsiFile representations there is an additional in-memory index. When an index is queried, the in-memory index, being the most up-to-date, is interrogated first, then the remaining keys are retrieved from the main on-disk indices and the cache.
Indexes located on disk can be found in IDE system directories https://intellij-support.jetbrains.com/hc/en-us/articles/206544519-Directories-used-by-the-IDE-to-store-settings-caches-plugins-and-logs
I suggest going through the usages of methods of com.intellij.util.indexing.IndexInfrastructure and com.intellij.util.indexing.FileBasedIndex these classes are working with file paths and have methods for working with and reading from indexes.
Contents of /index directory are project dependant.
Additionally: If user edits a file, we don't create indices for it until we need them, for example until the value of a specific key is requested. If the findUsages command is called while a file is being edited, additional indexing occurs only at that moment. However, a situation like that is almost impossible, since files are written to disk quite frequently and global indexation is run on changes.
Related
If I insert a document and, on the next line of code, search it by one of it's fields (other than Id), will I find it? Or do I have to wait for some indexing to happen?
Microsoft provides clear documentation around the different types of Indexing Strategies available and how to use them. The information below is a summary of this information.
CosmosDb has multiple indexing strategies. By default, it's set to consistent which means that documents are indexed as they are placed into the collection. New documents should be immediately available for querying. You are free to switch this to lazy indexing mode which indexes when it's more convenient for the database.
It's good to know that with consistent indexing turned on, you will observe a higher RU cost per insert/upsert because the cost of indexing is included. So whether or not consistent or lazy makes sense for you is based on the nature of the app you're building.
You can check the type of indexing you're using in the portal and actually tune indexing by including or excluding specific JSON paths in your documents. This is a really powerful and cool feature in Cosmos. You can see that by default, the settings are consistent indexing and a path of /* indicates that all JSON properties are covered by the index.
Has anybody experiences with great synonym files for the SynonymFilterFactory? We want to write down functional requirements for a new project (group the search results by facets with hierarchical synonyms) without own experiences.
How will be the index time increase per document? Which is a common file size for synonym files and which size should such a file not exceed?
I think you'll be pleasantly surprised, Solr can handle some decent sized lists: https://issues.apache.org/jira/browse/LUCENE-3233
That said, the only way to know if your particular use case will behave according to your particular requirements is to test it.
One thing though, if you're using configsets stored in Zookeeper (SolrCloud), the max file size in the default ZK config is 1Mb. If your synonym file exceeds that, you'll need to chop it up, not store it in ZK, or change the jute.maxbuffer setting in your ZK config.
I have these 3 files in a folder and they are all related to an index created by Lucene:
_0.cfs
segments_2
segments.gen
What are they all used for, and is it possible to convert any of them to a human-readable format to discern a bit more about how lucene works with its indexes?
The two segments files store information about the segments, and the .cfs is a compound file consisting of other index files (like index, storage, deletion, etc. files).
For documentation of different types of files used to create a Lucene index, see this summary of file extensions
Generally, no, Lucene files are not human readable. They are designed more for efficiency and speed than human readability. The way to get a human readable format is to access them through the Lucene API (via Luke, or Solr, or something like that).
If you want a thorough understanding of the file formats in use, the codecs package would be the place to look.
In the sample installation and configuration instructions, it is seemingly suggested that OpenGrok requires two staging areas, with the rationale being, that one area is an index-regeneration-work-area, and the other is a production area, and they are rotated with every index regen.
Is that really necessary? Can I only have one area instead of two?
I'm looking for an answer that is specific to opengrok, and not a general list of race conditions one might encounter.
Strictly said, this is not necessary. In fact, I am pretty sure overwhelming majority of the deployments are without staging area.
That said, you need to decide if you are comfortable with a window of inconsistency that could result in some failed/imprecise searches. Let's assume that the source was updated (e.g. via git pull in case of Git) and the indexer has not finished processing the new changes yet. Thus, the index still contains the data reflecting the old state of the source. Let's say the changes applied to the source removed a file. Now if someone initiates a search that matches the contents of the removed file, the search result will probably end with an error. This is probably the better alternative - consider the case when more subtle change is done to a file such as removal/addition of couple of lines of code. In that case the symbol definitions will be off so the search results will bring you to the wrong line of code. Or, not so subtle change, when e.g. a function definition is removed from a file, the search results for references of this function will contain invalid places.
The length of the inconsistency window stems from the indexing time that is largely dependent on 2 things, at least currently:
size of the changes applied to the source
size of the source directory tree
The first is relevant because of history processing. The more incoming history changes (e.g. changesets in Git), the more work the indexer will have to do to generate history cache and/or history fields for the index (assuming history handling is on).
The second is relevant because the indexer traverses the whole source directory tree to find out which files have changed which might incur lots syscalls and potentially lots of I/O. At least until https://github.com/oracle/opengrok/issues/3077 is implemented and that will help only Source Code Management systems based on changesets.
In some blogs and lucene website,I know lucene use data structure "skip list" in inverted index. But I have some puzzle about it.
1:In general,skip list maybe used in memory ,but inverted index is stored in disk. So how lucene use it when search on the index? just scanning it on disk or load it to memory?
2:skip list's insert operator often use random(0,1) to decide whether insert to next level,but in luncene introdution,it seems a fixed interval in every terms,so how lucene create the "skip list" different or not?
Please correct me if I am wrong.
Lucene uses memory in a couple different ways, even though the index is persisted on a disk when the IndexReader is created for searching and for operations like sorting (field cache):
http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
Basically those binary files get copied into RAM for much faster scanning and reducing I/O. You get a hint in the above link how searching with some parameters can force Lucene to "skip terms in searching" Hence, where that data structure can be used.
Lucene is open source, so you can see the code for yourself what is being used in Java or Lucene.NET for the C# implementation.
see To accelerate posting list skips, Lucene uses skip lists