Solr: Which size should a synonym file not exceed? - indexing

Has anybody experiences with great synonym files for the SynonymFilterFactory? We want to write down functional requirements for a new project (group the search results by facets with hierarchical synonyms) without own experiences.
How will be the index time increase per document? Which is a common file size for synonym files and which size should such a file not exceed?

I think you'll be pleasantly surprised, Solr can handle some decent sized lists: https://issues.apache.org/jira/browse/LUCENE-3233
That said, the only way to know if your particular use case will behave according to your particular requirements is to test it.
One thing though, if you're using configsets stored in Zookeeper (SolrCloud), the max file size in the default ZK config is 1Mb. If your synonym file exceeds that, you'll need to chop it up, not store it in ZK, or change the jute.maxbuffer setting in your ZK config.

Related

How does IntelliJ IDEA store search index on disk?

I know that for search capabilities IDEA builds inverted
index of all tokens (words).
For instance, for "Find in files" and regex search it uses
Trigram Index (see Wiki
and IDEA sources)
Also I know that this index could be really huge,
so it definitely must be stored on HDD,
because it can not fully fit into RAM.
And it should be rapidly loaded into RAM
when search action is executed.
I have found that they use externalization
(see IDEA sources) approach to serialize
and deserialize index data for implementation of indexes.
Questions:
Does IDEA cache index in memory, or loads index data for each search action?
If (1.) is true, how does IDEA decides what indexes to keep in memory and which should be cleared? In other words, which cache replacement policy is used?
Where is the code in repository which stores and reads index on HDD?
(optional) What is the format of indexes stored on HDD? Is there any documentation?
I will try to post my answers in the same order
After going through the entire project we write all the forward and inverse indexes to disk. When a user edits a file in the IDE, they are changing the contents of the Document representation (stored in memory) but not the contents of VirtualFile (which is stored on disk). To deal with this there are large indices on disk that reflect the state of physical files (the VirtualFile representation), and for Document and PsiFile representations there is an additional in-memory index. When an index is queried, the in-memory index, being the most up-to-date, is interrogated first, then the remaining keys are retrieved from the main on-disk indices and the cache.
Indexes located on disk can be found in IDE system directories https://intellij-support.jetbrains.com/hc/en-us/articles/206544519-Directories-used-by-the-IDE-to-store-settings-caches-plugins-and-logs
I suggest going through the usages of methods of com.intellij.util.indexing.IndexInfrastructure and com.intellij.util.indexing.FileBasedIndex these classes are working with file paths and have methods for working with and reading from indexes.
Contents of /index directory are project dependant.
Additionally: If user edits a file, we don't create indices for it until we need them, for example until the value of a specific key is requested. If the findUsages command is called while a file is being edited, additional indexing occurs only at that moment. However, a situation like that is almost impossible, since files are written to disk quite frequently and global indexation is run on changes.

Media file path generation

is there a way to migrate existing media files into new structure after changing
shopware:
cdn:
strategy: id
to
shopware:
cdn:
strategy: plain
and are there any required steps for new uploads to be stored the “plain” way? As far as I can tell, my newly uploaded files are not affected by the config change.
Additionally: are there any drawbacks of using the plain strategy?
My reasoning behind the change is to speed up rsync of ~40GB of files, when they are stored on one level as public/media/<filename.xy> instead of the "default" nested approach. Would that even gain me speed?
As far as I know there is no readily available method to migrate existing files when changing the strategy.
The whole idea of the id strategy is to make lookups faster. So the drawback of using the plain strategy would be performance loss with a huge amount of files within a single directory.
While rsync probably doesn't profit from the id strategy as it will have to traverse the directories anyways, I couldn't find any reports of the amount of directories impacting the speed in a significant way.

Limitation of an index entry on PostgreSQL

I was reading this section of PostgreSQL documentation. I got to this sentence and I can't understand the concept behind this:
The only limitation is that an index entry cannot exceed approximately
one-third of a page (after TOAST compression, if applicable)
I want to know the underlying reason for this fault. What's the "page" mentioned above? (is this the same page on the journal on for example ext4 file systems?). Why does an index entry have this limitation?
Is there any resource to give a comprehensive understanding of these concepts?
Update: Database Internals gives some deep insights about designing a database system and obviously also answers this question.
A page is the same thing as a database block. The size of a database block is 8kB by default. You change it at compile time, but this is seldom done.
You can see it from the line Database block size: from the pg_controldata binary. Or from within a running server by using show block_size;.
The reasoning here is that you must be able to store enough information on the block/page so that it can have a fan-out factor greater than one.
The page is the page referred to that holds the index data.
Basically, each page needs to have comparison values in order for the index to be useful. This guarantees that at least two or three values are on the page.

MongoDB - Change document structure

I'm working with a database MongoDB and due to the high consumption of resources (work with a dataset of almost 100GB), I need to shrink the field names of documents (something like "ALTER TABLE").
Is there an easy / automatic to do this?
I think so! Check out $rename: http://www.mongodb.org/display/DOCS/Updating#Updating-%24rename
Run an update() on your data set with a query with a bunch of $renames and I think that will get you what you want.
There's no built-in way to do this, though you could write a script in your preferred language to do so. Another alternative is to update your application code to rewrite documents to use shorter field names when the documents are accessed, which has the advantage of not requiring downtime or coordination of the script and your application code.
Note that even once you shrink the field names, your data set will remain the same size -- MongoDB will update the documents in place, leaving free space "around" the documents, so you may not see a reduction in your working set size. This may be advantageous if you expect your documents to grow, asMongoDB will update in-place when a document grows if there is enough free space to fit the new document.
Alternatively, you can use the repairDatabase command, which will shrink your data set. repairDatabase can be quite slow, and requires quite a bit of free disk space (it has to make a full copy of the entire database). repairDatabase also locks the entire database, so you should run this during a scheduled maintenance window.
Finally, if you are using version 1.9 or newer, you can use the compact command. compact requires less free space than repairDatabase (it needs about an additional 2 gigabytes of disk space), and operates only on a single collection at a time. compact locks the database the same way as repairDatabase, and the same warnings about scheduling compaction during a maintenance window applies.

Which would be better? Storing/access data in a local text file, or in a database?

Basically, I'm still working on a puzzle-related website (micro-site really), and I'm making a tool that lets you input a word pattern (e.g. "r??n") and get all the matching words (in this case: rain, rein, ruin, etc.). Should I store the words in local text files (such as words5.txt, which would have a return-delimited list of 5-letter words), or in a database (such as the table Words5, which would again store 5-letter words)?
I'm looking at the problem in terms of data retrieval speeds and CPU server load. I could definitely try it both ways and record the times taken for several runs with both methods, but I'd rather hear it from people who might have had experience with this.
Which method is generally better overall?
The database will give you the best performance with the least amount of work. The built in index support and query analyzers will give you good performance for free while a textfile might give you excellent performance for a ton of work.
In the short term, I'd recommend creating a generic interface which would hide the difference between a database and a flat-file. Later on, you can benchmark which one will provide the best performance but I think the database will give you the best bang per hour of development.
For fast retrieval you certainly want some kind of index. If you don't want to write index code yourself, it's certainly easiest to use a database.
If you are using Java or .NET for your app, consider looking into db4o. It just stores any object as is with a single line of code and there are no setup costs for creating tables.
Storing data in a local text file (when you add new records to end of the file) always faster then storing in database. So, if you create high load application, you can save the data in a text file and copy data to a database later. However in most application you should use a database instead of text file, because database approach has many benefits.