How does the Lucene tool Luke determine a file count? - lucene

Using Luke, it is showing 348K files in the Lucene index. Our repository, after being queried using SQL commands via ACCE (IBM Connections storing files in Connections Content Manager [ie. FileNet]) is coming back with 345K files users have uploaded. Is there any way to explain the 3K difference? It seems odd that Luke would report MORE documents than the actual repository contains.
Are there control docs? Versions? I can see 325 docs listed on the Luke page indicating it is also counting deletions, but that still leaves a 3K difference (the actual difference was originally closer to 3.5K when counting deletions). Over time, we have been monitoring the increase in the number of documents users are adding, and they are increasing at a consistent rate. However, the discrepancy between Luke and the file count returned by ACCE is increasing. We are now approaching 4K, even when not taking into account the deletions listed by Luke. How can we explain this anomaly?
Thanks.

Related

Memory consumption in my Jersey application keeps growing with time

Memory consumption in my application keeps growing with time. This app uses Lucene and search is performed using Rest endpoints that searched Lucene directories. Around 10 different directories are created and multiple users can perform search on one or more directories at the same time. While searching it also checked if any new record is entered or modified in DB then directories are updated by deleting and re-adding the documents. I could not find anything wrong in Lucene configurations an coding doe for search like IndexWriter on directories are flushed and committed after deletion and addition of documents. I am just wondering if search can also consume memory. I can provide more details if required.
Will appreciate any clue provided exactly what might be wrong.

What to do with old files of the SoftIndexFileStore in Infinispan persistent cache store?

I have a clustered cache store set up with Infinispan (8.2.4 Final) using the SoftIndexFileStore for persistence.
The documentation states that if entries expire it's not possible for the Compactor to cleanup purged entries and the disk usage will grow overtime. From the userguide:
When entries are stored with expiration, SIFS cannot detect that some
of those entries are expired. Therefore, such old file will not be
compacted (method AdvancedStore.purgeExpired() is not implemented).
This can lead to excessive file-system space usage.
Most of my entries expire but there are some which need to persist indefinitely meaning I can't simply run a cleanup job every once in while to delete all the data files.
How to deal with this wasted disk usage? After several weeks of running I see many files which haven't been modified in weeks. Is it safe to delete old files which haven't been modified e.g. less than a month ago?
No; old files won't ever be modified again (they are written once and then considered immutable until removed). Removing them manually could lead to failures since these files are referenced in the index.
Regrettably, when the store is iterated and the entries are found expired, the Compactor.free() is not called, because there could be multiple concurrent iterations and we could end up calling it many times for single entry.
A proper solution would be implementing a periodic (or JMX-triggered) process that goes through old files, computes space occupied by expired entries and schedules files that exceed some threshold for compaction. This should go into Compactor. Please see SIFS javadoc for general design description.
If you're interested in developing this feature and you want to discuss that more, please go to Infinispan forum.

Search index replication

I am developing an application that requires a CLucene index to be created in a desktop application, but replicated for (read-only) searching on iOS devices and efficiently updated when the index is updated.
Aside from simply re-downloading the entire index whenever it changes, what are my options here? CLucene does not support replication on its own, but Solr (which is built on top of Lucene) does, so it's clearly possible. Does anybody know how Solr does this and how one would approach implementing similar functionality?
If this is not possible, are there any (non-Java-based) full-text search implementations that would meet my needs better than CLucene?
Querying the desktop application is not an option - the mobile applications must be able to search offline.
A Lucene index is based on write-once read-many segments. This means that when new documents have been committed to a Lucene index, all you nee to retrieve is:
the new segments,
the merged segments (old segments which have been merged in a single segment, if any),
the segments file (which stores information about the current segments).
Once all these new files have been downloaded, the segments files which have been merged can be safely removed. To take the changes into account, just reopen an IndexReader.
Solr has a Java implementation to do this, but given how simple it is, using a synchronization tool such as rsync would do the trick too. By the way, this is how Solr replication worked before Solr 1.4, you can still find some documentation on the wiki about rsync replication.

IIS access log to SQL normalization

I am looking for insert IIS 6.0 access log ( 5 servers, and over 400MB daily ) to SQL database. What scares me is the size. There is a lot of information you are duplicating (i.e. site name, url, referrer, browser) and could be normalized by index and look-up table.
Reason why I am looking for own database instead using other tools is that is 5 servers and I need very custom statistics and reports on each, few or all. Also installing any (specially open source) software is massacre ( need have 125% functionality and take months ).
I wounder what would be the most efficient way to do it? Is someone saw examples or articles about it ?
Whilst I would suggest buying a decent log parsing tool if you insist on going it alone, take a look at Log Parser
http://www.microsoft.com/downloads/en/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en
to help you do some of the heavy listing, either into SQL or maybe it can get the results you are after directly.
On the one hand, you will reduce disk space for values a lot by using artificial keys for things like server IP address, user agent, and referrer. Some of that space you save will be lost to the index, but the overall disk savings for 400 MB per day, times 5 servers, should still be substantial.
The tradeoff, of course, is the need to use joins to bring that information back together for reporting.
My nitpick is that replacing one column's values with an artificial key to a two-column lookup table shouldn't be called "normalizing". You can do that without identifying any functional dependencies. (I'm not certain you're proposing to do that, but it sounds like it.)
You're looking at about 12 gigs a month in raw data, right? Did you consider approaching it from a data warehousing point of view? (Instead of an OLTP point of view.)

Low MySQL Table Cache Hit Rate

I've been working on optimizing my site and databases, and I have been using mysqltuner.pl to help with this. I've gotten just about everything correct, except for the table cache hit rate, no matter how high I raise it in my.cnf, I am still hitting about 0% (284 open / 79k opened).
My problem is that I don't really understand exactly what affects this so I don't really know what to look for in my queries/database structure to fix this.
The table cache defines the number of simultaneous file descriptors that MySQL has open. So table cache hit rate will be affected by how many tables you have relative to your limit as well as how frequently you re-reference tables (keeping in mind that it is counting not just a single connection, but simultaneous connections)
For instance, if your limit is 100 and you have 101 tables and you query each one in order, you'll never get any table cache hits. On the other hand, if you only have 1 table, you should generally get close to 100% hit rate unless you run FLUSH TABLES a lot ( as long as your table_cache is set higher than the number of typically simultaneous connections).
So for tuning, you need to look at how many distinct tables you might reference by one process/client and then look at how many simultaneous connections you might typically have.
Without more details, I can't guess whether your case is due to too many simultaneous connections or too many frequently referenced tables.
A cache is supposed to maintain copies of hot data. Hot data is data that is used a lot. If you cannot retrieve data out of a certain cache it means the DB has to go to disk to retrieve it.
--edit--
sorry if the definition seemed a bit obnoxious. a specific cache often covers a lot of entities, and these are database specific, you need to find out what is cached by the table cache firstly.
--edit: some investigation --
Ok, it seems (from the reply to this post), that Mysql uses the table cache for the data structures used to represent a table. the data structures also (via encapsulation or by having duplicate table entries for each table) represent a set of file descriptors open for the data files on the file system. The MyIsam engine uses one for a table and one for each index, additionally each active query element requires its own descriptors.
A file descriptor is a kernel entity used for file IO, it represents the low-level context of a particular file read or write.
I think you are either interpreting the value's incorrectly or they need to be interpreted differently in this context. 284 is the amount of active tables at the instance you took the snapshot and the second value represents the amount of times a table was acquired since you started Mysql.
I would hazard a guess that you need to take multiple snapshots of this reading and see if the first value (active fd's at that instance) ever exceed your cache size capacity.
p.s., the kernel generally has a upper limit on the amount of file descriptors it will allow each process to open -- so you might need to tune this if it is too low.