How to optimize Lucene.Net indexing - lucene

I need to index around 10GB of data. Each of my "documents" is pretty small, think basic info about a product, about 20 fields of data, most only a few words. Only 1 column is indexed, the rest are stored. I'm grabbing the data from text files, so that part is pretty fast.
Current indexing speed is only about 40mb per hour. I've heard other people say they have achieved 100x faster than this. For smaller files (around 20mb) the indexing goes quite fast (5 minutes). However, when I have it loop through all of my data files (about 50 files totalling 10gb), as time goes on the growth of the index seems to slow down a lot. Any ideas on how I can speed up the indexing, or what the optimal indexing speed is?
On a side note, I've noticed the API in the .Net port does not seem to contain all of the same methods as the original in Java...
Update--here are snippets of the indexing C# code:
First I set thing up:
directory = FSDirectory.GetDirectory(#txtIndexFolder.Text, true);
iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(25000);
iwriter.SetMergeFactor(1000);
iwriter.SetMaxBufferedDocs(Convert.ToInt16(txtBuffer.Text));
Then read from a tab-delim data file:
using (System.IO.TextReader tr = System.IO.File.OpenText(File))
{
string line;
while ((line = tr.ReadLine()) != null)
{
string[] items = line.Split('\t');
Then create the fields and add the document to the index:
fldName = new Field("Name", items[4], Field.Store.YES, Field.Index.NO);
doc.Add(fldName);
fldUPC = new Field("UPC", items[10], Field.Store.YES, Field.Index.NO);
doc.Add(fldUPC);
string Contents = items[4] + " " + items[5] + " " + items[9] + " " + items[10] + " " + items[11] + " " + items[23] + " " + items[24];
fldContents = new Field("Contents", Contents, Field.Store.NO, Field.Index.TOKENIZED);
doc.Add(fldContents);
...
iwriter.AddDocument(doc);
Once its completely done indexing:
iwriter.Optimize();
iwriter.Close();

Apparently, I had downloaded a 3 yr old version of Lucene that is prominently linked to for some reason from the home page of the project...downloaded the most recent Lucene source code, compiled, used the new DLL, fixed about everything. The documentation kinda sucks, but the price is right and its real fast.
From a helpful blog
First things first, you have to add
the Lucene libraries to your project.
On the Lucene.NET web site, you’ll see
the most recent release builds of
Lucene. These are two years old. Do
not grab them, they have some bugs.
There has not been an official release
of Lucene for some time, probably due
to resource constraints of the
maintainers. Use Subversion (or
TortoiseSVN) to browse around and grab
the most recently updated Lucene.NET
code from the Apache SVN Repository.
The solution and projects are Visual
Studio 2005 and .NET 2.0, but I
upgraded the projects to Visual Studio
2008 without any issues. I was able
to build the solution without any
errors. Go to the bin directory, grab
the Lucene.Net dll and add it to your
project.

Since I can't comment on the marked answer above related to a 3 year old version, I would highly recommend installing the Visual Studio extension for NuGet Package Manager when adding Lucene.NET to your projects. It should add the most recent DLL version for you unless you need a specific later version.

Related

How to use a compass lucene generated cfs index?

With (the latest) lucene 8.7 is it possible to open a .cfs compound index file generated by lucene 2.2 of around 2009, in a legacy application that I cannot modify, with lucene utility "Luke" ?
or alternatively could it be possibile to generate the .idx file for Luke from the .cfs ?
the .cfs was generated by compass on top of lucene 2.2, not by lucene directly
Is it possible to use a compass generated index containing :
_b.cfs
segments.gen
segments_d
possibly with solr ?
are there any examples how to open a file based .cfs index with compass anywhere ?
the conversion tool won't work because the index version is too old :
from lucene\build\demo :
java -cp ../core/lucene-core-8.7.0-SNAPSHOT.jar;../backward-codecs/lucene-backward-codecs-8.7.0-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader -verbose path_of_old_index
and the searchfiles demo :
java -classpath ../core/lucene-core-8.7.0-SNAPSHOT.jar;../queryparser/lucene-queryparser-8.7.0-SNAPSHOT.jar;./lucene-demo-8.7.0-SNAPSHOT.jar org.apache.lucene.demo.SearchFiles -index path_of_old_index
both fail with :
org.apache.lucene.index.IndexFormatTooOldException: Format version is not supported
This version of Lucene only supports indexes created with release 6.0 and later.
Is is possible to use an old index with lucene somehow ? how to use the old "codec" ?
also from lucene.net if possible ?
current lucene 8.7 yields an index containing these files :
segments_1
write.lock
_0.cfe
_0.cfs
_0.si
==========================================================================
update : amazingly it seems to open that very old format index with lucene.net v. 3.0.3 from nuget !
this seems to work in order to extract all terms from the index :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Globalization;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Version = Lucene.Net.Util.Version;
namespace ConsoleApplication1
{
class Program
{
static void Main()
{
var reader = IndexReader.Open(FSDirectory.Open("C:\\Temp\\ftsemib_opzioni\\v210126135604\\index\\search_0"), true);
Console.WriteLine("number of documents: "+reader.NumDocs() + "\n");
Console.ReadLine();
TermEnum terms = reader.Terms();
while (terms.Next())
{
Term term = terms.Term;
String termField = term.Field;
String termText = term.Text;
int frequency = reader.DocFreq(term);
Console.WriteLine(termField +" "+termText);
}
var fieldNames = reader.GetFieldNames(IndexReader.FieldOption.ALL);
int numFields = fieldNames.Count;
Console.WriteLine("number of fields: " + numFields + "\n");
for (IEnumerator<String> iter = fieldNames.GetEnumerator(); iter.MoveNext();)
{
String fieldName = iter.Current;
Console.WriteLine("field: " + fieldName);
}
reader.Close();
Console.ReadLine();
}
}
}
out of curiosity could it be possible to find out what index version it is ?
are there any examples of (old) compass with file system based index ?
Unfortunately you can't use an old Codec to access index files from Lucene 2.2. This is because codecs were introduced in Lucene 4.0. Prior to that the code for reading and writing files of the index was not grouped together into a codec but rather was just inherently part of the overall Lucene Library.
So in version of Lucene prior to 4.0 there is no codec, just file reading and writing code baked into the library. It would be very difficult to track down all that code and to create a codec that could be plugged into a modern version of Lucene. It's not an impossible task, but it require an Expert Lucene developer and a large amount of effort (ie an extremely expensive endeavor).
In light of all that, the answer to this SO question may be of some use: How to upgrade lucene files from 2.2 to 4.3.1
Update
Your best bet would be to use an old 3.x copy of java lucene or the Lucene.net ver 3.0.3 to open the index, then add and commit one doc (which will create a 2nd segment) and do a Optimize which will cause the two segments to be merged into one new segment. The new segment will be a version 3 segment. Then you can use Lucene.Net 4.8 Beta or a Java Lucene 4.X to do the same thing (but Commit was renamed ForceMerge starting in ver 4) again to convert the index to a 4.x index.
Then you can use the current java version of Lucene 8.x to do this once more to move the index all the way up to 8 since the current version of Java Lucene has codecs reaching all the way back to 5.0 see: https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/codecs
However if you do receive the error again that you reported:
This version of Lucene only supports indexes created with release 6.0 and later.
then you will have to play this game one more cycle with a version 6.x Java Lucene to get from a 5.x index to a 6.x index. :-)

how to use very old iText(under 0.99) to create bookmarks / outlines?

may I know how to use old iText(very old version under 0.99, package path = com.lowagie.xxx) to create bookmarks to jump in the internal pdf pls?
like the api in new iText jar:
PdfOutline outoline2 = com.itextpdf.pdf.PdfAction.gotoLocalPage("destinationName", false)
we have found below code to create bookmark, but find old iText needs to use the filename(see outFileName in below code). but what we want is a jump in internal pdf (not remote pdf)
olineSignature = new PdfOutline(root, new PdfAction(outFileName, "Signature2TxtDestination"), "Signature2TxtOutline");
FYI, we don't know what page number in advance, so no way to use the api as below: old PdfAction.gotoLocalPage(int, PdfDestination, PdfWriter)
anybody can help me? Thanks.#Bruno Lowagie, #itext :)
We are in the progress of upgrading to new iText(itext5+), but now we do get a request to create bookmarks(using old iText) for others to retrieve the created bookmarks.
My memory can't go that far back but local destinations are most probably not supported. Your only chance is to do an interim upgrade to the Jurassic 2.1.7 that should be more or less compatible with that Pleistocene 0.99.

Symfony2 performance tweaking

Symfony2 was looking so promising, powerful and flexible. So we were going to use Symfony2 + mongodb for one of our projects. But it appeared too slow (Apache/2.2.25 + PHP/5.4.20). Currently the app is pretty simple. but I have noticed that the httpd.exe lads CPU up to 28% when some simple page is loaded. The page is quite lite - just user profile info and the list of his posts. I even can't imagine how hundreds of users can be served (not even talking about numbers like 100k users) if performance will not be much better.
For instance the CPU load is 2% when opening the heavy 'products' page of ActivationCloud account (which fetches a good amount of data) (PHP+Smarty+SQL).
After taking a look on Xdebug output, I have found that a gret deal of time 20% is utilized by ClassLoader->loadClass(...) - 265 calls
After performing the following steps:
*generated class map
php composer.phar dump-autoload --optimize
*installed and enabled APC
[APC]
extension=php_apc.dll
apc.enabled=1
apc.shm_segments=1
;32M per WordPress install
apc.shm_size=128M
;Relative to the number of cached files (you may need to
watch your stats for a day or two to find out a good number)
apc.num_files_hint=7000
;Relative to the size of WordPress
apc.user_entries_hint=4096
;The number of seconds a cache entry is allowed to idle
in a slot before APC dumps the cache
apc.ttl=7200
apc.user_ttl=7200
apc.gc_ttl=3600
;Setting this to 0 will give you the best performance, as APC will
;not have to check the IO for changes. However, you must clear
;the APC cache to recompile already cached files. If you are still
;developing, updating your site daily in WP-ADMIN, and running W3TC
;set this to 1
apc.stat=1
;This MUST be 0, WP can have errors otherwise!
apc.include_once_override=0
;Only set to 1 while debugging
apc.enable_cli=0
;Allow 2 seconds after a file is created before
it is cached to prevent users from seeing half-written/weird pages
apc.file_update_protection=2
;Leave at 2M or lower. WordPress does't have any file sizes close to 2M
apc.max_file_size=2M
;Ignore files
apc.filters = "/var/www/apc.php"
apc.cache_by_default=1
apc.use_request_time=1
apc.slam_defense=0
apc.mmap_file_mask=/var/www/temp/apc.XXXXXX
apc.stat_ctime=0
apc.canonicalize=1
apc.write_lock=1
apc.report_autofilter=0
apc.rfc1867=0
apc.rfc1867_prefix =upload_
apc.rfc1867_name=APC_UPLOAD_PROGRESS
apc.rfc1867_freq=0
apc.rfc1867_ttl=3600
apc.lazy_classes=0
apc.lazy_functions=0
expected a miracle after it but it did not happen.
*enabled APC class loader - in Symfony\web\app.php uncommented
/*
$loader = new ApcClassLoader('sf2', $loader);
$loader->register(true);
*/
The ClassLoader->loadClass(...) got better 'Self' is 11 instead of 21
Frankly speaking I was shocked by what I saw in xdebug :( a lot of repetitive calls like Container->get(...) -317 calls, DocumentManager->getClassMeataData(...) - 301 calls. Totally more than 2k of function calls. Hard to believe that.
These bundles are installed:
class AppKernel extends Kernel
{
public function registerBundles()
{
$bundles = array(
new Symfony\Bundle\FrameworkBundle\FrameworkBundle(),
new Symfony\Bundle\SecurityBundle\SecurityBundle(),
new Symfony\Bundle\TwigBundle\TwigBundle(),
new Symfony\Bundle\MonologBundle\MonologBundle(),
new Symfony\Bundle\SwiftmailerBundle\SwiftmailerBundle(),
new Symfony\Bundle\AsseticBundle\AsseticBundle(),
new Doctrine\Bundle\DoctrineBundle\DoctrineBundle(),
new Doctrine\Bundle\MongoDBBundle\DoctrineMongoDBBundle(),
new Sensio\Bundle\FrameworkExtraBundle\SensioFrameworkExtraBundle(),
new HWI\Bundle\OAuthBundle\HWIOAuthBundle(),
new Knp\Bundle\MenuBundle\KnpMenuBundle(),
... our bundles ...
);
if (in_array($this->getEnvironment(), array('dev', 'test'))) {
$bundles[] = new Symfony\Bundle\WebProfilerBundle\WebProfilerBundle();
$bundles[] = new Sensio\Bundle\DistributionBundle\SensioDistributionBundle();
$bundles[] = new Sensio\Bundle\GeneratorBundle\SensioGeneratorBundle();
}
return $bundles;
}
It was sad to find that Symfony2 got one of the worst benchmark results among others php frameworks http://www.techempower.com/benchmarks/#section=data-r8&hw=i7&test=json&l=sg
At the same time Francois Zaninotto said in his blog http://symfony.com/blog/who-really-uses-symfony that Yahoo uses Symfony2 for the bookmarks service, tried some apps form the list http://trac.symfony-project.org/wiki/ApplicationsDevelopedWithSymfony - they are not looking slow also on Quora http://www.quora.com/Who-is-using-Symfony2-in-production its spoken that dailymotion is using it as well.
How to make the performance acceptable?
Got Symfony working x10 faster after adding the
realpath_cache_size = 4096k
to php.ini
First you should use linux (you mentioned https.exe so I think you are using windows). Than you should use nginx instead of apache and php-5.5 with fpm instead of mod_php. Opcache instead of apc (by the way apc.stat should be turned off). Doctrine caches should be turned on and than you should use http caching wherever you can. (You can view packagist's code for some hints.)

Using the TFS API, filetypes with extensions like .svnExe get ignored

I'm working on a tool which migrates from SVN to TFS using the TFS API.
workspace.CheckIn(
pendingChanges,
currentUser.TfsUser,
set.LogMessage + " on " + String.Format("{0:d/M/yyyy HH:mm:ss}", set.TimeStamp) + " by " + currentUser.SvnUser,
(CheckinNote)null,
(WorkItemCheckinInfo[])null,
(PolicyOverrideInfo)null
);
This is the way i check my revision in, but sometimes it ignores files like .svnExe, or other "unknown" file types.
Is there a way to check ALL filetypes in TFS?
There are two possibilities that I can think of:
Possibility 1: Something is causing the PendAdd() to fail.
For example, if the path already exists in Version Control, you have to use a PendEdit() instead.
To diagnose this possibility, you should subscribe to the VersionControlServer.NonFatalError event.
Possibility 2: You could have a corrupt workspace cache
You can refresh the cache by calling Workstation.Current.EnsureUpdateWorkspaceInfoCache() or by following the steps in this answer (run tf workspaces /collection:http://yourserver:8080/tfs/DefaultCollection, or delete the directories manually).

RavenDB, RavenHQ and Appharbor - document size error with very first document

I have a completely empty RavenHQ database that's linked to my Appharbor application. The amount of space the database is currently using is 1.1mb out of an available 25mb for my bronze account. The database previously had records in it, but I have deleted them using "delete collection" in the management studio.
The very first time I call session.Store(myobject), and BEFORE I call .SaveChanges(), I get the following error.
System.InvalidOperationException: Url: "/docs/Raven/Hilo/AccItems"
Raven.Database.Exceptions.OperationVetoedException: PUT vetoed by Raven.Bundles.Quotas.Triggers.DatabaseSizeQoutaForDocumetsPutTrigger because: Database size is 45,347 KB, which is over the allowed quota of 25,600 KB. No more documents are allowed in.
Now, the document is definitely not that big, so I don't know what this error can mean, especially as I don't think I've even hit the database at that point since I haven't closed the session by calling SaveChanges(). Any ideas? Here's the code itself.
XDocument doc = XDocument.Parse(rawXml);
var accItems = ExtractItemsFromFeed(doc);
using (IDocumentSession session = _store.OpenSession())
{
var dbItems = session.Query<AccItem>().ToList();
foreach (var item in accItems)
{
var existingRecord = dbItems.SingleOrDefault(x => x.Source == x.SourceId == cottage.SourceId);
if (existingRecord == null)
{
session.Store(item);
_logger.Info("Saved new item {0}.", item.ShortName);
}
else
{
existingRecord.ShortName = item.ShortName;
_logger.Info("Updated item {0}.", item.ShortName);
}
session.SaveChanges();
}
}
Any other comments about the style of this code would be most welcome, as I was unsure of the best way to approach the "update existing item or create if it isn't there" scenario.
The answer here was as follows.
RavenHQ support found that the database was indeed oversized, but it seemed that the size reported in the Appharbor-branded RavenHQ control panel was incorrect. I had filled up the database way over the limit with a previous faulty version of the code posted above, so the error message I received was actually correct.
Fixing this problem without paying to upgrade the database wasn't straightforward, as it's not possible to shrink the database. As I also wasn't able to delete my single Appharbor/RavenHQ database or create another one that left me with the choice of creating an entirely new Appharbor application, or registering directly with RavenHQ for a new account. I chose the latter. The RavenHQ-branded control panel is slightly different to the Appharbor one, in that it has the ability to create and delete databases.
So to summarize: there doesn't seem to be any benefit to using RavenHQ as an add-on to Appharbor - you might as well go and get a proper free RavenHQ account.