With (the latest) lucene 8.7 is it possible to open a .cfs compound index file generated by lucene 2.2 of around 2009, in a legacy application that I cannot modify, with lucene utility "Luke" ?
or alternatively could it be possibile to generate the .idx file for Luke from the .cfs ?
the .cfs was generated by compass on top of lucene 2.2, not by lucene directly
Is it possible to use a compass generated index containing :
_b.cfs
segments.gen
segments_d
possibly with solr ?
are there any examples how to open a file based .cfs index with compass anywhere ?
the conversion tool won't work because the index version is too old :
from lucene\build\demo :
java -cp ../core/lucene-core-8.7.0-SNAPSHOT.jar;../backward-codecs/lucene-backward-codecs-8.7.0-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader -verbose path_of_old_index
and the searchfiles demo :
java -classpath ../core/lucene-core-8.7.0-SNAPSHOT.jar;../queryparser/lucene-queryparser-8.7.0-SNAPSHOT.jar;./lucene-demo-8.7.0-SNAPSHOT.jar org.apache.lucene.demo.SearchFiles -index path_of_old_index
both fail with :
org.apache.lucene.index.IndexFormatTooOldException: Format version is not supported
This version of Lucene only supports indexes created with release 6.0 and later.
Is is possible to use an old index with lucene somehow ? how to use the old "codec" ?
also from lucene.net if possible ?
current lucene 8.7 yields an index containing these files :
segments_1
write.lock
_0.cfe
_0.cfs
_0.si
==========================================================================
update : amazingly it seems to open that very old format index with lucene.net v. 3.0.3 from nuget !
this seems to work in order to extract all terms from the index :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Globalization;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Version = Lucene.Net.Util.Version;
namespace ConsoleApplication1
{
class Program
{
static void Main()
{
var reader = IndexReader.Open(FSDirectory.Open("C:\\Temp\\ftsemib_opzioni\\v210126135604\\index\\search_0"), true);
Console.WriteLine("number of documents: "+reader.NumDocs() + "\n");
Console.ReadLine();
TermEnum terms = reader.Terms();
while (terms.Next())
{
Term term = terms.Term;
String termField = term.Field;
String termText = term.Text;
int frequency = reader.DocFreq(term);
Console.WriteLine(termField +" "+termText);
}
var fieldNames = reader.GetFieldNames(IndexReader.FieldOption.ALL);
int numFields = fieldNames.Count;
Console.WriteLine("number of fields: " + numFields + "\n");
for (IEnumerator<String> iter = fieldNames.GetEnumerator(); iter.MoveNext();)
{
String fieldName = iter.Current;
Console.WriteLine("field: " + fieldName);
}
reader.Close();
Console.ReadLine();
}
}
}
out of curiosity could it be possible to find out what index version it is ?
are there any examples of (old) compass with file system based index ?
Unfortunately you can't use an old Codec to access index files from Lucene 2.2. This is because codecs were introduced in Lucene 4.0. Prior to that the code for reading and writing files of the index was not grouped together into a codec but rather was just inherently part of the overall Lucene Library.
So in version of Lucene prior to 4.0 there is no codec, just file reading and writing code baked into the library. It would be very difficult to track down all that code and to create a codec that could be plugged into a modern version of Lucene. It's not an impossible task, but it require an Expert Lucene developer and a large amount of effort (ie an extremely expensive endeavor).
In light of all that, the answer to this SO question may be of some use: How to upgrade lucene files from 2.2 to 4.3.1
Update
Your best bet would be to use an old 3.x copy of java lucene or the Lucene.net ver 3.0.3 to open the index, then add and commit one doc (which will create a 2nd segment) and do a Optimize which will cause the two segments to be merged into one new segment. The new segment will be a version 3 segment. Then you can use Lucene.Net 4.8 Beta or a Java Lucene 4.X to do the same thing (but Commit was renamed ForceMerge starting in ver 4) again to convert the index to a 4.x index.
Then you can use the current java version of Lucene 8.x to do this once more to move the index all the way up to 8 since the current version of Java Lucene has codecs reaching all the way back to 5.0 see: https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/codecs
However if you do receive the error again that you reported:
This version of Lucene only supports indexes created with release 6.0 and later.
then you will have to play this game one more cycle with a version 6.x Java Lucene to get from a 5.x index to a 6.x index. :-)
Related
We are upgrading our .NET 2.0 application to .NET Core 3.x there's a DiffGrams used to capture the Table field updates (before/after values) used for Auditing purpose. I have to achieve the similar in the .NET Core 3.x. I am not sure which one is the equivalent for .NET Core 3.x.
Could you anyone help me guide on this? Thank you.
DataSet.WriteXml/DataSet.ReadXml method applies to .NET Core 3.x.
The WriteXml method provides a way to write either data only, or both data and schema from a DataSet into an XML document.
private void WriteXmlToFile(DataSet thisDataSet)
{
if (thisDataSet == null) { return; }
// Create a file name to write to.
string filename = "XmlDoc.xml";
// Create the FileStream to write with.
System.IO.FileStream stream = new System.IO.FileStream
(filename, System.IO.FileMode.Create);
// Create an XmlTextWriter with the fileStream.
System.Xml.XmlTextWriter xmlWriter =
new System.Xml.XmlTextWriter(stream,
System.Text.Encoding.Unicode);
// Write to the file with the WriteXml method.
thisDataSet.WriteXml(xmlWriter, XmlWriteMode.DiffGram);
xmlWriter.Close();
}
The resultant XML code is rooted in the <diffgr:diffgram> node and contains up to three distinct data sections, as follows:
<diffgr:diffgram>
<MyDataSet>
:
</MyDataSet>
<diffgr:before>
:
</diffgr:before>
<diffgr:errors>
:
</diffgr:errors>
</diffgr:diffgram>
As read here Can a raw Lucene index be loaded by Solr? Lucene indexes can be imported into Solr. This works well when the Solr server is not running (creating a Solr core folder structure in the data folder with all the needed configuration files) but it does not work when the Solr server is up and running.
Is there any call (via rest endpoint or java api) to tell Solr to re-scan the data folder?
You want to generate an index with lucene (outsite solr) and insert this to solr without restart.
You must not change the index-folder directly. But you can create a new core which point to the already build index folder and switch/swap the core with the (outdated) old one. Or you can merge the new index-Folder in the old core.
All this can be done by the solrj admin api.
e.g. create:
CoreAdminRequest.Create req = new CoreAdminRequest.Create();
req.setConfigName(configName);
req.setSchemaName(schemaName);
req.setDataDir(dataDir);
req.setCoreName(coreName);
req.setInstanceDir(instanceDir);
req.setIsTransient(true);
req.setIsLoadOnStartup(false); // <= unless its productive core.
return req.process(adminServer);
e.g. the swap:
CoreAdminRequest request = new CoreAdminRequest();
request.setAction(CoreAdminAction.SWAP);
request.setCoreName(coreName1);
request.setOtherCoreName(coreName2);
request.process(solrClient);
For SolrCloud use the first "create" approach with the collections api and use alias instead of swap.
e.g. the alias:
CollectionAdminRequest.CreateAlias req = new CollectionAdminRequest.CreateAlias();
req.setAliasedCollections(coreName);
req.setAliasName(aliasName);
return req.process(solrClient);
Getting Lucene 4.10 read 3.2 version indexes
Upgraded to 4.10 still need to read 3.2 indexes. Deployed jre 7 as required. Made all changes within a existing code base which became erroneous. Still need to read 3.2 indexes before going to take on re-indexing. How to read existing 3.2 indexes by Lucene 4.10 ( what changes to make if any in a code )
You can use IndexUpgrader, something like:
IndexUpgrader upgrader = new IndexUpgrader(myIndexDirectory, Version.LUCENE_4_10_0);
upgrader.upgrade();
or run it from the command line:
java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader myIndexDirectory
You can set the codec used to decode the indexes in the IndexWriterConfig. Lucene3xCodec would be the codec to use here:
IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
config.setCodec(new Lucene3xCodec());
IndexWriter writer = new IndexWriter(directory, config);
IndexSearcher searcher = new IndexSearcher(new DirectoryReader.open(writer));
Bear in mind, this codec is strictly read-only. Any attempt to add, delete, or update a document will result in an UnsupportedOperationException being thrown. If you wish to support writing to the index, you must upgrade your index (see my original answer).
I've been trying to use Apache Sling's Scala 2.8 Script Engine recently updated last month. I came from using Scala 2.7 along with Sling's Scala 2.7 Script Engine and that worked great. I run into a problem when I try to use the new implementation. When calling ScalaScriptEngine's eval function I always receive an "Error executing script" due to a NullPointerException. Has anyone else worked with the new build of the script engine and run into this as well?
Thanks!
Steven
There is a bug which prevent the Scala scripting engine from being used standalone. See https://issues.apache.org/jira/browse/SLING-1877 for details and a patch.
Also note, that with the patch applied you still need to set the class path when using the scripting engine. This is a change from 2.7.7 where the default java class path (i.e. java.class.path) was used automatically. In 2.8 you have to set this explicitly thorough the '-usejavacp' argument.
Here is some sample code demonstrating the standalone usage of the Scala scripting engine:
def testScalaScriptEngine() {
val scriptEngineFactory = new ScalaScriptEngineFactory
val settings = new ScalaSettings()
settings.parse("-usejavacp")
scriptEngineFactory.getSettingsProvider.setScalaSettings(settings)
val scriptEngine = scriptEngineFactory.getScriptEngine
val script = """
package script {
class Demo(args: DemoArgs) {
println("Hello")
}
}
"""
scriptEngine.getContext.setAttribute("scala.script.class", "script.Demo", ScriptContext.ENGINE_SCOPE)
scriptEngine.eval(script)
}
I need to index around 10GB of data. Each of my "documents" is pretty small, think basic info about a product, about 20 fields of data, most only a few words. Only 1 column is indexed, the rest are stored. I'm grabbing the data from text files, so that part is pretty fast.
Current indexing speed is only about 40mb per hour. I've heard other people say they have achieved 100x faster than this. For smaller files (around 20mb) the indexing goes quite fast (5 minutes). However, when I have it loop through all of my data files (about 50 files totalling 10gb), as time goes on the growth of the index seems to slow down a lot. Any ideas on how I can speed up the indexing, or what the optimal indexing speed is?
On a side note, I've noticed the API in the .Net port does not seem to contain all of the same methods as the original in Java...
Update--here are snippets of the indexing C# code:
First I set thing up:
directory = FSDirectory.GetDirectory(#txtIndexFolder.Text, true);
iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(25000);
iwriter.SetMergeFactor(1000);
iwriter.SetMaxBufferedDocs(Convert.ToInt16(txtBuffer.Text));
Then read from a tab-delim data file:
using (System.IO.TextReader tr = System.IO.File.OpenText(File))
{
string line;
while ((line = tr.ReadLine()) != null)
{
string[] items = line.Split('\t');
Then create the fields and add the document to the index:
fldName = new Field("Name", items[4], Field.Store.YES, Field.Index.NO);
doc.Add(fldName);
fldUPC = new Field("UPC", items[10], Field.Store.YES, Field.Index.NO);
doc.Add(fldUPC);
string Contents = items[4] + " " + items[5] + " " + items[9] + " " + items[10] + " " + items[11] + " " + items[23] + " " + items[24];
fldContents = new Field("Contents", Contents, Field.Store.NO, Field.Index.TOKENIZED);
doc.Add(fldContents);
...
iwriter.AddDocument(doc);
Once its completely done indexing:
iwriter.Optimize();
iwriter.Close();
Apparently, I had downloaded a 3 yr old version of Lucene that is prominently linked to for some reason from the home page of the project...downloaded the most recent Lucene source code, compiled, used the new DLL, fixed about everything. The documentation kinda sucks, but the price is right and its real fast.
From a helpful blog
First things first, you have to add
the Lucene libraries to your project.
On the Lucene.NET web site, you’ll see
the most recent release builds of
Lucene. These are two years old. Do
not grab them, they have some bugs.
There has not been an official release
of Lucene for some time, probably due
to resource constraints of the
maintainers. Use Subversion (or
TortoiseSVN) to browse around and grab
the most recently updated Lucene.NET
code from the Apache SVN Repository.
The solution and projects are Visual
Studio 2005 and .NET 2.0, but I
upgraded the projects to Visual Studio
2008 without any issues. I was able
to build the solution without any
errors. Go to the bin directory, grab
the Lucene.Net dll and add it to your
project.
Since I can't comment on the marked answer above related to a 3 year old version, I would highly recommend installing the Visual Studio extension for NuGet Package Manager when adding Lucene.NET to your projects. It should add the most recent DLL version for you unless you need a specific later version.