Reading internals of Solr index file in Java - indexing

I am trying to read a Solr index file. This file is created by an example from Solr download pages in version 6.4.
I am using this code:
import java.io.File;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class TestIndex {
public static void main(String[] args) throws IOException {
Directory dirIndex = FSDirectory.open(new File("D:\\data\\data\\index"));
IndexReader indexReader = IndexReader.open(dirIndex);
Document doc = null;
for(int i = 0; i < indexReader.numDocs(); i++) {
doc = indexReader.document(i);
}
System.out.println(doc.toString());
indexReader.close();
dirIndex.close();
}
}
Solr jar : solr-solrj-6.5.1.jar
Lucene : lucene-core-r1211247.jar
Exception :
Exception in thread "main"
org.apache.lucene.index.IndexFormatTooOldException: Format version is not
supported (resource:
ChecksumIndexInput(MMapIndexInput(path="D:\data\data\index\segments_2"))):
1071082519 (needs to be between -9 and -12). This version of Lucene only
supports indexes created with release 3.0 and later.
Updated code with lucene 6.5.1
Path path = FileSystems.getDefault().getPath("D:\\data\\data\\index");
Directory dirIndex = FSDirectory.open(path);
DirectoryReader dr = DirectoryReader.open(dirIndex);
Document doc = null;
for(int i = 0; i < dr.numDocs(); i++) {
doc = dr.document(i);
}
System.out.println(doc.toString());
dr.close();
dirIndex.close();
Exception :
java.lang.UnsupportedClassVersionError: org/apache/lucene/store/Directory : Unsupported major.minor version 52.0.
Could you please help me to run this code?
Thanks
Virendra Agarwal

I suggest to use Luke.
https://github.com/DmitryKey/luke
Luke is the GUI tool for introspecting your Lucene / Solr / Elasticsearch index. It allows:
Viewing your documents and analyzing their field contents (for stored fields) Searching in the index
Performing index maintenance: index health checking, index optimization (take a - backup before running this!)
Reading index from hdfs
Exporting the index or portion of it into an xml format
Testing your custom Lucene analyzers
Creating your own plugins!

That lucene-jar seems to be from 2012, so it's over five years old. Use lucene-core-6.5.1 to read index files generated by Solr 6.5.1.
You can pin your dependencies in your build file if it's picking the arbitrarily named file by error.

Related

No lucene documents created when serializer added

I created lucene index in gfsh using the following command create lucene index --name=myLucIndex --region=myRegion --field=title
--analyzer=org.apache.lucene.analysis.en.EnglishAnalyzer --serializer=a.b.c.MyLatLongSerializer
My serializer is as follows :
class MyLatLongSerializer implements LuceneSerializer<Book> {
#Override
public Collection<Document> toDocuments(LuceneIndex luceneIndex, Book book) {
logger.debug("inside custom lucene serializer ...");
// Writes fields of Book into a document
Document newDocument = new Document();
newDocument.add(new StoredField("title", book.getTitle()));
newDocument.add(new LatLonPoint("location", book.getLatitude(), book.getLongitude()));
return Collections.singleton(newDocument);
}
}
My spring boot configuration file is as follows:
#Configuration
#ClientCacheApplication
#EnableClusterDefinedRegions(clientRegionShortcut = ClientRegionShortcut.CACHING_PROXY)
#EnableIndexing
public class BookConfiguration {
#Bean(name = "bookGemfireCache")
ClientCacheConfigurer bookGemfireCache(
#Value("${spring.data.geode.locator.host:localhost}") String hostname,
#Value("${spring.data.geode.locator.port:10334}") int port) {
// Get clientCache
}
#Bean
Region<Long, Book> bookRegion(ClientCache clientCache) {
logger.debug("inside regions ...");
return clientCache.getRegion("myRegion");
}
#Bean
LuceneService ukBikesLuceneService(ClientCache clientCache) {
return LuceneServiceProvider.get(clientCache);
}
}
I load data to geode using the following code :
bookRegion.putAll(Map<bookId, Book>);
describe lucene index --name=myLucIndex --region=myRegion then document # 0 but when I create lucene index using the below command
create lucene index --name=myLucIndex --region=myRegion --field=title
--analyzer=org.apache.lucene.analysis.en.EnglishAnalyzer
then load the data again, run
describe lucene index --name=myLucIndex --region=myRegion
then document # 96.
I use spring data geode 2.1.8.RELEASE, geode-core 1.9.0, lucene-core 8.2.0
What am I missing here ?
Apache Geode currently uses Apache Lucene version 6.6.6 and you're using lucene-core 8.2.0, which is not backward compatible with major older versions like 6.X, that's the reason why you're getting these exceptions. Everything should work just fine if you use the Lucene version shipped with Geode.
As a side note, there are current efforts to upgrade the Lucene version used by Geode, you can follow the progress through GEODE-7039.
Hope this helps.

How to merge 10000 pdf into one using pdfbox in most effective way

PDFBox api is working fine for less number of files. But i need to merge 10000 pdf files into one, and when i pass 10000 files(about 5gb) it's taking 5gb ram and finally goes out of memory.
Is there some implementation for such requirement in PDFBox.
I tried to tune it for that i used AutoClosedInputStream which gets closed automatically after read, But output is still same.
I have a similar scenario here, but I need to merge only 1000 documents in a single one.
I tried to use PDFMergerUtility class, but I getting an OutOfMemoryError. So I did refactored my code to read the document, load the first page (my source documents have one page only), and then merge, instead of using PDFMergerUtility. And now works fine, with no more OutOfMemoryError.
public void merge(final List<Path> sources, final Path target) {
final int firstPage = 0;
try (PDDocument doc = new PDDocument()) {
for (final Path source : sources) {
try (final PDDocument sdoc = PDDocument.load(source.toFile(), setupTempFileOnly())) {
final PDPage spage = sdoc.getPage(firstPage);
doc.importPage(spage);
}
}
doc.save(target.toAbsolutePath().toString());
} catch (final IOException e) {
throw new IllegalStateException(e);
}
}

Tess4J doOCR() for *First Page* of pdf / tif

Is there a way to tell Tess4J to only OCR a certain amount of pages / characters?
I will potentially be working with 200+ page PDF's, but I really only want to OCR the first page, if that!
As far as I understand, the common sample
package net.sourceforge.tess4j.example;
import java.io.File;
import net.sourceforge.tess4j.*;
public class TesseractExample {
public static void main(String[] args) {
File imageFile = new File("eurotext.tif");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
Would attempt to OCR the entire, 200+ page into a single String.
For my particular case, that is way more than I need it to do, and I'm worried it could take a very long time if I let it do all 200+ pages and then just substring the first 500 or so.
The library has a PdfUtilities class that can extract certain pages of a PDF.

Import database from File PROGRAMMATICALLY?

Is there a way to programatically import an entire database from a file with SQL?? (either .CSV , .SQL and .DB files are fine)
Thanks!!
EDITED AFTER TO CLARIFY:
I am interested in a solution that is Database independent (has to works with all types of databases (Mysql, SQL Server, PostGres, Oracle...)
MySQL: LOAD DATA INFILE for csv's; for .sql files generated with MySQL, use the shell.
For SQLite: see this SO question.
SQL Server: apparently there's the BULK INSERT command.
You are not going to find a database-independent syntax for an SQL command because there isn't one.
There may be a wrapper library around databases but I'm not aware of it. (Or you could try to use ODBC, but that's connection oriented and wouldn't allow direct access to a file)
Perhaps there is an interactive GUI-related software tool out there to do this.
Note also that loading data directly from a file on a database server into a database almost certainly requires security privileges to do so (otherwise it is a security risk).
Ok so I actually found a solution that is Database INDEPENDENT to import a database from a .Sql file quite easily!! :)
So whichever database you have (Mysql, Sqlite, ...) do the following:
1) export your database into .sql format.
(This .sql file will contain all sql commands such as CREATE TABLE... INSERT INTO table...)
(You may need to remove the lines that start with CREATE TABLE and leave only the lines that start with INSERT...)
2) Then in the language that you are using write some code that read each line of the Sql Lite files and stores it into an Array of String (String[])
3) Then execute each String contained in the array String[] as sql command
I've implemented this in Java:
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.LinkedList;
import java.util.List;
import android.content.Context;
import android.database.sqlite.SQLiteDatabase;
public class DatabaseImporter {
private static DatabaseImporter instance;
public static DatabaseImporter getInstance(){
if(DatabaseImporter.instance == null)
instance = new DatabaseImporter();
return DatabaseImporter.instance;
}
private DatabaseImporter (){
}
public void importDatabaseFromFile(Context context, String databaseName , String filePath){
SQLiteDatabase database = //CREATE UR DATABASE WITH THE COMMAND FROM THE DATABASE API YOUR USING
this.executeSqlCommands( database ,
this.readSqlCommandsFromFile(filePath)
);
}
private String[] readSqlCommandsFromFile(String filePath){
String[] sqlCommands = new String[0];
List<String> sqlCommandsList = new LinkedList<String>();
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream(filePath);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
if(!strLine.equals("") && !strLine.equals(" ") && strLine != null)
sqlCommandsList.add(strLine);
}
//Close the input stream
in.close();
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
sqlCommands = new String[sqlCommandsList.size()];
sqlCommandsList.toArray(sqlCommands);
return sqlCommands;
}
private void executeSqlCommands(SQLiteDatabase database, String[] sqlCommands){
for(int i = 0; i < sqlCommands.length ; i++){
database.execSQL(sqlCommands[i]);
}
}
}
mysql -u -p < dumpfile.sql
importing csv would require a script (using eg. PHP) to put the right fields in the right bit of the query.
If you are using SQL Server check out SSIS

Using RAMDirectory

When should I use Lucene's RAMDirectory? What are its advantages over other storage mechanisms? Finally, where can I find a simple code example?
When you don’t want to permanently store your index data. I use this for testing purposes. Add data to your RAMDirectory, Do your unit tests in RAMDir.
e.g.
public static void main(String[] args) {
try {
Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true);
OR
public void testRAMDirectory () throws IOException {
Directory dir = FSDirectory.getDirectory(indexDir);
MockRAMDirectory ramDir = new MockRAMDirectory(dir);
// close the underlaying directory
dir.close();
// Check size
assertEquals(ramDir.sizeInBytes(), ramDir.getRecomputedSizeInBytes());
// open reader to test document count
IndexReader reader = IndexReader.open(ramDir);
assertEquals(docsToAdd, reader.numDocs());
// open search zo check if all doc's are there
IndexSearcher searcher = new IndexSearcher(reader);
// search for all documents
for (int i = 0; i < docsToAdd; i++) {
Document doc = searcher.doc(i);
assertTrue(doc.getField("content") != null);
}
// cleanup
reader.close();
searcher.close();
}
Usually if things work out with RAMDirectory, it will pretty much work fine with others. i.e. to permanently store your index.
Alternate to this is FSDirectory. You will have to take care of filesystem permissions in this case(which is not valid with RAMDirectory)
Functionally,there is not distinct advantage of RAMDirectory over FSDirectory(other than the fact that RAMDirectory will be visibly faster than FSDirectory). They both server two different needs.
RAMDirectory -> Primary memory
FSDirectory -> Secondary memory
Pretty similar to RAM & Hard disk .
I am not sure what will happen to RAMDirectory if it exceeds memory limit. I’d except a
OutOfMemoryException :
System.SystemException
thrown.