How to index PDF / MS-Word / Excel files really fast for full text search?

How to index PDF / MS-Word / Excel files really fast for full text search? - indexing

We are building real time search feature for institutions, the index is based on the user uploaded files (mostly are Word/Excel/PDF/PowerPoint, and ASCII files). The I/O is expected at only 10 IOPS -20 IOPS but it can vary depends on the date. Maximum I/O could be 100 IOPS. Current database size is reaching 10GB, it's 4 months old.
For real time search server, I'm considering Solr / Lucene and probably ElasticSearch. But the challenge is how to index these files FAST, so that search server can query the index in real time.
I have found some similar questions on how to index .doc/.xls/.pdf, but they did not mention how to ensure indexing performance:
Search for keywords in Word documents and index them
Index Word/PDF Documents From File System To SQL Server
How to extract text from MS office documents in C#
Using full-text search with PDF files in SQL Server 2005
So my question is: how to build the index FAST ?
Any suggestion on the architecture ? Should I focus on building fast infrastructure (i.e. RAID, SSD, more CPU, Network bandwidth ?) or focus on the index tools & algorithm?

We're building a high perfomance full-text search for office documents. We can share some insights:
We use ElasticSearch. It's difficult to make it perform well on large file. We write several posts about it.
Highlighting Large Documents in ElasticSearch
Making ElasticSearch Perform Well with Large Text Fields
Use microservice architecture and docker to easily scale your application
Do not store original files in elasticsearch as binary data. Store it separately for example in MongoDB
Hope it helps!

Related

Is it possible to query GZIP document stored as Binary data in SQL Server?

I have about thirty-thousand Binary records, all compressed using GZIP, and I need to search the contents of each document for a specified keyword. Currently, I am downloading and extracting all documents at startup. This works well enough, but I expect to be adding another ten-thousand each year. Ideally, I would like to perform a SELECT statement on the Binary column itself, but I have no idea how to go about it, or if this is even possible. I would like to perform this transaction with the least possible amount of data leaving the server. Any help would be appreciated.
EDIT: The Sql records are not compressed. What I mean to say is that I'm compressing the data locally and uploading the compressed files to a SQL Server column of Binary data type. I'm looking for a way to query that compressed data without downloading and decompressing each and every document. The data was stored this way to minimize overhead and reduce transfer cost, but the data must also be queried. It looks like I may have to store two versions of the data on the server, one compressed to be downloaded by the user, and one decompressed to allow search operations to be performed. Is there a more efficient approach?

SQL Server has a Full-Text Search feature. It will not work on the data that you compressed in your application, of course. You have to store it in plain-text in the database. But, it is designed specifically for this kind of search, so performance should be good.
SQL Server can also compress the data in rows or in pages, but this feature is not available in every edition of SQL Server. For more information, see Features Supported by the Editions of SQL Server. You have to measure the impact of compression on your queries.
Another possibility is to write your own CLR functions that would work on the server - load the compressed binary column, decompress it and do the search. Most likely performance would be worse than using the built-in features.
Taking your updated question into account.
I think your idea to store two versions of the data is good.
Store compressed binary data for efficient transfer to and from the server.
Store secondary copy of the data in an uncompressed format with proper indexes (consider full-text indexes) for efficient search by keywords.
Consider using CLR function to help during inserts. You can transfer only compressed data to the server, then call CLR function that would decompress it on the server and populate the secondary table with uncompressed data and indexes.
Thus you'll have both efficient storage/retrieval plus efficient searches at the expense of the extra storage on the server. You can think of that extra storage as an extra structure for the index that helps with searches.

Why compressing 30,000 or 40,000 records? Does not sound like a whole lot of data, of course depending of the average size of a record.
For keyword searching, you should not compress the database records. But to save on disk space, in most operating systems, it is possible to compress data on the file level, without the SQL Server even noticing.
update:
As Vladimir pointed out, SQL Server does not run on a compressed file system. Then you could store that data in TWO columns: once uncompressed, for keyword searching, and once compressed, for improved data transfer.
Storing data in a separate searchable column is not uncommon. For example, if you want to search on a combination of fields, you might as well store that combination in a search column, so that you could index that column to accelerate searching. In your case, you might store the data in the search column all lower cased, and with accented characters converted to ascii, and add an index, to accelerate case-insensitive searching on ascii keywords.
In fact, Vladimir already suggested this.

How much storage should a LucidWorks search engine index occupy?

I'm trying to use LucidWorks (http://www.lucidimagination.com/products/lucidworks-search-platform) as a search engine for my organization intranet.
I want it to index various document-types (Office formats, PDFs, web pages) from various data sources (web & wiki, file system, Subversion repositories).
So far I tried indexing several sites, directories & repositories (about 500K documents, with total size of about 50GB) - and the size of the index is 155GB.
Is this reasonable? Should the index occupy more storage than the data itself? What would be a reasonable thumb-rule for data-size to index-size ratio?

There is no reasonable size of index, basically depends upon the the data you have.
Ideally should be less, but there is no thumb rule.
However, For the index size and the data size, depends upon how you are indexing the data.
Many factors would determine and have affect on your index size.
Most of the space in the index is consumed by the Stored data fields.
If you are indexing the data from documents and all the content is stored, the index size will surely grow hugh.
Fine tuning of indexed fields attributes also helps in space saving.
You may want to revisit the fields which you need to be indexed and which needs to be stored.
Also, are you using lots of copyfields to duplicate data or maintaining repititive data.
Optimization might help as well.
More info # http://wiki.apache.org/solr/SolrPerformanceFactors

Speeding up Solr Indexing

I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.

When you index a document, several steps are performed :
the document is analyzed,
data is put in the RAM buffer,
when the RAM buffer is full, data is flushed to a new segment on disk,
if there are more than ${mergeFactor} segments, segments are merged.
The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.
You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.
My experience is that the main ways to improve indexing speed with Solr are :
buying faster hardware (especially I/O),
sending data to Solr from several threads (as many threads as cores is a good start),
using the Javabin format,
using faster analyzers.
Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.

This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.

You can store the content in external storage like file;
What are all the field that contains huge size of content,in schema set stored="false" for that corresponding field and store the content for that field in external file using some efficient file system hierarchy.
It improves indexing by 40 to 45% reduced time. But when doing search, search time speed is some what increased.For search it took 25% more time than normal search.

Lucene - is it the right answer for huge index?

Is Lucene capable of indexing 500M text documents of 50K each?
What performance can be expected such index, for single term search and for 10 terms search?
Should I be worried and directly move to distributed index environment?
Saar

Yes, Lucene should be able to handle this, according to the following article:
http://www.lucidimagination.com/content/scaling-lucene-and-solr
Here's a quote:
Depending on a multitude of factors, a single machine can easily host a Lucene/Solr index of 5 – 80+ million documents, while a distributed solution can provide subsecond search response times across billions of documents.
The article goes into great depth about scaling to multiple servers. So you can start small and scale if needed.
A great resource about Lucene's performance is the blog of Mike McCandless, who is actively involved in the development of Lucene: http://blog.mikemccandless.com/
He often uses Wikipedia's content (25 GB) as test input for Lucene.
Also, it might be interesting that Twitter's real-time search is now implemented with Lucene (see http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html).
However, I am wondering if the numbers you provided are correct: 500 million documents x 50 KB = ~23 TB -- Do you really have that much data?

lucene file index

I have to index log record from captured from enterprice networks.In current implementation every protocol,has index files as year/mont/day/lucene file ,i want to know if i use only one single lucene index file and every day i update this single file how this effect search time ? .is it Considerable increase,in current sitiuation when i search i am querying exacly for that day.
Current: smtp/year/month/ay/luceneindex
if i do smtp/luceneindex all idex in a single file.Let me know prons and cons

That depends on a whole range of factors.
When you say a single lucene file?
Lucene stores an index, using multiple types of files and has segments, so there is more than one file anyway.
What and how are you indexing log data?
What do you use for querying across lucene indexes, solr, elasticsearch, custom?
Are you running a single instance, single machine configuration.
Can you run multiple processes, on separate hosts, use some for search tasks and others for index updates?
What are your typical search queries like, optimise for those cases.
Have a look at http://elasticsearch.org/ or http://lucene.apache.org/solr/ for distributed search options.
lucene has options to run in memory, like RAMDirectory, you may like to investigate.
Is the size of the one-day file going to be problematic for administration?
Are the File sizes going to be so large relative to disk, bandwidth constraints that copying, moving introduces issues.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas