Can we tell Solr/Lucene max chars to analyze for a search? - lucene

I have a problem that in my lucene index files one document can have huge text. now when i search one of these huge text documents lucene/solr does not filter any results even the search term exist in the document text. the reason that i think might be the large number of characters in document text? if yes than how could we tell solr/lucene how much characters to analyze during search, please explain
I am using Solr 1.4.1 can any
Thanks
Ahsan

Lucene can handle huge documents without trouble. It seems unlikely that the document size itself is the problem. Use a tool like Luke to inspect the index and see what terms are associated with some of these large documents.

Also, have you changed the maxFieldLength setting in solrconfig.xml? I am testing out indexing the Bible, at 25 MB of data, and with a maxFieldLength of 10,000, which is the default, only the first 10,000 tokens ever get analysized, which leads to roughly ~2000 unique terms for my document.
If you are using Lucene directly, then there are a couple setting for maxFieldLength, you may have "unlimited" and therefore getting everything. Check the JavaDocs for how to set maxFieldLength.

Related

Google CSE limit indexing of single file?

I have been using Google CSE to index several long PDF files for searching (some 500+ pages long). I am noticing that the search will find terms close to the beginning of some of these documents, but not terms that are near the end of the document. Is there a limit to how much of a single file Google will index?
Since no one seems to know, I will provide my experience. We have requested a manual index of the pdf files several times, and still cannot get the search to pick up any search terms past page 10-15. It seems like there is a character limit on how much of a single pdf gets indexed. Google support is not available to confirm this until the business version is purchased, which we won't be doing.

how lucene use skip list in inverted index?

In some blogs and lucene website,I know lucene use data structure "skip list" in inverted index. But I have some puzzle about it.
1:In general,skip list maybe used in memory ,but inverted index is stored in disk. So how lucene use it when search on the index? just scanning it on disk or load it to memory?
2:skip list's insert operator often use random(0,1) to decide whether insert to next level,but in luncene introdution,it seems a fixed interval in every terms,so how lucene create the "skip list" different or not?
Please correct me if I am wrong.
Lucene uses memory in a couple different ways, even though the index is persisted on a disk when the IndexReader is created for searching and for operations like sorting (field cache):
http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
Basically those binary files get copied into RAM for much faster scanning and reducing I/O. You get a hint in the above link how searching with some parameters can force Lucene to "skip terms in searching" Hence, where that data structure can be used.
Lucene is open source, so you can see the code for yourself what is being used in Java or Lucene.NET for the C# implementation.
see To accelerate posting list skips, Lucene uses skip lists

Lucene indexing with for structured document where each text line has meta-data

I have a document structure where each text line in the document has some meta-data associated with it. The search result must show the line and the meta-data for the line.
Currently I am storing each such line as a Lucene documents and storing the metata-data as one of the non-indexed fields. That is I create and add a Lucene Document structure for each line. My concerns is that I may end up with too many Documents in the index.
Is there a more elegant approach ?
Thanks
Personally I'd index the documents as normal, and figure out the metadata / line number later.
There is no question about whether or not Lucene can cope with that many documents, however it might degrade the search results somewhat. For you can perform searches where you look for multiple terms in close proximity to each other, however this obviously won't work when the terms are split over multiple documents (lines).
How many is "too many"? Lucene has been known to handle hundreds of millions of records in a single index, so I doubt that you should have a problem. That being said, there's no substitute for testing and benchmarking yourself to see if this approach is good for your needs.

Index verification tools for Lucene

How can we know the index in Lucene is correct?
Detail
I created a simple program that created Lucene indexes and stored it in a folder. Using the diagnostic tool, Luke I could look inside an index and view the content.
I realise Lucene is a standard framework for building a search engine but I wanted to be sure that Lucene indexes every term that existed in a file.
Can I verify that the Lucene index creation is dependable? That not even a single term went missing?
You could always build a small program that will perform the same analysis you use when indexing your content. Then, for all the terms, query your index to make sure that the document is among the results. Repeat for all the content. But personally, I would not waste time on this. If you can open your index in Luke and if you can make a couple of queries, everything is most probably fine.
Often, the real question is whether or not the analysis you did on the content will be appropriate for the queries that will be made against your index. You have to make sure that your index will have a good balance between recall and precision.

Couple o' quick questions on Apache Lucene

-- I don't want to start any religious wars, but a quick google search indicates that Apache Lucene is the preferred open source tool for indexing and searching. Are there others?
-- What file format does Lucene use to store its index file(s)?
Thank is advance.
Doug
Which are the best alternatives to Lucene? And as a lucene user I can say it has improved a lot performance wise the last couple of versions (NOT meaning it was slow before!)
it uses an proprietary format see here
I suggest you to look at Sphinx.
I have an experience with Lucene.net and we have many problems with multithread indexing. Lucene stores index in files, and this files can be locked by anti-viruses software.
Also you can not compare numbers in Lucene: it is impossible to filter products by size and price.