I have been using Google CSE to index several long PDF files for searching (some 500+ pages long). I am noticing that the search will find terms close to the beginning of some of these documents, but not terms that are near the end of the document. Is there a limit to how much of a single file Google will index?
Since no one seems to know, I will provide my experience. We have requested a manual index of the pdf files several times, and still cannot get the search to pick up any search terms past page 10-15. It seems like there is a character limit on how much of a single pdf gets indexed. Google support is not available to confirm this until the business version is purchased, which we won't be doing.
Related
I create a PDF file with 20,000 pages. Send it to a printer and individual pages are printed and mailed. These are tax bills to homeowners.
I would like to place the PDF file my web server.
When a customer inputs a unique bill number on a search page, a search for that specific page is started.
When the page within the PDF file is located, only that page is displayed to the requester.
There are other issues with security, uniqueness of bill number to search that can be worked out.
The main question is... 1: Can this be done 2: Is there third party program that is required.
I am a novice programmer and would like to try and do this myself.
Thank you
It is possible but I would strongly recommend a different route. Instead of one 20,000 page document which might be great for printing, can you instead make 20,000 individual documents and just name them with something unique (bill number or whatever)? PDFs are document presentations and aren't suited for searching or even text information storage. There's no "words" or "paragraphs" and there's even no guarantee that text is written letter after letter. "Hello World" could be written "Wo", "He", "llo", "rld". Your customer's number might be "H1234567" but be written "1234567", "H". Text might be "in-page" but it also might be in form fields which adds to the complexity. There are many PDF libraries out there that try to solve these problems but if you can avoid them in the first your life will be much easier.
If you can't re-make the main document then I would suggest a compromise. Take some time now and use a library like iText (Java) or iTextSharp (.Net) to split the giant document into smaller documents arbitrarily named. Then try to write your text extraction logic using the same libraries to find your uniqueifiers in the documents and rename each document accordingly. This is really the only way that you can prove that your logic worked on every possible scenario.
Also, be careful with your uniqueifiers. If you have accounts like "H1234" and "H12345" you need to make sure that your search algorithm is aware that one is a subset (and therefore a match) of the other.
Finally, and this depends on how sensitive your client's data is, but if you're transporting very sensitive material I'd really suggest you spot-check every single document. Sucks, I know, I've had to do it. I'd get a copy of Ghostscript and convert all of the PDFs to images and then just run them through a program that can show me the document and the file name all at once. Google Picasa works nice for this. You could also write a Photoshop action that cropped the document to a specific region and then just use Windows Explorer.
I have a document structure where each text line in the document has some meta-data associated with it. The search result must show the line and the meta-data for the line.
Currently I am storing each such line as a Lucene documents and storing the metata-data as one of the non-indexed fields. That is I create and add a Lucene Document structure for each line. My concerns is that I may end up with too many Documents in the index.
Is there a more elegant approach ?
Thanks
Personally I'd index the documents as normal, and figure out the metadata / line number later.
There is no question about whether or not Lucene can cope with that many documents, however it might degrade the search results somewhat. For you can perform searches where you look for multiple terms in close proximity to each other, however this obviously won't work when the terms are split over multiple documents (lines).
How many is "too many"? Lucene has been known to handle hundreds of millions of records in a single index, so I doubt that you should have a problem. That being said, there's no substitute for testing and benchmarking yourself to see if this approach is good for your needs.
I have Lucene files indexed according to pageIds (UniqueKey). and one document can have multiple pages. Now once user perform some search it gives us pages that matches search criteria.
I am using Lucene.Net 2.9.2
We have 2 problems...
1- The file size is around 800GB and it has 130 million rows (pages) so the search time was really slow (all queries taking more than a min (we only have to return limited rows at a time)
To overcome the performance issue I shifted to SOLR which resolved the performance issue (which is quite strange as I am not using any extra functionality provided by SOLR like sharding etc - so could it be that Lucene.NET 2.9.2 is not really equivalent to performance compared to same version of JAVA??) but now I am having another issue...
2- The individual 'lucene document' is one page but i want to show results 'grouped by' 'real documents'. How many results I should be returned should be configurable based on 'real documents' not 'pages' (coz thats how I want to show to the user).
So lets say I want 20 'real documents' and ALL pages in them that matches the search criteria (doesnt matter if one document has 100 pages and another just 1).
From what I could get from SOLR forums was that it can be achieved by SOLR-236 patch (field collapsing) but I have not been able to apply the patch correctly with trunk (gives lots of errors).
This is really imp for me and I dont have much time, so can someone please either send me the SOLR 1.4.1 binary with this patch applied or guide me if there is any other way.
I would really appreciate it. Thanks!!
If you have issues with the collapse patch, then the Solr issue tracker is the channel to report them. I can see that other people are currently having some issues with it, so I suggest getting involved in its development.
That said: I recommend that if your application needs to search for 'real documents', then build your index around these 'real documents', not their individual pages.
If your only requirement is to show page numbers, I would suggest to play with the highlighter or made some custom development. You can store the word number of the start and end of each page in a custom structure, and knowing the matched word position in the whole document you can know in what page it appears. If the documents are very large you will get a good performance improvement.
You could also have a look at SOLR-1682 : Implement CollapseComponent, I havent tested it yet, but as far as I know it solves the collapsing too.
I have a problem that in my lucene index files one document can have huge text. now when i search one of these huge text documents lucene/solr does not filter any results even the search term exist in the document text. the reason that i think might be the large number of characters in document text? if yes than how could we tell solr/lucene how much characters to analyze during search, please explain
I am using Solr 1.4.1 can any
Thanks
Ahsan
Lucene can handle huge documents without trouble. It seems unlikely that the document size itself is the problem. Use a tool like Luke to inspect the index and see what terms are associated with some of these large documents.
Also, have you changed the maxFieldLength setting in solrconfig.xml? I am testing out indexing the Bible, at 25 MB of data, and with a maxFieldLength of 10,000, which is the default, only the first 10,000 tokens ever get analysized, which leads to roughly ~2000 unique terms for my document.
If you are using Lucene directly, then there are a couple setting for maxFieldLength, you may have "unlimited" and therefore getting everything. Check the JavaDocs for how to set maxFieldLength.
I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.