Apache Solr - are the documents itself stored internally apart from the index? - indexing

I have been trying to research how solr works when documents like doc or pdf are submitted to it. I want to know if I submit pdfs to solr, does it end up storing the pdf file also along with the index generated after parsing the pdf file?
Thanks,
-Keshav

Solr (Lucene) doesn't "end up store the PDF file" itself. However it can store the text contents of the PDF extracted from the PDF using a text-extractor such as Tika (if indeed the field is marked as stored in the schema).
If you wish to store the PDF file in its entirety you will need to convert the PDF into (for example) Base64 representation and persist the base64 string as a "Stored" field. So when you access the doc you convert back from Base64 to PDF.

Related

How to generate PDF files using Liferay?

I tried to find proper services for generating PDF files in Liferay, however I have found only class PDFProcessorUtil. How to use it to generate PDF file? How to save the generated file then? I think I should use
DLAppLocalServiceUtil.addFileEntry to save file into Liferay storage.
Liferay's PDF-conversion works by converting documents in the document library and offering them for download - this is implemented through Open Office. Install Open Office or Libre Office, run it in server mode and configure Liferay to use it, then you can choose to select downloads as PDF. The HTML format has a few limitations, as it can include so many external resources, so I'm not sure what your result will be.
If you're generating the HTML output yourself, you might want to consider any other (Liferay-independent) means of generating PDF, as you might not need to upload your files to the Document Library (e.g. if you're generating reports on the fly and just want the generator result to be PDF, but not store them). If this is what you need, you can use any pdf converter library you want - Liferay does not limit you in your choice.
You can also generate the PDFs from the serve resource phase of a portlet.
You put a button or a link somewhere, and when you click on it, you download the PDF.
In this simple example, the PDF is generated from a Freemarker template that generates an HTML that is converted to PDF:
https://github.com/roclas/pdfUtil

Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

I'm currently designing a full text search system where users perform text queries against MS Office and PDF documents, and the result will return a list of documents that best match the query. The user will then be to select any document returned and view that document within MS Word, Excel, or a PDF viewer.
Can I use ElasticSearch or Solr to import the raw binary documents (ie. .docx, .xlsx, .pdf files) into its "data store", and then export the document to the user's device on command for viewing.
Previously, I used MongoDB 2.6.6 to import the raw files into GridFS and the extracted text into a separate collection (the collection contained a text index) and that worked fine. However, MongoDB full text searching is quite basic and therefore I'm now looking at either Solr or ElasticSearch to perform more complex text searching.
Nick
Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.
Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.
Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.
So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.
I would try the Elasticsearch attachment plugin. Details can be found here:
https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments.html
https://github.com/elasticsearch/elasticsearch-mapper-attachments
It's built on top of Apache Tika:
http://tika.apache.org/1.7/formats.html
Attachment Type
The attachment type allows to index different "attachment" type field
(encoded as base64), for example, Microsoft Office formats, open
document formats, ePub, HTML, and so on (full list can be found here).
The attachment type is provided as a plugin extension. The plugin is a
simple zip file that can be downloaded and placed under
$ES_HOME/plugins location. It will be automatically detected and the
attachment type will be added.
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
iWorks document formats
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Feed and Syndication formats
Help formats
Audio formats
Image formats
Video formats
Java class files and archives
Source code
Mail formats
CAD formats
Font formats
Scientific formats
Executable programs and libraries
Crypto formats
A bit late to the party but this may help someone :)
I had a similar problem and some research led me to fscrawler. Description:
This crawler helps to index binary documents such as PDF, Open Office, MS Office.
Main features:
Local file system (or a mounted drive) crawling and index new files,
update existing ones and removes old ones. Remote file system over SSH
crawling.
REST interface to let you "upload" your binary documents to elasticsearch.
Regarding solr:
If the docs only need to be returned on metadata searches, Solr features a BinaryField fieldtype, to which you can send binary data base64 encoded.Keep in mind that in general people recommend against doing this, as it may increase your index (RAM requirements/performance), and if possible a set-up where you store the files externally (and the path to the file in solr) might bea better choice.
If you want solr to automatically index the text inside the pdf/doc -- that's possible with the extractingrequesthandler: https://wiki.apache.org/solr/ExtractingRequestHandler
Elasticsearch do store documents (.pdfs, .docs for instance) in the _source field. It can be used as a NoSQL datastore (same as MongoDB).

Genexus Report with Embedded BLOB (PDF) files

i'm trying to create a PDF report object that contains some PDF files saved as BLOB records into my DB.
At this point I'm able to embed images only...
How can I "append" other kinds of files into my genexus-report, such as PDF files?
Any suggestion will be appreciated
Client: GXEv2 - U5,
Environment: Java
I know it's possible to append others GX reports in your report by calling them.
However, i think there is no way to append a PDF file.

Creating an ics file from data on a PDF file

I'm looking for a way to convert a PDF document into multiple ics files that staff can use to add their fortnight roster to their smart phone calendars or outlook calendar on their desktops. The information required to create the multiple files would be pulled from the PDF by searching for selected initials from each column then referencing data from the same row as the initials. Is their a particular order I need the data to appear in the ics file to allow it to import to a smartphone calendar??
You can search for pdf APIs for more details in handling a pdf using programmatically.
and here are some online converters that could help. They convert a pdf into word
http://www.pdftoword.com/success.aspx
http://www.pdfescape.com/account/?expired
However, reconstructing structured data from PDF is not trivial because a program has to deduct the semantics in the layout. So most programs can only restore scattered data from a pdf.
I've done this with PERL and windows Adobe PDF viewer to highlight all the text in the PDF and cut and paste to a text file. As the previous answer said, you have to write PERL (or any other text processing language) to pick out the format of the PDF you have. Then you can print it with PERL to csv or to ical or whatever format you want. I've shared my code on github.com. I'm not sure if you know GIT, but send me a private message if you want me to send the PERL code outside of GIT.
The PDF's I've converted are here:
http://recplexonline.com/sports/hockey/old-geezers-hockey-35
The Git hub of my PERL code and the input files I used are here:
https://github.com/jdeltoft/PdfParse
It's pretty ugly perl, sorry for that. But it works. I'll try to clean it up soon.

advice on technology to use for document/form creation and indexing

My customer actually stores his documents, which are single page automotive forfeits, in a single MS Word document... this method is of course generating a huge file which is slow to open, not to talk about searches.
After a user compiles a document, he may need to print it to manually sign it. Then the document is scanned back and stored in PDF format. The document may be printed again to be
signed a second time by a manager. The doubly signed document is scanned again and saved
overwriting the singly-signed one.
The user wants to be able to search the document using a couple of search keys (the doc number and a sort of a SSN). That is the reason they are using a single file, to be able to search in the file using Word's search feature.
I have to propose an IT solution. I was thinking about giving them a software tool that:
reads a pdf form/template; the template rarely changes
shows the template on the screen and allows the user to input his variable fields in the form
some of the fields must be defined as searchable
the user saves only the form fields, not the whole pdf.
the sw is able to rebuild a document by coupling the template with the fields. I have to find a way to tie the template with the saved fields, so that the template can change (versioning) without breaking the old documents
the tool allows to search in multiple documents, using the defined search fields
the tool allows to print the document to manually sign it; this is the hard part. When the document is signed cannot be changed anymore, but if the document is simply scanned and coupled with the form/fields pdf, then I'll loose the benefits of only storing the data decoupled from the template. Should I only scan the signature and attach it to the document as an image?
What do you suggest to use?
Adobe XML Forms?
Adobe Forms Data Format?
An already existing software?
Other?
For the existing documents, I want allow the customer to import his huge MS Word file into the new system.
Thanks.
Sounds like you want a PDF form template that submits data to a dB that can be searched.
OTOH, if you just save the PDFs, Acrobat Pro can generate an index file from a directory, that can be searched (from reader?). Yep, you can run searches on an index from reader, but can only build them with Acrobat.
I prefer AcroForms to LiveCycle forms myself. There's a lot more software out there that works with 'em. If you go with LiveCycle, you're almost completely locked into Adobe. And Adobe server software is EXPENSIVE.