I'm designing a website for a small company and they want a database of documents where a user can search for policies. They hand me a rar with over 800 documents all within folders and with different files formats (for example a policy is divided into 3 jpeg files or a single .doc document). I trying to find a way to convert all these files to a pdf format without doing it manually in order to make a SQL database. someone have an idea?
I would suggest looking into Apache POI. You could write a module to automate the process and convert all of them into PDFs.
https://poi.apache.org/
Related
I have numerous folder structures with a label name on each folder. Within each of the these folders there are are 11 individual pdfs. I have to merge these pdfs into one file. Then I move onto the next folder (which contains files for a different project, but same report format) and merge all these files into a single file. Does anyone know of an automated way to do this. Could powerautomate be utilized to accomplish this? Or can it be done directly in Adobe? I have to do this hundred of time and looking to improve the workflow.
I am not sure where to begin on this. I am looking into PowerAutomate but I havent been given adobe yet. Currently using Xodo by hand to drag and drop the documents.
I have a Notes app that was designed for the browser, not the client. It allowed upload of files into the documents, so nearly all the documents have files. The files are stored in the NSF as $FILE and displayed in the documents as links.
I am using Adobe Acrobat Pro to create PDFs from the documents and need to include the file attachments within the PDFs, however the PDFs just include links to the files, not the attachments. Can I write an agent to run against the documents to get those files and embed them within the documents? When I view those documents through the client, I see all of the HTML etc. and then at the bottom of the document, the file attachments appear. When I view these same documents in the browser, the file attachments do not appear. If I could merely ensure that they are there, then when running the PDF generator in Acrobat Pro, they would be included in the PDFs and executable.
I am really stuck here, with no other way to 'archive' this notes database with all the data intact.
Thanks in advance for any insights!!
Ginni
There is a commercial product from Swing Software that does this. I hear that it's quite good, but I've never used it. Let me explain why...
The way I usually end up doing this is just quick-and-dirty. I write an agent to export the files, using the document UNID as part of the filename. The same agent exports all the data fields from the document into a CSV file, and I add a column with the filename of the extracted attachment. In your case, I would add two columns -- one for the extracted attachment(s), and one for the generated PDF. The CSV serves as an index for the exported data. It can be imported into something more friendly, or just left as-is and brought up in Excel, depending on the customer's usage requirements and available systems. I've recommended Swing Software's product and offered to explore other ideas for developing code (e.g., using wkhtmltopdf for Domino web apps to capture a WYSIWYG rendering based on an HTML crawl) for PDF rendering of Notes documents for a couple of clients, but none of them have justified the cost that would be involved in buying licenses and/or writing the code. Quick and dirty always seems to win, even when there are retention and eDiscovery considerations taken into account.
I tried to find proper services for generating PDF files in Liferay, however I have found only class PDFProcessorUtil. How to use it to generate PDF file? How to save the generated file then? I think I should use
DLAppLocalServiceUtil.addFileEntry to save file into Liferay storage.
Liferay's PDF-conversion works by converting documents in the document library and offering them for download - this is implemented through Open Office. Install Open Office or Libre Office, run it in server mode and configure Liferay to use it, then you can choose to select downloads as PDF. The HTML format has a few limitations, as it can include so many external resources, so I'm not sure what your result will be.
If you're generating the HTML output yourself, you might want to consider any other (Liferay-independent) means of generating PDF, as you might not need to upload your files to the Document Library (e.g. if you're generating reports on the fly and just want the generator result to be PDF, but not store them). If this is what you need, you can use any pdf converter library you want - Liferay does not limit you in your choice.
You can also generate the PDFs from the serve resource phase of a portlet.
You put a button or a link somewhere, and when you click on it, you download the PDF.
In this simple example, the PDF is generated from a Freemarker template that generates an HTML that is converted to PDF:
https://github.com/roclas/pdfUtil
I'm currently designing a full text search system where users perform text queries against MS Office and PDF documents, and the result will return a list of documents that best match the query. The user will then be to select any document returned and view that document within MS Word, Excel, or a PDF viewer.
Can I use ElasticSearch or Solr to import the raw binary documents (ie. .docx, .xlsx, .pdf files) into its "data store", and then export the document to the user's device on command for viewing.
Previously, I used MongoDB 2.6.6 to import the raw files into GridFS and the extracted text into a separate collection (the collection contained a text index) and that worked fine. However, MongoDB full text searching is quite basic and therefore I'm now looking at either Solr or ElasticSearch to perform more complex text searching.
Nick
Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.
Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.
Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.
So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.
I would try the Elasticsearch attachment plugin. Details can be found here:
https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments.html
https://github.com/elasticsearch/elasticsearch-mapper-attachments
It's built on top of Apache Tika:
http://tika.apache.org/1.7/formats.html
Attachment Type
The attachment type allows to index different "attachment" type field
(encoded as base64), for example, Microsoft Office formats, open
document formats, ePub, HTML, and so on (full list can be found here).
The attachment type is provided as a plugin extension. The plugin is a
simple zip file that can be downloaded and placed under
$ES_HOME/plugins location. It will be automatically detected and the
attachment type will be added.
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
iWorks document formats
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Feed and Syndication formats
Help formats
Audio formats
Image formats
Video formats
Java class files and archives
Source code
Mail formats
CAD formats
Font formats
Scientific formats
Executable programs and libraries
Crypto formats
A bit late to the party but this may help someone :)
I had a similar problem and some research led me to fscrawler. Description:
This crawler helps to index binary documents such as PDF, Open Office, MS Office.
Main features:
Local file system (or a mounted drive) crawling and index new files,
update existing ones and removes old ones. Remote file system over SSH
crawling.
REST interface to let you "upload" your binary documents to elasticsearch.
Regarding solr:
If the docs only need to be returned on metadata searches, Solr features a BinaryField fieldtype, to which you can send binary data base64 encoded.Keep in mind that in general people recommend against doing this, as it may increase your index (RAM requirements/performance), and if possible a set-up where you store the files externally (and the path to the file in solr) might bea better choice.
If you want solr to automatically index the text inside the pdf/doc -- that's possible with the extractingrequesthandler: https://wiki.apache.org/solr/ExtractingRequestHandler
Elasticsearch do store documents (.pdfs, .docs for instance) in the _source field. It can be used as a NoSQL datastore (same as MongoDB).
We are trying to add a code to a multipage pdf file.
Let's say there's 1000 pages in the pdf file.
Every six pages corresponds to 1 account.
So those six pages need the same account number.
Then the following six would get a different account number.
ETC.
Has anyone ever run across something that would perform this function?
Can the pdf link to a data file, or would it have to use simple page numbering?
Thanks in advance!
You can programmatically add whatever you want to any subset of pages in a PDF using a variety of PDF librarys, commercial or OSS.
Can you be more specific?