Reduce PDF size when merge documents with same fonts, resources - pdfbox

Does pdfbox have a feature to extract same resources, eg: fonts, images into object references when merge PDF? I cant seem to find any feature to reduce size in the docs?
I might be missing the function name in the docs. from similar products
iText
PdfSmartCopy
https://www.coderanch.com/how-to/javadoc/itext-2.1.7/com/lowagie/text/pdf/PdfSmartCopy.html
PDFNet ( pdfTron )
Optimizer
https://www.pdftron.com/api/PDFNet/html/T_pdftron_PDF_Optimizer.htm

Answered via comment! - there is no feature provided

Related

Print multiple pages using PDFBox

I have a list of text data containing links (PDActions) that might need to be rendered on more than one page. (see below)
**Table of Contents**
document1 link 5
document2 link 8
document3 link 11
Is there a simple way to just print all these content and let PDFBox decide to wrap the text and fit them in multiple pages as needed. And just give me the final PDDocument?
There are multiple answers on this topic such as this one. However, the answers are quite old, and I'm checking if there is a newer and simpler way to do it.
PDFBox version: 2.0.26
PDFBox essentially still only has that very low-level text drawing API but there are projects built on top of PDFBox offering automatic layout.
Allow me to quote the PDFBox FAQs
Can I use PDFBox to create complex layouts?
I'd like to use PDFBox to create a complex layout containing several paragraphs, tables, images etc. Is PDFBox fit for that purpose?
PDFBox being a low level PDF library provides the APIs to create page content such as text, images etc. But at this point in time it doesn't provide a higher level API to do page layout, paragraph handling, automatic line wrapping or create tables and such.
But PDFBox is the foundation of some projects which might help in that case. This includes projects such as
Boxable
BoxTable
easytable
pdfbox-layout
PdfLayoutManager
ph-pdf-layout
You may also want to consider using Apache FOP which allows to create complex documents from XML data and templates-

How to convert DJVU file into PDF

There are many books available only in DJVU format where text is selectable and size is quite small (300 pages less than 5 MB).
Since DJVU viewers are poor in terms of annotation of files, I want to convert them into PDF.
What are the options to convert DJVU book into PDF that maintains selectable text and does not result in a huge (x10 larger) PDF file?
since this question did not get answered so far:
I recommend to use the following online converter which to my knowledge is the only one to fulfill the two criteria: djvu to pdf. However, I do not know any stand-alone converter achieving this goal.

How to merge PDF files without external dependencies

In one of my applications I need to merge many single PDF documents into one document, where each of the original PDFs is a page. Although many PDF libraries exist for most languages, I would like to write this myself if it's not too hard.
Is it necessary to implement a full-fledged PDF parser in order to merge PDF documents? Where and what would I start to read to find out what is needed for the task?
You can use the Debenu QuickPDF Library Lite (free) version to do it. Here is a very good example how to do it:
http://www.debenu.com/kb/merge-pdf-files-together-programmatically/

Extract pages from DOC to new DOC

We are developing a printing server that allows user to upload a DOC and print it out via HP ePrint. It needs to support page extraction.
I tried to use macro (with the help of Adobe Acrobat Reader Pro and MS Word) to extract pages into PDF. But it turns out that the size of PDF may be larger in size than expected.
Is there any way to extract pages (without lossing format - E.g. Table in DOC) from DOC to DOC, so that the size can be approximately the size?
This is a difficult requirement. It sounds like you have run into 2 problems (large PDFs and format loss) at the outset. You should probably say more about what you mean by "extraction" and why PDF is part of your solution because that's quite different from "upload and print" and "doc to doc". That way readers will have more suggestions for you.
I would suggest you try to approach the problem from a different angle if possible, because I suspect that you are unlikely to achive a good, efficent, stable result. One possible approach is to turn the DOC into PDF and then use iText or some other PDF library to manipulate the PDF before printing. It really depends on what you are trying to achieve - the specifics of your extract/merge/convert.

OCR library that can insert OCR'd text back into the source PDF

Is there a library (or executable) that can OCR a PDF (typically a PDF created by scanning a paper), and inject the recognized text back into the PDF? Probably as invisible text behind the scanned images.
Preferably open source.
(Goal: I have a huge library of PDF files indexed by Lucene. It would be much easier for Lucene to find what PDFs are relevant if the PDFs contained text.)
One of the best options is to probably use Abbyy FineReader as it will give you lots of options including the creation of hidden text. www.abbyy.com I had a quick look at their site and also came across their Transformer product which is probably even more suitable for your needs.
http://www.abbyy.com.au/pdftransformer/product_features/
If PDFs doesn't contain text, what is indexed by Lucene?
Take a look at Docsplitt (https://github.com/documentcloud/docsplit) it can use Tesseract to perform OCR. You will get a plain text files, which reflects the content of PDFs. You can than build your Lucene index on top of these text files and store reference to PDF in Lucene index. After querying Lucene index you will get the list of Documents with references to original PDFs.