How to remove overlays from PDF file using PDFBox? - pdfbox

I am using Apache Tika 1.17 to extract content from PDF files. There is a small image overlay on a page in PDF due to which Tika is not able extract any content from that page but for rest of the pages it is working fine.
Is there any way to remove overlay from PDF page using PDFBox before sending it to Tika?
As a workaround, I converted the PDF to PNG and Tika is using TesseractOCR to extract content. But I am losing some content and text format this way.

Related

Decreasing size of PDF when using puppeteer for pdf generation

We are using IDR for converting PDF documents to HTML.
After doing some modifications we are using puppeteer for converting that document back to PDF I am getting files with increased page size (even if I don't do any modification to my HTML).
For ex:- If the original page size is 500kb I am getting a page with 1000kb
The page only contains some text.
Please help me to understand what is the reason behind this and how to solve this.

Apache PDFBOX renderImageWithDPI UTF-8 Issue

I am trying to export PDF pages to PNG using Apache PDFBOX(2.0.8) PDFRenderer - renderImageWithDPI . Everything works fine except when we have UTF-8 characters in text png show boxes instead of chracter . If i get text content I can see correct characters but image rendering is having this issue

ABCpdf - convert pdf stream to tif stream

My web page has a document viewer (canvas) where I will bind a multi-page tif file stream.
There is a functionality to delete pages from the file, I am using the ABCpdf library to convert the tif file stream to a pdf stream and delete a particular page. But I don't see any way to convert back the pdf stream to a tif stream.
Please help.
You want the GetData() method, called as GetData("foo.tif"). The filename passes is ignored except that its extension is checked to see what format to use. The return value is an array of Byte.
https://www.websupergoo.com/helppdfnet/default.htm?page=source%2F5-abcpdf%2Fxrendering%2F1-methods%2Fgetdata.htm

Read PDF Title from pdf content in PHP

How to get PDF Title from PDF content ? PDF Metadata is not getting PDF title .
I want to get PDF Title and Heading of PDF file in php.
Extracting metadata from PDFs can be tricky, because there are multiple places it can be stored in the file (specifically, both the info dictionary and the XMP stream).
This post suggests some PHP toolsets that may be relevant: Reading PDF metadata in PHP

PDF Generation with High Resultion SVG Images (suggestions needed!)

Can anyone suggest/recommend a product that can be used to dynamically produce PDFs that can contain high-res images?
We're currently using a product called Highwire from a company called Corda to produce PDFs of our HTML pages.
Highwire is crap at producing PDFs though becuase it does not conform to HTML standards (i.e. it requires table layouts rather than CSS/Div layouts). We have to use it though because it is capable of incorporating high-definition SVG images into its PDF output.
Thanks
Dave
What about Prince?
It can handle XHTML and CSS just fine as well as SVG.
I used Apache FOP together with this stylesheet from HTML to XSL-FO with success in some projects. Embedding SVG ist straight forward, since FOP incorporates Batik, Apache's SVG library. You can copy SVG images 1:1 to the XSL-FO file.