How to convert text pdf to image pdf using ghostscript - pdf

I need to convert text in pdf file to images, so users cannot copy it from the pdf etc.
This should be equivalent to converting the entire pdf to a set of images and then merging them to one single document. I did so, but it seems slow, is there any way to do it with ghostscipt options?

Welp, looks like I only need to specify option -dNoOutputFonts.

Related

How to replace a specific image within a pdf?

I have a pdf with 3 images
I want to find each image and replace it with another image
I saw in the pdf the original paths under xmpMM:Ingredients:
I tried to change it via notepad++ but it looks like the images are already embedded and changing the path does nothing.
How can I find each image and replace it with another image?
The xmp stuff is information only. The actual images are embedded streams in the pdf file. Finding the correct streams to replace and replacing them isn't a simple problem, and can't be done with notepad. You'll need a library / toolkit that can modify PDFs, like https://pdf-lib.js.org/ or similar.
The PDF file looks like an Illustrator file, which adds another layer of weirdness - Illustrator can write PDFs that have both PDF and Illustrator versions of the content, and you see one in Acrobat and the other in Illustrator.
It's probably easier to recreate the PDF from whatever source produced it.

What's the best way to extract text from pdf in python without changing the layout and format?

I want text with exact format and layout from pdf.
If pdf to text is not the direct choice, is it possible to do pdf -> xml -> text?
I have already tried PyPDF2, pdfminer and pdftotxt. Even I've tried using AWS textract and got incorrect layout.
Basically if I can construct sentence from the text extracted from pdf, that's enough.
I used Zamzar API which gives exact output but they're quiet expensive.
Any possible solution?
If you are looking to keep the structure of the PDF but not the font, colour, size etc., then try the pdftables_api library. This should hold the layout of your PDF. Convert PDF to CSV as a CSV file is just a comma seperated text file.
If you are looking to keep font, colour etc., Zamzar API is probably your best option.

How to convert marathi data from pdf to excel in proper format

I am converting Marathi data from PDF to excel or word but it is not getting proper format.
I have copied some data from PDF and pasted in word document but it was not getting proper format.
e.g. प्रविण सुधाकर शिरवाडकर this line is in PDF
but when i copied and pasted in word it has been getting
-प्रववर् सुधाकर शिरवाडकर
what should i do for this?
anyone please help me.
thank you in advance
There seem to be problems in the way PDF stores unicode devnagri text. Try this alternative route: convert your PDF to an image. Can use an online tool or downloaded, or if on linux use this command in terminal:
for f in *.pdf; do convert -density 200 "$f" "${f}_200dpi.jpg"; done
change the density from 200 to other as per need. Each page from your document should be converted into an image file. For a windows tool, try https://www.pdfill.com/pdf_tools_free.html
Then, go to http://www.i2ocr.com/free-online-hindi-ocr, upload the image and convert. That uses OCR (optical character recognition).
check the font in your PDF and try making it available to the word document.
I think you dont have perticular fonts which are used in PDF
In Adobe Reader -- -- File menu > Properties > Fonts tab gives you a list of all fonts used in the document.

Recover text from PDF file when normal methods fail

I have a few hundred PDF files from which I need to extract sections of text. For many, pdftotext works fine, but for others, it misses large sections of text. If I open the PDF in Acrobat and select that text by hand and copy/paste into emacs and then view the file without an encoding, I get stuff like this:
Husband \364\200\200\272\364\200\201\213\364 etc.
How can I extract the text correctly?
I should mention that I've tried saving as text from Acrobat; also tried applying Acrobat's Document=>OCR feature before copying.
Why not convert the PDF to doc or txt first? See the guide:
http://www.aolor.com/pdf-converter/user-guide.html

PDF data extraction gives symbols/gibberish?

I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.