Through a web service I get a PDF response. When I hit the endpoint through my browser the file is downloaded and saved as I would expect.
The problem: when I open the downloaded PDF the content appears to be encoded. If I paste the PDF content into something like MS Word or even Chrome, it becomes readable english.
I have opened the PDF using my code editor to inspect the metadata, but I don't know what exactly I'm looking for.
How can I get the content to display as readable english when the PDF is opened?
Any help would be much, much appreciated!
Here is a link to the PDF in question: testPDF
Related
I'm trying to extract text from PDF files using the Google Cloud Vision API. It works most of the times, but I get gibberish in a few cases. I tried both DOCUMENT_TEXT_DETECTION and TEXT_DETECTION, I tried forcing the language in the languageHints but it didn't help.
Then I tried with a screenshot saved as tiff and this did work, so I'm guessing that Google tries to use the text in the PDF if it's not just a picture. Indeed, when I select all "text" in the PDF, I get gibberish.
When I print the tiff back into PDF, text extraction works. So it's really something weird with the PDF. But other extraction software (such as abbyy) work well with the original PDF.
Has anyone had the same kind of issues?
One thing that could help would be an option to force treat the PDF as an "image PDF". Is there such an option?
Thanks for your help!
FYI, I am unfortunately not allowed to show the PDF, and I use the dotnet library.
Edit:
The info on the PDF is:
Creator: "PScript5.dll Version 5.2.2"
Producer: Acrobat Distiller 10.1.16 (Windows)
I have successfully managed to download the PDF file from a site that uses PDF.js to create and show PDFs (using selenium)
The downloaded PDF file does not open on my desktop (mac & linux).
It seems like the PDF is encoded, or encrypted.
On closer inspection, right after the PDF is downloaded, the network tab also shows pdf.js.worker. It seems like pdf.js.worker is decoding this file to show on the site.
How can I replicate, or follow the same flow of pdf.js.worker and decode this PDF?
Update
I have tried looking at the pdf.js.worker code to follow the code execution, but it seems like a really hard task, hoping there is a simpler way.
I have access to a scanner at my library which can create "searchable PDFs." These are PDFs that show the exact image of a scanned document, but there is a kind of hidden text in the PDF that can be selected when you try to select a portion of the image that contains text. In this way you can copy and paste text or search for text in the scanned document. This is VERY useful. It's an awesome improvement over raw scanned images. I also have several apps on my mac that can create this kind of searchable PDF from a scanned document or a raw image.
Now it's obvious from any who has ever used OCR that the process of converting images to text is not 100% accurate, so the text that you search or copy will not be correct in some places.
So I search for quite some time to find an application that would load a searchable PDF and allow me to repair the hidden searchable text without reformatting or modifying the original scanned image.
Does anyone know of a tool (or library API) that would allow this?
It's worth saying here that I tried the latest version of Adobe Acrobat DC for Mac, and it doesn't seem to even allow me to view the hidden searchable text, much less edit it. It does allow me to replace scanned image with the results of it's own OCR process so that I could edit and save the document. But this would produce horrible results for any of the scanned documents that I am using. It seems designed for editing a "native PDF" not editing a scanned document.
I have also tried ABBYY FineReader with no luck.
i'm using ABBYY FineReader 12 Professional. (not open source)
Just open a scanned image or scanned pdf and press Verify Text(or Ctrl + F7), than you go over all the spelling errors or low-confidence charachters and fix them.
The program is very good, it shows you the exact place in image/pdf to correct and the OCR guessing side by side for convenience. It iterates all of them.
[By the way, I'm using the shortcuts to speed up things:
Alt+Enter to add the unrecognized word to dictionary.
Ctrl+Delete to skip word or confirm in case you fixed it.]
Than save the document as a pdf file Menu:File>Save Document As> PDF File, and you can search it on every pdf reader. The saved file look the same as the scanned one, but 'behind' it there text.
It's weird you tried ABBYY with no luck... it's working great for me. maybe you tried not the Professional version.
Hope it helps you.
It is not creating a searchable pdf from images the poster is after, he wants to start with an already searchable pdf and modify its text (e.g. because intially a searchable pdf was made but later an overlooked error in recognition was found and needs correction). I see no way and no tool that assists in doing this.
Am currently working on a web application which receives the encoded text from the web service and am decoding & saving as a PDF file. Once the user clicks for the details then I am supposed to display the PDF file in the web browser.
What is the best practice to display the PDF file in the browser? Am using VB.Net 2003
All you need to do is set the link to point to your pdf file. And if the user has any PDF reader installed, it will be opened using that reader.
The name you Want to Show as Link
EDIT:
The other way, if you dont want to display as link and directly open the file, is to set the correct MIME type in the headers, so that the browsers can detect it as PDF file instead of HTML file.
Response.AddHeader("content-disposition","inline;filename=YourPdfFileName.pdf")
Response.End
I have looked for weeks and I keep hitting dead ends. I know you can create a text or image link and tell it to "print page" in a browser. But so far, I can't get it to print a document, specifically a pdf. I would like the print dialog to show after the link is clicked and yes, the pdf linked to has been printed.
Why does this seem to be such an impossible feat? I have seen it work in a Flash movie, but since I cannot access the native file I cannot see how it was done.
Any advice?
Thanks.
Many of today's printers support direct PDF printing. Lexmark, HP, Xerox to name a few all have this on most of the 'business' printers. On these devices simply sending the PDF file directly to the device over LPR, port 9100, or some other mechanism will result in a printed document. Some devices even support URLs. I do know that Lexmark had some devices that a URL could be sent to the printer as as long as it had access to the URL it would pull the document and print. In this case it supported basic HTML, JPEG, TIF, and PDF.
Hope this helps.
A PDF must be rendered as an image before it can be printed. Usually when you're printing a PDF file on your desktop you could simply right-click on the file and select Print and if you have Adobe Reader or an alternative application set as your default PDF viewer, then the PDF that you have selected will be opened automatically -- at this stage the PDF is rendered as an image -- and then the printing process will begin.
But if there is no access to a PDF viewer that can render the PDF and then print it, then you won't be able to print the PDF. Usually if you have Adobe Reader, Foxit Reader, etc, installed then when you click on a URL to a PDF then the PDF will open within the PDF viewer within the browser and you will be able to print it.
Alternatively, you could find a PDF SDK that silently renders a PDF as an image and then sends that to the printer, without the need to have a PDF viewer installed on your machine.