Looking for a 64bit PDF highlighter package for large PDF files - pdf

I am trying to find what I thought would be one of the most simple pieces of code there are: a batch PDF keyword highlighter.
I found a few commercial 32-bit tools online which just crash; they just can't handle a 127 page 2D architect PDF file. I have an i5 8600K with 16 Gb of RAM.
All I wanted to do is highlight a keyword.
I have cancelled my Acrobat DC subscription because it is only 32-bits.
Please point me in the right free or paid direction that works for large PDF files.

Ok. I have finally found a PDF editor for you estimators out there who are tired of downloading OCR apps when what you need is a redactor with color fill opacity highlighting capabilities.
Nitro Pro
This isn't advertising.
If you are trying to load high resolution 2D architect prints and find all occurrences of your specific TRADE keyword, this is it. Hands down.
Example:
Search and Redact all occurrences of keyword PAINT. The app
highlights them all.
Next, DO NOT CLICK redact and instead, click Save As to save the
file as a different copy.
Because you didn't proceed with the final redact, the app with just
highlight all occurrences of the word PAINT in this example and that
is all.

Related

Batch check Adobe Acrobat .pdf's for files containing rotated text

Does anybody know if there is a way to check whether a list of Adobe Acrobat .pdf files contain rotated text (any text not at 0 degrees)?
I thought this would be simple, but I'm struggling to find an answer.
I am using ABBYY Recognition Server to OCR thousands of files and the results are quite poor where the text is rotated. I need to get a list of files that have rotated text to allow me to perform some pre-processing on them.
I usually use iTextSharp for .pdf automation and modification but don't seem to be able to find anything for checking text rotation.
Thanks
You could achieve your goal by extracting all words from these PDFs and checking if any of the words is rotated.
I would recommend you to use a PDF library higher level abilities for the task. Docotic.Pdf library is a good choice (of course, I am one of the developers of the library).
Here is an articles that shows how to extract words from PDFs with extra info about their position etc.
Each extracted word comes in PdfTextData object. The PdfTextData contains IsTransformed property to check if word is rotated, scaled, and / or flipped. You can also analyze PdfTextData.TransformationMatrix for more information about the transformation.

How can I edit the search text of a searchable PDF?

I have access to a scanner at my library which can create "searchable PDFs." These are PDFs that show the exact image of a scanned document, but there is a kind of hidden text in the PDF that can be selected when you try to select a portion of the image that contains text. In this way you can copy and paste text or search for text in the scanned document. This is VERY useful. It's an awesome improvement over raw scanned images. I also have several apps on my mac that can create this kind of searchable PDF from a scanned document or a raw image.
Now it's obvious from any who has ever used OCR that the process of converting images to text is not 100% accurate, so the text that you search or copy will not be correct in some places.
So I search for quite some time to find an application that would load a searchable PDF and allow me to repair the hidden searchable text without reformatting or modifying the original scanned image.
Does anyone know of a tool (or library API) that would allow this?
It's worth saying here that I tried the latest version of Adobe Acrobat DC for Mac, and it doesn't seem to even allow me to view the hidden searchable text, much less edit it. It does allow me to replace scanned image with the results of it's own OCR process so that I could edit and save the document. But this would produce horrible results for any of the scanned documents that I am using. It seems designed for editing a "native PDF" not editing a scanned document.
I have also tried ABBYY FineReader with no luck.
i'm using ABBYY FineReader 12 Professional. (not open source)
Just open a scanned image or scanned pdf and press Verify Text(or Ctrl + F7), than you go over all the spelling errors or low-confidence charachters and fix them.
The program is very good, it shows you the exact place in image/pdf to correct and the OCR guessing side by side for convenience. It iterates all of them.
[By the way, I'm using the shortcuts to speed up things:
Alt+Enter to add the unrecognized word to dictionary.
Ctrl+Delete to skip word or confirm in case you fixed it.]
Than save the document as a pdf file Menu:File>Save Document As> PDF File, and you can search it on every pdf reader. The saved file look the same as the scanned one, but 'behind' it there text.
It's weird you tried ABBYY with no luck... it's working great for me. maybe you tried not the Professional version.
Hope it helps you.
It is not creating a searchable pdf from images the poster is after, he wants to start with an already searchable pdf and modify its text (e.g. because intially a searchable pdf was made but later an overlooked error in recognition was found and needs correction). I see no way and no tool that assists in doing this.

How to troubleshoot badly rendered PDF file

I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.
If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.
In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.
'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.

pdf see current line ruler

I'm looking for accessibility tool , to make it easier to read pdf's.
In short, it should be possible to easily see which line is being read ( a bit like a ruler,when it comes down to text ), to avoid losing the line that is being read.
I was wondering if anyone knows any solution for this , for example a plugin for Adobe Acrobat Reader, etc...
Any suggestions are welcome.
I don't think there is a plug-in for Acrobat Reader. You may want to look at ZoomText or ClaroRead. Of course these only work if the PDF has text, but not images of text.
A low tech solution would be to open a Notepad doc and size it how you need. If you are on Win7 you could do this with sticky notes.
Another approach I've used is to convert the PDF to HTML and then run a server with it. This is fairly simple to accomplish using Live Server in VScode.
In the Chrome browser, we may then use accessibility extensions, such as ReadingBuddies, that have reading ruler functions.
Otherwise consider,
Use a PDF reader that has a built-in reading ruler feature, such as Adobe Acrobat Reader DC or Foxit Reader.
Use a PDF reader that allows you to add a reading ruler as an annotation, such as Xodo PDF Reader.
Use an online tool that allows you to view PDFs with a reading ruler, such as Smallpdf's PDF Reader.
Use a screen ruler tool, such as the one offered by How-To Geek, to measure the PDF on your screen.
The academic term is sometimes called RSVP (Rapid Serial Visual Presentation), there are patented hardware and software versions but in principle it is simply a translucent masking added to the viewport. see https://softwarerecs.stackexchange.com/questions/28582/is-there-an-equivalent-to-a-reading-guide-strip-for-windows-os-x-or-linux and http://www.see-n-read.com/products/esee-n-read-2/
10 years later and its 2023 so software such as browsers should include such features here is Edge in some sites where Immersive Reader is supported but not StackOverflow !! The above example is using an edge extension. https://microsoftedge.microsoft.com/addons/detail/screen-mask/dfanfcmhbdocjfpmnoebccndgmhlincl others are available for other browsers https://chrome.google.com/webstore/detail/reading-ruler/phiedfcbjfjagnjikfbobmldbpmdcpfk
To get the Reader Mode options on Chrome: or Edge look at the available flags
However if you save page as PDF and read aloud it is then used there !
Some PDF readers like Mac Skim include such accessibility option.
However, simplest is :-
Most PDF readers can be reduced to focus viewport on single lines and with auto scrolling that allows for more focused "line by line" reading without the audio, plus fast and easy adjustments/enlarging for PDF variable lines with illustrations.
Note as per above PDF where much of the text is actually one or two lines out of order it is not trivial for a PDF reader to understand which text base line is independently to be used next. in reality "Read Aloud" will read two variable height lines then jump to top of page then back to the second visible line. PDF lines are not the visible order nor a constant height/spacing, you might expect.

PDF Outline Text - Automation of Acrobat Sequences

I have built an application that automates the filling out of form fields inside a pdf. It then takes various assets and combines them together to generate a "print ready" product. All of this is accomplished using the magic of iTextSharp. When form fields are populated, they are then flattened to text. The problem is that even with the fonts embedded they aren't really attached to the form fields in a meaningful way (like straight text elements are) and the printers are complaining that the pdf is generating licensing errors due to this. I researched this a bit and it just seems to be the nature of how form fields are.
The artists we are working with requested that we research a way to "outline" the text that is created from flattening the form fields. I found that running the PDF Optimizer with a custom preset allows for Text Outlining in Acrobat, and even better I can generate an Acrobat Sequence that runs this command on the pdf. The problem is that Sequences can not be automated, at all.
I found a plug-in called AutoBatch that allows for the execution of Sequences on the command line through a batch file. The downside is that this would require installing Acrobat Pro and the Plug-in on the server this application will be running on. Further it seems like an overkill solution just to outline the text in the pdf. For all I know at this point iTextSharp may allow me to do this programmatic, but searching for such a thing on google returns little results and nothing relevant.
So the question: Is there a better way to outline text in a pdf than the current solution I have implemented or am I kind of stuck?
TLDR; PDF is generated w/ non-standard fonts. I need to "outline" this text to send it to the printer. Currently using AutoBatch Acrobat Plug-In to execute Acrobat Sequence from the Command Line. Seems excessive, wondering if anyone knows a better way to automate font outlining.
I am also in a printing environment and have used forms for "Box Covers" plenty of times to shorten the code used to produce box covers.
I simple us "pdfStamper.FormFlattening = true;" and the printers (Xerox DP180 and DC5000) has no problems in using the PDF.
The moment I leave out FormFlattening the printer gives a lot of errors regarding the PDF.
If you are using FormFlattening then check if the printer has the font locally installed in order for it to reference the font from the print engine instead of the PDF resources.