currently, I'm trying to write a program (in C), which would allow me to search/replace text of pdfs. I already can search for the string and edit the string itself, but somehow I cannot find a way to really edit the object of the document.
Does someone know, how to "send" the changes back to the document in the library of mutool/mupdf?
If you consider using Command Line Tools you can use Mutool clean.
mutool-clean
Using this tool you can rewrite content streams.
"The clean command pretty prints and rewrites the syntax of a PDF file. It can be used to repair broken files, expand compressed streams, filter out a range of pages, etc."
Related
I have many pdf files with same layout in which I need to fill in some details with plain text and sign in certain places with company stamp.
I am trying to automate things, so I am after a way to fill in the details + stamp the first file and then "copy-paste" that operation to all files.
I managed to find:
online websites allowing me to annotate one file at a time
documentation on how to use pdftk and additional tools to create a script, but it can take a lot of manual operations via command line (e.g. scaling the signature jpeg file, positioning it, adding text in right location etc.) which is very tedious.
Is there any way to annotate the first file using visual tools (like #1) and then extract a script of commands to perform these annotations from command line ?
I can then use that script on multiple files.
Thank you,
-Moshe
I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.
I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.
I know ghostscript can use -dfirstpage -dlastpage to only make a file from a range of pages, but I need to make it (or another command line program) delete the 2nd page in any pdf where the range of pages is not explicitly told. I thought this would be far easier because most printers let you specify "1,3-end" and I have been using PDFCreator to do it that way.
The one way I can think of doing it (very very messy) is to extract page 1, extract pages 3 to end, and then merge the two pdfs. But I also don't know how to have GS determine the number of pages.
Use the right tool for the job!
For reasons outlined by KenS, Ghostscript is not the best tool for what you want to achieve. A better tool for this task is pdftk. To remove the 2nd page from input.pdf, you should run this command line:
pdftk input.pdf cat 1 3-end output output.pdf
OK first things first, if you use Ghostscript's pdfwrite device you are NOT extracting, or deleting, or performing any other 'manipulation' operation on your source PDF file. I keep on reiterating this, but I'm going to say it again.
When you pass an input file through Ghostscript it is completely interpreted to a series of graphical primitives which are passed to the device, in general the device will render the primitives to a bitmap. In the case of the 'high level' devices such as pdfwrite, the primitives are re-assmebled into a brand new file, in the case of pdfwrite a PDF file.
This flexibility allows for input in a number of different page description languages (PostScript, PDF, PCL, PCL-XL, XPS) and then output in a few different high level formats (PostScript, EPS, flavours of PDF, XPS, PCL, PCL-XL).
But the new file bears no relation to the original, other than its appearance.
Now, having got that out of the way... You can use the pdf_info.ps PostScript program, supplied in the 'toolin' directory of the Ghostscript installation, to get a variety of information about PDF files, one of the things you can get is the number of pages in the PDF. You also don't need to bother, run the file once with -dLastPage=1, then run it again with -dFirstPage=2 (don't set LastPage), then run both resulting files to create a file with the pages from each combined.
I recently recovered a 1.5TB external HDD that crashed. The program I used to recover the files was Active Undelete Enterprise, it's excellent. When the files were successfully recovered they were all saved with a .efs extension so files looked like mydocument.docx.efs. At first I thought they were encrypted and needed to be decrypted, I spent 10 mins on it and realized I just need to remove the .efs from the entire filename and the mydocument.docx works perfectly. Problem is now I have over 55,000 files within hundreds of folders where I need to simply remove the .efs after each file. Does anyone know how to do this?
From a command prompt window, navigate to the top level directory where these files reside.
Type the command
DIR /S/B >>filelist.txt
This command will give you a bare format file listing of the current directory plus all nested subdirectories without any extraneous information. The list will be contained in the text file named "filelist.txt" or whatever else you choose to call it. I would then use this text file in a text editor to convert every line of text from, for example,
C:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs
to
rename c:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs file1.png
to give a simple example of a file that I just found on my system using this method.
You will need to use a text editor with a columnar editing capability since you have to modify som many files. Old programmer's editors such as CodeWright made this really simple while modern editors such as Eclipse or Notepad++ make this a little more difficult and may require a columnar editing plugin, depending on version. You basically have to make a columnar copy of all of the text in the file, and then paste the copy off to the far right - far enough that a second column of filenames and paths won't overwrite any of the existing file names and paths. You can then use columnar editing features to select and delete the path names of the text in the 2nd column since the rename command requires that the 2nd argument be simply the base filename and extension without the path information. You can use the columnar editing features to prepend every line with "RENAME ". If you attempt to do this without columnar editing features, you will find it slow going!
An alternate way to do this is to use a command formed from a "regular expression" to create the rename command. If you are not familiar with "regular expressions", ask a programmer friend as this is not an easy topic to learn from scratch. If you are familiar with regular expressions, this is probably the simplest way to perform this task. I haven't used them in many years and no longer recall the exact syntax to use or I would tell you myself.
Regardless of what kind of editor you use, the goal is to turn this ASCII file list of paths and filenames into a batch file (simply rename file1.txt to file1.bat when you are finished editing). You can then run the batch file by typing file1.bat at a command prompt.
I have just run into this same problem myself using the same really wonderful tool that you used. I am writing this while waiting for the undelete program to finish. That it restores files with this extra extension seems very anti-intuitive so I will look for an option to make it not do this when it finishes. If I find one, I will post a new answer here that is more specific to this tool. Otherwise, I am going to have rename all kazillion files just as you had to.
You experienced this problem because the disk that you recovered your files to "does not support encryption", according to the Active# UNDELETE documentation. The documentation offers no further explanation of what kind of disks support encryption, etc.
They offer a Decrypt command that restores the file's proper names as a post processing step. Unfortunately, this requires that you "include" each and every file to be decrypted, with no support for wildcards and parsing subdirectories so that is a non-starter, in my opinion given that both of us have hundreds of thousands of files to be renamed.
I did find that by selecting a normal fixed (non-removable) hard drive as the destination of the recovery effort, that the resulting files do not end up encrypted (i.e., they are recovered with the proper file name and extension). I originally chose a large USB based flash drive and the files were stored in their "encrypted" state (not really encrypted, but possibly potentially so and thus they give the .efs extension). Of course, this meant that I had to run the command all over again after switching to a regular hard drive (takes about 16 hours to recover 80GB worth of files due to presence of many sector CRC errors).
My program downloads a PDF file from a source location every day. When I see the binary text of the PDF file in Notepad, I find that sometimes the PDF file has the string <!-FTCACHE-1-> at the end. Sometimes this word is missing from the PDF file.
My program downloads this PDF daily and compares it with the previous day's PDF file using the Windiff binary comparison.
99% of the time, Windiff reports differences in the PDF file just because one PDF contains the string <!-FTCACHE-1-> at the end.
Does anyone knows what the reason behind this is?
Thanks,
Praveen
<!--FTCACHE-1--> is generated by FatWire Content Server, a web content management solution that is probably generating your URL. FTCACHE means FutureTenseCache, the name of the original product component. The text is a "footer" flag that indicates to the caching module whether or not the page was properly generated. If the page is supposed to be cached, a 1 indicates that the page was properly built, and so is cacheable. If 0 is returned, it indicates that the page was corrupted and should not be cached. The Satellite Server caching engine is supposed to strip this footer once it reads it.
In other words, the key that is there to ensure that the cache is not corrupted, is causing the corruption in your PDF.
This issue has been fixed in patches to FatWire ContentServer for quite some time now.
For your purposes, just ignore the string - strip it if you can.
Sorry about that. That was my bug. :-)
The application that generates the PDF file has a bug, the FTCACHE tag should not be there, it is not a valid PDF construct. Its presence actually damages the PDF file, it invalidates the FastWebView feature in the PDF file, as you have seen it. It is safe to remove it before comparing the files.
"FT" could be FreeType, the open source font engine. The comment probably comes from the software that generates the PDF. If you can somehow identify that, you could (assuming it is open source) perhaps take a look through it and see what causes it to emit the comment.
FreeType has a source folder dedicated to caching, the root source file there is called ftcache.c. It doesn't do a lot though, just #includes (!) the other source files.
Googling on the string you see, reveals several more or less random PDF:s that seem to contain it.