Find duplicate PDFs - pdf

I'm looking for a utility that will help me find duplicate PDFs. The problem: I have a 1000s of PDF files. Some are duplicates. They are not easy to detect due differing files names and small differences in file size. Is there a utility/algorithm/library that can help me find the duplicates or show me files that are very similar (or degree of difference)?

Create an MD5 hash for each file and store it in a database. Identical files will then sort next to each other, or you can quickly search for a pre-existing key.

The problem is not yet solved in any way. What I do, is I use fdupes http://premium.caribe.net/~adrian2/fdupes.html to find exact duplicates.
But most of all, I use a workflow which minimizes duplicates. Every document that enters my system gets indexed with this perl-script I wrote: http://seegras.discordia.ch/Programs/fileindex which puts some name and an md5-sum of it into ~/.fileindex.md5 Now I can change metadata of the local PDF-files or whatever (and run fileindex again), and whenever I accidently download the same file again, I will stil lhave the md5-sum of the original file, and thus can detect whether it's a duplicate.
There's also exif-meta and exif-rename on http://seegras.discordia.ch/Programs/ which help with setting PDF metadata and with renaming PDF-files according to metadata; and if you're tagging all the files correctly, you will end up with duplicate filenames, indicating that they might be the same document within a different file.

If the files were created by the different tools, they could look the same but generate very different results because they are structured totally differently. I made some suggestions in a blog article at https://blog.idrsolutions.com/2010/09/comparing-2-pdf-files/

DiffPDF looks like something that might help you.

I remember that there is a UNIX utility called pdf2txt (see the package poppler-utils). You can try to extract the text from the files and make a textual diff.

Related

Reducing PDF size - From 5MB to ~200KB

INTRO
I have this 2,7 MB PDF file.
It's a certificate with two fields that I have to fill: name and course.
After filling those fields I save it for later printing.
THE PROBLEM
After saving, the new file comes up with ~5MB.
I have tried many saving options and but I only managed to reduce it to the final size of 4,7MB (still larger than the original file).
For instance, I tried open the original file (2,7MB) and save it right after opening (without making any change). The result is the same: a new ~5MB file.
That means that it isn't the information (Name and Course) the faulty.
SOLVING
At some point, trying new methods of saving, I managed to save it to the size of 180KB.
Unfortunately, I'm not being able to reproduce this made.
After several hours trying to achieve this made again and not succeeding, I came here ask for help :(
As you are in Acrobat, you might use "save as optimized…" (where you are already, in order to show the space usage), and remove as much as possible (mainly structure information, private data (which means data allowing the original-creating application to edit the file again), etc.).
You might also start from a minimum-sized blank file, and copy/paste the form fields into it. (although I don't think that would cause much reduction, as, AFAIK, fonts used in form fields are counted in the Fonts item).

Compare PDF files visually(drawings and highlights) and merge the differences

I am trying to compare and merge 2 pdf files which has text, drawings and highlights/comments.
The old file will have highlights and comments but new file will have changes to text and drawing with out highlights or comments, I need to be able compare all the differences and merge the highlights and comments from old file back to the new file where applicable.
So far I have found some tools that does the comparison but not the merge/highlights. I have tested DiffPDF and it works for comparing but I am not sure how I can use that to merge the files. Any software/tool that does this already and is there a way to do the merge with diffpdf ?
There is no easy way to do what you are asking. Even if you go low level, there are big challenges to face. PDF is very different from other document formats in that there is no semantic structure embedded in the document, so it would be very hard for something like a merge process to be able to figure out what to do. You may need to try a completely different approach. Remember that PDF was designed essentially to display identically on different platforms. It was never designed for document editing.
Check this library for PDF Compare and highlighting the differences.
https://github.com/vinsguru/pdf-util

How to remove .efs file extension from 1000's of recovered files in one folder

I recently recovered a 1.5TB external HDD that crashed. The program I used to recover the files was Active Undelete Enterprise, it's excellent. When the files were successfully recovered they were all saved with a .efs extension so files looked like mydocument.docx.efs. At first I thought they were encrypted and needed to be decrypted, I spent 10 mins on it and realized I just need to remove the .efs from the entire filename and the mydocument.docx works perfectly. Problem is now I have over 55,000 files within hundreds of folders where I need to simply remove the .efs after each file. Does anyone know how to do this?
From a command prompt window, navigate to the top level directory where these files reside.
Type the command
DIR /S/B >>filelist.txt
This command will give you a bare format file listing of the current directory plus all nested subdirectories without any extraneous information. The list will be contained in the text file named "filelist.txt" or whatever else you choose to call it. I would then use this text file in a text editor to convert every line of text from, for example,
C:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs
to
rename c:\Users\dlucas\.gimp-2.8\mathmap\file1.png.efs file1.png
to give a simple example of a file that I just found on my system using this method.
You will need to use a text editor with a columnar editing capability since you have to modify som many files. Old programmer's editors such as CodeWright made this really simple while modern editors such as Eclipse or Notepad++ make this a little more difficult and may require a columnar editing plugin, depending on version. You basically have to make a columnar copy of all of the text in the file, and then paste the copy off to the far right - far enough that a second column of filenames and paths won't overwrite any of the existing file names and paths. You can then use columnar editing features to select and delete the path names of the text in the 2nd column since the rename command requires that the 2nd argument be simply the base filename and extension without the path information. You can use the columnar editing features to prepend every line with "RENAME ". If you attempt to do this without columnar editing features, you will find it slow going!
An alternate way to do this is to use a command formed from a "regular expression" to create the rename command. If you are not familiar with "regular expressions", ask a programmer friend as this is not an easy topic to learn from scratch. If you are familiar with regular expressions, this is probably the simplest way to perform this task. I haven't used them in many years and no longer recall the exact syntax to use or I would tell you myself.
Regardless of what kind of editor you use, the goal is to turn this ASCII file list of paths and filenames into a batch file (simply rename file1.txt to file1.bat when you are finished editing). You can then run the batch file by typing file1.bat at a command prompt.
I have just run into this same problem myself using the same really wonderful tool that you used. I am writing this while waiting for the undelete program to finish. That it restores files with this extra extension seems very anti-intuitive so I will look for an option to make it not do this when it finishes. If I find one, I will post a new answer here that is more specific to this tool. Otherwise, I am going to have rename all kazillion files just as you had to.
You experienced this problem because the disk that you recovered your files to "does not support encryption", according to the Active# UNDELETE documentation. The documentation offers no further explanation of what kind of disks support encryption, etc.
They offer a Decrypt command that restores the file's proper names as a post processing step. Unfortunately, this requires that you "include" each and every file to be decrypted, with no support for wildcards and parsing subdirectories so that is a non-starter, in my opinion given that both of us have hundreds of thousands of files to be renamed.
I did find that by selecting a normal fixed (non-removable) hard drive as the destination of the recovery effort, that the resulting files do not end up encrypted (i.e., they are recovered with the proper file name and extension). I originally chose a large USB based flash drive and the files were stored in their "encrypted" state (not really encrypted, but possibly potentially so and thus they give the .efs extension). Of course, this meant that I had to run the command all over again after switching to a regular hard drive (takes about 16 hours to recover 80GB worth of files due to presence of many sector CRC errors).

Extract Tabular Data from a PDF and sort it

I have a PDF file which has the marklist of certain exam.
I am particularly interested in the first list, but which unfortunately has 2112 entries. And they aren't properly formatted. I need to sort all these entries (based on marks in last 2 columns- sum of marks in Aptitude and Computer), to know what my rank is.
I tried to copy in in MS Word and Excel, but if you try it, you can see it won't help. After pasting it in a plain text file, I tried to format it using regular expressions (in Notepad++), wrote a code in C to properly separate each field by '\t' (so that later I can properly copy them in an Excel sheet), but the inconsistency made me fail (some entries are spawned multiple lines, the "names" do not have fixed no. of fields).
Can someone come up with any idea that will make it possible to copy the first list in PDF to a spreadsheet in tabular form exactly as the original file?
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting point '1.' above! -- see these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Well I sort of managed it. I first copied it to a plain text file, deleted all letters from it leaving only the serial number and corresponding marks, separated by spaces or tabs. Then using "import" in an OpenOffice Spreadsheet, told it the delimiters are spaces and tabs (combine them if necessary) and bingo! I got my rank.
But I would still like to know if it's possible to copy the whole table as it is. So keeping this question open.
I once was tasked with building a parser which would extract data from a pdf with tabular and non-tabular data in a number of different encodings and with a mix a rtl and ltr text. That project took quite the effort but with a simple English table you should be able to dissect the pdf in no time. Look for the PDF specs on adobe.com and if it is that desperate start digging in.
Also you'll first need to use pdftk.exe to uncompress the file.
A shortcut that me be of aid:
http://www.adobe.com/devnet/pdf/pdf_reference.html
This is the shortcut I meant: http://www.codeproject.com/KB/cs/PDFToText.aspx

Should I merge .pbxproj files with git using merge=union?

I'm wondering whether the merge=union option in .gitattributes makes sense for .pbxproj files.
The manpage states for this option:
Run 3-way file level merge for text files, but take lines from both versions, instead of leaving conflict markers. This tends to leave the added lines in the resulting file in random order and the user should verify the result.
Normally, this should be fine for the 90% case of adding files to the project. Does anybody have experience with this?
Not a direct experience, but:
This SO question really advices again merging .pbxproj files.
The pbxproj file isn't really human mergable.
While it is plain ASCII text, it's a form of JSON. Essentially you want to treat it as a binary file.
(hence a gitignore solution)
Actually, Peter Hosey adds in the comment:
It's a property list, not JSON. Same ideas, different syntax.
Yet, according to this question:
The truth is that it's way more harmful to disallow merging of that .pbxproj file than it is helpful.
The .pbxproj file is simply JSON (similar to XML). From experience, just about the ONLY merge conflict you were ever get is if two people have added files at the same time. The solution in 99% of the merge conflict cases is to keep both sides of the merge.
So a merge 'union' (with a gitattributes merge directive) makes sense, but do some test to see if it does the same thing than the script mentioned in the last question.
See also this question for potential conflicts.
See the Wikipedia article on Property List
I have been working with a large team lately and tried *.pbxproj merge=union, but ultimately had to remove it.
The problem was that braces would become out of place on a regular basis, which made the files unreadable. It is true that tho does work most of the time - but fails maybe 1 out of 4 times.
We are back to using *.pbxproj -crlf -merge for now. This seems to be the best solution that is workable for us.