Reducing PDF size - From 5MB to ~200KB - pdf

INTRO
I have this 2,7 MB PDF file.
It's a certificate with two fields that I have to fill: name and course.
After filling those fields I save it for later printing.
THE PROBLEM
After saving, the new file comes up with ~5MB.
I have tried many saving options and but I only managed to reduce it to the final size of 4,7MB (still larger than the original file).
For instance, I tried open the original file (2,7MB) and save it right after opening (without making any change). The result is the same: a new ~5MB file.
That means that it isn't the information (Name and Course) the faulty.
SOLVING
At some point, trying new methods of saving, I managed to save it to the size of 180KB.
Unfortunately, I'm not being able to reproduce this made.
After several hours trying to achieve this made again and not succeeding, I came here ask for help :(

As you are in Acrobat, you might use "save as optimized…" (where you are already, in order to show the space usage), and remove as much as possible (mainly structure information, private data (which means data allowing the original-creating application to edit the file again), etc.).
You might also start from a minimum-sized blank file, and copy/paste the form fields into it. (although I don't think that would cause much reduction, as, AFAIK, fonts used in form fields are counted in the Fonts item).

Related

Extract text from illustrator file without opening file

Any idea if it would be possible to extract text from a illustrator file without opening it?
I have an AppleScript currently extracting the text but it takes a long time when I'm working on hundreds of files. I was wondering if it would be possible to get the information without opening the AI file.
+1 for show your own code first. (Also, typo in first line: I think you meant “Illustrator”, not “photoshop”.)
If you’re only getting plain text it should only take a fraction of a second per document (opening the file will take longer):
tell application "Adobe Illustrator"
get contents of every text frame of document 1
end tell
(i.e. Never iterate over individual application objects, querying each one, when a single query will do everything for you. Apple events are relatively expensive for apps to resolve; sending lots of them unnecessarily really kills performance.)
Also be aware that AppleScript also has serious performance problems when iterating over large lists, but that’s a separate issue, the solution to which should already be covered elsewhere.

Why should applications read a PDF file backwards?

I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?
I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..
PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.
Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.

Find duplicate PDFs

I'm looking for a utility that will help me find duplicate PDFs. The problem: I have a 1000s of PDF files. Some are duplicates. They are not easy to detect due differing files names and small differences in file size. Is there a utility/algorithm/library that can help me find the duplicates or show me files that are very similar (or degree of difference)?
Create an MD5 hash for each file and store it in a database. Identical files will then sort next to each other, or you can quickly search for a pre-existing key.
The problem is not yet solved in any way. What I do, is I use fdupes http://premium.caribe.net/~adrian2/fdupes.html to find exact duplicates.
But most of all, I use a workflow which minimizes duplicates. Every document that enters my system gets indexed with this perl-script I wrote: http://seegras.discordia.ch/Programs/fileindex which puts some name and an md5-sum of it into ~/.fileindex.md5 Now I can change metadata of the local PDF-files or whatever (and run fileindex again), and whenever I accidently download the same file again, I will stil lhave the md5-sum of the original file, and thus can detect whether it's a duplicate.
There's also exif-meta and exif-rename on http://seegras.discordia.ch/Programs/ which help with setting PDF metadata and with renaming PDF-files according to metadata; and if you're tagging all the files correctly, you will end up with duplicate filenames, indicating that they might be the same document within a different file.
If the files were created by the different tools, they could look the same but generate very different results because they are structured totally differently. I made some suggestions in a blog article at https://blog.idrsolutions.com/2010/09/comparing-2-pdf-files/
DiffPDF looks like something that might help you.
I remember that there is a UNIX utility called pdf2txt (see the package poppler-utils). You can try to extract the text from the files and make a textual diff.

Extract Tabular Data from a PDF and sort it

I have a PDF file which has the marklist of certain exam.
I am particularly interested in the first list, but which unfortunately has 2112 entries. And they aren't properly formatted. I need to sort all these entries (based on marks in last 2 columns- sum of marks in Aptitude and Computer), to know what my rank is.
I tried to copy in in MS Word and Excel, but if you try it, you can see it won't help. After pasting it in a plain text file, I tried to format it using regular expressions (in Notepad++), wrote a code in C to properly separate each field by '\t' (so that later I can properly copy them in an Excel sheet), but the inconsistency made me fail (some entries are spawned multiple lines, the "names" do not have fixed no. of fields).
Can someone come up with any idea that will make it possible to copy the first list in PDF to a spreadsheet in tabular form exactly as the original file?
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting point '1.' above! -- see these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Well I sort of managed it. I first copied it to a plain text file, deleted all letters from it leaving only the serial number and corresponding marks, separated by spaces or tabs. Then using "import" in an OpenOffice Spreadsheet, told it the delimiters are spaces and tabs (combine them if necessary) and bingo! I got my rank.
But I would still like to know if it's possible to copy the whole table as it is. So keeping this question open.
I once was tasked with building a parser which would extract data from a pdf with tabular and non-tabular data in a number of different encodings and with a mix a rtl and ltr text. That project took quite the effort but with a simple English table you should be able to dissect the pdf in no time. Look for the PDF specs on adobe.com and if it is that desperate start digging in.
Also you'll first need to use pdftk.exe to uncompress the file.
A shortcut that me be of aid:
http://www.adobe.com/devnet/pdf/pdf_reference.html
This is the shortcut I meant: http://www.codeproject.com/KB/cs/PDFToText.aspx

Does the string <!-FTCACHE-1-> in a PDF file mean anything?

My program downloads a PDF file from a source location every day. When I see the binary text of the PDF file in Notepad, I find that sometimes the PDF file has the string <!-FTCACHE-1-> at the end. Sometimes this word is missing from the PDF file.
My program downloads this PDF daily and compares it with the previous day's PDF file using the Windiff binary comparison.
99% of the time, Windiff reports differences in the PDF file just because one PDF contains the string <!-FTCACHE-1-> at the end.
Does anyone knows what the reason behind this is?
Thanks,
Praveen
<!--FTCACHE-1--> is generated by FatWire Content Server, a web content management solution that is probably generating your URL. FTCACHE means FutureTenseCache, the name of the original product component. The text is a "footer" flag that indicates to the caching module whether or not the page was properly generated. If the page is supposed to be cached, a 1 indicates that the page was properly built, and so is cacheable. If 0 is returned, it indicates that the page was corrupted and should not be cached. The Satellite Server caching engine is supposed to strip this footer once it reads it.
In other words, the key that is there to ensure that the cache is not corrupted, is causing the corruption in your PDF.
This issue has been fixed in patches to FatWire ContentServer for quite some time now.
For your purposes, just ignore the string - strip it if you can.
Sorry about that. That was my bug. :-)
The application that generates the PDF file has a bug, the FTCACHE tag should not be there, it is not a valid PDF construct. Its presence actually damages the PDF file, it invalidates the FastWebView feature in the PDF file, as you have seen it. It is safe to remove it before comparing the files.
"FT" could be FreeType, the open source font engine. The comment probably comes from the software that generates the PDF. If you can somehow identify that, you could (assuming it is open source) perhaps take a look through it and see what causes it to emit the comment.
FreeType has a source folder dedicated to caching, the root source file there is called ftcache.c. It doesn't do a lot though, just #includes (!) the other source files.
Googling on the string you see, reveals several more or less random PDF:s that seem to contain it.