How do I shave a few KB off a PDF?

How do I shave a few KB off a PDF? - pdf

I have a scanned greyscale PDF of a set of official school transcripts that has been compressed to 1MB. Actually, its 1023655 bytes. I am trying to upload the document to an online application that has a maximum file size of 1MB.
My attempts to further compressing the PDF via the same website have not worked.
I have tried using Neevia, but any further compression makes the lightest of the three pages completely white (the first two pages are black printed on a blue background, and third is light grey printed on a white background)
I've tried using mac preview to save as black and white (unreadable), and to resize it (blurry).
I have GIMP at my disposal, but otherwise I don't have any experience with photo or document manipulation. How do I shave those kilobytes off this PDF?

You could try looking at the bit depth of the grey scale. For example, if it's currently 16-bit grey scale (2^16, or 65536 shades of grey), you could try using an 8-bit grey scale (256 shades) or 4-bit (16 shades). You've already tried one form of this, going to 1-bit (2 shades, i. e. black and white), but without first taking a look at adjusting the contrast to make the text really stand out, you'll often end up with illegible files.

If you download and install CutePDF, you can open the PDF file and go to print it, select the CutePDF printer, and you will be prompted to save a new PDF file. Chances are this new PDF file will be much smaller,

Related

How can I renderize a PDF into BMP fitting content to PDF page boundaries?

I am getting a BMP from a PDF with GhostScript, but its content is not fitted into page boundaries. Even I try any option, I am not able to get the content fitted.
I've tried to generate the BMP with different GhostScript options, but noone seems to fit 100% ok the content.
This is the last command I tried. Please, don't expect it to have what I need, just copied & paste from tty.
gs -dBATCH -dNOPAUSE -sPAPERSIZE=a4 -dFIXEDMEDIA -dPSFitPage -sDEVICE=bmpmono -sOutputFile=Betlem.bmp -g1184x968 -c "<</PageSize [900 500]>> setpagedevice 0 0 translate" -c "<</PageOffset [-23 -100]>> setpagedevice" -f Betlem.pdf
I am expecting to get the content fitted into the BMP image borders, without exception of a pixel. I am using an OpenCV & Python function to extract content and fit in new image and this is the debug:
initial BMP image resolution = (872, 900)
BMP image resolution after fit content into new page = (541, 870)
Have a look to the following thread for the fitting funtion in Python:
I can't find a way to fit contour on new image zero point

You are using PSFitPage for a PDF file, you should be using PDFFitPage or just FitPage.
Note that the 'fitting' in this case is fitting the PDF media size to the existing media. If the PDF content leaves white space around the edge of the media, then the resulting scaling will include that.
In addition you are using PostScript to offset the page origin, which will introduce white space, and you are trying to change the media size, which won't work because you've set -dFIXEDMEDIA. Using these in combination with any of the FitPage switches is not likely to work well.
Randomly stabbing at controls and copying bits of code intended to solve different problems isn't likely to help you I'm afraid.
Without seeing an example file I can't, of course, tell you how to solve your problem, and I'm not really sure exactly what you are trying to achieve. A bitmap with no white space ? A bitmap of a given size with no white space ? Something else ?
[Edit]
OK so looking at the PDF file, the media box is 11.69x8.27 inches, there is white space at the top, bottom, left and right between the marks on the page and the edge of the media.
Running this through Ghostscript, to TIFF at 72 dpi results in a file which Adobe Photoshop says is 11.694x8.264 inches and has white space at top bottom left and right, just like the PDF file.
By default Ghostscript uses the Media size from the PDF to render to, however you can change this. If you were to change the media size to (say) 5.8x4.14 inches, set -dFIXEDMEDIA and then rendered the PDF file what would happen is that the top and right hand side of the PDF file would be 'off the page' so you would only get the left hand portion rendered. Try this:
gs -DEVICEWIDTHPOINTS=421 -dDEVICEHEIGHTPOINTS=298 -dFIXEDMEDIA "A betlem m en vull anar(1).pdf"
You will see the white space is still present at bottom and left, and the top and right have fallen off the page.
Now, if you add FitPage that will scale the original media down until it fits the new media size (and all the content too, of course). If you try:
gs -DEVICEWIDTHPOINTS=421 -dDEVICEHEIGHTPOINTS=298 -dFIXEDMEDIA -dFitPage "A betlem m en vull anar(1).pdf"
You'll see that the output is the same physical dfimensions as the previous command, but now the whole of the PDF content can be seen because its been scaled down. You should also see that the distribution of white space has changed, because I didn't strictly divide by 2 in each direction. The FitPage switch scaled the content in both directions by the same amount, and distributed the extra space in the x direction evenly to each side, as new white space.
Now I've no clue what you mean by 'simmetric'. You can undoubtedly do what you want using Ghostscript and the PostScript language, but I don't know what it is you want. Pointing me at Python code isn't going to help I'm afraid, I don't speak Python.
I can say that Ghostscript does not add extra white space that isn't present in the original unless you mess with the rendering by addding parameters like FitPage and FIXEDMEDIA.
If you can explain what you are trying to achieve I can probably tell you what to do.

How to rasterize "big" PDF files without losing thin lines?

I'm trying (in a script on a linux server) to shrink and rasterize several thousands PDF files that come from various CAD/CAM softwares and represent "big" drawings (as in, 800x600mm or the like) with lots of thin lines (as in, similar to a 0.2mm pen).
The rasterized files should have visible lines when printed on A5 or similar paper, so I have to kind of "shrink" the original drawing while preserving line thickness. As an example, when I open one of those PDF files on Mac OSX Preview, it does exactly that: when I zoom in and out it adjusts line thickness so they always look the same on screen.
I tried doing that with ImageMagick and tried lots of -density, -resize and various other settings without great success: the thin lines just get scaled down as anything else and end up being too thin (or to disappear completely, in some cases) to be discernible when printed to a small size. I've also read through its documentation without any success. Of course I'm also open to using other tools, as far as I can script it.
How could I "preserve line thickness" when rasterizing a vector PDF file in a script, just like Apple's Preview does when viewing the same file on screen?

How do I export an A5 doc to an A4 pdf without rescaling?

I have an A5 sized doc file that needs to be printed, yet the press needs them on A4 sized pages, centered, unscaled. When trying to export it from Office Word, you can adjust paper size, but only the left and top margins are kept and the content is spread in width to fill the paper (text size remains unchanged). I've tried PDF Architect / PDF Creator, but when it's about printing on A4 sized pages, the result is messed up fonts, messed up line wrapping and worse quality images.
Are there any tools that can preserve size, scale (in this case, centered and no scale), font, line wrapping and image quality as well or is it too much to ask from free tools? Proprietary tools are no option at the moment.

MS Word has poor options for exporting the PDF
I see some ways to resolve the issue:
Change size of paper in Word then manually recalculate and change size of margins (make like original page is in the center of bigger)
But best solution I see is to find appropriate options for printing device which should print the document (like "don't resize original doc pages, centered; output page size A4")
Try to emulate printing with http://www.dopdf.com/ (or similar) software. I'm pretty sure that it's possible to "print to pdf" with your requirements and then you got PDF which you can use for printing on real device

Eps file inside postscript file using ghostscript

I am trying to produce production ready pdf.
I have eps file uploaded by admin and postscript file which I generate dynamically. I include eps with in postscript file
using below script
%%BeginDocument: danske.eps
(".$bgeps_path.") run
%%EndDocument
Now my problem is, there should be 10 mm space around image.
I managed to add 10mm space into pdf via translate.
But when It goes to print, printer cuts two edges, one is with 10mm space and other with image edge.
So what I want is to allow only one edge to cut that is with 10mm space.
I tried to achieve this by playing with BoundingBox but that does help me.

BoundingBox is a comment, nothing more, and as such is usually ignored. If you want to place an EPS then you need to follow the rules for EPS inclusion. You need to set up the Current Transformation Matrix to correctly scale and position the EPS on the canvas at a minimum.
Tech Note 5022 the EPSF sepcification v3.0 has guidelines for importing EPS files on page 13, you really should read this, particularly the co-ordinate system transformation on page 16. The tech note is available here:
http://partners.adobe.com/public/developer/en/ps/5002.EPSF_Spec.pdf

PDF compression How does Adobe do it?

This is a bit more of a fun question than a serious one, but how does the Adobe PDF format make documents so... portable?
I just created a small Word document, 235kb in size, containing multiple color photos and a few textual phrases. A PDF created using CutePDF (which I understand isn't the most efficient method of PDF creation) is only 176kb. That's a 25% compression ratio. When those files are placed into a compressed folder, the PDF is capable of 3% compression where the .docx can only take 2%. I'm sure that larger files would have even greater differences in size.
My question is, how does Adobe manage to make their files so much smaller? I understand that they are drawn from raster graphics, but my 3 bitmap files really can't be helped from raster that much, can they?

If you have Acrobat 9 there is a nice tool built-in so you can see how the PDF was put together (and compressions used). There is a blog post explaining how to use it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects

There are a few ways it can be compressing this:
Pdf files use lzw and zip compression.
If the image is scaled in the document, or is a larger dpi on disk than you allow for in cutepdf (for example, if cutepdf is set for 300dpi and the image is 600 dpi), it can be scaled in the pdf.
Microsoft stores TONS of info in the docx format, in xml. WAY more than is really needed to just export the info (for an example, try copying and pasting your text into a textbox cell, and look at the html info that comes out - I had a limit on a textbox size for a cms, and a 7 word sentence ballooned to 950 characters). This is so it can be later edited, and with a lot of esoteric info to make sure everything displays right in every possible permutation. The pdf doesn't need that info, and so it can just do the font and size, and strip out all the unnecessary info, saving a ton of space.

When you use such small files any overhead in the document format will have a disproportionate effect which is why you are seeing such large % differences.
I took a 2683KB JPEG and inserted it into a new word 2003 document. The resulting .doc file was 2725KB (or 2697KB as docx). Turning this into a PDF gives me a 2701KB PDF. So I am seeing a difference of 25KB, but only about 1% difference because of the size of the image data. It is about half what you got but maybe the version of word you have is more verbose when making docx?
For the PDF, acrobat shows space usage as 2691K image, 8.27K overhead and 1K fonts. PDF is quite a sparse format in its syntax which limits overhead and much of it has repeating strings so is easily compressible.
If you want to see what the PDF contains in a tree-like view you can download the demo version of CosEdit.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas