How to remove everything except bitmaps from a PDF?

How to remove everything except bitmaps from a PDF? - pdf

In How can I remove all images from a PDF?, Kurt Pfeifle gave a piece of PostScript code (by courtesy of Chris Liddell) to filter out all bitmaps from a PDF, using GhostScript.
This works like a charm; however, I'm also interested in the companion task of removing everything except bitmaps from the PDF, and without recompressing bitmaps. Or, eventually, separating the vector and bitmap "layers." (I know, this is not what a layer is in PDF terminology.)
AFAIU, Kurt's filter works by sending all bitmaps to a null device, while leaving everything else to pdfwrite. I read that it is possible to use different devices with GS, so my hope is that it is possible to send everything to a fake/null device by default, and only switch to pdfwrite for those images which are captured by the filter. But unfortunately I'm completely unable to translate such a thing into PostScript code.
Can anyone help, or at least tell me if this approach might be doomed to fail?

Its possible, but its a large amount of work.
You can't start with the nulldevice and push the pdfwrite device as needed, that simply won't work because the pdfwrite device will write out the accumulated PDF file as soon as you unload it. Reloadng it will start a new PDF file.
Also, you need the same instance of the pdfwrite device for all the code, so you can't load the pdfwrite device, load the nulldevice, then load the pdfwrite device again only for the bits you want. Which means the only approach which (currently) works is the one which Chris wrote. You need to load pdfwrite and push the null device into place whenever you want to silently consume an operation.
Just 'images' is quite a limited amount of change, because there aren't that many operators which deal with images.
In order to remove everything except images however, there are a lot of operators. You need to override; stroke, fill, eofill, rectstroke, rectfill, ustroke, ufill, ueofill, shfill, show, ashow, widthshow, awidthshow, xshow, xyshow, yshow, glyphshow, cshow and kshow. I might have missed a few operators but those are the basics at least.
Note that the code Chris originally posted did actually filter various types of objects, not just images, you can find his code here:
http://www.ghostscript.com/~chrisl/filter-obs.ps
Please be aware this is unsupported example code only.

Related

Change Ghostscript dithering method when converting pdf to 256 color BMP

I am trying to produce some high quality 8bpp bmp from pdf file with ghostscript. For that purpose, I use the bmp256 device.
So far, everything works well and is really fast, but ghostscript use halftoning to dither the image, leading to some uggly patterns when zooming on the picture :
I've managed to reduce their size by playing with the -dDITHERPPI flag, but this is still not satisfying enough. Those are too regular and are too easily spotted, even with little zoom.
Instead of using halftone, I would like to use some error diffusion algorithm, like the Floyd–Steinberg one. I found this algorithm is implemented on other devices, but they are all printer related devices, so I can't really use them.
Plus, I need to be as fast as possible when converting the PDF to 8bpp BMP, and the outputed pictures are very large: so converting it to 24 or 32bpp BMP in the first place to dither it later with another tool is excluded.
I already downloaded the source to try to implement it myself, but the project is really big and complex and I don't know how and where to start.
Is there any way to use some error diffusion algorithm with ghostscript without having to implement it myself ?
If no, is there a prefered way for extending ghostscript ? Any guideline ?

Loading images (graphics) with VisualWorks very slow

I am trying to load image files like jpeg into vw as part of an application. This seems to take very long and sometimes even crashes vw. The image has roughly 3.5MB and is a simple jpeg picture. This is what causes the problem:
ImageReader fromFile:'pic.jpg'.
This operation takes about 5-10 seconds to complete. It happens in both 32 and 64 bit projects alike.
Any ideas or suggestions as to how I can solve this problem? Same thing in pharo seems to work okay.
Thanks!

ImageReader will automatically choose the correct subclass, like JPEGImageReader. Picking the subclass is not the slow part; decoding the JPG data is.
A jpeg file, unlike PNG doesn't use zip compression but instead uses discrete-cosine-transforms (see https://en.wikipedia.org/wiki/JPG#JPEG_compression). This compression requires a lot of number crunching, which is slower in VisualWorks than it would be in C. The PNG reader on the other hand uses Zlib to have the number-crunching part done in C, which is why it is so much faster.
You could use Cairo or GDI or whatever other C-API you have at hand to speed this up.

Try calling the JPEGImageReader directly:
JPEGImageReader fromFile:'pic.jpg'
If that's fast, then the slowdown is in finding the proper image reader to use for the file. What ImageReaders do you have installed and how do they implement the class method canRead:?
If the JPEGImageReader is still slow, then we can investigate from there.

Do PDFs sizes change depending on the program that opens it?

I work for a company designing t-shirts. We get transfers printed by another company. The transfers we received for a recent design were to small, as I'm guessing they were printed portrait instead of landscape. The representative from the company we order prints from is claiming that it isn't their fault and that...
"Sometimes the images we receive are not the size you send. This is due to the different formats of the files and the way our computers convert them."
He says it's our fault because we didn't send the design dimensions to him along with the design. The file sent was a PDF. Am I correct in understanding that a PDF will always open at it's intended size? I thought the size was embedded within the PDF. I'm fairly certain I'm correct, but don't want to basically call him out for it without knowing I'm correct. Do computers really convert PDFs so that they're a completely different size? That would be a terrible way for PDFs to operate, if that's the case.

The size is definitely embedded in PDF, and the purpose of PDF format is to be used as the final document format before printing. My advice is to send a note with desired dimensions, perhaps it's their machine that resizes the picture.

Resizable image resource with embedded cap insets

This is by far not a showstopper problem just something I've been curious about for some time.
There is this well-known -[UIImage resizableImageWithCapInsets:] API for creating resizable images, which comes really handy when texturing variable size buttons and frames, especially on the retina iPad and especially if you have lots of those and you want to avoid bloating the app bundle with image resources.
The cap insets are typically constant for a given image, no matter what size we want to stretch it to. We can also put that this way: the cap insets are characteristic for a given image. So here is the thing: if they logically belong to the image, why don't we store them together with the image (as some kind of metadata), instead of having to specify them everywhere where we got to create a new instance?
In the daily practice, this could have serious benefits, mainly by means of eliminating the possibility of human error in the process. If the designer who creates the images could embed the appropriate cap values upon exporting in the image file itself then the developers would no longer have to write magic numbers in the code and maintain them updated each time the image changes. The resizableImage API could read and apply the caps automatically. Heck, even a category on UIImage would make do.
Thus my question is: is there any reliable way of embedding metadata in images?
I'd like to emphasize these two words:
reliable: I have already seen some entries on the optional PNG chunks but I'm afraid those are wiped out of existence once the iOS PNG optimizer kicks in. Or is there a way to prevent that? (along with letting the optimizer do its job)
embedding: I have thought of including the metadata in the filename similarly to what Apple does, i.e. "#2x", "~ipad" etc. but having kilometer-long names like "image-20.0-20.0-40.0-20.0#2x.png" just doesn't seem to be the right way.
Can anyone come up with smart solution to this?

Android has a filetype called nine-patch that is basically the pieces of the image and metadata to construct it. Perhaps a class could be made to replicate it. http://developer.android.com/reference/android/graphics/NinePatch.html

Ajax request to SQL returns hundreds of images to JSON -> CSS Sprites? Base64?

We are writing a Web-based events calendar with thousands of theatre shows, festivals, and concerts in a SQL database.
The user goes to the Website, performs a search, the SQL server returns JSON, and jQuery code on the client side displays the 200+ events.
Our problem is that each event has an image. If I return URLs, the browser has to make 200+ HTTP GET requests for these tiny (2-4Kb) images.
It is tempting to pack all the images into one CSS sprite, but since the user searches can return any combination of images, this would have to be done dynamically and we'd lose control over caching. Every search would have to return every image in a brand new CSS sprite. Also, we want to avoid any solution that requires dynamically reading or writing from a filesystem: that would be way too slow.
I'd rather return binary images as part of the JSON, and keep track of which images we already have cached and which we really need. But I can't figure out how. It seems to require converting to Base64? Is there any example code on this?
Thanks for any help you can give! :)

The web site you are using (StackOverflow) has to provide 50 avatars for 50 questions shown in the questions page. How does it do it? Browser makes 50 requests.
I would say, you had better implement pagination so that the list does not get too big. Also you can load images on the background or when the user scrolls and gets to the image.
Keeping track of images downloaded is not our job, it is browser's. Our job is to make sure the URL is unique and consistent and response headers allow caching.
Also base64 will make the stream at least 33% longer while it's processing on the client side is not trivial - I have never seen an implementation but probably someone has done some javascript for it.
I believe all you need is just pagination.

It looks like the original poster has proceeded with essentially this solution on their own, however based on their comment about 33% space overhead, I don't think they observed an unexpected phenomenon that you get when base64-ing a bunch of images into one response entity and then gzipping it...believe it or not, this can easily produce an overall transfer size that is smaller than the total of the unencoded image files, even if each unencoded image was also separately gzipped.
I've covered this JSON technique for accelerated image presentation in depth here as an alternative to CSS sprites, complete with a couple live samples. It aims show that it is often a superior technique to CSS sprites.

Data is never random. For example you could name your sprites
date_genre_location.jpg
or however you organise your searches. This might be worth it. Maybe.
You'll need to do the math

Here is what we decided.
Creating sprites has some downsides:
To reduce image loss, we need to store the originals as PNGs instead of JPEGs on the server. This is going to add to the slowness, and there's already quite some slowness in creating dynamic CSS sprites with ImageMagick.
To reduce image size, the final CSS sprite must be a JPG, but JPGs are lossy and with a grid of images, things get weird at the edges as JPG tries to blend and compress. We can fix that by putting a blank border around all the images but even so, this adds complexity.
with CSS sprites, it becomes harder to do intelligent caching; we have to pack up all the images regardless of what we've sent in the past. (or we have to keep all the old CSS sprites regardless of whether they contain many or few images we still need).
I'm afraid we have too many combinations of date-category-location for precomputing CSS sprites, although surely we could handle part of the data this way
Thus our solution is to Base64 it, to actually send each image one by one. Despite the 33% overhead, it is far less complex to code and manage and when you take caching issues into account, it may even be less data transfer.
Thank you all for your advice on this!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas