The main topic I am interested in is whether it is possible to store and extract well-structured scale information from a PDF document.
[ For example, most engineering or architectural drawings printed to a PDF would be scaled down, say 1/8 in. : 1 ft. So if actually printed to paper you could measure a part of the drawing in inches and then infer the actual real-world size in feet. ]
Is there any way to programmatically look for that scale information in the PDF format? e.g., from the example above, to extract the 1/8" : 1' ratio from the PDF.
I thought this was not even possible until I came across this statement in the Adobe document Grids, guides, and measurements in PDFs:
Use Scales And Units From Document (When Present)
When enabled, measurements based on the units generated from the
original document, if present, are used. Deselect this option to
specify the units of measurements manually.
(Alternative document with same text, p. 92)
However I can find no other references that explain how this feature works. I checked in the PDF specifications (maybe too old of a version?) and it did not mention anything. But its hard to know what to search for and I'm not very familiar with PDF internals so I may simply have missed it.
A related detail is what level of the PDF this information would be stored at (if it exists at all) - I would guess per-page?
To be clear, I am absolutely not looking to scrape text from the rendered PDF. Instead, I want to find out if there is any metadata that would encode this information. The PDFs I have to deal with would have widely varying origins & contents.
If you look at the actual PDF specification (i.e. ISO 32000, part 1 or 2), you'll find a section on measurement properties.
E.g. in ISO 32000-1:
12.9 Measurement Properties
PDF documents, such as those created by CAD software, may contain graphics that are intended to represent real-world objects. Users of such documents often require information about the scale and units of measurement of the corresponding real-world objects and their relationship to units in PDF user space.
...
Beginning with PDF 1.6, such information may be stored in a measure dictionary (see Table 261). Measure dictionaries provide information about measurement units associated with a rectangular area of the document known as a viewport.
A viewport (PDF 1.6) is a rectangular region of a page. The optional VP entry in a page dictionary (see Table 30) shall specify an array of viewport dictionaries, whose entries shall be as shown in Table 260. Viewports allow different measurement scales (specified by the Measure entry) to be used in different areas of a page, if necessary.
etc. etc.
Thus, yes, you can programmatically look for that scale information in the PDF format. Beware, though, these properties are optional, so you'll find them only if the PDF producer was nice enough to provide them.
Related
Under Appendix H of the Vulkan spec it says:
Aspect (Image):
An image may contain multiple kinds, or aspects, of data for each pixel, where each aspect is
used in a particular way by the pipeline and may be stored differently or separately from other
aspects. For example, the color components of an image format make up the color aspect of the
image, and may be used as a framebuffer color attachment. Some operations, like depth testing,
operate only on specific aspects of an image. Others operations, like image/buffer copies, only
operate on one aspect at a time.
In addition there is the type VkImageAspectFlagBits which (I think) lists the 7 possible aspects an image can have:
VK_IMAGE_ASPECT_COLOR_BIT
VK_IMAGE_ASPECT_DEPTH_BIT
VK_IMAGE_ASPECT_STENCIL_BIT
VK_IMAGE_ASPECT_METADATA_BIT
VK_IMAGE_ASPECT_PLANE_0_BIT
VK_IMAGE_ASPECT_PLANE_1_BIT
VK_IMAGE_ASPECT_PLANE_2_BIT
VkImageCreateInfo does not have an aspects field.
If I understand correctly, a given VkImage has a subset of these 7 aspects, right?
Is there a way to deduce which of the 7 aspects a VkImage has from its VkImageCreateInfo?
ie how would you write a function that returns a bit mask of which aspects an image has?
VkImageAspectFlagsBits GetImageAspects(const VkImageCreateInfo& info) { ??? }
You can mostly derive which aspects an image has from its format:
VK_IMAGE_ASPECT_COLOR_BIT refers to all R, G, B, and/or A components available in the format.
VK_IMAGE_ASPECT_DEPTH_BIT refers to D components.
VK_IMAGE_ASPECT_STENCIL_BIT refers to S components.
VK_IMAGE_ASPECT_PLANE_n_BIT refer to the planes of multi-planar formats, i.e. ones with 2PLANE or 3PLANE in their name. These aspects are mostly used in copy commands, but can also be used to create a VkImageView of an individual plane rather than the whole image.
VK_IMAGE_ASPECT_METADATA_BIT is the odd one whose presence isn't based on format. Instead, images with sparse residency (VK_IMAGE_CREATE_SPARSE_RESIDENCY_BIT) have a metadata aspect.
If an image doesn't have the components or flag corresponding to the aspect, then it doesn't have that aspect.
Is there any pdf tools that generate information regarding the loading time and memory usage to display pdf in browser, and also total element inside the pdf?
Unfortunately not really. I've done some of this research, not for PDF in a browser but (and perhaps this is what you are looking at as well) PDF on mobile devices.
There are a number of factors that contribute and that to some extent can be tested for:
Whether or not big images exist in the PDF and what resolution they are. This is linked directly to memory usage.
What compression method is used for image compression. Decompressing JPEG-2000 images specifically can increase load time significantly. Even worse, as JPEG-2000 can be progressively decompressed, it can give the appearance of a really bad PDF until the images has been fully decompressed and loaded (this is ugly specifically on somewhat older tablets for example).
How complex the transparency effects are that are used in the document.
How many fonts are used in the document.
How many line-art objects (vector elements) with a large number of nodes (points) are used on a page.
You can test what is in the document using Acrobat Pro to some extent (there is a well-hidden tool when you save an optimised PDF file that can audit what objects use how much of the space in a PDF document). You can also use a preflight solution such as pdfToolbox from callas (I'm affiliated with this company) or pitstop from enfocus; these tools would allow you to get a report with the results of custom checks such as image resolution, compression, vector objects, color spaces etc.
I am working on a new in-house automated artwork workflow system. The new system delivers a stepped up PDF ready for offset print. The artwork is black only with barcodes and variable data text. There might be between 4 and 50 stepped up artworks per SRA3 sheet.
However, the new system is unable to add tick marks, gripper marks and other information around the edge of the sheet that our production teams would like. Our attempts to add these are not always fit for purpose as the of the shelf software used to generate the variable data was intended for thermal printers.
What would be better is if we had the tick marks saved on another series of template PDF's. We could then superimpose these on the automated artwork.
We are talking about 100's of PDF's per day and we would need this process to be a simple and as possible requiring little skill on the part of the operator or even automated/scripted.
I have read a similar post where adding watermark with Acrobat was recommended. This worked a treat for me but considering the high volumes of artwork, even if made into a Acrobat Batch Process/Action, this would be too involved.
Any ideas welcome: Rip Software, AppleScript, MSDOS!!! Whatever!
Cheers
Tim
I have a particular requirement to generate a PDF document with some dynamic data overlaid on top.
That's the general gist. To be clear, I have a fair amount of experience in generating PDFs programatically so I'm not looking for a list of products that can simply churn out PDF.
The specifics are:
I have a pre-existing PDF template containing a vector representation of certain regions of the UK. I will be capturing geographical data via a web interface and will need to overlay these data on the PDF as vector graphics (little circles with numbers to be specific).
So, what I'm looking for is some advice on:
How to dynamically write vector graphics to a PDF
Translation of geographical coordinates to vector layer coordinates
Cheers.
Steve
iText[Sharp] can handle the vector graphics portion fairly easily with a PdfStamper.
There have been several questions on the itext-questions mailing list about geographic coordinates that should help you on that front as well. Searching your favorite list archive for "geospatial" should turn them up.
I don't think we've added higher-level functions for that sort of thing yet, so you'll have to dig into PdfDictionary, PdfArray, and so forth. Keep a copy of the PDF Reference handy (but you probably already do).
I have an input PDF file (usually, but not always generated by pdfTeX), which I want to convert to an output PDF, which is visually equivalent (no matter the resolution), it has the same metadata (Unicode text info, hyperlinks, outlines etc.), but the file size is as small as possible.
I know about the following methods:
java -cp Multivalent.jar tool.pdf.Compress input.pdf (from http://multivalent.sourceforge.net/). This recompresses all streams, removes unused objects, unifies equivalent objects, compresses whitespace, removes default values, compresses the cross-reference table.
Recompressing suitable images with jbig2 and PNGOUT.
Re-encoding Type1 fonts as CFF fonts.
Unifying equivalent images.
Unifying subsets of the same font to a bigger subset.
Remove fillable forms.
When distilling or otherwise converting (e.g. gs -sDEVICE=pdfwrite), make sure it doesn't degrade image quality, and doesn't increase (!) the image sizes.
I know about the following techniques, but they don't apply in my case, since I already have a PDF:
Use smaller and/or less fonts.
Use vector images instead bitmap images.
Do you have any other ideas how to optimize PDF?
Optimize PDF Files
Avoid Refried Graphics
For graphics that must be inserted as bitmaps, prepare them for maximum compressibility and minimum dimensions. Use the best quality images that you can at the output resolution of the PDF. Inserting compressed JPEGs into PDFs and Distilling them may recompress JPEGs, which can create noticeable artifacts. Use black and white images and text instead of color images to allow the use of the newer JBIG2 standard that excels in monochromatic compression. Be sure to turn off thumbnails when saving PDFs for the Web.
Use Vector Graphics
Use vector-based graphics wherever possible for images that would normally be made into GIFs. Vector images scale perfectly, look marvelous, and their mathematical formulas usually take up less space than bitmapped graphics that describe every pixel (although there are some cases where bitmap graphics are actually smaller than vector graphics). You can also compress vector image data using ZIP compression, which is built into the PDF format. Acrobat Reader version 5 and 6 also support the SVG standard.
Minimize Fonts
How you use fonts, especially in smaller PDFs, can have a significant impact on file size. Minimize the number of fonts you use in your documents to minimize their impact on file size. Each additional fully embedded font can easily take 40K in file size, which is why most authors create "subsetted" fonts that only include the glyphs actually used.
Flatten Fat Forms
Acrobat forms can take up a lot of space in your PDFs. New in Acrobat 8 Pro you can flatten form fields in the Advanced -> PDF Optimizer -> Discard Objects dialog. Flattening forms makes form fields unusable and form data is merged with the page. You can also use PDF Enhancer from Apago to reduce forms by 50% by removing information present in the file but never actually used. You can also combine a refried PDF with the old form pages to create a hybrid PDF in Acrobat (see "Refried PDF" section below).
see article
From PDF specification version 1.5 there are two new methods of compression, object streams and cross reference streams.
You mention that the Multivalent.jar compress tool compresses the cross reference table. This usually means the cross reference table is converted into a stream and then compressed.
The format of this cross reference stream is not fixed. You can change the bit size of the three "columns" of data. It's also possible to pre-process the stream data using a predictor function which will improve the compression level of the data. If you look inside the PDF with a text editor you might be able to find the /Predictor entry in the cross reference stream dictionary to check whether the tool you're using is taking advantage of this feature.
Using a predictor on the compression might be handy for images too.
The second type of compression offered is the use of object streams.
Often in a PDF you have many similar objects. These can now be combined into a single object and then compressed. The documentation for the Multivalent Compress tool mentions that object streams are used but doesn't have many details on the actual choice of which objects to group together. The compression will be better if you group similar objects together into an object stream.