Who uses the `UserUnit` property in PDF? - pdf

I am working on a PDF library which uses the UserUnit property in some computations. I would like to test the library on real data but I cannot find any real-world PDF documents with non-standard UserUnit. What are the use cases for this property?

You only need UswerUnit if you want to specify a MediaSize in excess of (IIRC) 1440 points (200 inches) in either dimension.
Note that this is an Acrobat limitation, it can't handle MediaSize values larger than that, other applications can. If you want a MediaSize exceeding that, and you expect your file to view in Acrobat, you need to set UserUnit.
I do have real world files which set UserUnit (very, very few) but cannot share them as they are customer files. I'm told that these are used for architectural plans, apparently some regulatory bodies require all the plans to be on a single 'sheet', the only way to do that with (for example) a multi-story building is to have a very large 'sheet'.

Related

inserting an entir pdf into another by raw text manipulation

I need to include a pdf into another pdf that is being created by text manipulation, not through a package. (In particular, I'm using livecode, which is well suited to the generation of the information I need, and can easily do text manipulation).
Once included, I will be adding additional objects (primarily text, but also a few small squares).
I only need to be able to access the included pdf by page and area, such as (200,200) to (400,400) of page 5; I don't need any access to its objects.
Simply appending to the pdf won't do the job, as I'll actually be including multiple source pdfs into a single pdf output with my addition.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it. In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
Can this be done? Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
something that could be used on a one-time basis to convert an entire multi-page pdf wold also be a usable (but inferior) solution.
I've found that search engines aren't yielding usable results, as they are swamped with solutions for individual products, and not the pdf itself.
First of all, PDFs in general are not text data, they are binary. They may look textual as they contain identifiers built from ASCII values of words, but treating them as text, unless one and one's tools are extremely cautious, is a sure way to damage them.
But even if we assume such caution, unless your input PDFs are internally of a very simple and similar structure, creating code that allows to merge them and manipulate their content essentially is complexity-wise akin to creating a generic PDF library/package.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it.
Putting them into one indirect object each would work if you needed them merely as an unchanged attachment. But you want to change them.
In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
You will at least have to parse ("disassemble") the objects related to the pages you want to manipulate, add the manipulated versions thereof, and add cross references for the changed objects.
And you only mention cross reference tables. Don't forget that in case of a general solution you also have to be able to handle cross reference streams and object streams.
Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
If you really want to merge the source PDFs into a target one, you'll indeed need to renumber the objects from most source PDFs.
If as a target a portable collection (aka portfolio) of the source PDFs would suffice, you might not need to do that. In that case you merely have to apply the changes you want to the source PDFs (by means of incremental updates, if you prefer), and then combine all those manipulated sources in a result portfolio.
I've found that search engines aren't yielding usable results
The cause most likely is that you underestimate the complexities of the format PDF. Combining and manipulating arbitrary existing PDFs usually requires you to use a third-party library or create the equivalent of such a library yourself.
Only manipulating existing PDFs is a bit easier, and so is combining PDFs in a portfolio. Nonetheless, even in this case you should have studied the PDF specification quite a bit.
Restricting oneself to string manipulations to implement this makes the task much more complex - I'd say impossible for generic PDFs, daring for PDFs of simple and similar build.

Headless LibreOffice or OpenOffice as a PDF report generator?

I hope it’s Ok to post a complete naive question here for LO or OO experts.
I’m looking for advice on whether scripting LibreOffice or OpenOffice would be suitable for the following:
General Question
I’m looking to generate PDF reports, based on a combination of a “template” and a set of data (currently in JSON format) and inserted images.
This would act as a headless service that gets invoked when necessary from a web server, when a user requests a PDF report (on linux).
We have a need to frequently modify/customise/generate new templates, hence the reluctance to go down a route of using something like Reportlab (plus I don't know Reportlab at all, so face huge learning curve that way
Background
This is in contrast to using an approach of using a PDF library like Reportlab directly within the web server, and having to build up the template/report programmatically.
As LibreOffice/OpenOffice is obviously a lot faster for generating good looking report "templates", this is a question about doing both the template generation, plus final template + data -> PDF generation all directly within LibreOffice.
Some more specifics
The data values would mostly either be substituted into fields in the template, with no to minimal processing of values required.
However, there would be situations where some of the data is in “sets” that would be shown in a table type view, and the number of fields (and so number of table rows for instance) would need to vary per report, based on the number of values in that particular JSON data.
Additionally, I’d need to be able to include (“import”) images into the report. Some of the JSON data would be paths to image files, and I’d like to include those. Again for these, the number of image may vary between each report.
This wouldn't be high frequency at all, so would not need to run either LO/OO as a service, but could simply invoke when required with a sys call. Conceptually something like "LibreOffice --template 'make_fancy.report' <data.json> <output_file.pdf>"
If this approach would be reasonable in either LO or OO, what languages are best to script in? (Hopefully python3).

Losing Aria/accessibility when converting from HTML to PDF

I am using ABCpdf to generate a collection of PDFs from HTML markup, and am struggling with making it fully accessible.
The HTML pages include several graphs which are created by CSS, and which are completely ignored by the screenreader.
I have tried using aria-label to give a written explanation of the graphs, but it is lost in the conversion. I have tried configuring the Gecko engine within ABCpdf in numerous ways, including scaling back security options, altering markup options, and adding special tags to explicitly include an element. The PDF is tagged and is rated as fully accessible by our evaluation program.
I haven't been able to find a way to include "hidden" text in the PDF for the purpose of screenreaders. Any help is appreciated!
EDIT: Due to security concerns, I am unable to display the actual data behind the graphs. Manual steps are also not an option due to the sheer number of generated PDFs, and a short timeline.
HTML-to-PDF conversion utilities are usually pretty basic and typically don't handle complex CSS very well at all. You may be better off taking a screen capture and then using alt-text to describe the intent of the graph. Sometimes the simplest approach is the most reliable.
Another way of approaching the issue would be to present the complete data set to users via a data table. That way, they can "see" everything contained in the graph, and it won't matter if the graph itself is inaccessible. If placing a giant data table in the middle of your document doesn't fit with your formatting, you can also include the data set in an appendix with a note or hyperlink in the text directing readers where they can go to access the entirety of information.

How to create your own package for interaction with word, pdf etc

I know that there are a lot of packages around which allow you to create or read e.g. PDF, Word and other files.
What I'm interested in (and never learned at the university) is how you create such a package? Are you always relying on source code being given by the original company (such as Adobe or Microsoft), or is there another clever way of working around it? Should I analyze the individual bytes I see in e.g. PDF files?
It varies.
Some companies provide an SDK ("Software Development Kit") for their own data format, others only a specification (i.e., Adobe for PDF, Microsoft for Word and it's up to the software developer to make sure to write a correct implementation.
Since that can be a lot of work – the PDF specification, for example, runs to over 700 pages and doesn't go deep into practically required material such as LZW, JPEG/JPEG2000, color theory, and math transformations – and you need a huge set of data to test against, it's way easier to use the work that others have done on it.
If you are interested in writing a support library for a certain file format which
is not legally protected,
has no, or only sparse (official) documentation,
and is not already under deconstruction elsewhere,a
then yes: you need to
gather as many possible different files;
from as many possible sources;
(ideally, you should have at least one program that can both read and create the files)
inspect them on the byte level;
create a 'reader' which works on all of the test files;
if possible, interesting, and/or required, create a 'writer' that can create a new file in that format from scratch or can convert data in another format to this one.
There is 'cleverness' involved, mainly in #3, as you need to be very well versed in how data representation works in general. You should be able to tell code from data, and string data from floating point, and UTF8 encoded strings from MacRoman-encoded strings (and so on).
I've done this a couple of times, primarily to inspect the data of various games, mainly because it's huge fun! (Fair warning: it can also be incredibly frustrating.) See Reverse Engineering's Reverse engineering file containing sprites for an example approach; notably, at the bottom of my answer in there I admit defeat and start using the phrases "possibly" and "may" and "probably", which is an indication I did not get any further on that.
a Not necessarily of course. You can cooperate with other whose expertise lies elsewhere, or even do "grunt work" for existing projects – finding out and codifying fairly trivial subcases.
There are also advantages of working independently on existing projects. For example, with the experience of my own PDF reader (written from scratch), I was able to point out a bug in PDFBox.

embed serial number to PDF file?

To prevent the casual distribution of pdf document, is there any way such as embedding the serial number to the file?
My idea is to embed the id bound to user and enable to find who distribute the file.
I know it's not preventing the distribution but may discourage the casual distribution by the certain level.
Any other solution is also welcome.
Thanks!
Common way is placing of meta data, but you can easily remove them.
Let's search hideouts (most of them low-level)!
Non-mark text
Text under overlapping objects
Objects of older versions (doesn't noticed by reader, but there with redundant information)
Marks in streams between BX-EX (with weird information from readers point of view)
Information before %PDF-X
Information above %%EOF
Substitution of names for some elements (like font name)
Steganography
Manipulation from used fonts
Whitespacing
Images with setganograpy
My favorite are steganography and BX-EX block within stream, with proper compression and/or encryption it is hard to find (if do not know, where it is). To make search harder wrap some normal blocks with BX-EX.
Some of ways are easy to remove, some harder, but decided attacker will be able to find and sanitize them all. Think about copy-paste of text or print trough PDF-printer.
You can render transparent text. You can write text outside the media box of a page. You can add custom document property. There are plenty of ways to do this.
Why not create a digital id on the documents?