Table of Contents (TOC) missing after using CGContextDrawPDFPage - pdf

i am cocoa programer and using Quartz to draw pdf files, the original pdf has table of contents (TOC), but the result pdf lost TOC after using following functions.
for(int i = 1; i <= pageCount; i++)
{
page = CGPDFDocumentGetPage (document, i);
CGContextDrawPDFPage (myContext, page);
}
Am I doing wrong or how to keep TOC with Quartz? Any help would be appreciated. (english is not my native language, hope you can understand what i am asking...)

Your code takes the pages content from the source file and draws them on a new document. This is the only content you can transfer from one document to another. The bookmarks (TOC), form fields, annotations, links in the source file cannot be copied to the new document. It is a limitation of the CoreGraphics API.
So if you need to modify an existing PDF file you're out of luck.

Related

After content stream recreation (using PDFbox) getting error in axesPDF (insert spaces)?

when I am recreating a pdf after some changes , if I use the output pdf in axesPDF to fix spaces issue I am getting "ERROR". The only difference I observed my input pdf have Array of tokens but my recreated pdf has Dictionary of elements. As shown in below. Does that cause the problem? How can I recreate similar structure? (Left one is input pdf right one output pdf after editing)
Input pdf
output pdf with changes
The code I am using to save the pdf is
PDStream newContents = new PDStream(document);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(tokens);
out.close();
document.getPage(pg_ind).setContents(newContents);
newPDF.addPage(document.getPage(pg_ind));
newPDF.save()
Please help me on this. Thanks in advance.
Updating question along with error.
Another Input file
The error is
The button I used.
I am wondering this time even the content stream is in COSDictionary format it's giving error. Something else causing this.

iText 7 need to skip reading page header elements

I am using EventHandler to create page header for my pdf. The content of the header are added into a Table before adding to Canvas. As part of 508 compliance, i need to exclude the header content from being read out loud. How do i accomplice this?
public class TEirHeaderEventHandler : IEventHandler
{
public void HandleEvent(Event e)
{
PdfDocumentEvent docEvent = (PdfDocumentEvent)e;
PdfDocument pdf = docEvent.GetDocument();
PdfPage page = docEvent.GetPage();
PdfCanvas headerPdfCanvas = new PdfCanvas(page.NewContentStreamBefore(), page.GetResources(), pdf);
Rectangle headerRect = new Rectangle(60, 725, 495, 96);
Canvas headerCanvas = new Canvas(headerPdfCanvas, pdf, headerRect);
//creating content for header
CreateHeaderContent(headerCanvas);
headerCanvas.Close();
}
private void CreateHeaderContent(Canvas canvas)
{
//Create header content
Table table = new Table(UnitValue.CreatePercentArray(new float[] { 60, 25, 15 } ));
table.SetWidth(UnitValue.CreatePercentValue(100));
Cell cell1 = new Cell().Add(new Paragraph("Establishment Inspection Report").SetBold().SetTextAlignment(TextAlignment.LEFT));
cell1.SetBorder(Border.NO_BORDER);
table.AddCell(cell1);
Cell cell2 = new Cell().Add(new Paragraph("FEI Number:").SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell2.SetBorder(Border.NO_BORDER);
table.AddCell(cell2);
Cell cell3 = new Cell().Add(new Paragraph(_feiNum).SetBold().SetTextAlignment(TextAlignment.RIGHT));
cell3.SetBorder(Border.NO_BORDER);
table.AddCell(cell3);
canvas.Add(table);
}
}
public static void CreatePdf()
{
using (MemoryStream writeStream = new MemoryStream())
using (FileStream inputHtmlStream = File.OpenRead(inputHtmlFile))
{
PdfDocument pdf = new PdfDocument(new PdfWriter(writeStream));
pdf.SetTagged();
iTextDocument document = new iTextDocument(pdf);
TEirHeaderEventHandler teirEvent = new TEirHeaderEventHandler();
pdf.AddEventHandler(PdfDocumentEvent.START_PAGE, teirEvent);
//Convert html to pdf
HtmlConverter.ConvertToDocument(inputHtmlStream, pdf, properties);
document.Close();
byte[] bytes = TEirReorderingPages(writeStream, numOfPages);
File.WriteAllBytes(outputPdfFile, bytes);
}
}
Note that i have set the document to be tagged. but i still get the "Reading Untagged Document" screen when i open the file. However, all of the content are read including the header when i activate the Read Out Loud feature. Any input or suggestion would be appreciated. Thank you in advance for your help.
General
The approach suggested by Alexey Subach is generally correct. You mark the content as artifact to differentiate it from real content.
element.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This marks the content in the content stream and it excludes the element from the structure tree.
Your case
However, your specific case is more nuanced.
For a well tagged PDF document, the proper way to read it out loud is to process the structure tree, which is a data structure that represents the logical reading order of the (semantic) elements of the document, such as paragraphs, tables and lists.
Because of the way you are creating the header content, it is not automatically tagged: a Canvas instance that is created from a PdfCanvas instance has autotagging disabled by default. So the table in the header is not marked in the content stream and it is not included in the structure tree. Marking it explicitly as an artifact, with the approach described above in General, should not make a significant difference because it was not in the structure tree to begin with.
If you enable autotagging by adding headerCanvas.enableAutoTagging(page), you will notice that the table does appear in the structure tree.
If you then add table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT), the table is excluded from the structure tree again.
Summary: looking at the structure tree, there's no difference between your original code and the approach of General.
Adobe reading order / accessibility settings
From your description, I think you are using Adobe Acrobat or Reader for the read out loud functionality. Under Preferences > Reading > Reading Order Options, you can configure how the content should be processed for the read out loud feature:
From https://helpx.adobe.com/reader/using/accessibility-features.html:
Infer Reading Order From Document (Recommended): Interprets the reading order of untagged documents by using an advanced method of structure inference layout analysis.
Left-To-Right, Top-To-Bottom Reading Order: Delivers the text according to its placement on the page, reading from left to right and then top to bottom. This method is faster than Infer Reading Order From Document. This method analyzes text only; form fields are ignored and tables aren’t recognized as such.
Override The Reading Order In Tagged Documents: Uses the reading order specified in the Reading preferences instead what the tag structure of the document specifies. Use this preference only when you encounter problems in poorly tagged PDFs.
In my tests, the only way I can make Adobe Reader read out loud the header content created with your original code, is when I select Left-To-Right, Top-To-Bottom Reading Order and enable Override The Reading Order In Tagged Documents. In that case, it is basically ignoring the tagging and just processing the content per the location on the page.
With Override The Reading Order In Tagged Documents disabled, the header content is not read, for your original code and with explicit artifacts.
Conclusion
Although it's a good idea to always tag artifacts as such, so they can be properly differentiated from real content, in this case I believe the behaviour you're experiencing is more related to application configuration than to file structure.
Headers and footers are typically pagination artifacts and should be marked as such in the following way:
table.getAccessibilityProperties().setRole(StandardRoles.ARTIFACT);
This will exclude the table from being read. Please note that you can mark any element implementing IAccessibleElement interface as artifact.

Filter out anything but interactive form fields in PDF's

I'm looking for a way to filter out all objects apart from interactive form fields in PDF files.
The programming language isn't too important, but it would would love if I could do it from the Linux command line but I'm pretty much open to anything.
E.g. choose an pdf input file, and output a new pdf file with only the interactive form fields from the first.
The ultimate goal is to be able to take an already printed but unfilled form , and print only the content of the filled in form fields onto it.
The closest I've gotten is by using ghostscript:
gs -o outfile.pdf -sDEVICE=pdfwrite -dFILTERTEXT -dFILTERIMAGE infile.pdf
But that still leaves a lot of lines in my case, as well as an image despite -dFILTERIMAGE.
There's also a -dFILTERVECTOR-option but sadly it removes the formfields as well.
I'm looking for a way to filter out all objects apart from interactive form fields in PDF files.
First and foremost you have to get rid of the static page content. Using an arbitrary general purpose pdf library you can do that by clearing the contents entry of every page.
E.g. using the Java version of iText7 this can be done as follows:
try (
PdfReader pdfReader = new PdfReader(SOURCE);
PdfWriter pdfWriter = new PdfWriter(RESULT);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter)
) {
for (int pageNr = 1; pageNr <= pdfDocument.getNumberOfPages(); pageNr++) {
PdfPage pdfPage = pdfDocument.getPage(pageNr);
pdfPage.getPdfObject().remove(PdfName.Contents);
pdfPage.getPdfObject().setModified();
}
}
(RemoveContent test testRemoveAllPageContentStreams)

Docx4J: Vertical text frame not exported to PDF

I'm using Docx4J to make an invoice model.
In the left-side of the page, it's usual to show a legal sentence as: Registered company in ... Book ... Page ...
I have inserted this in my template with a Word text frame.
Well, my issue is: when exporting to .docx, this legal text is shown perfect, but when exporting to .pdf, it's shown as an horizontal table under the other data.
The code to export to PDF is:
FOSettings foSettings = Docx4J.createFOSettings();
foSettings.setFoDumpFile(foDumpFile);
foSettings.setWmlPackage(template);
fos = new FileOutputStream(new File("/C:/mypath/prueba_OUT.pdf"));
Docx4J.toFO(foSettings, fos, Docx4J.FLAG_EXPORT_PREFER_XSL);
Any help would be very appreciated.
Thanks.
You'd need to extend the PDF via FO code; see further How to correctly position a header image with docx4j?
Float left may or may not be easy; similarly the rotated text.
In general, the way to work on this is to take the FO generated by docx4j, then hand edit it to something which FOP can convert to a PDF you are happy with. If you can do that, then its a matter of modifying docx4j to generate that FO.

PDF metadata to open document in Actual Size (100%) view

I am generating a PDF document using jsPDF. Is there a way to store metadata in the PDF document that will force Acrobat to open it in 100% view mode (Actual Size) vs sized to fit?
In other words does PDF document specification allow that to specify it in the document itself?
This is definitely possible, because a PDF document can contain information on how it should open.
You might create such a document in Acrobat and then find the opening information, and/or you might have a look at the Portable Document Format Reference, which is part of the Acrobat SDK, downloadable from the Adobe website.
However, I don't know whether you can insert that structure into the PDF with your tool.
I figured it out; in the Catalog section of the PDF document, there is a OpenAction section where we can specify how the view can show the file, among other things.
I changed this
putCatalog = function () {
out('/Type /Catalog');
out('/Pages 1 0 R');
// #TODO: Add zoom and layout modes
out('/OpenAction [3 0 R /FitH null]');
out('/PageLayout /OneColumn');
events.publish('putCatalog');
},
to this
putCatalog = function () {
out('/Type /Catalog');
out('/Pages 1 0 R');
// #TODO: Add zoom and layout modes
out('/OpenAction [3 0 R 1 100]'); //change from standard code to use zoom to 100 % instead of fit to width
out('/PageLayout /OneColumn');
events.publish('putCatalog');
},