Ghostscript PDFA1A to PDFA1B Validation VeraPdf - pdf

So I want to have a valid PDFA1B which validates correctly with my function:
public boolean isValidPdfA1B(File pdf) throws Exception {
VeraGreenfieldFoundryProvider.initialise();
PDFAFlavour flavour = PDFA_1_B;
try (PDFAParser parser = Foundries.defaultInstance().createParser(pdf, flavour)) {
PDFAValidator validator = Foundries.defaultInstance().createValidator(flavour, false);
ValidationResult result = validator.validate(parser);
if (result.isCompliant()) {
return true;
} else {
return false;
}
} catch (IOException | ValidationException | ModelParsingException | EncryptedPdfException exception) {
// Exception during validation
return false;
}}
First I created a pdf with word - export ISO19005-1 PDFA compliant
Then i used Ghostscript with AdobeRGB.icc and the following command to create a PDFA1B document:
λ gswin64 -dPDFA=1 -dBATCH -dNOPAUSE -dNOOUTERSAVE
-sColorConversionStrategy=UseDeviceIndependentColor -sDEVICE=pdfwrite -sOutputFile=/PATH/TO/output-a.pdf -dPDFACompatibilityPolicy=2 /PATH/TO/PDFA_def.ps /PATH/TO/word_created.pdf
Before I hat to do the pdfmarks solution from Ghostscript won't generate PDF/A with UTF16BE text string detected in DOCINFO - in spite of PDFACompatibilityPolicy saying otherwise to avoid the error with DocumentInfo
So now (with UseDeviceIndependentColor set, RGB does produce a lot of more problems) - I have the following errors when in my Vera-Pdf Check:
DeviceRGB may be used only if the file has a PDF/A-1 OutputIntent that
uses an RGB colour space

Related

Blank LatexPDF File issue with custom Latex Style with SPHINX

Recently I've created my PDF file correctly as in style formatting.
But now I've got the issue that my RST files from Sphinx are not included.
Maybe that someone knows how I include the RST content into the custom PREAMBLE - sty file content
main.sty
\PassOptionsToPackage{english}{babel}
\usepackage{amsmath}
\usepackage{color,pxfonts,fix-cm}
\usepackage{latexsym}
\usepackage[mathletters]{ucs}
\DeclareUnicodeCharacter{32}{$\ $}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{pict2e}
\usepackage{wasysym}
\usepackage{tikz}
\pagestyle{empty}
\geometry{left=0.2in, top=0.5in, paperwidth=595pt, paperheight=878pt}
\begin{document}
\definecolor{color_93343}{rgb}{0.25098,0.25098,0.25098}
\definecolor{color_217206}{rgb}{0.737255,0.839216,0.443137}
\begin{tikzpicture}[overlay]\path(0pt,0pt);\end{tikzpicture}
\begin{picture}(-5,0)(2.5,0)
\put(-15,-837.4999){\includegraphics[width=596.25pt,height=887.25pt]{latexImage_ab7aceb12f4add82da5ce423202fd278.png}}
\put(55.775,-99.78003){\fontsize{11}{1}\usefont{T1}{cmr}{m}{n}\selectfont\color{color_93343} }
\put(55.525,-810){\fontsize{11}{1}\usefont{T1}{cmr}{m}{n}\selectfont\color{color_93343} }
\put(282.38,-810){\fontsize{11}{1}\usefont{T1}{cmr}{m}{n}\selectfont\color{color_93343} }
\put(509.48,-810){\fontsize{11}{1}\usefont{T1}{cmr}{m}{n}\selectfont\color{color_93343} }
\put(55.775,-823.5){\fontsize{11}{1}\usefont{T1}{cmr}{m}{n}\selectfont\color{color_93343} }
\put(55.775,-118.05){\fontsize{16}{1}\usefont{T1}{cmr}{b}{n}\selectfont\color{color_217206} }
\put(55.775,-150.3){\fontsize{16}{1}\usefont{T1}{cmr}{b}{n}\selectfont\color{color_217206} }
\end{picture}
\end{document}
[SPHINX] Conf.py:
latex_engine = 'pdflatex'
latex_additional_files = [
'latexImage_ab7aceb12f4add82da5ce423202fd278.png',
'main.sty'
]
latex_elements = {
# Additional stuff for the LaTeX preamble.
'extraclassoptions': 'openany',
'preamble' : r"""
\usepackage{main}
\renewcommand{\subtitle}{%s}
""" % (project)
}
It creates the latex PDF with the correct custom styling.. But all the RST content is not included.
I also tried it with 'latex_documents' Creation method within the conf.py but resulted also with a blanko PDF file.

Converting XDP to PDF using Live Cycle replaces question marks(?) with space at multiple places. How can that be fixed?

I have been trying to convert XDP to PDF using Adobe Live Cycle. Most of my forms turn up good. But while converting some of them, I fine ??? replaced in place of blank spaces at certain places. Any suggestions to how can I rectify that?
Below is the code snippet that I am using:
public byte[] generatePDF(TextDocument xdpDocument) {
try {
Assert.notNull(xdpDocument, "XDP Document must be passed.");
Assert.hasLength(xdpDocument.getContent(), "XDPDocument content cannot be null");
// Create a ServiceClientFactory object
ServiceClientFactory myFactory = ServiceClientFactory.createInstance(createConnectionProperties());
// Create an OutputClient object
FormsServiceClient formsClient = new FormsServiceClient(myFactory);
formsClient.resetCache();
String text = xdpDocument.getContent();
String charSet = xdpDocument.getCharsetName();
if (charSet == null || charSet.trim().length() == 0) {
charSet = StandardCharsets.UTF_8.name();
}
byte[] bytes = text.getBytes();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
Document inTemplate = new Document(byteArrayInputStream);
// Set PDF run-time options
// Set rendering run-time options
RenderOptionsSpec renderOptionsSpec = new RenderOptionsSpec();
renderOptionsSpec.setLinearizedPDF(true);
renderOptionsSpec.setAcrobatVersion(AcrobatVersion.Acrobat_9);
PDFFormRenderSpec pdfFormRenderSpec = new PDFFormRenderSpec();
pdfFormRenderSpec.setGenerateServerAppearance(true);
pdfFormRenderSpec.setCharset("UTF8");
FormsResult formOut = formsClient.renderPDFForm2(inTemplate, null, pdfFormRenderSpec, null, null);
Document xfaPdfOutput = formOut.getOutputContent();
//If the input file is already static PDF, then below method will throw an exception - handle it
OutputClient outClient = new OutputClient(myFactory);
outClient.resetCache();
Document staticPdfOutput = outClient.transformPDF(xfaPdfOutput, TransformationFormat.PDF, null, null, null);
byte[] data = StreamIO.toBytes(staticPdfOutput.getInputStream());
return data;
} catch(IllegalArgumentException ex) {
logger.error("Input validation failed for generatePDF request " + ex.getMessage());
throw new EformsException(ErrorExceptionCode.INPUT_REQUIRED + " - " + ex.getMessage(), ErrorExceptionCode.INPUT_REQUIRED);
} catch (Exception e) {
logger.error("Exception occurred in Adobe Services while generating PDF from xdpDocument..", e);
throw new EformsException(ErrorExceptionCode.PDF_XDP_CONVERSION_EXCEPTION, e);
}
}
I suggest trying 2 things:
Check the font. Switch to something very common like Arial / Times New Roman and see if the characters are still lost
Check the character encoding. It might not be a simple question mark character you are using and if so the character encoding will be important. The easiest way is to make sure your question mark is ascii char 63 (decimal).
I hope that helps.

Stop Controller from executing again once a request has been made

I'm new to grails so hoping someone will be patient and give me a hand. I have a controller that creates a PDF. If the user clicks more then one time before the PDF is created I get the following error. Below is the code for the creation of the PDF.
2016-03-09 09:32:11,549 ERROR errors.GrailsExceptionResolver - SocketException occurred when processing request: [GET] /wetlands-form/assessment/f3458c91-3435-4714-a0e0-3b24de238671/assessment/pdf
Connection reset by peer: socket write error. Stacktrace follows:
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at mdt.wetlands.AssessmentController$_closure11$$EPeyAg3t.doCall(AssessmentController.groovy:300)
at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:195)
at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2016-03-09 09:32:11,549 ERROR errors.GrailsExceptionResolver - IllegalStateException occurred when processing request: [GET] /wetlands-form/assessment/f3458c91-3435-4714-a0e0-3b24de238671/assessment/pdf
getOutputStream() has already been called for this response. Stacktrace follows:
org.codehaus.groovy.grails.web.pages.exceptions.GroovyPagesException: Error processing GroovyPageView: getOutputStream() has already been called for this response
at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:195)
at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: getOutputStream() has already been called for this response
at C__MDTDATA_gg_workspace_new_wetlands_grails_app_views_error_gsp.run(error.gsp:1)
... 5 more
2016-03-09 09:32:11,549 ERROR [/wetlands-form].[grails] - Servlet.service() for servlet grails threw exception
java.lang.IllegalStateException: getOutputStream() has already been called for this response
PDF CODE VIA rendering plugin
def pdf = {
def assessment = lookupAssessment()
if (!assessment){
return
}
// Trac 219 Jasper report for PDF output
Map reportParams = [:]
def report = params.report
def printType = params.printType
def mitigationType = params.mitigationType
def fileName
def fileType
fileType = 'PDF'
def reportDir =
grailsApplication.mainContext.servletContext.getRealPath(""+File.separatorChar+"reports"+File.separatorChar)
def resolver = new SimpleFileResolver(new File(reportDir))
reportParams.put("ASSESS_ID", assessment.id)
reportParams.put("RUN_DIR", reportDir+File.separatorChar)
reportParams.put("JRParameter.REPORT_FILE_RESOLVER", resolver)
reportParams.put("_format", fileType)
reportParams.put("_file", "assessment")
println params
def reportDef = jasperService.buildReportDefinition(reportParams, request.getLocale(), [])
def file = jasperService.generateReport(reportDef).toByteArray()
// Non-inline reports (e.g. PDF)
if (!reportDef.fileFormat.inline && !reportDef.parameters._inline)
{
response.setContentType("APPLICATION/OCTET-STREAM")
response.setHeader("Content-disposition", "attachment; filename=" + assessment.name + "." + reportDef.fileFormat.extension);
response.contentType = reportDef.fileFormat.mimeTyp
response.characterEncoding = "UTF-8"
response.outputStream << reportDef.contentStream.toByteArray()
}
else
{
// Inline report (e.g. HTML)
render(text: reportDef.contentStream, contentType: reportDef.fileFormat.mimeTyp, encoding: reportDef.parameters.encoding ? reportDef.parameters.encoding : 'UTF-8');
}
}
This is the WORD code.
def word = {
def assessment = lookupAssessment()
if (!assessment){
return
}
// get the assessment's data as xml
def assessmentXml = g.render(template: 'word', model: [assessment:assessment]).toString()
// open the Word template
def loader = new LoadFromZipNG()
def template = servletContext.getResourceAsStream('/word/template.docx')
WordprocessingMLPackage wordMLPackage = (WordprocessingMLPackage)loader.get(template)
// get custom xml piece from Word template
String itemId = '{44f68b34-ffd4-4d43-b59d-c40f7b0a2880}' // have to pull up part by ID. Watch out - this may change if you muck with the template!
CustomXmlDataStoragePart customXmlDataStoragePart = wordMLPackage.getCustomXmlDataStorageParts().get(itemId)
CustomXmlDataStorage data = customXmlDataStoragePart.getData()
// and replace it with our assessment's xml
ByteArrayInputStream bs = new ByteArrayInputStream(assessmentXml.getBytes())
data.setDocument(bs) // needs java.io.InputStream
// that's it! the data is in the Word file
// but in order to do the highlighting, we have to manipulate the Word doc directly
// gather the list of cells to highlight
def highlights = assessment.highlights()
// get the main document from the Word file as xml
MainDocumentPart mainDocPart = wordMLPackage.getMainDocumentPart()
def xml = XmlUtils.marshaltoString(mainDocPart.getJaxbElement(), true)
// use the standard Groovy tools to handle the xml
def document = new XmlSlurper(keepWhitespace:true).parseText(xml)
// for each value in highlight list - find node, shade cell and add bold element
highlights.findAll{it != null}.each{highlight ->
def tableCell = document.body.tbl.tr.tc.find{it.sdt.sdtPr.alias.'#w:val' == highlight}
tableCell.tcPr.shd[0].replaceNode{
'w:shd'('w:fill': 'D9D9D9') // shade the cell
}
def textNodes = tableCell.sdt.sdtContent.p.r.rPr
textNodes.each{
it.appendNode{
'w:b'() // bold element
}
}
}
// here's a good way to print out xml for debugging
// System.out.println(new StreamingMarkupBuilder().bindNode(document.body.tbl.tr.tc.find{it.sdt.sdtPr.alias.#'w:val' == '12.1.1'}).toString())
// or save xml to file for study
// File testOut = new File("C:/MDTDATA/wetlands-trunk/xmlout.xml")
// testOut.setText(new StreamingMarkupBuilder().bindNode(document).toString())
// get the updated xml back in the Word doc
Object obj = XmlUtils.unmarshallFromTemplate(new StreamingMarkupBuilder().bindNode(document).toString(), null);
mainDocPart.setJaxbElement((Object)obj)
File file = File.createTempFile('wordexport-', '.docx')
wordMLPackage.save(file)
response.setHeader('Content-Type', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document;')
response.setHeader('Content-Disposition', "attachment; filename=${assessment.name.encodeAsURL()}.docx")
response.setHeader('Content-Length', "${file.size()}")
response.outputStream << file.readBytes()
response.outputStream.flush()
file.delete()
}
// for checking XML during development
def word2 = {
def assessment = lookupAssessment()
if (!assessment){
return
}
render template: 'word', model: [assessment:assessment]
}
You need to catch the exception, if you wish to not do anything with it then as below in the catch nothing going on.. after it has gone through try and catch if still no file we know something has gone wrong so we render another or same view with error this time. After this it returns so it won't continue to your other bit which checks report type i.e. pdf or html
..
//declare file (def means it could be any type of object)
def file
//Now when you expect unexpected behaviour capture it with a try/catch
try {
file = jasperService.generateReport(reportDef).toByteArray()
}catch (Exception e) {
//log.warn (e)
//println "${e} ${e.errors}"
}
//in your scenario or 2nd click the user will hit the catch segment
//and have no file produced that would be in the above try block
//this now says if file == null or if file == ''
// in groovy !file means capture if there nothing defined for file
if (!file) {
//render something else
render 'a message or return to page with error that its in use or something gone wrong'
//return tells your controller to stop what ever else from this point
return
}
//so what ever else would occur will not occur since no file was produced
...
Now a final note try/catches are expensive and should not be used everywhere. If you are expecting something then deal with the data. In scenarios typically like this third party api where you have no control i.e. to make the unexpected expected then you fall back to these methods
1- Client Side : Better is to disable button on first click and wait for response from Server.
2- Catch Exception and do nothing or just print error log.
// get/set parameters
def file
def reportDef
try{
reportDef = jasperService.buildReportDefinition(reportParams, request.getLocale(), [])
file = jasperService.generateReport(reportDef).toByteArray()
}catch(Exception e){
// print log or do nothing
}
if (file){
// render file according to your conditions
}
else {
// render , return appropriate message.
}
Instead of catching Exception, Its better to catch IOException. Otherwise you will be eating all other exceptions as well. Here is how i handled it.
private def streamFile(File file) {
def outputStream
try {
response.contentType = "application/pdf"
response.setHeader "Content-disposition", "inline; filename=${file.name}"
outputStream = response.outputStream
file.withInputStream {
response.contentLength = it.available()
outputStream << it
}
outputStream.flush()
}
catch (IOException e){
log.info 'Probably User Cancelled the download!'
}
finally {
if (outputStream != null){
try {
outputStream.close()
} catch (IOException e) {
log.info 'Exception on close'
}
}
}
}

iText - Cleaning Up Text in Rectangle without cleaning full row

I'm trying to clean up text inside rectangle in pdf document using iText.
Following is the piece of code I’m using:
PdfReader pdfReader = null;
PdfStamper stamper = null;
try
{
int pageNo = 1;
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 202.3);
linkBounds.add(1, (float) 588.6);
linkBounds.add(2, (float) 265.8);
linkBounds.add(3, (float) 599.7);
pdfReader = new PdfReader("Test1.pdf");
stamper = new PdfStamper(pdfReader, new FileOutputStream("Test2.pdf"));
Rectangle linkLocation = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(pageNo, linkLocation, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
try {
stamper.close();
}
catch (Exception e) {
e.printStackTrace();
}
pdfReader.close();
}
After executing this piece of code, it’s clearing up entire line of text instead of cleaning up text only inside given rectangle.
To explain things in a better way I have attached pdf documents.
input PDF
output PDF
In the input pdf, I have highlighted the text to show the rectangle I’m specifying for cleaning up.
And, in the output pdf as you can clearly see that there is grey rectangle but if you notice it cleaned up the whole line of text.
Any help will be appreciated.
The files input.pdf and output.pdf the OP originally presented did not allow to reproduce the issue but instead seemed not at all to match. Thus, there was an original answer essentially demonstrating that the issue could not be reproduced.
The second set of files Test1.pdf and Test2.pdf, though, did allow to reproduce the issue, giving rise to the updated answer...
Updated answer referring to the OP's second set of sample files
There indeed is an issue in the current (up to 5.5.8) iText clean-up code: In case of tagged files some methods of PdfContentByte used here introduced extra instructions into the content stream which actually damaged it and relocated some text in the eyes of PDF viewers which ignored the damage.
In more detail:
PdfCleanUpContentOperator.writeTextChunks used canvas.setCharacterSpacing(0) and canvas.setWordSpacing(0) to initially set the character and word spacing to 0. Unfortunately these methods in case of tagged files check whether the canvas under construction currently is in a text object and (if not) start a text object. This check depends on a local flag set by beginText; but during clean-up text objects are not started using that method. Thus, writeTextChunks here inserts an extra "BT 1 0 0 1 0 0 Tm" sequence damaging the stream and relocating the following text.
private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
canvas.setCharacterSpacing(0);
canvas.setWordSpacing(0);
...
PdfCleanUpContentOperator.writeTextChunks instead should use hand-crafted Tc and Tw instructions to not trigger this side effect.
private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
if (Float.compare(characterSpacing, 0.0f) != 0 && Float.compare(characterSpacing, -0.0f) != 0) {
new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append(Tc);
}
if (Float.compare(wordSpacing, 0.0f) != 0 && Float.compare(wordSpacing, -0.0f) != 0) {
new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append(Tw);
}
canvas.getInternalBuffer().append((byte) '[');
With this change in place the OP's new sample file "Test1.pdf" is properly redacted by the sample code
#Test
public void testRedactJavishsTest1() throws IOException, DocumentException
{
try ( InputStream resource = getClass().getResourceAsStream("Test1.pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "Test1-redactedJavish.pdf")) )
{
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 202.3);
linkBounds.add(1, (float) 588.6);
linkBounds.add(2, (float) 265.8);
linkBounds.add(3, (float) 599.7);
Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
stamper.close();
reader.close();
}
}
(RedactText.java)
Original answer referring to the OP's original sample files
I just tried to reproduce your issue using this test method
#Test
public void testRedactJavishsText() throws IOException, DocumentException
{
try ( InputStream resource = getClass().getResourceAsStream("input.pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "input-redactedJavish.pdf")) )
{
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 200.7);
linkBounds.add(1, (float) 547.3);
linkBounds.add(2, (float) 263.3);
linkBounds.add(3, (float) 558.4);
Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
stamper.close();
reader.close();
}
}
(RedactText.java)
For your source PDF looking like this
the result was
and not your
I even re-tested using the iText versions 5.5.5 you mention in a comment and also 5.5.4, but in all cases I got the correct result.
Thus, I cannot reproduce your issue.
I had a closer look at your output.pdf. It is a bit peculiar, in particular it does not contain certain blocks typical for PDFs created or manipulated by current iText versions. Furthermore the content streams look extremely different.
Thus, I assume that after iText redacted your file some other tool post-processed and in doing so damaged it.
In particular the page content instructions preparing the insertion of the redacted line look like this in your input.pdf:
q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
[...] TJ
and like this in the version I received directly from iText:
q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
0 Tc
0 Tw
[...] TJ
but the corresponding lines in your output.pdf look like this
BT
1 0 0 1 113.3 548.5 Tm
0 Tc
BT
1 0 0 1 0 0 Tm
0 Tc
[...] TJ
Here the instructions in your output.pdf are
invalid as inside a text object BT ... ET there may be no other text object but you have two BT operations following each other without an ET inbetween;
effectively positioning the text at 0, 0 if a PDF viewer ignores the error mentioned above.
And indeed, if you look at the bottom of your output.pdf page you'll see:
So if my assumption that there is some other program post-processing the iText result, is correct, you should repair that post-processor.
If there is no such post-processor, you seem not to have the officially published iText version but something altogether different.

Split a "tagged" PDF document into multiple documents, keeping the tagging

In a project I have to split a PDF document into two documents, one containing all blank pages, and one containing all pages with content.
For this job, I use a PdfReader to read the source file, and two pdfCopy objects (one for the blank pages document, one for the pages with content document) to write the files to.
I use GetImportedPage to read a PdfImportedPage, which is then added to one of the PdfCopy writers.
Now, the problem is the following: the source file is using the "tagged PDF format". To preserve this (which is absolutely required), I use the SetTagged() method on both PdfCopy writers, and use the extra third parameter in GetImportedPage(...) to keep the tagged format. However, when calling the AddPage(...) on the PdfCopy writer, I get an invalid cast exception:
"Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary' to type 'iTextSharp.text.pdf.PRIndirectReference'."
Anyone has any ideas on how to solve this ? Any hints ?
Also: the project currently refers version 5.1.0.0 of the itext libraries. In 5.4.4.0 the third parameter to GetImportedPage does not seem to be there anymore.
Below, you can find a code extract:
iTextSharp.text.Document targetPdf = new iTextSharp.text.Document();
iTextSharp.text.Document blankPdf = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfReader sourcePdfReader = new iTextSharp.text.pdf.PdfReader(inputFile);
iTextSharp.text.pdf.PdfCopy targetPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(targetPdf, new FileStream(outputFile, FileMode.Create));
iTextSharp.text.pdf.PdfCopy blankPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(blankPdf, new FileStream(blanksFile, FileMode.Append));
targetPdfWriter.SetTagged();
blankPdfWriter.SetTagged();
try
{
iTextSharp.text.pdf.PdfImportedPage page = null;
int n = sourcePdfReader.NumberOfPages;
targetPdf.Open();
blankPdf.Open();
blankPdf.Add(new iTextSharp.text.Phrase("This document contains the blank pages removed from " + inputFile));
blankPdf.NewPage();
for (int i = 1; i <= n; i++)
{
byte[] pageBytes = sourcePdfReader.GetPageContent(i);
string pageText = "";
iTextSharp.text.pdf.PRTokeniser token = new iTextSharp.text.pdf.PRTokeniser(new iTextSharp.text.pdf.RandomAccessFileOrArray(pageBytes));
while (token.NextToken())
{
if (token.TokenType == iTextSharp.text.pdf.PRTokeniser.TokType.STRING)
{
pageText += token.StringValue;
}
}
if (pageText.Length >= 15)
{
page = targetPdfWriter.GetImportedPage(sourcePdfReader, i, true);
targetPdfWriter.AddPage(page);
}
else
{
page = blankPdfWriter.GetImportedPage(sourcePdfReader, i, true);
blankPdfWriter.AddPage(page);
blankPageCount++;
}
}
}
catch (Exception ex)
{
Console.WriteLine("Exception at LOC1: " + ex.Message);
}
The error occurs in the call to targetPdfWriter.AddPage(page); near the end of the code sample.
Thank you very much for your help.
Koen.