Converting XDP to PDF using Live Cycle replaces question marks(?) with space at multiple places. How can that be fixed? - pdf

I have been trying to convert XDP to PDF using Adobe Live Cycle. Most of my forms turn up good. But while converting some of them, I fine ??? replaced in place of blank spaces at certain places. Any suggestions to how can I rectify that?
Below is the code snippet that I am using:
public byte[] generatePDF(TextDocument xdpDocument) {
try {
Assert.notNull(xdpDocument, "XDP Document must be passed.");
Assert.hasLength(xdpDocument.getContent(), "XDPDocument content cannot be null");
// Create a ServiceClientFactory object
ServiceClientFactory myFactory = ServiceClientFactory.createInstance(createConnectionProperties());
// Create an OutputClient object
FormsServiceClient formsClient = new FormsServiceClient(myFactory);
formsClient.resetCache();
String text = xdpDocument.getContent();
String charSet = xdpDocument.getCharsetName();
if (charSet == null || charSet.trim().length() == 0) {
charSet = StandardCharsets.UTF_8.name();
}
byte[] bytes = text.getBytes();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
Document inTemplate = new Document(byteArrayInputStream);
// Set PDF run-time options
// Set rendering run-time options
RenderOptionsSpec renderOptionsSpec = new RenderOptionsSpec();
renderOptionsSpec.setLinearizedPDF(true);
renderOptionsSpec.setAcrobatVersion(AcrobatVersion.Acrobat_9);
PDFFormRenderSpec pdfFormRenderSpec = new PDFFormRenderSpec();
pdfFormRenderSpec.setGenerateServerAppearance(true);
pdfFormRenderSpec.setCharset("UTF8");
FormsResult formOut = formsClient.renderPDFForm2(inTemplate, null, pdfFormRenderSpec, null, null);
Document xfaPdfOutput = formOut.getOutputContent();
//If the input file is already static PDF, then below method will throw an exception - handle it
OutputClient outClient = new OutputClient(myFactory);
outClient.resetCache();
Document staticPdfOutput = outClient.transformPDF(xfaPdfOutput, TransformationFormat.PDF, null, null, null);
byte[] data = StreamIO.toBytes(staticPdfOutput.getInputStream());
return data;
} catch(IllegalArgumentException ex) {
logger.error("Input validation failed for generatePDF request " + ex.getMessage());
throw new EformsException(ErrorExceptionCode.INPUT_REQUIRED + " - " + ex.getMessage(), ErrorExceptionCode.INPUT_REQUIRED);
} catch (Exception e) {
logger.error("Exception occurred in Adobe Services while generating PDF from xdpDocument..", e);
throw new EformsException(ErrorExceptionCode.PDF_XDP_CONVERSION_EXCEPTION, e);
}
}

I suggest trying 2 things:
Check the font. Switch to something very common like Arial / Times New Roman and see if the characters are still lost
Check the character encoding. It might not be a simple question mark character you are using and if so the character encoding will be important. The easiest way is to make sure your question mark is ascii char 63 (decimal).
I hope that helps.

Related

vb.net stream reader reads from a .accdb and .xml file without an error [duplicate]

How can I test whether a file that I'm opening in C# using FileStream is a "text type" file? I would like my program to open any file that is text based, for example, .txt, .html, etc.
But not open such things as .doc or .pdf or .exe, etc.
In general: there is no way to tell.
A text file stored in UTF-16 will likely look like binary if you open it with an 8-bit encoding. Equally someone could save a text file as a .doc (it is a document).
While you could open the file and look at some of the content all such heuristics will sometimes fail (eg. notepad tries to do this, by careful selection of a few characters notepad will guess wrong and display completely different content).
If you have a specific scenario, rather than being able to open and process anything, you should be able to do much better.
I guess you could just check through the first 1000 (arbitrary number) characters and see if there are unprintable characters, or if they are all ascii in a certain range. If the latter, assume that it is text?
Whatever you do is going to be a guess.
As others have pointed out there is no absolute way to be sure. However, to determine if a file is binary (which can be said to be easier than determining if it is text) some implementations check for consecutive NUL characters. Git apparently just checks the first 8000 chars for a NUL and if it finds one treats the file as binary. See here for more details.
Here is a similar C# solution I wrote that looks for a given number of required consecutive NUL. If IsBinary returns false then it is very likely your file is text based.
public bool IsBinary(string filePath, int requiredConsecutiveNul = 1)
{
const int charsToCheck = 8000;
const char nulChar = '\0';
int nulCount = 0;
using (var streamReader = new StreamReader(filePath))
{
for (var i = 0; i < charsToCheck; i++)
{
if (streamReader.EndOfStream)
return false;
if ((char) streamReader.Read() == nulChar)
{
nulCount++;
if (nulCount >= requiredConsecutiveNul)
return true;
}
else
{
nulCount = 0;
}
}
}
return false;
}
To get the real type of a file, you must check its header, which won't be changed even the extension is modified. You can get the header list here, and use something like this in your code:
using(var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
using(var reader = new BinaryReader(stream))
{
// read the first X bytes of the file
// In this example I want to check if the file is a BMP
// whose header is 424D in hex(2 bytes 6677)
string code = reader.ReadByte().ToString() + reader.ReadByte().ToString();
if (code.Equals("6677"))
{
//it's a BMP file
}
}
}
I have a below solution which works for me.This is general solution which check all types of Binary file.
/// <summary>
/// This method checks whether selected file is Binary file or not.
/// </summary>
public bool CheckForBinary()
{
Stream objStream = new FileStream("your file path", FileMode.Open, FileAccess.Read);
bool bFlag = true;
// Iterate through stream & check ASCII value of each byte.
for (int nPosition = 0; nPosition < objStream.Length; nPosition++)
{
int a = objStream.ReadByte();
if (!(a >= 0 && a <= 127))
{
break; // Binary File
}
else if (objStream.Position == (objStream.Length))
{
bFlag = false; // Text File
}
}
objStream.Dispose();
return bFlag;
}
public bool IsTextFile(string FilePath)
using (StreamReader reader = new StreamReader(FilePath))
{
int Character;
while ((Character = reader.Read()) != -1)
{
if ((Character > 0 && Character < 8) || (Character > 13 && Character < 26))
{
return false;
}
}
}
return true;
}

Using pdfbox - how to get the font from a COSName?

How to get the font from a COSName?
The solution I'm looking for looks somehow like this:
COSDictionary dict = new COSDictionary();
dict.add(fontname, something); // fontname COSName from below code
PDFontFactory.createFont(dict);
If you need more background, I added the whole story below:
I try to replace some string in a pdf. This succeeds (as long as all text is stored in one token). In order to keep the format I like to re-center the text. As far as I understood I can do this by getting the width of the old string and the new one, do some trivial calculation and setting the new position.
I found some inspiration on stackoverflow for replacing https://stackoverflow.com/a/36404377 (yes it has some issues, but works for my simple pdf's. And How to center a text using PDFBox. Unfortunatly this example uses a font constant.
So using the first link's code I get a handling for operator 'TJ' and one for 'Tj'.
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
java.util.List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++)
{
Object next = tokens.get(j);
if (next instanceof Operator)
{
Operator op = (Operator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
// Tj takes one operator and that is the string to display so lets
// update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
String replaced = prh.getReplacement(string);
if (!string.equals(replaced))
{ // if changes are there, replace the content
previous.setValue(replaced.getBytes());
float xpos = getPosX(tokens, j);
//if (true) // center the text
if (6 * xpos > page.getMediaBox().getWidth()) // check if text starts right from 1/xth page width
{
float fontsize = getFontSize(tokens, j);
COSName fontname = getFontName(tokens, j);
// TODO
PDFont font = ?getFont?(fontname);
// TODO
float widthnew = getStringWidth(replaced, font, fontsize);
setPosX(tokens, j, page.getMediaBox().getWidth() / 2F - (widthnew / 2F));
}
replaceCount++;
}
}
Considering the code between the TODO tags, I will get the required values from the token list. (yes this code is awful, but for now it let's me concentrate on the main issue)
Having the string, the size and the font I should be able to call the getWidth(..) method from the sample code.
Unfortunatly I run into trouble to create a font from the COSName variable.
PDFont doesn't provide a method to create a font by name.
PDFontFactory looks fine, but requests a COSDictionary. This is the point I gave up and request help from you.
The names are associated with font objects in the page resources.
Assuming you use PDFBox 2.0.x and that page is a PDPage instance, you can resolve the name fontname using:
PDFont font = page.getResources().getFont(fontname);
But the warning from the comments to the questions you reference remain: This approach will work only for very simple PDFs and might even damage other ones.
try {
//Loading an existing document
File file = new File("UKRSICH_Mo6i-Spikyer_z1560-FAV.pdf");
PDDocument document = PDDocument.load(file);
PDPage page = document.getPage(0);
PDResources pageResources = page.getResources();
System.out.println(pageResources.getFontNames() );
for (COSName key : pageResources.getFontNames())
{
PDFont font = pageResources.getFont(key);
System.out.println("Font: " + font.getName());
}
document.close();
}

Split a "tagged" PDF document into multiple documents, keeping the tagging

In a project I have to split a PDF document into two documents, one containing all blank pages, and one containing all pages with content.
For this job, I use a PdfReader to read the source file, and two pdfCopy objects (one for the blank pages document, one for the pages with content document) to write the files to.
I use GetImportedPage to read a PdfImportedPage, which is then added to one of the PdfCopy writers.
Now, the problem is the following: the source file is using the "tagged PDF format". To preserve this (which is absolutely required), I use the SetTagged() method on both PdfCopy writers, and use the extra third parameter in GetImportedPage(...) to keep the tagged format. However, when calling the AddPage(...) on the PdfCopy writer, I get an invalid cast exception:
"Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary' to type 'iTextSharp.text.pdf.PRIndirectReference'."
Anyone has any ideas on how to solve this ? Any hints ?
Also: the project currently refers version 5.1.0.0 of the itext libraries. In 5.4.4.0 the third parameter to GetImportedPage does not seem to be there anymore.
Below, you can find a code extract:
iTextSharp.text.Document targetPdf = new iTextSharp.text.Document();
iTextSharp.text.Document blankPdf = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfReader sourcePdfReader = new iTextSharp.text.pdf.PdfReader(inputFile);
iTextSharp.text.pdf.PdfCopy targetPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(targetPdf, new FileStream(outputFile, FileMode.Create));
iTextSharp.text.pdf.PdfCopy blankPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(blankPdf, new FileStream(blanksFile, FileMode.Append));
targetPdfWriter.SetTagged();
blankPdfWriter.SetTagged();
try
{
iTextSharp.text.pdf.PdfImportedPage page = null;
int n = sourcePdfReader.NumberOfPages;
targetPdf.Open();
blankPdf.Open();
blankPdf.Add(new iTextSharp.text.Phrase("This document contains the blank pages removed from " + inputFile));
blankPdf.NewPage();
for (int i = 1; i <= n; i++)
{
byte[] pageBytes = sourcePdfReader.GetPageContent(i);
string pageText = "";
iTextSharp.text.pdf.PRTokeniser token = new iTextSharp.text.pdf.PRTokeniser(new iTextSharp.text.pdf.RandomAccessFileOrArray(pageBytes));
while (token.NextToken())
{
if (token.TokenType == iTextSharp.text.pdf.PRTokeniser.TokType.STRING)
{
pageText += token.StringValue;
}
}
if (pageText.Length >= 15)
{
page = targetPdfWriter.GetImportedPage(sourcePdfReader, i, true);
targetPdfWriter.AddPage(page);
}
else
{
page = blankPdfWriter.GetImportedPage(sourcePdfReader, i, true);
blankPdfWriter.AddPage(page);
blankPageCount++;
}
}
}
catch (Exception ex)
{
Console.WriteLine("Exception at LOC1: " + ex.Message);
}
The error occurs in the call to targetPdfWriter.AddPage(page); near the end of the code sample.
Thank you very much for your help.
Koen.

Split PDF into separate files based on text

I have a large single pdf document which consists of multiple records. Each record usually takes one page however some use 2 pages. A record starts with a defined text, always the same.
My goal is to split this pdf into separate pdfs and the split should happen always before the "header text" is found.
Note: I am looking for a tool or library using java or python. Must be free and available on Win 7.
Any ideas? AFAIK imagemagick won't work for this. May itext do this? I never used and it's
pretty complex so would need some hints.
EDIT:
Marked Answer led me to solution. For completeness here my exact implementation:
public void splitByRegex(String filePath, String regex,
String destinationDirectory, boolean removeBlankPages) throws IOException,
DocumentException {
logger.entry(filePath, regex, destinationDirectory);
destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
PdfReader reader = null;
Document document = null;
PdfCopy copy = null;
Pattern pattern = Pattern.compile(regex);
try {
reader = new PdfReader(filePath);
final String RESULT = destinationDirectory + "/record%d.pdf";
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 1; i < n; i++) {
final String text = PdfTextExtractor.getTextFromPage(reader, i);
if (pattern.matcher(text).find()) {
if (document != null && document.isOpen()) {
logger.debug("Match found. Closing previous Document..");
document.close();
}
String fileName = String.format(RESULT, i);
logger.debug("Match found. Creating new Document " + fileName + "...");
document = new Document();
copy = new PdfCopy(document,
new FileOutputStream(fileName));
document.open();
logger.debug("Adding page to Document...");
copy.addPage(copy.getImportedPage(reader, i));
} else if (document != null && document.isOpen()) {
logger.debug("Found Open Document. Adding additonal page to Document...");
if (removeBlankPages && !isBlankPage(reader, i)){
copy.addPage(copy.getImportedPage(reader, i));
}
}
}
logger.exit();
} finally {
if (document != null && document.isOpen()) {
document.close();
}
if (reader != null) {
reader.close();
}
}
}
private boolean isBlankPage(PdfReader reader, int pageNumber)
throws IOException {
// see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
PdfDictionary pageDict = reader.getPageN(pageNumber);
// We need to examine the resource dictionary for /Font or
// /XObject keys. If either are present, they're almost
// certainly actually used on the page -> not blank.
PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
if (resDict != null) {
return resDict.get(PdfName.FONT) == null
&& resDict.get(PdfName.XOBJECT) == null;
} else {
return true;
}
}
You can create a tool for your requirements using iText.
Whenever you are looking for code samples concerning (current versions of) the iText library, you should consult iText in Action — 2nd Edition the code samples from which are online and searchable by keyword from here.
In your case the relevant samples are Burst.java and ExtractPageContentSorted2.java.
Burst.java shows how to split one PDF in multiple smaller PDFs. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
// step 1
document = new Document();
// step 2
copy = new PdfCopy(document,
new FileOutputStream(String.format(RESULT, ++i)));
// step 3
document.open();
// step 4
copy.addPage(copy.getImportedPage(reader, i));
// step 5
document.close();
}
reader.close();
This sample splits a PDF in single-page PDFs. In your case you need to split by different criteria. But that only means that in the loop you sometimes have to add more than one imported page (and thus decouple loop index and page numbers to import).
To recognize on which pages a new dataset starts, be inspired by ExtractPageContentSorted2.java. This sample shows how to parse the text content of a page to a string. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
System.out.println("\nPage " + i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();
Simply search for the record start text: If the text from page contains it, a new record starts there.
Apache PDFBox has a PDFSplit utility that you can run from the command-line.
If you like Python, there's a nice library: PyPDF2. The library is pure python2, BSD-like license.
Sample code:
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("C:\\Users\\Jarek\\Documents\\x.pdf", "rb"))
# analyze pdf data
print input1.getDocumentInfo()
print input1.getNumPages()
text = input1.getPage(0).extractText()
print text.encode("windows-1250", errors='backslashreplacee')
# create output document
output = PdfFileWriter()
output.addPage(input1.getPage(0))
fout = open("c:\\temp\\1\\y.pdf", "wb")
output.write(fout)
fout.close()
For non coders PDF Content Split is probably the easiest way without reinventing the wheel and has an easy to use interface: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html
hope that helps.

Attachments not showing up in pdf document - created using pdfbox

I m trying to attach an swf file to a pdf document. Below is my code (excerpted from the pdfbox-examples). while i can see that the file is attached based on the size of the file - with & without the attachment, I can't see / locate it in the pdf document. I do see textual content correctly displayed. Can someone tell me what I m doing wrong & help me fix the issue?
doc = new PDDocument();
PDPage page = new PDPage();
doc.addPage( page );
PDFont font = PDType1Font.HELVETICA_BOLD;
String inputFileName = "sample.swf";
InputStream fileInputStream = new FileInputStream(new File(inputFileName));
PDEmbeddedFile ef = new PDEmbeddedFile(doc, fileInputStream );
PDPageContentStream contentStream = new PDPageContentStream(doc, page,true,true);
//embedded files are stored in a named tree
PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
//first create the file specification, which holds the embedded file
PDComplexFileSpecification fs = new PDComplexFileSpecification();
fs.setEmbeddedFile(ef);
//now lets some of the optional parameters
ef.setSubtype( "swf" );
ef.setCreationDate( new GregorianCalendar() );
//now add the entry to the embedded file tree and set in the document.
Map<String, COSObjectable> efMap = new HashMap<String, COSObjectable>();
efMap.put("My first attachment", fs );
efTree.setNames( efMap );
//attachments are stored as part of the "names" dictionary in the document catalog
PDDocumentNameDictionary names = new PDDocumentNameDictionary( doc.getDocumentCatalog() );
names.setEmbeddedFiles( efTree );
doc.getDocumentCatalog().setNames( names );
After struggling with the same thing, I've discovered this is a known issue. Attachments haven't worked for a while I guess.
Here's a link to the issue on the apache forum.
There is a hack suggested here that you can use.
I tried it and it worked!
the other work around i found is after you call setNames on your PDEmbeddedFilesNameTreeNode remove the limits: ((COSDictionary
)efTree.getCOSObject()).removeItem(COSName.LIMITS); ugly hack, but it
works, without having to recompile pdfbox
Attachment works fine with new version of PDFBox 2.0,
public static boolean addAtachement(final String fileName, final String... attachements) {
if (Objects.isNull(fileName)) {
throw new NullPointerException("fileName shouldn't be null");
}
if (Objects.isNull(attachements)) {
throw new NullPointerException("attachements shouldn't be null");
}
Map<String, PDComplexFileSpecification> efMap = new HashMap<String, PDComplexFileSpecification>();
/*
* Load PDF Document.
*/
try (PDDocument doc = PDDocument.load(new File(fileName))) {
/*
* Attachments are stored as part of the "names" dictionary in the
* document catalog
*/
PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
/*
* First we need to get all the existed attachments, after that we
* can add new attachments
*/
PDEmbeddedFilesNameTreeNode efTree = names.getEmbeddedFiles();
if (Objects.isNull(efTree)) {
efTree = new PDEmbeddedFilesNameTreeNode();
}
Map<String, PDComplexFileSpecification> existedNames = efTree.getNames();
if (existedNames == null || existedNames.isEmpty()) {
existedNames = new HashMap<String, PDComplexFileSpecification>();
}
for (String attachement : attachements) {
/*
* Create the file specification, which holds the embedded file
*/
PDComplexFileSpecification fs = new PDComplexFileSpecification();
fs.setFile(attachement);
try (InputStream is = new FileInputStream(attachement)) {
/*
* This represents an embedded file in a file specification
*/
PDEmbeddedFile ef = new PDEmbeddedFile(doc, is);
/* Set some relevant properties of embedded file */
ef.setCreationDate(new GregorianCalendar());
fs.setEmbeddedFile(ef);
/*
* now add the entry to the embedded file tree and set in
* the document.
*/
efMap.put(attachement, fs);
}
}
efTree.setNames(efMap);
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
doc.save(fileName);
return true;
} catch (IOException e) {
System.out.println(e.getMessage());
return false;
}
}
To 'locate' or see an attached file in the PDF, you can't flip through its pages to find any trace of it there (like, an annotation).
In Acrobat Reader 9.x for example, you have to click on the "View Attachments" icon (looking like a paper-clip) on the left sidebar.