PDFBox - include multiple color profiles during conversion to PDF/A - pdf

We are currently trying to merge multiple PDFs and create a PDF/A (1B) out of it.
Currently we face a problem when we want to fix the color profiles. The PDF we receive has no embedded color profiles, so during the merge functionality of PDFBox, no OutputIntents are merged. So in the last step we try to add the color profiles.
If we do not add any color profile, we get validation issues for RGB and CMYK. If we add both color profiles to the PDDocumentCatalog, then only the validation issues for the first one are gone. So if we add RGB first, we only get CMYK validation issues and vice versa.
Here is a part of the code when we add the color profiles:
public void convertToPDFA(PDDocument doc, String file){
PDMetadata metadata = new PDMetadata(doc);
PDDocumentCatalog cat = doc.getDocumentCatalog();
cat.setMetadata(metadata);
// do metadata stuff, just removed it for now
InputStream colorProfile = PDFService.class.getResourceAsStream("/pdfa/sRGB Color Space Profile.icm");
PDOutputIntent oi = new PDOutputIntent(doc, colorProfile);
oi.setInfo("sRGB IEC61966-2.1");
oi.setOutputCondition("sRGB IEC61966-2.1");
oi.setOutputConditionIdentifier("sRGB IEC61966-2.1");
oi.setRegistryName("http://www.color.org");
cat.addOutputIntent(oi);
This is the code for RGB, we also add another *.icm color profile for CMYK.
So the color profiles seem to be fine, because dependent on the one we add first, the validation issues are gone.
For me it feels like we are just missing a small thing that both color profiles will be accepted, or could it be that only one color profile can be used for the creation of a PDF/A?
Thanks in advance and kind regards

Only a single output intent is allowed, see here. An alternative is also mentioned there, which would be to use only ICC based colorspaces.
What should be possible (although beyond the scope of the question), would be to assign ICC profiles to /DeviceGray, /DeviceRGB, or /DeviceCMYK, by adding DefaultGray, DefaultRGB, or DefaultCMYK entries the ColorSpaces in the resource dictionary, as explained in section 8.6.5.6 of the PDF specification:
When a device colour space is selected, the ColorSpace subdictionary
of the current resource dictionary (see 7.8.3, "Resource
Dictionaries") is checked for the presence of an entry designating a
corresponding default colour space (DefaultGray, DefaultRGB, or
DefaultCMYK, corresponding to DeviceGray, DeviceRGB, or DeviceCMYK,
respectively). If such an entry is present, its value shall be used as
the colour space for the operation currently being performed.
Be aware that making PDF file PDF/A-1b conformant is often more trickier than just adding output intents - check your file with PDFBox preflight or with the online validator from PDF Tools, there are many possible errors. Which is why there are products from Callas Software or PDF Tools that convert PDF files to PDF/A.

Related

Problem with line breaks in PDF document generated by BIRT

I have some cell texts in a BIRT report which do not flow as nicely as I hoped.
For example,
The text is Long value resultwithaverylongname whichcannotbreak and I had hoped that it would be displayed like this:
Long value
resultwithaverylongname
whichcannotbreak
The render options are as follows:
renderOptions.setOutputFormat(IPDFRenderOption.OUTPUT_FORMAT_PDF);
renderOptions.setOption(IPDFRenderOption.PAGE_OVERFLOW, IPDFRenderOption.OUTPUT_TO_MULTIPLE_PAGES);
renderOptions.setOption(IPDFRenderOption.PDF_TEXT_WRAPPING, true);
renderOptions.setOption(IPDFRenderOption.PDF_WORDBREAK, true);
It seems to me that my desired output is physically possible but I don't know why BIRT does not break on a whitespace and breaks in the middle of the word.
I am using BIRT 4.16 (from Sourceforge). The texts contain normal whitespace (no non-breakable spaces) and are displayed via a data object.
3.Sep.21
I now have an example project which I am trying to commit to Github. In the meantime here is a screenshot showing breaks which look good and others which are not...
The git repo is here: https://github.com/pramsden/test.wordbreak
If the text "resultwithaverylongname" physically fits, then you are right:
BIRT should not break it in the middle of the word.
Your renderOptions seem right (depending of what BIRT version you are using).
At first glance this looks like a bug.
But: In German language, we often have quite long words, and I've created a lot of (complex) PDF reports with BIRT, but I never saw this issue.
So I guess it is a tiny silly detail which causes this.
Just to double-check:
Are the spaces between "Long", "value", "result..." normal spaces (0x20)? or non-breaking spaces?
Which BIRT release are you using?
Are you using a data item or a dynamic text item and if so, is it HTML or plain text?
Can you create a reproducible simple test case and post the rptdesign file somewhere?
well i don use BIRT , but try to use (\n),
in my case I use PDFFlow library to generate pdf docs, and to make a line-break i just use \n
this is a simple example code to create a pdf file and use line break
var DocumentBuilder.New()
.AddSection()
.AddParagraphToSection("Hello world! \n go to the next line")
.ToDocument()
.Build("Result.PDF");
try it and tell me if it works

Edit a Mainframe file in the RecordEditor without a copybook

How do you Edit a (binary EBCDIC) Mainframe file in the RecordEditor with out a Cobol Copybook.
How do you generate Java code to read the file using the RecordEditor.
Note: This is an attempt to split a question that is far to broad to give meaningful answer to
into a series of simpler Question and Answer's.
Try and avoid editing a binary file with a Cobol Copybook if at all possible. This should only be attempted as a last resort !!!.
Try and get
that Cobol copybook (or some field layout document) for the file !!!
Some general advise:
It is feasible when dealing with 10 / 20 fields in a record but not if there a thousands of fields in a Record.
Take your time do not rush the process. Try and get each step correct before moving on
Finally upgrade to the latest version of the RecordEditor (currently 0.98.4)
This process will also work for normal Text file as well
RecordEditor Layout Wizard
To start the wizard select option Record Layouts >>> Layout Wizard.
File Structure screen
The file structure screen has 3 purposes:
Get the File structure - It could be Fixed Width, VB, Windows/Unix Text file
Get the Record-Length (if it is a fixed width file).
Get the font (character-set / encoding)
The RecordEditor will try and work this out for you
Field Selection Screen
The RecordEditor will try and work out where fields start and end but
it is not perfect. You need to carefully check and correct its choices
On this screen, the fields are displayed in alternating colors
you create/delete a field by clicking on
use the Clear Fields button clear all the fields
you can change what field-types to search for using the various check box's (e.g. Mainframe Zones Decimal)
The Add Fields will do another field search
Field Definition screen
On this screen you define the field names and Types. You may need to go back to the **Field Selection Screen* to adjust the fields
Editing the file
Once the Record Layout has been defined, it can be used on the open file screen
Generating Java code
When editing your file, you can generate java~JRecord code to read the file
by selecting Generate >>> Java >>> ....
You can the enter a package-id + generate options:
and finally your sample java code is generated to read / write the
file.

how can i export DataGridView with ARABIC data from Visual Basic to PDF by using iTextSharp [duplicate]

I have a problem with inserting UNICODE characters in a PDF file in eclipse.
There is some solution for this that it is not very efficient for me.
The solution is like this.
document.add(new Paragraph("Unicode: \u0418", new Font(bfComic, 12)));
I want to retrieve data from a database and show them to the user and my characters are in Arabic script and sometimes in Farsi script.
What solution do you suggest?
thanks
You are experiencing different problems:
Encoding of the data:
Please download chapter 2 of my book and go to section 2.2.2 entitled "The Phrase object: a List of Chunks with leading". In this section, look for the title "Database encoding versus the default CharSet used by the JVM".
You will see that database values are retrieved like this:
String name1 = new String(rs.getBytes("given_name"), "UTF-8");
That’s because the database contains different names with special characters. You risk that these special characters are displayed as gibberish if you would retrieve the field like this:
String name2 = rs.getString("given_name")
Encoding of the font:
You create your font like this:
Font font = new Font(bfComic, 12);
You don't show how you create bfComic, but I assume that this object is a BaseFont object using IDENTITY_H as encoding.
Writing from right to left / making ligatures
Although your code will work to show a single character, it won't work to show a sentence correctly.
Suppose that name1 is the Arabic version of the name "Lawrence of Arabia" and that we want to write this name to a PDF. This is done three times in the following screen shot:
The first line is wrong, because the characters are in the wrong order. They are written from left to right whereas they should be written from right to left. This is what will happen when you do:
document.add(name1);
Even if the encoding is correct, you're rendering the text incorrectly.
The second line is also wrong. The characters are now in the correct order, but no ligatures are made: ل followed by و should be combined into a single glyph: لو
You can only achieve this by adding the content to a ColumnText or PdfPCell object, and by setting the run direction to PdfWriter.RUN_DIRECTION_RTL. For instance:
pdfCell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
Now the text will be rendered correctly.
This is explained in chapter 11 of my book. You can find a full example here: Ligatures2

PDF acroform fields become non editable in Adobe reader after writing to it using Pdfbox APIs

I am reading a PDF which has editable fields and the fields can be edited by opening it through Adobe Reader. I am using PDFBox APIs to generate an output PDF with data filled for the editable fields in input PDF. The output PDF can be opened using Adobe Reader and I am able to see the field values but I am unable to edit those fields directly from Adobe reader.
There is also a JIRA ticket for this issue and it is unresolved according to this link :
https://issues.apache.org/jira/browse/PDFBOX-1121
Can anybody please tell me if this got resolved? Also, if possible please answer the following questions related to my question:
Is there any protection policy or access permission that I need to explicitly set in order to edit the output PDF from Adobe reader?
Every time I open the PDF that was written to using pdfbox APIs, I get this message prompt:
" The document has been changed since it was created and use of extended features is no longer available...."
I am using PdfBox 1.8.6 jar and Adobe reader 11.0.8. I would really appreciate if anybody could help me with this issue.
Code snippet added to aid responders in debugging :
String outputFileNameWithPath = "C:\myfolder\testop.pdf";
PDDocument pdf = null;
pdf = PDDocument.load( outputFileNameWithPath );
PDDocumentCatal og docCatalog = pdf.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
//The map pdfValues is a collection of the data that I need to set in the PDF
//I am unable to go into the details of my data soutce
// The key in the data map corresponds to the PDField's expanded name and data
// corresponds to the data that I am trying to set.
Iterator<Entry<String, String>> iter=pdfValues.entrySet().iterator();
String name=null;
String value=null;
PDField field=null;
//Iterate over all data and see if the PDF has a matching field.
while(iter.hasNext()) {
Map.Entry<String, String> currentEntry=iter.next();
name=currentEntry.getKey();
value=currentEntry.getValue();
if(name!=null) {
name=CommonUtils.fromSchemaNameToPdfName(name);
field=acroForm.getField(name);
}
if( field != null && value!=null )
{
field.setValue( value ); //setting the values once field is found.
}
}
// Set access permissions / encryption here before saving
pdf.save(outputFileNameWithPath);
Thanks.
The document has been changed since it was created and use of extended features is no longer available....
This indicates that the original form has been Reader-enabled, i.e. an integrated Usage-Rights digital signature has been applied to the document using a private key held by Adobe which tells the Adobe Reader that it shall make some extra functionality available to the user viewing that form.
If you don't want to break that signature during form fill-ins with PDFBox, you need to make sure that you
don't do any changes but form fill-ins and
save the changes as incremental update.
If you provided your form fill-in code and your source PDF, this could be analyzed in more detail.

docx4j Differencer Showing More Differences Than Expected

I have two documents:
Document 1 (input)
Document 2 (output)
Document 2 is the result of passing Document 1 through a transformation process which leaves any content and formatting intact (verified by side-by-side compare in Word).
However, the process removes many id numbers from the .docx files.
For example,
<w:p w:rsidP="00B600D6" w:rsidR="00F55D78" w:rsidRDefault="00B600D6">
becomes
<w:p>
according to a dump of each document via the following code:
Body body = ((Document)newerPackage.getMainDocumentPart().getJaxbElement()).getBody();
Node node = org.docx4j.XmlUtils.marshaltoW3CDomDocument(body).getDocumentElement();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.transform(new DOMSource(node),
new StreamResult(new OutputStreamWriter(System.out, "UTF-8")));
Using the docx4j Differencer comparison method recommended here, everything (except the first line which has no formatting applied) is shown as a modification.
Question is: Are the diffs a result of the missing id's, the formatting or something else?
In case it's important, we're using docx4j in this context to perform automated sanity/regression tests on our round-tripping proceess (i.e. apply the "loss-less" process and expect no differences)
Disclosure: I work on docx4j
If the only difference between paragraphs is the rsid attributes, they will still be detected as different.
You could "clean" the documents before performing the comparison, so that neither docx has rsid attributes. See the Filter sample.
By the way, an easier way to see the XML for an object (eg a single paragraph, or the entire body) is to use XmlUtils.marshaltoString