html to pdf convert, cyrillic characters not displayed properly - pdf

I have a problem with pdf fonts. I have used a method for generating pdf from html which worked fine on my local machine which is windows OS, but now on linux Cyrillic text is displayed with question marks. I checked for fonts there but it turned out that there were required fonts. Now I switched to another method which is shown below.
Document document = new Document(PageSize.A4);
String myFontsDir = "C:\\";
String filePath = AppProperties.downloadLocation + "Order_" + orderID + ".pdf";
try {
OutputStream file = new FileOutputStream(new File(filePath));
PdfWriter writer = PdfWriter.getInstance(document, file);
int iResult = FontFactory.registerDirectory(myFontsDir);
if (iResult == 0) {
System.out.println("TestPDF(): Could not register font directory " + myFontsDir);
} else {
System.out.println("TestPDF(): Registered font directory " + myFontsDir);
}
document.open();
String htmlContent = "<html><head>"
+ "<meta http-equiv=\"content-type\" content=\"application/xhtml+xml; charset=UTF-8\"/>"
+ "</head>"
+ "<body>"
+ "<h4 style=\"font-family: arialuni, arial; font-size:16px; font-weight: normal; \" >"
+ "Здраво Kristijan!"
+ "</h4></body></html>";
InputStream inf = new ByteArrayInputStream(htmlContent.getBytes("UTF-8"));
XMLWorkerFontProvider fontImp = new XMLWorkerFontProvider(myFontsDir);
FontFactory.setFontImp(fontImp);
XMLWorkerHelper.getInstance().parseXHtml(writer, document, inf, null, null, fontImp);
document.close();
System.out.println("Done.");
} catch (Exception e) {
e.printStackTrace();
}
with this peace of code I am able to generate proper pdf from latin text, but cyrillic is displayed with weird characters. This happens on Windows, I haven't yet test it on Linux. Any advice for encoding or font?
Thanks in advance

First this: it is very hard to believe that your font directory is C:\\. You are assuming that you have a file with path C:\\arialuni.ttf whereas I assume that the path to MS Arial Unicode is C:\\windows\fonts\arialuni.ttf.
Secondly: I don't think arialuni is the correct name. I'm pretty sure it's arial unicode ms. You can check this by running this code:
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register("c:/windows/fonts/arialuni.ttf");
for (String s : fontProvider.getRegisteredFamilies()) {
System.out.println(s);
}
The output should be:
courier
arial unicode ms
zapfdingbats
symbol
helvetica
times
times-roman
These are the values you can use; arialuni isn't one of them.
Also: aren't you defining the character set in the wrong place?
I have slightly adapted your source code in the sense that I stored the HTML in an HTML file cyrillic.html:
<html>
<head>
<meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8"/>
</head>
<body>
<h4 style="font-family: Arial Unicode MS, FreeSans; font-size:16px; font-weight: normal; " >Здраво Kristijan!</h4>
</body>
</html>
Note that I replaced arialuni with Arial Unicode MS and that I used FreeSans as an alternative font. In my code, I used FreeSans.ttf instead of arialttf.
See ParseHtml11:
public static final String DEST = "results/xmlworker/cyrillic.pdf";
public static final String HTML = "resources/xml/cyrillic.html";
public static final String FONT = "resources/fonts/FreeSans.ttf";
public void createPdf(String file) throws IOException, DocumentException {
// step 1
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
// step 3
document.open();
// step 4
XMLWorkerFontProvider fontImp = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontImp.register(FONT);
FontFactory.setFontImp(fontImp);
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML), null, Charset.forName("UTF-8"), fontImp);
// step 5
document.close();
}
As you can see, I use the Charset when parsing the HTML. The result looks like this:
If you insist on using Arial Unicode, just replace this line:
public static final String FONT = "resources/fonts/FreeSans.ttf";
With this one:
public static final String FONT = "c:/windows/fonts/arialuni.ttf";
I have tested this on a Windows machine and it works too:

Related

allow arabic text in pdf table using itext7 (xamarin android)

I have to put my list data in a table in a pdf file. My data has some Arabic words. When my pdf is generated, the Arabic words don't appear. I searched and found that I need itext7.pdfcalligraph so I installed it in my app. I found this code too https://itextpdf.com/en/blog/technical-notes/displaying-text-different-languages-single-pdf-document and tried to do something similar to allow Arabic words in my table but I couldn't figure it out.
This is a trial code before I apply it to my real list:
var path2 = global::Android.OS.Environment.ExternalStorageDirectory.AbsolutePath;
filePath = System.IO.Path.Combine(path2.ToString(), "myfile2.pdf");
stream = new FileStream(filePath, FileMode.Create);
PdfWriter writer = new PdfWriter(stream);
PdfDocument pdf2 = new iText.Kernel.Pdf.PdfDocument(writer);
Document document = new Document(pdf2, PageSize.A4);
FontSet set = new FontSet();
set.AddFont("ARIAL.TTF");
document.SetFontProvider(new FontProvider(set));
document.SetProperty(Property.FONT, "Arial");
string[] sources = new string[] { "يوم","شهر 2020" };
iText.Layout.Element.Table table = new iText.Layout.Element.Table(2, false);
foreach (string source in sources)
{
Paragraph paragraph = new Paragraph();
Bidi bidi = new Bidi(source, Bidi.DirectionDefaultLeftToRight);
if (bidi.BaseLevel != 0)
{
paragraph.SetTextAlignment(iText.Layout.Properties.TextAlignment.RIGHT);
}
paragraph.Add(source);
table.AddCell(new Cell(1, 1).SetTextAlignment(iText.Layout.Properties.TextAlignment.CENTER).Add(paragraph));
}
document.Add(table);
document.Close();
I updated my code and added the arial.ttf to my assets folder . i'm getting the following exception:
System.InvalidOperationException: 'FontProvider and FontSet are empty. Cannot resolve font family name (see ElementPropertyContainer#setFontFamily) without initialized FontProvider (see RootElement#setFontProvider).'
and I still can't figure it out. any ideas?
thanks in advance
- C #
I have a similar situation for Turkish characters, and I've followed these
steps :
Create a folder under projects root folder which is : /wwwroot/Fonts
Add OpenSans-Regular.ttf under the Fonts folder
Path for font is => ../wwwroot/Fonts/OpenSans-Regular.ttf
Create font like below :
public static PdfFont CreateOpenSansRegularFont()
{
var path = "{Your absolute path for FONT}";
return PdfFontFactory.CreateFont(path, PdfEncodings.IDENTITY_H, true);
}
and use it like :
paragraph.Add(source)
.SetFont(FontFactory.CreateOpenSansRegularFont()); //set font in here
table.AddCell(new Cell(1, 1)
.SetTextAlignment(iText.Layout.Properties.TextAlignment.CENTER)
.Add(paragraph));
This is how I used font factory for Turkish characters ex: "ü,i,ç,ş,ö"
For Xamarin-Android, you could try
string documentsPath = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);
var path = Path.Combine(documentsPath, "Fonts/Arial.ttf");
Look i fixed it in java,this may help you:
String font = "your Arabic font";
//the magic is in the next 4 lines:
PdfFontFactory.register(font);
FontProgram fontProgram = FontProgramFactory.createFont(font, true);
PdfFont f = PdfFontFactory.createFont(fontProgram, PdfEncodings.IDENTITY_H);
LanguageProcessor languageProcessor = new ArabicLigaturizer();
//and look here how i used setBaseDirection and don't use TextAlignment ,it will work without it
com.itextpdf.kernel.pdf.PdfDocument tempPdfDoc = new com.itextpdf.kernel.pdf.PdfDocument(new PdfReader(pdfFile.getPath()), TempWriter);
com.itextpdf.layout.Document TempDoc = new com.itextpdf.layout.Document(tempPdfDoc);
com.itextpdf.layout.element.Paragraph paragraph0 = new com.itextpdf.layout.element.Paragraph(languageProcessor.process("الاستماره الالكترونية--الاستماره الالكترونية--الاستماره الالكترونية--الاستماره الالكترونية"))
.setFont(f).setBaseDirection(BaseDirection.RIGHT_TO_LEFT)
.setFontSize(15);

Converting XDP to PDF using Live Cycle replaces question marks(?) with space at multiple places. How can that be fixed?

I have been trying to convert XDP to PDF using Adobe Live Cycle. Most of my forms turn up good. But while converting some of them, I fine ??? replaced in place of blank spaces at certain places. Any suggestions to how can I rectify that?
Below is the code snippet that I am using:
public byte[] generatePDF(TextDocument xdpDocument) {
try {
Assert.notNull(xdpDocument, "XDP Document must be passed.");
Assert.hasLength(xdpDocument.getContent(), "XDPDocument content cannot be null");
// Create a ServiceClientFactory object
ServiceClientFactory myFactory = ServiceClientFactory.createInstance(createConnectionProperties());
// Create an OutputClient object
FormsServiceClient formsClient = new FormsServiceClient(myFactory);
formsClient.resetCache();
String text = xdpDocument.getContent();
String charSet = xdpDocument.getCharsetName();
if (charSet == null || charSet.trim().length() == 0) {
charSet = StandardCharsets.UTF_8.name();
}
byte[] bytes = text.getBytes();
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
Document inTemplate = new Document(byteArrayInputStream);
// Set PDF run-time options
// Set rendering run-time options
RenderOptionsSpec renderOptionsSpec = new RenderOptionsSpec();
renderOptionsSpec.setLinearizedPDF(true);
renderOptionsSpec.setAcrobatVersion(AcrobatVersion.Acrobat_9);
PDFFormRenderSpec pdfFormRenderSpec = new PDFFormRenderSpec();
pdfFormRenderSpec.setGenerateServerAppearance(true);
pdfFormRenderSpec.setCharset("UTF8");
FormsResult formOut = formsClient.renderPDFForm2(inTemplate, null, pdfFormRenderSpec, null, null);
Document xfaPdfOutput = formOut.getOutputContent();
//If the input file is already static PDF, then below method will throw an exception - handle it
OutputClient outClient = new OutputClient(myFactory);
outClient.resetCache();
Document staticPdfOutput = outClient.transformPDF(xfaPdfOutput, TransformationFormat.PDF, null, null, null);
byte[] data = StreamIO.toBytes(staticPdfOutput.getInputStream());
return data;
} catch(IllegalArgumentException ex) {
logger.error("Input validation failed for generatePDF request " + ex.getMessage());
throw new EformsException(ErrorExceptionCode.INPUT_REQUIRED + " - " + ex.getMessage(), ErrorExceptionCode.INPUT_REQUIRED);
} catch (Exception e) {
logger.error("Exception occurred in Adobe Services while generating PDF from xdpDocument..", e);
throw new EformsException(ErrorExceptionCode.PDF_XDP_CONVERSION_EXCEPTION, e);
}
}
I suggest trying 2 things:
Check the font. Switch to something very common like Arial / Times New Roman and see if the characters are still lost
Check the character encoding. It might not be a simple question mark character you are using and if so the character encoding will be important. The easiest way is to make sure your question mark is ascii char 63 (decimal).
I hope that helps.

When using iText to generate a PDF, if I need to switch fonts many times the file size becomes too large

I have a section of my PDF in which I need to use one font for its unicode symbol and the rest of the paragraph should be a different font. (It is something like "1. a 2. b 3. c" where "1." is the unicode symbol/font and "a" is another font) I have followed the method Bruno describes here: iText 7: How to build a paragraph mixing different fonts? and it works fine to generate the PDF. The issue is that the file size of the PDF goes from around 20MB to around 100MB compared to using only one font and one Text element. This section is used repeatedly in the document thousands of times. I am wondering if there is a way to reduce the impact of switching fonts or to reduce the file size of the entire document in some way.
Style creation pseudocode:
Style style1 = new Style();
Style style2 = new Style();
PdfFont font1 = PdfFontFactory.createFont(FontProgramFactory.createFont(fontFile1), PdfEncodings.IDENTITY_H, true);
style1.setFont(font1).setFontSize(8f).setFontColor(Color.DARK_GRAY);
PdfFont font2 = PdfFontFactory.createFont(FontProgramFactory.createFont(fontFile2), "", false);
style2.setFont(font2).setFontSize(8f).setFontColor(Color.DARK_GRAY);
Writing text/paragraph pseudocode:
Div div = new Div().setPaddingLeft(3).setMarginBottom(0).setKeepTogether(true);
Paragraph paragraph = new Paragraph();
loop up to 25 times: {
Text unicodeText = new Text(unicodeSymbol + " ").addStyle(style1);
paragraph.add(unicodeText);
Text plainText = new Text(plainText + " ").addStyle(style2);
paragraph.add(plainText);
}
div.add(paragraph);
This writing of text/paragraph is done thousands of times and makes up most of the document. Basically the document consists of thousands of "buildings" that have corresponding codes and the codes have categories. I need to have the index for the category as the unicode symbol and then all of the corresponding codes within the paragraph for the building.
Here is reproducable code:
float offSet = 50;
Integer leading = 10;
DateFormat format = new SimpleDateFormat("yyyy_MM_dd_kkmmss");
String formattedDate = format.format(new Date());
String path = "/tmp/testing_pdf_"+formattedDate + ".pdf";
File targetPdfFile = new File(path);
PdfWriter writer = new PdfWriter(path, new WriterProperties().addXmpMetadata());
PdfDocument pdf = new PdfDocument(writer);
pdf.setTagged();
PageSize pageSize = PageSize.LETTER;
Document document = new Document(pdf, pageSize);
document.setMargins(offSet, offSet, offSet, offSet);
byte[] font1file = IOUtils.toByteArray(FileUtility.getInputStreamFromClassPath("fonts/Garamond-Premier-Pro-Regular.ttf"));
byte[] font2file = IOUtils.toByteArray(FileUtility.getInputStreamFromClassPath("fonts/Quivira.otf"));
PdfFont font1 = PdfFontFactory.createFont(FontProgramFactory.createFont(font1file), "", true);
PdfFont font2 = PdfFontFactory.createFont(FontProgramFactory.createFont(font2file), PdfEncodings.IDENTITY_H, true);
Style style1 = new Style().setFont(font1).setFontSize(8f).setFontColor(Color.DARK_GRAY);
Style style2 = new Style().setFont(font2).setFontSize(8f).setFontColor(Color.DARK_GRAY);
float columnGap = 5;
float columnWidth = (pageSize.getWidth() - offSet * 2 - columnGap * 2) / 3;
float columnHeight = pageSize.getHeight() - offSet * 2;
Rectangle[] columns = {
new Rectangle(offSet, offSet, columnWidth, columnHeight),
new Rectangle(offSet + columnWidth + columnGap, offSet, columnWidth, columnHeight),
new Rectangle(offSet + columnWidth * 2 + columnGap * 2, offSet, columnWidth, columnHeight)};
document.setRenderer(new ColumnDocumentRenderer(document, columns));
for (int j = 0; j < 5000; j++) {
Div div = new Div().setPaddingLeft(3).setMarginBottom(0).setKeepTogether(true);
Paragraph paragraph = new Paragraph().setFixedLeading(leading);
// StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 26; i++) {
paragraph.add(new Text("\u3255 ").addStyle(style2));
paragraph.add(new Text("test ").addStyle(style1));
// stringBuilder.append("\u3255 ").append(" test ");
}
// paragraph.add(stringBuilder.toString()).addStyle(style2);
div.add(paragraph);
document.add(div);
}
document.close();
In creating the reproducible code I have found this this is related to the document being tagged. If you remove the line that marks it as tagged it reduces the file size greatly.
You can also reduce the file size by using the commented out string builder with one font instead of two. (Comment out the two "paragraph.add"s in the for-loop) This mirrors the issue I have in my code.
The problem is not in fonts themselves. The issues comes from the fact that you are creating a tagged PDF. Tagged documents have a lot of PDF objects in them that need a lot of space in the file.
I wasn't able to reproduce your 20MB vs 100MB results. On my machine whether with one font or with two fonts, but with two Text elements, the resultant file size is ~44MB.
To compress file when creating large tagged documents, you should use full compression mode which compresses all PDF objects, not only streams.
To activate full compression mode, create a PdfWriter instance with WriterProperties:
PdfWriter writer = new PdfWriter(outFileName,
new WriterProperties().setFullCompressionMode(true));
This setting reduced the file size for me from >40MB to ~5MB.
Please note that you are using iText 7.0.x while 7.1.x line has already been released and is now the main line of iText, so I recommend that you update to the latest version.

Using pdfbox - how to get the font from a COSName?

How to get the font from a COSName?
The solution I'm looking for looks somehow like this:
COSDictionary dict = new COSDictionary();
dict.add(fontname, something); // fontname COSName from below code
PDFontFactory.createFont(dict);
If you need more background, I added the whole story below:
I try to replace some string in a pdf. This succeeds (as long as all text is stored in one token). In order to keep the format I like to re-center the text. As far as I understood I can do this by getting the width of the old string and the new one, do some trivial calculation and setting the new position.
I found some inspiration on stackoverflow for replacing https://stackoverflow.com/a/36404377 (yes it has some issues, but works for my simple pdf's. And How to center a text using PDFBox. Unfortunatly this example uses a font constant.
So using the first link's code I get a handling for operator 'TJ' and one for 'Tj'.
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
java.util.List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++)
{
Object next = tokens.get(j);
if (next instanceof Operator)
{
Operator op = (Operator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj"))
{
// Tj takes one operator and that is the string to display so lets
// update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
String replaced = prh.getReplacement(string);
if (!string.equals(replaced))
{ // if changes are there, replace the content
previous.setValue(replaced.getBytes());
float xpos = getPosX(tokens, j);
//if (true) // center the text
if (6 * xpos > page.getMediaBox().getWidth()) // check if text starts right from 1/xth page width
{
float fontsize = getFontSize(tokens, j);
COSName fontname = getFontName(tokens, j);
// TODO
PDFont font = ?getFont?(fontname);
// TODO
float widthnew = getStringWidth(replaced, font, fontsize);
setPosX(tokens, j, page.getMediaBox().getWidth() / 2F - (widthnew / 2F));
}
replaceCount++;
}
}
Considering the code between the TODO tags, I will get the required values from the token list. (yes this code is awful, but for now it let's me concentrate on the main issue)
Having the string, the size and the font I should be able to call the getWidth(..) method from the sample code.
Unfortunatly I run into trouble to create a font from the COSName variable.
PDFont doesn't provide a method to create a font by name.
PDFontFactory looks fine, but requests a COSDictionary. This is the point I gave up and request help from you.
The names are associated with font objects in the page resources.
Assuming you use PDFBox 2.0.x and that page is a PDPage instance, you can resolve the name fontname using:
PDFont font = page.getResources().getFont(fontname);
But the warning from the comments to the questions you reference remain: This approach will work only for very simple PDFs and might even damage other ones.
try {
//Loading an existing document
File file = new File("UKRSICH_Mo6i-Spikyer_z1560-FAV.pdf");
PDDocument document = PDDocument.load(file);
PDPage page = document.getPage(0);
PDResources pageResources = page.getResources();
System.out.println(pageResources.getFontNames() );
for (COSName key : pageResources.getFontNames())
{
PDFont font = pageResources.getFont(key);
System.out.println("Font: " + font.getName());
}
document.close();
}

asp.net itextsharp convert the file format files to PDF

When I tried to use the code to convert the file format files to PDF using itextsharp. The problem appears when converting texts written in Arabic. The result came without the written text in Arabic.
I hope you to help me overcome the problems.
Thank you very much
Here's a summary of the process:
Wrap Paragraph objects in one of the two iText IElement classes that support Arabic text: PdfPCell and ColumnText.
Use a font that has Arabic glyphs.
Set the text run direction and alignment.
Something like this:
using (Document document = new Document()) {
PdfWriter writer = PdfWriter.GetInstance(document, STREAM);
document.Open();
string arabicText = #"
iText ® هي المكتبة التي تسمح لك لخلق والتلاعب وثائق PDF. فإنه يتيح للمطورين تتطلع الى تعزيز شبكة الإنترنت وغيرها من التطبيقات مع دينامية الجيل ثيقة PDF و / أو تلاعب.
";
PdfPTable table = new PdfPTable(1);
table.WidthPercentage = 100;
PdfPCell cell = new PdfPCell();
cell.Border = PdfPCell.NO_BORDER;
cell.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
Font font = new Font(BaseFont.CreateFont(
"c:/windows/fonts/arialuni.ttf",
BaseFont.IDENTITY_H, BaseFont.EMBEDDED
));
Paragraph p = new Paragraph(arabicText, font);
p.Alignment = Element.ALIGN_LEFT;
cell.AddElement(p);
table.AddCell(cell);
document.Add(table);
}
Sorry if the example text above is poor, incorrect, or both. I had to use Google translate, since my native language is English.