HTML To PDF Turkish Character Problem - pdf

I want to convert a ASP.NET web page to pdf using ITextSharp. I did write some code but I can not make it show the Turkish Characters. Can anyone help me?
Here is the code:
using System;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
using System.Web.UI;
using System.Web;
using iTextSharp.text.html.simpleparser;
using System.Text;
using System.Text.RegularExpressions;
namespace Presentation
{
public partial class TemporaryStudentFormPrinter : System.Web.UI.Page
{
protected override void Render(HtmlTextWriter writer)
{
MemoryStream mem = new MemoryStream();
StreamWriter twr = new StreamWriter(mem);
HtmlTextWriter myWriter = new HtmlTextWriter(twr);
base.Render(myWriter);
myWriter.Flush();
myWriter.Dispose();
StreamReader strmRdr = new StreamReader(mem);
strmRdr.BaseStream.Position = 0;
string pageContent = strmRdr.ReadToEnd();
strmRdr.Dispose();
mem.Dispose();
writer.Write(pageContent);
CreatePDFDocument(pageContent);
}
public void CreatePDFDocument(string strHtml)
{
string strFileName = HttpContext.Current.Server.MapPath("test.pdf");
Document document = new Document(PageSize.A4, 80, 50, 30, 65);
PdfWriter.GetInstance(document, new FileStream(strFileName, FileMode.Create));
StringReader se = new StringReader(strHtml);
HTMLWorker obj = new HTMLWorker(document);
document.Open();
obj.Parse(se);
document.Close();
ShowPdf(strFileName);
}
public void ShowPdf(string strFileName)
{
Response.ClearContent();
Response.ClearHeaders();
Response.AddHeader("Content-Disposition", "inline;filename=" + strFileName);
Response.ContentType = "application/pdf";
Response.WriteFile(strFileName);
Response.Flush();
Response.Clear();
}
protected void Page_Load(object sender, EventArgs e)
{
}
}
}

iTextSharp.text.pdf.BaseFont STF_Helvetica_Turkish = iTextSharp.text.pdf.BaseFont.CreateFont("Helvetica", "CP1254", iTextSharp.text.pdf.BaseFont.NOT_EMBEDDED);
iTextSharp.text.Font fontNormal = new iTextSharp.text.Font(STF_Helvetica_Turkish, 12, iTextSharp.text.Font.NORMAL);
You should pass the font as an argument in itextsharp manipulation commands like that :
pdftable.AddCell(new Phrase(nn.InnerText.Trim(), fontNormal));
You might want to consider working with reporting tools with pdf exporting capability instead of direclty working with pdf which can be a real headache..

You'll need to make sure you are writing the text in a Font that supports the Turkish character set (or at least the caracters you are trying to write out).
I don't know what HtmlTextWriter does in terms of font use - it will presumably use one of the standard built-in fonts that are unlikely to support the characters you want to print if the fall outside of the Latin1 or Latin1-extended Unicode ranges.
I use BaseFont.createFont(...) to have an external Font included in my PDF in iText (Java) - one which supports all the characters that I am writing. You might be able to create your Font object and then pass it to HtmlTextWriter?

PdfFont font = PdfFontFactory.CreateFont("Helvetica", "CP1254",PdfFontFactory.EmbeddingStrategy.FORCE_NOT_EMBEDDED);
doc.Add(new Paragraph("P").SetFont(font));

Related

what will be the alternative for Pdfstamper in itext7?

I tried to find the alternative for Pdfstamper in itext7 but didn't get how to use? I've already implemented code in itextshap its working but not in itext7.
I've one more doubt what will be the alternative for Acro Fields in itext7?
public byte[] GeneratePDF(string pdfPath, Dictionary<string, string> formFieldMap, bool formFlattening = true)
{
var output = new MemoryStream();
var reader = new PdfReader(pdfPath);
var stamper = new PdfStamper(reader, output);
//PdfDocument pdfDocument = new PdfDocument(reader, writer);
var formFields = stamper.AcroFields;
foreach (var fieldName in formFieldMap.Keys)
formFields.SetField(fieldName, formFieldMap[fieldName]);
stamper.FormFlattening = formFlattening;
stamper.Close();
reader.Close();
return output.ToArray();
}
The iText API got completely overhauled between versions 5.x and 7.x. Thus, you do not always have a one-to-one correspondence between classes here and there. Thus, I would propose studying the introductory ebooks on the iText knowledge base site before porting code.
There actually is an example in those ebooks very similar to your code:
//Initialize PDF document
PdfDocument pdf = new PdfDocument(new PdfReader(src), new PdfWriter(dest));
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdf, true);
IDictionary<String, PdfFormField> fields = form.GetFormFields();
PdfFormField toSet;
fields.TryGetValue("name", out toSet);
toSet.SetValue("James Bond");
fields.TryGetValue("language", out toSet);
toSet.SetValue("English");
fields.TryGetValue("experience1", out toSet);
toSet.SetValue("Off");
fields.TryGetValue("experience2", out toSet);
toSet.SetValue("Yes");
fields.TryGetValue("experience3", out toSet);
toSet.SetValue("Yes");
fields.TryGetValue("shift", out toSet);
toSet.SetValue("Any");
fields.TryGetValue("info", out toSet);
toSet.SetValue("I was 38 years old when I became an MI6 agent.");
form.FlattenFields();
pdf.Close();
("Flattening a Form" in "Chapter 4: Making a PDF interactive | .NET" of "iText 7: Jump-Start Tutorial for .NET")

Wrong rendering of thousands separator with custom font in itext

I have a rendering issue since I upgraded to iText 7.2 (from 7.1.7) with a custom font.
The thousand separator is not rendered correctly.
With iText 7.1.7
With iText 7.2
tableArticles.AddCell(new Cell(1, 2)
.Add(new Paragraph($"{bonCommande.Total:C}")
.SetFontSize(14)
.SetBold()
.SetTextAlignment(TextAlignment.RIGHT)
.SetFixedLeading(LeadingSize))
.SetBorder(new SolidBorder(1))
.SetVerticalAlignment(VerticalAlignment.MIDDLE)
.SetPadding(5));
I tried to update the font by drawing something for unicode char arabic thousands separator (U+066C) but without success.
I'm from Belgium and use "fr-BE".
Thanks for any help.
EDIT :
Here is some code example with the font and pdfs examples.
WeTransfer Link To PDFs and Font
public class PdfCurrencyModel : PageModel
{
private readonly IPdfCurrencyService _pdfCurrencyService;
public PdfCurrencyModel(IPdfCurrencyService pdfCurrencyService)
{
_pdfCurrencyService = pdfCurrencyService;
}
public IActionResult OnGetAsync()
{
//var stream = _pdfCurrencyService.GetPDFMemoryStream();
//return File(stream, MediaTypeNames.Application.Pdf, "Values.pdf");
var bytesArray = _pdfCurrencyService.GetPDFBytesArray();
return new FileContentResult(bytesArray, MediaTypeNames.Application.Pdf);
}
}
public interface IPdfCurrencyService
{
public MemoryStream GetPDFMemoryStream();
public byte[] GetPDFBytesArray();
}
public class PdfCurrencyService : IPdfCurrencyService
{
public MemoryStream GetPDFMemoryStream()
{
// Fonts
var fontTexte = PdfFontFactory.CreateFont(FontConstants.BrownBold,
PdfFontFactory.EmbeddingStrategy.FORCE_EMBEDDED);
// Initialisation
var ms = new MemoryStream();
// Initialize PDF writer
var writer = new iText.Kernel.Pdf.PdfWriter(ms);
writer.SetCloseStream(false);
// Initialize PDF document
var pdfDoc = new iText.Kernel.Pdf.PdfDocument(writer);
// Initialize document
var document = new Document(pdfDoc, iText.Kernel.Geom.PageSize.A4)
.SetFont(fontTexte).SetFontSize(10);
// Values
for (var i = 1000; i <= 2000; i += 100)
document.Add(new Paragraph(i.ToString("C")));
// Close document
document.Close();
//ms.Flush();
ms.Position = 0;
return ms;
}
public byte[] GetPDFBytesArray()
{
return GetPDFMemoryStream().ToArray();
}
}
EDIT 2 : I created a simple web app with the problem
Link to project
When using the "fr-BE" culture this line:
1000.ToString("C")
evaluates to:
1.000,00 €
It seems that iText does some processing with the "." (dot) because neither files are correct.
Seems that 7.2.1 changed the way the fonts are handled. 7.1.7 created a simple WinAnsi font while 7.2.1 creates a Type0 font.
7.1.7 does not write at all the dot in the PDF file:
BT
/F1 10 Tf
36 787.93 Td
(1000,00 €)Tj
ET
7.2.1 writes <0000> for the dot GID (instead of <0073>) in the PDF file (spaces in string added by me to emphasize each GID):
BT
/F1 10 Tf
36 787.93 Td
<0093 0000 00a2 00a2 00a2 0080 00a2 00a2 0074 00e1>Tj
ET
I finally managed to make it work.
It appears that iText use U+202F (NARROW NO-BREAK SPACE) as thousands separator in 7.2.x versions but it was not the case before.
My font didn't have something specified so it rendered a square.
I copy/paste from U+0020 which is "SPACE" and I don't have any problem anymore.
In fr-BE and fr-FR the thousands separator is a blank space and not an dot.
var separator = new CultureInfo("fr-BE", false)
.NumberFormat
.CurrencyGroupSeparator;
Debug.WriteLine($"Currency Group Separator for fr-BE is {(int)separator[0]}");
Debug.WriteLine($"As HEX {((int)separator[0]).ToString("X")}");
// Currency Group Separator for fr-BE is 8239
//As HEX 202F
Thanks anyone for your time.

iTextSharp returns only whitespaces string for specific PDF file

I am testing this simple code:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace PDF_TXT2
{
class Program
{
[STAThread]
static void Main(string[] args)
{
string path = args[0];
string pathFileName = System.IO.Path.GetFileNameWithoutExtension(path);
string pathFolder = System.IO.Path.GetDirectoryName(path);
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
Clipboard.SetText(text);
MessageBox.Show(text);
}
}
}
This specific PDF file makes causes an empty string. actually not empty but only full of whitespaces.
Could you help me understand why?
Thanks a lot!
The fonts in your PDF have this entry
/ToUnicode/Identity-H
i.e. the value of ToUnicode is the name Identity-H.
According to the PDF specification, though, the value of ToUnicode must be a stream!
ToUnicode
stream
(Optional) A stream containing a CMap file that maps character codes to Unicode values (see 9.10, "Extraction of Text Content").
Thus, the ToUnicode mapping in your file is invalid which can result in arbitrary errors during text extraction.

Error using OpenXML to read a .docx file from a memorystream to a WordprocessingDocument to a string and back

I have an existing library that I can use to receive a docx file and return it. The software is .Net Core hosted in a Linux Docker container.
It's very limited in scope though and I need to perform some actions it can't do. As these are straightforward I thought I would use OpenXML, and for my proof of concept all I need to do is to read a docx as a memorystream, replace some text, turn it back into a memorystream and return it.
However the docx that gets returned is unreadable. I've commented out the text replacement below to eliminate that, and if I comment out the call to the method below then the docx can be read so I'm sure the issue is in this method.
Presumably I'm doing something fundamentally wrong here but after a few hours googling and playing around with the code I am not sure how to correct this; any ideas what I have wrong?
Thanks for the help
private MemoryStream SearchAndReplace(MemoryStream mem)
{
mem.Position = 0;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(mem, true))
{
string docText = null;
StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream());
docText = sr.ReadToEnd();
//Regex regexText = new Regex("Hello world!");
//docText = regexText.Replace(docText, "Hi Everyone!");
MemoryStream newMem = new MemoryStream();
newMem.Position = 0;
StreamWriter sw = new StreamWriter(newMem);
sw.Write(docText);
return newMem;
}
}
If your real requirement is to search and replace text in a WordprocessingDocument, you should have a look at this answer.
The following unit test shows how you can make your approach work if the use case really demands that you read a string from a part, "massage" the string, and write the changed string back to the part. It also shows one of the shortcomings of any other approach than the one described in the answer already mentioned above, e.g., by demonstrating that the string "Hello world!" will not be found in this way if it is split across w:r elements.
[Fact]
public void CanSearchAndReplaceStringInOpenXmlPartAlthoughThisIsNotTheWayToSearchAndReplaceText()
{
// Arrange.
using var docxStream = new MemoryStream();
using (var wordDocument = WordprocessingDocument.Create(docxStream, WordprocessingDocumentType.Document))
{
MainDocumentPart part = wordDocument.AddMainDocumentPart();
var p1 = new Paragraph(
new Run(
new Text("Hello world!")));
var p2 = new Paragraph(
new Run(
new Text("Hello ") { Space = SpaceProcessingModeValues.Preserve }),
new Run(
new Text("world!")));
part.Document = new Document(new Body(p1, p2));
Assert.Equal("Hello world!", p1.InnerText);
Assert.Equal("Hello world!", p2.InnerText);
}
// Act.
SearchAndReplace(docxStream);
// Assert.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(docxStream, false))
{
MainDocumentPart part = wordDocument.MainDocumentPart;
Paragraph p1 = part.Document.Descendants<Paragraph>().First();
Paragraph p2 = part.Document.Descendants<Paragraph>().Last();
Assert.Equal("Hi Everyone!", p1.InnerText);
Assert.Equal("Hello world!", p2.InnerText);
}
}
private static void SearchAndReplace(MemoryStream docxStream)
{
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(docxStream, true))
{
// If you wanted to read the part's contents as text, this is how you
// would do it.
string partText = ReadPartText(wordDocument.MainDocumentPart);
// Note that this is not the way in which you should search and replace
// text in Open XML documents. The text might be split across multiple
// w:r elements, so you would not find the text in that case.
var regex = new Regex("Hello world!");
partText = regex.Replace(partText, "Hi Everyone!");
// If you wanted to write changed text back to the part, this is how
// you would do it.
WritePartText(wordDocument.MainDocumentPart, partText);
}
docxStream.Seek(0, SeekOrigin.Begin);
}
private static string ReadPartText(OpenXmlPart part)
{
using Stream partStream = part.GetStream(FileMode.OpenOrCreate, FileAccess.Read);
using var sr = new StreamReader(partStream);
return sr.ReadToEnd();
}
private static void WritePartText(OpenXmlPart part, string text)
{
using Stream partStream = part.GetStream(FileMode.Create, FileAccess.Write);
using var sw = new StreamWriter(partStream);
sw.Write(text);
}

Apache PDFBox and PDF/A-3

Is it possible to use Apache PDFBox to process PDF/A-3 documents? (Especially for changing field values?)
The PDFBox 1.8 Cookbook says that it is possible to create PDF/A-1 documents with pdfaid.setPart(1);
Can I apply pdfaid.setPart(3) for a PDF/A-3 document?
If not: Is it possible to read in a PDF/A-3 document, change some field values and safe it by what I have not need for >creation/conversion to PDF/A-3< but the document is still PDF/A-3?
How to create a PDF/A {2,3} - {B, U, A) valid: In this example I convert the PDF to Image, then I create a valid PDF / Ax-y with the image. PDFBOX2.0x
public static void main(String[] args) throws IOException, TransformerException
{
String resultFile = "result/PDFA-x.PDF";
FileInputStream in = new FileInputStream("src/PDFOrigin.PDF");
PDDocument doc = new PDDocument();
try
{
PDPage page = new PDPage();
doc.addPage(page);
doc.setVersion(1.7f);
/*
// A PDF/A file needs to have the font embedded if the font is used for text rendering
// in rendering modes other than text rendering mode 3.
//
// This requirement includes the PDF standard fonts, so don't use their static PDFType1Font classes such as
// PDFType1Font.HELVETICA.
//
// As there are many different font licenses it is up to the developer to check if the license terms for the
// font loaded allows embedding in the PDF.
String fontfile = "/org/apache/pdfbox/resources/ttf/ArialMT.ttf";
PDFont font = PDType0Font.load(doc, new File(fontfile));
if (!font.isEmbedded())
{
throw new IllegalStateException("PDF/A compliance requires that all fonts used for"
+ " text rendering in rendering modes other than rendering mode 3 are embedded.");
}
*/
PDPageContentStream contents = new PDPageContentStream(doc, page);
try
{
PDDocument docSource = PDDocument.load(in);
PDFRenderer pdfRenderer = new PDFRenderer(docSource);
int numPage = 0;
BufferedImage imagePage = pdfRenderer.renderImageWithDPI(numPage, 200);
PDImageXObject pdfXOImage = LosslessFactory.createFromImage(doc, imagePage);
contents.drawImage(pdfXOImage, 0,0, page.getMediaBox().getWidth(), page.getMediaBox().getHeight());
contents.close();
}catch (Exception e) {
// TODO: handle exception
}
// add XMP metadata
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
PDDocumentCatalog catalogue = doc.getDocumentCatalog();
Calendar cal = Calendar.getInstance();
try
{
DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
// dc.setTitle(file);
dc.addCreator("My APPLICATION Creator");
dc.addDate(cal);
PDFAIdentificationSchema id = xmp.createAndAddPFAIdentificationSchema();
id.setPart(3); //value => 2|3
id.setConformance("A"); // value => A|B|U
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(doc);
metadata.importXMPMetadata(baos.toByteArray());
catalogue.setMetadata(metadata);
}
catch(BadFieldValueException e)
{
throw new IllegalArgumentException(e);
}
// sRGB output intent
InputStream colorProfile = CreatePDFA.class.getResourceAsStream(
"../../../pdmodel/sRGB.icc");
PDOutputIntent intent = new PDOutputIntent(doc, colorProfile);
intent.setInfo("sRGB IEC61966-2.1");
intent.setOutputCondition("sRGB IEC61966-2.1");
intent.setOutputConditionIdentifier("sRGB IEC61966-2.1");
intent.setRegistryName("http://www.color.org");
catalogue.addOutputIntent(intent);
catalogue.setLanguage("en-US");
PDViewerPreferences pdViewer =new PDViewerPreferences(page.getCOSObject());
pdViewer.setDisplayDocTitle(true);;
catalogue.setViewerPreferences(pdViewer);
PDMarkInfo mark = new PDMarkInfo(); // new PDMarkInfo(page.getCOSObject());
PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
catalogue.setMarkInfo(mark);
catalogue.setStructureTreeRoot(treeRoot);
catalogue.getMarkInfo().setMarked(true);
PDDocumentInformation info = doc.getDocumentInformation();
info.setCreationDate(cal);
info.setModificationDate(cal);
info.setAuthor("My APPLICATION Author");
info.setProducer("My APPLICATION Producer");;
info.setCreator("My APPLICATION Creator");
info.setTitle("PDF title");
info.setSubject("PDF to PDF/A{2,3}-{A,U,B}");
doc.save(resultFile);
}catch (Exception e) {
throw new IllegalArgumentException(e);
}
}
PDFBox supports that but please be aware that due to the fact that PDFBox is a low level library you have to ensure the conformance yourself i.e. there is no 'Save as PDF/A-3'. You might want to take a look at http://www.mustangproject.org which uses PDFBox to support ZUGFeRD (electronic invoicing) which also needs PDF/A-3.