itext html to pdf without embedding fonts - pdf

I'm following this guide in Chapter 6 of iText 7: Converting HTML to PDF with pdfHTML on adding extra fonts:
public static final String FONT = "src/main/resources/fonts/cardo/Cardo-Regular.ttf";
public void createPdf(String src, String font, String dest) throws IOException {
ConverterProperties properties = new ConverterProperties();
FontProvider fontProvider = new DefaultFontProvider(false, false, false);
FontProgram fontProgram = FontProgramFactory.createFont(font);
fontProvider.addFont(fontProgram, "Winansi");
properties.setFontProvider(fontProvider);
HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}
While it's working as expected and embedding subsets of the fonts being used, I'm wondering if there is a way for the resulting PDF document to not embed the fonts at all. This is possible when creating BaseFont instances and setting the embedded property to false and using them to build various PDF building blocks. What I'm looking for is this same behavior when using the HtmlConverter.convertToPdf().

What you should normally do is override FontProvider:
FontProvider fontProvider = new DefaultFontProvider(false, false, false) {
#Override
public boolean getDefaultEmbeddingFlag() {
return false;
}
};
However, the problem is that at the moment this font provider would be overwritten by pdfHTML further into the pipeline in ProcessorContext#reset.
While this issue is not fixed in iText you can build a custom version of pdfHTML for your needs. The repo is located at https://github.com/itext/i7j-pdfhtml and you are interested in this line. Just replace it with the overload as above and build the jar.
UPD The fix is available starting from pdfHTML 2.1.3. From that version on you can use custom font providers freely.

Related

Wrong rendering of thousands separator with custom font in itext

I have a rendering issue since I upgraded to iText 7.2 (from 7.1.7) with a custom font.
The thousand separator is not rendered correctly.
With iText 7.1.7
With iText 7.2
tableArticles.AddCell(new Cell(1, 2)
.Add(new Paragraph($"{bonCommande.Total:C}")
.SetFontSize(14)
.SetBold()
.SetTextAlignment(TextAlignment.RIGHT)
.SetFixedLeading(LeadingSize))
.SetBorder(new SolidBorder(1))
.SetVerticalAlignment(VerticalAlignment.MIDDLE)
.SetPadding(5));
I tried to update the font by drawing something for unicode char arabic thousands separator (U+066C) but without success.
I'm from Belgium and use "fr-BE".
Thanks for any help.
EDIT :
Here is some code example with the font and pdfs examples.
WeTransfer Link To PDFs and Font
public class PdfCurrencyModel : PageModel
{
private readonly IPdfCurrencyService _pdfCurrencyService;
public PdfCurrencyModel(IPdfCurrencyService pdfCurrencyService)
{
_pdfCurrencyService = pdfCurrencyService;
}
public IActionResult OnGetAsync()
{
//var stream = _pdfCurrencyService.GetPDFMemoryStream();
//return File(stream, MediaTypeNames.Application.Pdf, "Values.pdf");
var bytesArray = _pdfCurrencyService.GetPDFBytesArray();
return new FileContentResult(bytesArray, MediaTypeNames.Application.Pdf);
}
}
public interface IPdfCurrencyService
{
public MemoryStream GetPDFMemoryStream();
public byte[] GetPDFBytesArray();
}
public class PdfCurrencyService : IPdfCurrencyService
{
public MemoryStream GetPDFMemoryStream()
{
// Fonts
var fontTexte = PdfFontFactory.CreateFont(FontConstants.BrownBold,
PdfFontFactory.EmbeddingStrategy.FORCE_EMBEDDED);
// Initialisation
var ms = new MemoryStream();
// Initialize PDF writer
var writer = new iText.Kernel.Pdf.PdfWriter(ms);
writer.SetCloseStream(false);
// Initialize PDF document
var pdfDoc = new iText.Kernel.Pdf.PdfDocument(writer);
// Initialize document
var document = new Document(pdfDoc, iText.Kernel.Geom.PageSize.A4)
.SetFont(fontTexte).SetFontSize(10);
// Values
for (var i = 1000; i <= 2000; i += 100)
document.Add(new Paragraph(i.ToString("C")));
// Close document
document.Close();
//ms.Flush();
ms.Position = 0;
return ms;
}
public byte[] GetPDFBytesArray()
{
return GetPDFMemoryStream().ToArray();
}
}
EDIT 2 : I created a simple web app with the problem
Link to project
When using the "fr-BE" culture this line:
1000.ToString("C")
evaluates to:
1.000,00 €
It seems that iText does some processing with the "." (dot) because neither files are correct.
Seems that 7.2.1 changed the way the fonts are handled. 7.1.7 created a simple WinAnsi font while 7.2.1 creates a Type0 font.
7.1.7 does not write at all the dot in the PDF file:
BT
/F1 10 Tf
36 787.93 Td
(1000,00 €)Tj
ET
7.2.1 writes <0000> for the dot GID (instead of <0073>) in the PDF file (spaces in string added by me to emphasize each GID):
BT
/F1 10 Tf
36 787.93 Td
<0093 0000 00a2 00a2 00a2 0080 00a2 00a2 0074 00e1>Tj
ET
I finally managed to make it work.
It appears that iText use U+202F (NARROW NO-BREAK SPACE) as thousands separator in 7.2.x versions but it was not the case before.
My font didn't have something specified so it rendered a square.
I copy/paste from U+0020 which is "SPACE" and I don't have any problem anymore.
In fr-BE and fr-FR the thousands separator is a blank space and not an dot.
var separator = new CultureInfo("fr-BE", false)
.NumberFormat
.CurrencyGroupSeparator;
Debug.WriteLine($"Currency Group Separator for fr-BE is {(int)separator[0]}");
Debug.WriteLine($"As HEX {((int)separator[0]).ToString("X")}");
// Currency Group Separator for fr-BE is 8239
//As HEX 202F
Thanks anyone for your time.

replace string in PDF document (ITextSharp or PdfSharp)

We use non-manage DLL that has a funciton to replace text in PDF document (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php).
We are trying to move to managed solution (ITextSharp or PdfSharp).
I know that this question has been asked before and that the answers are "you should not do it" or "it is not easily supported by PDF".
However there exists a solution that works for us and we just need to convert it to C#.
Any ideas how I should approach it?
According to your library reference link, you use the Debenu PDFLibrary function ReplaceTag. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
That should be possible with any general purpose PDF library, it definitely is with iText(Sharp):
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
WARNING: Just like in case of the Debenu function, for most documents this code wouldn’t have any effect or would even be destructive. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed.
By the way, the Debenu knowledge base article continues:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
Thus, if during your move to managed solution you also change the way the source documents are created, chances are that neither the Debenu PDFLibrary function ReplaceTag nor the code above will be able to change the content as desired.
for pdfsharp users heres a somewhat usable function, i copied from my project and it uses an utility method which is consumed by othere methods hence the unused result.
it ignores whitespaces created by Kerning, and therefore may mess up the result (all characters in the same space) depending on the source material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}

How a font is detected to be bold/italic/plain that is used in PDF

While Extracting Content from PDF using the MuPDF library, i am getting the Font name only not its font-face.
Do i guess (eg.bold in font-name though not the right way) or there is any other way to detect that specific font is Bold/Italic/Plain.
I have used itextsharp to extract font-family ,font color etc
public void Extract_inputpdf() {
text_input_File = string.Empty;
StringBuilder sb_inputpdf = new StringBuilder();
PdfReader reader_inputPdf = new PdfReader(path); //read PDF
for (int i = 0; i <= reader_inputPdf.NumberOfPages; i++) {
TextWithFont_inputPdf inputpdf = new TextWithFont_inputPdf();
text_input_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_inputPdf, i, inputpdf);
sb_inputpdf.Append(text_input_File);
input_pdf = sb_inputpdf.ToString();
}
reader_inputPdf.Close();
clear();
}
public class TextWithFont_inputPdf: iTextSharp.text.pdf.parser.ITextExtractionStrategy {
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
string curFont = renderInfo.GetFont().PostscriptFontName;
string divide = curFont;
string[] fontnames = null;
//split the words from postscript if u want separate. it will be in this
}
}
public string GetResultantText() {
return result.ToString();
}
The PDF spec contains entries which allow you to specify the style of a font. However unfortunately in the real world you will often find that these are absent.
If the font is referenced rather than embeded this generally means you are stuck with the PostScript name for the font. It requires some heuristics but normally the name provides sufficient clues as to the style. It sounds this is pretty much where you are.
If the font is embedded you can parse it and try and find style information from the embedded font program. If it is subsetted then in theory this information might be removed but in general I don't think it will be. However parsing TrueType/OpenType fonts is boring and you may not feel that it is worth it.
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)"

fop in glassfish fail to render external resources

I am generating a PDF file via fop 1.0 out of a java library. The unit tests are running fine and the PDF is rendered as expected, including an external graphic:
<fo:external-graphic content-width="20mm" src="url('images/image.png')" />
If I render this within a Java EE application in glassfish 3.1, I always get the following error:
Image not found. URI: images/image.png. (No context info available)
I double-checked whether the image is available. It is available within the .jar file in the .ear file and should therfore be available by the ClasspathUriResolver. This is a code-snipplet of how I setup the fop-factory:
FopFactory fopFactory = FopFactory.newInstance();
URIResolver uriResolver = new ClasspathUriResolver();
fopFactory.setURIResolver(uriResolver);
Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, out);
...
I also assigned the URI resolver to the TransformerFactory and the Transformer with no success. Would be great if someone can help me out.
-- Wintermute
Btw: the ClasspathUriResolver() looks like this
public class ClasspathUriResolver implements URIResolver {
#Override
public Source resolve(String href, String base) throws TransformerException {
Source source = null;
InputStream inputStream = ClassLoader.getSystemResourceAsStream(href);
if (inputStream != null) {
source = new StreamSource(inputStream);
}
return source;
}
}
You consider a different class loader then ClassLoader.getSystemResourceAsStream(href);
Try InputStream inputStream = getClass().getResourceAsStream(href); or something else, maybe.
Does it work, then?

What's the easiest way of converting an xhtml string to PDF using Flying Saucer?

I've been using Flying Saucer for a while now with awesome results.
I can set a document via uri like so
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(xhtmlUri);
Which is nice, as it will resolve all relative css resources etc relative to the given URI. However, I'm now generating the xhtml, and want to render it directly to a PDF (without saving a file). The appropriate methods in ITextRenderer seem to be:
private Document loadDocument(final String uri) {
return _sharedContext.getUac().getXMLResource(uri).getDocument();
}
public void setDocument(String uri) {
setDocument(loadDocument(uri), uri);
}
public void setDocument(Document doc, String url) {
setDocument(doc, url, new XhtmlNamespaceHandler());
}
As you can see, my existing code just gives the uri and ITextRenderer does the work of creating the Document for me.
What's the shortest way of creating the Document from my formatted xhtml String? I'd prefer to use the existing Flying Saucer libs without having to import another XML parsing jar (just for the sake of consistent bugs and functionality).
The following works:
Document document = XMLResource.load(new ByteArrayInputStream(templateString.getBytes())).getDocument();
Previously, I had tried
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);
final DocumentBuilder documentBuilder = dbf.newDocumentBuilder();
Document document = documentBuilder.parse(new ByteArrayInputStream(templateString.getBytes()));
but that fails as it attempts to download the HTML docType from http://www.w3.org (which returns 503's for the java libs).
I use the following without problem:
final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setValidating(false);
DocumentBuilder builder = documentBuilderFactory.newDocumentBuilder();
builder.setEntityResolver(FSEntityResolver.instance());
org.w3c.dom.Document document = builder.parse(new ByteArrayInputStream(doc.toString().getBytes()));
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(document, null);
renderer.layout();
renderer.createPDF(os);
The key differences here are passing in a null URI, and also provided the DocumentBuilder with an entity resolver.