iText7 convert HTML to PDF "System.NullReferenceException." - pdf

OLD TITLE: iTextSharp convert HTML to PDF "The document has no pages."
I am using iTextSharp and xmlworker to convert html from a view to PDF in ASP.NET Core 2.1
I tried many code snippets I found online but all generate an exception:
The document has no pages.
Here is my current code:
public static byte[] ToPdf(string html)
{
byte[] output;
using (var document = new Document())
{
using (var workStream = new MemoryStream())
{
PdfWriter writer = PdfWriter.GetInstance(document, workStream);
writer.CloseStream = false;
document.Open();
using (var reader = new StringReader(html))
{
XMLWorkerHelper.GetInstance().ParseXHtml(writer, document, reader);
document.Close();
output = workStream.ToArray();
}
}
}
return output;
}
UPDATE 1
Thanks to #Bruno Lowagie's advice, I upgraded to iText7 and pdfHTML, but I can't find much tutorials about it.
I tried this code:
public static byte[] ToPdf(string html)
{
html = "<html><head><title>Extremely Basic Title</title></head><body>Extremely Basic Content</body></html>";
byte[] output;
using (var workStream = new MemoryStream())
using (var pdfWriter = new PdfWriter(workStream))
{
using (var document = HtmlConverter.ConvertToDocument(html, pdfWriter))
{
//Passes the document to a delegated function to perform some content, margin or page size manipulation
//pdfModifier(document);
}
//Returns the written-to MemoryStream containing the PDF.
return workStream.ToArray();
}
}
but I get
System.NullReferenceException
when I call HtmlConverter.ConvertToDocument(html, pdfWriter)
Am I missing something?
UPDATE 2
I tried to debug using source code.
This is the stack trace
System.NullReferenceException
HResult=0x80004003
Message=Object reference not set to an instance of an object.
Source=itext.io
StackTrace: at iText.IO.Font.FontCache..cctor() in S:\Progetti\*****\itext7-dotnet-develop\itext\itext.io\itext\io\font\FontCache.cs:line 76
This is the code that generates the exception:
static FontCache()
{
try
{
LoadRegistry();
foreach (String font in registryNames.Get(FONTS_PROP))
{
allCidFonts.Put(font, ReadFontProperties(font));
}
}
catch (Exception) { }
}
registryNames count = 0 and .Get(FONTS_PROP) throws the exception
UPDATE 3
The problem was related to some sort of cache. I can't really understand what, but as you can see in the code the exception was generated when it tried to load fonts from cache.
I realized that, after having tried the same code on a new project where it worked.
So I cleaned the solution, deleted bin, obj, .vs, killed IIS Express, removed and reinstalled all nuget packages then run again, magically it worked.
Then I had to make only one fix to the code:
Instead of HtmlConverter.ConvertToDocument that generates only a 15 bytes document I used HtmlConverter.ConvertToPdf to generate a full PDF.
Here is the complete code:
public static byte[] ToPdf(string html)
{
using (var workStream = new MemoryStream())
{
using (var pdfWriter = new PdfWriter(workStream))
{
HtmlConverter.ConvertToPdf(html, pdfWriter);
return workStream.ToArray();
}
}
}

I had this EXACT same problem, and after digging down all the way to iText7's FontCache object and getting an error when trying to create my OWN FontProgram to use from a raw TTF file (which also failed with the same null reference error), I finally "solved" my problem.
Apparently iText has some internal errors/exceptions that they are just sort of "skipping" and "pushing past", because I realized by accident that I had "Enable Just My Code" in Visual Studios disabled, and so my system was trying to debug iText7's code as well as mine. The moment that I re-enabled it in my Visual Studio settings (Tools > Options > Debugging > General > Enable Just My Code checkbox), the problem magically went away.
So I spent four hours trying to troubleshoot a problem that was in THEIR code, but that they apparently found some way to work around and push through the method anyways even on a null reference failure.
My convert to PDF function is now working just fine.

I was getting this error as well, but noticed it was only on the first attempted load of the SvgConverter. So I added this at the top of my class, and it seems to have fixed hidden the bug.
using iText.Kernel.Pdf;
using iText.IO.Font;
public class PdfBuilder {
static PdfBuilder() {
try {
FontCache.GetRegistryNames();
}
catch(Exception) {
// ignored... this forces the FontCache to initialize
}
}
...
}

I was using itext 7 everything works fine in Console application.
When I use same code in Web/Function App project, I started getting below error.
System.NullReferenceException
HResult=0x80004003
Message=Object reference not set to an instance of an object.
Source=itext.html2pdf
StackTrace:
at iText.Html2pdf.Attach.Impl.Tags.BrTagWorker..ctor(IElementNode element, ProcessorContext context)
at iText.Html2pdf.Attach.Impl.DefaultTagWorkerMapping.<>c.<.cctor>b__1_10(IElementNode lhs, ProcessorContext rhs)
at iText.Html2pdf.Attach.Impl.DefaultHtmlProcessor.Visit(INode node)
at iText.Html2pdf.Attach.Impl.DefaultHtmlProcessor.Visit(INode node)
at iText.Html2pdf.Attach.Impl.DefaultHtmlProcessor.Visit(INode node)
at iText.Html2pdf.Attach.Impl.DefaultHtmlProcessor.Visit(INode node)
at iText.Html2pdf.Attach.Impl.DefaultHtmlProcessor.Visit(INode node)
at iText.Html2pdf.Attach.Impl.DefaultHtmlProcessor.Visit(INode node)
at iText.Html2pdf.Attach.Impl.DefaultHtmlProcessor.Visit(INode node)
at iText.Html2pdf.Attach.Impl.DefaultHtmlProcessor.ProcessDocument(INode root, PdfDocument pdfDocument)
at iText.Html2pdf.HtmlConverter.ConvertToPdf(String html, PdfDocument pdfDocument, ConverterProperties converterProperties)
at iTextSample.ConsoleApp.HtmlToPdfBuilder.RenderPdf() in C:\code\iTextSample.ConsoleApp\HtmlToPdfBuilder.cs:line 227
After some investigation found that <br /> tag was a problem. I removed all <br /> tags and it is working fine.

Related

.NET Core pdf downloader "No output formatter was found for content types 'application/pdf'..."

I'm creating a .NET Core 3.1 web api method to download a pdf for a given filename. This method is shared across teams where their client code is generated using NSwag.
I recently changed produces attribute to Produces("Application/pdf") from json, this change is required so other teams can generate valid client code. However since this change, I haven't been able to download any files from this method. Requests to download documents return with a 406 error (in Postman) and the following error is logged to the server event viewer.
No output formatter was found for content types 'application/pdf, application/pdf' to write the response.
Reverting the produced content-type to 'application/json' does allow documents to be downloaded, but as mentioned, this value is required to be pdf.
Any suggestions would be greatly appreciated.
Method:
[HttpGet("{*filePath}")]
[ProducesResponseType(typeof(FileStreamResult), StatusCodes.Status200OK)]
[ProducesResponseType(StatusCodes.Status404NotFound)]
[ProducesResponseType(StatusCodes.Status400BadRequest)]
[ProducesResponseType(StatusCodes.Status401Unauthorized)]
[Produces("Application/pdf")]
public async Task<ActionResult> GetDocument(string fileName) {
RolesRequiredHttpContextExtensions.ValidateAppRole(HttpContext, _RequiredScopes);
var memoryStream = new MemoryStream();
var memoryStream = new MemoryStream();
using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 4096, useAsync: true)) {
stream.CopyTo(memoryStream);
}
memoryStream.Seek(offset: 0, SeekOrigin.Begin);
return new FileStreamResult(memoryStream, "Application/pdf");
}
I just came across the same error and after some investigation I found out that the cause of the exception was indeed in the model binding error. You already wrote about it in your answer, but on closer inspection it became obvious that the reason was not related to binding itself, rather to the response body.
Since you specified [Produces("application/pdf")] the framework assumes this content type is the only possible for this action, but when an exception is thrown, you get application/json containing error description instead.
So to make this work for both "happy path" and exceptions, you could specify multiple response types:
[Produces("application/pdf", "application/json")]
public async Task<ActionResult> GetDocument(string fileName)
{
...
}
I'am using
public asnyc Task<IActionResult> BuildPDF()
{
Stream pdfStream = _pdfService.GetData();
byte[] memoryContent = pdfStream.ToArray();
return File(memoryContent, "application/pdf");
}
and it works. Could you please try?
The issue was caused by renaming the method parameter and not updating [HttpGet("{*filePath}")] to [HttpGet("{*fileName}")]
I had the same error, it is very confusing in some cases.
I got this error after adding new parameter of type int[] to my method forgetting [FromQuery] attribute for it.
After adding [FromQuery] attribute error gone.

Adding text items to an Existing PDF w/ Telerik DocumentProcessing Library

I want to open an existing PDF document and add different annotations to it. Namely bookmarks and some text
I am using the Telerik Document Processing Library (dpl) v2019.3.1021.40
I am new to dpl , but I believe the RadFlowDocument is the way to go.
I am having troubles creating the RadFlowDocument
FlowProvider.PdfFormatProvider provider = new FlowProvider.PdfFormatProvider();
using (Stream stream = File.OpenRead(sourceFile))
{
--> RadFlowDocument flowDoc = provider.Import(stream);
}
The line indicated w/ the arrow give the error "Import Not Supported"
There is a telerik blog post here
https://www.telerik.com/forums/radflowdocument-to-pdf-error
It seems relevant, but not 100% sure.
It cautions to be sure the providers are mated correctly, I believe they are in my example....
Again, ultimate goal is to open a PDF and add some stuff to it. I think the RadFlowDocument is the right direction. If there is a better solution, Im happy to hear that too.
I figured it out. The DPL is pretty good, but doc is still growing, hope this helps someone out...
This draws from a myriad of articles, I cant begin to cite them all.
There are 2 notions for working w/ PDFs in the DPL.
FixedDocument takes pages. I think this is meant for sewing docs together.
FlowDocument I believe lays things out like an HTML renderer would.
I am using Fixed, mainly b/c I can get that to work.
using System;
using System.IO;
using System.Windows; //nec for Size struct
using System.Diagnostics; //nec for launching the pdf at the end
using Telerik.Windows.Documents.Fixed.Model;
//if you have fixed and flow provider, you have to specify, so I make a shortcut
using FixedProvider = Telerik.Windows.Documents.Fixed.FormatProviders.Pdf;
using Telerik.Windows.Documents.Fixed.Model.Editing;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace DocAggregator
{
[TestClass]
public class UnitTest2
{
[TestMethod]
public void EditNewFIle_SrcAsFixed_TrgAsFixed()
{
String dt = #"C:\USERS\greg\DESKTOP\DPL\";
String sourceFile = dt + "output.pdf";
//Open the sourceDoc so you can add stuff to it
RadFixedDocument sourceDoc;
//a provider parses the actual file into the model.
FixedProvider.PdfFormatProvider fixedProv = new FixedProvider.PdfFormatProvider();
using (Stream stream = File.OpenRead(sourceFile))
{
//'populate' the doc object from the file
//using the FLOW classes, I get "Import Not Supported".
sourceDoc = fixedProv.Import(stream);
}
int pages = sourceDoc.Pages.Count;
int pageCounter = 1;
int xoffset = 150;
int yoffset = 50;
//editor is the thing that lets you add elements into the source doc
//Like the provider, the Editor needs to match the document class (Fixed or Flow)
RadFixedDocumentEditor editor = new RadFixedDocumentEditor(sourceDoc);
foreach (RadFixedPage page in sourceDoc.Pages)
{
FixedContentEditor pEd = new FixedContentEditor(page);
Size ps = page.Size;
pEd.Position.Translate(ps.Width - xoffset, ps.Height - yoffset);
Block block = new Block();
block.HorizontalAlignment = Telerik.Windows.Documents.Fixed.Model.Editing.Flow.HorizontalAlignment.Center;
block.TextProperties.FontSize = 22;
block.InsertText(string.Format("Page {0} of {1} ", pageCounter, pages));
pEd.DrawBlock(block);
pageCounter++;
}
string exportFileName = "addedPageNums.pdf";
if (File.Exists(exportFileName))
{
File.Delete(exportFileName);
}
File.WriteAllBytes(exportFileName, fixedProv.Export(sourceDoc));
//launch the app
Process.Start(exportFileName);
}
}
}

How to merge 10000 pdf into one using pdfbox in most effective way

PDFBox api is working fine for less number of files. But i need to merge 10000 pdf files into one, and when i pass 10000 files(about 5gb) it's taking 5gb ram and finally goes out of memory.
Is there some implementation for such requirement in PDFBox.
I tried to tune it for that i used AutoClosedInputStream which gets closed automatically after read, But output is still same.
I have a similar scenario here, but I need to merge only 1000 documents in a single one.
I tried to use PDFMergerUtility class, but I getting an OutOfMemoryError. So I did refactored my code to read the document, load the first page (my source documents have one page only), and then merge, instead of using PDFMergerUtility. And now works fine, with no more OutOfMemoryError.
public void merge(final List<Path> sources, final Path target) {
final int firstPage = 0;
try (PDDocument doc = new PDDocument()) {
for (final Path source : sources) {
try (final PDDocument sdoc = PDDocument.load(source.toFile(), setupTempFileOnly())) {
final PDPage spage = sdoc.getPage(firstPage);
doc.importPage(spage);
}
}
doc.save(target.toAbsolutePath().toString());
} catch (final IOException e) {
throw new IllegalStateException(e);
}
}

Text Extraction, Not Image Extraction

Please help me understand if my solution is correct.
I'm trying to extract text from a PDF file with a LocationTextExtractionStrategy parser. I'm getting exceptions because the ParseContentMethod tries to parse inline images? The code is simple and looks similar to this:
RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
I realize the images are in the content stream but I have a PDF file failing to extract text because of inline images. It returns an UnsupportedPdfException of "The filter /DCTDECODE is not supported" and then it finally fails with and InlineImageParseException of "Could not find image data or EI", when all I really care about is the text. The BI/EI exists in my file so I assume this failure is because of the /DCTDECODE exception. But again, I don't care about images, I'm looking for text.
My current solution for this is to add a filterHandler in the InlineImageUtils class that assigns the Filter_DoNothing() filter to the DCTDECODE filterHandler dictionary. This way I don't get exceptions when I have InlineImages with DCTDECODE. Like this:
private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
try {
IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
PdfReader.DecodeBytes(samples, imageDictionary, handlers);
return true;
} catch (IOException e) {
return false;
}
}
public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
{
return b;
}
}
My problem with this "fix" is that I had to change the iTextSharp library. I'd rather not do that so I can try to stay compatible with future versions.
Here's the PDF in question:
https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5

Rss20FeedFormatter Ignores TextSyndicationContent type for SyndicationItem.Summary

While using the Rss20FeedFormatter class in a WCF project, I was trying to wrap the content of my description elements with a <![CDATA[ ]]> section. I found that no matter what I did, the HTML content of the description elements was always encoded and the CDATA section was never added. After peering into the source code of Rss20FeedFormatter, I found that when building the Summary node, it basically creates a new TextSyndicationContent instance which wipes out whatever settings were previously specified (I think).
My Code
public class CDataSyndicationContent : TextSyndicationContent
{
public CDataSyndicationContent(TextSyndicationContent content)
: base(content)
{
}
protected override void WriteContentsTo(System.Xml.XmlWriter writer)
{
writer.WriteCData(Text);
}
}
... (The following code should wrap the Summary with a CDATA section)
SyndicationItem item = new SyndicationItem();
item.Title = new TextSyndicationContent(name);
item.Summary = new CDataSyndicationContent(
new TextSyndicationContent(
"<div>This is a test</div>",
TextSyndicationContentKind.Html));
Rss20FeedFormatter Code
(AFAIK, the above code does not work because of this logic)
...
else if (reader.IsStartElement("description", ""))
result.Summary = new TextSyndicationContent(reader.ReadElementString());
...
As a workaround, I've resorted to using the RSS20FeedFormatter to build the RSS, and then patch the RSS manually. For example:
StringBuilder buffer = new StringBuilder();
XmlTextWriter writer = new XmlTextWriter(new StringWriter(buffer));
feedFormatter.WriteTo(writer ); // feedFormatter = RSS20FeedFormatter
PostProcessOutputBuffer(buffer);
WebOperationContext.Current.OutgoingResponse.ContentType =
"application/xml; charset=utf-8";
return new MemoryStream(Encoding.UTF8.GetBytes(buffer.ToString()));
...
public void PostProcessOutputBuffer(StringBuilder buffer)
{
var xmlDoc = XDocument.Parse(buffer.ToString());
foreach (var element in xmlDoc.Descendants("channel").First()
.Descendants("item")
.Descendants("description"))
{
VerifyCdataHtmlEncoding(buffer, element);
}
foreach (var element in xmlDoc.Descendants("channel").First()
.Descendants("description"))
{
VerifyCdataHtmlEncoding(buffer, element);
}
buffer.Replace(" xmlns:a10=\"http://www.w3.org/2005/Atom\"",
" xmlns:atom=\"http://www.w3.org/2005/Atom\"");
buffer.Replace("a10:", "atom:");
}
private static void VerifyCdataHtmlEncoding(StringBuilder buffer,
XElement element)
{
if (!element.Value.Contains("<") || !element.Value.Contains(">"))
{
return;
}
var cdataValue = string.Format("<{0}><![CDATA[{1}]]></{2}>",
element.Name,
element.Value,
element.Name);
buffer.Replace(element.ToString(), cdataValue);
}
The idea for this workaround came from the following location, I just adapted it to work with WCF instead of MVC. http://localhost:8732/Design_Time_Addresses/SyndicationServiceLibrary1/Feed1/
I'm just wondering if this is simply a bug in Rss20FeedFormatter or is it by design? Also, if anyone has a better solution, I'd love to hear it!
Well #Page Brooks, I see this more as a solution then as a question :). Thanks!!! And to answer your question ( ;) ), yes, I definitely think this is a bug in the Rss20FeedFormatter (though I did not chase it as far), because had encountered precisely the same issue that you described.
You have a 'localhost:8732' referral in your post, but it wasn't available on my localhost ;). I think you meant to credit the 'PostProcessOutputBuffer' workaround to this post:
http://damieng.com/blog/2010/04/26/creating-rss-feeds-in-asp-net-mvc
Or actually it is not in this post, but in a comment to it by David Whitney, which he later put in a seperate gist here:
https://gist.github.com/davidwhitney/1027181
Thank you for providing the adaption of this workaround more to my needs, because I had found the workaround too, but was still struggling to do the adaptation from MVC. Now I only needed to tweak your solution to put the RSS feed to the current Http request in the .ashx handler that I was using it in.
Basically I'm guessing that the fix you mentioned using the CDataSyndicationContent, is from feb 2011, assuming you got it from this post (at least I did):
SyndicationFeed: Content as CDATA?
This fix stopped working in some newer ASP.NET version or something, due to the code of the Rss20FeedFormatter changing to what you put in your post. This code change might as well have been an improvement for other stuff that IS in the MVC framework, but for those using the CDataSyndicationContent fix it definitely causes a bug!
string stylesheet = #"<xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform""><xsl:output cdata-section-elements=""description"" method=""xml"" indent=""yes""/></xsl:stylesheet>";
XmlReader reader = XmlReader.Create(new StringReader(stylesheet));
XslCompiledTransform t = new XslCompiledTransform(true);
t.Load(reader);
using (MemoryStream ms = new MemoryStream())
{
XmlWriter writer = XmlWriter.Create(ms, t.OutputSettings);
rssFeed.WriteTo(writer); // rssFeed is Rss20FeedFormatter
writer.Flush();
ms.Position = 0;
string niko = Encoding.UTF8.GetString(ms.ToArray());
}
I'm sure someone pointed this out already but this a stupid workaround I used.
t.OutputSettings is of type XmlWriterSettings with cdataSections being populated with a single XmlQualifiedName "description".
Hope it helps someone else.
I found the code for Cdata elsewhere
public class CDataSyndicationContent : TextSyndicationContent
{
public CDataSyndicationContent(TextSyndicationContent content)
: base(content)
{
}
protected override void WriteContentsTo(System.Xml.XmlWriter writer)
{
writer.WriteCData(Text);
}
}
Code to call it something along the lines:
item.Content = new Helpers.CDataSyndicationContent(new TextSyndicationContent("<span>TEST2</span>", TextSyndicationContentKind.Html));
However the "WriteContentsTo" function wasn't being called.
Instead of Rss20FeedFormatter I tried Atom10FeedFormatter - and it worked!
Obviously this gives Atom feed rather than traditional RSS - but worth mentioning.
Output code is:
//var formatter = new Rss20FeedFormatter(feed);
Atom10FeedFormatter formatter = new Atom10FeedFormatter(feed);
using (var writer = XmlWriter.Create(response.Output, new XmlWriterSettings { Indent = true }))
{
formatter.WriteTo(writer);
}