iText 7 Chinese characters and merge with existing pdf template - pdf

I have to rephrase my question, basically my request is very straight forward, i want to display Asian characters in the generated pdf file from iText7.
As of now i have download the NotoSansCJKsc-Regular.otf file and assign a variable to hold the path, below is my code:
public static string FONT = #"D:\Projects\Resources\NotoSansCJKsc-Regular.otf";
PdfWriter writer = new PdfWriter(#"C:\temp\test.pdf");
PdfDocument pdfDoc = new PdfDocument(writer);
Document doc = new Document(pdfDoc, PageSize.A4);
PdfFont fontChinese = PdfFontFactory.CreateFont(FONT, PdfEncodings.IDENTITY_H);
doc.SetFont(fontChinese);
but the issue i am facing now is whenever the code runs to this section:
PdfFont fontChinese = PdfFontFactory.CreateFont(FONT, PdfEncodings.IDENTITY_H);
i am always getting this error: The request could not be performed because of an I/O device error. and this error doesn't make sense to me and I am struggling to find out the solution, could someone in here had the similar issue plz, the code is in C#.
Many thanks.

I can confirm that above code is working as expected, the .otf file that I was originally downloaded was corrupted, hence I got above error.

Related

Error converting .docx with HTML to PDF using Graph API

I am trying to convert MS Word (.docx) file to PDF format using Graph API. The file is stored in SharePoint Office 365. I am using below code which works.
var httpClient = await CreateAuthorizedHttpClient();
string path = $"{GraphEndpoint}sites/{SiteId}/drive/items/";
string requestUrl = $"{path}{fileId}/content?format={targetFormat}";
var response = await httpClient.GetAsync(requestUrl);
However, when we try to convert .docx file which contains HTML added using below code converting fails.
string altChunkId = "myId123";
//Create an alternative format import part on the MainDocumentPart
AlternativeFormatImportPart altformatImportPart = wordDoc.MainDocumentPart
.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Html, altChunkId);
using (MemoryStream htmlMemoryStream = new MemoryStream(Encoding.UTF8.GetBytes($"<html><head></head><body>{value}</body></html>")))
{
//Add the HTML data into the alternative format import part
altformatImportPart.FeedData(htmlMemoryStream);
//create a new altChunk and link it to the id of the AlternativeFormatImportPart
AltChunk altChunk = new AltChunk();
altChunk.Id = altChunkId;
//p.InsertAfterSelf(altChunk);
documentBody.Append(altChunk);
break;
}
I get 406 Not Acceptable error when we try to convert the file using Graph API. Also I see that the file is not editable in browser and open in compatibility mode. If I try to open the document in edit mode I get error:
Sorry this document can't be opened because it contains objects that
word doesn't support
I tried removing the HTML part of the document and pasted that in another document and tried converting that document to PDF which worked. When I saw the XML of the document I saw Word App converted that HTML to word compatible XML tags.
Question 1: How can I convert the HTML to word compatible tags? So I can convert the document to PDF.
Also if I try to Download as PDF, the file is converted to PDF without any issue.
This option is using below API call:
https://word-view.officeapps.live.com/wv/WordViewer/request.pdf?WOPIsrc={SiteURL}%2F%5Fvti%5Fbin%2Fwopi%2Eashx%2Ffiles%2F{ID}&access_token=&access_token_ttl=&z=256&type=downloadpdf
Question 2: Is there a way I can use this API to convert .docx file to PDF? I saw the access token's audience value is "wopi/{TenantName}#{TenantID}". If I get the correct access token I think I will be able to use the above API.

Remove PDFont caching with Apache tika

I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.
My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.
I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File(child.getPath()));
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String s=null;
s =handler.toString();
handler=null;
context=null;
inputstream.close();
PDFont.clearResources();
So I fudged a workaround and just called System.gc(); everytime the file had finished being processed which works a treat but doesn't really answer the question.

Error using Aspose.Email to open an embedded PDF attachment, then load into Datalogics

I am using using the Aspose.Email to get attachments out of an Outlook email like this:
var mailMessage = Aspose.Email.Mail.MailMessage.Load(stream);
var attachments = Aspose.Email.Outlook.MapiMessage.FromMailMessage(mailMessage).Attachments;
var pdfAttachment = attachments.ToList()[attachmentIndexDesired];
Then, I am loading the attachment into DataLogics like this:
var pdfStream = new MemoryStream(pdfAttachment.BinaryData);
var pdfDocument = new Datalogics.PDFL.Document(pdfStream);
Here I get the following exception:
PDF Library Error: File does not begin with '%PDF-'. Error number: 537001985
I cannot find anything on this error anywhere.
Note that the initial stream object above is a *.msg Outlook file and originates from a sharepoint SPFile. Also note that if the stream object SPFile is itself a PDF file (as opposed to an attachment to a *.msg file) I can load it into DataLogics just fine.
I know the error is being thrown by the DataLogics library, but is there something about how I am getting the attachment that could be changed/improved that would prevent this error from occurring?
Any ideas?
So just a few minutes ago I was trying to grab the BinaryData from the attachment and convert it to a string that I could read so I could visually inspect its contents. So I did this.
(new StreamReader(new MemoryStream(curAttachment.BinaryData))).ReadToEnd()
When I did that, it printed this error:
"Evaluation copy of Aspose.Email limits to extract only 3 attachments in the messages. Please contact sales#aspose.com to purchase a valid license."
Kind of strange as I am using my production license, but either way I went ahead and deleted some attachments from the parent *.msg file (it previously had 4) and tried to load it again. Then... poof... it started working.
Seems like it should throw an exception or something instead as that would have made this issue a lot easier to track down.

GhostScript .NET not continuing past certain pages

I've created a program which needs to convert PDF files into image files, and for this GhostScript is the best choice. But once in a while, the library stalls completely on a page and doesn't continue, it just keeps using CPU power and working, as though it might be caught in an infinite loop. The error is easily reproduce-able as it happens every time on the specific PDF files that it occurs on, though no error is given from GhostScript of any kind, and nothing is out of the ordinary in the PDF files themselves as far as I can see.
I have however been able to find out that the stalling is due to a specific element or elements in the pdf files, and by deleting the elements the pdf will easily render in GhostScript, but this is not a solution, nor an answer I can use.
PDF link* - http://www.filedropper.com/usjunis1-32webtest
*saved with free version of PDF-XChange Editor, so it has watermarks at the top, but it is the square that creates the stalling. I've also seen it happen on vector graphics objects, so it is not limited to squares.
Code -
private void startImageProcessing(String pdfFile)
{
GhostscriptVersionInfo gvi = new GhostscriptVersionInfo(new Version(0, 0, 0), Directory.GetCurrentDirectory() + #"\gsdll32.dll", string.Empty, GhostscriptLicense.GPL);
Ghostscript.NET.Processor.GhostscriptProcessor processor = new Ghostscript.NET.Processor.GhostscriptProcessor(gvi, true);
processor.StartProcessing(CreateTestArgs(pdfFile, pdfFile.Substring(0, pdfFile.Length - 4) + "\\"+prefix+"-%03d.jpg", 72 * scale), new ConsoleStdIO(true));
}
private static string[] CreateTestArgs(string inputPath, string outputPath, int dpi)
{
List<string> gsArgs = new List<string>();
gsArgs.Add("-dSAFER");
gsArgs.Add("-dBATCH");
gsArgs.Add("-dNOPAUSE");
gsArgs.Add("-sDEVICE=jpeg");
gsArgs.Add("-r" + dpi);
gsArgs.Add("-dJPEGQ=100");
gsArgs.Add("-dNumRenderingThreads=" + Environment.ProcessorCount.ToString());
gsArgs.Add("-dTextAlphaBits=4");
gsArgs.Add("-dGraphicsAlphaBits=4");
gsArgs.Add(#"-sOutputFile=" + outputPath);
gsArgs.Add(#"-f" + inputPath);
return gsArgs.ToArray();
}
I've also created a pdf file only containing one of the wrong elements for testing, and it has both had the error when saved by Adobe Acrobat, and PDF-XChange Editor, so the error is not due to a specific program that I've used to save the PDF either.

an error 3013 thrown when writing a file Adobe AIR

I'm trying to write/create a JSON file from a AIR app, I'm trying not so show a 'Save as' dialogue box.
Here's the code I'm using:
var fileDetails:Object = CreativeMakerJSX.getFileDetails();
var fileName:String = String(fileDetails.data.filename);
var path:String = String(fileDetails.data.path);
var f:File = File.userDirectory.resolvePath( path );
var stream:FileStream = new FileStream();
stream.open(f, FileMode.WRITE );
stream.writeUTFBytes( jsonToExport );
stream.close();
The problem I'm having is that I get a 'Error 3013. File or directory in use'. The directory/path is gathered from a Creative Suite Extension I'm building, this path is the same as the FLA being developed in CS that the Extension is being used with.
So I'm not sure if the problem is that there are already files in the directory I'm writing the JSON file to?
Do I need to add a timer in order to close the stream after a slight delay, giving some time to writing the file?
Can you set up some trace() commands? I would need to know what the values of the String variables are, and the f.url.
Can you read from the file that you are trying to write to, or does nothing work?
Where is CreativeMakerJSX.getFileDetails() coming from? Is it giving you data about a file that is in use?
And from Googling around, this seems like it may be a bug. Try setting up a listener for when you are finished, if you have had the file open previously.
I re-wrote how the file was written, no longer running into this issue.