I have successfully converted PDF to text using iTextSharp using the following code:
var reader = new PdfReader(filePath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new
iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s =Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s + Environment.NewLine;
pdfTextBox.Text = strText;
}
reader.Close();
However, certain PDFs, which show text when viewing as PDF, show up as empty(no characters).
Does anyone have any ideas why?
All help would be appreciated
Thanks in advance
Related
I have to put my list data in a table in a pdf file. My data has some Arabic words. When my pdf is generated, the Arabic words don't appear. I searched and found that I need itext7.pdfcalligraph so I installed it in my app. I found this code too https://itextpdf.com/en/blog/technical-notes/displaying-text-different-languages-single-pdf-document and tried to do something similar to allow Arabic words in my table but I couldn't figure it out.
This is a trial code before I apply it to my real list:
var path2 = global::Android.OS.Environment.ExternalStorageDirectory.AbsolutePath;
filePath = System.IO.Path.Combine(path2.ToString(), "myfile2.pdf");
stream = new FileStream(filePath, FileMode.Create);
PdfWriter writer = new PdfWriter(stream);
PdfDocument pdf2 = new iText.Kernel.Pdf.PdfDocument(writer);
Document document = new Document(pdf2, PageSize.A4);
FontSet set = new FontSet();
set.AddFont("ARIAL.TTF");
document.SetFontProvider(new FontProvider(set));
document.SetProperty(Property.FONT, "Arial");
string[] sources = new string[] { "يوم","شهر 2020" };
iText.Layout.Element.Table table = new iText.Layout.Element.Table(2, false);
foreach (string source in sources)
{
Paragraph paragraph = new Paragraph();
Bidi bidi = new Bidi(source, Bidi.DirectionDefaultLeftToRight);
if (bidi.BaseLevel != 0)
{
paragraph.SetTextAlignment(iText.Layout.Properties.TextAlignment.RIGHT);
}
paragraph.Add(source);
table.AddCell(new Cell(1, 1).SetTextAlignment(iText.Layout.Properties.TextAlignment.CENTER).Add(paragraph));
}
document.Add(table);
document.Close();
I updated my code and added the arial.ttf to my assets folder . i'm getting the following exception:
System.InvalidOperationException: 'FontProvider and FontSet are empty. Cannot resolve font family name (see ElementPropertyContainer#setFontFamily) without initialized FontProvider (see RootElement#setFontProvider).'
and I still can't figure it out. any ideas?
thanks in advance
- C #
I have a similar situation for Turkish characters, and I've followed these
steps :
Create a folder under projects root folder which is : /wwwroot/Fonts
Add OpenSans-Regular.ttf under the Fonts folder
Path for font is => ../wwwroot/Fonts/OpenSans-Regular.ttf
Create font like below :
public static PdfFont CreateOpenSansRegularFont()
{
var path = "{Your absolute path for FONT}";
return PdfFontFactory.CreateFont(path, PdfEncodings.IDENTITY_H, true);
}
and use it like :
paragraph.Add(source)
.SetFont(FontFactory.CreateOpenSansRegularFont()); //set font in here
table.AddCell(new Cell(1, 1)
.SetTextAlignment(iText.Layout.Properties.TextAlignment.CENTER)
.Add(paragraph));
This is how I used font factory for Turkish characters ex: "ü,i,ç,ş,ö"
For Xamarin-Android, you could try
string documentsPath = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments);
var path = Path.Combine(documentsPath, "Fonts/Arial.ttf");
Look i fixed it in java,this may help you:
String font = "your Arabic font";
//the magic is in the next 4 lines:
PdfFontFactory.register(font);
FontProgram fontProgram = FontProgramFactory.createFont(font, true);
PdfFont f = PdfFontFactory.createFont(fontProgram, PdfEncodings.IDENTITY_H);
LanguageProcessor languageProcessor = new ArabicLigaturizer();
//and look here how i used setBaseDirection and don't use TextAlignment ,it will work without it
com.itextpdf.kernel.pdf.PdfDocument tempPdfDoc = new com.itextpdf.kernel.pdf.PdfDocument(new PdfReader(pdfFile.getPath()), TempWriter);
com.itextpdf.layout.Document TempDoc = new com.itextpdf.layout.Document(tempPdfDoc);
com.itextpdf.layout.element.Paragraph paragraph0 = new com.itextpdf.layout.element.Paragraph(languageProcessor.process("الاستماره الالكترونية--الاستماره الالكترونية--الاستماره الالكترونية--الاستماره الالكترونية"))
.setFont(f).setBaseDirection(BaseDirection.RIGHT_TO_LEFT)
.setFontSize(15);
I' m trying to convert a docx document to pdf and store the newly created pdf file as a new version.
This is the test code:
var document = search.findNode("workspace://SpacesStore/30f334f3-d357-4ea6-a09f-09eab2da7488");
var folder = document.parent
var pdf = document.transformDocument('application/pdf');
pdf.name = "tranformed-" + pdf.name;
pdf.save();
document.name = "new-" + document.name + ".pdf";
document.mimetype = "application/pdf";
document.content = pdf.content;
document.save();
The document ends up empty.
Is this type of conversion possible with javaScript?
This Code create new pdf from docx and created pdf stored as version 1.0
var document = search.findNode("workspace://SpacesStore/30f334f3-d357-4ea6-a09f-09eab2da7488");
var folder = document.parent
var pdf = document.transformDocument('application/pdf');
pdf.name = "tranformed-" + pdf.name;
pdf.save();
Thanks for your support.
The problem was assigning the pdf content.
The following code seems to work only with plain text content:
document.content = pdf.content;
Paradoxically, the following is needed when assigning pdf content to a document.
document.properties.content.write(pdf.properties.content);
Thanks.
In a project I have to split a PDF document into two documents, one containing all blank pages, and one containing all pages with content.
For this job, I use a PdfReader to read the source file, and two pdfCopy objects (one for the blank pages document, one for the pages with content document) to write the files to.
I use GetImportedPage to read a PdfImportedPage, which is then added to one of the PdfCopy writers.
Now, the problem is the following: the source file is using the "tagged PDF format". To preserve this (which is absolutely required), I use the SetTagged() method on both PdfCopy writers, and use the extra third parameter in GetImportedPage(...) to keep the tagged format. However, when calling the AddPage(...) on the PdfCopy writer, I get an invalid cast exception:
"Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary' to type 'iTextSharp.text.pdf.PRIndirectReference'."
Anyone has any ideas on how to solve this ? Any hints ?
Also: the project currently refers version 5.1.0.0 of the itext libraries. In 5.4.4.0 the third parameter to GetImportedPage does not seem to be there anymore.
Below, you can find a code extract:
iTextSharp.text.Document targetPdf = new iTextSharp.text.Document();
iTextSharp.text.Document blankPdf = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfReader sourcePdfReader = new iTextSharp.text.pdf.PdfReader(inputFile);
iTextSharp.text.pdf.PdfCopy targetPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(targetPdf, new FileStream(outputFile, FileMode.Create));
iTextSharp.text.pdf.PdfCopy blankPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(blankPdf, new FileStream(blanksFile, FileMode.Append));
targetPdfWriter.SetTagged();
blankPdfWriter.SetTagged();
try
{
iTextSharp.text.pdf.PdfImportedPage page = null;
int n = sourcePdfReader.NumberOfPages;
targetPdf.Open();
blankPdf.Open();
blankPdf.Add(new iTextSharp.text.Phrase("This document contains the blank pages removed from " + inputFile));
blankPdf.NewPage();
for (int i = 1; i <= n; i++)
{
byte[] pageBytes = sourcePdfReader.GetPageContent(i);
string pageText = "";
iTextSharp.text.pdf.PRTokeniser token = new iTextSharp.text.pdf.PRTokeniser(new iTextSharp.text.pdf.RandomAccessFileOrArray(pageBytes));
while (token.NextToken())
{
if (token.TokenType == iTextSharp.text.pdf.PRTokeniser.TokType.STRING)
{
pageText += token.StringValue;
}
}
if (pageText.Length >= 15)
{
page = targetPdfWriter.GetImportedPage(sourcePdfReader, i, true);
targetPdfWriter.AddPage(page);
}
else
{
page = blankPdfWriter.GetImportedPage(sourcePdfReader, i, true);
blankPdfWriter.AddPage(page);
blankPageCount++;
}
}
}
catch (Exception ex)
{
Console.WriteLine("Exception at LOC1: " + ex.Message);
}
The error occurs in the call to targetPdfWriter.AddPage(page); near the end of the code sample.
Thank you very much for your help.
Koen.
When I tried to use the code to convert the file format files to PDF using itextsharp. The problem appears when converting texts written in Arabic. The result came without the written text in Arabic.
I hope you to help me overcome the problems.
Thank you very much
Here's a summary of the process:
Wrap Paragraph objects in one of the two iText IElement classes that support Arabic text: PdfPCell and ColumnText.
Use a font that has Arabic glyphs.
Set the text run direction and alignment.
Something like this:
using (Document document = new Document()) {
PdfWriter writer = PdfWriter.GetInstance(document, STREAM);
document.Open();
string arabicText = #"
iText ® هي المكتبة التي تسمح لك لخلق والتلاعب وثائق PDF. فإنه يتيح للمطورين تتطلع الى تعزيز شبكة الإنترنت وغيرها من التطبيقات مع دينامية الجيل ثيقة PDF و / أو تلاعب.
";
PdfPTable table = new PdfPTable(1);
table.WidthPercentage = 100;
PdfPCell cell = new PdfPCell();
cell.Border = PdfPCell.NO_BORDER;
cell.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
Font font = new Font(BaseFont.CreateFont(
"c:/windows/fonts/arialuni.ttf",
BaseFont.IDENTITY_H, BaseFont.EMBEDDED
));
Paragraph p = new Paragraph(arabicText, font);
p.Alignment = Element.ALIGN_LEFT;
cell.AddElement(p);
table.AddCell(cell);
document.Add(table);
}
Sorry if the example text above is poor, incorrect, or both. I had to use Google translate, since my native language is English.
Dear Team,
In my application, i want to split the pdf using itextsharp. If i upload PDF contains 10 pages with file size 10 mb for split, After splitting the combine file size of each pdfs will result into above 20mb file size. If this possible to reduce the file size(each pdf).
Please help me to solve the issue.
Thanks in advance
This may have to do with the resources in the file. If the original document uses an embedded font on each, for example, then there will only be one instance of the font in the original file. When you split it, each file will be required have that font as well. The total overhead will be n pages × sizeof(each font). Elements that will cause this kind of bloat include fonts, images, color profiles, document templates (aka forms), XMP, etc.
And while it doesn't help you in your immediate problem, if you use the PDF tools in Atalasoft dotImage, your task becomes a 1 liner:
PdfDocument.Separate(userpassword, ownerpassword, origPath, destFolder, "Separated Page{0}.pdf", true);
which will take the PDF in orig file and create new pages in the dest folder each named with the pattern. The bool at the end is to overwrite an existing file.
Disclaimer: I work for Atalasoft and wrote the PDF library (also used to work at Adobe on Acrobat versions 1, 2, 3, and 4).
Hi Guys i modified the above code to split a PDF file into multiple Pdf file.
iTextSharp.text.pdf.PdfReader reader = null;
int currentPage = 1;
int pageCount = 0;
//string filepath_New = filepath + "\\PDFDestination\\";
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
//byte[] arrayofPassword = encoding.GetBytes(ExistingFilePassword);
reader = new iTextSharp.text.pdf.PdfReader(filepath);
reader.RemoveUnusedObjects();
pageCount = reader.NumberOfPages;
string ext = System.IO.Path.GetExtension(filepath);
for (int i = 1; i <= pageCount; i++)
{
iTextSharp.text.pdf.PdfReader reader1 = new iTextSharp.text.pdf.PdfReader(filepath);
string outfile = filepath.Replace((System.IO.Path.GetFileName(filepath)), (System.IO.Path.GetFileName(filepath).Replace(".pdf", "") + "_" + i.ToString()) + ext);
reader1.RemoveUnusedObjects();
iTextSharp.text.Document doc = new iTextSharp.text.Document(reader.GetPageSizeWithRotation(currentPage));
iTextSharp.text.pdf.PdfCopy pdfCpy = new iTextSharp.text.pdf.PdfCopy(doc, new System.IO.FileStream(outfile, System.IO.FileMode.Create));
doc.Open();
for (int j = 1; j <= 1; j++)
{
iTextSharp.text.pdf.PdfImportedPage page = pdfCpy.GetImportedPage(reader1, currentPage);
pdfCpy.SetFullCompression();
pdfCpy.AddPage(page);
currentPage += 1;
}
doc.Close();
pdfCpy.Close();
reader1.Close();
reader.Close();
}
Have you tried setting the compression on the writer?
Document doc = new Document();
using (MemoryStream ms = new MemoryStream())
{
PdfWriter writer = PdfWriter.GetInstance(doc, ms);
writer.SetFullCompression();
}