I am developing a C# winform application that converts the pdf contents to text. All the required contents are extracted except the content found in highlighted text of the pdf.
Please help to get the working sample to extract the highlighted text found in pdf.
I am using the iTextSharp.dll in the project
Assuming that you're talking about Comments. Please try this:
for (int i = pageFrom; i <= pageTo; i++)
{
PdfDictionary page = reader.GetPageN(i);
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if (annots != null)
foreach (PdfObject annot in annots.ArrayList)
{
PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
// now use the String value of contents
}
}
This is written from memory (I'm a Java developer, not a C# developer).
Related
I am trying to add watermark on pdf file using PdfSharp, I tried from this link
http://www.pdfsharp.net/wiki/Watermark-sample.ashx
but am not able to get how to get the existing pdf file page object and how to watermark on that page.
Help?
Basically, the samples are only snippets. You can download the source and with that you get a bunch of samples, including this watermark example.
The following comes from PDFSharp-MigraDocFoundation-1_32/PDFsharp/samples/Samples C#/Based on GDI+/Watermark/Program.cs
Quite simple, really ... I am only showing the code up to the for loop that goes over each page. You should have a look at the full file.
[...]
const string watermark = "PDFsharp";
const int emSize = 150;
// Get a fresh copy of the sample PDF file
const string filename = "Portable Document Format.pdf";
File.Copy(Path.Combine("../../../../../PDFs/", filename),
Path.Combine(Directory.GetCurrentDirectory(), filename), true);
// Create the font for drawing the watermark
XFont font = new XFont("Times New Roman", emSize, XFontStyle.BoldItalic);
// Open an existing document for editing and loop through its pages
PdfDocument document = PdfReader.Open(filename);
// Set version to PDF 1.4 (Acrobat 5) because we use transparency.
if (document.Version < 14)
document.Version = 14;
for (int idx = 0; idx < document.Pages.Count; idx++)
{
//if (idx == 1) break;
PdfPage page = document.Pages[idx];
[...]
This is a case of OCR gone wrong. I need to remove the hidden text from a PDF and I'm having a hard time figuring out how to do it.
The hidden text resides in an area always named /QuickPDFsomething which is under and /XObject dictionary that resides in the page's /Resources dictionary.
I have tried these two things and neither has worked so I'm clearly doing something wrong.
Option 1 - Kill obj - The PDF won't open in Acrobat and states, 'An error exists on this page. Acrobat may not display the page correctly' but it looks ok. Pitstop pukes with 'Critical parser failure: XObject resource missing'.
PdfReader.KillIndirect(obj);
oPdfFile.GetPdfReader().RemoveUnusedObjects();
var stamper = new PdfStamper(oPdfFile.GetPdfReader(), new FileStream(#"C:\temp.pdf", FileMode.Create));
stamper.Close();
Option 2 - CleanupProcessor - Throws an exception about 'A Graphics object cannot be created from an image that has an indexed pixel format'.
var stamper = new PdfStamper(oPdfFile.GetPdfReader(), new FileStream(#"C:\temp.pdf", FileMode.Create));
var cleanupLocations = new List<PdfCleanUpLocation>();
var pageRect = oPdfFile.GetPdfReader().GetCropBox(1);
cleanupLocations.Add(new PdfCleanUpLocation(1, pageRect));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanupLocations, stamper);
cleaner.CleanUp();
stamper.Close();
I'd like to remove the /QuickPDF object (41 0 R, in this image) as well as remove it from the content stream that calls it with /QuickPDF Do.
Unfortunately I cannot provide the PDF.
Any tips on how to do this?
I hate to answer my own question but I wanted to share the solution I found in case others need it.
After playing around with this for a couple days i figured out that Option 1 above would indeed remove the object and that the exception that I was getting from PitStop was because the content stream had a reference to the /QuickPDF XObject.
So I tried following #mkl's solution here Removing Watermark from PDF iTextSharp but it kept putting unwanted data in the content stream that rotated my PDF.
So then I found #Chris's solution here Removing Watermark from a PDF using iTextSharp and it seems to work although I'm not sure how stable this solution will be.
This is my solution for removing /QuickPDF from the content stream:
int numPages = oPdfFile.GetPdfReader().NumberOfPages;
int pgNumber = 1;
PdfDictionary page = oPdfFile.GetPdfReader().GetPageN(pgNumber);
PdfArray contentarray = page.GetAsArray(PdfName.CONTENTS);
PRStream stream;
string content;
if (contentarray != null)
{
//Loop through content
for (int j = 0; j < contentarray.Size; j++)
{
stream = (PRStream)contentarray.GetAsStream(j);
content = Encoding.ASCII.GetString(PdfReader.GetStreamBytes(stream));
string[] tokens = content.Split('\n');
for (int i = 0; i< tokens.Length; i++)
{
if (tokens[i].Contains("/QuickPDF"))
{
tokens[i] = string.Empty;
}
}
string outstr = string.Join("\n", tokens.Select(p => p).ToArray());
byte[] outbytes = Encoding.ASCII.GetBytes(outstr);
stream.SetData(outbytes);
}
}
This code returns lots of \0\0s and extracts only a few English phrases from the PDF. Any Japanese text is not returned.
I am using Unicode encoding, so I am not sure what is happening here.
StringBuilder text = new StringBuilder(2000);
string fullFileName = #"c:\my_japanaese_pdf.pdf";
PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(fullFileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.Unicode.GetString(UnicodeEncoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
(Windows 7 x64, iTextSharp 5.0.2.0)
Thanks
Ryan
I had this same problem, and here's what I did (note this code is extremely similar to the code in the question, but doesn't use any encoding conversion stuff).
using (iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(inputPDF))
{
ITextExtractionStrategy Strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = PdfTextExtractor.GetTextFromPage(reader, i, Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
// do anything you want here
}
}
}
Even when using the above code, I was still not getting any Japanese characters out of the PDF, so I changed the font used in the PDF to Meiryo UI font. That is how to solve this problem. Meiryo UI is a font that iTextSharp recognizes (at least version 5.5.13.2), so Japanese text with that font can successfully be extracted from the PDF.
I have some PDFs containing Hyperlinks both in form of URL and mailto. Now Is there any way or tool(may be 3rd party) to extract the Hyperlink meta information form the PDF like coordinates, link type and destination address. Any help is highly appreciated.
I have already tried with iText and PDFBox but with no major success, even some third party software are not providing me the desired output.
I have tried the following code in Java using iText
PdfReader myReader = new PdfReader("pdf File Path");
PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
System.out.println(annots);
ArrayList<String> dests = new ArrayList<String>();
if(annots != null)
{
for(int i=0; i<annots.size(); ++i)
{
PdfDictionary annotDict = annots.getAsDict(i);
PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
if (subType != null && PdfName.LINK.equals(subType))
{
PdfDictionary action = annotDict.getAsDict(PdfName.A);
if(action != null && PdfName.URI.equals(action.getAsName(PdfName.S)))
{
dests.add(action.getAsString(PdfName.URI).toString());
} // else { its an internal link }
}
}
}
System.out.println(dests);
You can use Docotic.Pdf library for links extraction (disclaimer: I work for the company).
Below is the code that opens specified file, finds all hyperlinks, collects information about position of each link and draws rectangle around each links.
After that the code creates new PDF (with links in rectangles) and a text file with collected information. In the end, both created files are opened in default viewers.
public static void ListAndHighlightLinks(string inputFile, string outputFile, string outputTxt)
{
using (PdfDocument doc = new PdfDocument(inputFile))
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < doc.Pages.Count; i++)
{
PdfPage page = doc.Pages[i];
foreach (PdfWidget widget in page.Widgets)
{
PdfActionArea actionArea = widget as PdfActionArea;
if (actionArea == null)
continue;
PdfUriAction linkAction = actionArea.Action as PdfUriAction;
if (linkAction == null)
continue;
Uri url = linkAction.Uri;
PdfRectangle rect = actionArea.BoundingBox;
// add information about found link into string buffer
sb.Append("Page ");
sb.Append(i.ToString());
sb.Append(" : ");
sb.Append(rect.ToString());
sb.Append(" ");
sb.AppendLine(url.ToString());
// draw rectangle around found link
page.Canvas.DrawRectangle(rect);
}
}
// save document with highlighted links and text information about links to files
doc.Save(outputFile);
System.IO.File.WriteAllText(outputTxt, sb.ToString());
// open created PDF and text file in default viewers
System.Diagnostics.Process.Start(outputTxt);
System.Diagnostics.Process.Start(outputFile);
}
}
You can use the sample code with a call like this:
ListAndHighlightLinks("input.pdf", "output.pdf", "links.txt");
if your pdfs are copy protected, you need to start with step 1, if they're free to copy, you can start with step 2
step 1: convert your pdfs into word .doc: use Adobe Acrobat Pro or an online pdf to word converter:
http://www.pdfonline.com/pdf2word/index.asp
step 2: copy-paste the whole document into the input window here, you can also download the lightweight html tool:
http://www.surf7.net/services/value-added-services/free-web-tools/email-extractor-lite/
select 'url' as 'Type of address to extract', select your separator, hit extract and that's it.
Hope it works cheers.
One possibility would be using a custom JavaScript in Acrobat, which would enumerate the "words" on the page and then read out their Quads. From that you get the coordinates to create a link (or to compare with the links on the page), as well as the actual text (that's the "word(s)".
If it is "only" to set the border of the existing links, you also do another Acrobat JavaScript which enumerates the links of the document, and set their border color property (and you may need to set the width as well).
(if you prefer "buy" over "make" feel free to contact me in private; such things are part of my standard "repertoire").
I have a PDF file containing Arabic text and a watermark. I am using PDFBox to print the PDF from Java. My issue is the PDF is printed with high quality, but all the lines with Arabic characters have junk characters instead. Could somebody help on this?
Code:
String pdfFile = "C:/AresEPOS_Home/Receipts/1391326264281.pdf";
PDDocument document = null;
try {
document = PDDocument.load(pdfFile);
//PDFont font = PDTrueTypeFont.loadTTF(document, "C:/Windows/Fonts/Arial.ttf");
PrinterJob printJob = PrinterJob.getPrinterJob();
printJob.setJobName(new File(pdfFile).getName());
PrintService[] printService = PrinterJob.lookupPrintServices();
boolean printerFound = false;
for (int i = 0; !printerFound && i < printService.length; i++) {
if (printService[i].getName().indexOf("EPSON") != -1) {
printJob.setPrintService(printService[i]);
printerFound = true;
}
}
document.silentPrint(printJob);
}
finally {
if (document != null) {
document.close();
}
}
In essence
Your PDF can properly be printed using PDFBox 2.0.0-SNAPSHOT but not using PDFBox 1.8.4. Thus, either the Arabic font in question requires a feature which is not yet supported in PDFBox up to version 1.8.4 or there was a bug in 1.8.4 which meanwhile has been fixed.
The details
Printing the OP's document using PDFBox 1.8.4 resulted in some scrambled output like this
but printing it using the current PDFBox 2.0.0-SNAPSHOT resulted in a proper output like this
In 2.0.0-SNAPSHOT the PDDocument methods print and silentPrint have been removed, though, so the original
document.silentPrint(printJob);
has to be replaced by something like
printJob.setPageable(new PDPageable(document, printJob));
printJob.print();