Hyperlink Detection from PDF - pdf

I have some PDFs containing Hyperlinks both in form of URL and mailto. Now Is there any way or tool(may be 3rd party) to extract the Hyperlink meta information form the PDF like coordinates, link type and destination address. Any help is highly appreciated.
I have already tried with iText and PDFBox but with no major success, even some third party software are not providing me the desired output.
I have tried the following code in Java using iText
PdfReader myReader = new PdfReader("pdf File Path");
PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
System.out.println(annots);
ArrayList<String> dests = new ArrayList<String>();
if(annots != null)
{
for(int i=0; i<annots.size(); ++i)
{
PdfDictionary annotDict = annots.getAsDict(i);
PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
if (subType != null && PdfName.LINK.equals(subType))
{
PdfDictionary action = annotDict.getAsDict(PdfName.A);
if(action != null && PdfName.URI.equals(action.getAsName(PdfName.S)))
{
dests.add(action.getAsString(PdfName.URI).toString());
} // else { its an internal link }
}
}
}
System.out.println(dests);

You can use Docotic.Pdf library for links extraction (disclaimer: I work for the company).
Below is the code that opens specified file, finds all hyperlinks, collects information about position of each link and draws rectangle around each links.
After that the code creates new PDF (with links in rectangles) and a text file with collected information. In the end, both created files are opened in default viewers.
public static void ListAndHighlightLinks(string inputFile, string outputFile, string outputTxt)
{
using (PdfDocument doc = new PdfDocument(inputFile))
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < doc.Pages.Count; i++)
{
PdfPage page = doc.Pages[i];
foreach (PdfWidget widget in page.Widgets)
{
PdfActionArea actionArea = widget as PdfActionArea;
if (actionArea == null)
continue;
PdfUriAction linkAction = actionArea.Action as PdfUriAction;
if (linkAction == null)
continue;
Uri url = linkAction.Uri;
PdfRectangle rect = actionArea.BoundingBox;
// add information about found link into string buffer
sb.Append("Page ");
sb.Append(i.ToString());
sb.Append(" : ");
sb.Append(rect.ToString());
sb.Append(" ");
sb.AppendLine(url.ToString());
// draw rectangle around found link
page.Canvas.DrawRectangle(rect);
}
}
// save document with highlighted links and text information about links to files
doc.Save(outputFile);
System.IO.File.WriteAllText(outputTxt, sb.ToString());
// open created PDF and text file in default viewers
System.Diagnostics.Process.Start(outputTxt);
System.Diagnostics.Process.Start(outputFile);
}
}
You can use the sample code with a call like this:
ListAndHighlightLinks("input.pdf", "output.pdf", "links.txt");

if your pdfs are copy protected, you need to start with step 1, if they're free to copy, you can start with step 2
step 1: convert your pdfs into word .doc: use Adobe Acrobat Pro or an online pdf to word converter:
http://www.pdfonline.com/pdf2word/index.asp
step 2: copy-paste the whole document into the input window here, you can also download the lightweight html tool:
http://www.surf7.net/services/value-added-services/free-web-tools/email-extractor-lite/
select 'url' as 'Type of address to extract', select your separator, hit extract and that's it.
Hope it works cheers.

One possibility would be using a custom JavaScript in Acrobat, which would enumerate the "words" on the page and then read out their Quads. From that you get the coordinates to create a link (or to compare with the links on the page), as well as the actual text (that's the "word(s)".
If it is "only" to set the border of the existing links, you also do another Acrobat JavaScript which enumerates the links of the document, and set their border color property (and you may need to set the width as well).
(if you prefer "buy" over "make" feel free to contact me in private; such things are part of my standard "repertoire").

Related

Remove object in PDF with iTextSharp and save

This is a case of OCR gone wrong. I need to remove the hidden text from a PDF and I'm having a hard time figuring out how to do it.
The hidden text resides in an area always named /QuickPDFsomething which is under and /XObject dictionary that resides in the page's /Resources dictionary.
I have tried these two things and neither has worked so I'm clearly doing something wrong.
Option 1 - Kill obj - The PDF won't open in Acrobat and states, 'An error exists on this page. Acrobat may not display the page correctly' but it looks ok. Pitstop pukes with 'Critical parser failure: XObject resource missing'.
PdfReader.KillIndirect(obj);
oPdfFile.GetPdfReader().RemoveUnusedObjects();
var stamper = new PdfStamper(oPdfFile.GetPdfReader(), new FileStream(#"C:\temp.pdf", FileMode.Create));
stamper.Close();
Option 2 - CleanupProcessor - Throws an exception about 'A Graphics object cannot be created from an image that has an indexed pixel format'.
var stamper = new PdfStamper(oPdfFile.GetPdfReader(), new FileStream(#"C:\temp.pdf", FileMode.Create));
var cleanupLocations = new List<PdfCleanUpLocation>();
var pageRect = oPdfFile.GetPdfReader().GetCropBox(1);
cleanupLocations.Add(new PdfCleanUpLocation(1, pageRect));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanupLocations, stamper);
cleaner.CleanUp();
stamper.Close();
I'd like to remove the /QuickPDF object (41 0 R, in this image) as well as remove it from the content stream that calls it with /QuickPDF Do.
Unfortunately I cannot provide the PDF.
Any tips on how to do this?
I hate to answer my own question but I wanted to share the solution I found in case others need it.
After playing around with this for a couple days i figured out that Option 1 above would indeed remove the object and that the exception that I was getting from PitStop was because the content stream had a reference to the /QuickPDF XObject.
So I tried following #mkl's solution here Removing Watermark from PDF iTextSharp but it kept putting unwanted data in the content stream that rotated my PDF.
So then I found #Chris's solution here Removing Watermark from a PDF using iTextSharp and it seems to work although I'm not sure how stable this solution will be.
This is my solution for removing /QuickPDF from the content stream:
int numPages = oPdfFile.GetPdfReader().NumberOfPages;
int pgNumber = 1;
PdfDictionary page = oPdfFile.GetPdfReader().GetPageN(pgNumber);
PdfArray contentarray = page.GetAsArray(PdfName.CONTENTS);
PRStream stream;
string content;
if (contentarray != null)
{
//Loop through content
for (int j = 0; j < contentarray.Size; j++)
{
stream = (PRStream)contentarray.GetAsStream(j);
content = Encoding.ASCII.GetString(PdfReader.GetStreamBytes(stream));
string[] tokens = content.Split('\n');
for (int i = 0; i< tokens.Length; i++)
{
if (tokens[i].Contains("/QuickPDF"))
{
tokens[i] = string.Empty;
}
}
string outstr = string.Join("\n", tokens.Select(p => p).ToArray());
byte[] outbytes = Encoding.ASCII.GetBytes(outstr);
stream.SetData(outbytes);
}
}

iTextSharp PDF Reading highlighed text (highlight annotations) using C#

I am developing a C# winform application that converts the pdf contents to text. All the required contents are extracted except the content found in highlighted text of the pdf.
Please help to get the working sample to extract the highlighted text found in pdf.
I am using the iTextSharp.dll in the project
Assuming that you're talking about Comments. Please try this:
for (int i = pageFrom; i <= pageTo; i++)
{
PdfDictionary page = reader.GetPageN(i);
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if (annots != null)
foreach (PdfObject annot in annots.ArrayList)
{
PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
// now use the String value of contents
}
}
This is written from memory (I'm a Java developer, not a C# developer).

iTextSharp: What is Lost When Copying PDF Content From Another PDF?

I am currently evaluating iTextSharp for potential use in a project. The code that I have written to achieve my goal is making use of PDFCopy.GetImportedPage to copy all of the pages from an existing PDF. What I want to know is what all do I need to be aware of that will be lost from a PDF and/or page when duplicating PDF content like this? For example, one thing that I already noticed is that I need to manually add in any bookmarks and named destinations into my new PDF.
Here's some rough sample code:
using (PdfReader reader = new PdfReader(inputFilename))
{
using (MemoryStream ms = new MemoryStream())
{
using (Document document = new Document())
{
using (PdfCopy copy = new PdfCopy(document, ms))
{
document.Open();
int n;
n = reader.NumberOfPages;
for (int page = 0; page < n; )
{
copy.AddPage(copy.GetImportedPage(reader, ++page));
}
// add content and make further modifications here
}
}
// write the content to disk
}
}
Basically anything that's document-level instead of page-level will get lost and both Bookmarks and Destinations are document-level. Pull up the PDF spec and look at section 3.6.1 for other entries in the document catalog including Threads, Open and Additional Actions and Meta Data.
You might already have seen these but here are some samples (in Java) of how to merge Named Destinations and how to merge Bookmarks.

Need alternative to local or remote goto/destinations with merged documents

BACKGROUND
I have a java program that analyzes data and creates a pdf report using itext 5.
I recently had to add a summary of major problems at the start of the document so a user would not have to read over a hundred pages to find problems. Problems are only discovered when serially looking through the data.
I solved the problem by creating 3 pdf documents and then merging them, a start/title pdf, the summary of problems pdf, and the body or analysis pdf. (Basically splitting the original document at the point I wanted to insert the summary)
I use PdfReader and PdfCopy to combine the documents. I am able to keep the chapter bookmarks OK.
THE PROBLEM
As I encounter a significant problem I add it to the 'summary' document. I want to add a link in the summary to point to the problem in the body.
I tried to use Chunk.setLocalDestination and setLocalGoto but realized why that did not work, so I tried using setLocalDestination and setRemoteGoto (with and without 'file://'), but that did not work either. (Also, I used the final pdf document name in the RemoteGoto, not the temporary pdf document name.)
I do not want to use bookmarks because that seems wrong and would not look right.
I am hoping someone could suggest an alternate method or make a suggestion.
To recap, in my current code a create a Chunk with setLocalDestination and that chunk goes into the 'body' document. At the same time I create a setRemoteGoto which is put into the summary document. I was hoping when they were combined the link would work, but when the link is clicked, you go to the first page of the combined document.
Thanks.....
PS I have both iText in action books
CLARIFICATION 3/5/2014
What I was calling 'bookmarks' are really Chapter class entities that are inserted into sections of the 3 documents as they are being created.
After saving the 3 documents, PdfReader is used to open each and PdfCopy is used to put them into a new, final document.
I get the data from the Chapters, which creates the 'bookmarks' on the left side of the Pdf reader used by the user, e.g. Acrobat Reader.
int thisPdfPages = reader.getNumberOfPages();
reader.consolidateNamedDestinations();
java.util.List<HashMap<String, Object>> bookmarks = SimpleBookmark.getBookmark(reader);
if (bookmarks != null) {
if (pageOffset != 0) {
if (debug3) auditLogger.log("Shifting pages by " + pageOffset );
SimpleBookmark.shiftPageNumbers(bookmarks, pageOffset, null);
}
masterBookmarks.addAll(bookmarks);
}
for (int i = 0; i < thisPdfPages;) {
page = copy.getImportedPage(reader, ++i);
stamp = copy.createPageStamp(page);
// add page numbers
ColumnText.showTextAligned(stamp.getUnderContent(), Element.ALIGN_CENTER, new Phrase(String.format("page %d of %d", start + i, totalPages)), 297.5f, 28, 0);
stamp.alterContents();
copy.addPage(page);
}
PRAcroForm form = reader.getAcroForm();
if (form != null) {
copy.copyAcroForm(reader);
}
When analyzing the data I have 2 documents open, a base document which contains all the details and a summary document which contains notable events over some thresholds.
//NOTE section is part of the 'body' document
//NOTE summaryPhrase is a part of the 'summary' document
String linkName = "summaryPf_" + networkid ;
//create Link target
section.add(new Chunk("CHANGE TO EMPTY STRING WHEN WORKING").setLocalDestination( linkName ));
//create Link
Chunk linkChunk = new Chunk( "[Link] " );
Font linkFont = new Font( regularFont );
linkFont.setColor(BaseColor.BLUE);
linkFont.setStyle( Font.UNDERLINE );
linkChunk.setFont( linkFont );
boolean useLocal = true;
// both local and remote goto's fail
if (useLocal) {
linkChunk.setLocalGoto( linkName);
} else {
// all permutations of setting filename fail,
// but it does bring up a permissions dialog when the link is clicked.
//String remotePdfName = "file://./" + pdfReportName ;
//String remotePdfName = "file://" + pdfReportName ;
//String remotePdfName = "file:" + pdfReportName ;
String remotePdfName = pdfReportName ;
linkChunk.setRemoteGoto( remotePdfName, linkName);
}
// add link to summary document
summaryPhrase.add( linkChunk );
summaryPhrase.add( String.format("There were %d devices with ping failures", summaryCount));
summaryPhrase.add( Chunk.NEWLINE );
}
If I use setLocalGoto, when you click the link in the final document you goto the first page.
If I use setRemoteGoto, a dialog ask permission to go to a document, but the document fails to open, tried several permutations on filename.

Insert PDF in PDF (NOT merging files)

I'd like to insert a PDF page in another PDF page scaled. I'd like to use iTextSharp for this.
I have a vector drawing which can be exported as a single page PDF file. I would like to add this file into a page of other PDF document just like I would add an image to a PDF document.
Is this possible?
The purpose of this is to retain the ability to zoom in without losing quality.
It is very hard to reproduce the vector drawing using PDF vectors because it is an extremely complex drawing.
Exporting the vector drawing as high resolution image is not an option since I have to use a lot of them in a single PDF document. The final PDF would be very large and its writing too slow.
This is relatively easy to do although there's a couple of ways to go about it. If you're creating a new document that has the other documents inside of it and nothing else then the easiest thing to use is probably the PdfWriter.GetImportedPage(PdfReader, Int). This will give you a PdfImportedPage (which inherits from PdfTemplate). Once you have that you can add it to your new document by using PdfWriter.DirectContent.AddTemplate(PdfImportedPage, Matrix).
There's a couple of overloads to AddTemplate() but the easiest one (at least for me) is the one that takes a System.Drawing.Drawing2D.Matrix. If you use this you can easily scale and translate (change x,y) without having to think in "matrix" terms.
Below is sample code that shows this off. It targets iTextSharp 5.4.0 although it should work pretty much the same with 4.1.6 if you remove the using statements. It first creates a sample PDF with 12 pages with random background colors. Then it creates a second document and adds each page from the first PDF scaled by 50% so that 4 old pages fit onto 1 new page. See the code comments for further details. This code assumes that all pages are the same size, you might need to perform further calculations if your situation differs.
//Test files that we'll be creating
var file1 = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "File1.pdf");
var file2 = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "File2.pdf");
//For test purposes we'll fill the pages with a random background color
var R = new Random();
//Standard PDF creation, nothing special here
using (var fs = new FileStream(file1, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
//Create 12 pages with text on each one
for (int i = 1; i <= 12; i++) {
doc.NewPage();
//For test purposes fill the page with a random background color
var cb = writer.DirectContentUnder;
cb.SaveState();
cb.SetColorFill(new BaseColor(R.Next(0, 256), R.Next(0, 256), R.Next(0, 256)));
cb.Rectangle(0, 0, doc.PageSize.Width, doc.PageSize.Height);
cb.Fill();
cb.RestoreState();
//Add some text to the page
doc.Add(new Paragraph("This is page " + i.ToString()));
}
doc.Close();
}
}
}
//Create our combined file
using (var fs = new FileStream(file2, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
//Bind a reader to the file that we created above
using (var reader = new PdfReader(file1)) {
doc.Open();
//Get the number of pages in the original file
int pageCount = reader.NumberOfPages;
//Loop through each page
for (int i = 0; i < pageCount; i++) {
//We're putting four original pages on one new page so add a new page every four pages
if (i % 4 == 0) {
doc.NewPage();
}
//Get a page from the reader (remember that PdfReader pages are one-based)
var imp = writer.GetImportedPage(reader, (i + 1));
//A transform matrix is an easier way of dealing with changing dimension and coordinates on an rectangle
var tm = new System.Drawing.Drawing2D.Matrix();
//Scale the image by half
tm.Scale(0.5f, 0.5f);
//PDF coordinates put 0,0 in the bottom left corner.
if (i % 4 == 0) {
tm.Translate(0, doc.PageSize.Height); //The first item on the page needs to be moved up "one square"
} else if (i % 4 == 1) {
tm.Translate(doc.PageSize.Width, doc.PageSize.Height); //The second needs to be moved up and over
} else if (i % 4 == 2) {
//Nothing needs to be done for the third
} else if (i % 4 == 3) {
tm.Translate(doc.PageSize.Width, 0); //The fourth needs to be moved over
}
//Add our imported page using the matrix that we set above
writer.DirectContent.AddTemplate(imp,tm);
}
doc.Close();
}
}
}
}
In addition; while i was trying to add a rotated pdf to a rotated pdf, i got some rotation problems. Kind of confusing but you should check the "PdfImportedPage.Rotation" of the page which is gonna be added to pdf.
PdfImportedPage page;//page = writer.GetImportedPage(PdfReader reader, int pageNum);
PdfContentByte pcb;//pcb = PdfWriter.DirectContentUnder;
//create matrix to use for rotating imported page
Matrix matrix = new Matrix(a, b, c, d, e, f);
matrix.Rotate(-(page.Rotation));
if (page.Rotation != 0)
pcb.AddTemplate(page, matrix, true);
else
pcb.AddTemplate(page, a, b, c, d, e, f, true);
code looks like silly but i want to get your attention on "matrix.Rotate(negative rotation of imported page)"