To give you the context, I am trying to remove a watermark from a pdf generated by some 3rd party application. I am trying to write a utility with iText7 which will detect such watermark and remove it.
Here is the pdf file sample https://easyupload.io/54zhzs. The highlighted watermark text I am trying to remove.
I was first thinking that it could be added as annotation,but I receive annots as null.Had there been annots collection I could have inspected and remove this annotation. This works well when we stamp annotaion with iText itself but here is not that case.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath), new PdfWriter(destinationfile));
for (var i = 1; i <= pdfDoc.GetNumberOfPages(); i++)
{
PdfDictionary pageDict = pdfDoc.GetPage(i).GetPdfObject();
PdfArray annots = pageDict.GetAsArray(PdfName.Annots);
if (annots != null)
{
for (int j = 0; j < annots.Size(); j++)
{
PdfDictionary annotation = annots.GetAsDictionary(j);
if (PdfName.Watermark.Equals(annotation.GetAsName(PdfName.Subtype)))
{
annotation.Clear();
}
}
}
}
pdfDoc.Close();
Next, in the second approach I tried to acquire the PdfObject like this, but I am clueless here how to identify the correct object which I want to delete.
PdfDictionary pageDict = pdfDoc.GetFirstPage().GetPdfObject();
PdfArray fields = pageDict.GetAsArray(PdfName.Fields);
PdfDictionary resources = pageDict.GetAsDictionary(PdfName.Resources);
PdfArray xObjects = resources.GetAsArray(PdfName.ProcSet);
foreach (PdfObject obj in xObjects)
{
// not sure what to do from here
}
This is the debug result what I get. I think in the last ProcSet this watermark is an image layer, but I am not sure what to identify that.
Am I on right track or is there any other solutions?
Related
This is a case of OCR gone wrong. I need to remove the hidden text from a PDF and I'm having a hard time figuring out how to do it.
The hidden text resides in an area always named /QuickPDFsomething which is under and /XObject dictionary that resides in the page's /Resources dictionary.
I have tried these two things and neither has worked so I'm clearly doing something wrong.
Option 1 - Kill obj - The PDF won't open in Acrobat and states, 'An error exists on this page. Acrobat may not display the page correctly' but it looks ok. Pitstop pukes with 'Critical parser failure: XObject resource missing'.
PdfReader.KillIndirect(obj);
oPdfFile.GetPdfReader().RemoveUnusedObjects();
var stamper = new PdfStamper(oPdfFile.GetPdfReader(), new FileStream(#"C:\temp.pdf", FileMode.Create));
stamper.Close();
Option 2 - CleanupProcessor - Throws an exception about 'A Graphics object cannot be created from an image that has an indexed pixel format'.
var stamper = new PdfStamper(oPdfFile.GetPdfReader(), new FileStream(#"C:\temp.pdf", FileMode.Create));
var cleanupLocations = new List<PdfCleanUpLocation>();
var pageRect = oPdfFile.GetPdfReader().GetCropBox(1);
cleanupLocations.Add(new PdfCleanUpLocation(1, pageRect));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanupLocations, stamper);
cleaner.CleanUp();
stamper.Close();
I'd like to remove the /QuickPDF object (41 0 R, in this image) as well as remove it from the content stream that calls it with /QuickPDF Do.
Unfortunately I cannot provide the PDF.
Any tips on how to do this?
I hate to answer my own question but I wanted to share the solution I found in case others need it.
After playing around with this for a couple days i figured out that Option 1 above would indeed remove the object and that the exception that I was getting from PitStop was because the content stream had a reference to the /QuickPDF XObject.
So I tried following #mkl's solution here Removing Watermark from PDF iTextSharp but it kept putting unwanted data in the content stream that rotated my PDF.
So then I found #Chris's solution here Removing Watermark from a PDF using iTextSharp and it seems to work although I'm not sure how stable this solution will be.
This is my solution for removing /QuickPDF from the content stream:
int numPages = oPdfFile.GetPdfReader().NumberOfPages;
int pgNumber = 1;
PdfDictionary page = oPdfFile.GetPdfReader().GetPageN(pgNumber);
PdfArray contentarray = page.GetAsArray(PdfName.CONTENTS);
PRStream stream;
string content;
if (contentarray != null)
{
//Loop through content
for (int j = 0; j < contentarray.Size; j++)
{
stream = (PRStream)contentarray.GetAsStream(j);
content = Encoding.ASCII.GetString(PdfReader.GetStreamBytes(stream));
string[] tokens = content.Split('\n');
for (int i = 0; i< tokens.Length; i++)
{
if (tokens[i].Contains("/QuickPDF"))
{
tokens[i] = string.Empty;
}
}
string outstr = string.Join("\n", tokens.Select(p => p).ToArray());
byte[] outbytes = Encoding.ASCII.GetBytes(outstr);
stream.SetData(outbytes);
}
}
I am developing a C# winform application that converts the pdf contents to text. All the required contents are extracted except the content found in highlighted text of the pdf.
Please help to get the working sample to extract the highlighted text found in pdf.
I am using the iTextSharp.dll in the project
Assuming that you're talking about Comments. Please try this:
for (int i = pageFrom; i <= pageTo; i++)
{
PdfDictionary page = reader.GetPageN(i);
PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if (annots != null)
foreach (PdfObject annot in annots.ArrayList)
{
PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
// now use the String value of contents
}
}
This is written from memory (I'm a Java developer, not a C# developer).
I have some PDFs containing Hyperlinks both in form of URL and mailto. Now Is there any way or tool(may be 3rd party) to extract the Hyperlink meta information form the PDF like coordinates, link type and destination address. Any help is highly appreciated.
I have already tried with iText and PDFBox but with no major success, even some third party software are not providing me the desired output.
I have tried the following code in Java using iText
PdfReader myReader = new PdfReader("pdf File Path");
PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
System.out.println(annots);
ArrayList<String> dests = new ArrayList<String>();
if(annots != null)
{
for(int i=0; i<annots.size(); ++i)
{
PdfDictionary annotDict = annots.getAsDict(i);
PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
if (subType != null && PdfName.LINK.equals(subType))
{
PdfDictionary action = annotDict.getAsDict(PdfName.A);
if(action != null && PdfName.URI.equals(action.getAsName(PdfName.S)))
{
dests.add(action.getAsString(PdfName.URI).toString());
} // else { its an internal link }
}
}
}
System.out.println(dests);
You can use Docotic.Pdf library for links extraction (disclaimer: I work for the company).
Below is the code that opens specified file, finds all hyperlinks, collects information about position of each link and draws rectangle around each links.
After that the code creates new PDF (with links in rectangles) and a text file with collected information. In the end, both created files are opened in default viewers.
public static void ListAndHighlightLinks(string inputFile, string outputFile, string outputTxt)
{
using (PdfDocument doc = new PdfDocument(inputFile))
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < doc.Pages.Count; i++)
{
PdfPage page = doc.Pages[i];
foreach (PdfWidget widget in page.Widgets)
{
PdfActionArea actionArea = widget as PdfActionArea;
if (actionArea == null)
continue;
PdfUriAction linkAction = actionArea.Action as PdfUriAction;
if (linkAction == null)
continue;
Uri url = linkAction.Uri;
PdfRectangle rect = actionArea.BoundingBox;
// add information about found link into string buffer
sb.Append("Page ");
sb.Append(i.ToString());
sb.Append(" : ");
sb.Append(rect.ToString());
sb.Append(" ");
sb.AppendLine(url.ToString());
// draw rectangle around found link
page.Canvas.DrawRectangle(rect);
}
}
// save document with highlighted links and text information about links to files
doc.Save(outputFile);
System.IO.File.WriteAllText(outputTxt, sb.ToString());
// open created PDF and text file in default viewers
System.Diagnostics.Process.Start(outputTxt);
System.Diagnostics.Process.Start(outputFile);
}
}
You can use the sample code with a call like this:
ListAndHighlightLinks("input.pdf", "output.pdf", "links.txt");
if your pdfs are copy protected, you need to start with step 1, if they're free to copy, you can start with step 2
step 1: convert your pdfs into word .doc: use Adobe Acrobat Pro or an online pdf to word converter:
http://www.pdfonline.com/pdf2word/index.asp
step 2: copy-paste the whole document into the input window here, you can also download the lightweight html tool:
http://www.surf7.net/services/value-added-services/free-web-tools/email-extractor-lite/
select 'url' as 'Type of address to extract', select your separator, hit extract and that's it.
Hope it works cheers.
One possibility would be using a custom JavaScript in Acrobat, which would enumerate the "words" on the page and then read out their Quads. From that you get the coordinates to create a link (or to compare with the links on the page), as well as the actual text (that's the "word(s)".
If it is "only" to set the border of the existing links, you also do another Acrobat JavaScript which enumerates the links of the document, and set their border color property (and you may need to set the width as well).
(if you prefer "buy" over "make" feel free to contact me in private; such things are part of my standard "repertoire").
I am adding text to an already created pdf document using this method.
ITextSharp insert text to an existing pdf
Basically it uses the PdfContentByte and then adds the content template to the page.
I am finding that in some areas of the file, the text doesn't show up.
It seems that the text I am adding is showing up behind the content that is already on the page? I flattened the pdf document down to it just being images but I am still having the same issue happen with the flattened file.
Has anyone had any issues adding text being hidden using Itextsharp?
I also tried using DirectContentUnder as was suggested in this link to no avail..
iTextSharp hides text when write
Here is the code I am using...With this I am trying to basically overlay graph paper on top of the PDF. In this example, there is a box in the upper left corner of every page that doesn't get populated. There is an image in the original pdf in this spot. And on the 4th and 5th pages, there are boxes that don't get populated, but they don't seem to be images.
PdfReader reader = new PdfReader(oldFile);
iTextSharp.text.Rectangle size = reader.GetPageSizeWithRotation(1);
Document document = new Document(size);
// open the writer
FileStream fs = new FileStream(newFile, FileMode.Create, FileAccess.Write);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
document.Open();
// the pdf content
PdfContentByte cb = writer.DirectContent;
for (int i = 0; i < reader.NumberOfPages; i++)
{
document.NewPage();
// select the font properties
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA_BOLD, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
cb.SetFontAndSize(bf, 4);
cb.SetColorStroke(BaseColor.GREEN);
cb.SetLineWidth(1f);
for (int j = 10; j < 600; j += 10)
{
WriteToDoc(ref cb, j.ToString(), j, 10);//Write the line number
WriteToDoc(ref cb, j.ToString(), j, 780);//Write the line number
if (j % 20 == 0)
{
cb.MoveTo(j, 20);
cb.LineTo(j, 760);
cb.Stroke();
}
}
for (int j = 10; j < 800; j += 10)
{
WriteToDoc(ref cb, j.ToString(), 5, j);//Write the line number
WriteToDoc(ref cb, j.ToString(), 590, j);//Write the line number
if (j % 20 == 0)
{
cb.MoveTo(15, j);
cb.LineTo(575, j);
cb.Stroke();
}
}
// create the new page and add it to the pdf
PdfImportedPage page = writer.GetImportedPage(reader, i + 1);
cb.AddTemplate(page, 0, 0);
}
// close the streams and voilá the file should be changed :)
document.Close();
fs.Close();
writer.Close();
reader.Close();
Thanks for any of the help you can provide...I really appreciate it!
-Greg
First of all: If you are trying to basically overlay graph paper on top of the PDF, why do you first draw the graph paper and stamp the original page onto it? You essentially are underlaying graph paper, not overlaying it.
Depending on the content of the page, your graph paper this way may easily get covered. E.g. if there is a filled rectangle in the page content, in the result there is a box in the upper left corner of every page that doesn't get populated.
Thus, simply first add the old page content, then add overlay changes.
This being said, for the task of applying changes to an existing PDF, using PdfWriter and GetImportedPage is less than optimal. This actually is a task for the PdfStamper class which its made for stamping additional content on existing PDFs.
E.g. have a look at the sample StampText, the pivotal code being:
PdfReader reader = new PdfReader(resource);
using (var ms = new MemoryStream())
{
using (PdfStamper stamper = new PdfStamper(reader, ms))
{
PdfContentByte canvas = stamper.GetOverContent(1);
ColumnText.ShowTextAligned( canvas, Element.ALIGN_LEFT, new Phrase("Hello people!"), 36, 540, 0 );
}
return ms.ToArray();
}
I'd like to insert a PDF page in another PDF page scaled. I'd like to use iTextSharp for this.
I have a vector drawing which can be exported as a single page PDF file. I would like to add this file into a page of other PDF document just like I would add an image to a PDF document.
Is this possible?
The purpose of this is to retain the ability to zoom in without losing quality.
It is very hard to reproduce the vector drawing using PDF vectors because it is an extremely complex drawing.
Exporting the vector drawing as high resolution image is not an option since I have to use a lot of them in a single PDF document. The final PDF would be very large and its writing too slow.
This is relatively easy to do although there's a couple of ways to go about it. If you're creating a new document that has the other documents inside of it and nothing else then the easiest thing to use is probably the PdfWriter.GetImportedPage(PdfReader, Int). This will give you a PdfImportedPage (which inherits from PdfTemplate). Once you have that you can add it to your new document by using PdfWriter.DirectContent.AddTemplate(PdfImportedPage, Matrix).
There's a couple of overloads to AddTemplate() but the easiest one (at least for me) is the one that takes a System.Drawing.Drawing2D.Matrix. If you use this you can easily scale and translate (change x,y) without having to think in "matrix" terms.
Below is sample code that shows this off. It targets iTextSharp 5.4.0 although it should work pretty much the same with 4.1.6 if you remove the using statements. It first creates a sample PDF with 12 pages with random background colors. Then it creates a second document and adds each page from the first PDF scaled by 50% so that 4 old pages fit onto 1 new page. See the code comments for further details. This code assumes that all pages are the same size, you might need to perform further calculations if your situation differs.
//Test files that we'll be creating
var file1 = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "File1.pdf");
var file2 = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "File2.pdf");
//For test purposes we'll fill the pages with a random background color
var R = new Random();
//Standard PDF creation, nothing special here
using (var fs = new FileStream(file1, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
//Create 12 pages with text on each one
for (int i = 1; i <= 12; i++) {
doc.NewPage();
//For test purposes fill the page with a random background color
var cb = writer.DirectContentUnder;
cb.SaveState();
cb.SetColorFill(new BaseColor(R.Next(0, 256), R.Next(0, 256), R.Next(0, 256)));
cb.Rectangle(0, 0, doc.PageSize.Width, doc.PageSize.Height);
cb.Fill();
cb.RestoreState();
//Add some text to the page
doc.Add(new Paragraph("This is page " + i.ToString()));
}
doc.Close();
}
}
}
//Create our combined file
using (var fs = new FileStream(file2, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
//Bind a reader to the file that we created above
using (var reader = new PdfReader(file1)) {
doc.Open();
//Get the number of pages in the original file
int pageCount = reader.NumberOfPages;
//Loop through each page
for (int i = 0; i < pageCount; i++) {
//We're putting four original pages on one new page so add a new page every four pages
if (i % 4 == 0) {
doc.NewPage();
}
//Get a page from the reader (remember that PdfReader pages are one-based)
var imp = writer.GetImportedPage(reader, (i + 1));
//A transform matrix is an easier way of dealing with changing dimension and coordinates on an rectangle
var tm = new System.Drawing.Drawing2D.Matrix();
//Scale the image by half
tm.Scale(0.5f, 0.5f);
//PDF coordinates put 0,0 in the bottom left corner.
if (i % 4 == 0) {
tm.Translate(0, doc.PageSize.Height); //The first item on the page needs to be moved up "one square"
} else if (i % 4 == 1) {
tm.Translate(doc.PageSize.Width, doc.PageSize.Height); //The second needs to be moved up and over
} else if (i % 4 == 2) {
//Nothing needs to be done for the third
} else if (i % 4 == 3) {
tm.Translate(doc.PageSize.Width, 0); //The fourth needs to be moved over
}
//Add our imported page using the matrix that we set above
writer.DirectContent.AddTemplate(imp,tm);
}
doc.Close();
}
}
}
}
In addition; while i was trying to add a rotated pdf to a rotated pdf, i got some rotation problems. Kind of confusing but you should check the "PdfImportedPage.Rotation" of the page which is gonna be added to pdf.
PdfImportedPage page;//page = writer.GetImportedPage(PdfReader reader, int pageNum);
PdfContentByte pcb;//pcb = PdfWriter.DirectContentUnder;
//create matrix to use for rotating imported page
Matrix matrix = new Matrix(a, b, c, d, e, f);
matrix.Rotate(-(page.Rotation));
if (page.Rotation != 0)
pcb.AddTemplate(page, matrix, true);
else
pcb.AddTemplate(page, a, b, c, d, e, f, true);
code looks like silly but i want to get your attention on "matrix.Rotate(negative rotation of imported page)"