How to remove blank pages from PDF using PDFSHarp? - vb.net

How will i be able to remove a blank page from a PDF file? I have a sample PDF file where the 1st page contains a few strings and a 2nd page with absolutely NOTHING in it. I tried to loop into the pdf pages and get the element count PER page but the funny thing is that i get the same number between the 2 pages =| How did that happen if the 1st page has a few strings and the 2nd page was absolutely blank???
This is my code
Dim inputDOcument As PdfDocument = PdfReader.Open("")
Dim elemountCount As Integer = 0
Dim elemountCount2 As Integer = 0
Dim pdfPageCount As Integer = inputDOcument.PageCount
For x As Integer = 0 To pdfPageCount - 1
elemountCount = inputDOcument.Pages(x).Contents.Elements.Count
elemountCount2 = inputDOcument.Pages(x).Elements.Count
Next

Try to check length of each element:
public bool HasContent(PdfPage page)
{
for(var i = 0; i < page.Contents.Elements.Count; i++)
{
if (page.Contents.Elements.GetDictionary(i).Stream.Length > 76)
{
return true;
}
}
return false;
}

You can try the PDFsharp Document Explorer that comes with PDFsharp to see what the PDF file really contains.
Or load and save the file with a PDFsharp DEBUG build, this will give you a "verbose" file. Viewing that with Notepad could help to understand what the file contains.

Related

iText: missing characters when converting PDF to Text

I am Trying to extract the text from the this pdf using the LocationTextExtractionStrategy.class, but for some reason a number of characters are being dropped during the parsing.
On the first page of the original .pdf;
【表紙】
【提出書類】有価証券報告書
【根拠条文】金融商品取引法第24条第1項
【提出先】近畿財務局長
【提出日】平成22年6月28日
【事業年度】第27期(自 平成21年4月1日 至 平成22年3月31日)
【会社名】株式会社カネミツ
【英訳名】KANEMITSU CORPORATION
Resulting text output has numbers such as 22,28 and the english text "KANEMATSU" missing;
【表紙】
【提出書類】 有価証券報告書
【根拠条文】 金融商品取引法第条第1項
【提出先】 近畿財務局長
【提出日】 平成年6月日
【事業年度】 第期(自 平成年4月1日 至 平成年3月日)
【会社名】 株式会社カネミツ
【英訳名】
Here's the code...
PdfReader reader = new PdfReader(sourceFileUrl);
String strategyClass = “com.itextpdf.text.pdf.parser.LocationTextExtractionStrategy.class”
int n = reader.getNumberOfPages();
for(int I = 1; I < n; i++) {
TextExtractionStrategy strategy = (TextExtractionStrategy) Class.forName(strategyClass).newInstance();
String text = PdfTextExtractor.getTextFromPage(reader, i,strategy);
…
}
I have reviewed other questions of a similar nature on SO, this page is similar although i am able to copy the text from the pdf directly so this is probably a different issue.

Identify and extract or delete pages of a PDF based on a search string / text (action / javascript)

Good Evening (UK)
I'm trying to filter down a 1500+ page PDF file to only the pages which include a certain text string (typically one or two words). My laptop is locked down with respect to installing more software BUT I have used action(script)s quite a bit
I get the error below when I try to install this action into Abobe Acrobat X Pro (Win 7):
screen dump of error
called "Extract Commented Pages"... supposed to be OK for X and XI this looks like what I want.....
I wondered if there was something simple causing the problem but the actionscript file is rather... busy to say the least.
I used to have an action that I think was based on a legal redaction script but it is filed somewhere!
If you have already got an action that does this or a version of the above that doesn't give the error I get (unable to import the Action.... The file is either invalid or corrupt) I will forever by indebted to your gratitude
Many thanks, have a good weekend!
I recently came across a script found at the following link: http://forums.adobe.com/thread/1077118
I'm having some issues getting the script to run in Acrobat, despite everything looking alright in the script itself. I'll update if I find any errors.
Here is a copy of the script:
// Set the word to search for here
var sWord = "forms";
// Source document = current document
var sd = this;
var nWords, currWord, fp, fpa = [], nd;
var fn = sd.documentFileName.replace(/\.pdf$/i, "");
// Loop through the pages
for (var i = 0; i < sd.numPages; i += 1) {
// Get the number of words on the page
nWords = sd.getPageNumWords(i);
// Loop through the words on the page
for (var j = 0; j < nWords; j += 1) {
// Get the current word
currWord = sd.getPageNthWord(i, j);
if (currWord === sWord) {
// Extract the current page to a new file
fp = fn + "_" + i + ".pdf";
fpa.push(fp);
sd.extractPages({nStart: i, nEnd: i, cPath: fp});
// Stop searching this page
break;
}
}
}
// Combine the individual pages into one PDF
if (fpa.length) {
// Open the document that's the first extracted page
nd = app.openDoc({cPath: fpa[0], oDoc: sd});
// Append any other pages that were extracted
if (fpa.length > 1) {
for (var i = 1; i < fpa.length; i += 1) {
nd.insertPages({nPage: i - 1, cPath: fpa[i], nStart: 0, nEnd: 0});
}
}
// Save to a new document and close this one
nd.saveAs({cPath: fn + "_searched.pdf"});
nd.closeDoc({bNoSave: true});
}

Split a "tagged" PDF document into multiple documents, keeping the tagging

In a project I have to split a PDF document into two documents, one containing all blank pages, and one containing all pages with content.
For this job, I use a PdfReader to read the source file, and two pdfCopy objects (one for the blank pages document, one for the pages with content document) to write the files to.
I use GetImportedPage to read a PdfImportedPage, which is then added to one of the PdfCopy writers.
Now, the problem is the following: the source file is using the "tagged PDF format". To preserve this (which is absolutely required), I use the SetTagged() method on both PdfCopy writers, and use the extra third parameter in GetImportedPage(...) to keep the tagged format. However, when calling the AddPage(...) on the PdfCopy writer, I get an invalid cast exception:
"Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary' to type 'iTextSharp.text.pdf.PRIndirectReference'."
Anyone has any ideas on how to solve this ? Any hints ?
Also: the project currently refers version 5.1.0.0 of the itext libraries. In 5.4.4.0 the third parameter to GetImportedPage does not seem to be there anymore.
Below, you can find a code extract:
iTextSharp.text.Document targetPdf = new iTextSharp.text.Document();
iTextSharp.text.Document blankPdf = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfReader sourcePdfReader = new iTextSharp.text.pdf.PdfReader(inputFile);
iTextSharp.text.pdf.PdfCopy targetPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(targetPdf, new FileStream(outputFile, FileMode.Create));
iTextSharp.text.pdf.PdfCopy blankPdfWriter = new iTextSharp.text.pdf.PdfSmartCopy(blankPdf, new FileStream(blanksFile, FileMode.Append));
targetPdfWriter.SetTagged();
blankPdfWriter.SetTagged();
try
{
iTextSharp.text.pdf.PdfImportedPage page = null;
int n = sourcePdfReader.NumberOfPages;
targetPdf.Open();
blankPdf.Open();
blankPdf.Add(new iTextSharp.text.Phrase("This document contains the blank pages removed from " + inputFile));
blankPdf.NewPage();
for (int i = 1; i <= n; i++)
{
byte[] pageBytes = sourcePdfReader.GetPageContent(i);
string pageText = "";
iTextSharp.text.pdf.PRTokeniser token = new iTextSharp.text.pdf.PRTokeniser(new iTextSharp.text.pdf.RandomAccessFileOrArray(pageBytes));
while (token.NextToken())
{
if (token.TokenType == iTextSharp.text.pdf.PRTokeniser.TokType.STRING)
{
pageText += token.StringValue;
}
}
if (pageText.Length >= 15)
{
page = targetPdfWriter.GetImportedPage(sourcePdfReader, i, true);
targetPdfWriter.AddPage(page);
}
else
{
page = blankPdfWriter.GetImportedPage(sourcePdfReader, i, true);
blankPdfWriter.AddPage(page);
blankPageCount++;
}
}
}
catch (Exception ex)
{
Console.WriteLine("Exception at LOC1: " + ex.Message);
}
The error occurs in the call to targetPdfWriter.AddPage(page); near the end of the code sample.
Thank you very much for your help.
Koen.

How do I skip a blank page of a PDF when extracting text using iTextSharp?

My program reads through a PDF and extracts the text. When it reaches a blank page, I get the error "System.InvalidOperationException: Unable to handle Content of type iTextSharp.text.pdf.PdfDictionary", and the program stops.
How do I check to see if the page is blank before trying to read it? How do I continue in my program if it does hit a blank page?
Code:
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
Something like this?
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string tmp = PdfTextExtractor.GetTextFromPage(reader, i,
new SimpleTextExtractionStrategy());
if(!string.IsNullOrEmpty(tmp))
output.WriteLine(tmp);
}

Word OLE Automation - delete first page and manipulate header and footer

I am using PHP to start Word Automation and manipulate word documents, but i guess it can be done in all any other language. What i need to do is quite simple, i need to remove the first page and add header and footer.
Here is my code:
$word = new COM('word.applicantion');
$word->Documents->Open('xxx.docx');
$word->Documents[1]->SaveAs($result_file_name, 12);
Any samples?
This is the way you could do it in VBA. This can likely be ported to PHP fairly simply.
Sub RemoveFirstPageAndAddHeaderFooter()
Dim d As Document
Set d = ActiveDocument
Dim pageOne As Range
Set pageOne = d.Bookmarks("\page").Range
pageOne.Select
Selection.Delete
d.Sections(1).Headers(1).Range.Text = "Some text"
d.Sections(1).Footers(1).Range.InlineShapes.AddPicture "C:\beigeplum.jpg", False, True
End Sub
Note on the ...InlineShapes.AddPicture - the onus would be on you to ensure the picture is the right size. If you want more control over this, you would use .Footers(1).Shapes.AddPicture instead as that let's you set the width/height, top/left, etc.
try
{
$word = new COM("word.application") //$word = new COM("C:\x.docx");
or die("couldnt create an instance of word");
//bring word to the front
$word->Visible = 1;
//open a word document
$word->Documents->Open("file.docx");
// remove first page
$range = $word->ActiveDocument->Bookmarks("\page");
$range->Select();
$word->Selection->Delete();
//save the document as docx
$word->Documents[1]->SaveAs("modified_file.docx", 12); // SaveAs('filename', format) // format: 0 - same?, 1 - doc?, 2 - text, 4 - text other encoding
}
catch(Exception $e)
{
echo "error class.document.php - convert_to_docx: $e 20100816.01714";
}
//close word
if($word)
$word->Quit();
//free object resources
//$word->Release();
$word = null;