PDF String Extract a checkbox being checked or not - pdfbox

We have a method to check if a checkbox in a PDF (No forms) is checked or not and it works great on one company's PDF. But on another, there is no way to tell if the checkbox is checked or not.
Here is the code that works on one company's PDF
protected static final String[] HOLLOW_CHECKBOX = {"\uF06F", "\u0086"};
protected static final String[] FILLED_CHECKBOX = {"\uF06E", "\u0084"};
protected boolean isBoxChecked(String boxLabel, String content) {
content = content.trim();
for (String checkCharacter : FILLED_CHECKBOX) {
String option = String.format("%s %s", checkCharacter, boxLabel);
String option2 = String.format("%s%s", checkCharacter, boxLabel);
if (content.contains(option) || content.contains("\u0084 ") || content.contains(option2)) {
return true;
}
}
return false;
}
However, when I do the same for another company's PDF there is nothing in the extracted text near the checkbox to tell us if it is checked or not.
The big issue is we have no XML Schema, no Metadata, and no forms on these PDFs, it is just raw String, so you can see a checkbox is difficult to have in a String, but that is all we have. Here is code example of pulling the String in the PDF from a page to some other page, all the text in between
protected String getTextFromPages(int startPage, int endPage, PDDocument document) throws IOException {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(startPage);
stripper.setEndPage(endPage);
return stripper.getText(document);
}
I wish the pdfs had an easier way to extract the text/data, but these vendors that make the PDFs decided it was better to keep that out of them.
No we cannot have the vendor's/ other companies change anything, we receive these PDFs from the courts system that have been submitted by lawyers that we don't know and that the lawyers bought the PDF software that generates these files.
We also cannot do it the even longer way of trying to figure out the object model that PDFBox creates of the document with things like
o.apache.pdfbox.util.PDFStreamEngine - processing substream token: PDFOperator{Tf}
because these are 80-100 page PDFs and would take us years just to code to parse one vendor's format.

Related

How to open a password protected PDF using VB6/VB.NET?

I want to open and view a password protected PDF file in VB6/VB.NET program. I have tried using the Acrobat PDF Library but could not do it.
The reason I want to create a password protected PDF file is because I dont want the PDF file to be opened without the password externally i.e outside the program.
To open a password protected PDF you will need to develop at least a PDF parser, decryptor and generator. I wouldn't recommend to do that, though. It's nowhere near an easy task to accomplish.
With help of a PDF library everything is much simpler. You might want to try Docotic.Pdf library for the task.
Here is a sample for you task:
public static void unprotectPdf(string input, string output)
{
bool passwordProtected = PdfDocument.IsPasswordProtected(input);
if (passwordProtected)
{
string password = null; // retrieve the password somehow
using (PdfDocument doc = new PdfDocument(input, password))
{
// clear both passwords in order
// to produce unprotected document
doc.OwnerPassword = "";
doc.UserPassword = "";
doc.Save(output);
}
}
else
{
// no decryption is required
File.Copy(input, output, true);
}
}
Docotic.Pdf can also extract text (formatted or not) from PDFs. It might be useful for indexing (I guess it's what you are up to because you mentioned Adobe IFilter)
you can convert code to vb over the internet

How can I Extract words with its coordinates from pdf using .net?

I'm working with pdf in hebrew language with diacritical marks. I want to extract all the words with its coordinates. I tried to use ITextSharp and pdfClown and they both didn't give me what I want.
In pdfClown there are missing letters\chars in ITextSharp I don't get the words coordinates.
Is there a way to do it? (I'm looking for a free framework\code)
EDIT:
PDFClown Code:
File file = new File(PDFFilePath);
TextExtractor te = new TextExtractor();
IDictionary<RectangleF?, IList<ITextString>> strs = te.Extract(file.Document.Pages[0].Contents);
List<string> correctText = new List<string>();
foreach (var key in strs.Keys)
{
foreach (var value in strs[key])
{
string reversedText = new string(value.Text.Reverse().ToArray());
string cleanText = RemoveDiacritics(reversedText);
correctText.Add(cleanText);
}
}
You aren't showing how you are trying to extract text using iText(Sharp). I am assuming that you are following the official documentation and that your code looks like this:
public string ExtractText(byte[] src) {
PdfReader reader = new PdfReader(src);
MyTextRenderListener listener = new MyTextRenderListener();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.GetPageN(1);
PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
processor.ProcessContent(
ContentByteUtils.GetContentBytesForPage(reader, 1), resourcesDic);
return listener.Text.ToString();
}
If your code doesn't look like this, this explains already explains the first thing you're doing wrong.
In this method, there is one class that isn't part of iTextSharp: MyTextRenderListener. This is a class you should write and that looks for instance like this:
public class MyTextRenderListener : IRenderListener {
public StringBuilder Text { get; set; }
public MyTextRenderListener() {
Text = new StringBuilder();
}
public void BeginTextBlock() {
Text.Append("<");
}
public void EndTextBlock() {
Text.AppendLine(">");
}
public void RenderImage(ImageRenderInfo renderInfo) {
}
public void RenderText(TextRenderInfo renderInfo) {
Text.Append("<");
Text.Append(renderInfo.GetText());
LineSegment segment = renderInfo.GetBaseline();
Vector start = segment.GetStartPoint();
Text.Append("| x=");
Text.Append(start[Vector.I1]);
Text.Append("; y=");
Text.Append(start[Vector.I2]);
Text.Append(">");
}
}
When you run this code, and you look what's inside Text, you'll notice that a PDF document doesn't store words. Instead, it stores text blocks. In our special IRenderListener, we indicate the start and the end of text blocks using < and >. Inside these text blocks, you'll find text snippets. We'll mark text snippets like this: <text snippet| x=36.0000; y=806.0000> where the x and y value give you the coordinate of the start of the baseline (as opposed to the ascent and descent position). You can also get the end position of the baseline (and the ascent/descent).
Now how do you distill words out of all of this? The problem with the text snippets you get, is that they don't correspond with words. See for instance this file: hello_reverse.pdf
When you open it in Adobe Reader, you read "Hello World Hello People." You'd hope you'd find four words in the content stream, wouldn't you? In reality, this is what you'll find:
<>
<<ld><Wor><llo><He>>
<<Hello People>>
To distill the words, "World" and "Hello" from the first line, you need to do plenty of Math. Instead of getting the base line of the TextRenderInfo object returned in the RenderText() method of your render listener, you have to use the GetCharacterRenderInfos() method. This will return a list of TextRenderInfo objects that gives you more info about every character (including the position of those characters). You then need to compose the words from those different characters.
This is explained in mkl's answer to this question: Retrieve the respective coordinates of all words on the page with itextsharp
We've done similar projects. One of them is described here: https://www.youtube.com/watch?v=lZnbhnU4m3Y
You'll need to do quite some coding to get it right. One word about PdfClown: your text is probably stored as UNICODE in your PDF. To retrieve the correct characters, the parser needs to examine the mapping of the glyphs stored in the font and the corresponding UNICODE character. If PdfClown can't do this, this means that PdfClown doesn't do this task correctly. PdfClown is a one man project, so you'll have to ask that developer to fix this (if he has the time).
As you can tell from the video, iText could help you out, but iText is a company with subsidiaries in the US, Belgium and Singapore. It is a company with many employees and to keep that company running, we need to make money (that's how we pay our employees). Hence you shouldn't expect that we help you for free. Surely you can understand this as you wouldn't want to work for free either, would you?

replace string in PDF document (ITextSharp or PdfSharp)

We use non-manage DLL that has a funciton to replace text in PDF document (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php).
We are trying to move to managed solution (ITextSharp or PdfSharp).
I know that this question has been asked before and that the answers are "you should not do it" or "it is not easily supported by PDF".
However there exists a solution that works for us and we just need to convert it to C#.
Any ideas how I should approach it?
According to your library reference link, you use the Debenu PDFLibrary function ReplaceTag. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
That should be possible with any general purpose PDF library, it definitely is with iText(Sharp):
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
WARNING: Just like in case of the Debenu function, for most documents this code wouldn’t have any effect or would even be destructive. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed.
By the way, the Debenu knowledge base article continues:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
Thus, if during your move to managed solution you also change the way the source documents are created, chances are that neither the Debenu PDFLibrary function ReplaceTag nor the code above will be able to change the content as desired.
for pdfsharp users heres a somewhat usable function, i copied from my project and it uses an utility method which is consumed by othere methods hence the unused result.
it ignores whitespaces created by Kerning, and therefore may mess up the result (all characters in the same space) depending on the source material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}

How a font is detected to be bold/italic/plain that is used in PDF

While Extracting Content from PDF using the MuPDF library, i am getting the Font name only not its font-face.
Do i guess (eg.bold in font-name though not the right way) or there is any other way to detect that specific font is Bold/Italic/Plain.
I have used itextsharp to extract font-family ,font color etc
public void Extract_inputpdf() {
text_input_File = string.Empty;
StringBuilder sb_inputpdf = new StringBuilder();
PdfReader reader_inputPdf = new PdfReader(path); //read PDF
for (int i = 0; i <= reader_inputPdf.NumberOfPages; i++) {
TextWithFont_inputPdf inputpdf = new TextWithFont_inputPdf();
text_input_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_inputPdf, i, inputpdf);
sb_inputpdf.Append(text_input_File);
input_pdf = sb_inputpdf.ToString();
}
reader_inputPdf.Close();
clear();
}
public class TextWithFont_inputPdf: iTextSharp.text.pdf.parser.ITextExtractionStrategy {
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
string curFont = renderInfo.GetFont().PostscriptFontName;
string divide = curFont;
string[] fontnames = null;
//split the words from postscript if u want separate. it will be in this
}
}
public string GetResultantText() {
return result.ToString();
}
The PDF spec contains entries which allow you to specify the style of a font. However unfortunately in the real world you will often find that these are absent.
If the font is referenced rather than embeded this generally means you are stuck with the PostScript name for the font. It requires some heuristics but normally the name provides sufficient clues as to the style. It sounds this is pretty much where you are.
If the font is embedded you can parse it and try and find style information from the embedded font program. If it is subsetted then in theory this information might be removed but in general I don't think it will be. However parsing TrueType/OpenType fonts is boring and you may not feel that it is worth it.
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)"

Pragmatically convert PDF images to 8 bit

I have a set of PDFs in normal RGB colour. They would benefit from conversion to 8 bit to reduce file sizes. Are there any APIs or tools that would allow me to do this whilst retaining non-raster elements in the PDF?
This is a fun one. Atalasoft dotImage with the PDF Rasterizer and dotPdf can do this (disclaimer: I work for Atalasoft and wrote most of the PDF tools). I'd start off first by finding candidate pages:
List<int> GetCandidatePages(Stream pdf, string password)
{
List<int> retVal = new List<int>();
using (PageCollection pages = new PageCollection(pdf, password)) {
for (int i=0; i < pages.Count; i++) {
if (pages[i].SingleImageOnly())
retVal.Add(i);
}
}
pdf.Seek(0, SeekOrigin.Begin); // restore file pointer
return retVal;
}
Next, I'd rasterize only those pages, turning them into 8-bit images, but to keep things efficient, I'd use an ImageSource which manages memory well:
public class SelectPageImageSource : RandomAccessImageSource {
private List<int> _pages;
private Stream _stm;
public SelectPageImageSource(Stream stm, List<int> pages)
{
_stm = stm;
_pages = pages;
}
protected override ImageSourceNode LowLevelAcquire(int index)
{
PdfDecoder decoder = new PdfDecoder();
_stm.Seek(0, SeekOrigin.Begin);
AtalaImage image = PdfDecoder.Read(_stm, _pages[index], null);
// change to 8 bit
if (image.PixelFormat != PixelFormat.Pixel8bppIndexed) {
AtalaImage changed = image.GetChangedPixelFormat(PixelFormat.Pixel8bppIndexed);
image.Dispose();
image = changed;
}
return new FileReloader(image, new PngEncoder());
}
protected override int LowLevelTotalImages() { return _pages.Count; }
}
Next you need to create a new PDF from this:
public void Make8BitImagePdf(Stream pdf, Stream outPdf, List<int> pages)
{
PdfEncoder encoder = new PdfEncoder();
SelectPageImageSource source = new SelectPageImageSource(pdf, pages);
encoder.Save(outPdf, source, null);
}
Next you need to replace the original pages with the new ones:
public void ReplaceOriginalPages(Stream pdf, Stream image8Bit, Stream outPdf, List<int> pages)
{
PdfDocument docOrig = new PdfDocument(pdf);
PdfDocument doc8Bit = new PdfDocument(image8Bit);
for (int i=0; i < pages.Count; i++) {
docOrig.Pages[pages[i]] = doc8Bit[i];
}
docOrig.Save(outPdf); // this is your final
}
This will do what you want, more or less. The less-than ideal bit of this is that the image pages have been rasterized, which is probably not what you want. The nice thing is that just by rasterizing, generating output is easy, but it might not be at the resolution of the original image. This can be done, but it is significantly more work in that you need to extract the image from SingleImageOnly pages and then change their pixel format. The problem with this is that SingleImageOnly does NOT imply that the image fits the entire page, nor does it imply that the image is placed in any particular location. In addition to the PixelFormat change (actually, before the change), you would want to apply the matrix that is used to place the image on the page to the image itself, and use PdfEncoder with an appropriate set of margins and the original page size to get the image where it should be. This is all cut-and dried, but it is a substantial amount of code.
There is another approach that might also work using our PDF generation API. It involves opening the document and swapping out the image resources for the document with 8-bit ones. This is also doable, but is not entirely trivial. You would do something like this:
public void ReplaceImageResources(Stream pdf, Stream outPdf, List<int> pages)
{
PdfGeneratedDocument doc = new PdfGeneratedDocument(pdf);
doc.Resources.Images.Compressors.Insert(0, new AtalaImageCompressor());
foreach (int page in pages) {
// GetSinglePageImage uses PageCollection, as above, to
// pull a single image from the page (no need to use the matrix)
// then converts it to 8 bpp indexed and returns it or null if it
// is already 8 bpp indexed (or 4bpp or 1bpp).
using (AtalaImage image = GetSinglePageImage(pdf, page)) {
if (image == null) continue;
foreach (string resName in doc.Pages[page].ImportedImages) {
doc.Resources.Images.Remove(resName);
doc.Resources.Images.Add(resName, image);
break;
}
}
}
doc.Save(outPdf);
}
As I said, this is tricky - the PDF generation suite was made for making new PDFs from whole cloth or adding new pages to an existing PDF (in the future, we want to add full editing). But PDF manages all of its images as resources within the document and we have the ability to replace those resources entirely. So to make life easier, we add an ImageCompressor to the Image resource collection that handles AtalaImage objects and remove the existing image resources and replace them with the new ones.
Now I'm going to do something that you probably won't see any vendor do when talking about their own products - I'm going to be critical of it on a number of levels. First, it isn't super cheap. Sorry. You might get sticker shock when you look at the price, but the price includes technical support from a staff that is honestly second to none.
You can probably do a lot of this with iTextPdf Sharp or the Bit Miracle's Docotic PDF library or Tall Components PDF libraries. The latter two also cost money. Bit Miracle's engineers have proven to be pretty helpful and you're likely to see them here (HI!). Maybe they can help you out too. iTextPdfSharp is problematic in that you really need to understand the PDF spec to do the right thing or you're likely to output garbage PDF - I've done this experiment with my own library side-by-side with iTextPdfSharp and found a number of pain points for common tasks that require an in-depth knowledge of the PDF spec to fix. I tried to make decisions in my high-level tools such that you didn't need to know the PDF spec nor did you need to worry about creating bad PDF.
I don't particularly like the fact that there are several apparently different tools in our code base that do similar things. PageCollection is part of our PDF rasterizer for historical reasons. PdfDocument is made strictly for manipulating pages and tries to be lightweight and stingy with memory. PdfGeneratedDocument is made for manipulating/creating page content. PdfDecoder is for generating raster images from existing PDF. PdfEncoder is for generating image-only PDF from images. It can be daunting to have all these apparently overlapping niche tools, but there is a logic to them and their relationship to each other.