PDFs: Extracting text associated with font (linux) - pdf

The general problem that I'm trying to solve is to determine how much text in a large set of PDFs is associated with different fonts. I know I can extract text from a PDF using pdftotext and fonts information with pdffonts, but I can't figure out how to link those together. I have 100,000+ PDFs to process, so will need something I can program against (and I don't mind a commercial solution).

PDFTron PDFNet SDK can extract all the graphic operations, including text objects, including link to the font being used.
Starting with the ElementReader sample, you can get the Font for every text element.
https://www.pdftron.com/documentation/samples?platforms=windows#elementreader
https://www.pdftron.com/api/PDFNet/?topic=html/T_pdftron_PDF_Font.htm

The Adobe PDF Library - a product my company sells - can do that.
This is part of the sample code:
// This callback function is called fpr each PDWord object.
ACCB1 ASBool ACCB2 WordEnumProc(PDWordFinder wfObj, PDWord pdWord, ASInt32 pgNum, void* clientData)
{
char str[128];
char fontname[100];
// get word text
PDWordGetString(pdWord, str, sizeof(str));
// get the font name
PDStyle style = PDWordGetNthCharStyle(wfObj, pdWord, 0);
PDFont wordFont = PDStyleGetFont(style);
PDFontGetName(wordFont, fontname, sizeof(fontname));
printf("%s [%s]\n", str, fontname);
return true;
}
This is the output example:
...
Chapter [Arial,Bold]
2: [Arial,Bold]
Overview [Arial,Bold]
27 [Arial]
...
This [TimesNewRoman]
book [TimesNewRoman]
describes [TimesNewRoman]
the [TimesNewRoman]
Portable [TimesNewRoman]
Document [TimesNewRoman]
Format [TimesNewRoman]
...

Related

PDF-Forms with Unicode chars [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am currently struggling with withing a PDF form created from a LibreOffice document.
I created it like suggested in the book "iText in Action" and am now trying to pre-fill the embedded form with a few values, that can contain Unicode chars.
This includes a character that consist of base char with an addition combining char (e.G. M̂).
I have tried several different hints I found in in stackoverflow and the book, but I never got a PDF document with a form that works on all platforms: Linux (Okular, Evince, Acrobat DC, macOS Previewer, etc.)
I'm aware that I need to have a font, that covers the chars and embedded the font fully. Below there is the code I used to file the PDF document and the PDF file.
My questions are:
Is the different behavior of the PDF readers specification weakness in the PDF specification and I have to live with it?
Specially the Linux PDF readers and Acrobat behave badly. Are there known bugs?
I'm not very familiar with internals of PDF, so any suggestions? Are the contents of my PDF files ok?
Any suggestions on how to improve the code to get better results?
Code to fill the form:
BaseFont uniFont = BaseFont.createFont("./src/main/resources/UnicodeDoc.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED, false, null, null, false);
uniFont.setSubset(false);
// Debugging code...
for (String codepage : uniFont.getCodePagesSupported()) {
System.out.println("Codepage = " + codepage);
}
FileInputStream fis = new FileInputStream(src);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfReader reader = new PdfReader(fis);
PdfStamper stamper = new PdfStamper(reader, baos);
// Fill all fields in PDF form
String text = "aM\u0302a"; // Same as "aM̂a"
com.itextpdf.text.pdf.AcroFields form = stamper.getAcroFields();
for (String fname : form.getFields().keySet()) {
System.out.println("form." + fname);
form.setField(fname, text);
form.setFieldProperty(fname, "textfont", uniFont, null);
}
form.setGenerateAppearances(true);
form.addSubstitutionFont(uniFont);
stamper.setFormFlattening(false);
stamper.close();
reader.close();
Template
Template filled
Font
Thanks in advance, Mik86
I'm not very familiar with internals of PDF, so any suggestions? Are the contents of my PDF files ok?
I'll have to dig into the PDF specification to see if there is something definitively incorrect going on, but to me there does appear to be a confusion.
Firstly, your input Template gives me an error when I attempt to open it in Acrobat, and LiveCycle complains that "UnicodeDoc" must be swapped out for a different font. "UnicodeDoc" is used within the original input file:
Note that the font "UnicodeDoc" is not embedded in your input file. When filling in you create and embed a font, but it looks like you don't overwrite the original (again, not to say this is correct or incorrect):
Without going too much into the inner workings of PDFs the form that is getting filled out still links to the original Font that isn't embedded.
This doesn't necessarily directly address the issue, but if I "fix" your document by removing the font from the original template:
input.pdf
And run it through your code it produces output.pdf which has the correct output in Acrobat and Reader.
Again, this isn't to say your PDF is wrong or iText is wrong in this case as I haven't looked through the entire specification to see what (if any) interaction is expected here, but as it stands the font that you are embedding is not the font that ends up getting used in the form field.

Renaming a pdf file (scanned document) using OCR. It should read 3 zones and rename accordingly. e.g. Streetname_LastName_Date.pdf

I am running into trouble with tons of paperwork. I want to have it digitalized in order to simplify search and therefore cut down a huge amount of time spent on looking though the paperwork.
It is rather simple, I want to scan documents, which share the same layout and rename it according to 3 areas within the document. In my case its a reference number, a Last name and the date listed on the document. It would be even better if it could move the files to folders named after an area in the document.
Here is an image, basically this but with hundreds of pdfs in batch.
http://i.imgur.com/8vwwyEb.png
I couldn't find any solution for days and yet the technology is there. Have you ever gotten across a problem like this and found a solution? I would really appreciate your help.
The closes thing I have found is a program called FileCenter, but you need to click a button for each scan. Using ocr on existing files require you to go through a 3 click menu for each file. I wonder if there is an easy batch program, where you just select the rectangles and it does the renaming part.
I will edit this OP if any solution can be found, for anyone googling.
You may do this with the commercial component ByteScout PDF Extractor SDK designed specifically for this purpose. It may extract text from a given region by coordinates with optional OCR (that also works in the selected extraction region) in a batch. Coordinates of the region to extract text from can be measured in the base document with PDF Multitool free utility (asssuming all your PDF files are using the same layout).
You may extract text from given regions in C# like using OCR (English language):
using System;
using System.IO;
using System.Text;
using Bytescout.PDFExtractor;
using System.Drawing;
using System.Diagnostics;
namespace Example
{
class Program
{
static void Main(string[] args)
{
// Create Bytescout.PDFExtractor.TextExtractor instance
TextExtractor extractor = new TextExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";
// enable OCR auto mode, will use English by default
extractor.OCRMode = OCRMode.Auto;
string sourceFile = "sample.pdf";
// Load source PDF file
extractor.LoadDocumentFromFile(sourceFile);
// extract from given area (measured from base typical file using PDF Multitool utility), assuming we have a reference string there
extractor.SetExtractionArea(Rectangle.FromLTRB(10, 10, 100,100));
string extractedReference = extractor.GetTextFromPage(0).Trim();
extractor = null; // dispose the extractor and release the original file
// Copy the original file into the file with filename based on the original reference so it will be like "1234-sample.pdf"
string outputFile = extractedReference + "-" + sourceFile;
File.Copy(sourceFile, outputFile);
Console.WriteLine();
Console.WriteLine(inputFile + " has been copied to " + outputFile);
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
}
}
Disclosure: I'm connected with ByteScout

PDF acroform fields become non editable in Adobe reader after writing to it using Pdfbox APIs

I am reading a PDF which has editable fields and the fields can be edited by opening it through Adobe Reader. I am using PDFBox APIs to generate an output PDF with data filled for the editable fields in input PDF. The output PDF can be opened using Adobe Reader and I am able to see the field values but I am unable to edit those fields directly from Adobe reader.
There is also a JIRA ticket for this issue and it is unresolved according to this link :
https://issues.apache.org/jira/browse/PDFBOX-1121
Can anybody please tell me if this got resolved? Also, if possible please answer the following questions related to my question:
Is there any protection policy or access permission that I need to explicitly set in order to edit the output PDF from Adobe reader?
Every time I open the PDF that was written to using pdfbox APIs, I get this message prompt:
" The document has been changed since it was created and use of extended features is no longer available...."
I am using PdfBox 1.8.6 jar and Adobe reader 11.0.8. I would really appreciate if anybody could help me with this issue.
Code snippet added to aid responders in debugging :
String outputFileNameWithPath = "C:\myfolder\testop.pdf";
PDDocument pdf = null;
pdf = PDDocument.load( outputFileNameWithPath );
PDDocumentCatal og docCatalog = pdf.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
//The map pdfValues is a collection of the data that I need to set in the PDF
//I am unable to go into the details of my data soutce
// The key in the data map corresponds to the PDField's expanded name and data
// corresponds to the data that I am trying to set.
Iterator<Entry<String, String>> iter=pdfValues.entrySet().iterator();
String name=null;
String value=null;
PDField field=null;
//Iterate over all data and see if the PDF has a matching field.
while(iter.hasNext()) {
Map.Entry<String, String> currentEntry=iter.next();
name=currentEntry.getKey();
value=currentEntry.getValue();
if(name!=null) {
name=CommonUtils.fromSchemaNameToPdfName(name);
field=acroForm.getField(name);
}
if( field != null && value!=null )
{
field.setValue( value ); //setting the values once field is found.
}
}
// Set access permissions / encryption here before saving
pdf.save(outputFileNameWithPath);
Thanks.
The document has been changed since it was created and use of extended features is no longer available....
This indicates that the original form has been Reader-enabled, i.e. an integrated Usage-Rights digital signature has been applied to the document using a private key held by Adobe which tells the Adobe Reader that it shall make some extra functionality available to the user viewing that form.
If you don't want to break that signature during form fill-ins with PDFBox, you need to make sure that you
don't do any changes but form fill-ins and
save the changes as incremental update.
If you provided your form fill-in code and your source PDF, this could be analyzed in more detail.

How to count specific words in locked pdfs

How can I count specific words within a pdf file that is locked.
I am talking about annual reports here. You can search within, but you cant copy out of it (for whatever reason, doesnt make sense).
After googling forever, I still havent found a solution.
If your file contains text (and not just scanned images) and used fonts contains information about mapping from glyphs to characters then you should be able to extract text from the file using any PDF library that provides text extraction capabilities.
Copying of text is usually forbidden by setting usage rights. Many PDF libraries ignore these settings and allow text extraction from locked PDFs.
Depending on the library, you might try extracting whole text and splitting it into words yourselves or extracting text as collection of words (if library can split text into words for you).
Here is a sample code for Docotic.Pdf library that shows how to build dictionary that contains information about words found in a PDF document and how many times they are used.
public static Dictionary<string, int> countWords(string file)
{
Dictionary<string, int> wordCounts = new Dictionary<string, int>();
using (PdfDocument pdf = new PdfDocument(file))
{
foreach (PdfPage page in pdf.Pages)
{
PdfCollection<PdfTextData> words = page.GetWords();
foreach (PdfTextData word in words)
{
int count = 0;
wordCounts.TryGetValue(word.Text, out count);
wordCounts[word.Text] = count++;
}
}
}
return wordCounts;
}
Disclaimer: I work for the vendor of Docotic.Pdf.

How to generate a tiff file with meta data

I have to generate a tiff file with many images and meta data.
I found that it's possible to convert a png or a jpg to tiff here :
But how to add meta data ? it is possible with ImageMagic for iOS ?
thanks
Edit: finaly i installed ImageMagick on iphone, but i don't found how to create multipage tiff with magickwand .... it's possible also to use libtiff directly :
i found how to create a empty simple page in c code
char buffer[25 * 144] = { /* boring hex omitted */ };
TIFF *image;
char szFileName[512];
strcpy(szFileName, getenv("HOME"));
strcat(szFileName, "/Documents/");
strcat(szFileName, "output.tif");
// Open the TIFF file
if((image = TIFFOpen(szFileName, "w")) == NULL)
{
printf("Could not open output.tif for writing\n");
}
// We need to set some values for basic tags before we can add any data
TIFFSetField(image, TIFFTAG_IMAGEWIDTH, 25 * 8);
TIFFSetField(image, TIFFTAG_IMAGELENGTH, 144);
TIFFSetField(image, TIFFTAG_BITSPERSAMPLE, 1);
TIFFSetField(image, TIFFTAG_SAMPLESPERPIXEL, 1);
TIFFSetField(image, TIFFTAG_ROWSPERSTRIP, 144);
TIFFSetField(image, TIFFTAG_COMPRESSION, COMPRESSION_CCITTFAX4);
TIFFSetField(image, TIFFTAG_PHOTOMETRIC, PHOTOMETRIC_MINISWHITE);
TIFFSetField(image, TIFFTAG_FILLORDER, FILLORDER_MSB2LSB);
TIFFSetField(image, TIFFTAG_PLANARCONFIG, PLANARCONFIG_CONTIG);
TIFFSetField(image, TIFFTAG_XRESOLUTION, 150.0);
TIFFSetField(image, TIFFTAG_YRESOLUTION, 150.0);
TIFFSetField(image, TIFFTAG_RESOLUTIONUNIT, RESUNIT_INCH);
// Write the information to the file
TIFFWriteEncodedStrip(image, 0, buffer, 25 * 144);
// Close the file
TIFFClose(image);
So are there any c tutorial about how insert images Data in the created tiff file ?
and how create multipage tiff ?
thx
I have used LibTIFF - but not on iOS directly. But then, it is a plain old C library, so should be fine. I do note that the Apple Image I/O framework supposedly supports image meta-data (but again, I have not used this myself). See link here - but seemingly nothing for multi-page TIFF or bespoke tags, only standardised camera info tags...
However, in plain C, adding your own bespoke tags is usually performed most simply by modifying the core library and adding them to the main header file: tiff.h along with a few wrapper functions.
See section on "Adding New Tags"
And then you can refer/use them as you would other TIFF tags, e.g. what I have done to load in some embedded xml data:
TIFFGetField(lp_tif,MYTAG, &lp_xml)
Of course, you then have to ship/maintain your modified version of libTIFF (new "public" tags have to go through the process with Adobe).
The example you posted is fine for writing a single TIFF file (i.e set all the tags then write the contents of buffer to the file). For multi-page TIFF - it's one step further. You need to understand the concept if a Image File Directory (IFD). I would suggest looking at this link further to understand the use of the functions:
TIFFWriteDirectory()
TIFFReadDirectory()
NB: Properly, every TIFF file should have one directory to associate all the tags and image data together.
Finally, you can of course go one level even further! If you know the fixed structure of the TIFF file you want to create - simply write the bytes without even using LibTIFF.
In case it is an option for you to use a script for adding the meta data:
Use Phil Harvey's exiftool!
exiftool is a quite powerful, well-documented (and multi-platform) commandline utility to read and write meta data from/to lots of different file formats, including TIFF.