Extract text content from PDF

Extract text content from PDF - pdf

I have been extracting text from PDFs using pdftotext. I have also done this with Ghostscript. Recently, a utility provider changed their PDFs so a portion of it is not being extracted by these methods. Specifically, I'm missing the due date and total due. When I open the PDF in a reader, the 'missing' text can be highlighted, copied, and pasted into an external editor. When I open it in Acrobat Pro, and view the content (View -> Show/Hide -> Navigation Panes -> Content), the text I need is there. How can I get it out without manually copying and pasting? (which is not an option, because I'll be doing this on thousands of PDFs)?
Here an example of what I'm dealing with. I have removed all sensitive data:
link to PDF
EDIT: I noticed after posting this that when you follow the link to the file (hosted on Google Drive), it will allow you to select and copy most text on page, but not the stuff I'm missing. When you download the file, you are able to select the missing text in a PDF reader.

Recent releases of Ghostscript have a txtwrite device which is probably worth trying.

I have solved this by getting the newest unreleased version of Ghostscript from git and building it. Now the txtwrite device gives me exactly what I need. Thanks to chrisl for his answer and comments leading me in the right direction.

There is a VERY HACKY method to extract the data, but it only works with the older version of ghostscript, like 8.51 or 8.62. In the older version of ghostscript, the PDF commands are defined in /lib/pdf_ops.ps The new version does something else.
A tested version of version 8.62 is available here.
http://sourceforge.net/projects/ghostscript/files/GPL%20Ghostscript/8.62/gs862w32.exe/download
The text you are after is printed using /Tj {} def and /TJ {} def by adding a dup == to the beginning of each definition. (This could be made more sophisticated) I also didn't bother to worry about the font warning messages, but these would be filtered out if the data were written to file.
Some words are split into pieces and individual letters because kerning is being done. Given time, this could also be filtered.
modified /Tj from pdf_ops.ps
/Tj { dup ==
0 0 moveto Show settextposition
} bdef
modified /TJ from pdf_ops.ps
/TJ { dup ==
0 0 moveto {
dup type /stringtype eq {
Show
} { -1000 div
currentfont /ScaleMatrix .knownget { 0 get mul } if
0 Vexch rmoveto
} ifelse
} forall settextposition
} bdef
output
(Help a neighbor within your county each month by contributing to The Salvation )
(Army's Project SHARE and Georgia Power will match your gift. To help, simply check )
($1, $2, $5, or $10 on the return portion of this bill. Starting next month, your pledge )
(amount will be included on your monthly bill.)
(Our business offices will be closed on December 24 and 25 for Christmas and January )
(1 for New Year's Day. In case of an emergency, please call us at the number on your )
(bill 24 hours a day, 7 days a week.)
(PLEASE KEEP THIS PORTION FOR YOUR RECORDS.)
(PLEASE RETURN THIS PORTION WITH YOUR PAYMENT, MAKING SURE THE RETURN ADDRESS SHOWS IN THE ENVELOPE WINDOW.)
(Account Number)
(Mail To:)
Isn't postscript fun?

Related

Problem with line breaks in PDF document generated by BIRT

I have some cell texts in a BIRT report which do not flow as nicely as I hoped.
For example,
The text is Long value resultwithaverylongname whichcannotbreak and I had hoped that it would be displayed like this:
Long value
resultwithaverylongname
whichcannotbreak
The render options are as follows:
renderOptions.setOutputFormat(IPDFRenderOption.OUTPUT_FORMAT_PDF);
renderOptions.setOption(IPDFRenderOption.PAGE_OVERFLOW, IPDFRenderOption.OUTPUT_TO_MULTIPLE_PAGES);
renderOptions.setOption(IPDFRenderOption.PDF_TEXT_WRAPPING, true);
renderOptions.setOption(IPDFRenderOption.PDF_WORDBREAK, true);
It seems to me that my desired output is physically possible but I don't know why BIRT does not break on a whitespace and breaks in the middle of the word.
I am using BIRT 4.16 (from Sourceforge). The texts contain normal whitespace (no non-breakable spaces) and are displayed via a data object.
3.Sep.21
I now have an example project which I am trying to commit to Github. In the meantime here is a screenshot showing breaks which look good and others which are not...
The git repo is here: https://github.com/pramsden/test.wordbreak

If the text "resultwithaverylongname" physically fits, then you are right:
BIRT should not break it in the middle of the word.
Your renderOptions seem right (depending of what BIRT version you are using).
At first glance this looks like a bug.
But: In German language, we often have quite long words, and I've created a lot of (complex) PDF reports with BIRT, but I never saw this issue.
So I guess it is a tiny silly detail which causes this.
Just to double-check:
Are the spaces between "Long", "value", "result..." normal spaces (0x20)? or non-breaking spaces?
Which BIRT release are you using?
Are you using a data item or a dynamic text item and if so, is it HTML or plain text?
Can you create a reproducible simple test case and post the rptdesign file somewhere?

well i don use BIRT , but try to use (\n),
in my case I use PDFFlow library to generate pdf docs, and to make a line-break i just use \n
this is a simple example code to create a pdf file and use line break
var DocumentBuilder.New()
.AddSection()
.AddParagraphToSection("Hello world! \n go to the next line")
.ToDocument()
.Build("Result.PDF");
try it and tell me if it works

Edit a Mainframe file in the RecordEditor without a copybook

How do you Edit a (binary EBCDIC) Mainframe file in the RecordEditor with out a Cobol Copybook.
How do you generate Java code to read the file using the RecordEditor.
Note: This is an attempt to split a question that is far to broad to give meaningful answer to
into a series of simpler Question and Answer's.

Try and avoid editing a binary file with a Cobol Copybook if at all possible. This should only be attempted as a last resort !!!.
Try and get
that Cobol copybook (or some field layout document) for the file !!!
Some general advise:
It is feasible when dealing with 10 / 20 fields in a record but not if there a thousands of fields in a Record.
Take your time do not rush the process. Try and get each step correct before moving on
Finally upgrade to the latest version of the RecordEditor (currently 0.98.4)
This process will also work for normal Text file as well
RecordEditor Layout Wizard
To start the wizard select option Record Layouts >>> Layout Wizard.
File Structure screen
The file structure screen has 3 purposes:
Get the File structure - It could be Fixed Width, VB, Windows/Unix Text file
Get the Record-Length (if it is a fixed width file).
Get the font (character-set / encoding)
The RecordEditor will try and work this out for you
Field Selection Screen
The RecordEditor will try and work out where fields start and end but
it is not perfect. You need to carefully check and correct its choices
On this screen, the fields are displayed in alternating colors
you create/delete a field by clicking on
use the Clear Fields button clear all the fields
you can change what field-types to search for using the various check box's (e.g. Mainframe Zones Decimal)
The Add Fields will do another field search
Field Definition screen
On this screen you define the field names and Types. You may need to go back to the **Field Selection Screen* to adjust the fields
Editing the file
Once the Record Layout has been defined, it can be used on the open file screen
Generating Java code
When editing your file, you can generate java~JRecord code to read the file
by selecting Generate >>> Java >>> ....
You can the enter a package-id + generate options:
and finally your sample java code is generated to read / write the
file.

Aspose PdfFileEditor Cutting off Pages

I have a function that takes non specified number of PDFs (in a List) and turns them into 1 PDF using the PdfFileEditor.append function, like so.
pdfFileEditor.append(streams.get(x), streams.get(y), 1, 1500, outputStream);
The function that controls the merging is usually fine. Except there is 1 PDF in the application that seems to always eat anything that was appended before it.
For example, if we have 5 PDFs where number 3 is the bad one.
We can either use a forward loop (append 1 and 2, then 1&2 and 3, then 1&2&3 and 4, then 1&2&3&4 and 5) or a backwards loop (append 4 and 5, then 3 and 4&5, then 2 and 3&4&5, then 1 and 2&3&4&5) to combine the PDFs.
In a forward loop, we end up with only 3, 4, and 5 in the final PDF. In a backward loop, we end up with 1, 2, and 3 in the final PDF.
I am not sure what's wrong with PDF 3. It opens fine. But it does appear to be a dynamic PDF (has fields etc). I tried both forward and backwards loops because I thought maybe the PDF type was causing a reset to occur on the output stream somehow.
Has anyone seen the append method essentially just ignore a stream before?
Notes
I know this is a deprecated package from Aspose. But company standards means we cannot update to the new package.
Code is helpful - I can include the method, but it is long and the issue is clearly with the 1 PDF. Everything works in all cases except when a certain PDF is included in the list.

I am social media developer at Aspose. I would suggest you to download and try the latest version of Aspose.Pdf at your end to see if the problematic file works fine with the latest version. Also, it would be better if you save your complete code, version of library you are using and the problematic file with us.

Apple script that can scan a pdf document and copy all the annotated highlighted text to clipboard with a page reference

When doing research I find myself usually annotating a pdf document (highlighting, adding notes), then I will create a note in Evernote and index all my annotations.
For example,
p 3 - "is it possible for schools to change their practices and thereby have a strongly positive effect on student achievement?"
p 10 - "the district boldly moved forward with several new reforms"
My hope is to work with a pdf document, annotate it, then run the applet which would copy all my annotations (highlights and notes) to clipboard, where then I could paste them in a note, thereby having an index of all the points I found useful.
I am using a mac, and am open to using which ever language would be simple to creating this. My thoughts are that an applescript would be best.

Skim can export notes as text, and it also has an AppleScript dictionary.
tell application "Skim" to tell document 1 to save as "notes as text" in "/Users/username/Desktop/notes.txt"
The output looks like this:
* Highlight, page 1
ocument (highlighting
* Text Note, page 1
aa
* Highlight, page 1
ent, annotate it,

Automatically Convert Prices on a Web Page to A Different Currency

I am interested in possible methods of automatically converting the prices given when a web page is loaded from the currency given to a specified currency. Ideally, the conversion would also make use of the current exchange rate to give valid prices.
For example, in my specific case, I would like to convert the prices given in Euros (€) on this web site to Sterling (£).
I am looking at using a GreaseMonkey script for this conversion, but can anyone suggest other methods?
Thanks, MagicAndi.

Try the API: http://thecurrencygraph.com
It uses Geo Location scripts to detect the user's country and through that their native currency. It then converts your prices into their currency using the latest exchange rates
Hope this this helps!
W.

Since I dabble in AutoHotkey here's a potential solution using that scripting language, it retrieves the page source from a webpage that does the conversion and parses out the converted value. This requires the httpQuery library to be included:
#Include httpQuery.ahk
InputBox, n, EUR to GBP, Enter the number., , 150, 120
if (ErrorLevel || !n)
return
url := "http://www.xe.com/ucc/convert.cgi?Amount=" n "&From=EUR&To=GBP&image.x=55&image.y=8"
html := URLDownloadToVar(url)
Gui, Add, Edit, w125, % RegExMatch(html,"[\d\.]+(?= GBP)",m) ? m "£" : "The value could not be retrieved."
Gui, Show, AutoSize Center, GBP
VarSetCapacity(html,0)
Return
GuiClose:
GuiEscape:
Gui, Destroy
return
URLDownloadToVar(url){
if !RegExMatch(url,"^http://")
url := "http://" url
httpQuery(html,url)
VarSetCapacity(html, -1)
Return html
}
There are obviously more thorough (and complex) methods for solving this problem but this at least solves it with minimal effort.

The quick and easy answer is to make use of a Firefox add-on. There are a number of currency converters available as add-ons, but I ended up using Exch, as it suited my needs best.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas