Need a suitable scripting language for below requirement of PDF documents - pdf

We have project requirement to validate PDF files which would contain below things for different policies.
Page Number
Images (screen shots)
Here we want to validate whether all the pages have images(screen shots), number of images in the PDF, image duplication and empty pages.
Please suggest me a suitable scripting language and way to fulfill our requirement.
Note:- Each policy will have different set screen shots and hence the total no of pages and image content for each PDF will vary.
Thanks in Advance!

I've had to validate a lot of PDFs and found this toolkit very useful http://euske.github.io/pdfminer/index.html . It's written in Python, but comes with an excellent pdfdump utility which lets you look at the page number of each pages and all the elements in that page.
Having said that, I've only used it for text and am not sure how it identifies images.

I would comment on Kim Ryan's answer, except that I don't have enough reputation to comment yet, which seems pretty silly.
In any case, I agree with Kim that pdfminer is probably your best bet overall. However, I would mention that looking for images isn't all that difficult, and there is an "extract" example in the pdfrw library that will find images and pull them out to a separate PDF file. I don't think it would be very hard to modify it to match images to page numbers. I am the pdfrw author, so you can email me (address at github) if you have any questions on this.

Related

How to compress PDF to the limit

I have a 130,000-page PDF with a size of 1G. Each page of the PDF has a variable serial code. The other content is the same. It can be said that 99% of the content is the same. Can this situation be compressed to dozens of megabytes?
I hope everyone can give me some guidance, or give me some articles about the composition principle of PDF documents, or the principle of PDF compression
I don’t want to get a software tool, I want to know the deeper principles
In fact, I feel that the same thing can be used as an object instead of being repeated, just like the encapsulation of writing code
I provided a screenshot. This is how the two software processes the PDF. The content of a single page of the artwork is similar, but the size is different, but the 130,000 page on the right is only 91.51M, (and the same content I generated is 1GB), which is really amazing.(the left test file not mine)

extract dimensions from PDF using OCR

I am looking for a way to programmatically examine a pdf cad drawing, plain 2D print, and pull out all the dimensions along with the locations of the dimensions on the page. I am in search of technologies that will allow me to do this.
I'm looking at leadtools, PDFBox, iText, TET, Adobe SDK and trying to do some comparison among them. I am particularly interested in recognizing dimensions/numbers and shapes accurately and the api must have ability to extract location info as well. Any past experiences with any of these or helpful insight on the good ones/bad ones would be greatly appreciated!!
We can provide relevant information about the LEADTOOLS part of your question since it's our product.
If the PDF contains actual text and not just an image of text, you can extract it directly without going through OCR. To do that, use the Leadtools.Pdf.PDFDocument.ParsePages() method.
If you’re dealing with images that contain both text and non-text areas, you could use Leadtools.ImageProcessing.Core.AutoZoningCommand to isolate the text zones (areas) and get their coordinates. You could then use either our OCR engine or your own code. If you try this and don’t get satisfactory results, there could be other advanced options to help you, but we might need to see actual samples you’re working with. If you like, email some sample files to our support address and mention what you tried so far.

Extracting information from PDFs of research papers [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.
At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.
Ideally this would be an open source solution.
The problem is that not all PDF's encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.
I know there's a lot of libraries. It's identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.
I'm only allowed one link per posting so this is it:
pdfinfo Linux manual page
This might get the title and authors. Look at the bottom of the manual page, and there's a link to www.foolabs.com/xpdf where the open source for the program can be found, as well as binaries for various platforms.
To pull out bibliographic references, look at cb2bib:
cb2Bib is a free, open source, and multiplatform application for rapidly extracting unformatted, or unstandardized bibliographic references from email alerts, journal Web pages, and PDF files.
You might also want to check the discussion forums at www.zotero.org where this topic has been discussed.
We ran a contest to solve this problem at Dev8D in London, Feb 2010 and we got a nice little GPL tool created as a result. We've not yet integrated it into our systems but it's there in the world.
https://code.google.com/p/pdfssa4met/
Might be a tad simplistic but Googling "bibtex + paper title" ussualy gets you a formated bibtex entry from the ACM,Citeseer, or other such reference tracking sites. Ofcourse this is assuming the paper isn't from a non-computing journal :D
-- EDIT --
I have a feeling you won't find a custom solution for this, you might want to write to citation trackers such as citeseer, ACM and google scholar to get ideas for what they have done. There are tons of others and you might find their implementations are not closed source but not in a published form. There is tons of research material on the subject.
The research team I am part of has looked at such problems and we have come to the conclusion that hand written extraction algorithms or machine learning are the way to do it. Hand written algorithms are probably your best bet.
This is quite a hard problem due to the amount of variation possible. I suggest normalizing the PDF's to text (which you get from any of the dozens of programmatic PDF libraries). You then need to implement custom text scrapping algorithms.
I would start backward from the end of the PDF and look what sort of citation keys exist -- e.g., [1], [author-year], (author-year) and then try to parse the sentence following. You will probably have to write code to normalize the text you get from a library (removing extra whitespace and such). I would only look for citation keys as the first word of a line, and only for 10 pages per document -- the first word must have key delimiters -- e.g., '[' or '('. If no keys can be found in 10 pages then ignore the PDF and flag it for human intervention.
You might want a library that you can further programmatically consult for formatting meta-data within citations --e.g., itallics have a special meaning.
I think you might end up spending quite some time to get a working solution, and then a continual process of tuning and adding to the scrapping algorithms/engine.
In this case i would recommend TET from PDFLIB
If you need to get a quick feel for what it can do, take a look at the TET Cookbook
This is not an open source solution, but it's currently the best option in my opinion. It's not platform-dependant and has a rich set of language bindings and a commercial backing.
I would be happy if someone pointed me to an equivalent or better open source alternative.
To extract text you would use the TET_xxx() functions and to query metadata you can use the pcos_xxx() functions.
You can also use the commanline tool to generate an XML-file containing all the information you need.
tet --tetml word file.pdf
There are examples on how to process TETML with XSLT in the TET Cookbook
What’s included in TETML?
TETML output is encoded in UTF-8 (on zSeries with USS or
MVS: EBCDIC-UTF-8, see www.unicode.org/reports/tr16), and includes the following information:
general document information and metadata
text contents of each page (words or paragraph)
glyph information (font name, size, coordinates)
structure information, e.g. tables
information about placed images on the page
resource information, i.e. fonts, colorspaces, and images
error messages if an exception occurred during PDF processing
CERMINE - Content ExtRactor and MINEr
Described in the paper: TKACZYK, Dominika, et al. CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), 2015, 18.4: 317-335.
Mainly written in Java and available as open source at github.
Another Java library to try would be PDFBox. PDFs are really designed to viewed and printed, so you definitely want a library to do some of the heavy lifting for you. Even so, you might have to do a little gluing of text pieces back together to get the data you want extracted. Good Luck!
Just found pdftk... it's amazing, comes in a binary distribution for Win/Lin/Mac as well as source.
In fact, I solved my other problem (look at my profile, I asked then answered another pdf question .. can't link due to 1 link limitation).
It can do pdf metadata extraction, for example, this will return the line containing the title:
pdftk test.pdf dump_data output test.txt | grep -A 1 "InfoKey: Title" | grep "InfoValue"
It can dump title, author, mod-date, and even bookmarks and page numbers (test pdf had bookmarks)... obviously a bit of work will be needed to properly grep the output, but I think this should fit your needs.
If your pdfs don't have metadata (ie, no "Abstract" metadata), you can cat the text using a different tool like pdf2text, and use some grep tricks like above. If your pdfs are not OCR'd, you have a much bigger problem, and ad-hoc querying of the pdf(s) will be painfully slow (best to OCR).
Regardless, I would recommend you build an index of your documents instead of having each query scan the file metadata/text.
Take a look at iText. It is a Java library that will let you read PDFs. You will still face the problem of finding the right data, but the library will provide formatting and layout information that might be usable to infer purpose.
PyPDF might be of help. It provides extensive API for reading and writing the content of a PDF file (un-encrypted), and its written in an easy language Python.
Have a look at this research paper - Accurate Information Extraction from Research Papers using Conditional Random Fields
You might want to use an open-source package like Stanford NER to get started on CRFs.
Or perhaps, you could try importing them (the research papers) to Mendeley. Apparently, it should extract the necessary information for you.
Hope this helps.
Here is what I do using linux and cb2bib.
Open up cb2bib and make sure that clipboard connection is ON, and that your reference database is loaded
Find your paper on google scholar
Click 'import to bibtex' underneath the paper
Select (highlight) everything on the next page (ie., the bibtex code)
It should now appear formatted in cb2bib
Optionally now press network search (the globe icon) to add additional info.
Press save in cb2bib to add the paper to your ref database.
Repeat this for all the papers. I think in the absence of a method that reliably extracts metadata from PDFs, this is the easiest solution I found.
I recommend gscholar in combination with pdftotext.
Although PDF provides meta data, it is seldomly populated with correct content. Often "None" or "Adobe-Photoshop" or other dumb strings are inplace of the title field, for example. That is why none of the above tools might derive correct information from PDFs as the title might be anywhere in the document. Another example: many papers of conference proceedings might also have the title of the conference, or the name of the editors which confuses automatic extraction tools. The results are then dead wrong when you are interested of the real authors of the paper.
So I suggest a semi-automatic approach involving google scholar.
Render the PDF to text, so you might extract: author, and title.
Second copy paste some of this info and query google scholar. To automate this, I employ the cool python script gscholar.py.
So in real life this is what I do:
me#box> pdftotext 10.1.1.90.711.pdf - | head
Computational Geometry 23 (2002) 183–194
www.elsevier.com/locate/comgeo
Voronoi diagrams on the sphere ✩
Hyeon-Suk Na a , Chung-Nim Lee a , Otfried Cheong b,∗
a Department of Mathematics, Pohang University of Science and Technology, South Korea
b Institute of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
Received 28 June 2001; received in revised form 6 September 2001; accepted 12 February 2002
Communicated by J.-R. Sack
me#box> gscholar.py "Voronoi diagrams on the sphere Hyeon-Suk"
#article{na2002voronoi,
title={Voronoi diagrams on the sphere},
author={Na, Hyeon-Suk and Lee, Chung-Nim and Cheong, Otfried},
journal={Computational Geometry},
volume={23},
number={2},
pages={183--194},
year={2002},
publisher={Elsevier}
}
EDIT: Be careful, you might encounter captchas. Another great script is bibfetch.

online book reader - streaming/chunking parts of a book

I am trying to create an online book reader (all text, no graphics needed). The reader can be flash or html/javascript. The trick is I need to push out the book in chunks so I can limit non-paid readers to only the first chunk or so. I have thought about just pre-parsing a book file into several files (each chunk) and serving each but there is no good way to segment as I cannot easily segment by chapters (not always sure where a chapter begins). So a more fluid solution would be best.
Additionally, I need some basic security to make it difficult to copy the content and piece it back together as a whole book. I don't expect this part to be unbreakable, just good enough.
I've seen I can make each book a PDF and then convert that into an SWF. But then how would I limit the frames served in flash (assuming a swf frame is a pdf page)?
Any ideas?
btw, I've looked closely at Scribd's iPaper. Things like that look slick but don't really provide a good reading experience for text only books. I do like the readability of some of Google's books. Its fluid and seems to read well for mostly text books.
Consider this solution:
You convert the first chunk to a separate file that you serve to your free people and the paid people get the larger one. If you serve up the SWF with all the content in it then you just gave it away for free to any technical users.
The best solution is just to limit what you return to only the portions you want them to see. I think Print2Flash does a good job for a reader (you can customize it a lot if you're so inclined).

Company insists on using a binary format for all our documentation [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I work at a company that, for some reason, insists that all our development documentation should be in MS Word format. Which, being a binary format, means we cannot:
Diff versions of a document against each other (so peer reviewing them is a pain - because of the domain we work in, peer reviews for all changes are essential)
Grep a folder-full of documents for keywords
What do you use to write documentation in and why?
Please also give me ammo to change this situation with...
I recently started using DocBook XML to author my documentation.
On the upside, it's a pure text format. You can break a large document into multiple files, and use nodes to bring them all together into a single book. Table of contents and index are automatically generated. Intra-document links (within arbitrary text, pointing to chapters or sections) are very easy. And with a push of a button, I can create a single-html-file version, a chunked-html version (one file per chapter), and a PDF version.
After some tweaking and customization, I'm very happy with the output. The documents look great!!
DocBook is used extensively by real publishers (most notably, O'Reilly), and it's been around for more than fifteen years, so it's reached a certain level of maturity.
On the other hand, all of the processing is done with XSLT, using an ad-hoc collection of tools. (My own docbook pipeline includes Python, Java, Xerces, Xalan, Apache FOP, and PDF-SAM. Plus the official XSLT stylesheet distribution, and my own XSLT customizations.)
DocBook is not a turnkey solution. You won't be able to get going quickly, without reading the manual. And if you don't know anything about XSLT, you'll have to learn.
On the other hand, there are only a dozen or two XML tags that you really need to know to write the documents. (The real expertise comes into play during doc generation from the XML sources.) If one person on your team was willing to be responsible for writing the doc build script, then everyone else on the team could just learn the DTD and do a decent job contributing.
Anyhow... DocBook definitely has some faults. It's not the easiest system for tech authorship. But it's the best open source tool I know of.
The "Subversion Book" is written in DocBook. Here's a page with links to the different book versions (single-html, chunked-html, and PDF):
http://svnbook.red-bean.com/
And here's a link to the DocBook XML sources for the first chapter, so that you can get an idea for how it works:
http://sourceforge.net/p/svnbook/source/HEAD/tree/branches/1.7/en/book/ch01-fundamental-concepts.xml
For ammo, there's the trusty old Pragmatic Programmer, chapter 14: The Power of Plain Text.
As Pragmatic Programmers, our base
material isn't wood or iron, it's
knowledge. We gather requirements as
knowledge, and then express that
knowledge in our designs,
implementations, tests, and documents.
And we believe the best format for
storing knowledge persistently is
plain text. With plain text, we give
ourselves the ability to manipulate
knowledge, both manually and
programmatically, using virtually
every tool at our disposal.
We use a wiki (specifically the one provided by Trac) for the two reasons you mentioned. Plus, if we really need to we can get the text version of the markup and manipulate it in a text-only environment, too (e.g. as part of svn comments during commit).
A format that can be easily reduced to text-only (non-binary) is definitely a must. Having the ability to upconvert it to a pretty format like a PDF is, for us, not terribly important.
Word has change tracking for documents (although it only works up until you accept the changes) and you can also grep them (the text isn't encrypted). So I'm not sure either of your arguments will hold up under scrutiny. I'd love to give you the ammo to change this but I've become jaded and cynical with age.
We use MS Word for our docs (which is a huge improvement over the earlier choice (Lotus WordPro - ugh!).
We use a wiki - specifically Confluence by Atlassian.
It's a commercial product, and it's great. One of the reasons we picked it over free/open wiki engines is that it has a full-blown WYSIWYG editor and various other features that make it more easily accessible to users who are familiar with Word.
We've also come up with a neat trick where we store images, designs, wireframes, etc. in Subversion, and then embed links in the wiki documents to those resources URLs via the Apache/SVN web interface module; notes on how we do this are here if you're interested.
Like Dylan's organisation, we also use the excellent Confluence wiki. I wrote an article about why this is better approach called Wiki is my word-processor, which should give you some reasons to change the situation.
Benefits of using a wiki for internal documentation include the following.
Word-processor users get sucked into changing the layout and typography, however good your templates are, which wastes time and reduces consistency.
A wiki provides full-text search, which you are unlikely to have for your body of the MS Word documents written by everyone.
A wiki provides a document version history; I have never heard of a team successfully keeping all revisions in Word documents and always being able to compare old versions, or using a version control system (with the possible exception of SharePoint but that's whole different failure scenario).
A wiki makes hyperlinks between documents easy; it is too hard to reliably link between documents in a collection of Word documents, so new documents end up duplicating older content into new monolithic documents which means they take more time to read and write.
Separate wiki pages can be edited by different people at the same time, and Confluence can merge changes when multiple people edit the same page at the same time; collaboration is harder with a Word document that only one person can edit at a time.
A wiki like Confluence automatically generates navigation pages based on wiki structure and tags; you need a librarian and lots of discipline to make it possible to browse a large collection of Word documents.
A wiki page usually loads and displays more quickly than a Word document.
A wiki page has more automatic meta-data; you need templates and discipline to make sure that Word documents always have Title, Author and Version set in the document properties and visible in the document on-screen and in print.
If you want more ammunition than this, then there is lots of wiki-promotion on The Atlassian Blog.
You could ask for documentation to be in OOXML (.docx, in the case of Word) format. Not as ideal as using ODT, in my opinion, however, it's still just a zip file with a bunch of XML files inside. :-)
A textual format facilitates merging your documentation with generated items such as JavaDoc, API references or data dictionaries. It also scales much better than word, which is hard to use for large documents. Finally, a format that allows includes allows multiple authors to work on a document concurrently.
LaTeX and FrameMaker (the two systems I have used for this) both have vastly superior indexing and cross-referencing capabilities and have either a native textual format or a textual version of their native format that can be included (MIF in the case of Framemaker). They are also both much more stable than word.
I've built tools that read data dictionaries and generate documentation that can be included into a larger document with stable indexing and two-way cross-referencing. The functional specification for This product was done with LaTeX in this way and got me another gig with the company. I have also developed a similar process with FrameMaker.
Is the entire development team against this requirement, or is it a small group? If it's the entire team, just ignore the mandate and use a text-based format -- wouldn't be the first time employees ignored a silly rule. Works especially well if you've not made a big fuss about it in the past. If you have, management might look especially hard at your docs.
MS Word supports document changes tracking and peer review.
The new MS Office format is fully XML based (to see this, rename a MS Word .docx file to a .zip, then unpack it to see).
Maybe Office 2007 may fit both your company requirements and your concerns ?
You can at least compare Word documents, see the "Track changes" command in the "Extra" menu, or use software like DeltaView. Found via google search first link at lifehacker.com. Searching in word documents should be possible with Google Desktop Search or other similar programs that index all files they are able to read.
Do they insist that you write it in Word or only that it's available in Word format? You could write in a text format and convert it to Word automatically.
Don't you store documentation files in some kind of Version Control System, ideally together with the source code? I would recommend to do this (makes it easy to get the documentation for old software releases).
And if you do store the docs in VCS, you will notice that plain text or XML-bases files are much better for this, because you can get diffs; also, changes between text files are usually stored more efficiently than changes between binary files.
Not to defend MS products here, but MS word can diff documents.
If you use Beyond Compare as the diff tool for your source-control system (As we do, with Perforce), it will show you differences between revisions of your Word docs. Admittedly, it only shows the textual differences - formatting changes are not shown - but this is usually enough for you to see what changed.
This is just another reason to invest in Beyond Compare, as it is one of the most polished pieces of software I've ever used - and it's the best $30 dollars (Less if you buy several) I've spent on software
There are many tools for word document comparison. I currently use a python script that puts a command-line on the built-in compare and merge functionality of word.
http://nicolas.lehuen.com/index.php/post/2005/06/30/60-comparing-microsoft-word-documents-stored-in-a-subversion-repository
It should be easy to automate word to extract all text from a word document into a text file. So you could write a script creating text files from word docs, and grep, compare, version control, Review these text files.
Of course this is not an ideal solution, since you loose your pretty formatting, but it should work.
I think there are programmes that convert Word docs to plain text. Use one of them to convert the word doc to plain text and then use diff, grep etc
Also have a look into recommended toolchain(s) for DocBook.