How to convert DJVU file into PDF - pdf

There are many books available only in DJVU format where text is selectable and size is quite small (300 pages less than 5 MB).
Since DJVU viewers are poor in terms of annotation of files, I want to convert them into PDF.
What are the options to convert DJVU book into PDF that maintains selectable text and does not result in a huge (x10 larger) PDF file?

since this question did not get answered so far:
I recommend to use the following online converter which to my knowledge is the only one to fulfill the two criteria: djvu to pdf. However, I do not know any stand-alone converter achieving this goal.

Related

Convert text-based pdf to image-based pdf

Sometimes your nicely formatted (TeX'ed) pdf is converted to Microsoft Word because of the default process at some company. This can badly mangle the layout (and fonts?) of your pdf, potentially rendering it unusable. A possible solution to this problem is to convert the pdf from text-based to image-based to thwart the bad conversion to Word.
This question is about ways to convert a text-based pdf to an image-based pdf.
Your question is very broad, but you do point out a basic incompatibility between PDF and any structured document format. If you are looking for a programatic answer to your question, the general approach is to create an image drawing context instead of a PDF context, and render all the elements of your pages to that context. The result is an image which you then draw into a PDF context.
Now I do have an answer that seems to work, but I wonder about alternatives. My solution also has a shortcoming in that internal or external links are destroyed. In theory it should be possible to keep links intact. Finally, my solution works well for a single page document, but may not work (well) for other documents.
pdftoppm -r 300 text.pdf | convert -page A4 - text.pdf.ppm.pdf
This converts to a pixel-based format and increases file size significantly (10x for my test case).

Extract pages from DOC to new DOC

We are developing a printing server that allows user to upload a DOC and print it out via HP ePrint. It needs to support page extraction.
I tried to use macro (with the help of Adobe Acrobat Reader Pro and MS Word) to extract pages into PDF. But it turns out that the size of PDF may be larger in size than expected.
Is there any way to extract pages (without lossing format - E.g. Table in DOC) from DOC to DOC, so that the size can be approximately the size?
This is a difficult requirement. It sounds like you have run into 2 problems (large PDFs and format loss) at the outset. You should probably say more about what you mean by "extraction" and why PDF is part of your solution because that's quite different from "upload and print" and "doc to doc". That way readers will have more suggestions for you.
I would suggest you try to approach the problem from a different angle if possible, because I suspect that you are unlikely to achive a good, efficent, stable result. One possible approach is to turn the DOC into PDF and then use iText or some other PDF library to manipulate the PDF before printing. It really depends on what you are trying to achieve - the specifics of your extract/merge/convert.

Understanding the PDF DOM

I am writing an application that has to read and interpret data stored in some PDF files. The reading part is done but I am only able to get a dump of all the words on a page and not the format of the words. What I mean is that if I have to extract a table, I am getting the numbers in the table but not the markup which defines the table.
Further, there is some formatting used which displays a few of these numbers within parentheses (meaning that those numbers are negative) but the parentheses themselves are not part of the text. Hence, I am not able to distinguish between positive and negative numbers present in the PDF table!
How do you get the PDF markup along with the text? Is a PDF similar in structure to an XML with tags used to markup tables etc.? If not, then, is there a resource which describes the salient features of the PDF DOM?
I am using VBA and the Acrobat library (AcroExch etc.)
There is no such thing as "PDF markup" in the sense of HTML etc. A table in PDF cannot be distinguished from line art, other than by using OCR, which can be error-prone if the layout is complex. It is simply drawn using geometrical shapes, like in a vector-based graphics program.
"Is a PDF similar in structure to an XML with tags used to markup tables etc.?"
No, not at all.
And there is no such thing as a 'DOM' either. Google for a file named *PDF32000_2008.pdf*. The current PDF specification for v1.7 (ISO spec) is that file. You should be able to locate it on the Adobe website.
As omz stated, text inside PDF does not really have a structure. You can take a look on the specification here. However, for some very specific files, there is something called PDF Tags, or PDF Marked Content, which is fairly new, and it aims to give PDF documents some kind of structure. If you target this kind of files specifically, you might be able to achieve something. Take a look on chapter 10 (Document Interchange) of the Adobe's specification for further details.
Maybe what you want to achieve can be done with less effort and faster by using TET, the Text Extraction Toolkit made by the fine folks from pdflib.com ( http://www.pdflib.com/products/tet/ ) ??
AFAIR, the TET has some (limited) support for table detection as well....

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?
There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).
On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).
PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid
You may do use the following approach like this with iTextSharp or other open source libraries:
Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
Sort all text objects by coordinates so you will have them all together
Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not
Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:
extract text and images along with analyzing the layout of the text
XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.
Disclaimer: I am affiliated with ByteScout
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.
Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
As mentioned in the answers above, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.
If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.
If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.
PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.
However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.
Disclaimer:I was involved in writing the blogpost.
iText api:
PdfReader pr=new PdfReader("C:\test.pdf");
References:
PDFReader

Add PDF to AFP Output

We're trying to take a PDF file created in a web application and dynamically insert it into an AFP datastream on an IBM iSeries. Does anyone know if this is possible?
The short answer is not easily.
There would be two ways to do that. The easiest way would be to convert the pdf to a series of tiff images and rewrite the afp stream to include them as IOCA images. If they are two-color fax G4 compressed, they can be converted to IOCA group 10 without too much trouble.
The second way would be to convert the pdf to AFP (the two formats have a lot of similarities) and rewrite the AFP stream to include it. If the PDF doesn't have embedded fonts then you shouldn't have too much trouble mapping the fonts to the AFP ones.
It may not look quite right at first. AFP allows you to position elements much more accurately than PDF does.
I agree with the previous answer that it's not an "easy" task...
But if you are familiar with AFP structure (or at least willing to learn the spec) you can technically add single page PDF objects into AFP using AFP's Object Containers.
But the downside is that you'll need a pretty modern AFP print environment to actually handle AFP that contains PDF object containers. Most printers I've dealt with won't handle it.
So converting the PDF to AFP and merging it into the parent document would be best. Applications like Xenos.com's d2e Vision will do a nice job of this.
You could try COMPART CPMILL, where you can convert your PDF to AFP and then you can insert easily it to your AFP.
We use this tool to insert PNG images in AFP.