Reading a PDF file using Java [duplicate]

Reading a PDF file using Java [duplicate] - pdf

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
PDF to text tool or Java library?
How read PDF using java
Is there any way i can read PDF files using JAVA. The pdf file contains images and text.. Its kind off irregular. I need to get the text alone. Any implementations???
Thanks in advance.

You should check out iText. It is an open source library for reading, creating and modifying PDF files. I have recently used it and it works very well.

Related

Objective c: Need SDK to unzip .7z file [duplicate]

This question already has answers here:
How to unzip/extract 7z compressed files in ios
(2 answers)
Closed 9 years ago.
I would like to unzip .7z files in objective-c (for mac dev). I am using SSZipArchive, which is really nice, but it will not help me unzip 7z. Could you guys recommend me a good sdk to unzip the 7z? I also need to keep track of progressing during unzipping.(Like the percentage done..) Thanks!!

You can Check the Source Code developed by Mo Dejong at 7zip decompresson SDK. It's based on LZMA SDK 9.21 beta. It includes only decode functions and the adler checksum logic is disabled at compile time, to improve performance.
GoodLuck !!!

.Net Tool or Library to compare one PDF to another PDF [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am working on a project that currently uses a .tiff, compares the defined template document to the document in question. We are moving away from the .tiff format for a variety of reasons but mainly because the new files will be coming in the format of PDF.
I see two potential solutions to the issue. First convert the PDF to a tiff and use the existing code.
Or second, use a PDF library that will compare the template PDF to the PDF that is received.
Because the PDF that is received will basically come from an outside source we won’t know for sure if it is text based or image based so the library or tool will have to be able to compare both.
Any suggestions on tools/libraries you have found helpful would be great!
Thank you in advance!
dj

How about i-net PDFC - it does a full content comparison - text, images, lines, header/footer-detection and so on. You can use it either on command line or with a GUI (2.0, currently in public beta-phase) or via API (I think we have an internal version being a .NET library).
Disclaimer: Yep, I work for the company who made this - so feedback highly appreciated.

What we ended up doing was using the Aspose.Pdf library.
I ended up learning there are two types of PDFs:
Image based and
Text based
I did not have any issues comparing the Text based PDFs. However, at the point that a image based PDF was received converting the PDF to a .tiff so that we could use Microsoft's MODI to compare the PDF against our specified template. The .tiff would be a blank image rather than the actual content of the PDF. Aspose.Pdf library did cost some money, however in the end, the library did exactly what we needed it to and it allowed us to meet our client's needs.

I think your method of comparing tiffs is the way to go, using ImageMagick or other library?
Converting PDF to images can also be done via ImageMagick with the help of Ghostscript.
http://www.imagemagick.org/script/compare.php
I have a C# wrapper for GhostScript that may help, sent me a mail (on profile) and I can send it to you.

As far as I can see from your question, you want visual comparison of 2 PDFs, not structural comparison. (Because I can create you a thousand different PDF pages, which will have different internal structures and PDF source code, but will render identically on screen or on paper.)
In this case any comparison software will have to transform the 2 PDFs into raster images and compare those.
But since you have your own code already to do that for TIFFs, you can as well re-use it for PDFs (like you are considering already) which you convert to TIFFs.
Unless you find another, external tool that is better, faster, more precise, more funky, less resource-hungry... than your own solution! -- But that one will not be able to avoid converting the PDF pages to some sort of raster image before it can start the real visual comparison. (This may happen internally and unnoticeable for the user, but nevertheless it will have to take place...)

Docotic.Pdf library can compare PDF documents for you.
Please have a look at Check that two PDF documents are equal sample.
We use this feature for regression testing of the library itself (yes, I am part of the library's dev team).

Any other way to read/write a PDF file using java application other than itext, PDFbox?

I Tried with iText and PDFBox .
It is not simple , we need to understand lot of code for this.
Can anybody provide a simple way of reading and writing PDF using Java Application.
Make sure the application is standalone, and no need of any web/application server.

There are loads of simple examples for manipulating PDFs with Itext in the Itext in Action Book.
PDF is a complex file format. What are you trying to do exactly?

How does google make make those awesome PDF reports in Analytics and when you print a Google Doc etc? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
When you print from Google Docs (using the "print" link, not File/Print) you end up printing a nicely formated PDF file instead of relying on the print engine of the browser. Same is true for some of the reports in Google Analytics . . . the printed reports as PDF's are beautiful. How do they do that? I can't imagine they use something like Adobe Acrobat to facilitate it but maybe they do. I've seen some expensive HTML to PDF converters online from time to time but have never tired it. Any thoughts?

If you are specifically looking at how Google does it. If you look at the PDF Properties page, they use Prince 6.0 (see princexml.com)
There are lots of other PDF generators out there. I've had great success with PDFlib for tricky jobs.

iTextSharp and iText are opensource and free PDF generation libraries for .NET and Java respectively.
I've used them to generate report PDF's before and was quite happy with the results.
http://itextsharp.sourceforge.net/
http://www.lowagie.com/iText/

Great free alternative to PrinceXML: wkhtmltopdf . There are plenty of wrapper libraries for various languages - but I've only used Ruby ones. However the product itseld is on par with PrinceXML IMHO.

I have had success with pd4ml. It has a tag library, so you can turn any existing HTML into PDF by
<pd4ml:transform>
<!-- Your HTML is here -->
<c:import url="/page.html" />
</pd4ml:transform>

Well, I doubt it's as easy as generating HTML . . . I mean, first of all, PDF is not a human readable format and it's not plain text (like SVG). In fact, I would compare a SVG file to a PDF file in that with both you have precise control over the layout on a printed page. But SVG is different in that it's XML (and also in that it's not supported completely in the browser . . . still looking into SVG too). Come to think of it, SVG should probably will be my next question.
I know Google doesn't use .NET and I doubt they use Java so there must be some other libraries they use for generating the PDF files. More importantly, how do they create the PDF's without having to rewrite everything as a PDF instead of as HTML? I mean, there has to be some shared code for between when they generate the HTML view as opposed to the PDF view. Come to think of it, maybe the PDF view and the HTML view are completely separate and they just have two views and hence why the MVC development style seems to be the way to go.

Rendering a PDF is hard, complex problem. However generating them, is not. Simply make up some entities, and generate. It's about same problem domain as generating HTML for webpage vs. displaying (rendering) it.

How to convert Word and Excel documents to PDF programmatically?

We are developing a little application that given a directory with PDF files creates a unique PDF file containing all the PDF files in the directory. This is a simple task using iTextSharp. The problem appears if in the directory exist some files like Word documents, or Excel documents.
My question is, is there a way to convert word, excel documents into PDF programmatically? And even better, is this possible without having the office suite installed on the computer running the application?

Office 2007 allows for this. I have found PDFCreator to be good, the VBA is included in sample files, and have heard that CutePDF is also good. PDFCreator and CutePDF are free.
To work without Office, you would need viewers, as far as I know:
http://www.microsoft.com/downloads/details.aspx?FamilyID=c8378bf4-996c-4569-b547-75edbd03aaf0&displaylang=EN
http://www.microsoft.com/downloads/details.aspx?familyid=95E24C87-8732-48D5-8689-AB826E7B8FDF&displaylang=en

I needed to do this myself, but managed to get it done with .Net and without 3rd party tools:
MSDN: Saving Word 2007 Documents to PDF and XPS Formats
Pretty simple, about 50 lines of code. However I think you will need Word 2007 installed on the machine as well as the ability to Save As PDF

To convert Word documents to PDF, take a look at jWordConvert, a java library that can do exactly that. This will not work with the Excel files though, only with the Word files. The language is not Sharp, it's Java but you could switch to use IText (which is java) instead of ITextSharp.

You can also use a component like activePDF's DocConverter to convert a lot formats to PDF.

Use PDF maker that comes with adobe 7- 9
I just used this code Covert Doc to PDF

I'm surprised Aspose wasn't mentioned here, it's easy, simple, and reliable. Downside is that it is not free.
I've used iTextSharp in the past, it's really good, easy to install (one DLL I believe), the merge takes a bit of tindering so it's not as easy to use as Aspose, but hey, it's free so that is the best part.

TallPDF.NET (comes with a hefty price tag) allows you to serve dynamic PDF from any .NET application including ASP.NET pages and web services.
PDFEdit (free and open source) is an editor for manipulating PDF documents. It has a GUI version and a command-line interface. Scripting is used to a great extent in the editor and almost anything can be scripted. It is possible to create your own scripts or plugins.

The most common way to convert files to a pdf is to print them to a pdf printer driver. There are a number of such drivers, one that i know of that will do the job is Black Ice.
Another is to use Adobe Acrobat's SDK. from memory its very expensive.
Its been a while since i have actually done any work with converting pdf's and the landscape may have changed.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reading a PDF file using Java [duplicate] - pdf

You should check out iText. It is an open source library for reading, creating and modifying PDF files. I have recently used it and it works very well.

Related

Objective c: Need SDK to unzip .7z file [duplicate]

.Net Tool or Library to compare one PDF to another PDF [closed]

Any other way to read/write a PDF file using java application other than itext, PDFbox?

How does google make make those awesome PDF reports in Analytics and when you print a Google Doc etc? [closed]

How to convert Word and Excel documents to PDF programmatically?

Categories

Resources