Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
We have a bunch of documents in our organization that were inadvertently saved as Adobe PDF packages (also known as PDF 1.7 "collections").
We would like to convert these to normal PDFs (most of these "packages" contain one bog-standard pdf file), but given the number of files, it's not possible manually.
Any Adobe expert know whether:
There is an open-source or free library that handles PDF package format that I can write a script around?
Does Adobe Pro 9 have a relevant scriptable interface that would allow me to extract the relevant file from each package?
Alternatively, I'm looking at a macro-based approach, but I'd rather not go this route until investigating other options.
Thanks!
After a bunch of digging around, I found pdftk, which is distributed as source and binary on many platforms.
It does almost all of what we need to do, and we can now iterate through our documents and recursively call pdftk on each (some are multi-level attachment chains).
Note pdftk will only burst pages of the visible document into individual documents. The hidden documents remain hidden.
The option you need to use is unpack_files.
Yet another unwanted obfuscation format to hinder interoperability therefore classified as malware.
Using Adobe Acrobat Professional combine all into one pdf and then split by bookmark level
I understand this thread is few years old but if anyone is looking for free utility to extract files from PDF packages (especially from large collections) then check the free utility ByteScout PDF Multitool: it was tested against 500+ MB package files to extract hundreds of multilevel chained attachments.
Disclaimer: I'm affiliated with ByteScout
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I'm looking for a tool to display code examples in PDF file. I mean that I would like to colorize and indent code in my PDF (it's for lessons).
I'm not able to find anything on the web or on StackOverflow. It's full of tutorials to use code to make PDF but not to display code in PDF. When I search for 'display' it gives me how to display PDF in web/applications.
Sorry to disappoint you, but:
There is no such thing as you are looking for!
If you want code samples on a PDF page to be syntax highlighted, you must look for a tool that does do this within the source document which was used to generate the PDF file from.
There is no tool in the world, neither Free and Open Source Software, nor commercial payware, that lets you edit a PDF and convert the source samples on its pages into properly syntax highlighted parts. (The only thing you can possibly do on this level is adding specific comments -- here you have to manually highlight specific words or sentences with a background color of your choice.)
If you are looking for a toolchain that makes it easy to generate PDFs from scratch containing syntax-highlighted code samples, look at:
Markdown: a very lean text markup language to write the document in (use any text editor you like)
Pandoc: a powerfull Markdown-to-Anything converter. It's a command line tool available for all major OS platforms. Its output may be PDF, HTML, EPUB, LaTeX (all of the previous with syntax highlighting), as well as ODT, DOCX, DocBook (no syntax highlighting supported so far for the last few) and a few more...
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'd like to embed a PDF file viewer in a window of my planned-to-be open-source application. I don't want to release my application on GPL though, and most of open-source PDF libraries are on GPL (poppler, ghostscript, muPDF).
Is there a PDF viewer library that would be on a non-viral open-source license?
Thanks,
It seems that there is a new BSD-licensed contender: PDFium.
IANAL. Blah blah blah.
Using GhostScript by shelling out to a command line will not require you to change your licensing in any way. Batch files used to call GhostScript aren't automagically GPL'ed.
With GPL, I'd always understood that it boiled down to "Separate process? Separate license!".
So you just have GS whip up a relatively hi DPI version of the PDF page in question, and let the user pan and zoom around in that. Because GS IS in a separate process, you could fire off additional page requests in the background so the user won't perceive a delay when paging back and forth. GS takes a page range as one of its conversion parameters.
What you couldn't do is generate an image of a small part of an individual PDF page at high DPI/zoom. IIRC, you have to render the whole page.
If your application is open source and free then you should consider the option to host Adobe Reader ActiveX control (which requires to have Adobe Reader to be installed), this behavior would be the same as embedded Adobe Reader in Internet Explorer or Firefox browsers.
Lot of users have Adobe Reader or Foxit Reader installed on their computers already.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there a free way to read PDF files through VBA to extract basic text content? I need to automate a weekly data acquisition process at my company where data is contained in PDF files (which are updated weekly by the data provider). Also, is there a reference I can look into to understand the file structure (DOM?) of a PDF?
Adobe's PDF reference is online here: http://www.adobe.com/devnet/pdf/pdf_reference.html
I'm not sure about the best way to read PDFs from VBA directly, but if you can call an external Java or C# program, then I would recommend using iText for basic text extraction.
EDIT: I should maybe mention that Adobe's PDF reference is an 800 page beast. I found that it's good for looking up answers to particular questions (eg, storing widths of embedded truetype fonts), but it may not be a good place to start. For that, reading through the iText book helped me to get started on the format.
The IText book contains lots of worked examples for general PDF tasks and lots of background info to help you understand PDF files. It more than pays for itself very quickly!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 11 months ago.
Improve this question
I am trying to join together several audio files into one mp4/m4a file containing chapter metadata.
I am currently using QTKit to do this but unfortunately when QTKit exports to m4a format the metadata is all stripped out (this has been confirmed as a bug by Apple) see sample code. I think this rules QTKit out for this job, but would be happy to be proven wrong as it is a really neat API for it if it worked.
So, I am looking for a way to concatenate audio files (input format does not really matter as I can do conversion) into an m4a file with chapters metadata.
As an alternative to code, I am open to the idea of using an existing command line tool to accomplish this as long as it is redistributable as part of another application.
Any ideas?
Audiobook Maker does something like this, and I believe it uses ffmpeg under the hood. It's open source, so maybe its worth a look?
commandline tool mp4chaps does the work. It is from mp4v2-utils package if you use Ubuntu. Remember to specify qt format for quicktime, because Nero format chapter marks seems to be used less nowadays.
I discovered these guys: sensoryresearch who license an API for writing chapter/text/link tracks to MP4s (which is what an M4A is).
Depending on where the bug is, you could try going straight to the QuickTime C APIs to write the movie file. You might also try adding the chapters track using the C APIs.
Any word on when Apple will fix the bug? I am planning to create enhanced podcasts with QTKit, and need this to work.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to.
Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF.
This question had some interesting stuff, especially pdftotext but I'd like to avoid calling to an external command-line app if I can.
You can use the IFilter interface built into Windows to extract text and properties (author, title, etc.) from any supported file type. It's a COM interface so you would have use the .NET interop facilities.
You'd also have to download the free PDF IFilter driver from Adobe.
Here is a good list:
Open Source Libs for PDF/C#
Most of these are geared toward creating PDFs, but they should have read capability as well.
There is this one as well: iText
I have only played with iText before. Nothing major.
We've used Aspose with good results.
Addition to the to the approved answer: there are also alternative commercial solutions to replace Adobe IFilter for text indexing (providing the similar API but also offering additional premium functionality):
Foxit PDF IFilter: provides much faster text indexing comparing to Adobe's plugin.
PDFLib PDF iFilter: includes support for damaged PDF documents plus the additional API to run your own queries.
If you are looking for the single tool that can be used from both managed .NET apps and legacy programming languages like classic ASP or VB6 then this is where the commercial ByteScout PDF Extractor SDK would fit as it provides both .NET and ActiveX/COM API.
Disclaimer: I work for ByteScout
Docotic.Pdf library can be used to extract formatted or plain text from PDF documents.
The library can read PDF documents of any version (up to the latest published standard). Extraction of pages is also supported by the library.
Links to sample code:
How to extract text from PDF
How to extract PDF pages
Disclaimer: I work for the vendor of the library.