Can we extract pdf pages using lua scripts - pdf

Our application is receiving PDF file based on 150 pages from business line, I want to extract pages from this pdf file using lua scripts.
Any body share his experience.
Thanks

Sure, you can do this. As long as you write a Lua module that can read PDF files.
There are some Lua modules for writing PDFs, but none for reading them. No public ones, at any rate. You may want to switch to Python for this, as there are quite a few Python modules for dealing with PDFs.

You could write a Lua wrapper calling something like pdftk.

Related

Get selected "PostScript" from PDF

I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/

How to merge PDF files without external dependencies

In one of my applications I need to merge many single PDF documents into one document, where each of the original PDFs is a page. Although many PDF libraries exist for most languages, I would like to write this myself if it's not too hard.
Is it necessary to implement a full-fledged PDF parser in order to merge PDF documents? Where and what would I start to read to find out what is needed for the task?
You can use the Debenu QuickPDF Library Lite (free) version to do it. Here is a very good example how to do it:
http://www.debenu.com/kb/merge-pdf-files-together-programmatically/

How to detect image in a document

How can I detect images in a document say doc,xls,ppt or pdf ?
I came across with Apache Tika, I am trying its command line option.
http://tika.apache.org/1.2/gettingstarted.html
But not quite sure how it will detect images.
Any help is appreciated.
Thanks
You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!
The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg
$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)
Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.
Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!
Having used Tika in the past I can't see how Tika can help with images embedded within Office documents or PDFs I was wrong to answer No. You will have may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.
Using Tika makes these APIs automatically available (side effect of using Tika).
UPDATE:
Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.

generating pdf files with php

After some work with PHPExcel, I finally get it to generate sheets of 3000cells in ~5 seconds by using a big array.
With same data, I'll need to generate some pdf files. I've tried to do it with PHPExcel, but it is not a good choice. Generating a pdf file with PHPExcel, took a lot of time and a lot of resources.
I've tried to generate a pdf file with html2pdf php library. The file which contain a table with 3000 cells took me 20 seconds o generate.
My problem is that I can't find a good solution to my problem. Do you know any good library? Do you know any good practices in generating pdf files faster, with a low load on server side?
You can use the FPDF library to generate PDF files in a fast manner and you can use the Write HTML tables add-on to achieve what you want (see example at the bottom of the page).
PhpExcel uses TCPDF to generate PDF, the same as HTML2PDF with PHP5:
HTML2PDF is a HTML to PDF converter written in PHP4 (use FPDF), and PHP5 (use TCPDF).
I think that when generating a PDF, PhpExcel first generates XLS, then converts it to HTML, then again converts it to PDF. Not very efficient.
That is why by using HTML2PDF you can cut to 20 seconds.
--
To cut waiting time even more, maybe you could try another library, like dompdf, and keep skiping PhpExcel when what you need is a PDF.
If your table doesn't have formulas, you can generate all the content in an array, and pass it to some function to generate an XLS with PhpExcel, and to another to generate a PDF.

How to open PDF and read it?

how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA
Update: thanks for the solutions, I'm sure some of them will suit me fine.
#RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.
For Perl, check out these modules:
PDF::API2
CAM::PDF
Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.
libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.