Concatenate pdf and add table of contents - pdf

I have several pdf files (say chapters of a book) and I want to concatenate them and add a table of contents to the resulting pdf.
No original pdf file has a table of contents. Simple numbers as the table of contents is more than fine.
I know that I can concatenate the pdf files using tools such as pdftk. Another possibility would be to use LaTeX to create table of contents and include each pdf file.

Related

search within a pdf file and print automatically

I have a pdf file with 1400 pages containing resumes.. there is an excel sheet with list of unique id for each candidate.. is there some tool which could search those ids one by one and print pages containing them.
searching them one by one and printing is very time consuming. there are more then 700 candidates.

How to extract table data from pdf and store it in csv/excel using Automation Anywhere?

I want to extract the table data from pdf to excel/csv. How can I do this using Automation Anywhere?
Please find below the sample table from pdf document.
There are multiple ways to extract data from PDFs.
You can extract raw data, formatted data, or create form fields if the layout is consistent.
If the layout is more random, you might want to take a look at IQ Bot, where there are predefined classifications for things like Orders etc.
I would err on using form fields if you have unusual fonts like " for inches character if you have a standard format, since the encoding doesn't map well with the raw/formatted option.
The raw format has some quirks where you don't always get all the characters you expect, such as missing first letter of a data item for raw.
The formatted option is good at capturing tabular columns as they go across the line.

One input file - multiple output files

Is it possible to have one input HAML file which produces multiple output files in one folder.
Request is to have one price list for products with links to individual product descriptions which is basically the same html file with small differences.
Is this possible ONLY in HAML?

Extract MS Word document chapters to SQL database records?

I have a 300+ page word document containing hundreds of "chapters" (as defined by heading formats) and currently indexed by word. Each chapter contains a medium amount of text (typically less than a page) and perhaps an associated graphic or two. I would like to split the document up into database records for use in an iPhone program - each chapter would be a record consisting of a title, id #, and content fields. I haven't decided yet if I would want the pictures to be a separate field (probably just containing a file name), or HTML or similar style links in the content text. In any case, the end result would be that I could display a searchable table of titles that the user could click on to pull up any given entry.
The difficulty I am having at the moment is getting from the word document to the database. How can I most easily split the document up into records by chapter, while keeping the image associations? I thought of inserting some unique character between each chapter, saving to text format, and then writing a script to parse the document into a database based on that character, but I'm not sure that I can handle the graphics in this scenario. Other options?
To answer my own question:
Given a fairly simply formatted word document
convert it to an Open Office XML document
write a python script to parse the document into a database using the xml.sax python module.
Images are inserted into the record as HTML, to be displayed using a web interface.

Referencing cells in XWPFTable with Apache POI

I'm writing a docx parser with Apache's POI library. I'm having some trouble understanding how cells are referenced within a XWPFTable. Can someone explain how the referencing is done if non uniform tables are presented (ie two columns with different number of rows).
POI XWPF will give you the cells in the order that Word has stored them in the file. It's as (deceptively!) simple as that...
To check what word does, one option it just to use POI and see what you get. The other is to unzip the word file - a .docx is simply a special zip of xml files. Look at the document XML and see how Word has decided is the best way to store your complex set of table cells. Then, ask POI for them, and you should get the same ordering!