How to convert a multi-page PDF table to a spreadsheet format? - pdf

I have a huge PDF file with 300+ pages on which a big 10+ column table is spread. I am using Linux and would like to have a simple command line command which would convert this table to a text importable to a spreadsheet.
Currently I am using pdftotext -layout, and gives quite good results, other than every page is considered independently and column widths and positions change from page to page (due to different maximum column content width on each page), so I cannot simply import the resulting text file to a spreadsheet application and split it to columns by a fixed column width.
I have tried to crop every column on every page (their position is identical across the whole PDF file), but in the result the empty rows are merged together, so the rows with content will be shifted with respect to each other.
If pdftotext had an option to convert the file with a STRICT LAYOUT (not by column content width), that would help. Or if I could stack all pages in PDF file to a single page, that could also solve it.
What are the options to solve this problem?

You are misunderstanding the nature of the content of a PDF file. There are no tables in PDf, there is no metadata (generally) to describe the content as a table. The text you see on the page may not be laid out in the reading order.
For example the PDF file might contain a line of text drawn at the top of the page, then one at the bottom, then a paragraph in the middle before jumping back up to the top for a headline.
In addition there may be no spaces between two text fragments. Text is drawn at an absolute position on the page, so you can draw (for example ) cell A, then move the current point by say 1 cm, rthen draw cell B and so on. Since there's no 'space' characters between the two cells, a naive text extraction will, naturally, assume the two lines of text are coninuous.
The STRICT LAYOUT you want isn't impossible, but you can't do it with a simple text file, because the original layout isn't made up of simple text characters, sometimes the space between two characters, or two fragments of text is done by moving the current point before drawing the text.
Ghostscript's txtwrite device in its simplest mode attempts to replicate the layout by replacing the white space with actual space characters in a fixed pitch font. This 'might' be good enough for you, but it equally well might not. That's because it operates by defining the smallest distance used on the page as being one space character. All distances between text is then replaced by a number of space characters, as many as are required to make up the space. This can (and often does) result in very wide output files with a lot of white space.
Essentially what you seem to want isn't really possible, you can't take a rich format like PDF and replicate it, including the layout, with nothing more than text characters.

Related

How to detect visible text in a text field in a PDF?

When using PDFBox to populate a text field in a form in a PDF, it is possible that the text overflows the text field and is not visible when opening the PDF in a viewer.
Question: Is it possible to use PDFBox to detect how much text within a text field is visible?
At the risk of falling victim to an XY problem, here is the context in which this came up.
I have a PDF which is provided by the Danish government, and the software I am creating needs to be able to fill out this form programmatically. On pages 5 and 6 of this document, there is a large blank area that needs to be filled out. The way the PDF creators designed it, they just made two text fields (named Text57 and Text58), which a person directly filling out the form would manually need to jump between.
The problem is, I need to be able to populate these fields with text, and if the text is too large to fit in the first text field, then it needs to overflow into the second text field. However, I do not seem to have any way of actually detecting when the text overflows in the first text field.
One workaround which could be acceptable, would be if I could modify the document to remove the second text field, and just have the first text field span multiple pages, but while playing around in Acrobat, this does not seem to be possible.
The PDF in question can be found here: https://www.trafikstyrelsen.dk/~/media/Dokumenter/10%20Bolig/Bolig/Private%20lejeboliger/Lejekontrakt/typeformular-a.pdf
Here is a code snippet which populates the problematic field with 100 lines numbered from 1 to 100.
PDDocument document = PDDocument.load(new File("typeformular-a.pdf"));
PDField text57 = document.getDocumentCatalog().getAcroForm().getField("Text57");
text57.setValue(IntStream.range(1, 101).mapToObj(Integer::toString)
.collect(Collectors.joining(System.lineSeparator())));
document.save("typeformular-a.out.pdf");
After the code is run, we can see that the text gets cut off after line 44. Of course I cannot simply count lines in my text, because under normal circumstances the lines in the text will wrap, which would invalidate that approach.
Auxiliary question: Is there any other approach that could solve this original problem of splitting text across multiple pages?

How to substitude all "\t" (tab characters) with white space in a PDF

Hello i am trying to convert a pdf book about programming to mobi format with Calibre.
The problem I am facing is that the code blocks inside the converted version completely lose indentation.
I managed with a regular expression to correctly indent the lines that where indented using white spaces. I did so transforming every two white spaces to two non-breaking-spaces.
Some of the code blocks unfortunately are indented using the tab character, so the regular expression is not working in these cases.
I came to realize that during the conversion from pdf to mobi there is an intermediate step in which the pdf is converted to hmtl and there is when the tab information is lost because no special tag is being generated to carry this information.
So i think the best solution is to edit the very pdf itself and replace all the tab characters(\t) to two white spaces (\s\s). This way the regular expression i mentioned before will work for all the code block references and the code will be indented properly.
but i have no idea which software to use that has this functionality of substituting pdf elements.
I doubt that the 'tabs' are contained in the PDF as tabs. The 'tab' character (0x04 in ASCII) has no special significance in PDF, and in particular it does not move the current point, it simply draws a glyph. As a result, if you do (A\tB) what you will see when the PDF is rendered is 'AB'. Or 'A*B' where the * is some other character you didn't expect (often a square)
So you would probably actually have to convert current point movement operators into white space drawing There's no realistic way that can be automated, since no tool can tell where a movement was a 'tab' and where it was a reposition.
So you will need to do it manually.
The challenge here is that the page content stream is likely to be compressed, so the first thing you will have to do is decompress the PDF. There are a number of tools which will do this for you, MuPDF is one, I think pdftk is another.
Then you will need to locate the position where you want to insert space, this could be challenging, as the font may be re-encoded to something other than ASCII so it may be hard to identify the correct position. Once you've done that, you can insert the space(s) you want into the text strings, again bearing in mind that the font in use may be re-encoded, and subset. This means that a space may not be 0x20 and indeed the font may not even contain a space glyph. And of course you need to remove the operations to reposition the current point.
Finally, after you've modified the contents, you need to remember that PDF is a binary format, and the xref table contains the position of every element in the file. If you've edited the file its likely that you will have altered the length of one or more elements, which will change the offset of any following elements, so you will need to recalculate those and update the xref table.
I suspect you are going to find it easier to modify the conversion from PDF to HTML, or modify the HTML, than to try and alter the PDF file.

Get x/y and width/height of all characters in a PDF using GhostScript

I need to get the x/y, width/height, and page number of each individual character in a PDF, ideally as percentages.
Clearly, Ghost Script is able to do this as it wouldn't be possible to convert PDFs to raster images otherwise. Is there a simple way to get Ghostscript to give me this information or am I going to need to modify the source to hook into this functionality?
Glyphs are rendered to bitmaps (using FreeType) and stored in the glyph cache tagged with the font and matrix so that they can be uniquely identified. When text is rendered to the page the cache is consulted first and if a hit exists that bitmap is drawn at the current point. If not then the glyph is rendered and cached.
However, extremely large point sizes are left uncached, and rendered each time to avoid filling up or overflowing the cache.
So in order to retrieve this information using Ghostscript you would need to write a device which has a set of text methods. You would need to capture the bitmaps from the glyph in order to determine the width and height of the glyphs, and the current point would give you the position on the page. The output_page method would tell you that a page had completed, so you would need to track the page number yourself.
You could look at the txtwrite device to see how text is processed, and the epswrite device to see how to retrieve bitmaps, you'll need some combination of both.
Be aware that 'text' in a PDF file need not be text. What appears to be text can be bitmaps, or vectors. Text can be encoded in unusual ways, and there may be no way to retrieve Unicode or other identifiable information about the glyphs (again the txtwrite device shows how you might extract such information if possible).
Also, fonts are not always embedded in PDF files, in which case a substitute font is used, which would mess up your width/height information.
This is quite a big project.

Is this possible to break the pdf file smaller than page wise breaking?

I found there is a lot of tools available for breaking the Big PDF files into smaller one by splitting the original PDF file PAGE WISE.for example, if i have a 10 page PDF Document,then we can able to break the original pdf file into 10 pieces in page wise splitting.
But i want similar kind of tool that breaks the PDF file smaller than the Page wise splitting.That means,i need to split the PDF page into different documents based on any parameter like paragraph,section,element...
for example,
If my PDF file having 2 pages with 10 paragraphs then i would like to split the pdf file into 10 separate Pdf file based on paragraph parameter...
Also, I strongly believe pdf does not contain any structure like Open XML.But i also Suspecting
How the tools can able to break the pdf files in to small pdf files by splitting page wise? What kind of mechanism they are using for page wise splitting PDF File?
So, Is there any way to do my work? Please give me your valuable suggestion on this?
PDF is a vector based document description language. It's page based so in a way every page is independent from the next one. Splitting page wise is therefore pretty easy. Contrary to a raster image where you can extract small subsets independently in a pdf you have to render the whole page to know how a small subset looks like.
Say you have a Page (black) which contains a complex shaped object (here it is a line but it could be any text, shape, image, etc.) and you want to extract a subset (red). You would have to first find all the objects that produce visible output in the region of interest. Then you would have to modify them so they are rendered correctly (in this case calculate the green points from the blue points while preserving the shape of the object).
An easier approach would be to include the whole page and clip the viewing area to the dimensions of the region.
You could do this with pdfjam. Check the --trim/--offset/--delta command in conjunction with a custom paper size (Example 6,7 on the pdfjam website). You would still have to somehow calculate the coordinates of the region of interest though.

PDF itext TOC generation

I have to merge multiple PDF documents into a single PDF document. Besides this, I have to generate TOC. The original documents will contain text with a specific style (say H1). This special text becomes part of TOC.
Have used iText for merging multiple PDF files. I am unable to find example/API on parsing the document to find all the contents having style H1.
Generating TOC is next challenge.
You don't. PDFs don't have styles. They have "current Graphic State", which includes:
current transformation matrix (CTM).
stroke & fill colors
clipping path
font & size
gobs of other text state stuff (char spacing, word spacing, leading, text render mode...)
Including a separate text transformation matrix which is combined with the CTM.
So first you have to track all this stuff (which iText can mostly do for you). Then you have to determine how big "H1" text is, and latch on to all the text that is in that size screen size, taking the CTM, text matrix, and font size into account (which iText will do for you again, IIRC).
And just to make life more exciting for folks like yourself, it's entirely possible that the text you're looking at isn't text at all. It could be paths, or a bitmap... at which point you need OCR, and I don't think you'll get much in the way of size info with OCR.
You'll need to write a TextRenderListener that determines the final size of a given piece of text (and whether or not its a part of the last piece) and filter out all the stuff that's too small. You'll then build your TOC based on the text you find.