how to automatically markup plaintext and preserve formatting

how to automatically markup plaintext and preserve formatting - formatting

I am maintaining a growing (250 pages) plaintext document, that really needs to be a PDF technical document. Is there some automatic markup tool I can use that preserves my existing formatting, i.e. headings, subheadings, paragraphs, tables, columns, examples, etc? Once the initial markup (to html/xml) is right, I can move it to PDF more directly, but I would really like to avoid an entire manual reformatting just to keep the formatting it already has.

Related

inserting an entir pdf into another by raw text manipulation

I need to include a pdf into another pdf that is being created by text manipulation, not through a package. (In particular, I'm using livecode, which is well suited to the generation of the information I need, and can easily do text manipulation).
Once included, I will be adding additional objects (primarily text, but also a few small squares).
I only need to be able to access the included pdf by page and area, such as (200,200) to (400,400) of page 5; I don't need any access to its objects.
Simply appending to the pdf won't do the job, as I'll actually be including multiple source pdfs into a single pdf output with my addition.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it. In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
Can this be done? Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
something that could be used on a one-time basis to convert an entire multi-page pdf wold also be a usable (but inferior) solution.
I've found that search engines aren't yielding usable results, as they are swamped with solutions for individual products, and not the pdf itself.

First of all, PDFs in general are not text data, they are binary. They may look textual as they contain identifiers built from ASCII values of words, but treating them as text, unless one and one's tools are extremely cautious, is a sure way to damage them.
But even if we assume such caution, unless your input PDFs are internally of a very simple and similar structure, creating code that allows to merge them and manipulate their content essentially is complexity-wise akin to creating a generic PDF library/package.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it.
Putting them into one indirect object each would work if you needed them merely as an unchanged attachment. But you want to change them.
In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
You will at least have to parse ("disassemble") the objects related to the pages you want to manipulate, add the manipulated versions thereof, and add cross references for the changed objects.
And you only mention cross reference tables. Don't forget that in case of a general solution you also have to be able to handle cross reference streams and object streams.
Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
If you really want to merge the source PDFs into a target one, you'll indeed need to renumber the objects from most source PDFs.
If as a target a portable collection (aka portfolio) of the source PDFs would suffice, you might not need to do that. In that case you merely have to apply the changes you want to the source PDFs (by means of incremental updates, if you prefer), and then combine all those manipulated sources in a result portfolio.
I've found that search engines aren't yielding usable results
The cause most likely is that you underestimate the complexities of the format PDF. Combining and manipulating arbitrary existing PDFs usually requires you to use a third-party library or create the equivalent of such a library yourself.
Only manipulating existing PDFs is a bit easier, and so is combining PDFs in a portfolio. Nonetheless, even in this case you should have studied the PDF specification quite a bit.
Restricting oneself to string manipulations to implement this makes the task much more complex - I'd say impossible for generic PDFs, daring for PDFs of simple and similar build.

Extract Title and specific table of content in set of PDF's

I need to extract the title of each PDF and a specific content and its pages.
For example i have a folder full of PDF's and i need to find in the Table of Contents a heading called Enhancements if there. If Enhancement content is there copy the Title of the PDF usually on first page and copy the Enhancement section and place in another PDF as chronology of enhancements.

You will need to extract text chunks with their coordinates from those PDFs first. You can use a PDF processing software of your choice for this.
Then you will need to analyze extracted chunks and detect what chunks go into the Enhancement section. This is the hardest part. And I doubt there is a software that might do such analysis for you out of the box. Sorry.
Please note that text in PDFs is usually stored in chunks, not words or sentences. Each chunk is one or more characters. It might be one letter or one and the half word. There are no guarantees for what constitutes a chunk.

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.

I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)

To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

Soft PDF documents

in fact, We have two types of PDF documents :
Soft documents (conversion from word to PDF or Latex to PDF).
Hard documents (conversion from scanned image to PDF).
By the way, I am only interested to soft documents.
In fact I am trying to conceal information (by using a specific steganography method...) in an existing PDF document, and I am interested to insert the embedded message by slightly modifying the position of the characters. So I know that in a line, all characters have the same y-axis but different x-axis. So I can insert some bits by modifying slightly the x-axis of each character, but if I insert bits by modifying the y-axis of characters that are places in the same line, so this will be easy detectable (because they have the same y-axis). That is why, I am interested to insert some bits by modifying the x-axis of characters which are belonging in the same line, and some bits by modifying the y-axis of characters which are belonging to different lines (each character in a specific line, but i didn't know if the gap between lines remains the same or not). And in this case, I think that my method will be more undetectable.
But before to achieve that, I am interested to get responses of the following questions:
1) If we have a PDF generated by a conversion from Microsoft word to PDF : does the gap between each line remains the same? and does the gap between paragraphs is constant (remains the same)?
2) Furthermore, If we have a PDF generated by a conversion from Latex to PDF: does the gap between each line remains the same? and does the gap between paragraphs is constant (remains the same)? Please I need your opinion and brief explanations about that.
3) When the text is justified, does the space between 2 pairs of letters remains the same ? In other word and for more precision, assume that we have a text into pdf, where the text is "happy new year and merry christmas, world is beautiful! ". The space between "ea" in year remains the same in "beautiful" ? So if we have multiple words containing "ea", does always the space between e and a is the same in all the ea of all the words? (assume that we do not change the police along all the text into the PDF).

You might have to explain more about what you want to do; that might make it easier to give good advice. Essentially it's important to understand the fundamental difference between applications such as Word (I'm hesitant to comment about Latex - I don't know enough about it) and PDF.
Word lives by words, sentences and paragraphs. Structured content is important and how that's layout on the page is - almost - an after thought. In fact, while recent versions of Word are much better at this, older versions of Word could produce completely different layout (including pagination) by simply selecting a different printer. Trust me, I got bitten by that very badly at one point (stupid me).
PDF lives by page representation and structure is - literally - an after thought. When a PDF file draws a paragraph, it draws individual characters or groups of characters. Sometimes in reading order, but possibly in a completely different order (depending on many factors). There is no concept of line height attributed to a character or paragraph style; the application generating the PDF simply moves the text pointer down a certain number of points and starts drawing the next characters.
So... to perhaps answer your question partially.
If you have Word documents generated by the same version of Word using the same operating system using the same font (not a font with the same name, the same font), you can generally assume that the basic text layout rules will be the same. So if you reproduce exactly the same text in both Word versions, you'll get exactly the same results.
However...
There are too many influencing parameters in Word to be absolutely certain. For example, line-height can be influenced by the actual words on a line. Having a bold word or a word in another font on a line (symbols can count!) can influence the amount of spacing between those particular lines. So while there might be overall the same distance between lines, individual lines may differ.
Also for example, word spacing is something that can quite easily be influenced with character styles and with text justification, as can inter-character spacing.
As for your question 3), apart from the fact that character spacing may change what you see, it's fair to assume that all things being equal the combination "ea" for example will always have the same distance. There are two types of fonts.
1) Those that define only character widths, which means that each combination of "ea" would logically always have the same width
2) Those that define character widths and specific kerning for specific character pairs. But because such kerning is for specific character pairs, the distance between "ea" would still always be the same.
I hope this makes sense, like I said, perhaps you need to share more about what you are trying to accomplish so that a better answer can be given...

#David's answer and #Jongware's comments to it already answered your explicit questions 1), 2), and 3). In essence, if you have an identical software setup (and at least in case of MS Word this may include system resources not normally considered), a source document (Word or LaTeX) is likely to produce the identical output concerning glyph positions. But small patches, maybe delivered as security updates from the manufacturer, may give rise to differences in this respect, most often minute but sometimes making lines or even pages break at different positions.
Thus, concerning your objective
to conceal information (by using a specific steganography method...) in an existing PDF document, [...] to insert the embedded message by slightly modifying the position of the characters.
Unless you want to have multiple identical software setups as part of your security concept, I would propose that you do not try to hide the information as difference between your manipulated PDF and the PDF without manipulations but instead in less significant digits (e.g. by hiding bits by making those digits odd or even, either before or after transformation with a given precision) in your manipulated documents making comparisons with "originals" unnecessary.
For more definite propositions, please provide more information, e.g.
whom shall the information be concealed from: how knowledgeable and resourceful are they?
how shall the information extraction be possible; by visual comparison? By some small program run-able on any computer? By a very well defined software setup?
what post-processing steps shall be possible without destroying the hidden information; shall e.g. signing by certain software packages be possible? Such post-processors sometimes introduce minor changes, for example by parsing numbers into float variables and writing them back from those floats later.

Concatenate PDFs and preserve Extended Features in Acrobat Reader

We are using iText to automatically fill in form fields on a number of documents and then concatenating those documents into one resulting PDF.
Adobe has introduced the Extend Features in Acrobat Reader option to allow users of Acrobat Reader to save the PDF with changes to the form fields.
This is a proprietary Adobe feature that iText can only work around.
I have been able to execute the work around for one specific document using the PdfStamper class in append mode. Since the PDF's contain form fields, we use the PdfCopyFields class to perform the concatenation. PdfCopyFields does not have an append mode.
Is there another way to do an append of a PDF into a preexisting PDF with iText (any version)?

It's possible, but would require you to know enough to modify PdfCopyFields so that it saves in append mode.
You could duplicate the functionality and use it on top of PdfStamper (in your own class or otherwise), subclass PdfCopyFields, or modify PdfCopyFields directly.
Big Stumbling Block
All fields with the same name in a PDF share the same value as well. If you have two copies of the same form in your resulting PDF, then you have two views of the same data.
Even with different forms, if you happen to have a name collision ("City" over here might be part of a current address, while over there it might be the city they were born in), they'll glom together the same value.
If you have a Comprehensive System such that all your naming collisions will be deliberate, that's great, go for broke. If "FirstName" is always referring to the same person, and changing it SHOULD change the value across all the forms in question, you're golden. If not... that's why PdfStamper's flattening ability is so popular.
The alternative becomes "rename all your fields before gluing the forms together" to avoid such collisions.
Even with a Comprehensive System, I still suggest whipping up a little tool that'll go through the forms you propose to merge and look for collisions. Maybe list them along with their values in some test data. You might catch something along the lines of "Fly: House, Common" vs "Fly: Southwest Airlines".
Probably not that particular example, but who knows? ;)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas