Space getting added between characters while writing to PDF using binary write - vb.net

Here is issue screenshot
Here is the sample code.
Dim rawData As Byte() = "sample data"
Response.ContentType = "application/pdf"
Response.ContentEncoding = System.Text.Encoding.UTF8
Response.BinaryWrite(rawData)
Response.End()

Space getting added between characters while writing to PDF using binary write
The underlying issue here is that you actually are not writing a PDF at all!
Your code essentially returns pure text data and then claims that it is a PDF. Such a claim doesn't change the text data in any way, though, they remain text and don't become a PDF.
The PDF viewer you use apparently attempts to somehow display what it got nonetheless but the result thereof turns out to be very unsatisfying (a proportional font seems to be used in a monospaced manner).
If you actually want to return a PDF, you have to explicitly create one. PDF is a complicated binary format best to create using a dedicated library.
Look for pdf libraries for your environment. You can find some that have explicit ways to add table or paragraph structures to the pdf, and some that create content by conversion from another structured format e.g. html.
The output of these libraries is a binary in pdf format which you can return from your code using Response.BinaryWrite.
Recently one can read in a number of questions that people have data in text or html format, return it setting some binary content type (PDF in this question, MS office formats in other ones), and then assume they so have generated a file in that format.
This is wrong, claiming a format doesn't transform into that format!
All this setting of the content type does, is informing the client what kind of viewer to use to open the data.
Probably this anti-pattern came up because MS Word (and most likely other word processors, too) can also open plain text and html text files and display them fairly properly. Thus, this anti-pattern at first glance appears to work somehow.
If you promised your client, though, that your application returns MS Office documents, don't return HTML or plain text claiming it to be an Office document, instead do create actual MS Office documents! Otherwise knowledgeable clients will not accept your implementation and clients who did accept it will eventually be informed by knowledgeable users that you cheated them which will at least lessen your renown.

Related

PDF Copy Text Issue: Weird Characters

I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

Pass a bitmap object to Interop Word function that is expecting a filename string for a bitmap file

The title sounds insane but bear with me. This is a problem that could exist with any object.
I am generating a bitmap object in memory and I would like to pass it directly to another function that wants to open a bitmap file. The simple solution is to write the file to disk, call the function against the file, and then delete the file. I don't want to do that. If I am pushing a high volume of image objects in to a Word document with a VSTO add-in it doesn't make sense to thrash my disk for no reason when the whole thing could be done in memory.
I guess I am looking for a different function to insert a picture in to a Word document that accepts a bitmap object. Or a way to pass a filesystem object that actually points to memory (Not a RAMDisk, but a RAMFile?). Or a way to wire the "Image.Save" directly to the reader of the "AddPicture" function without actually making a file on disk.
Hopefully, there is a better way of doing this.
Here is the code example:
Dim newImage = GenerateImage(InputString, SelectedFormat)
Dim imagePath = Path.Combine(Path.GetTempPath(), Path.GetRandomFileName())
newImage.Save(imagePath, ImageFormat.Png)
With Globals.ThisAddIn.Application
.Selection.InlineShapes.AddPicture(imagePath)
End With
File.Delete(imagePath)
Word can't "stream" (see "Background", below) content, so your choices are 1) The Clipboard or 2) wrapping the bitmap in valid, Word Open XML OPC flat-file format, which means first converting the bitmap to base64.
For the first, you can use standard .NET methods to place the information on the Clipboard in the format you want Word to use. In the Word "interop", the Paste or PasteSpecial methods will insert it. The argument against this approach is, as ever, "interfering" with the user's Clipboard.
Using Word Open XML is as close as you can get to "streaming" content into Word, using the Range.InsertXML method.
Word documents (and other Office files) are essentially "zip packages" of XML and binary files that together make up the document. It's possible to create and edit these files without opening them in the Word (Office) application, which makes the format suitable for server-side work. Any tool that can work with zip files and xml can be used for this; standard is the Microsoft Open XML SDK which offers a complete API of the Office content.
Word, alone, of all the Office applications enables the developer to read and write content in the opened Word document using the OPC flat-file standard. This "concatenates" the entire content of the zip package into an XML String. The Word object model's Range.InsertXML method is used to write content in this format to a Word document open in the Word application.
Information on how to convert a zip package into OPC flat file can be found in this blog article. Information concerning minimal Word Open XML to have a valid OPC version is described in this article; there is a section in there specifically about working with graphics.
Background
Word is based on very old technology - late 1980's. By the mid-1990's it reached a very high standard as a professional word processor and what has happened with it since has mostly been "sugar coating" - adding a bit of this and a bit of that to bring it closer to HTML / page layouting. But the core of the application remains the same... and part of that means Word isn't able to do many of the things the modern developer expects - such as "streaming" data in and out.

Search MS Word binary file for specific content

I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word "hello" for instance.
Is there any way to do this search in the binary file?
You could go down the route of using commercial tools. Aspose.Words can load a document from a stream and has all sorts of methods for finding text within the document.
If you have the stream from the DB, then you code would look like this:
Aspose.Words.Document doc = new Aspose.Words.Document(streamObjectFromDatabase);
if (doc.GetText().ToLower().Contains("hello world"))
MessageBox.Show("Hello World exists");
Note: The benefit of this tool is that it does not require Word objects to be installed and it can work with streams in memory.
Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.
Of course you could just search for the text you're looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that's also stored in plain text? And there's always the off chance that the binary data will match your text pattern.
If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?
There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).
On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).
PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid
You may do use the following approach like this with iTextSharp or other open source libraries:
Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
Sort all text objects by coordinates so you will have them all together
Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not
Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:
extract text and images along with analyzing the layout of the text
XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.
Disclaimer: I am affiliated with ByteScout
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.
Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
As mentioned in the answers above, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.
If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.
If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.
PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.
However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.
Disclaimer:I was involved in writing the blogpost.
iText api:
PdfReader pr=new PdfReader("C:\test.pdf");
References:
PDFReader

Create destinations for all bookmarks in a PDF file with iText API

I'd like to write some (java) code that takes a PDF document, and creates named destinations from all of the bookmarks. I think the iText API is the easiest way of doing this, but I have never used the API before.
How would you go about writing this sort of code with the iText API? Can iText do the parsing needed to manipulate existing PDFs by itself? The kind of manipulations I am thinking of are:
Open,
Find bookmarks,
Create destinations,
Save,
Close.
Or is there a different API that would be better?
Followup: I submitted a patch to iText a few months ago (it has now been accepted and is part of HEAD) that adds text parsing capabilities to iText. PdfBox (mentioned below) has (had?) problems with reading newer PDFs that use xref streams instead of the older xref table format.
Another library that is very good at parsing existing PDF files is PdfBox It can also be used for modifying an existing PDF. FYI - this is the text parser that Lucene uses.
I will also mention that iText does have the ability to parse a PDF file, it's just not great at parsing the text content on each page. If you are looking at accessing the PDF higher level constructs (Dictionaries, etc...) that are used for storing bookmarks, etc... and you don't mind getting your hands a little dirty with reading the PDF spec, you can absolutely do what you are asking about (we do it quite a bit ourselves).
The PDF Spec is big, but readable for the most part, and you don't have to worry about the bulk of it (which is geared towards actual page content and rendering) if all you are trying to do is extract bookmarks.
I'll just warn you up front that you may be disappointed with this. iText isn't really intended to be used as a parser. It's really more for creating entirely new PDF documents, but you can take a whack at it.
To start, using iText, you won't be able to modify the existing PDF document. What you can do, though, is to make a copy with the additional features that you want. (If somebody else knows better, please let me know, this drives me crazy.)
What you will want to do is create a PdfReader object from an input stream on your source file. Then create a PdfCopy object (which is just an extended PdfWriter that makes getting data from an existing source more convenient) for your destination.
As far as I can tell, the bookmarks cannot be obtained from iText at all. Another library may be needed. I think jpedal may have the ability to extract them (it can get them as an XML document, which you may then have to parse to get what you want.) However you get them, you can then add them to a java.util.List, and set that list as outline on the PDFCopy. The bookmarks themselves are just HashMaps with a particular set of keys. I'm not sure what all of the values are, but they include "Title", "Action" (which seems to be where you'd specify that this is a named destination, though I don't know what that value would be), and "URI" (which is used if this is an external link -- I suspect that this would specify the name of the named destination that you're linking to). Again, the specifics are hard to find.
Then iterate over the pages of the reader, importing each page to the PdfCopy. this page may help you.
Sorry I'm not more helpful to you. Good luck.
P.S. If anybody else knows of a better tool that's either (L)GPL or BSD licensed, I'd love to hear about it.