This file I downloaded is supposed to be a PDF (I think, could be just a text file for all I know) but see the picture below for what the file looks like. Does anyone know what this is or if it can be converted?
If it's from a PDF file, it is likely to be Flate encoded (the same type of compression as is used with zip files, but no you cannot open a PDF file with a zip utility). This is the most common compression in a PDF for non-image data. It's not ASCIIHex or ASCII85 encoded. It could be but likely isn't LZW or RunLength (RLE) encoded. If it is image data, it could be CITTFax, JBIG2, DCT (essentially JPEG), or JPX (JPEG 2000) encoded.
In some cases, it is possible that parts of a PDF might be encoded by more than one of these filters, so a combination of say, DCT and ASCII85, could be used. but this isn't as common anymore.
Or the PDF file could be encrypted, in which case you have a choice of RC4 or different flavors of AES encryption. It's also possible that custom encryption was used (e.g. if the PDF file is an E-Book).
The screenshot you provided doesn't contain enough information to determine what the case may be for that particular part of the file, but the end conclusion you basically need to read your PDF file with software that understands the PDF format; a text editor won't do.
Related
Is it possible to obfuscate the bytes that are visible when a PDF file is opened with a hex editor? Also, I wonder if there is any problem in viewing the contents of the PDF file even if it is obfuscated.
You will always be able to see whatever bytes are within a file using a hex editor.
There might be ways to generate your pdf pages using methods that don't involve directly writing the text into the pdf (for example using javascript that's obfuscated).
Like answered above, the bytes of the file are always visible when being viewed with a hex-editor. However there are some options to hide/protect data in the file:
You could encrypt either the whole pdf or partial datasets. Note that an encryption/decryption always requires a secret. When the file is fully encrypted you can't read it without the key.
You can add additional similiar dataframes but set them invisible in the pdf. Note that this technique blows up the size of the file.
You can use scripting languages which dynamicly build up your pdf. Be aware that this could look suspicious to users or any anti-virus software.
You can use tools steganography to hide your data. For example a tool you could use is steghide
You can simply compress datastreams in the pdf, e.g. using gzip or similiar compression tools. That way you can't read it directly. However that is easy to recognize and to uncompress for anyone.
I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...
I have a pdf which is produced by Ghostscript 8.15. I need to process this pdf from my software which extract font names from pdf file and then perform some operations. But when I extract font names from this pdf file, these names are not as same as should be. For example: Original font name is 'NOORIN05', but pdf file contains 'TTE25A5F90t00'. How can decode these font names to original names. All fonts are TTF.
NOTE:
Why I need to extract the fonts.
Actually there is a software named InPage which was most famous in India and Pakistan to write documents in Urdu language, because before the unicode support in word processor, this was the only solution to type Urdu in computer. Due to complexity of Urdu language, this software uses 89 fonts files named NOORIN01 TO NOORIN89. The reason of using too many font files is to contain all the Urdu ligatures which are more than 19 thousands. Because each file can contains only 255 ligatures so that's why they used this technique before the unicode. Now copy and paste the text from pdf file generated by this software, results a garbage in MS Word. The reason which I told above 89 font files. So there was no way to extract text from such kind of old pdf files. (Now a days this software has support of unicode but I am talking about old files). So I developed a software in C# to extract text from such old pdf files. The algorithm I am using, creating a database file which contains all the names of 89 font files with all the aschii codes, and in next column I typed Urdu unicode ligature in unicode. I process the pdf file character by character with font, matching the font name from my database file and get the unicode ligature from database and then display in a text box. So in this way I get the unicode text successfully. My software was working fine in many of pdf files. But few days ago I get complaint from a person that your software fails to extract text from this pdf. When I test, I found that the pdf file doesn't contain the original font names so that's why my software unable to do further process. When I checked the properties of this pdf file, It shows the PDF producer GPL Ghostscript 8.15. So I search the net and study the documentation related to fonts but still couldn't find any clue to decode and get the original font names.
The first thing you should do is try a more recent version of Ghostscript. 8.16 is 14 years old..... The current version is 9.21.
If that doe snot preserve the original names (potentially including the usual subset prefix) then we'll need to see an example input file which exhibits the problem.
It might also be helpful if you were to explain why you need to extract the font names, possibly you are attempting something which simply isn't possible.
[EDIT}
OK so now I understand the problem, I'm afraid the answer to your question is 'you can't get the original font name'.
The PDF file was created from the output of the (Adobe-created) Windows PostScript printer driver. When that embeds TrueType fonts into the PostScript stream as type 42 fonts, it gives them a pseudo-random name which is composed of 'TT' followed by some additional characters that may look like hex, but aren't.
Old versions of the Ghostscript pdfwrite device (and 8.15 is very old) simply used that name verbatim, and that's what has been used for the font names in the PDF file you supplied.
Newer versions are capable of digging further into the font and picking up the original font name which is present in the PostScript. Unfortunately the old versions didn't preserve that. Once you've thrown the information away there is no way to get it back again.
So if the only thing you have is this PDF file then its simply not possible to get the font names back. If the person who supplied you with the PDF file can remake it, using a more recent version of Ghostscript, then it will work. But I presume they don't have the PostScript program used to create a 14 year old file.
I'm just beginning to learn about encoding issues, and I've learned just enough to know that it's far more complicated (on Windows, at least) than I had imagined, and that I have a lot more to learn.
I have an xml document that I believe is encoded with UTF-8. I'm using a VB.net app to transform the xml with (XslCompiledTransform and XmlTextWriter) into a column-specific text file. Some characters in the xml are coming out bad in the output text file. Example: an em-dash (—) is being turned into three characters "—". When that happens the columns in the file are thrown off.
As I understand it, an em-dash is not even a "Unicode character". I wouldn't expect to have issues with it. But, I can make the problem come and go by changing the encoding specified in the VB.net app.
If I use this, the em-dash is preserved:
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(True))
If I use this, the em-dash is corrupted into "—":
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(False))
The True/False simply tells VB whether to write the BOM at the beginning of the file. As I understand it, the BOM is neither necessary nor recommended for UTF-8. So, my preference is False - but then I get the weird characters.
I have several questions:
How can I be certain that the xml file is UTF-8? Is there a Windows tool that can tell me that?
How can I be certain that the transformed file is actually bad? Could it be that the real problem is the editor I'm using to look at it? Both EmEditor and UltraEdit show the same thing.
I've tried using the XVI32 hex editor to look at the file. I want to know what is actually written to disk, rather than what some GUI program is displaying to me. But, even on a file that looks good in EmEditor, XVI32 shows me the bad characters. Could it be that XVI32 just doesn't understand non-ASCII characters? What Windows hex editor would you recommend for this purpose?
The XML file is 650 MB, and the final text file is 380 MB - so that limits the list of useful tools somewhat.
You say 'As I understand it, an em-dash is not even a "Unicode character".' What do you mean by that? The Unicode character set definitely contains a code for em dash: 2014 hex. In the UTF-8 encoding, it will be 3 bytes: E2, 80, 94.
I suspect Martin Honnen is right that your editor is simply not showing the file properly. A couple of suggestions:
I'm not familiar with the editors you mention, but editors that handle different encodings will often silently choose an encoding by which to interpret the file (based on the BOM if there is one, and sometimes based on the character codes they see). They also typically have some way of showing what encoding they are interpreting the file as, and a way of tell them to load (or reload) the file as a particular encoding. If your editor does not have these features, I suggest you get one that does, such as EditPlus, or NotePad++.
As far as the hex editor, again I'm not familiar with the one you mention, but the whole point of a hex editor is to see the raw bytes. Such editors often offer a text view also (often side by side with the hex view) and if they do, I would not rely on their handling of encoding. Just use them to view the hex bytes and see if the bytes for your em dash are the same in both files.
Another way viewing the file can go wrong: even if your editor is interpreting the file as UTF-8, not all fonts will have all unicode characters in them, and for those characters not in the font they may display a little square or nothing at all. Try a few different fonts, or find one that purports to support unicode (though no font support ALL of Unicode, and there are several revisions of the Unicode spec which add more characters). Lucida Sans Unicode, I think, is one that will be on most Windows systems.
Another trick: I highly recommend the utility BabelMap. You can look up any unicode character there and see what the unicode value is, and you can copy the character from there and paste it into the file in your text editor and see how it displays it.
UltraEdit offers several configuration settings for working with UTF-8 encoded files. There is Auto detect UTF-8 files in File Handling - Unicode/UTF-8 Detection configuration dialog which is by default enabled.
With this setting enabled UltraEdit searches for the UTF-8 BOM. If not present, it searches in first few KB for an UTF-8 declaration as usually present in head of HTML/XHTML files or in first line of an XML file. If there is no BOM and no standardized encoding information at top of the file, UltraEdit searches within the first 64 KB for byte sequences which looks like UTF-8 encoded characters. If such a byte sequence can be found, the file is interpreted as UTF-8 encoded file by UltraEdit. For example a file containing only the 3 bytes E2 80 94 is interpreted as UTF-8 encoded file.
UltraEdit indicates in the status bar at bottom of the main window which encoding is detected and active (on save) for the active file. In the status bar there is either UTF-8 or U8- displayed depending on which status bar is used (advanced or basic) and which version of UltraEdit is used as the older versions have only the basic status bar.
Only files encoded in UTF-8 with no BOM, no UTF-8 character set or encoding declaration and no UTF-8 encoded character within the first 64 KB are opened wrong as ANSI file. In such cases the user can use the enhanced File - Open command of UltraEdit and explicitly select UTF-8 encoding before opening the file with button Open.
For completness, there is also a configuration setting which can be manually added to uedit32.ini which results in opening all files not being detected as UTF-16 files as UTF-8 encoded files. This is a setting useful for those who want to work only with UTF-8 encoded files even if there are very often no characters in a file present with a code value greater than 127.
For more information about working with UTF-8 encoded files take a look in the UltraEdit forums. There are a few topics with lots of information about editing UTF-8 encoded files in UltraEdit.
When I buy ebooks I download all of the available formats. I've noticed that the file sizes for the various formats can be markedly different and epub is typically much smaller.
For example:
PDF - 5.7mb;
ePub - 2.7mb;
Mobi - 8.1mb.
Or:
PDF - 4.5mb;
ePub - 1.8mb;
Mobi - 5.3mb.
I've flipped through them and tried to confirm that the contents are the same and they seem to be (i.e. no large images missing). Can anyone explain why epub is so much smaller than the other two?
The mobi versions can be larger because they include the legacy mobi format, the new KF8 format and a copy of the original epub, this is assuming the mobi file was generated with the latest version of kindlegen.
For the PDF's I'm guessing (and that's all it is here) that embedded fonts may be the cause of a larger file size, another thing that comes into play here is image optimisation. Depending on the image optimisation settings used when the PDF was created will largely affect the final file size.
Epub's are basically just a bunch HTML, CSS and image files with a few XML files for defining the books metadata, chapter order and table of contents navigation. The epub file is really just a zip file with a .epub extension and since it doesn't have 3 copies of the same book like the Kindle version does it will always be much smaller.
Because the epubs are similar to a website. An epub book is made from XHTML & CSS2 & some features like CSS3, then the software that reads epub interpret that file and make a visual representation from that code.
.epub files are compressed (in fact, they are just zip files).
.mobi files are not compressed. If you zip a mobi file, you may get a smaller file than the epub.
Incidentally, this makes text searching much faster on mobi files than on epub.
That depends on the format of the mobi that you have. As you must be already aware, an epub file can be converted into any ebook format that you choose - you can consider the epub format as the base for any other format.
I am guessing that the mobi file that you have has the original epub embedded inside it. This is to assist editing tools (as direct editing of mobi files is cumbersome). Also, some mobi files contain several versions of the mobi(mobi-7 and KF8) to maintain backward compatibility with readers that do not support the latest format.
You can find more information about the file formats here