Kind of a strange question, but I'm doing some testing to handle errors when a dicom file's tags can't be read.
Unfortunately I don't have a damaged dicom available.
Specifically, can anyone advise how to apply some sort of incorrectly encoded text tag or some invalid numeric data tag onto the file, such that it can't be read by python's pydicom package?
you could have a look at the dcmodify tool from the DCMTK. It can be used to insert, modify and delete attributes. I doubt that it is possible to specify invalid attribute values through the command line, but you could surely modify the source code to accomplish that (except you can definitely write attribute values that exceed the maximum length according to the Value Representation).
My approach would be to create a buffer of characters and write binary data to it. Then pass it to the method that writes the value to the attribute.
Examples:
write unicode (UTF-8) sequences which are not a valid unicode character
write ascii characters which are not covered by the characterset specified by (0008,0005) - not sure whether pydicom would run into problems but it would be wrong from the DICOM perspective
write non-numeric characters to attributes with Value Representation "Decimal String" or "Integer String".
formats other than YYYYMMDD for VR "Date"
formats other than HHMMSS.FFFFFF for VR "Time"
other characters than ['0'-'9'], '.' for VR "Unique Identifier"
[edit]: DCMTK, dcmodify: http://dicom.offis.de/dcmtk.php.en
Related
I'm trying to read the official pdf specification "Document management — Portable document format — Part 1: PDF 1.7" (PDF32000_2008.pdf) as bytes and then interpret them according to that specification.
In Annex D, Character Sets and Encodings, there is a list of all named characters, like:
or
When I parse PDF32000_2008.pdf, there are also named characters like "f_f", "uni00D0" and "a204", which are missing in that specification.
My guess is that "f_f" is a symbol for two 'f' characters, which might get printed with a special glyph. There is a unicode "Latin Small Ligature Ff" for 'ff'.
For example, there is also "f_i" in that file, which I expect to mean 'fi', one glyph showing the 2 characters 'f' and 'i'. However, the pdf specification has 'fi' as named character "fi" and what is the point for having 2 named characters pointing to the same symbol ?
I can imagine that "uni00D0" means the unicode character 'Ð'. However, pdf defines it already as named character "Eth"
What could be "a204" ? Maybe Ansi 204 'Ì', which also has already a named character "Igrave" ?
Why do they use also "a62", which would be just a '<' ?
However, my main question is: Where can I find a specification for these additional named characters ?
Of course, Adobe Acrobat understands them, but also Gmail seems not to have a problem with them. So I guess, their meaning must be specified somewhere.
Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.
I need to know the encoding of the values of PDF dictionaries (not the text displayed to the user but the "code behind").
I plan not to use any library for that.
Where can I find it?
the encoding of the values of PDF dictionaries
Values of PDF dictionaries are PDF objects.
You should take a look at the PDF specification ISO 32000-1, in particular chapter 7 Syntax, to find out about PDF objects. You will find:
The tokens that delimit objects and that describe the structure of a PDF file shall use the ASCII character
set. In addition all the reserved words and the names used as keys in PDF standard dictionaries and
certain types of arrays shall be defined using the ASCII character set.
Thus, most of the time you have to deal with ASCII values.
The situation is tricky with strings, though, because there are several types of strings which use the same string syntax options, so you have to interpret their contents according to their context.
Table 35 – String Object Types
Type Description
text string Shall be used for human-readable text, such as text
annotations, bookmark names, article names, and
document information. These strings shall be encoded
using either PDFDocEncoding or UTF-16BE with a
leading byte-order marker.
This type is described in 7.9.2.2, "Text String Type."
PDFDocEncoded string Shall be used for characters and glyphs that are
represented in a single byte, using PDFDocEncoding.
This type is described in 7.9.2.3, "PDFDocEncoded String
Type."
ASCII string Shall be used for characters that are represented in a
single byte using ASCII encoding.
byte string Shall be used for binary data represented as a series of
bytes, where each byte can be any value representable in
8 bits. The string may represent characters but the
encoding is not known. The bytes of the string need not
represent characters. This type shall be used for data
such as MD5 hash values, signature certificates, and Web
Capture identification values.
This type is described in 7.9.2.4, "Byte String Type."
If a string is the value e.g. of the Author metadata, it is a text string, so it is encoded using either PDFDocEncoding or UTF-16BE with a leading byte-order marker.
If on the other hand a string is the value e.g. of Contents in a signature dictionary, it is a byte string holding a binary object, any attempt to interpret it according to some encoding will fail.
The situation is even more tricky with streams.
First of all the stream content may be somehow processed, e.g. it may be compressed. To get to the actual stream contents, you first have to undo this processing.
The the content may either be binary, e.g. a font program, or text, e.g. JavaScript, or it may be a content stream, e.g. the page contents.
A content stream is a PDF stream object whose data consists of a sequence of instructions describing the
graphical elements to be painted on a page. The instructions shall be represented in the form of PDF objects,
using the same object syntax as in the rest of the PDF document.
Thus, they are mostly ASCII values. The exception again are string arguments to text drawing instructions. Their encoding depends entirely on the font currently selected when the string is drawn, and fonts may use standard encodings, but they may also use completely chaotic, ad-hoc encodings.
PS: If you happen to try and analyze an encrypted PDF, you will find that Encryption
applies to all strings and streams in the document's PDF file, with very few exceptions. In particular encryption does not apply to dictionary and array structures, numbers and names. Thus, someone not aware of this might not recognize that the PDF is encrypted but instead assume that strings and streams are encoded in a very weird way.
You find that in the PDF specification (http://www.adobe.com/devnet/pdf/pdf_reference.html). To elaborate a bit on the most important points in your question...
1) PDF dictionaries can contain a variety of value types (booleans, numbers, strings...). The encoding you are going to encounter depends on the type of value.
2) Mostly, the interesting and complex case is that where the type of object is a string.
3) For a string, read section 7.9.2 in the PDF specification. That explains what encodings can be used for such strings (PDFDocEncoding, Unicode encoding...) and how to recognise what encoding you have for a particular string.
To complement #mkl's and #DavidvanDriessche's excellent answers...
Here are three OpenSource command line tools which can help you to transform any PDF into different forms which expand/uncompress/decode object streams (Note, there is not one single, "the-one-and-only-correct" way to do this -- so the outputs of each of the tools will be different):
pdftk
mutool
qpdf
Each of these should be available via your favorite operating systems package manager.
pdftkexample usage:
pdftk in.pdf cat output out1.pdf uncompress
mutool example usage:
mutool clean -d in.pdf out2.pdf
qpdf example usage (my favorite tool for this purpose):
qpdf --qdf --object-streams=disable in.pdf out3.pdf
You should try each of these, compare their outputs for different input PDFs and then decide which one is your favorite (but never forget to remember the other tools when you encounter a case where your favorite shows unexpected results).
PDF content are saved as several ways, "(abc) Tj", "(<0035><0035>) Tj" or "\u065".
I want to know if there is a way to convert the PDF code to one type, no matter direct text "(abc) Tj", or hexadecimal "(<0035><0035>) Tj", or Octal "\u065".
I think if convert and encode the PDF to one type, will be easier to analyse the content.
Is it possible to use Ghostscript or something to do that? Thanks
Essentially, no, there is no way to do so. There are two kinds of string, regular strings '(' and ')' delimited, and hex strings '<' and '>' delimited. Hex strings need not be escaped whereas regular text strings do need to be for 'special' characters, like carriage return and linefeed. Octal is also permitted in regular strings.
PDF producers are free to mix and match these all they like, but in general a given PDF producer will usually use one technique throughout.
Because Ghostscript's pdfwrite device is a PDF producer, it will (I believe) generally produce all its output the same way.
What it won't do is 'convert' your original PDF file. It produces a brand new PDF file which should look visually identical but whose internals bear no resemblance to your original PDF. In addition some metadata or fidelity may be lost.
I have string variable txt. It contains "°" degree symbol. I would like to save string into CSV file ASCII encoded. I use the procedure below But the "°" symbol is converted to "?". Do you have any idea how to save properly degree symbol?
Public Sub Write_File(ByVal txt As String, ByVal fName As String)
Try
Using OutFile As New StreamWriter(fName, False, Text.Encoding.ASCII)
OutFile.Write(txt)
End Using
Me.Write_Log("Succesfully Exported")
Catch ex As Exception
Me.Write_Log("Write Error during export")
End Try
End Sub
Encoding.ASCII is for the standard 7-bit ASCII encoding, which does not contain a degree symbol at all. In order to get a degree symbol in ASCII, you would have to use one of the many 8-bit ASCII encodings. For English, you'd probably be most interested in using the ISO 8859-1 code page, since that's the most standard-ish one there is of the bunch. For instance, instead of using Encoding.ASCII, you could do something like this:
Using OutFile As New StreamWriter(fName, False, Text.Encoding.GetEncoding("iso-8859-1"))
OutFile.Write(txt)
End Using
For a complete list of available encodings, use the Encoding.GetEncodings method, or look at the list of supported ones in the MSDN documentation.
Of course, none of the various 8-bit ASCII encodings are compatible with each other, so, if you do use that, the degree symbol will be a completely different symbol when viewed on a system that uses a different code page by default. That is precisely why UTF-8 has become the new standard. Usage of 8-bit ASCII is widely discouraged since it is practically unworkable in multi-cultural scenarios. If you can use UTF-8 instead, I would. If you must use ASCII, it's best to stick to the standard 7-bit encoding. If you must use an 8-bit ASCII encoding, please do so sparingly and with full awareness of its drawbacks.
One more thing. You mention the degree symbol as being character 167 (0xA7) in your desired target encoding. If that is the case, you may actually be wanting IBM437 encoding rather than ISO 8859-1. IBM437 is the old code page that was used by default in MS-DOS. If you really need to use that code page, you may have additional trouble for two reasons. As you'll see in the MSDN article, that code page is not well supported in the .NET framework. In my testing, outputting the Unicode string containing the degree symbol using that encoding did not work properly. Therefore, you may find yourself needing to use a byte array to represent the data rather than a String variable (which is Unicode). For instance:
File.WriteAllBytes("Test.txt", {167})
The second problem is that IBM437 is likely not the default code page for your windows OS, so even when it is written to the file as byte value 167, it won't actually look like a degree symbol when you view it in a windows application such as notepad.