How to avoid plus sign to create a line feed in a rdlc textbox - rdlc

I need to print an encrypted string as is in a rdlc report. My problem is if the string contain a plus sign it creates a new line in the Textbox. How to avoid this?

Encryption produces output that is binary and contains many bytes that have no displayable representation.
Because of this if encrypted data needs to be displayed it is generally either Base64 (best for computers) or hexadecimal (best for people) encoded.
It seems that you may have base64 encoded encrypted data and that is generally composed of the upper and lowercase characters, the 10 digits, "+", "/" and "=". You can not delete these and expect to recover the encrypted data.
If these characters present a problem they can be many times be escaped in some manor or another encoding can be chosen such as hexadecimal or an alternate Base64 character set, see Base64. If you choose an alternate Base64 character set interoperability will most likely be impaired.
Note: More information would produce a better answer.

I had to replace the "+" with "÷".
Users don't notice is it since the PDF is just a visual representation of the CFDI, I haven't had any issues with it.

Related

how to render a dicom file's header unreadable

Kind of a strange question, but I'm doing some testing to handle errors when a dicom file's tags can't be read.
Unfortunately I don't have a damaged dicom available.
Specifically, can anyone advise how to apply some sort of incorrectly encoded text tag or some invalid numeric data tag onto the file, such that it can't be read by python's pydicom package?
you could have a look at the dcmodify tool from the DCMTK. It can be used to insert, modify and delete attributes. I doubt that it is possible to specify invalid attribute values through the command line, but you could surely modify the source code to accomplish that (except you can definitely write attribute values that exceed the maximum length according to the Value Representation).
My approach would be to create a buffer of characters and write binary data to it. Then pass it to the method that writes the value to the attribute.
Examples:
write unicode (UTF-8) sequences which are not a valid unicode character
write ascii characters which are not covered by the characterset specified by (0008,0005) - not sure whether pydicom would run into problems but it would be wrong from the DICOM perspective
write non-numeric characters to attributes with Value Representation "Decimal String" or "Integer String".
formats other than YYYYMMDD for VR "Date"
formats other than HHMMSS.FFFFFF for VR "Time"
other characters than ['0'-'9'], '.' for VR "Unique Identifier"
[edit]: DCMTK, dcmodify: http://dicom.offis.de/dcmtk.php.en

Encoding of PDF dictionaries

I need to know the encoding of the values of PDF dictionaries (not the text displayed to the user but the "code behind").
I plan not to use any library for that.
Where can I find it?
the encoding of the values of PDF dictionaries
Values of PDF dictionaries are PDF objects.
You should take a look at the PDF specification ISO 32000-1, in particular chapter 7 Syntax, to find out about PDF objects. You will find:
The tokens that delimit objects and that describe the structure of a PDF file shall use the ASCII character
set. In addition all the reserved words and the names used as keys in PDF standard dictionaries and
certain types of arrays shall be defined using the ASCII character set.
Thus, most of the time you have to deal with ASCII values.
The situation is tricky with strings, though, because there are several types of strings which use the same string syntax options, so you have to interpret their contents according to their context.
Table 35 – String Object Types
Type Description
text string Shall be used for human-readable text, such as text
annotations, bookmark names, article names, and
document information. These strings shall be encoded
using either PDFDocEncoding or UTF-16BE with a
leading byte-order marker.
This type is described in 7.9.2.2, "Text String Type."
PDFDocEncoded string Shall be used for characters and glyphs that are
represented in a single byte, using PDFDocEncoding.
This type is described in 7.9.2.3, "PDFDocEncoded String
Type."
ASCII string Shall be used for characters that are represented in a
single byte using ASCII encoding.
byte string Shall be used for binary data represented as a series of
bytes, where each byte can be any value representable in
8 bits. The string may represent characters but the
encoding is not known. The bytes of the string need not
represent characters. This type shall be used for data
such as MD5 hash values, signature certificates, and Web
Capture identification values.
This type is described in 7.9.2.4, "Byte String Type."
If a string is the value e.g. of the Author metadata, it is a text string, so it is encoded using either PDFDocEncoding or UTF-16BE with a leading byte-order marker.
If on the other hand a string is the value e.g. of Contents in a signature dictionary, it is a byte string holding a binary object, any attempt to interpret it according to some encoding will fail.
The situation is even more tricky with streams.
First of all the stream content may be somehow processed, e.g. it may be compressed. To get to the actual stream contents, you first have to undo this processing.
The the content may either be binary, e.g. a font program, or text, e.g. JavaScript, or it may be a content stream, e.g. the page contents.
A content stream is a PDF stream object whose data consists of a sequence of instructions describing the
graphical elements to be painted on a page. The instructions shall be represented in the form of PDF objects,
using the same object syntax as in the rest of the PDF document.
Thus, they are mostly ASCII values. The exception again are string arguments to text drawing instructions. Their encoding depends entirely on the font currently selected when the string is drawn, and fonts may use standard encodings, but they may also use completely chaotic, ad-hoc encodings.
PS: If you happen to try and analyze an encrypted PDF, you will find that Encryption
applies to all strings and streams in the document's PDF file, with very few exceptions. In particular encryption does not apply to dictionary and array structures, numbers and names. Thus, someone not aware of this might not recognize that the PDF is encrypted but instead assume that strings and streams are encoded in a very weird way.
You find that in the PDF specification (http://www.adobe.com/devnet/pdf/pdf_reference.html). To elaborate a bit on the most important points in your question...
1) PDF dictionaries can contain a variety of value types (booleans, numbers, strings...). The encoding you are going to encounter depends on the type of value.
2) Mostly, the interesting and complex case is that where the type of object is a string.
3) For a string, read section 7.9.2 in the PDF specification. That explains what encodings can be used for such strings (PDFDocEncoding, Unicode encoding...) and how to recognise what encoding you have for a particular string.
To complement #mkl's and #DavidvanDriessche's excellent answers...
Here are three OpenSource command line tools which can help you to transform any PDF into different forms which expand/uncompress/decode object streams (Note, there is not one single, "the-one-and-only-correct" way to do this -- so the outputs of each of the tools will be different):
pdftk
mutool
qpdf
Each of these should be available via your favorite operating systems package manager.
pdftkexample usage:
pdftk in.pdf cat output out1.pdf uncompress
mutool example usage:
mutool clean -d in.pdf out2.pdf
qpdf example usage (my favorite tool for this purpose):
qpdf --qdf --object-streams=disable in.pdf out3.pdf
You should try each of these, compare their outputs for different input PDFs and then decide which one is your favorite (but never forget to remember the other tools when you encounter a case where your favorite shows unexpected results).

How to convert the PDF content code to the type like "(<0034>) Tj"?

PDF content are saved as several ways, "(abc) Tj", "(<0035><0035>) Tj" or "\u065".
I want to know if there is a way to convert the PDF code to one type, no matter direct text "(abc) Tj", or hexadecimal "(<0035><0035>) Tj", or Octal "\u065".
I think if convert and encode the PDF to one type, will be easier to analyse the content.
Is it possible to use Ghostscript or something to do that? Thanks
Essentially, no, there is no way to do so. There are two kinds of string, regular strings '(' and ')' delimited, and hex strings '<' and '>' delimited. Hex strings need not be escaped whereas regular text strings do need to be for 'special' characters, like carriage return and linefeed. Octal is also permitted in regular strings.
PDF producers are free to mix and match these all they like, but in general a given PDF producer will usually use one technique throughout.
Because Ghostscript's pdfwrite device is a PDF producer, it will (I believe) generally produce all its output the same way.
What it won't do is 'convert' your original PDF file. It produces a brand new PDF file which should look visually identical but whose internals bear no resemblance to your original PDF. In addition some metadata or fidelity may be lost.

Saving CSV file with degree symbol and ASCII encoded

I have string variable txt. It contains "°" degree symbol. I would like to save string into CSV file ASCII encoded. I use the procedure below But the "°" symbol is converted to "?". Do you have any idea how to save properly degree symbol?
Public Sub Write_File(ByVal txt As String, ByVal fName As String)
Try
Using OutFile As New StreamWriter(fName, False, Text.Encoding.ASCII)
OutFile.Write(txt)
End Using
Me.Write_Log("Succesfully Exported")
Catch ex As Exception
Me.Write_Log("Write Error during export")
End Try
End Sub
Encoding.ASCII is for the standard 7-bit ASCII encoding, which does not contain a degree symbol at all. In order to get a degree symbol in ASCII, you would have to use one of the many 8-bit ASCII encodings. For English, you'd probably be most interested in using the ISO 8859-1 code page, since that's the most standard-ish one there is of the bunch. For instance, instead of using Encoding.ASCII, you could do something like this:
Using OutFile As New StreamWriter(fName, False, Text.Encoding.GetEncoding("iso-8859-1"))
OutFile.Write(txt)
End Using
For a complete list of available encodings, use the Encoding.GetEncodings method, or look at the list of supported ones in the MSDN documentation.
Of course, none of the various 8-bit ASCII encodings are compatible with each other, so, if you do use that, the degree symbol will be a completely different symbol when viewed on a system that uses a different code page by default. That is precisely why UTF-8 has become the new standard. Usage of 8-bit ASCII is widely discouraged since it is practically unworkable in multi-cultural scenarios. If you can use UTF-8 instead, I would. If you must use ASCII, it's best to stick to the standard 7-bit encoding. If you must use an 8-bit ASCII encoding, please do so sparingly and with full awareness of its drawbacks.
One more thing. You mention the degree symbol as being character 167 (0xA7) in your desired target encoding. If that is the case, you may actually be wanting IBM437 encoding rather than ISO 8859-1. IBM437 is the old code page that was used by default in MS-DOS. If you really need to use that code page, you may have additional trouble for two reasons. As you'll see in the MSDN article, that code page is not well supported in the .NET framework. In my testing, outputting the Unicode string containing the degree symbol using that encoding did not work properly. Therefore, you may find yourself needing to use a byte array to represent the data rather than a String variable (which is Unicode). For instance:
File.WriteAllBytes("Test.txt", {167})
The second problem is that IBM437 is likely not the default code page for your windows OS, so even when it is written to the file as byte value 167, it won't actually look like a degree symbol when you view it in a windows application such as notepad.

Is there a field in which PDF files specify their encoding?

I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question.
My question is: Is there a field in a PDF file where, by convention, the encoding scheme is specified (e.g.: UTF-8)? This would be something roughly analogous to <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in HTML.
Thank you very much in advance,
Blz
A quick look at the PDF specification seems to suggest that you can have different encoding inside a PDF-file. Have a look at page 86. So a PDF library with some kind of low level access should be able to provide you with encoding used for a string. But if you just want the text and don't care about the internal encodings used I would suggest to let the library take care of conversions for you.
PDF uses "named" characters, in the sense that a character is a name and not a numeric code. Character "a" has name "a", character "2" has name "two" and the euro sign has name "euro", to give a few examples. PDF defines a few "standard" "base" encodings (named "WinAnsiEncoding", "MacRomanEncoding" and a few more, can't remember exactly), an encoding being a one-to-one correspondence between character names and byte values (yes, only 0 to 255). The exact, normative values for these predefined encodings are in the PDF specification. All these encodings use the ASCII values for the US-ASCII characters, but they differ in higher byte values.
A PDF file may define new encodings by taking a "base" encoding (say, WinAnsiEncoding) and redefining a few bytes, so a PDF author may, for example, define a new encoding named "MySuperbEncoding" as WinAnsiEncoding but with byte value 65 changed to mean character "ntilde" (this definition goes inside the PDF file), and then specifying that some strings in the file use encoding "MySuperbEncoding". In this case, a string containing byte values 65-66-67 would mean characters "ñBC" and not "ABC". And note that I mean characters, nothing to do with glyphs or fonts. Different strings withing the PDF file may use different encodings (this provides a way for using more tan 256 characters in the PDF file, even though every string is defined as a byte sequence, and one byte always corresponds to one character).
So, the answer to your question is: characters within a PDF file can well be encoded internally in an ad-hoc encoding made on the spot for that specific PDF file. PDF parsers should make the appropriate substitutions when necessary. I do not know PDFMiner but I'm surprised that it (being a PDF parser) gives incorrect values, as the specification is very clear on how this must be interpreted. It IS possible to get all the necessary information from the PDF file, but, as Mattias said, it might be a large project and I think a program named PDFMiner should do exactly this kind of job.