pdf decoding, what do the bracket and numbers do? - pdf

my PDF file has deflate encoding, when inflating the string, it outputs something like this:
[(Lorem)-21( ipsum)-55( dolor)-14( sit)-55( amet,)-56( consectetur)-8( adipiscing)-14( elit.)-34( Donec)-15( faucibus)-49( lorem)-42( varius2)-56( mauris)-28( porttitor,)-34( et)-28( pellentesque)-1( )]TJ
what do the numbers and brackets mean?
it does not seems to be character count, or spacing,
does anyone know?

That is an array for showing text (Stuff in brackets denote array objects []), it should be followed by the TJ operator. The number is used to translate the text matrix (adjust the positioning of the text). Assuming horizontal text, a negative number moves the next glyph to the right.
From 9.4.3 Text-Showing Operators (Please see the spec for more details)
Show one or more text strings, allowing individual glyph positioning.
Each element of array shall be either a string or a number. If the
element is a string, this operator shall show the string. If it is a
number, the operator shall adjust the text position by that amount;
that is, it shall translate the text matrix, Tm. The number shall be
expressed in thousandths of a unit of text space (see 9.4.4, "Text
Space Details"). This amount shall be subtracted from the current
horizontal or vertical coordinate, depending on the writing mode. In
the default coordinate system, a positive adjustment has the effect of
moving the next glyph painted either to the left or down by the given
amount.
The parentheses denote string objects:
String objects shall be written in one of the following two ways:
As a sequence of literal characters enclosed in parentheses ( ) (using
LEFT PARENTHESIS (28h) and RIGHT PARENThESIS (29h)); see 7.3.4.2,
"Literal Strings."
...
A literal string shall be written as an arbitrary number of characters
enclosed in parentheses. Any characters may appear in a string except
unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS
(29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be
treated specially as described in this sub-clause. Balanced pairs of
parentheses within a string require no special treatment.
I suggest getting the PDF Spec and reading it to find out more info.

Related

PDF TJ operator

is it possible to determine if a number in TJ operator represents space between words?
Example: [(Sta)28(ry)-333(Plzenec,)]TJ
Number 28 is not enough for space, otherwise 333 it should be space according to actual font size. Font size is 9.96
First of all please be aware that there is no absolute limit number separating numbers for spaces between words from spaces for kerning. All you can do is develop heuristics which will fail for some documents, usually for very tightly set ones.
Now remember how those numbers are applied when calculating the text replacement tx or ty from the origin of the last character before the number to the origin of the first character thereafter:
(ISO 32000-1, section 9.4.4 Text Space Details, also discussed here)
Thus, first of all such a number only widens the gap to the next character if it's negative.
Furthermore, the number is applied before the font size is multiplied; thus, one does not have to take the font size into account as I incorrectly claimed in a comment to the question.
The number (scaled by 1/1000) is directly subtracted from the glyphs displacement. So one can compare it with the glyph displacements of the font in question to get an impression of the meaning of the number.
The glyph displacements essentially are the numbers from the corresponding font's Widths or W array (defaulting to the MissingWidth / DW value) scaled by 1/1000. As both the TJ numbers and the widths are scaled by 1/1000, you can directly compare them.
Thus, an obvious option would be to compare the absolute value of negative TJ numbers to the width of the space glyph in the font in question. This differs from font to font, e.g. it's 600 for Courier, 278 for Helvetica, and 250 for Times-Roman.
Spaces between words created by TJ numbers don't necessarily have to be as wide as the full space glyph of the font, but a relevant fraction of it, e.g. half its value (YMMV), can be used as minimum for interpreting a TJ number as a space between words.
Unfortunately, though, if a PDF generator creates all spaces between words by TJ numbers and none by space glyphs, and if the font is embedded as a subset only, there is no need to embed the space glyph at all. In that case you might want to use other glyphs to compare with; often the length of a capital 'M' is used as a measure for the widths of a font, you might want to use a relevant fraction thereof, e.g. one fifth (YMMV again).
You can improve your heuristics
by also taking the character spacing value Tc into account: If Tc / Tfs is negative with a relevant absolute value, the text is tightly set. In that case you might want to lessen the limit number determined as above. Or
by an analysis of all the TJ numbers in your text or those in the surrounding text. Here I can only guess, though, what might be acceptable heuristics...

PDF extracted text seems to be unreadable

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

Which Unicode characters are "composing" characters (whose sole purpose is to add accent, tilda)?

This is related to
What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?
This is how I plan to do this:
Use http://msdn.microsoft.com/en-us/library/dd374126%28v=vs.85%29.aspx to turn the string into
KD form.
Basically it'll turn most variation such as superscript into the normal number. Also it decompose tilda and accent into 2 characters.
Next step would be to remove all characters whose sole purpose is tildaing or accenting character.
How do I know which characters are like that? Which characters are just "composing characters"
How do I find such characters? After I find those, how do I get rid of it? Should I scan character by character and remove all such "combining characters?"
For example:
Character from 300 to 362 can be gotten rid off.
Then what?
Combining characters are listed in UnicodeData.txt as having a nonzero Canonical_Combining_Class, and a General_Category of Mn (Mark, nonspacing).
For each character in the string, call GetUnicodeCategory and check the UnicodeCategory for NonSpacingMark, SpacingCombiningMark or EnclosingMark.
You may be able to do it more efficiently using regex, eg Regex.Replace(str, "\p{M}", "").

RegEx to find % symbols in a string that don't form the start of a legal two-digit escape sequence?

I would like a regular expression to find the %s in the source string that don't form the start of a valid two-hex-digit escaped character (defined as a % followed by exactly two hexadecimal digits, upper or lower case) that can be used to replace only these % symbols with %25.
(The motivation is to make the best guess attempt to create legally escaped strings from strings of various origins that may be legally percent escaped and may not, and may even be a mixture of the two, without damaging the data intent if the original string was already correctly encoded, e.g. by blanket re-encoding).
Here's an example input string.
He%20has%20a%2050%%20chance%20of%20living%2C%20but%20there%27s%20only%20a%2025%%20chance%20of%20that.
This doesn't conform to any encoding standard because it is a mix of valid escaped characters eg. %20 and two loose percentage symbols. I'd like to convert those %s to %25s.
My progress so far is to identify a regex %[0-9a-z]{2} that finds the % symbols that are legal but I can't work out how to modify it to find the ones that aren't legal.
%(?![0-9a-fA-F]{2})
Should do the trick. Use a look-ahead to find a % NOT followed by a valid two-digit hexadecimal value then replace the found % symbol with your %25 replacement.
(Hopefully this works with (presumably) NSRegularExpression, or whatever you're using)
%(?![a-fA-F0-9]{2})
That's a percent followed by a negative lookahead for two hex digits.

Hsqldb - how to remove the padding on char fields

I'm finding that Char fields are being padded.
Is there any way to stop this happening.
I've tried using the property
SET PROPERTY "sql.enforce_strict_size" FALSE
but doesn't seem to help.
Indeed, the MySQL docs specify that "When CHAR values are retrieved, trailing spaces are removed." This is odd, as other databases seem to always keep the padding (i can confirm that for Oracle). The SQL-92 standard indicates that right-padded spaces are part of the char, for example in the definition of the CAST function on p. 148. When source (SV=source value) and target (TV=target value, LTD=length of target datatype), then:
ii) If the length in characters of SV is larger than LTD, then
TV is the first LTD characters of SV. If any of the re-
maining characters of SV are non-<space> characters, then a
completion condition is raised: warning-string data, right
truncation.
iii) If the length in characters M of SV is smaller than LTD,
then TV is SV extended on the right by LTD-M <space>s.
Maybe that's just another one of MySQL's many oddities and gotchas.
And to answer your question: if you don't want the trailing spaces, you should use VARCHAR instead.
I thought 'char' by definition are space padded to fill the field. They are considered fixed lenght and will be space padded to be fixed length.
The data type 'varchar' is defined as variable char where they are not space padded to fill the field.
I could be wrong though since I normally work on SQL Server.