PDF COSDictionary with key names like L, O, E, N, T, andH - pdfbox

When I looked into the Apache Java pdfbox parser code, the first dictionary has key names of single characters and values of simple integers. For example, there can be an entry like (COSName{N}:COSInt{606}), another entry like (COSName{T}:COSInt{3423924}) in the dictionary.
There must be some pre-defined meaning of these single-character key names. Why are the values simple integers? Could someone please provide more insights as to what they mean? Are they some offsets or codes defined by PDF specs?

The ISO 32000-1:2008 specification includes tables for known dictionary entries, they typically list a column for Key, Type, and Value. This usually includes an explanation of the meaning of the key and sometimes explicitly mentions allowed values.
O key
E.g. Section 12.3.5 Collections shows in Table 157 – Entries in a collection field dictionary:
O
integer
(Optional) The relative order of the field name in the user interface. Fields shall be sorted by the conforming reader in ascending order.
However such keys can have different meanings in different dictionaries.
E.g. Section 7.6.3.2 Standard Encryption Dictionary shows in Table 21 - Additional encryption dictionary entries for the standard security handler:
O
string
(Required) A 32-byte string, based on both the owner and user passwords, that shall be used in computing the encryption key and in determining whether a valid owner password was entered. For more information, see 7.6.3.3, "Encryption Key Algorithm," and 7.6.3.4, "Password Algorithms."
You should be able to find explanations there for such keys.

Related

jsonschema for a map stored as an array [key1, val1, key2, val2.....]

Is it possible to create a json schema for an array with undefined length (besides it always being an even number of elements) that captures a map stored as an array?
i.e. as described in title [key1, val1, key2, val2.....]
it seems that the only option for an array of an undetermined length is to have a single item "type" (though that type could conceptually be a oneOf type). However, that wouldn't enforce ordering of key/val schema restrictions. While it would validate valid uses, it would also validate invalid uses.
if I knew how long the array would be, I could just enforce it by specifying the types for all keys and values in their respective positions, but that's not the case here.
Yes, it would be nice if the api worked off a map/object instead of an array in this location, but this is an old api that I'm trying to create a json schema for, so it probably can't be changed.

Can the SQL function uuid_generate_v5(UUID, "namespace") be reversed?

I'm creating masked data from our production database and part of that process is masking all the UUIDs with the function uuid_generate_v5(UUID, "namespace"). This helps create masked data with valid UUIDs as well as making them be able to generate the same each time across all the tables of our database.
It has the added benefit of allowing me to go from a production UUID to a masked ID for testing.
What I'm not sure about is if this function is reversible. Can I go from my masked UUID back to the production ID. My thought (and hope) is that I can't.
Example:
SELECT uuid_generate_v5('0015b178-6102-43bc-89a8-5f7cafe78f2f', 'mask');
-- will return: 'fdb8a7dd-909b-544d-8340-41358c6b3495'
Can I go from the value 'fdb8a7dd-909b-544d-8340-41358c6b3495' back to '0015b178-6102-43bc-89a8-5f7cafe78f2f'? using the uuid_generate_v5() function?
I'm not sure I can put this better than the Wikipedia article on UUIDs:
Version-3 and version-5 UUIDs are generated by hashing a namespace identifier and name. Version 3 uses MD5 as the hashing algorithm, and version 5 uses SHA-1.
...
Version-3 and version-5 UUIDs have the property that the same namespace and name will map to the same UUID. However, neither the namespace nor name can be determined from the UUID, even if one of them is specified, except by brute-force search.
In other words, the answer is the same as "Is it possible to reverse a SHA-1 hash? - no (at least by design; a flaw may eventually be found in the SHA-1 algorithm that makes it more feasible).
This is confirmed by the PostgreSQL manual for the function you're using:
uuid_generate_v3 ... The name parameter will be MD5-hashed, so the cleartext cannot be derived from the generated UUID.
uuid_generate_v5 ... Generates a version 5 UUID, which works like a version 3 UUID except that SHA-1 is used as a hashing method.
However, your use of the function is not really what is intended, because you have mixed up the two parameters; the signature is documented as:
uuid_generate_v3 ( namespace uuid, name text ) → uuid
It is the UUID passed first which is the "namespace", with pre-defined values that can identify that the name used was a DNS name, URL, etc. The second parameter is not a "namespace" or "mask", but "an identifier in the selected namespace".
So, while your method will work, what you're doing is treating every input UUID as a namespace, and creating a UUID for the same name in each of them. The intended use is that you use a fixed UUID as the namespace, and the value you're masking (which happens to be a UUID in string form) as the name.
SELECT uuid_generate_v5(somehow_get_mask_namespace_uuid(), '0015b178-6102-43bc-89a8-5f7cafe78f2f');

STRING type or SSTRING element for a text field in table? Pros and cons

I need to create a Z table to store reasons for modifications of a certain custom object.
In the UI, the user will pick a reason ID and then optionally fill a text box. The table will have more or less the fields below:
key objectID
key changeReasonID
changedOn
changedBy
comments
My doubt is with the comments field. I read the documentation about the limitations of STRING and SSTRING, but it's not clear to me if a STRING type field used in a transparent table has a limited length or not.
Even if the length is not limited (at least by the DB), I'm not sure if it's a good idea to use this approach or would you recommend CHAR/SSTRING types with a fix length instead?
**My system is running MSSQL database.
Strings have unlimited length, both in ABAP structures/tables, and in the database.
Most databases will store only a pointer in this column that points to the real CLOB value which is stored in a different memory segment. As a result, they restrict the usage of these columns, and may not allow you to use them as a key or index.
If I remember correctly, ABAP supports a maximum of 16 string fields per structure, which naturally limits its use cases. Also consider that ABAP structures have a maximum size.
For your case, if the comment will remain the only long field, and if you are actually fine with storing unlimited input (--> security constraints?), string sounds like a reasonable option.
If you are unsure what the future will bring, or to be on the safe side regarding security, you might want to opt for sstring or simply a long char instead.

Handling dynamic (user supplied) column names

When writing applications that manage data, it is often useful to allow the end user to create or remove classes of data that are best represented as columns. For example, I'm working on a dictionary building application; a user might decide they want to add, say, an "alternate spelling" field or something to data, which could be very easily represented as another column.
Usually, I just name the column based on whatever the user called it ("alternate_spelling" in this case); however, a user-defined string that isn't explicitly sanitized as a database identifier bothers me. Since column names can't be bound like values, I'm trying to figure out how to sanitize the column names.
So my question is: what should I be doing? Can I get away with just quoting things? There's lots of questions asking how to bind column names in SQL, and many responses saying one should never need to, but never explaining the correct approach to handling variable columns. I'm working in Python specifically, but I think this question is more general.
It depands on which database you are using...
According to PostgreSQL:
"SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard, so their use might render applications less portable"
(Keep also in mind: maximum length allowed for the name)
I was looking for something like this. I still wouldn't trust it with user-supplied names - I'd look those up from the database catalog instead, but I think it is robust enough to check data that is provided from your backend.
i.e. Just because something comes from your internal data tables or yaml config files doesn't 100% mean that an attacker couldn't have hacked into those sources, so why not add another layer right before composing sql queries?
This is for postgresql but mostly should work on something else. No, it doesn't cover ALL possible characters for naming columns and tables, only those used in my databases.
class SecurityException(Exception):
"""concerns security"""
class UnsafeSqlException(SecurityException):
""" sql fragments looks unsafe """
def is_safe_sql_name(sql : str, error_on_empty : bool = False, raise_on_false : bool = True) -> bool :
"""check that something looks like an object name"""
patre = re.compile("^[a-z][a-z0-9_]{0,254}$",re.IGNORECASE)
if not isinstance(sql, str):
raise TypeError(f"sql should be a string {sql=}")
if not sql:
if error_on_empty:
raise ValueError(f"empty sql {sql=}")
return False
res = bool(patre.match(sql))
if not res and raise_on_false:
raise UnsafeSqlException(f"{sql=}")
return res

Distinct & compact word lists to represent e.g. fingerprints or other binary data for humans?

What word lists / "dictionaries" can you suggest for conveying binary data such as cryptographic fingerprints, hashes etc. ?
Criteria for such a word list are e.g.
Compact, i.e. a long list of rather short words so you need less of those words to transmit data
Distinctive words / simple to distinguish (no homonyms, not even accidentally caused by slight mispronunciation)
Simple to spell
One of the most familiar ones is the PGP word list.