Why does the trailer object report a previous value for the "Size" entry? - pdfbox

I'm trying to write code that investigates changes to a PDF document after signing (pointers welcome) and came across this strange issue.
I want to retrieve the number of objects in the PDF file as indexed in the xref tables. It seems that, while all other entries in the trailer dictionary are that of the final trailer, the number for Size is the one on the original trailer. In my particular case there have been 2 updates to the original document (adding 2 xref tables for a total of 3), adding objects up to the number 567, from the original of 550.
This is how I get the Size from the trailer dictionary:
private static long getMaxObjId(PDDocument doc) {
COSDocument cosdoc = doc.getDocument();
COSDictionary trailer = cosdoc.getTrailer();
long maxobj = trailer.getLong(COSName.SIZE);
return maxobj;
}
I'm using PDFBox 2.0.21.

You are right. The Size entry in that trailer contains the lowest (i.e. usually the oldest) Size value of all trailers in the document while all other entries in that trailer contain the newest value of their respective keys.
And the cause for this is even worse than I originally thought: That trailer object you get is not simply the latest (or, considering the Size value, the earliest) trailer dictionary in the document, it is the union of all trailer dictionaries, starting with the earliest trailer in the Prev chain up to the newest one.
So far so good. But shouldn't this mean that all entries in that union trailer should have the value from the newest trailer dictionary with the entry key? That's what I thought until I saw the COSDictionary.addAll(COSDictionary) code used to create that union:
/**
* This will add all of the dictionaries keys/values to this dictionary.
* Only called when adding keys to a trailer that already exists.
*
* #param dic The dictionaries to get the keys from.
*/
public void addAll(COSDictionary dic)
{
dic.forEach((key, value) ->
{
/*
* If we're at a second trailer, we have a linearized pdf file, meaning that the first Size entry represents
* all of the objects so we don't need to grab the second.
*/
if (!COSName.SIZE.equals(key) || !items.containsKey(COSName.SIZE))
{
setItem(key, value);
}
});
}
Here an existing Size entry is explicitly not replaced!
This explains the original observation that the Size entry in that trailer contains the lowest (i.e. usually the oldest) Size value of all trailers in the document while all other entries in that trailer contain the newest value of their respective keys.
The comments give rise to the assumption that this is a relic from the times when PDFBox by default parsed a PDF from the front, ignoring cross reference tables, and the only relevant test PDFs were ones without normal incremental updates, merely ones without updates at all and ones with linearization which uses mechanisms defined for incremental updates in inverse order. And only in case of such linearized documents this exception might make sense.
But why I consider this worse than originally thought: this addAll method is a public COSDictionary method which by its name parallels the Java Collection Framework addAll. Thus, it makes the user think the first JavaDoc line, This will add all of the dictionaries keys/values to this dictionary, is true; so he'll use it for that task, never expecting that Size entries won't be replaced.
Indeed, even in the PDFBox code itself COSDictionary.addAll(COSDictionary) is used in other context than for trailer unions in spite of the second JavaDoc line, Only called when adding keys to a trailer that already exists.
This should be inspected and fixed. I created a Jira issue to that effect, PDFBOX-4999.

Related

Remove Redis key deletion behavior on expiration

I'm using Redis Key Space Notification to get my application notified when a specified key gets expired.
But when the key gets expired, Redis deletes the key, i need to remove this behavior because my application can use this expired information in another moment.
Is there a way to remove this behavior?
As #sazzad and #nitrin0 said, there's no way to change that.
As another option to get a similar result, I'd suggest you use a sorted set to track these "psuedo-expirations", and when they "expire", a background process does whatever else you need the key for: move it, transform it, reset the expiration, etc.
Use the command zadd to both create a new sorted set and to add members to it. The key for the set can be anything, but I'd use the members as the keys from the data that expires so you can easily work with both the real data, and the member in the sorted set.
ZADD name-of-sorted-set NX timestamp-when-data-expires key-of-real-data
Let's break this down:
name-of-sorted-set is what you'd use in the other Z* commands to work with this specific sorted set.
NX means "Only add new elements. Don't update already existing elements.". The other option is XX which is "Only update elements that already exist. Don't add new elements." For this, the only options are NX or nothing.
timestamp-when-data-expires is the score for this member, and we'll use it as the exact timestamp when the data would otherwise "expire", so you'll have to do some calculations in your application to provide the timestamp instead of just the seconds until it expires.
key-of-real-data is the exact key used for the real data this represents. Using the exact key here will help easily combine the two when you're working with this sorted set to find which members have "expired", since the members are the keys you'd use to move, delete, transform, the data.
Next I'd have a background process run zrangebyscore to see if there are any members whose scores (timestamps) are within some range:
ZRANGEBYSCORE name-of-sorted-set min-timestamp max-timestamp WITHSCORES LIMIT 0 10
Let's break this down too:
name-of-sorted-set is the key for the set we chose in ZADD above
min-timestamp is the lower end of the range to find members that have "expired"
max-timestamp is the higher end of the range
WITHSCORES tells Redis to return the name of the members AND their scores
LIMIT allows us to set an offset (the 0) and a count of items to return (the 10). This is just an example, but for very large data sets you'll likely have to make use of both the offset and count limits.
ZRANGEBYSCORE will return something like this if using redis-cli:
1) "first-member"
2) "1631648102"
3) "second-member"
4) "1631649154"
5) "third-member"
6) "1631650374"
7) "fourth-member"
8) "1631659171"
9) "fifth-member"
10) "1631659244"
Some Redis clients will change that, so you'll have to test it in your application. In redis-cli the member-score pair is returned over two lines.
Now that you have the members (keys of the actual data) that have "expired" you can do whatever it is you need to do with them, then probably either remove them from the set entirely, or remove them and replace them. Since in this example we created the sorted set with the NX example, we can't update existing records, only insert new ones.

CIL hex code to call method in another assembly

For example, I'm writing code in assembly A, and the method I want to call is in assembly B at 0x06000DF2. Here is the hex dnSpy create for me 6F8701000A, but I don't know how it's calculated. Please explain to me. Thank you!
The first byte (6F) indicates that it is the callvirt instruction, the remaining 4 bytes is the metadata token for the method little endian byte order.
callvirt 0x0A000187
The metadata token is a reference to a particular row in a particular table in the metadata of the current module (the module that contains the IL). The high order byte indicates the type of token (and hence, which metadata table to look in), while the remaining 3 bytes indicates the row number within the table. 0x0A indicates that the target row is in the MemberRef table and the referenced record will provide the details necessary to find the correct member.
The MemberRef table is described in ECMA-335 Partition II, section 22.25.

EMV TLV length restriction limitation to overcome

We have code to interrogate the values from various EMV TLVs.
However, in the case of PED serial number, the spec for tag "9F1E" at
http://www.emvlab.org/emvtags/
has:-
Name Description Source Format Template Tag Length P/C Interface
Device (IFD) Serial Number Unique and permanent serial number assigned
to the IFD by the manufacturer Terminal an 8 9F1E 8 primitive
But the above gives a limit of 8, while we have VeriFone PEDs with 9-long SNs.
So sample code relying on tag "9F1E" cannot retrieve the full length.
int GetPPSerialNumber()
{
int rc = -1;
rc = GetTLV("9F1E", &resultCharArray);
return rc;
}
In the above, GetTLV() is written to take a tag arg and populate the value to a char array.
Have any developers found a nice way to retrieve the full 9?
You're correct -- there is a mis-match here. The good thing about TLV is that you don't really need a specification to tell you how long the value is going to be. Your GetTLV() is imposing this restriction itself; the obvious solution is to relax this.
We actually don't even look at the documented lengths on the TLV-parsing level. Each tag is mapped to an associated entity in the BL (sometimes more than one thanks to the schemes going their own routes for contactless), and we get to choose which entities we want to impose a length restriction on there.

How to create a lazy-evaluated range from a file?

The File I/O API in Phobos is relatively easy to use, but right now I feel like it's not very well integrated with D's range interface.
I could create a range delimiting the full contents by reading the entire file into an array:
import std.file;
auto mydata = cast(ubyte[]) read("filename");
processData(mydata); // takes a range of ubytes
But this eager evaluation of the data might be undesired if I only want to retrieve a file's header, for example. The upTo parameter doesn't solve this issue if the file's format assumes a variable-length header or any other element we wish to retrieve. It could even be in the middle of the file, and read forces me to read all of the file up to that point.
But indeed, there are alternatives. readf, readln, byLine and most particularly byChunk let me retrieve pieces of data until I reach the end of the file, or just when I want to stop reading the file.
import std.stdio;
File file("filename");
auto chunkRange = file.byChunk(1000); // a range of ubyte[]s
processData(chunkRange); // oops! not expecting chunks!
But now I have introduced the complexity of dealing with fixed size chunks of data, rather than a continuous range of bytes.
So how can I create a simple input range of bytes from a file that is lazy evaluated, either by characters or by small chunks (to reduce the number of reads)? Can the range in the second example be seamlessly encapsulated in a way that the data can be processed like in the first example?
You can use std.algorithm.joiner:
auto r = File("test.txt").byChunk(4096).joiner();
Note that byChunk reuses the same buffer for each chunk, so you may need to add .map!(chunk => chunk.idup) to lazily copy the chunks to the heap.

How do I find the middle element of an ArrayList?

How do I find the middle element of an ArrayList? What if the size is even or odd?
It turns out that a proper ArrayList object (in Java) maintains its size as a property of the object, so a call to arrayList.size() just accesses an internal integer. Easy.
/**
* Returns the number of elements in this list.
*
* #return the number of elements in this list
*/
public int size() {
return size;
}
It is both the shortest (in terms of characters) and fastest (in terms of execution speed) method available.
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.0size
So, presuming you want the "middle" element (i.e. item 3 in a list of 5 items -- 2 items on either side), it'd be this:
Object item = arrayList.get((arrayList.size()/2)+1);
Now, it gets a little trickier if you are thinking about an even sized array, because an exact middle doesn't exist. In an array of 4 elements, you have one item on one side, and two on the other.
If you accept that the "middle" will be biased to ward the end of the array, the above logic also works. Otherwise, you'll have to detect when the size of the elements is even and behave accordingly. Wind up your propeller beanie friends...
Object item = arrayList.get((arrayList.size()/2) + (arrayList.size() % 2));
if the arraylist is odd : list.get(list.size() / 2);
if the arratlist is even: list.get((list.size() / 2) -1);
If you have a limitation for not using arraylist.size() / arraylist.length() method; you can use two iterators. One of them iterates from beginning to the end of the array, the other iterates from end to the beginning. When they reach the same index on the arraylist, then you find the middle element.
Some additional controls might be necessary to assure iterators wait each other before next iteration, you should not miss the meeting point..etc.
While iterating, for both iterators you keep total number of elements they read. So they should iterate one element in a cycle. With cycle, I mean a process including these operations:
iteratorA reads one element from the beginning
iteratorB reads one element from the end
The iterators might need to read more than one index to read an element. In other words you should skip one element in one cycle, not one index.