What is the meaning of the two numbers before obj in a PDF? - pdf

A PDF object starts with obj and ends with endobj, but in all examples, the specification and real world PDFs, objects also have two numbers in front of them them, like this:
1 0 obj
% here is the object defintion
endobj
I expected this to be explained in the offical specification (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf page 13 and following and page 40), but it is never really explained, or do I miss it completly?
The first number seems to be just a running number, like a unique ID, but what is the second number?

This is explained in 7.3.10 Indirect Objects: they are object number and generation number.

Related

Wiki API - Parsing sentences from JSON extracts in JavaScript?

Is there a way to have wiki display extracts in an array of sentences?
Or does anyone have any ideas other than using string.split(".") to parse? There are cases where the sentence may include a . and I don't want to split if it occurs mid-sentence.
For example, "The Eagles were No. 1 in the U.S. in 1970" would be split into 4 sentences using str.split(), and that's not what I want.
Wiki must have some sort of determination of what defines a sentence as it works when you limit the number of existence in a call (they don't break a sentence on an in-line period). Is there a way to get them individually?
Looking for a solution in JavaScript to parse a JSON excerpt string.
I ended up figuring out a work-around. Using exsentences, I made 10 calls, each with one more sentence than the previous call. I stored the results of each call in an array. So when the 10 calls were complete, I had 10 strings, ranging from one sentence in position 0, up to 10 sentences in the 9th position. Then I just iterated through the array, from 0 to length - 2, subtracting the string in the current position from the string at position [i + 1] (with string[i + 1].slice(string[i].length)), to get the nth string.

How to treat numbers inside text strings when vectorizing words?

If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?
I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?
Does converting numbers to strings weakens the information i feed the network?
Expanding your discussion with #user1735003 - Lets consider both ways of representing numbers:
Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.
Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.
For example,
1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.
By treating the numbers as another word, you are not changing the
context but by doing any other transformation on those numbers, you
can't guarantee its for better. So, its better to leave it untouched and treat it as another word.
Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).
The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.
EDIT
Example literally using your string and their code:
corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
for word in tweet.split(" "):
if word not in dictionary: dictionary[word] = i
i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}
The following paper can be helpful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf
Specifically, page 7.
Before they use an <unknown> tag they try to replace alphanumeric symbol combination with common pattern names tags, such as:
FourDigits (good for years)
I've tried to implement it and it gave great results.

BACnet deserialization: How do I know if a new list elements starts

I'm implementing a generic BACnet decoder and came across the following question, of which I can't seem to find the answer within the BACnet standard. The chapter "20.2.1.3.2 Constructed Data" does not answer my question, or I might not fully understand it.
Let's assume I have a List (SEQUENCE OF) with elements of type Record (SEQUENCE).
Said record has 4 fields, identified by context tag, where field 0 and 1 are optional.
I further assume that the order, in which those fields are serialized, can be arbitrary (because they're identified by their context tags).
The data could look like that (number indicates field / column):
[{ "3", "0", 2" }, {"1", "2", "3"}]
Over the wire, the only "structure information" I assume I get are the open / close tags for the list.
That means:
Open Tag List
ctxTagColumn3, valueColumn3,
ctxTagColumn0, valueColumn0,
ctxTagColumn2, valueColumn2,
ctxTagColumn1, valueColumn1,
ctxTagColumn2, valueColumn2,
ctxTagColumn3, valueColumn3
Close Tag List
How do I know, after I've read the last column data ("2") of my first list item, that I must begin decoding the second item, starting with a value for column "1"?
Which of my assumptions is wrong?
Thank you and kind regards
Pascal
The order of elements of a SEQUENCE is always known and shall not be arbitrarily by definition. Further, not all conceivable combinations are possible to encode. Regarding BACnet, all type definitions shall be decodable universally.
Assuming I understand you correctly; the "order" cannot be "arbitrary"; i.e.:
SEQUENCE = *ordered* collection of variables of **different** types
SEQUENCE OF = *ordered* collection of variables of **same** type
The tag-number for the item (SD) context-tag will be different (/possibly an incremented value/maybe +1) from the containing (PD) context-tag; so you could check for that, or better still if the tag-number value is <= 5 (/'length' value) then it's a SD context-tag for one of your items, rather than a (/the closing) PD context tag (/'type' value) delimiting the end of your items.

Converting Ordered Collection to Literal Array

I have an ordered collection that I would like to convert into a literal array. Below is the ordered collection and the desired result, respectively:
an OrderedCollection(1 2 3)
#(1 2 3)
What would be the most efficient way to achieve this?
The message asArray will create and Array from the OrderedCollection:
anOrderedCollection asArray
and this is probably what you want.
However, given that you say that you want a literal array it might happen that you are looking for the string '#(1 2 3)' instead. In that case I would use:
^String streamContents: [:stream | aCollection asArray storeOn: stream]
where aCollection is your OrderedCollection.
In case you are not yet familiar with streamContents: this could be a good opportunity to learn it. What it does in this case is equivalent to:
stream := '' writeStream.
aCollection asArray storeOn: stream.
^stream contents
in the sense that it captures the pattern:
stream := '' writeStream.
<some code here>
^stream contents
which is fairly common in Smalltalk.
UPDATE
Maybe it would help if we clarify a little bit what do we mean literal arrays in Smalltalk. Consider the following two methods
method1
^Array with: 1 with: 2 with: 3
method2
^#(1 2 3)
Both methods answer with the same array, the one with entries 1, 2 and 3. However, the two implementations are different. In method1 the array is created dynamically (i.e., at runtime). In method2 the array is created statically (i.e., at compile time). In fact when you accept (and therefore compile) method2 the array is created and saved into the method. In method1instead, there is no array and the result is created every time the method is invoked.
Therefore, you would only need to create the string '#(1 2 3)' (i.e., the literal representation of the array) if you were generating Smalltalk code dynamically.
You can not convert an existing object into a literal array. To get a literal array you'd have to write it using the literal array syntax in your source code.
However, I believe you just misunderstood what literal array means, and you are infact just looking for an array.
A literal array is just an array that (in Pharo and Squeak [1]) is created at compile time, that is, when you accept the method.
To turn an ordered collection into an array you use asArray.
Just inspect the results of
#(1 2 3).
(OrderedCollection with: 1 with: 2 with: 3) asArray.
You'll see that both are equal.
[1]: see here for an explanation: https://stackoverflow.com/a/29964346/1846474
In Pharo 5.0 (a beta release) you can do:
| oc ary |
oc := OrderedCollection new: 5.
oc addAll: #( 1 2 3 4 5).
Transcript show: oc; cr.
ary := oc asArray.
Transcript show: ary; cr.
The output on the transcript is:
an OrderedCollection(1 2 3 4 5)
#(1 2 3 4 5)
the literalArray encoding is a kind of "poor man's" persistency encoding to get a representation, which can reconstruct the object from a compilable literal array. I.e. an Array of literals, which by using decodeAsLiteralArray reconstructs the object.
It is not a general mechanism, but was mainly invented to store UI specifications in a method (see UIBuilder).
Only a small subset of classes support this kind of encoding/decoding, and I am not sure if OrderedCollection does it in any dialect.
In the one I use (ST/X), it does not, and I get a doesNotUnderstand.
However, it would be relatively easy to add the required encoder/decoder and make it possible.
But, as I said, its intended use is for UIspecs, not as a general persistency (compiled-object persistency) mechanism. So I rather not recommend using it for such.

Difference between the ID of a pdf read from iTextSharp and pdf.js

I am trying to parse the ID of a particular pdf (this) using iTextSharp as mentioned in this answer. But I get null array for ID whereas I can see that another pdfReader (pdf.js) can read the id as 77a2a5c4fc17dc3a91a072c46fe69ec0. Why is this behaviour different? Am I expected to read the ID field from some place other than the trailer?
Open a regular PDF with an ID in a text editor like this:
Right before where it says startxref, you see a dictionary (it starts with <<). That's the trailer dictionary of the PDF. One of the (optional) entries is the /ID which is an array containing two PDF strings.
If your PDF has such an entry, then the answer to the question Extract ID of a PDF document using iTextSharp won't return null.
Now open your PDF in a text editor:
Again you see a dictionary (the trailer dictonary) before startxref. However, in this case, the dictionary only has three entries: /Size (the number of objects in the cross-reference table), /Info (a reference to the dictionary containing the metadata) and /Root (a reference to the catalog dictionary).
There is no /ID entry, hence iText (and iTextSharp) should return null (and you confirmed that they do).
Now search for the value 77a2a5c4fc17dc3a91a072c46fe69ec0 in the PDF you've opened in a text editor. You won't find that value anywhere because it's just not there!
Summarized: your question Am I expected to read the ID field from some place other than the trailer? is wrong. You are asking how to read something that isn't there. Your question should be: Why is pdf.js creating an ID for PDFs that don't have one, and how do I retrieve it? The answer to the first part is reasonable: even iText tries to create an /ID when you manipulate a PDF because it's good practice for a PDF to have an ID. The answer to the second part is: look in the trailer (but you already knew that).
Conclusion: based on feedback in the comments, it turns out the the OP is using the fingerprint() method in pdf.js. This method returns the first element of the ID if and ID is present. If no ID is found, and MD5 hash is returned. See the source code of the fingerprint() method in pdf.js.