About Lucene index postings list, why are all deltas between 0 and 255 during FOR encoding? - lucene

From this blog, it says that postings lists are split into blocks of 256 docs and then each block is compressed separately. But what if a term's postings list is [72, 373]? Is there anything that Lucene does to avoid a deltas greater than 255, like altering doc sequence so the docs have appropriate doc ids?

The article doesn't say there is a limit of 256 for delta, but the example in the article does.
Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits.
For example, If a posting list contains doc ids 1 to 256 like [1,2,3,.....256], the delta encoded block would be [1,1,1,1,....1] which means the block needs only 1 bit per doc id to store.
Taking the example in your question [72,373..], the delta encoded block would be [72, 301,...] which means the block will need 9 bits per doc id (assuming that 301 is the largest delta in the block).
Hope it clears.

Related

DEFLATE: how to handle "no distance codes" case?

I mostly get RFC 1951, however I'm not too clear on how to manage the case where (when using dynamic Huffman tables) no distance codes are needed or present. For example, let's take the input:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890987654321ZYXWVUTSR
where no backreference is possible since there are no repetitions of length >= 3.
According to RFC 1951, at least one distance code must be present regardless, otherwise it wouldn't be possible to encode HDIST - 1. I understand, according to the reference, that such code should be of zero bits to signal "no distance codes".
One distance code of zero bits means that there are no distance codes
used at all (the data is all literals).
In infgen symbols, I'd expect to see a dist 0 0.
Analyzing what gzip does with infgen, however, I see that TWO distance codes are emitted (each 1 bit long) for the above input (even though none is actually used then):
! infgen 2.4 output
!
gzip
!
last
dynamic
litlen 48 6
litlen 49 6
litlen 50 6
...cut...
litlen 121 6
litlen 122 6
litlen 256 6
dist 0 1
dist 1 1
literal 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890987654321Z
literal 'YXWVUTSR
end
!
crc
length
So what's the correct behavior in these cases?
If there are no matches in the deflate block, there will be no lengths from the length/literal code, and so the decoder will never look for a distance code. In that case, what would make the most sense is to provide no information at all about a distance code.
However the format does not permit that, since the 5-bit HDIST value in the header is interpreted as 1 to 32 distance codes, for which lengths must be provided for in the header. You must provide at least one distance code length in the header, even though it will never be used.
There are several valid things you can do in that case. RFC 1951 notes you can provide a single distance code (HDIST == 0, meaning one length), with length zero, which would be just one zero in the list of lengths.
It is also permitted to provide a single code of length one, or you could do as zlib is doing, which is to provide two codes of length one. You can actually put any valid distance code description you like there, and it will still be accepted.
As to why zlib's deflate is choosing to define two codes there, I can only guess that Jean-loup was being conservative, writing something he knew that even an over-simplified inflator would have to accept. Both gzip and zopfli do the same thing. They all do the same thing when there is only one distance code used. They could emit just the single one-bit distance code, per the RFC, but they emit two single-bit distance codes, one of which is never used.
Really the right thing to do would be to write a single zero length as noted in the RFC, which would take the fewest number of bits in the header. I will consider updating zlib to do that, to eke out a few more bits of compression.

How to correctly understand TrueType cmap's subtable Format 4?

The following is the information, which the TrueType font format documentation provides with regards to the fields of "Format 4: Segment mapping to delta values" subtable format, which may be used in cmap font table (the one used for mapping character codes to glyph indeces):
Type Name Description
1. uint16 format Format number is set to 4.
2. uint16 length This is the length in bytes of the subtable.
3. uint16 language For requirements on use of the language field, see “Use of the language field in 'cmap' subtables” in this document.
4. uint16 segCountX2 2 × segCount.
5. uint16 searchRange 2 × (2**floor(log2(segCount)))
6. uint16 entrySelector log2(searchRange/2)
7. uint16 rangeShift 2 × segCount - searchRange
8. uint16 endCode[segCount] End characterCode for each segment, last=0xFFFF.
9. uint16 reservedPad Set to 0.
10. uint16 startCode[segCount] Start character code for each segment.
11. int16 idDelta[segCount] Delta for all character codes in segment.
12. uint16 idRangeOffset[segCount] Offsets into glyphIdArray or 0
13. uint16 glyphIdArray[ ] Glyph index array (arbitrary length)
(Note: I numbered the fields as to allow referencing them)
Most fields, such as 1. format, 2. length,3. language,9. reservedPad` are trivial basic info and understood.
Other fields 4. segCountX2, 5. searchRange, 6 .entrySelector, 7. rangeShift I see as some odd way to have a precomputed values, but basically being only a redundant way to store the number of segments segCount (implicitly). Also those fields I have no major headache understanding.
Lastly there remain the fields that represent arrays. Per each segment there is a field 8. endCode, 10. stadCode, 11. idDelta and 12. idRangeOffset and there might/might not be even a field 13. glyphIdArray. Those are the fields I still struggle to interprete correctly and which this question is about.
To allow for a most helpful answer allow me to sketch quickly my take on those fields:
Working basically segment for segment, each segment maps characters codes from startCode to endCode to the indexes of the fonts glyphs (reflecting the order they appear in the glyf table).
having the character code as input
having the glyph index as output
segment is determined by iterating through them checking that the input value is inside the range of startCode to endCode.
with the segment found thus, the fields respective fields idRangeOffset and idDelta are determined as well.
idRangeOffset conveys a special meaning
case A) idRangeOffset being set to special value 0 means that the ouput can be
calculated from the input value (character code) and the idDelta. (I think it is either glyphId = inputCharCode + idDelta or glyphId = inputCharCode - idDelta )
case B) idRangeOffset being not 0 something different happens, which is part of what I seek an answer about here.
With respect to case B) the documentation states:
If the idRangeOffset value for the segment is not 0, the mapping of
character codes relies on glyphIdArray. The character code offset from
startCode is added to the idRangeOffset value. This sum is used as an
offset from the current location within idRangeOffset itself to index
out the correct glyphIdArray value. This obscure indexing trick works
because glyphIdArray immediately follows idRangeOffset in the font
file. The C expression that yields the glyph index is:
glyphId = *(idRangeOffset[i]/2
+ (c - startCode[i])
+ &idRangeOffset[i])
which I think provides a way to map a continuous input range (hence "segment") to a list of values stored in the field glyphIdArray, possibly as a way to provide output values that cannot be computed via idDelta, for being unordered/non-consecutive. This at least is my read on that what was described as "obscure" in the documentation.
Because glyphIdArray[] follows idRangeOffset[] in the TrueType file, the code segment in question
glyphId = *(&idRangeOffset[i]
+ idRangeOffset[i]/2
+ c - startCode[i])
points to the memory address of the desired position in glyphIdArray[]. To elaborate on why:
&idRangeOffset[i] points to the memory address of idRangeOffset[i]
moving forward idRangeOffset[i] bytes (or idRangeOffset[i]/2 uint16's) brings you to the relevant section of glyphIdArray[]
c - startCode[i] is the position in glyphIdArray[] that contains the desired ID value
From here, in the event that this ID is not zero, you will add idDelta[i] to obtain the glyph number corresponding to c.
It is important to point out *(&idRangeOffset[i] + idRangeOffset[i]/2 + (c - startCode[i])) is really pseudocode: you don't want a value stored in your program's memory, but rather the memory address in the file.
In a more modern language without pointers, the above code segment translates to:
glyphIndexArray[i - segCount + idRangeOffset[i]/2 + (c - startCode[i])]
The &idRangeOffset[i] in the original code segment has been replaced by i - segCount (where segCount = segCountX2/2). This is because the range offset (idRangeOffset[i]/2) is relative to the memory address &idRangeOffset[i].

How to manipulate bits in Smalltalk?

I am currently working on a file compressor based on Huffman decoding. So I have a decoding tree like so:
and I have to encode this tree on an output file by following a certain criteria:
"for each leaf, write out a 0 bit, followed by the 8 bits of
the corresponding character. Write out the bits in the order bit 7, bit 6, . . ., bit 0, that is high bit first. As a special case, if the byte is 0, write out bit 8, which will be a 0 for a byte value of 0, and 1 for a byte value of 256 (the EOF marker)." For an internal node, just write a bit 1.
So what I plan to do is to create a bit array and add to it the corresponding bits in the specified format. The problem is that I don't know how to convert a number to binary in smalltalk.
For example, if I want to encode the first leaf, I would want to do something like 01101011 i.e 0 followed by the bit representation of k and then add every bit one by one into the array.
I don't know which dialect you are using exactly, but generally, you can access the bits of Integer. They are modelled as if the representation was in two-complement, with an infinite sequence of bits.
2 is ....0000000000010
1 is ....0000000000001
0 is ....0000000000000 with infinitely many 0 on the left
-1 is ....1111111111111 with infinitely many 1 on the left
-2 is ....1111111111110
This is also true for LargeIntegers, even though they are generally implemented as sign magnitude (the class encodes the sign), two-complement will be emulated.
Then you can operate with bitAnd: bitOr: bitXor: bitInvert bitShift:, and in some flavours bitAt:put:
You can access the bits with (2 bitAt: index) where the index starts at 1 for least significant bit, or grows higher. If it's missing, implement it with bitAnd: and bitShift:...
For positive, you can ask for the rank of high bit (2 highBit).
All these operations should create a new integer (there's no in place modification possible).
Conceptually, a ByteArray is a collection of unsigned integers on 8 bits (between 0 and 255), so you can implement a bit Array with them (if it does not already exist in the dialect). Or you can use an Integer (but won't be able to control size which will be infinite, nor in place mofifications, operations will cost a copy).

Structure Packing

I'm currently learning C# and my first project (as a learning experiment) is to create a DBF reader. I'm having some difficulty understanding "packing" according to this: http://www.developerfusion.com/pix/articleimages/dec05/structs1.jpg
If I specified a packing of 2, wouldn't all structure elements begin on a 2-byte boundary, and if I specified a packing of 4, wouldn't all structure elements begin on a 4-byte boundary, and also consume a minimum of 4 bytes each?
For instance, a byte element would be placed on a 4 byte boundary, and the element following it (in a sequential layout) would be located on the next 4-byte boundary (losing 3 bytes to padding)?
In the image shown, in the "pack=4" it shows a byte that is on a 2 byte boundary, following a short.
If I understand the picture correctly, pack equal to n means that one variable cannot be stored "between" two packs of lengths n. In other words, bytes which compose a variable cannot cross one pack's boundary. This is only true if the size of a variable is less or equal to the size of a pack.
Let's take Pack = 4 as an example. Here, we can safely store a byte and a short in one pack, because they require 3 bytes of memory together. But since there is only one byte in the pack left, it requires one byte of padding to be able to store an int into the data structure, because what's left in the pack is too little to store the whole int.
I hope the explanation makes sense.
Looking at the picture again, I think it would be better if all data were aligned to the same side of a pack, either to bottom or top. This would make it clearer what's going on.

Advice for bit level manipulation

I'm currently working on a project that involves a lot of bit level manipulation of data such as comparison, masking and shifting. Essentially I need to search through chunks of bitstreams between 8kbytes - 32kbytes long for bit patterns between 20 - 40bytes long.
Does anyone know of general resources for optimizing for such operations in CUDA?
There has been a least a couple of questions on SO on how to do text searches with CUDA. That is, finding instances of short byte-strings in long byte-strings. That is similar to what you want to do. That is, a byte-string search is much like a bit-string search where the number of bits in the byte-string can only be a multiple of 8, and the algorithm only checks for matches every 8 bits. Search on SO for CUDA string searching or matching, and see if you can find them.
I don't know of any general resources for this, but I would try something like this:
Start by preparing 8 versions of each of the search bit-strings. Each bit-string shifted a different number of bits. Also prepare start and end masks:
start
01111111
00111111
...
00000001
end
10000000
11000000
...
11111110
Then, essentially, perform byte-string searches with the different bit-strings and masks.
If you're using a device with compute capability >= 2.0, store the shifted bit-strings in global memory. The start and end masks can probably just be constants in your program.
Then, for each byte position, launch 8 threads that each checks a different version of the 8 shifted bit-strings against the long bit-string (which you now treat like a byte-string). In each block, launch enough threads to check, for instance, 32 bytes, so that the total number of threads per block becomes 32 * 8 = 256. The L1 cache should be able to hold the shifted bit-strings for each block, so that you get good performance.