Static data-heavy Rust library seems bloated - size

I've been developing a Rust library recently to try to provide fast access to a large database (the Unicode character database, which as a flat XML file is 160MB). I also want it to have a small footprint so I've used various approaches to reduce the size. The end result is that I have a series of static slices that look like:
#[derive(Clone,Copy,Eq,PartialEq,Debug)]
pub enum UnicodeCategory {
UppercaseLetter,
LowercaseLetter,
TitlecaseLetter,
ModifierLetter,
OtherLetter,
NonspacingMark,
SpacingMark,
EnclosingMark,
DecimalNumber,
// ...
}
pub static UCD_CAT: &'static [((u8, u8, u8), (u8, u8, u8), UnicodeCategory)] =
&[((0, 0, 0), (0, 0, 31), UnicodeCategory::Control),
((0, 0, 32), (0, 0, 32), UnicodeCategory::SpaceSeparator),
((0, 0, 33), (0, 0, 35), UnicodeCategory::OtherPunctuation),
/* ... */];
// ...
pub static UCD_DECOMP_MAP: &'static [((u8, u8, u8), &'static [(u8, u8, u8)])] =
&[((0, 0, 160), &[(0, 0, 32)]),
((0, 0, 168), &[(0, 0, 32), (0, 3, 8)]),
((0, 0, 170), &[(0, 0, 97)]),
((0, 0, 175), &[(0, 0, 32), (0, 3, 4)]),
((0, 0, 178), &[(0, 0, 50)]),
/* ... */];
In total, all the data should only take up around 600kB max (assuming extra space for alignment etc), but the library produced is 3.3MB in release mode. The source code itself (almost all data) is 2.6MB, so I don't understand why the result would be more. I don't think the extra size is intrinsic as the size was <50kB at the beginning of the project (when I only had ~2kB of data). If it makes a difference, I'm also using the #![no_std] feature.
Is there any reason for the extra binary bloat, and is there a way to reduce the size? In theory I don't see why I shouldn't be able to reduce the library to a megabyte or less.
As per Matthieu's suggestion, I tried analysing the binary with nm.
Because all my tables were represented as borrowed slices, this wasn't very useful for calculating table sizes as they were all in anonymous _refs. What I could determine was the maximum address, 0x1208f8, which would be consistent with a filesize of ~1MB rather than 3.3MB. I also looked through the hex dump to see if there were any null blocks that might explain it, but there weren't.
To see if it was the borrowed slices that were the problem, I turned them into non-borrowed slices ([T; N] form). The filesize didn't change much, but now I could interpret the nm data quite easily. Weirdly, the tables took up exactly how much I expected them to (even more weirdly, they matched my lower bounds when not accounting for alignment, and there was no space between the tables).
I also looked at the tables with nested borrowed slices, e.g. UCD_DECOMP_MAP above. When I removed all of these (about 2/3 of the data), the filesize was ~1MB when it should have only been ~250kB (by my calculations and the highest nm address, 0x3d1d0), so it doesn't look like these tables were the problem either.
I tried extracting the individual files from the .rlib file (which is a simple ar-format archive). It turns out that 40% of the library is just metadata files, and that the actual object file is 1.9MB. Further, when I do this to the library without the borrowed references the object file is 261kB! I then went back to the original library and looked at the sizes of the individual _refs and found that for a table like UCD_DECOMP_MAP: &'static [((u8,u8,u8),&'static [(u8,u8,u8)])], each value of type ((u8,u8,u8),&'static [(u8,u8,u8)]) takes up 24 bytes (3 bytes for the u8 triplet, 5 bytes of padding and 16 bytes for the pointer), and that as a result these tables take up a lot more room than I would have thought. I think I can now fully account for all the filesize.
Of course, 3MB is still quite small, I just wanted to keep the file as small as possible!

Thanks to Matthieu M. and Chris Emerson for pointing me towards the solution. This is a summary of the updates in the question, sorry for the duplication!
It seems that there are two reasons for the supposed bloat:
The .rlib file outputted is not a pure object file, but is an ar archive file. Usually such a file would consist entirely of one or more object files, but rust also includes metadata. Part of the reason for this seems to be to obviate the need for separate header files. This accounted for around 40% of the final filesize.
My calculations turned out to not be accurate for some of the tables, which also happened to be the largest ones. Using nm I was able to find that for normal tables such as UCD_CAT: &'static [((u8,u8,u8), (u8,u8,u8), UnicodeCategory)], the size was 7 bytes for each item (which is actually less than I originally anticipated, assuming 8 bytes for alignment). The total of all these tables was about 230kB, and the object file including just these came in at 260kB (after extraction), so this was all consistent.
However, examining the nm output more closely for the other tables (such as UCD_DECOMP_MAP: &'static [((u8,u8,u8),&'static [(u8,u8,u8)])]) was more difficult because they appear as anonymous borrowed objects. Nevertheless, it turned out that each ((u8,u8,u8),&'static [(u8,u8,u8)]) actually takes up 24 bytes: 3 bytes for the first tuple, 5 bytes of padding, and an unexpected 16 bytes for the pointer. I believe this is because the pointer also includes the size of the referenced array. This added around a megabyte of bloat to the library, but does seem to account for the entire filesize.

Related

Correctly computing index for offsetting into frame-resource buffers

Lets say, for a rendering setup, that we have a renderer that can be either double or triple buffered, and this buffering can be changed dynamically, at runtime. So, at any time, it's 2 or 3 frames in flight (FIF).
Frame resources, like buffers, are duplicated to match the FIF-count. Buffers are created large enough to hold (frame-data-size * FIF-count), and reading/writing into those buffers are offsetted accordingly.
For the purpose of offsetting into buffers, is it good enough to use a monotonic clycling index which would go like so:
double buffer: 0, 1, 0, 1, 0, 1, ...
triple buffer: 0, 1, 2, 0, 1, 2, ...
Then, if a change is made to the FIF-count a runtime, we first WaitIdle() on the GPU, and then reset this index to 0.
Is this a safe way of offsetting into buffers, such that we don't trample on data still being used by the GPU?
I'm particularly unsure how this may play with a triple-buffer setup with a swapchain mailbox present-mode.
we first WaitIdle() on the GPU, and then reset this index to 0.
Presumably, you're talking about vkDeviceWaitIdle (which is a function you should almost never use). If so, then as far as non-swapchain assets are concerned, this is safe. vkDeviceWaitIdle will halt the CPU until the GPU device has done all of the work it has been given.
The rules for swapchains don't change just because you waited. You still need to acquire the next image before trying to use it and so forth.
However, waiting doesn't really make sense here. If all you do is just reset the index to 0, that means you didn't reallocate any memory. So if you went from 3 buffers to 2, you still have three buffers worth of storage, and you're only using two of them.
So what did you wait accomplish? You can just keep cycling through your 3 buffers worth of storage even if you only have 2 swapchain images.
The only point in waiting would be if you need to release storage (or allocate more if you're going from 2 to 3 buffers). And even then, the only reason to wait is if it is absolutely imperative to delete your existing memory before allocating new memory.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

Creating a WAV file with an arbitrary bits per sample value?

Do WAV files allow any arbitrary number of bitsPerSample?
I have failed to get it to work with anything less than 8. I am not sure how to define the blockAlign for one thing.
Dim ss As New Speech.Synthesis.SpeechSynthesizer
Dim info As New Speech.AudioFormat.SpeechAudioFormatInfo(AudioFormat.EncodingFormat.Pcm, 5000, 4, 1, 2500, 1, Nothing) ' FAILS
ss.SetOutputToWaveFile("TEST4bit.wav", info)
ss.Speak("I am 4 bit.")
My.Computer.Audio.Play("TEST4bit.wav")
AFAIK no, 4-bit PCM format is undefined, it wouldn't make much sense to have 16 volume levels of audio; quality would be horrible.
While technically possible, I know no decent software (e.g. Wavelab) that supports it, your very own player could though.
Formula: blockAlign = channels * (bitsPerSample / 8)
So for a mono 4-bit it would be : blockAlign = 1 * ((double)4 / 8) = 0.5
Note the usage of double being necessary to not end up with 0.
But if you look at the block align definition below, it really does not make much sense to have an alignment of 0.5 bytes, one would have to work at the bit-level (painful and useless because at this quality, non-compressed PCM would just sound horrible):
wBlockAlign
The block alignment (in bytes) of the waveform data. Playback
software needs to process a multiple of wBlockAlign bytes of data at
a time, so the value of wBlockAlign can be used for buffer
alignment.
Reference:
http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/Docs/riffmci.pdf page 59
Workaround:
If you really need 4-bit, switch to ADPCM format.

What does PACK8/16/32 mean in VkFormat names?

I'm trying to understand the names of the items in the VkFormat enum, and so far I think I get all the structure of the names of all of the (non-block) formats, but I can't figure out what it means when they have a suffix of PACK8, PACK16, PACK32. If I add up the channel sizes, they always add up to 8, 16, or 32, nothing irregular, so I don't understand what it would mean to bit-pack these values, since they seem to be 100% efficient, using all their bits.
As usual, the documentation is not very helpful, just saying the format is packed without saying what that means.
The PACK fields mean exactly what the specification says they mean:
whole texels or attributes are stored in a single data element, rather than individual components occupying a single data element
Though if you find that too confusing, you could just look at the actual format descriptions. Vulkan goes into excruciating detail about them, to the point of needless repetition.
The difference between VK_FORMAT_B8G8R8A8_RGB and VK_FORMAT_B8G8R8A8_RGB_PACK32 is the same difference between a uint8_t[4] and a uint32_t. One is an array ("individual components"), while the other is a single value ("single data element") made up of smaller values.
If you have a uint8_t color[4] array, which stores B8G8R8A8, then color[0] stores the blue component. The order of the components in the array is defined by the order of the components in the format's name.
If you have a uint32_t color value, which stores B8G8R8A8, then (color & 0xFF000000) >> 24 will retrieve the blue component. The highest byte is the first, followed by the next highest and so forth.
The reason the packed-vs-not-packed distinction matters is because of endian issues. Arrays of bytes don't have endian issues. But values packed into 16 or 32-bits do have endian issues. The endian of the packed formats is always assumed to be the native endian of the host.

Memcpy and Memset on structures of Short Type in C

I have a query about using memset and memcopy on structures and their reliablity. For eg:
I have a code looks like this
typedef struct
{
short a[10];
short b[10];
}tDataStruct;
tDataStruct m,n;
memset(&m, 2, sizeof(m));
memcpy(&n,&m,sizeof(m));
My question is,
1): in memset if i set to 0 it is fine. But when setting 2 i get m.a and m.b as 514 instead of 2. When I make them as char instead of short it is fine. Does it mean we cannot use memset for any initialization other than 0? Is it a limitation on short for eg
2): Is it reliable to do memcopy between two structures above of type short. I have a huge
strings of a,b,c,d,e... I need to make sure copy is perfect one to one.
3): Am I better off using memset and memcopy on individual arrays rather than collecting in a structure as above?
One more query,
In the structue above i have array of variables. But if I am passed pointer to these arrays
and I want to collect these pointers in a structure
typedef struct
{
short *pa[10];
short *pb[10];
}tDataStruct;
tDataStruct m,n;
memset(&m, 2, sizeof(m));
memcpy(&n,&m,sizeof(m));
In this case if i or memset of memcopy it only changes the address rather than value. How do i change the values instead? Is the prototype wrong?
Please suggest. Your inputs are very imp
Thanks
dsp guy
memset set's bytes, not shorts. always. 514 = (256*2) + (1*2)... 2s appearing on byte boundaries.
1.a. This does, admittedly, lessen it's usefulness for purposes such as you're trying to do (array fill).
reliable as long as both structs are of the same type. Just to be clear, these structures are NOT of "type short" as you suggest.
if I understand your question, I don't believe it matters as long as they are of the same type.
Just remember, these are byte level operations, nothing more, nothing less. See also this.
For the second part of your question, try
memset(m.pa, 0, sizeof(*(m.pa));
memset(m.pb, 0, sizeof(*(m.pb));
Note two operations to copy from two different addresses (m.pa, m.pb are effectively addresses as you recognized). Note also the sizeof: not sizeof the references, but sizeof what's being referenced. Similarly for memcopy.