What is the size of a single sign in .txt file? - size

For some reason I need to create and save a 512 MB .txt file on my 512MB pendrive. The file has to be made of alternaning zeroes and ones.
I'm gonna make a program that makes n such signs, but I need to know how big n must be to make this file 512 MB big. How to determine it? What is the size of single "1" or "0"? And how big is just "existance" of the file.

If you're asking how big the characters "1" and "0" are, the answer is 1 byte, assuming ASCII OR UTF-8 encoding. So if you want exactly 512 MB you will need 512 * 1024 * 1024 bytes. But keep in mind that drives aren't exactly the size they claim to be (MB vs MiB) and each drive is slightly different, so keep this in mind.

Related

TFRecord larger than the original data

Actually, I am dealing with many pictures which are from different videos, so I use tf.SequenceExample() to save them as different sequences and their labels attached into TFRcord.
But after running my code to generate TFRecord, it generates a TFRecord which is 29GB larger than my original pictures 3GB.
Is that normal to create TFRecord larger than the original data?
You may be storing the decoded images instead of the jpeg encoded ones. TFRecord has no concept of image formats so you can use any encoding you want. To keep the size the same, convert the original image file contents to a BytesList and store that without calling decode_image or using any image libraries or anything that understands image formats.
Another possibility is you might be storing the image as an Int64List full of bytes which would be 8x the size. Instead, store it as a BytesList containing a single Bytes.
Check the the type of data you load. I guess you load images as pixel-data. Every pixel is unit8 (8 bit) and likely to be converted to float (32 bit). Hence you have to expect that it gets 4 times the original size (3 GB -> 12 GB).
Also, the original format might have (better) compression than tfrecords. (I'm not sure if tfrecords can use compression)

Field: version.deployment_uri Error: The total size of files in gs://my-bucket/ml/ is x bytes, which exceeds the allowed maximum of 1073741824 bytes

When trying to create a new version in the google cloud console, I get an error like,
Field: version.deployment_uri Error: The total size of files in gs://my-bucket/ml/ is 2150116163 bytes, which exceeds the allowed maximum of 1073741824 bytes.
My model is an RNN model. I believe the embed sequence, vocab size, is likely the cause of the large model.
Is there a quota setting that can be adjusted for larger models?
Unfortunately, that limit is not adjustable at this time, although it may be in the future.
Are you comfortable sharing how large your model is? That information is valuable for us for planning purposes.
In the meantime, you will need to adjust the vocab and embedding sizes or otherwise reduce the size of the model.

How to write integer value "60" in 16bit binary, 32bit binary & 64bit binary

How to write integer value "60" in other binary formats?
8bit binary code of 60 = 111100
16bit binary code of 60 = ?
32bit binary code of 60 = ?
64bit binary code of 60 = ?
is it 111100000000 for 16 bit binary?
why does 8bit binary code contain 6bits instead of 8?
I googled for the answers but I'm not able to get these answers. Please provide answers as I'm still a beginner of this area.
Imagine you're writing the decimal value 60. You can write it using 2 digits, 4 digits or 8 digits:
1. 60
2. 0060
3. 00000060
In our decimal notation, the most significant digits are to the left, so increasing the number of digits for representation, without changing the value, means just adding zeros to the left.
Now, in most binary representations, this would be the same. The decimal 60 needs only 6 bits to represent it, so an 8bit or 16bit representation would be the same, except for the left-padding of zeros:
1. 00111100
2. 00000000 00111100
Note: Some OSs, software, hardware or storage devices might have different Endianness - which means they might store 16bit values with the least significant byte first, then the most signficant byte. Binary notation is still MSB-on-the-left, as above, but reading the memory of such little-endian devices will show any 16bit chunk will be internally reversed:
1. 00111100 - 8bit - still the same.
2. 00111100 00000000 - 16bit, bytes are flipped.
every number has its own binary number, that means that there is only one!
on a 16/32/64 bit system 111100 - 60 would just look the same with many 0s added infront of the number (regulary not shown)
so on 16 bit it would be 0000000000111100
32 bit - 0000000000000000000000000011110
and so on
For storage Endian matters ... otherwise bitwidth zeros are always prefixed so 60 would be...
8bit: 00111100
16bit: 0000000000111100

Encoding - Efficiently send sparse boolean array

I have a 256 x 256 boolean array. These array is constantly changing and set bits are practically randomly distributed.
I need to send a current list of the set bits to many clients as they request them.
Following numbers are approximations.
If I send the coordinates for each set bit:
set bits data transfer (bytes)
0 0
100 200
300 600
500 1000
1000 2000
If I send the distance (scanning from left to right) to the next set bit:
set bits data transfer (bytes)
0 0
100 256
300 300
500 500
1000 1000
The typical number of bits that are set in this sparse array is around 300-500, so the second solution is better.
Is there a way I can do better than this without much added processing overhead?
Since you say "practically randomly distributed", let's assume that each location is a Bernoulli trial with probability p. p is chosen to get the fill rate you expect. You can think of the length of a "run" (your option 2) as the number of Bernoulli trials necessary to get a success. It turns out this number of trials follows the Geometric distribution (with probability p). http://en.wikipedia.org/wiki/Geometric_distribution
What you've done so far in option #2 is to recognize the maximum length of the run in each case of p, and reserve that many bits to send all of them. Note that this maximum length is still just a probability, and the scheme will fail if you get REALLY REALLY unlucky, and all your bits are clustered at the beginning and end.
As #Mike Dunlavey recommends in the comment, Huffman coding, or some other form of entropy coding, can redistribute the bits spent according to the frequency of the length. That is, short runs are much more common, so use fewer bits to send those lengths. The theoretical limit for this encoding efficiency is the "entropy" of the distribution, which you can look up on that Wikipedia page, and evaluate for different probabilities. In your case, this entropy ranges from 7.5 bits per run (for 1000 entries) to 10.8 bits per run (for 100).
Actually, this means you can't do much better than you're currently doing for the 1000 entry case. 8 bits = 1 byte per value. For the case of 100 entries, you're currently spending 20.5 bits per run instead of the theoretically possible 10.8, so that end has the highest chance for improvement. And in the case of 300: I think you haven't reserved enough bits to represent these sequences. The entropy comes out to 9.23 bits per pixel, and you're currently sending 8. You will find many cases where the space between true exceeds 256, which will overflow your representation.
All of this, of course, assumes that things really are random. If they're not, you need a different entropy calculation. You can always compute the entropy right out of your data with a histogram, and decide if it's worth pursuing a more complicated option.
Finally, also note that real-life entropy coders only approximate the entropy. Huffman coding, for example, has to assign an integer number of bits to each run length. Arithmetic coding can assign fractional bits.

Maximum no of objects in PDF

What are the maximum no of objects that a PDF file can have?
From the PDF specifications:
In general, PDF does not restrict the size or quantity of things described in the file
format, such as numbers, arrays, images, and so on.
....
PDF itself has one architectural limit. Because ten digits are allocated to byte
offsets, the size of a file is limited to 10
10
bytes (approximately 10GB)."