doing this in byte is easy with filestream, but I cant get it to work with a bitarray.
I want to develop file compression algorithms, just as a hobby.
The method in question checks if the combined occurence of three combination of bytes occur enough times for the filesize to benefit from adding two bits at the beginning of every byte, as to whether the next byte is one of the three pre-stored bytes, if both bits are zero it assumes no, and just continues reading the file. now writing another byte for each byte would make all of this redundant.
If someone could tell me how to do this, I would much appreciate it.
As I want to do it, it is not possible. Extra bits are inevitable as the smallest unit which by the hard drive can be written to is a byte. the extra bits have to be dealt with in the logic of decompression. a bulletproof solution is to add some extra bits at the beginning of the file (or wherever the metadata is stored for the filecompression) which represent the cutoff in the last byte. 3 bits is enough as you can represent the number 7 with it. (obviously there isn't a whole extra unnecessery byte, therefor if the compressed file is still divisible by eight these three bits should be 000 or numerically a zero, 8 does not need to be written) this way once the last byte is being read the program should ignore as many bits as the three bits equal to.
Related
I'm trying to find some exif data in an image.
So first I need to find the number 0x45786966 ('Exif' as unsignedInt32) and store the offset.
The next two bytes should be zeros and after that the endianness as unsignedInt16 (either 0x4d4d or 0x4949) which should be stored too.
I can get the image as Bytes with the elm/file module.
But how do I search the 'Exif' start and parse the endianness in those Bytes?
I looked at the loop-example from elm/bytes but do not fully understand it.
First it reads the length of a list (unsignedInt32) and then it reads byte by byte?
How would this work if I want to read unsignedInt32s instead of bytes?
How do I set an offset to indicate where functions like unsignedInt32 should read next?
The example is talking about structured data with a known size field at the start. In your case, what you want to do is a search, so it is a rather different problem.
The problem is elm/bytes isn't really designed to handle searching. If you can guarantee the part you are looking for will be byte aligned, it may well be possible to do this, but given just what you have said, there isn't an easy way, as you can't iterate bit-by-bit.
You would have to read in values without alignment and then manually search for the part of the number you want within that. Given the difficulty and inefficiency of that approach, I would recommend using ports instead for that use case.
If you can guarantee that what you are searching for will be byte-aligned (or better yet, aligned to the length of your number), you can decode a byte at a time until you find what you are looking for. There is no way to read from a given offset, if you want to read to a certain point, you'd need to read and throw away values.
To do this, you would want to set up a loop where your state contains how much of the value you are looking for you have found. Each step, you check if you have the whole thing (success), you have the next part (continue), or you have something different (reset the state to search from the start again). If you reach the end without finding it, you have failed.
Occasionally I will store the state of some system as an integer. I often find myself using small values for these states (say 1-10), since the system is relatively simple.
In general, what's the best declaration for a variable which stores small positive integers - where best is defined as fastest read/write time & smallest memory consumption? Small is here defined as 1-10, although a complete list of integer storing methods and their ranges would be useful.
Originally I used Integer as on the face of it, it uses less memory. But I have since learned that that is not the case, as it is silently converted to Long
I then used Long for the above reason, and in the knowledge that it uses less memory than Double
I have since discovered Byte and switched to that, since it stores smaller integers (0-255 or 256, I never remember which), and I guess uses less memory from it's minute name. But I don't really trust VBA and wonder if there's any internal type conversions done here too.
Boolean I thought was only 0 or 1, but I've read that any non-zero number is converted to True, does this mean it can also store numbers?
Originally I used Integer as on the face of it, it uses less memory. But I have since learned that that is not the case, as it is silently converted to Long
That's right there is no advantage in using Integer over Long because of that conversion, but Integer might be necessary when communicating with old 16 bit APIs.
Also read "Why Use Integer Instead of Long?"
I then used Long for the above reason, and in the knowledge that it uses less memory than Double
You would not decide between Long or Double because one uses less memory. You decide between them because …
you need floating point numbers (Double)
or you don't accept floating point numbers. (Long)
Deciding on memory usage in this specific case is just a very bad idea because these types are fundamentally different.
I have since discovered Byte and switched to that, since it stores smaller integers (0-255 or 256, I never remember which), and I guess uses less memory from it's minute name. But I don't really trust VBA and wonder if there's any internal type conversions done here too.
I don't see any case where you use Office/Excel and run into any memory issues by using Long instead of Byte to iterate from 1 to 10. If you need to limit it to 255 (some old APIs, whatever) then you might use Byte. If there is no need for that I would use Long just to be flexible and not run into any coding issues because you need to remember which counters are only Byte and which are Long.
E.g. If I use i for iterating I would expect Long. I see no advantage in using Byte for that case.
Stay as simple as possible. Don't do strange things one would not expect only because you can. Avoiding future coding issues is worth more than one (or three) byte of memory usage. Sometimes it is worthier to write good human readable and maintainable code than faster code especially if you can't notice the differences (which you really can't in this case). Bad readable code always results in errors or vulnerabilities sooner or later.
Boolean I thought was only 0 or 1, but I've read that any non-zero number is converted to True, does this mean it can also store numbers?
No that's wrong. Boolean is -1 for True and 0 for False. But note that if you cast e.g. a Long into Boolean which is not 0 then it will automatically cast and result in True.
But Boolean in VBA is clearly defined as:
0 = False
-1 = True
The smallest chunk of memory that can be addressed is a byte (8 bits).
I cannot guarantee that VBA Bytes are stored as bytes in all cases, but using this type you are on the safest side.
By the way, the largest byte value is 11111111b, i.e 255d. The value 256d is 100000000b which requires 9 bits.
Also note that using Bytes every possible time might be unproductive as it can have a cost in terms of running time, if numerical conversions are required, while the spared memory space may be insignificant.
Except for very special applications, this kind of micro-optimization is of no use.
What data type do I use to store a single byte in a protocol buffer message? Seeing the list at https://developers.google.com/protocol-buffers/docs/proto#scalar it seems like one of the *int32 types are the best fit. Is there a more efficient way to store a single byte?
Well you need to understand that it will take at least two bytes anyway - one for the tag and one for the data. (The tag will take more space if the field number is high.) If you use uint32, it will take 1 byte for the data for values up to 127, and 2 bytes for anything larger.
I don't believe there's anything that will be more efficient than that.
Quick question. Does it matter from the point of storing data if I will use decimal field limits or hexadecimal (say 16,32,64 instead of 10,20,50)?
I ask because I wonder if this will have anything to do with clusters on HDD?
Thanks!
VARCHAR(128) is better than VARCHAR(100) if you need to store strings longer than 100 bytes.
Otherwise, there is very little to choose between them; you should choose the one that better fits the maximum length of the data you might need to store. You won't be able to measure the performance difference between them. All else apart, the DBMS probably only stores the data you send, so if your average string is, say, 16 bytes, it will only use 16 (or, more likely, 17 - allowing 1 byte for storing the length) bytes on disk. The bigger size might affect the calculation of how many rows can fit on a page - detrimentally. So choosing the smallest size that is adequate makes sense - waste not, want not.
So, in summary, there is precious little difference between the two in terms of performance or disk usage, and aligning to convenient binary boundaries doesn't really make a difference.
If it would be a C-Program I'd spend some time to think about that, too. But with a database I'd leave it to the DB engine.
DB programmers spent a lot of time in thinking about the best memory layout, so just tell the database what you need and it will store the data in a way that suits the DB engine best (usually).
If you want to align your data, you'll need exact knowledge of the internal data organization: How is the string stored? One, two or 4 bytes to store the length? Is it stored as plain byte sequence or encoded in UTF-8 UTF-16 UTF-32? Does the DB need extra bytes to identify NULL or > MAXINT values? Maybe the string is stored as a NUL-terminated byte sequence - then one byte more is needed internally.
Also with VARCHAR it is not neccessary true, that the DB will always allocate 100 (128) bytes for your string. Maybe it stores just a pointer to where space for the actual data is.
So I'd strongly suggest to use VARCHAR(100) if that is your requirement. If the DB decides to align it somehow there's room for extra internal data, too.
Other way around: Let's assume you use VARCHAR(128) and all things come together: The DB allocates 128 bytes for your data. Additionally it needs 2 bytes more to store the actual string length - makes 130 bytes - and then it could be that the DB aligns the data to the next (let's say 32 byte) boundary: The actual data needed on the disk is now 160 bytes 8-}
Yes but it's not that simple. Sometimes 128 can be better than 100 and sometimes, it's the other way around.
So what is going on? varchar only allocates space as necessary so if you store hello world in a varchar(100) it will take exactly the same amount of space as in a varchar(128).
The question is: If you fill up the rows, will you hit a "block" limit/boundary or not?
Databases store their data in blocks. These have a fixed size, for example 512 (this value can be configured for some databases). So the question is: How many blocks does the DB have to read to fetch each row? Rows that span several block will need more I/O, so this will slow you down.
But again: This doesn't depend on the theoretical maximum size of the columns but on a) how many columns you have (each column needs a little bit of space even when it's empty or null), b) how many fixed width columns you have (number/decimal, char), and finally c) how much data you have in variable columns.
I'm currently trying to decipher WAV files. From headers to the PCM data.
I've found a PDF (http://www.tdt.com/T2Support/technical_notes/tn0132.pdf) detailing the anatomy of a WAV file, and I've been able to extract and make sense of the appropriate header data using Ghex2. But my questions are:
Why are the integers bytes stored backwards? I.e. dec. 20 is stored as 0x14000000 instead of 0x00000014.
Are the integers of the PCM data also stored backwards?
WAV files are little-endian (least significant bytes first) because the format originated for operating systems running on intel processor based machines which use the little endian format to store numbers.
If you think about it kind of makes sense because if you want to cast a long integer to a short one or even a character the starting address remains the same you just look at less bytes.
Consequently, for 16 bit encoding upwards, little-endian format will be used for the PCM as well. This is quite handy since you will be able to pull them in as integers. don't forget they will be stored as two's complement signed integers if they are 16 bit, but not if they are 8 bit. (see http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html for more detail)
"Backwards" is subjective. Some machines are big-endian, others are little-endian. In byte-oriented contexts like file formats and network protocols, the order is arbitrary. Some formats like to specify big- or little-endian, others like to be flexible and accept either form, with a flag indicating which is in use.
Looks like WAV files just like little-endian.