Writing wav files of unknown length

Writing wav files of unknown length - header

The various headers of a wav-file contain file-length information. Consider the case where I generate a wav file without knowing how long it is going to be and possibly without the ability to alter the header after I finished (i.e. in case of writing to a pipe), what should I write into these fields?

Either way this isn't an ideal situation. But, if there's absolutely no way to edit the file, I'd recommend writing 0xFFFFFFFF, that is, the maximum possible value that can be assigned to the Subchunk2Size field of a standard wav header (albeit somewhat of a hack). Doing so will allow the whole file to be read/played by practically all players.
As some players solely rely on this field to calculate the audio's length (so it knows when to loop, how far to allow seeking, etc.), therefore, saying the file is longer than it actually is will "trick" the player into processing the entire file (although, depending on the player an error may occur once it reaches the end of the audio).

Related

Making lex/yaac script multhreaded

I have a lex/yaac code which captures some data after parsing an file. That file is in specific format. Consider this format:
File format:
ABC
Something something
ABC
Something something
....
....
Lex/Yacc code is sequential right now. Is it possible to make the code multithreading for single file by dividing it into chunks separated by ABC.
Where to start?
I shall be happy to share more details, if needed.

The problem is that you don't know where the delimiters are until you lexically analyse the file. Unless the delimiters are contextually unambiguous (which won't be the case if your syntax includes things like comments or quoted strings), then you can't start the analysis at a random point in the file without knowing the context of that point. So in common cases, finding the delimiters requires a linear scan.
You could hand a chunk of the file to a separate chunk parser thread every time you identify the chunk's endpoints. That will require scanning the chunk twice (once for the initial delimiter scan and a second time for the precise parse) but it might still be a win if parsing is relatively slow. You could avoid the second scan by retaining the token stream in memory and passing the parser thread the list of tokens, but it may well turn out that the memory management and other bookkeeping for the token lists are as or more expensive than the lexical scan. That will depend on a lot of factors, so experimentation and benchmarking in your particular use case will be necessary.
If the delimiters are completely unambiguous and the chunks are relatively short, you could divide the files into pages of some convenient size, and give each thread its own pages to analyse. When the thread is assigned a page, it will have to skip initial data until it finds the first delimiter; when it reaches the end of the page, it will have to continue scanning the rest of the current chunk in the next page. So there will be a bit of duplicate scanning, but if the chunks are short relative to the page size, it won't be too bad. Again, benchmarking your particular use case will be useful. And, again, you can only do this if you can unambiguously recognise a delimiter without knowing anything about the lexical context.

Organzing Data in EEPROM

I have a 64KB EEPROM, organized as 128-byte pages, on my board which talks to an AT Mega 1281. The board also has a SD Card slot and is capable of copying over some configuration files onto the EEPROM (which acts as the internal memory). Due to the nature of the board, only two types of files are needed - one is known as the Circuit Data and the other is Location Data - both are binary files.
Up until now, I had just split the EEPROM into two 32K halves and wrote the Circuit Data in the top half and the Location Data in the bottom half. Both files also have a 25 byte header. I copy the header in the last pages of the files respective half i.e. the page starting at address 0x7F80 has the Circuit Data file's header and the address starting at 0xFF80 has the other header. The data is always going to be of fixed width so that makes random access quite easy.
My question is, is there a better, simpler, way to organize data in an EEPROM? At the moment, I don't even store the length of the data as it's not really needed. But I'm thinking it might add an another step of safety if I do include that in the header.

Better? It depends. Simpler? Really not. It depends how strong is your "always". How much do you believe yourself that the files will be always of fixed length? The fact that you are asking this question probably means some doubts. Keep in mind KISS principle. Microcontroller development is still an area where unecessary features are a direct threat to the solution stability. Having a data length in the header would be useful if you want to make your EEPROM access more generic. But then again, generalization for two files is an overkill.
Second thought: rather than introducing file lengths which you actually don't need, i would like to know why you store the file headers at the opposite side of the respective memory chunk. A "header" is to me something what needs to be read before the file itself. You could save one transfer of the reading address to EEPROM.

I believe, in any embedded project, simplest solution is the best. Your way to organize storage is simple, and looks like it meets all your requirements.
Any attempt to "improve" or "optimize" this solution will lead to more complicated code and will increase probability of making bug in it. So keep all your engineering solutions as simple as possible. If there will pop new requirements, you always can find new simple solution for them. Don't do any premature optimizations.

Alternatives to using RFile in Symbian

This question is in continuation to my previous question related to File I/O.
I am using RFile to open a file and read/write data to it. Now, my requirement is such that I would have to modify certain fields within the file. I separate each field within a record with a colon and each record with a newline. Sample is below:
abc#def.com:Albert:1:2
def#ghi.com:Alice:3:1
Suppose I want to replace the '3' in the second record by '2'. I am finding it difficult to overwrite specific field in the file using RFile because RFile does not provide its users with such facility.
Due to this, to modify a record I have to delete the contents of the file and serialize ( that is loop through in memory representation of records and write to the file ). Doing this everytime there is a change in a record's value is quite expensive as there are hundreds of records and the change could be quite frequent.
I searched around for alternatives and found CPermanentFileStore. But I feel the API is hard to use as I am not able to find any source on the Internet that demonstrates its use.
Is there a way around this. Please help.

Depending on which version(s) of Symbian OS you are targetting, you could store the information in a relational database. Since v9.4, Symbian OS includes an SQL implementation (based on the open source SQLite engine).

Using normal files for this type of records takes a lot of effort no matter the operating system. To be able to do this efficiently you need to reserve space in the file for expansion of each record - otherwise you need to rewrite the entire file if a record value changes from say 9 to 10. Also storing a lookup table in the file will make it possible to jump directly to a record using RFile::Seek.
The CPermamanentFileStore simplifies the actual reading and writing of the file but basically does what you have to do yourself otherwise. A database may be a better choice in this instance. If you don't want to use a database I think using stores would be be a better solution.

Functional testing of output files, when output is non-deterministic (or with low control)

A long time ago, I had to test a program generating a postscript file image. One quick way to figure out if the program was producing the correct, expected output was to do an md5 of the result to compare against the md5 of a "known good" output I checked beforehand.
Unfortunately, Postscript contains the current time within the file. This time is, of course, different depending on when the test runs, therefore changing the md5 of the result even if the expected output is obtained. As a fix, I just stripped off the date with sed.
This is a nice and simple scenario. We are not always so lucky. For example, now I am programming a writer program, which creates a big fat RDF file containing a bunch of anonymous nodes and uuids. It is basically impossible to check the functionality of the whole program with a simple md5, and the only way would be to read the file with a reader, and then validate the output through this reader. As you probably realize, this opens a new can of worms: first, you have to write a reader (which can be time consuming), second, you are assuming the reader is functionally correct and at the same time in sync with the writer. If both the reader and the writer are in sync, but on incorrect assumptions, the reader will say "no problem", but the file format is actually wrong.
This is a general issue when you have to perform functional testing of a file format, and the file format is not completely reproducible through the input you provide. How do you deal with this case?

In the past I have used a third party application to validate such output (preferably converting it into some other format which can be mechanically verified). The use of a third party ensures that my assumptions are at least shared by others, if not strictly correct. At the very least this approach can be used to verify syntax. Semantic correctness will probably require the creation of a consumer for the test data which will likely always be prone to the "incorrect assumptions" pitfall you mention.

Is the randomness always in the same places? I.e. is most of the file fixed but there are some parts that always change? If so, you might be able to take several outputs and use a programmatic diff to determine the nondeterministic parts. Once those are known, you could use the information to derive a mask and then do a comparison (md5 or just a straight compare). Think about pre-processing the file to remove (or overwrite with deterministic data) the parts that are non-deterministic.
If the whole file is non-deterministic then you'll have to come up with a different solution. I did testing of MPEG-2 decoders which are non-deterministic. In that case we were able to do a PSNR and fail if it was above some threshold. That may or may not work depending on your data but something similar might be possible.

Binary file & saved game formatting

I am working on a small roguelike game, and need some help with creating save games. I have tried several ways of saving games, but the load always fails, because I am not exactly sure what is a good way to mark the beginning of different sections for the player, entities, and the map.
What would be a good way of marking the beginning of each section, so that the data can read back reliably without knowing the length of each section?
Edit: The language is C++. It looks like a readable format would be a better shot. Thanks for all the quick replies.

The easiest solution is usually use a library to write the data using XML or INI, then compress it. This will be easier for you to parse, and result in smaller files than a custom binary format.
Of course, they will take slightly longer to load (though not much, unless your data files are 100's of MBs)
If you're determined to use a binary format, take a look at BER.

Are you really sure you need binary format?
Why not store in some text format so that it can be easily parseable, be it plain text, XML or YAML.

Since you're saving binary data you can't use markers without length.
Simply write the number of records of any type and then structured data, then it will be
easy to read again. If you have variable length elements like string the also need length information.
2
player record
player record
3
entities record
entities record
entities record
1
map

If you have a marker, you have to guarantee that the pattern doesn't exist elsewhere in your binary stream. If it does exist, you must use a special escape sequence to differentiate it. The Telnet protocol uses 0xFF to mark special commands that aren't part of the data stream. Whenever the data stream contains a naturally occurring 0xFF, then it must be replaced by 0xFFFF.
So you'd use a 2-byte marker to start a new section, like 0xFF01. If your reader sees 0xFF01, it's a new section. If it sees 0xFFFF, you'd collapse it into a single 0xFF. Naturally you can expand this approach to use any length marker you want.
(Although my personal preference is a text format (optionally compressed) or a binary format with length bytes instead of markers. I don't understand how you're serializing it without knowing when you're done reading a data structure.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas