What language is to binary, as Perl is to text? - scripting

I am looking for a scripting (or higher level programming) language (or e.g. modules for Python or similar languages) for effortlessly analyzing and manipulating binary data in files (e.g. core dumps), much like Perl allows manipulating text files very smoothly.
Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.
Any suggestions?

Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.
Well, while it may seem counter-intuitive, I found erlang extremely well-suited for this, namely due to its powerful support for pattern matching, even for bytes and bits (called "Erlang Bit Syntax"). Which makes it very easy to create even very advanced programs that deal with inspecting and manipulating data on a byte- and even on a bit-level:
Since 2001, the functional language Erlang comes with a byte-oriented datatype (called binary) and with constructs to do pattern matching on a binary.
And to quote informIT.com:
(Erlang) Pattern matching really starts to get
fun when combined with the binary
type. Consider an application that
receives packets from a network and
then processes them. The four bytes in
a packet might be a network byte-order
packet type identifier. In Erlang, you
would just need a single processPacket
function that could convert this into
a data structure for internal
processing. It would look something
like this:
processPacket(<<1:32/big,RestOfPacket>>) ->
% Process type one packets
...
;
processPacket(<<2:32/big,RestOfPacket>>) ->
% Process type two packets
...
So, erlang with its built-in support for pattern matching and it being a functional language is pretty expressive, see for example the implementation of ueencode in erlang:
uuencode(BitStr) ->
<< (X+32):8 || <<X:6>> <= BitStr >>.
uudecode(Text) ->
<< (X-32):6 || <<X:8>> <= Text >>.
For an introduction, see Bitlevel Binaries and Generalized Comprehensions in Erlang.You may also want to check out some of the following pointers:
Parsing Binaries with erlang, lamers inside
More File Processing with Erlang
Learning Erlang and Adobe Flash format same time
Large Binary Data is (not) a Weakness of Erlang
Programming Efficiently with Binaries and Bit Strings
Erlang bit syntax and network programming
erlang, the language for network programming (1)
Erlang, the language for network programming Issue 2: binary pattern matching
An Erlang MIDI File Reader/Writer
Erlang Bit Syntax
Comprehending endianness
Playing with Erlang
Erlang: Pattern Matching Declarations vs Case Statements/Other
A Stream Library using Erlang Binaries
Bit-level Binaries and Generalized Comprehensions in Erlang
Applications, Implementation and Performance Evaluation of Bit Stream Programming in Erlang

perl's pack and unpack ?

Take a look at python bitstring, it looks like exactly what you want :)

The Python bitstring module was written for this purpose. It lets you take arbitary slices of binary data and offers a number of different interpretations through Python properties. It also gives plenty of tools for constructing and modifying binary data.
For example:
>>> from bitstring import BitArray, ConstBitStream
>>> s = BitArray('0x00cf') # 16 bits long
>>> print(s.hex, s.bin, s.int) # Some different views
00cf 0000000011001111 207
>>> s[2:5] = '0b001100001' # slice assignment
>>> s.replace('0b110', '0x345') # find and replace
2 # 2 replacements made
>>> s.prepend([1]) # Add 1 bit to the start
>>> s.byteswap() # Byte reversal
>>> ordinary_string = s.bytes # Back to Python string
There are also functions for bit-wise reading and navigation in the bitstring, much like in files; in fact this can be done straight from a file without reading it into memory:
>>> s = ConstBitStream(filename='somefile.ext')
>>> hex_code, a, b = s.readlist('hex:32, uint:7, uint:13')
>>> s.find('0x0001') # Seek to next occurence, if found
True
There are also views with different endiannesses as well as the ability to swap endianness and much more - take a look at the manual.

I'm using 010 Editor to view binary files all the time to view binary files.
It's especially geared to work with binary files.
It has an easy to use c-like scripting language to parse binary files and present them in a very readable way (as a tree, fields coded by color, stuff like that)..
There are some example scripts to parse zipfiles and bmpfiles.
Whenever I create a binary file format, I always make a little script for 010 editor to view the files. If you've got some header files with some structs, making a reader for binary files is a matter of minutes.

Any high-level programming language with pack/unpack functions will do. All 3 Perl, Python and Ruby can do it. It's matter of personal preference. I wrote a bit of binary parsing in each of these and felt that Ruby was easiest/most elegant for this task.

Why not use a C interpreter? I always used them to experiment with snippets, but you could use one to script something like you describe without too much trouble.
I have always liked EiC. It was dead, but the project has been resurrected lately. EiC is surprisingly capable and reasonably quick. There is also CINT. Both can be compiled for different platforms, though I think CINT needs Cygwin on windows.

Python's standard library has some of what you require -- the array module in particular lets you easily read parts of binary files, swap endianness, etc; the struct module allows for finer-grained interpretation of binary strings. However, neither is quite as rich as you require: for example, to present the same data as bytes or halfwords, you need to copy it between two arrays (the numpy third-party add-on is much more powerful for interpreting the same area of memory in several different ways), and, for example, to display some bytes in hex there's nothing much "bundled" beyond a simple loop or list comprehension such as [hex(b) for b in thebytes[start:stop]]. I suspect there are reusable third-party modules to facilitate such tasks yet further, but I can't point you to one...

Forth can also be pretty good at this, but it's a bit arcane.

Well, if speed is not a consideration, and you want perl, then translate each line of binary into a line of chars - 0's and 1's. Yes, I know there are no linefeeds in binary :) but presumably you have some fixed size -- e.g. by byte or some other unit, with which you can break up the binary blob.
Then just use the perl string processing on that data :)

If you're doing binary level processing, it is very low level and likely needs to be very efficient and have minimal dependencies/install requirements.
So I would go with C - handles bytes well - and you can probably google for some library packages that handle bytes.
Going with something like Erlang introduces inefficiencies, dependencies, and other baggage you probably don't want with a low-level library.

Related

is there anyway in fortran90 to read data at specied byte

I have encountered a problem that demands reading at data at specified byte from a binary input file,like reading at location 40000 bytes off the start of the file.I intend to use direct access to file.But that requires each segment be divided in the same size which specified in the argument recl.Can anybody provides a feasible solution.Some programming language like c provide function that can jump to the specified bytes.
The Fortran 2003 standard introduced unformatted stream access, to pretty much do exactly this. Once the file has been opened appropriately you can just use a POS specifier in the relevant write statement. Support for this Fortran 2003 feature is reasonably widespread amongst the Fortran compilers that are actively supported. The compiler needs to use a file storage unit of a byte, but all compilers that I am aware of do this (this is also what the standard recommends).
Otherwise, the closest standard Fortran 90 approach is to use unformatted direct access with a record length that is some reasonable common factor of the desired position and size of the elements of data to be read. For instance - if you were reading eight byte real numbers from the file, then a record length of eight might work - you would start reading at record number 5000. This requires both that the file storage unit of the Fortran processor be a byte (common, perhaps with compile options) and that no record delimiters or similar exists in the file for unformatted direct access (mostly the case, again perhaps with compile options).

Normalize Audio File obj-c

Is there a library function to normalize a sound file? I have searched around but could not find any.
I would like to be able to normalize a sound file and setting that into the sound file so it only needs to be done once rather than on the fly.
Can this be done with Core-Audio?
Yes it can be done, but not with a single function call.
The functionality you want is not in fact CoreAudio, but rather in ExtendedAudioFile.h - part of the AudioToolbox framework. This is available for both iOS and MacOSX. I can attest for this being rather hard to find.
Functions of interest in this header are ExtAudioFileOpenURL(), ExtAudioFileRead() and ExtAudioFileWrite().
In outline what you do:
Use ExtAudioFileOpenURL() to open the input file
Use ExtAudioFileGetProperty() with propertyId kExtAudioFileProperty_FileDataFormat to obtain an AudioStreamBasicDescription describing the file.
Possibly set the ASBD to get the format you want. AudioToolBox on MacOSX seems rather more amenable to this than on iOS.
Calculate an allocate a buffer large enough to hold the entire audio file
Read the entire file with ExtAudioFileRead() - NB: this call might not read it all in one go - operating in much the same was as POSIX read()
Perform normalisation
Use ExtAudioFileCreateWithURL() to create the output file
Use ExtAudioFileWrite() to write the normalised samples out.
Dispose of both audio files.
The documentation links to several example projects that can act as donors of working code. You'll find doing normalisation much easier with the samples as floats, but in iOS, I could never get the conversion to work automatically, so you might have to format convert yourself.

Quick uses for scripting languages?

I feel that there are a lot of quick uses for scripting languages that you may only think of if you have the shell open at all times. I leave a terminal tab open with python running and have solved many problems with a few lines of code typed off the top of my head. What are some of your less obvious uses for the scripting language of your choice.
Most recently in my Windows centric world I have used it to rename large numbers of files, search/filter log files for a specific occurrence, perform network diagnostics, and a host of smaller things I can't think of at the moment that some of my colleagues not having a UNIX background would never have thought of.
I just used a Lua script in SciTE to take a selected SVG path and do some operations on it (find min values and translate to 0, scale, round up values to avoid having a ton of decimal digits). It is just handy.
Reformat text is some complicated way;
Prepare some text based on a template logic;
Rename multiple files (e.g. music collection or photos);
etc.
Something very similar was discussed in the Wikibooks article Ad Hoc Data Analysis From The Command Line.
This mostly discusses the use of Unix commands rather than scripting languages, but the principle is the same ... have a shell open at all times.
Use BeautifulSoup to clean up some HTML.

Which is it Perl or perl, TIF or TIFF, ant or Ant, ClearCase or Clear Case?

In one sentence I have manage to create 16 possible variations on how I present information. Does it matter as long as the context is clear? Do any common mistakes irritate you?
regarding Perl: How should I capitalize Perl?
TIFF stands for Tagged Image File Format, whereas the extension of files using that format is often ".tif".
That is for the purpose of compatibility with 8.3 filenames, I believe.
I generally like the Perl way of capitalizing when used as a proper noun, but lowercasing when referring to the command itself (assuming the command is lowercase to begin with).
Well, Perl and TIFF have already been answered, so I'll add the last two
the Apache Foundation writes "Apache Ant".
Rational ClearCase (or sometimes "IBM Rational ClearCase") is written as such at its web site.
Even though Perl was originally an acronym for Practical Extration and Report Language, it is written Perl.
These things dont 'bother' me as much as they provide insights into the level of knowledge of the speaker/author. You see, we work in a industry that requires precision, so precision in language does matter as it affects the understanding of the consumer.
The one that really seems to bother me is when people fully upper case JAVA as though it was an acronym.

Process for reducing the size of an executable

I'm producing a hex file to run on an ARM processor which I want to keep below 32K. It's currently a lot larger than that and I wondered if someone might have some advice on what's the best approach to slim it down?
Here's what I've done so far
So I've run 'size' on it to determine how big the hex file is.
Then 'size' again to see how big each of the object files are that link to create the hex files. It seems the majority of the size comes from external libraries.
Then I used 'readelf' to see which functions take up the most memory.
I searched through the code to see if I could eliminate calls to those functions.
Here's where I get stuck, there's some functions which I don't call directly (e.g. _vfprintf) and I can't find what calls it so I can remove the call (as I think I don't need it).
So what are the next steps?
Response to answers:
As I can see there are functions being called which take up a lot of memory. I cannot however find what is calling it.
I want to omit those functions (if possible) but I can't find what's calling them! Could be called from any number of library functions I guess.
The linker is working as desired, I think, it only includes the relevant library files. How do you know if only the relevant functions are being included? Can you set a flag or something for that?
I'm using GCC
General list:
Make sure that you have the compiler and linker debug options disabled
Compile and link with all size options turned on (-Os in gcc)
Run strip on the executable
Generate a map file and check your function sizes. You can either get your linker to generate your map file (-M when using ld), or you can use objdump on the final executable (note that this will only work on an unstripped executable!) This won't actually fix the problem, but it will let you know of the worst offenders.
Use nm to investigate the symbols that are called from each of your object files. This should help in finding who's calling functions that you don't want called.
In the original question was a sub-question about including only relevant functions. gcc will include all functions within every object file that is used. To put that another way, if you have an object file that contains 10 functions, all 10 functions are included in your executable even if one 1 is actually called.
The standard libraries (eg. libc) will split functions into many separate object files, which are then archived. The executable is then linked against the archive.
By splitting into many object files the linker is able to include only the functions that are actually called. (this assumes that you're statically linking)
There is no reason why you can't do the same trick. Of course, you could argue that if the functions aren't called the you can probably remove them yourself.
If you're statically linking against other libraries you can run the tools listed above over them too to make sure that they're following similar rules.
Another optimization that might save you work is -ffunction-sections, -Wl,--gc-sections, assuming you're using GCC. A good toolchain will not need to be told that, though.
Explanation: GNU ld links sections, and GCC emits one section per translation unit unless you tell it otherwise. But in C++, the nodes in the dependecy graph are objects and functions.
On deeply embedded projects I always try to avoid using any standard library functions. Even simple functions like "strtol()" blow up the binary size. If possible just simply avoid those calls.
In most deeply embedded projects you don't need a versatile "printf()" or dynamic memory allocation (many controllers have 32kb or less RAM).
Instead of just using "printf()" I use a very simple custom "printf()", this function can only print numbers in hexadecimal or decimal format not more. Most data structures are preallocated at compile time.
Andrew EdgeCombe has a great list, but if you really want to scrape every last byte, sstrip is a good tool that is missing from the list and and can shave off a few more kB.
For example, when run on strip itself, it can shave off ~2kB.
From an old README (see the comments at the top of this indirect source file):
sstrip is a small utility that removes the contents at the end of an
ELF file that are not part of the program's memory image.
Most ELF executables are built with both a program header table and a
section header table. However, only the former is required in order
for the OS to load, link and execute a program. sstrip attempts to
extract the ELF header, the program header table, and its contents,
leaving everything else in the bit bucket. It can only remove parts of
the file that occur at the end, after the parts to be saved. However,
this almost always includes the section header table, and occasionally
a few random sections that are not used when running a program.
Note that due to some of the information that it removes, a sstrip'd executable is rumoured to have issues with some tools. This is discussed more in the comments of the source.
Also... for an entertaining/crazy read on how to make the smallest possible executable, this article is worth a read.
Just to double-check and document for future reference, but do you use Thumb instructions? They're 16 bit versions of the normal instructions. Sometimes you might need 2 16 bit instructions, so it won't save 50% in code space.
A decent linker should take just the functions needed. However, you might need compiler & linke settings to package functions for individual linking.
Ok so in the end I just reduced the project to it's simplest form, then slowly added files one by one until the function that I wanted to remove appeared in the 'readelf' file. Then when I had the file I commented everything out and slowly add things back in until the function popped up again. So in the end I found out what called it and removed all those calls...Now it works as desired...sweet!
Must be a better way to do it though.
To answer this specific need:
•I want to omit those functions (if possible) but I can't find what's
calling them!! Could be called from any number of library functions I
guess.
If you want to analyze your code base to see who calls what, by whom a given function is being called and things like that, there is a great tool out there called "Understand C" provided by SciTools.
https://scitools.com/
I have used it very often in the past to perform static code analysis. It can really help to determine library dependency tree. It allows to easily browse up and down the calling tree among other things.
They provide a limited time evaluation, then you must purchase a license.
You could look at something like executable compression.