Lazily Read Stream of Bytes from File in Java 8

Lazily Read Stream of Bytes from File in Java 8 - file-io

So Java 8 introduces a lot of lazily loaded Streams, including one for reading lines from a text file using a particular character encoding.
But after doing a LOT of reading I've determined that there is no out of the box method to lazily read chunks of bytes from file, and I'm a bit confused as to why this is the case. This is a pretty common use-case, so there must be a good reason for it to not have been included, right?
My best solution to this seems to be a custom implementation of a Spliterator to read byte chunks using some guidance from this post:
https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams
But I would love to know why Java 8 doesn't have this feature out of the box?

Related

Reading DIEs in ELF file

Hello I fairly new to the DWARF standard and ELF format. I have a few questions. I am using the DWARF 2 standard and I have a pretty basic understanding of how DIEs work and I was needing more clarity on how they are represented in bytes.
ELF Wiki provides a good table for in which order the bytes go in the program header, sections, and segments. But what is the correct way to represent DIEs in bytes for the DWARF 2 standard?
I have tried to dive deep into Dwarf Standards pdf documents to try to understand how DIEs are represented in bytes. Perhaps there is a section I am missing?
I would like to use this information to be able to delete certain DIEs to save space in the debugging section. I am only interest in the DIEs that provide variable address's.

I recommend that anyone starting out in DWARF begin with the Introduction to the DWARF Debugging Format. It's a very concise overview that provides an excellent foundation for exploring the format in further depth. Armed with this background, compile a debug version of a very simple program and compare a hex dump of the two ELF sections .debug_abbrev and .debug_info with the output of dwarfdump or readelf.
Once you are broadly familiar with the encoding of a DIE you will see that simply deleting its corresponding bytes from .debug_info would corrupt the entire file — in terms of both DWARF and ELF. For example, each DIE is identified by its relative file offset; deleting one DIE's bytes would alter the offsets of all subsequent DIEs and any references to them would therefore be broken. A robust solution would require parsing the DWARF to create an internal representation of the tree before eliminating unwanted nodes and writing out new DWARF. After modifying .debug_info you'd then need to edit the fabric of the ELF itself: at the very least, this would involve updating the section header table to reflect the new offsets for any shifted sections and updating any relocations.
If your principal concern is indeed space saving then I suggest you instead investigate what compiler options you have. The Oracle Studio Compilers, for example, allow fine control over the content included in the DWARF. Depending on your compiler and OS it may also be possible to emit files with compressed DWARF sections (e.g. .zdebug_info) or even leave the DWARF in different files altogether. The problem of DWARF bloat is well known and, if you are interested in tackling it at a low level yourself, you will find other suggestions in Michael Eager's introduction and in later versions of the standard.

The format is explained page 66 in sections 7.5.2 and 7.5.3.
The example in appendix 2, page 93 is is much clearer:
Each DIE references a corresponding entry in .debug_abbrev which defines a given DIE "signature" i.e.
its type (DW_TAG_*)
it has child DIE
its attribute (DW_AT_*) and their form (DW_FORM_*).
The format od the DIE is:
reference to a abbreviation (LEB128 i.e. variable length);
0 is used for ending a list of children̄;
une value per attribute (using the encoding associated with the given form).

How to match ZLib stream between VBA 6/VBA 7and Java 8?

We are being able to do the following.
In VBA 6/ VBA 7:
Refer a 32 bit zlibwapi.dll (VBA 6) or 64 bit zlibwapi.dll (VBA 7).
Invoke compress() or compress2() methods to generate compressed
streams
Invoke uncompress() and uncompress2() methods to decompress
compressed streams
In Java 8 (JDK 1.8 on Tomcat 8)
Have a simple java program that compresses data using the new
Deflater() instance
Have a simple Java program that decompresses using Inflater()
instance
We are failing when VBA sends out the compressed stream for Java Servlet to uncompress or when Java Servlet sends out compressed response data for VBA to decompress.
We are aware of following facts.
there are 3 formats provided by ZLib (raw, zlib and gzip).
The methods in zlibwapi.dll namely compress() and compress2()
generates compressed bytes in zlib format. This has been mentioned
in a similar thread at
Java decompressing array of bytes
Inflater() instance on Java side allows to uncompress zlib format
data as per a code sample posted at
Compression / Decompression of Strings using the deflater
Java 8 has zlib version 1.2.5 integrated as part of java.utils.zip
package.
We have ensured that we are using zlibwapi.dll version
1.2.5 on VBA side as well.
We have tried to use Hex editors to compare byte streams of compressed data independently generated by VBA and Java as well. We notice some difference in the generated compressed data. We think it is this difference that is causing both the environments to misunderstand each other.
Additionally, we think that when communication occurs, there has to be some common charset that governs the encoding/decoding scheme between both the endpoints. We have even tried to compare the hex code of byte stream generated by VBA and communicated across to Java Servlet.
The bytes seem to be getting some additional 0 bytes inserted in
between the actual set of compressed bytes while communication
occurs. This happens on VBA side. May be because of some unicode interpretation.
Whatever bytes get communicated across to Java appear entirely
different in their representation.
We need to fix our independently working code to communicate with one another and compress and decompress peacefully. We think there are 2 things to address - Getting format to match and using a charset that sends bytes as is. We are looking for any assistance from experts on this front that can help us find correct path to the possible solution. We need answers for
Does compress2() or compress() really generate zlib format?
Which charset will allow us to send bytes as is (if there are 10
bytes, we want to send 10 bytes. Not 20). If its unicode, 0 bytes
get inserted in between (10 bytes become 20 bytes because of this).

Yes.
Don't send characters. Send bytes.

Why should applications read a PDF file backwards?

I am trying to wrap my head around the PDF file structure. There is a header, a body with objects, a cross-reference table and a trailer. In the official PDF reference from Adobe, section 3.4.4 about file trailer, we can read that:
The trailer of a PDF file enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF file from its end.
This looks very inefficient to me. I can't show anything to users this way (not even the first page) before I load the whole file. Well, to be precise, I can - if my file is linearized. But that is optional and means some extra overhead both when writing and reading such file.
Instead of that whole linearization thing, it would be easier to just put the references in front of the body (followed by objects on page 1, page2, page 3... ). But people in Adobe probably had their reasons to put it after it. I just don't see them. So...
Why is the cross-reference table placed after the body?

I would agree with the two reasons already mentioned, but not because of hardware limitations "back in the day", but rather scale. It's easy to think an invoice with a couple of pages of text could be done better differently, but what about a book, or a PDF with 1,000 photos?
With the trailer at the end you can write images/text/fonts to the file as they are processed and then discard them from memory while simply storing the file offset of each object to be used to write the trailer.
If the trailer had to come first then you would have to read (or even generate in the case of an embedded font) all of these objects just to get their size so you could write out the trailer, then write all the objects to the file. So you would either be reading, sizing, discarding, then reading again, or trying to hold everything in ram until you could write them to the file.
Write speed and ram are still issues we contend with today when we're running in a docker container on a VM on shared hardware..

PDF was invented back when hard drives were slow to write files... really s-l-o-w. By putting the xref at the end, you could quickly change a file by simply appending new objects and an updated xref to the end of the file rather than rewriting the whole thing.

Not only were the drives slow (giving rise to the argument in #joelgeraci's answer), also was there much less RAM available in a typical computer. Thus, when creating a pdf one had to write data to file early, much earlier than one had any idea how big the file or, as a consequence, the cross references would become. Writing the cross references at the end, therefore, was a natural consequence.

What file type starts with BOSS 7?

I am looking at some files generated in the early 90s. One of them seems to hold references to data packed in some binary format in a number of large files.
The first six bytes of the file are 0x42 0x4f 0x53 0x53 0x20 0x37 which spells BOSS 7.
My searches of various sources of file type information, including /usr/share/file/magic have not turned up anything. Does anyone know what software might have been used to generate files that start with these bytes? Any information on file layout would be great.

It looks like the file might have been generated by VisualWorks Smalltalk:
[BOSS 7.5]
Contains the Binary Object Streaming Service, which supports efficient storage and
retrieval of objects, including code, to and from files.
Note that for code storage, the parcel system now supercedes BOSS.
I tried to load the file using the IDE provided at http://www.cincomsmalltalk.com/ and it generated a meaningful exception:
The identifier MediaCollectionDictionary has no binding
The file does contain:
MediaCollectionDictionary
MediaCollection*
CallMediaVehDict2
etc which means, if I could now figure out what the rest of the files do and learn enough SmallTalk, I could disentangle this mess.
Of course, I am not sure if this analysis is correct. So, please if you have any other ideas, let me know. Thank you.
Much later: So, my initial assessment seems to be correct. I got some useful tips on comp.lang.smalltalk: http://groups.google.com/group/comp.lang.smalltalk/browse_thread/thread/5d55d857e2f80158#

Ask on comp.lang.smalltalk
Ask on the vwnc mailing list

What language is to binary, as Perl is to text?

I am looking for a scripting (or higher level programming) language (or e.g. modules for Python or similar languages) for effortlessly analyzing and manipulating binary data in files (e.g. core dumps), much like Perl allows manipulating text files very smoothly.
Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.
Any suggestions?

Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.
Well, while it may seem counter-intuitive, I found erlang extremely well-suited for this, namely due to its powerful support for pattern matching, even for bytes and bits (called "Erlang Bit Syntax"). Which makes it very easy to create even very advanced programs that deal with inspecting and manipulating data on a byte- and even on a bit-level:
Since 2001, the functional language Erlang comes with a byte-oriented datatype (called binary) and with constructs to do pattern matching on a binary.
And to quote informIT.com:
(Erlang) Pattern matching really starts to get
fun when combined with the binary
type. Consider an application that
receives packets from a network and
then processes them. The four bytes in
a packet might be a network byte-order
packet type identifier. In Erlang, you
would just need a single processPacket
function that could convert this into
a data structure for internal
processing. It would look something
like this:
processPacket(<<1:32/big,RestOfPacket>>) ->
% Process type one packets
...
;
processPacket(<<2:32/big,RestOfPacket>>) ->
% Process type two packets
...
So, erlang with its built-in support for pattern matching and it being a functional language is pretty expressive, see for example the implementation of ueencode in erlang:
uuencode(BitStr) ->
<< (X+32):8 || <<X:6>> <= BitStr >>.
uudecode(Text) ->
<< (X-32):6 || <<X:8>> <= Text >>.
For an introduction, see Bitlevel Binaries and Generalized Comprehensions in Erlang.You may also want to check out some of the following pointers:
Parsing Binaries with erlang, lamers inside
More File Processing with Erlang
Learning Erlang and Adobe Flash format same time
Large Binary Data is (not) a Weakness of Erlang
Programming Efficiently with Binaries and Bit Strings
Erlang bit syntax and network programming
erlang, the language for network programming (1)
Erlang, the language for network programming Issue 2: binary pattern matching
An Erlang MIDI File Reader/Writer
Erlang Bit Syntax
Comprehending endianness
Playing with Erlang
Erlang: Pattern Matching Declarations vs Case Statements/Other
A Stream Library using Erlang Binaries
Bit-level Binaries and Generalized Comprehensions in Erlang
Applications, Implementation and Performance Evaluation of Bit Stream Programming in Erlang

perl's pack and unpack ?

Take a look at python bitstring, it looks like exactly what you want :)

The Python bitstring module was written for this purpose. It lets you take arbitary slices of binary data and offers a number of different interpretations through Python properties. It also gives plenty of tools for constructing and modifying binary data.
For example:
>>> from bitstring import BitArray, ConstBitStream
>>> s = BitArray('0x00cf') # 16 bits long
>>> print(s.hex, s.bin, s.int) # Some different views
00cf 0000000011001111 207
>>> s[2:5] = '0b001100001' # slice assignment
>>> s.replace('0b110', '0x345') # find and replace
2 # 2 replacements made
>>> s.prepend([1]) # Add 1 bit to the start
>>> s.byteswap() # Byte reversal
>>> ordinary_string = s.bytes # Back to Python string
There are also functions for bit-wise reading and navigation in the bitstring, much like in files; in fact this can be done straight from a file without reading it into memory:
>>> s = ConstBitStream(filename='somefile.ext')
>>> hex_code, a, b = s.readlist('hex:32, uint:7, uint:13')
>>> s.find('0x0001') # Seek to next occurence, if found
True
There are also views with different endiannesses as well as the ability to swap endianness and much more - take a look at the manual.

I'm using 010 Editor to view binary files all the time to view binary files.
It's especially geared to work with binary files.
It has an easy to use c-like scripting language to parse binary files and present them in a very readable way (as a tree, fields coded by color, stuff like that)..
There are some example scripts to parse zipfiles and bmpfiles.
Whenever I create a binary file format, I always make a little script for 010 editor to view the files. If you've got some header files with some structs, making a reader for binary files is a matter of minutes.

Any high-level programming language with pack/unpack functions will do. All 3 Perl, Python and Ruby can do it. It's matter of personal preference. I wrote a bit of binary parsing in each of these and felt that Ruby was easiest/most elegant for this task.

Why not use a C interpreter? I always used them to experiment with snippets, but you could use one to script something like you describe without too much trouble.
I have always liked EiC. It was dead, but the project has been resurrected lately. EiC is surprisingly capable and reasonably quick. There is also CINT. Both can be compiled for different platforms, though I think CINT needs Cygwin on windows.

Python's standard library has some of what you require -- the array module in particular lets you easily read parts of binary files, swap endianness, etc; the struct module allows for finer-grained interpretation of binary strings. However, neither is quite as rich as you require: for example, to present the same data as bytes or halfwords, you need to copy it between two arrays (the numpy third-party add-on is much more powerful for interpreting the same area of memory in several different ways), and, for example, to display some bytes in hex there's nothing much "bundled" beyond a simple loop or list comprehension such as [hex(b) for b in thebytes[start:stop]]. I suspect there are reusable third-party modules to facilitate such tasks yet further, but I can't point you to one...

Forth can also be pretty good at this, but it's a bit arcane.

Well, if speed is not a consideration, and you want perl, then translate each line of binary into a line of chars - 0's and 1's. Yes, I know there are no linefeeds in binary :) but presumably you have some fixed size -- e.g. by byte or some other unit, with which you can break up the binary blob.
Then just use the perl string processing on that data :)

If you're doing binary level processing, it is very low level and likely needs to be very efficient and have minimal dependencies/install requirements.
So I would go with C - handles bytes well - and you can probably google for some library packages that handle bytes.
Going with something like Erlang introduces inefficiencies, dependencies, and other baggage you probably don't want with a low-level library.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas