Read in 4-byte words from binary file in Julia - file-io

I have a simple binary file that contains 32-bit floats adjacent to each other.
Using Julia, I would like to read each number (i.e. each 32-bit word) and put them each sequentially into a array of Float32 format.
I've tried a few different things through looking at the documentation, but all have yielded impossible values (I am using a binary file with known values as dummy input). It appears that:
Julia is reading the binary file one-byte at a time.
Julia is putting each byte into a Uint8 array.
For example, readbytes(f, 4) gives a 4-element array of unsigned 8-bit integers. read(f, Float32, DIM) also gives strange values.
Anyone have any idea how I should proceed?

I'm not sure of the best way of reading it in as Float32 directly, but given an array of 4*n Uint8s, I'd turn it into an array of n Float32s using reinterpret (doc link):
raw = rand(Uint8, 4*10) # i.e. a vector of Uint8 aka bytes
floats = reinterpret(Float32, raw) # now a vector of 10 Float32s
With output:
julia> raw = rand(Uint8, 4*2)
8-element Array{Uint8,1}:
0xc8
0xa3
0xac
0x12
0xcd
0xa2
0xd3
0x51
julia> floats = reinterpret(Float32, raw)
2-element Array{Float32,1}:
1.08951e-27
1.13621e11

(EDIT 2020: Outdated, see newest answer.) I found the issue. The correct way of importing binary data in single precision floating point format is read(f, Float32, NUM_VALS), where f is the file stream, Float32 is the data type, and NUM_VALS is the number of words (values or data points) in the binary data file.
It turns out that every time you call read(f, [...]) the data pointer iterates to the next item in the binary file.
This allows people to be able to read in data line-by-line simply:
f = open("my_file.bin")
first_item = read(f, Float32)
second_item = read(f, Float32)
# etc ...
However, I wanted to load in all the data in one line of code. As I was debugging, I had used read() on the same file pointer several times without re-declaring the file pointer. As a result, when I experimented with the correct operation, namely read(f, Float32, NUM_VALS), I got an unexpected value.

Julia Language has changed a lot since 5 years ago. read() no longer has API to specify Type and length simultaneously. reinterpret() creates a view of a binary array instead of array with desired type. It seems that now the best way to do this is to pre-allocate the desired array and fill it with read!:
data = Array{Float32, 1}(undef, 128)
read!(io, data)
This fills data with desired float numbers.

Related

Convert bytestring to array of uint8 Python

I was wondering what the fastest way of converting a bytestring to an array of unit8 would be in the following code? I use hashlib for SHA-256
x = hashlib.sha256(str(word).encode("ascii")).digest()
Now I need to convert x to an array of uint8. I currently do this by list comprehension, but this seems slow.
h_new = [int(y) for y in x]
num_hashes[idx_perm, :] = h_new
Does anyone have suggestions for a faster way of conversion?
You can use the function frombuffer of Numpy. Since the size of a sha256 is always 32 bytes and the output type is well known, the function can convert the input buffer x very quickly.
num_hashes[idx_perm, :] = np.frombuffer(x, np.uint8, 32)
This takes about 0.7 us per call on my machine while the initial code takes about 3.35 us. Thus, this version is about 4.8 faster. Note that this is also faster than converting the result to a list (due to the many int objects to be allocated and reference-counted) and the from fromiter Numpy function (since the iterable interface introduces an additional overhead).

Pytorch copying inexact value of numpy floating point number

I'm converting a floating point number (or numpy array) to Pytorch tensor and it seems to be copying the inexact value to the tensor. The error comes in the 8th significant digit and afterwards. This is significant (no-pun intended) for my work as I deal with chaotic dynamics which is very sensitive towards the slight change in the initial conditions.
I'm already using torch.set_printoptions(precision=16) to print 16 significant digits.
np_x = state
print(np_x)
x = torch.tensor(np_x,requires_grad=True,dtype=torch.float32)
print(x.data[0])
and the output is :
0.7575408585008059
tensor(0.7575408816337585)
It would be helpful to know what is going wrong or how it could be resolved ?
Because you're using float32 dtype. If you convert these two numbers to binary, you will find they are actually the same. Strictly speaking, the most accurate representations of those two numbers in float32 format are the same.
0.7575408585008059
Most accurate representation = 7.57540881633758544921875E-1
0.7575408816337585
Most accurate representation = 7.57540881633758544921875E-1
Binary: 00111111 01000001 11101110 00110011

Why is my TFRecord file so much bigger than csv?

I always thought that being a binary format, TFRecord will consume less space then a human-readable csv. But when I tried to compare them, I saw that it is not the case.
For example here I create a num_rows X 10 matrix with num_rows labels and save it as a csv. I do the same by saving it to TFRecors:
import pandas as pd
import tensorflow as tf
from random import randint
num_rows = 1000000
df = pd.DataFrame([[randint(0,300) for r in xrange(10)] + [randint(0, 1)] for i in xrange(num_rows)])
df.to_csv("data/test.csv", index=False, header=False)
writer = tf.python_io.TFRecordWriter('data/test.bin')
for _, row in df.iterrows():
arr = list(row)
features, label = arr[:-1], arr[-1]
example = tf.train.Example(features=tf.train.Features(feature={
'features' : tf.train.Feature(int64_list=tf.train.Int64List(value=features)),
'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
}))
writer.write(example.SerializeToString())
writer.close()
Not only it takes way more time to create a binary file than a csv (2 sec VS 1 min 50 sec), but it also uses almost 2 times more space (38Mb VS 67.7Mb).
Do I do it correctly? How can I make the output file smaller (saw TFRecordCompressionType), but is there anything else I can do? And what is the reason for a much bigger size?
Vijay's comment regarding int64 makes sense but still does not answer everything. Int64 consumes 8 bytes, because I am storing data in csv, the string representation of the integer should be of length 8. So if I do this df = pd.DataFrame([[randint(1000000,99999999) for r in xrange(10)] for i in xrange(num_rows)]) I still get a slightly bigger size. Now it is 90.9Mb VS 89.1Mb. In additional to this csv stores 1 byte for each comma between each integers.
The fact that your file is bigger is due to the overhead that TFRecords has for each row, in particular the fact that the label names are stored every time.
In your example, if you increase the number of features (from 10 to say 1000) you will observe that your tfrecord file is actually about half the size of the csv.
Also that the fact that integers are stored on 64 bits is eventually irrelevant, because the serialization uses a "varint" encoding that depends on the value of the integer, not on its initial encoding. Take your example above, and instead of a random value between 0 and 300, use a constant value of 300: you will see that your file size increases.
Note that the number of bytes used for the encoding is not exactly that of the integer itself. So a value of 255 will still need two bytes, but a value of 127 will take one byte. Interesting to know, negative values come with a huge penalty: 10 bytes for storage no matter what.
The correspondance between values and storage requirements is found in protobuf's function _SignedVarintSize.
This may because of your generated numbers are in range 0~300, so they just need 3 bytes most to store a number, but when they are stored in tfrecords as int64, it need at least 8 bytes(not very sure) to store a number. If your generated numbers are in range 0~2^64-1, I think tfrecords file will much smaller than csv file.

How to change four bytes into float32 in tensorflow?

I use tf.FixedLengthRecordReader to read file and get a list of uint8 tensors. And I want to transform the first four bytes into one float32.
For example, the first four bytes are 0xAA,0xBB,0xCC,0xDD, I want to get 0xAABBCCDD and change it into float32. We known if we use C++, it is easy that we just use (double*)address(0xAA). But how can I do in tensorflow?
You need to use tf.decode_raw, which has an out_type argument that specifies the cast to be done, e.g.
record_bytes = tf.decode_raw(value, tf.float32)

Mutable array, Objective-c, or variable array, c. Any difference in performance?

I have a multidimensional array (3D matrix) of unknown size, where each element in this matrix is of type short int.
The size of the matrix can be approximated to be around 10 x 10 x 1,000,000.
As I see it I have two options: Mutable Array (Objective-c) or Variable Array (c).
Are there any difference in reading writing to these arrays?
How large will these files become when I save to file?
Any advice would be gratefully accepted.
Provided you know the size of the array at the point of creation, i.e. you don't need to dynamically change the bounds, then a C array of short int with these dimensions will win easily - for reasons such as no encoding of values as objects and direct indexing.
If you write the array in binary to a file then it will just be the number of elements multiplied by sizeof(short int) without any overhead. If you need to also stored the dimensions that is 3 * sizeof(int) - 12 or 24 bytes.
The mutable array will be slower (albeit not by much) since its built on a C array. How large will the file be when you save this array?
It will take you more than 10x10x10000000 bytes because you'll have to encode it in a way where you can recall the matrix. This part is really up to you. For a 3D array, you'll have to use a special character/format in order to denote a 3D array. It depends on how you want to do this, but it will take 1 byte for every digit of every number + 1 char for the space you'll put between elements in the same row + (1 NL For every 2nd dimension in your array * n) + (1 other character for 3d values * n *n)
It might be easier to Stick each Row into its own file, and then stick the columns below it like you normally would. Then in a new file, I would start putting the 3d elements such that each line lines up with the column number of the 2nd dimension. That's just me though, its up to you.