Why is 32768 used as a constant to normalize the wav data in VGGish? - tensorflow

I'm trying to follow along with what the code is doing for VGGish and I came across a piece that I don't really understand. In vggish_input.py there is this:
def wavfile_to_examples(wav_file):
"""Convenience wrapper around waveform_to_examples() for a common WAV format.
wav_file: String path to a file, or a file-like object. The file
is assumed to contain WAV audio data with signed 16-bit PCM samples.
See waveform_to_examples.
wav_data, sr = wav_read(wav_file)
assert wav_data.dtype == np.int16, 'Bad sample type: %r' % wav_data.dtype
samples = wav_data / 32768.0 # Convert to [-1.0, +1.0]
return waveform_to_examples(samples, sr)
Where does the constant of 32768 come from and how does dividing that convert the data to samples?
I found this for converting to -1 and +1 and not sure how to bridge that with 32768.

32768 is 2^15. int16 has a range of -32768 to +32767. If you have int16 as input and divide it by 2^15, you get a number between -1 and +1.


Convert bytestring to array of uint8 Python

I was wondering what the fastest way of converting a bytestring to an array of unit8 would be in the following code? I use hashlib for SHA-256
x = hashlib.sha256(str(word).encode("ascii")).digest()
Now I need to convert x to an array of uint8. I currently do this by list comprehension, but this seems slow.
h_new = [int(y) for y in x]
num_hashes[idx_perm, :] = h_new
Does anyone have suggestions for a faster way of conversion?
You can use the function frombuffer of Numpy. Since the size of a sha256 is always 32 bytes and the output type is well known, the function can convert the input buffer x very quickly.
num_hashes[idx_perm, :] = np.frombuffer(x, np.uint8, 32)
This takes about 0.7 us per call on my machine while the initial code takes about 3.35 us. Thus, this version is about 4.8 faster. Note that this is also faster than converting the result to a list (due to the many int objects to be allocated and reference-counted) and the from fromiter Numpy function (since the iterable interface introduces an additional overhead).

CRC of input data shorter than poly width

I'm in the process of writing a paper during my studies on implementing CRC in Excel with VBA.
I've created a fairly straightforward, modular algorithm that uses Ross's parametrized model.
It works flawlessly for any length polynomian and any combination of parameters except for one; when the length of the input data is shorter than the width of the polynomial and an initial value is chosen ("INIT") that has any bits set which are "past" the length of the input data.
Input Data: 0x4C
Poly: 0x1021
Xorout: 0x0000
Refin: False
Refout: False
If I choose no INIT or any INIT like 0x##00, I get the same checksum as any of the online CRC generators. If any bit of the last two hex characters is set - like 0x0001 - my result is invalid.
I believe the question boils down to "How is the register initialized if only one byte of input data is present for a two byte INIT parameter?"
It turns out I was misled (or I very well may have misinterpreted) the explaination of how to use the INIT parameter on the sunshine2k website.
The INIT value must not be XORed with the first n input bytes per se (n being the width of the register / cropped poly / checksum), but must only be XORed in after the n 0-Bits have been appended to the input data.
This specification does not matter when input data is equal or larger than n bytes, but it does matter when the input data is too short.

Why is my TFRecord file so much bigger than csv?

I always thought that being a binary format, TFRecord will consume less space then a human-readable csv. But when I tried to compare them, I saw that it is not the case.
For example here I create a num_rows X 10 matrix with num_rows labels and save it as a csv. I do the same by saving it to TFRecors:
import pandas as pd
import tensorflow as tf
from random import randint
num_rows = 1000000
df = pd.DataFrame([[randint(0,300) for r in xrange(10)] + [randint(0, 1)] for i in xrange(num_rows)])
df.to_csv("data/test.csv", index=False, header=False)
writer = tf.python_io.TFRecordWriter('data/test.bin')
for _, row in df.iterrows():
arr = list(row)
features, label = arr[:-1], arr[-1]
example = tf.train.Example(features=tf.train.Features(feature={
'features' : tf.train.Feature(int64_list=tf.train.Int64List(value=features)),
'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
Not only it takes way more time to create a binary file than a csv (2 sec VS 1 min 50 sec), but it also uses almost 2 times more space (38Mb VS 67.7Mb).
Do I do it correctly? How can I make the output file smaller (saw TFRecordCompressionType), but is there anything else I can do? And what is the reason for a much bigger size?
Vijay's comment regarding int64 makes sense but still does not answer everything. Int64 consumes 8 bytes, because I am storing data in csv, the string representation of the integer should be of length 8. So if I do this df = pd.DataFrame([[randint(1000000,99999999) for r in xrange(10)] for i in xrange(num_rows)]) I still get a slightly bigger size. Now it is 90.9Mb VS 89.1Mb. In additional to this csv stores 1 byte for each comma between each integers.
The fact that your file is bigger is due to the overhead that TFRecords has for each row, in particular the fact that the label names are stored every time.
In your example, if you increase the number of features (from 10 to say 1000) you will observe that your tfrecord file is actually about half the size of the csv.
Also that the fact that integers are stored on 64 bits is eventually irrelevant, because the serialization uses a "varint" encoding that depends on the value of the integer, not on its initial encoding. Take your example above, and instead of a random value between 0 and 300, use a constant value of 300: you will see that your file size increases.
Note that the number of bytes used for the encoding is not exactly that of the integer itself. So a value of 255 will still need two bytes, but a value of 127 will take one byte. Interesting to know, negative values come with a huge penalty: 10 bytes for storage no matter what.
The correspondance between values and storage requirements is found in protobuf's function _SignedVarintSize.
This may because of your generated numbers are in range 0~300, so they just need 3 bytes most to store a number, but when they are stored in tfrecords as int64, it need at least 8 bytes(not very sure) to store a number. If your generated numbers are in range 0~2^64-1, I think tfrecords file will much smaller than csv file.

Tensorflow: Reading binary files of varying length?

I'm trying to read binary files that content information of a 3D scene stored as 19 floats followed by a varying number of uint32 values. Since the scene is stored in Run-length encoding (RLE), every binary file has a different size.
Is it possible to read that kind of data using tensorflow?
The equivalent in Matlab looks like this:
filename = 'myFile.bin';
fid = fopen(filename,'r');
vox_origin = fread(fid,3,'float');
camera_poses = fread(fid,16,'float');
labels = fread(fid,'uint32'); % Labels are saved in RLE
value = labels(1:2:end);
value_iter = labels(2:2:end);
I don't know any mechanism for this provided by Tensorflow. This is a very basic functionality that you should be able to easily implement or use something like this https://gist.github.com/nvictus/66627b580c13068589957d6ab0919e66. Tensorflow works with numpy arrays.

Read in 4-byte words from binary file in Julia

I have a simple binary file that contains 32-bit floats adjacent to each other.
Using Julia, I would like to read each number (i.e. each 32-bit word) and put them each sequentially into a array of Float32 format.
I've tried a few different things through looking at the documentation, but all have yielded impossible values (I am using a binary file with known values as dummy input). It appears that:
Julia is reading the binary file one-byte at a time.
Julia is putting each byte into a Uint8 array.
For example, readbytes(f, 4) gives a 4-element array of unsigned 8-bit integers. read(f, Float32, DIM) also gives strange values.
Anyone have any idea how I should proceed?
I'm not sure of the best way of reading it in as Float32 directly, but given an array of 4*n Uint8s, I'd turn it into an array of n Float32s using reinterpret (doc link):
raw = rand(Uint8, 4*10) # i.e. a vector of Uint8 aka bytes
floats = reinterpret(Float32, raw) # now a vector of 10 Float32s
With output:
julia> raw = rand(Uint8, 4*2)
8-element Array{Uint8,1}:
julia> floats = reinterpret(Float32, raw)
2-element Array{Float32,1}:
(EDIT 2020: Outdated, see newest answer.) I found the issue. The correct way of importing binary data in single precision floating point format is read(f, Float32, NUM_VALS), where f is the file stream, Float32 is the data type, and NUM_VALS is the number of words (values or data points) in the binary data file.
It turns out that every time you call read(f, [...]) the data pointer iterates to the next item in the binary file.
This allows people to be able to read in data line-by-line simply:
f = open("my_file.bin")
first_item = read(f, Float32)
second_item = read(f, Float32)
# etc ...
However, I wanted to load in all the data in one line of code. As I was debugging, I had used read() on the same file pointer several times without re-declaring the file pointer. As a result, when I experimented with the correct operation, namely read(f, Float32, NUM_VALS), I got an unexpected value.
Julia Language has changed a lot since 5 years ago. read() no longer has API to specify Type and length simultaneously. reinterpret() creates a view of a binary array instead of array with desired type. It seems that now the best way to do this is to pre-allocate the desired array and fill it with read!:
data = Array{Float32, 1}(undef, 128)
read!(io, data)
This fills data with desired float numbers.