How to properly select wanted data and discard unwanted data from binary files - numpy

I'm working on a project where I'm trying to convert old 16bit binary data files into 32bit data files for later use.
Straight conversion is no issue, but then i noticed i needed to remove header data from the data-file's.
The data consists of 8206 bytes long frames, each frame consists of 14 byte long header and 4096 bytes long data -block, depending on file, there are either 70313 or 70312 frames in each file.
i couldn't find a neat way to find all the header and remove them and save only the data-block to a new file.
so heres what I did:
results_array = np.empty([0,1], np.uint16)
for filename in file_list:
num_files += 1
# read data from file as 16bit's and save it as 32bit
data16 = np.fromfile(data_dir + "/" + filename, dtype=np.uint16)
filesize =
if filesize == 288494239:
total_frames = 70313
#total_frames = 3000
total_frames = 70312
#total_frames = 3000
frame_count = 0
chunksize = 4103
with open(data_dir + "/" + filename, 'rb') as file:
while frame_count < total_frames:
frame_count += 1
read_data =
if not read_data:
data = read_data[7:4103]
results_array = np.append(results_array,data)
converted = np.frombuffer(results_array, np.uint16)
print(str(frame_count) + "/" + str(total_frames))
converted = np.frombuffer(results_array, np.uint16)
data32 = converted.astype(dtype=np.uint32) * 256
It works (i think it does atleast), but it is very very slow.
So question is, is there a way to do the above much faster, maybe some build-in function in numpy or something else perhaps?
Thanks in advance

Finally managed to crack this one, and it is 100x faster than initial approach :)
data = np.fromfile(read_dir + "/" + file, dtype=np.int16)
frames = len(data) // 4103 # framelenght
# Reshape into array such that each row is a frame
data = np.reshape(data[:frames * 4103], (frames, 4103))
# Remove headers and convert to int32
data = data[:, 7:].astype(np.int32) * 256


efficient way to join 65,000 .csv files

I have say 65,000 .csv files that I need to work with in julia language.
The goal is to perform basic statistics on the data set.
I had some ways of joining all the data sets
#1 - set a common index and leftjoin() - perform statistics row wise
#2 - vcat() the dataframes on top of each other - vertically stacked use group by
Eitherway the final data frames are very large ! and become slow in processing
Is there an efficient way of doing this ?
I thought of performing either #1 or #2 and splitting the joining operations in thirds, lets say after 20,000 joins save to .csv and operate in chunks then at the end join all 3 in one last operation.
Well not sure how to replicate making 65k .csv files but basically below I loop through the files in the directory, load the csv then vcat() to one df. Question more relating to if there is a better way to manage the size of the operation. vcat() makes something grow. Ahead of time maybe I can cycle through the .csv files, obtain file dimensions per .csv, initialize the full dataframe to final output size, then cycle through each .csv row by row and populate the initialized df.
using CSV
using DataFrames
# read all files in directory
csv_dir_tmax = cd(readdir, "C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax")
# initialize outputs
tmax_all = DataFrame(Date = [], TMAX = [])
for c = 1:length(csv_dir_tmax)
print("Starting csv file ", csv_dir_tmax[c]," - Iteration ",c,"\n")
if c <= length(csv_dir_tmax)
csv_tmax =["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax/", csv_dir_tmax[c]]), DataFrame, header=true)
tmax_all = vcat(tmax_all, csv_tmax)
The following approach should be relatively efficient (assuming that data fits into memory):
tmax_all = reduce(vcat, ["YOUR_DIR$x", DataFrame) for x in csv_dir_tmax])
initializing the final output to the total size of final output (like vcat() would finally build). Then populate it element wise seems to be working way better:
# get the dimensions of each .csv files
tmax_all_total_output_size = fill(0, size(csv_dir_tmax,1))
tmin_all_total_output_size = fill(0, size(csv_dir_tmin,1))
tavg_all_total_output_size = fill(0, size(csv_dir_tavg,1))
tmax_dim = Int64[]
tmin_dim = Int64[]
tavg_dim = Int64[]
for c = 1:length(csv_dir_tmin) # 47484 - last point
print("Starting csv file ", csv_dir_tmin[c]," - Iteration ",c,"\n")
if c <= length(csv_dir_tmax)
tmax_csv =["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax/", csv_dir_tmax[c] ]), DataFrame, header=true)
global tmax_dim = size(tmax_csv,1)
tmax_all_total_output_size[c] = tmax_dim
if c <= length(csv_dir_tmin)
tmin_csv =["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmin/", csv_dir_tmin[c]]), DataFrame, header=true)
global tmin_dim = size(tmin_csv,1)
tmin_all_total_output_size[c] = tmin_dim
if c <= length(csv_dir_tavg)
tavg_csv =["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tavg/", csv_dir_tavg[c]]), DataFrame, header=true)
global tavg_dim = size(tavg_csv,1)
tavg_all_total_output_size[c] = tavg_dim
# sum total dimension of all .csv files
tmax_sum = sum(tmax_all_total_output_size)
tmin_sum = sum(tmin_all_total_output_size)
tavg_sum = sum(tavg_all_total_output_size)
# initialize final output to total final dimension
tmax_date_array = fill(Date("13000101", "yyyymmdd"),tmax_sum)
tmax_array = zeros(tmax_sum)
tmin_date_array = fill(Date("13000101", "yyyymmdd"),tmin_sum)
tmin_array = zeros(tmin_sum)
tavg_date_array = fill(Date("13000101", "yyyymmdd"),tavg_sum)
tavg_array = zeros(tavg_sum)
# initialize outputs
tmax_all = DataFrame(Date = tmax_date_array, TMAX = tmax_array)
tmin_all = DataFrame(Date = tmin_date_array, TMIN = tmin_array)
tavg_all = DataFrame(Date = tavg_date_array, TAVG = tavg_array)
tmax_count = 0
tmin_count = 0
tavg_count = 0
Then begin filling the initialized df.

Getting an IndexError when trying to run pcolormesh from a pandas DataFrame

I'm trying to generate a pcolormesh plot from a large dataset, where the rows are in units of hertz, the rows are individual files, and the body is an array of magnitude values per file for each frequency. My DataFrame gets constructed correctly with correct labels, but when I pass it in to pcolormesh, it throws the exception "arrays used as indices must be of integer (or boolean) type". The code I am attaching reflects a conversion of the frequency array to an integer array using .astype(int). Note, if I convert the PSD_array (magnitudes) to integers, it DOES work (but isn't helpful), but it doesn't like it otherwise. I also played around with other pcolormesh generations using decimals as the body of the DataFrame and it worked fine.
Ideas would be lovely, I'll keep working on it.
Code: (Note: specific call file paths redacted).
'''def file_List():
files = [file for file in os.listdir('###)]
file_list = []
for file in files:
file_list += [file]
###Using Fast Fourier Transform, take in file list generated from
###file_List program and perform FFT on each file.
###We read the files by adding them to the file directory. Could improve
###by making an overarching program that runs everything with an input
###file directory.
def FFT():
runs through FFT for the files in a file list as determined by the
file_List() program.
file_list = file_List() #runs file_List() program and saves
#the list of files as a variable.
df = pd.DataFrame()
freq_array = np.empty((0,204800))
PSD_array = np.empty((0,204800))
count = 0
for file in file_list:
while count < 10:
file_read = pd.read_csv('###'+file,skiprows=22,sep = '\t')
df = pd.DataFrame(file_read, columns = ['X_Value','Acceleration'])
q = df['Acceleration'] #data set input
n = len(df['Acceleration']) #number of data points
dt = 2/len(df['X_Value']) #
f_hat = np.fft.fft(q,n) #Runs FFT
PSD = f_hat * np.conj(f_hat) / n #Power Spectral Density
freq = (1/(dt*n)) * np.arange(n) #Creates x axis of frequencies
freq_array = np.append(freq_array,np.array([freq]),axis=0)
PSD_array = np.append(PSD_array,np.array([PSD]),axis=0)
count += 1
#trans_freq = np.transpose(freq_array)
#trans_PSD = np.transpose(PSD_array)
int_freq = freq_array.astype(int)
#PSD_int = PSD_array.astype(int)
PSD_df = pd.DataFrame(PSD_array, index = np.arange(len(PSD_array)), columns = int_freq[0])
def heatmap(df):
Constructs a heatmap given an input dataframe

Can't convert 'bytes' object to str implicitly for DCM to raw file

I learn how to convert DCM file to Raw file .Got the code from Git Hub:
And it got a error"Can't convert 'bytes' object to str implicitly" on the line
"allInOne += dataset.PixelData"
I try to use "encode("utf-8")",but it make allInOne to be empty.
By the way ,Is there any code to generate the .mhd file corresponding to the .raw file?
import dicom
import os
import numpy
import sys
dicomPath = "C:/DataLuna16pen/dcmdata/"
lstFilesDCM = [] # create an empty list
for dirName, subdirList, fileList in os.walk(dicomPath):
allInOne = ""
for filename in fileList:
if "".join(filename).endswith((".dcm", ".DCM")):
path = dicomPath + "".join(filename)
dataset = dicom.read_file(path)
for n,val in enumerate(dataset.pixel_array.flat):
dataset.pixel_array.flat[n] = val / 60
if val < 0:
dataset.pixel_array.flat[n] = 0
dataset.PixelData = numpy.uint8(dataset.pixel_array).tostring()
allInOne += dataset.PixelData
print ("slice " + "".join(filename) + " done ",end=" ")
print (i)
newFile = open("./all_in_one.raw", "wb")
print ("RAW file generated")
There are several things:
PyDicom still doesn't read compressed DICOMs properly (loseless jpeg). You should check Transfer Syntax of the files to check if this is the case. As a workaround you can use GDCM tool dcmdjpeg
you should not convert byte array into string (np.array.tostring returns in fact the array of bytes)
for writing mha files, take a look at MedPy. You can also use ITK directly. There is python wrapper and SimpleITK - some kind lightweight modification of ITK

Retrieve indices for rows of a PyTables table matching a condition using `Table.where()`

I need the indices (as numpy array) of the rows matching a given condition in a table (with billions of rows) and this is the line I currently use in my code, which works, but is quite ugly:
indices = np.array([row.nrow for row in the_table.where("foo == 42")])
It also takes half a minute, and I'm sure that the list creation is one of the reasons why.
I could not find an elegant solution yet and I'm still struggling with the pytables docs, so does anybody know any magical way to do this more beautifully and maybe also a bit faster? Maybe there is special query keyword I am missing, since I have the feeling that pytables should be able to return the matched rows indices as numpy array.
tables.Table.get_where_list() gives indices of the rows matching a given condition
I read the source of pytables, where() is implemented in Cython, but it seems not fast enough. Here is a complex method that can speedup:
Create some data first:
from tables import *
import numpy as np
class Particle(IsDescription):
name = StringCol(16) # 16-character String
idnumber = Int64Col() # Signed 64-bit integer
ADCcount = UInt16Col() # Unsigned short integer
TDCcount = UInt8Col() # unsigned byte
grid_i = Int32Col() # 32-bit integer
grid_j = Int32Col() # 32-bit integer
pressure = Float32Col() # float (single-precision)
energy = Float64Col() # double (double-precision)
h5file = open_file("tutorial1.h5", mode = "w", title = "Test file")
group = h5file.create_group("/", 'detector', 'Detector information')
table = h5file.create_table(group, 'readout', Particle, "Readout example")
particle = table.row
for i in range(1001000):
particle['name'] = 'Particle: %6d' % (i)
particle['TDCcount'] = i % 256
particle['ADCcount'] = (i * 256) % (1 << 16)
particle['grid_i'] = i
particle['grid_j'] = 10 - i
particle['pressure'] = float(i*i)
particle['energy'] = float(particle['pressure'] ** 4)
particle['idnumber'] = i * (2 ** 34)
# Insert a new particle record
Read the column in chunks and append the indices into a list and concatenate the list to array finally. You can change the chunk size according to your memory size:
h5file = open_file("tutorial1.h5")
table = h5file.get_node("/detector/readout")
size = 10000
col = "energy"
buf = np.zeros(batch, dtype=table.coldtypes[col])
res = []
for start in range(0, table.nrows, size):
length = min(size, table.nrows - start)
data =, start + batch, field=col, out=buf[:length])
tmp = np.where(data > 10000)[0]
tmp += start
res = np.concatenate(res)

How to format input data for textsum data_convert_example

I was hoping someone may be able to see where I am failing here. So I have scraped some data from buzzfeed and now I am trying to format a text file with which I can then send into data_convert_examples text_to_data formatter.
I thought I had the answer a couple times, but I am still running up against a brick wall when I process this as binary and then try to train against the data.
What I did was run the binary_to_text on the toy dataset and then opened the file in notepad++ under windows, showing all characters, and matched what I believed to be the format.
I appologize for the long function below, but I really am unsure as to where the issue might be and figured this was the best way to provide enough info. Anyone have any ideas or recommendations?
def processPath(self, toPath):
fout = open(os.path.join(toPath, '{}-{}'.format(self.baseName, self.fileNdx)), 'a+')
for path, dirs, files in os.walk(self.fromPath):
for fn in files:
fullpath = os.path.join(path, fn)
if os.path.isfile(fullpath):
#with open(fullpath, "rb") as f:
with, "rb", 'ascii', "ignore") as f:
finalRes = ""
content = f.readlines()
sentences = sent_tokenize((content[1]).encode('ascii', "ignore").strip('\n'))
for sent in sentences:
textSumFmt = self.textsumFmt
finalRes = textSumFmt["artPref"] + textSumFmt["sentPref"] + sent.replace("=", "equals") + textSumFmt["sentPost"] + textSumFmt["postVal"]
finalRes += (('\t' + textSumFmt["absPref"] + textSumFmt["sentPref"] + (content[0]).strip('\n').replace("=", "equals") + textSumFmt["sentPost"] + textSumFmt["postVal"]) + '\t' +'publisher=BUZZ' + os.linesep)
if self.lineNdx != 0 and self.lineNdx % self.lines == 0:
fout = open(os.path.join(toPath, '{}-{}'.format(self.baseName, self.fileNdx)), 'a+')
fout.write( ("{}").format( finalRes.encode('utf-8', "ignore") ) )
except RuntimeError as e:
print "Runtime Error: {0} : {1}".format(e.errno, e.strerror)
After further analysis, it seems that the source of the problem is more with the source data and the way it is constructed rather than itself. I'm closing this as the heading is not in-line with the source of the issue.
I found the source of my problem was that I had a space between "Article" and the equals sign. After removing that I was able to successfully train.