How to format input data for textsum data_convert_example - tensorflow

I was hoping someone may be able to see where I am failing here. So I have scraped some data from buzzfeed and now I am trying to format a text file with which I can then send into data_convert_examples text_to_data formatter.
I thought I had the answer a couple times, but I am still running up against a brick wall when I process this as binary and then try to train against the data.
What I did was run the binary_to_text on the toy dataset and then opened the file in notepad++ under windows, showing all characters, and matched what I believed to be the format.
I appologize for the long function below, but I really am unsure as to where the issue might be and figured this was the best way to provide enough info. Anyone have any ideas or recommendations?
def processPath(self, toPath):
try:
fout = open(os.path.join(toPath, '{}-{}'.format(self.baseName, self.fileNdx)), 'a+')
for path, dirs, files in os.walk(self.fromPath):
for fn in files:
fullpath = os.path.join(path, fn)
if os.path.isfile(fullpath):
#with open(fullpath, "rb") as f:
with codecs.open(fullpath, "rb", 'ascii', "ignore") as f:
try:
finalRes = ""
content = f.readlines()
self.populateVocab(content)
sentences = sent_tokenize((content[1]).encode('ascii', "ignore").strip('\n'))
for sent in sentences:
textSumFmt = self.textsumFmt
finalRes = textSumFmt["artPref"] + textSumFmt["sentPref"] + sent.replace("=", "equals") + textSumFmt["sentPost"] + textSumFmt["postVal"]
finalRes += (('\t' + textSumFmt["absPref"] + textSumFmt["sentPref"] + (content[0]).strip('\n').replace("=", "equals") + textSumFmt["sentPost"] + textSumFmt["postVal"]) + '\t' +'publisher=BUZZ' + os.linesep)
if self.lineNdx != 0 and self.lineNdx % self.lines == 0:
fout.close()
self.fileNdx+=1
fout = open(os.path.join(toPath, '{}-{}'.format(self.baseName, self.fileNdx)), 'a+')
fout.write( ("{}").format( finalRes.encode('utf-8', "ignore") ) )
self.lineNdx+=1
except RuntimeError as e:
print "Runtime Error: {0} : {1}".format(e.errno, e.strerror)
finally:
fout.close()

After further analysis, it seems that the source of the problem is more with the source data and the way it is constructed rather than data_convert_example.py itself. I'm closing this as the heading is not in-line with the source of the issue.
I found the source of my problem was that I had a space between "Article" and the equals sign. After removing that I was able to successfully train.

Related

How to properly select wanted data and discard unwanted data from binary files

I'm working on a project where I'm trying to convert old 16bit binary data files into 32bit data files for later use.
Straight conversion is no issue, but then i noticed i needed to remove header data from the data-file's.
The data consists of 8206 bytes long frames, each frame consists of 14 byte long header and 4096 bytes long data -block, depending on file, there are either 70313 or 70312 frames in each file.
i couldn't find a neat way to find all the header and remove them and save only the data-block to a new file.
so heres what I did:
results_array = np.empty([0,1], np.uint16)
for filename in file_list:
num_files += 1
# read data from file as 16bit's and save it as 32bit
data16 = np.fromfile(data_dir + "/" + filename, dtype=np.uint16)
filesize = np.prod(data16.shape)
if filesize == 288494239:
total_frames = 70313
#total_frames = 3000
else:
total_frames = 70312
#total_frames = 3000
frame_count = 0
chunksize = 4103
with open(data_dir + "/" + filename, 'rb') as file:
while frame_count < total_frames:
frame_count += 1
read_data = file.read(chunksize)
if not read_data:
break
data = read_data[7:4103]
results_array = np.append(results_array,data)
converted = np.frombuffer(results_array, np.uint16)
print(str(frame_count) + "/" + str(total_frames))
converted = np.frombuffer(results_array, np.uint16)
data32 = converted.astype(dtype=np.uint32) * 256
It works (i think it does atleast), but it is very very slow.
So question is, is there a way to do the above much faster, maybe some build-in function in numpy or something else perhaps?
Thanks in advance
Finally managed to crack this one, and it is 100x faster than initial approach :)
data = np.fromfile(read_dir + "/" + file, dtype=np.int16)
frames = len(data) // 4103 # framelenght
# Reshape into array such that each row is a frame
data = np.reshape(data[:frames * 4103], (frames, 4103))
# Remove headers and convert to int32
data = data[:, 7:].astype(np.int32) * 256

Split Tfrecord files

I have tfrecord file that is about 8 G. I want to split it into 4 files, each file about 2 G. How can I do this directly? Can I do this in tensorflow? Is there any application to split tfrecord data?
I don't know of a way to specify the resulting size of a tfrecord file. However, you can certainly limit the number of the features inside the tfrecord files. Knowing this is not exactly what you're asking for, it gets the job done similarly.
Here's example code how I dealt with this situation in the past (see full code here):
(fragment_size is the number of features in one tfrecord file)
for video_count in range((num_videos)):
if video_count % fragment_size == 0:
if writer is not None:
writer.close()
filename = os.path.join(destination_path, name + str(
current_batch_number) + '_of_' + str(
total_batch_number) + '.tfrecords')
print('Writing', filename)
writer = tf.python_io.TFRecordWriter(filename)
for image_count in range(num_images):
path = 'blob' + '/' + str(image_count)
image = data[video_count, image_count, :, :, :]
image = image.astype(color_depth)
image_raw = image.tostring()
feature[path] = _bytes_feature(image_raw)
feature['height'] = _int64_feature(height)
feature['width'] = _int64_feature(width)
feature['depth'] = _int64_feature(num_channels)
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
if writer is not None:
writer.close()

Can't convert 'bytes' object to str implicitly for DCM to raw file

I learn how to convert DCM file to Raw file .Got the code from Git Hub:
https://github.com/xiasun/dicom2raw/blob/master/dicom2raw.py
And it got a error"Can't convert 'bytes' object to str implicitly" on the line
"allInOne += dataset.PixelData"
I try to use "encode("utf-8")",but it make allInOne to be empty.
By the way ,Is there any code to generate the .mhd file corresponding to the .raw file?
import dicom
import os
import numpy
import sys
dicomPath = "C:/DataLuna16pen/dcmdata/"
lstFilesDCM = [] # create an empty list
for dirName, subdirList, fileList in os.walk(dicomPath):
allInOne = ""
print(subdirList)
i=0
for filename in fileList:
i+=1
if "".join(filename).endswith((".dcm", ".DCM")):
path = dicomPath + "".join(filename)
dataset = dicom.read_file(path)
for n,val in enumerate(dataset.pixel_array.flat):
dataset.pixel_array.flat[n] = val / 60
if val < 0:
dataset.pixel_array.flat[n] = 0
dataset.PixelData = numpy.uint8(dataset.pixel_array).tostring()
allInOne += dataset.PixelData
print ("slice " + "".join(filename) + " done ",end=" ")
print (i)
newFile = open("./all_in_one.raw", "wb")
newFile.write(allInOne)
newFile.close()
print ("RAW file generated")
There are several things:
PyDicom still doesn't read compressed DICOMs properly (loseless jpeg). You should check Transfer Syntax of the files to check if this is the case. As a workaround you can use GDCM tool dcmdjpeg
you should not convert byte array into string (np.array.tostring returns in fact the array of bytes)
for writing mha files, take a look at MedPy. You can also use ITK directly. There is python wrapper and SimpleITK - some kind lightweight modification of ITK

How to split a PDF every n page using PyPDF2?

I'm trying to learn how to split a pdf every n page.
In my case I want to split a 64p PDF into several chunks containing four pages each: file 1: p.1-4, file 2: p.5-8 etc.
I'm trying to understand PyPDF2 but my noobness overwhelms me:
from PyPDF2 import PdfFileWriter, PdfFileReader
pdf = PdfFileReader('my_pdf.pdf')
I guess I need to make a loop of sorts using addPage and write files till there's no pages left?
Little late but I ran into your question while looking for help trying to do the same thing.
I ended up doing the following, which does what you're asking. Mind you it's probably more than you're asking for, but the answer is in there. It's a rough first draft, in heavy need of refactoring and some variable renaming.
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def split_pdf(in_pdf, step=1):
"""Splits a given pdf into seperate pdfs and saves
those to a supfolder of the parent pdf's folder, called
splitted_pdf.
Arguments:
in_pdf: [str] Absolute path (and filename) of the
input pdf or just the filename, if the file
is in the current directory.
step: [int] Desired number of pages in each of the
output pdfs.
Returns:
dunno yet
"""
#TODO: Add choice for output dir
#TODO: Add logging instead of prints
#TODO: Refactor
try:
with open(in_pdf, 'rb') as in_file:
input_pdf = PdfFileReader(in_file)
num_pages = input_pdf.numPages
input_dir, filename = os.path.split(in_pdf)
filename = os.path.splitext(filename)[0]
output_dir = input_dir + "/" + filename + "_splitted/"
os.mkdir(output_dir)
intervals = range(0, num_pages, step)
intervals = dict(enumerate(intervals, 1))
naming = f'{filename}_p'
count = 0
for key, val in intervals.items():
output_pdf = PdfFileWriter()
if key == len(intervals):
for i in range(val, num_pages):
output_pdf.addPage(input_pdf.getPage(i))
nums = f'{val + 1}' if step == 1 else f'{val + 1}-{val + step}'
with open(f'{output_dir}{naming}{nums}.pdf', 'wb') as outfile:
output_pdf.write(outfile)
print(f'{naming}{nums}.pdf written to {output_dir}')
count += 1
else:
for i in range(val, intervals[key + 1]):
output_pdf.addPage(input_pdf.getPage(i))
nums = f'{val + 1}' if step == 1 else f'{val + 1}-{val + step}'
with open(f'{output_dir}{naming}{nums}.pdf', 'wb') as outfile:
output_pdf.write(outfile)
print(f'{naming}{nums}.pdf written to {output_dir}')
count += 1
except FileNotFoundError as err:
print('Cannot find the specified file. Check your input:')
print(f'{count} pdf files written to {output_dir}')
Hope it helps you.
from PyPDF2 import PdfFileReader, PdfFileWriter
import os
# Method to split the pdf at every given n pages.
def split_at_every(self,infile , step = 1):
# Copy the input file path to a local variable infile
input_pdf = PdfFileReader(open(infile, "rb"))
pdf_len = input_pdf.number_of_pages
# Get the complete file name along with its path and split the text to take only the first part.
fname = os.path.splitext(os.path.basename(infile))[0]
# Get the list of page numbers in the order of given step
# If there are 10 pages in a pdf, and the step is 2
# page_numbers = [0,2,4,6,8]
page_numbers = list(range(0,pdf_len,step))
# Loop through the pdf pages
for ind,val in enumerate(page_numbers):
# Check if the index is last in the given page numbers
# If the index is not the last one, carry on with the If block.
if(ind+1 != len(page_numbers)):
# Initialize the PDF Writer
output_1 = PdfFileWriter()
# Loop through the pdf pages starting from the value of current index till the value of next index
# Ex : page numbers = [0,2,4,6,8]
# If the current index is 0, loop from 1st page till the 2nd page in the pdf doc.
for page in range(page_numbers[ind], page_numbers[ind+1]):
# Get the data from the given page number
page_data = input_pdf.getPage(page)
# Add the page data to the pdf_writer
output_1.addPage(page_data)
# Frame the output file name
output_1_filename = '{}_page_{}.pdf'.format(fname, page + 1)
# Write the output content to the file and save it.
self.write_to_file(output_1_filename, output_1)
else:
output_final = PdfFileWriter()
output_final_filename = "Last_Pages"
# Loop through the pdf pages starting from the value of current index till the last page of the pdf doc.
# Ex : page numbers = [0,2,4,6,8]
# If the current index is 8, loop from 8th page till the last page in the pdf doc.
for page in range(page_numbers[ind], pdf_len):
# Get the data from the given page number
page_data = input_pdf.getPage(page)
# Add the page data to the pdf_writer
output_final.addPage(page_data)
# Frame the output file name
output_final_filename = '{}_page_{}.pdf'.format(fname, page + 1)
# Write the output content to the file and save it.
self.write_to_file(output_final_filename,output_final)

Return multiple input (Python)

In python 3 I have a line asking for input that will then look in an imported dictionary and then list all their inputs that appear in the dictionary. My problem is when I run the code and put in the input it will only return the last word I input.
For example
the dictionary contains (AIR, AMA)
and if I input (AIR, AMA) it will only return AMA.
Any information to resolve this would be very helpful!
The dictionary:
EXCHANGE_DATA = [('AIA', 'Auckair', 1.50),
('AIR', 'Airnz', 5.60),
('AMP', 'Amp',3.22),
The Code:
import shares
a=input("Please input")
s1 = a.replace(' ' , "")
print ('Please list portfolio: ' + a)
print (" ")
n=["Code", "Name", "Price"]
print ('{0: <6}'.format(n[0]) + '{0:<20}'.format(n[1]) + '{0:>8}'.format(n[2]))
z = shares.EXCHANGE_DATA[0:][0]
b=s1.upper()
c=b.split()
f=shares.EXCHANGE_DATA
def find(f, a):
return [s for s in f if a.upper() in s]
x= (find(f, str(a)))
toDisplay = []
a = a.split()
for i in a:
temp = find(f, i)
if(temp):
toDisplay.append(temp)
for i in toDisplay:
print ('{0: <6}'.format(i[0][0]) + '{0:<20}'.format(i[0][1]) + ("{0:>8.2f}".format(i[0][2])))
Ok, the code seems somewhat confused. Here's a simpler version that seems to do what you want:
#!/usr/bin/env python3
EXCHANGE_DATA = [('AIA', 'Auckair', 1.50),
('AIR', 'Airnz', 5.60),
('AMP', 'Amp',3.22)]
user_input = input("Please Specify Shares: ")
names = set(user_input.upper().split())
print ('Listing the following shares: ' + str(names))
print (" ")
# Print header
n=["Code", "Name", "Price"]
print ('{0: <6}{1:<20}{2:>8}'.format(n[0],n[1],n[2]))
#print data
for i in [data for data in EXCHANGE_DATA if data[0] in names]:
print ('{0: <6}{1:<20}{2:>8}'.format(i[0],i[1],i[2]))
And here's an example of use:
➤ python3 program.py
Please Specify Shares: air amp
Listing the following shares: {'AMP', 'AIR'}
Code Name Price
AIR Airnz 5.6
AMP Amp 3.22
The code sample you provided actually does what was expected, if you gave it space separated quote names.
Hope this helps.