How do you convert all the pdfs in a directory, into txt format, via R? - pdf

I'm trying to convert a list of PDF files located in my computer directory, into txt format so that R can read it and begin text mining. Do you know what is wrong with this code?
library(tm) #load text mining library
setwd('D:/Directory') #sets R's working directory to near where my files are
ae.corpus<-Corpus(DirSource("D:/Directory/NewsArticles"),readerControl=list(reader=readPlain))
exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"
system(paste("\"", exe, "\" \"", ae.corpus, "\"", sep = ""), wait = F)
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error..
summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
myStopwords <- c(stopwords('english'), "available", "via")
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords
ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 3))
inspect(ae.tdm)
findFreqTerms(ae.tdm, lowfreq=2)
findAssocs(ae.tdm, "economic",.7)
d<- Dictionary (c("economic", "uncertainty", "policy"))
inspect(DocumentTermMatrix(ae.corpus, list(dictionary = d)))

Try and use this instead:
dest <- "" #same as setwd()
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
lapply(myfiles, function(i) system(paste('""', #the path to Program files where the pdftotext.exe is saved
paste0('"', i, '"')), wait = FALSE) )
and then
#combine files
files <- list.files(pattern = "[.]txt$")
outFile <- file("output.txt", "w")
for (i in files){
x <- readLines(i)
writeLines(x[2:(length(x)-1)], outFile)
}
close(outFile)
#read data
txt<-read.table('output.txt',sep='\t', quote = "")
How that helps!

Related

How to avoid "missing input files" error in Snakemake's "expand" function

I get a MissingInputException when I run the following snakemake code:
import re
import os
glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))
rule end:
input:
expand(os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas"), fileName=glob_vars.fileName)
rule rename:
'''
rename fasta file to avoid problems
'''
input:
expand("inputs/{{fileName}}.{ext}", ext=glob_vars.ext)
output:
os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas")
run:
list_ = []
with open(str(input)) as f2:
line = f2.readline()
while line:
while not line.startswith('>') and line:
line = f2.readline()
fas_name = re.sub(r"\W", "_", line.strip())
list_.append(fas_name)
fas_seq = ""
line = f2.readline()
while not line.startswith('>') and line:
fas_seq += re.sub(r"\s","",line)
line = f2.readline()
list_.append(fas_seq)
with open(str(output), "w") as f:
f.write("\n".join(list_))
My Inputs folder contains these files:
G.bullatarudis.fasta
goldfish_protein.faa
guppy_protein.faa
gyrodactylus_salaris.fasta
protopolystoma_xenopodis.fa
salmon_protein.faa
schistosoma_mansoni.fa
The error message is:
Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NCBI/BLAST/RHB/test.rule:
Missing input files for rule rename:
inputs/guppy_protein.fasta
inputs/guppy_protein.fa
I assumed that the error is caused by expand function, because only guppy_protein.faa file exists, but expand also generate guppy_protein.fasta and guppy_protein.fa files. Are there any solutions?
By default, expand will produce all combinations of the input lists, so this is expected behavior. You need your input to lookup the proper extension given a fileName. I haven't tested this:
glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))
# create a dict to lookup extensions given fileNames
glob_vars_dict = {fname: ex for fname, ex in zip(glob_vars.fileName, glob_vars.ext)}
def rename_input(wildcards):
ext = glob_vars_dict[wildcards.fileName]
return f"inputs/{wildcards.fileName}.{ext}"
rule rename:
input: rename_input
A few unsolicited style comments:
You don't have to prepend your glob_wildcards with the os.getcwd, glob_wildcards("inputs", "{fileName}.{ext}")) should work as snakemake uses paths relative to the working directory by default.
Try to stick with snake_case instead of camalCase for your variable names in python
In this case, fileName isn't a great descriptor of what you are capturing. Maybe species_name or species would be clearer
Thanks to Troy Comi, I modified my code and it worked:
import re
import os
import itertools
speciess,exts = glob_wildcards(os.path.join(os.getcwd(), "inputs_test","{species}.{ext}"))
rule end:
input:
expand("inputs_test/{species}_rename.fas", species=speciess)
def required_files(wildcards):
list_combination = itertools.product([wildcards.species], list(set(exts)))
exist_file = ""
for file in list_combination:
if os.path.exists(f"inputs_test/{'.'.join(file)}"):
exist_file = f"inputs_test/{'.'.join(file)}"
return exist_file
rule rename:
'''
rename fasta file to avoid problems
'''
input:
required_files
output:
"inputs_test/{species}_rename.fas"
run:
list_ = []
with open(str(input)) as f2:
line = f2.readline()
while line:
while not line.startswith('>') and line:
line = f2.readline()
fas_name = ">" + re.sub(r"\W", "_", line.replace(">", "").strip())
list_.append(fas_name)
fas_seq = ""
line = f2.readline()
while not line.startswith('>') and line:
fas_seq += re.sub(r"\s","",line)
line = f2.readline()
list_.append(fas_seq)
with open(str(output), "w") as f:
f.write("\n".join(list_))

How can I use a loop to apply a function to a list of csv files?

I'm trying to loop through all files in a directory and add "indicator" data to them. I had the code working where I could select 1 file and do this, but now am trying to make it work on all files. The problem is when I make the loop it says
ValueError: Invalid file path or buffer object type: <class 'list'>
The goal would be for each loop to read another file from list, make changes, and save file back to folder with changes.
Here is complete code w/o imports. I copied 1 of the "file_path"s from the list and put in comment at bottom.
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file in listdrs_path:
file_path = listdrs_path
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
###################################################
#here is an example of what path from list looks like
#'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/A.csv'
The problem is in line number 13 and 14. Your filename is in variable file but you are using file_path which you've assigned the file list. Because of this you are getting ValueError. Try this:
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
Let me know if it helps.

Convert PDF to .txt file with Google Cloud Storage

I have this code for Python on a local file system.
What is the equivalent Python object API for os.getcwd(), os.listdir?
I want this code to work using files from GCS?
In order to use GCS folders, I include this code
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
pdfDir = bucket.get_blob('uploads/pdf/')
txtDir = bucket.get_blob('uploads/txt/')
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt
#converts pdf, returns its text content as a string
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
#converts all pdfs in directory pdfDir, saves all resulting txt files to
txtdir
def PDF2txt(pdfDir, txtDir):
if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in
for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
fileExtension = pdf.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = pdfDir + pdf
text = convert(pdfFilename) #get string of text content of pdf
textFilename = txtDir + pdf + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
pdfDir = "C:/pdftotxt/pdfs/"
txtDir = "C:/pdftotxt/txt/"
PDF2txt(pdfDir, txtDir)
I assume that what you want is to list objects in a bucket and objects in particular folders inside a bucket. For doing that you can use directly the Python Client Libraries that Google Cloud Storage provide. Use bucket.list_blobs() for listing the whole bucket and bucket.list_blobs(prefix=prefix, delimiter=delimiter) for listing a particular folder or object.
A more detailed documentation can be found here [1] and the Git repository containing the whole libraries here [2].

Can't convert 'bytes' object to str implicitly for DCM to raw file

I learn how to convert DCM file to Raw file .Got the code from Git Hub:
https://github.com/xiasun/dicom2raw/blob/master/dicom2raw.py
And it got a error"Can't convert 'bytes' object to str implicitly" on the line
"allInOne += dataset.PixelData"
I try to use "encode("utf-8")",but it make allInOne to be empty.
By the way ,Is there any code to generate the .mhd file corresponding to the .raw file?
import dicom
import os
import numpy
import sys
dicomPath = "C:/DataLuna16pen/dcmdata/"
lstFilesDCM = [] # create an empty list
for dirName, subdirList, fileList in os.walk(dicomPath):
allInOne = ""
print(subdirList)
i=0
for filename in fileList:
i+=1
if "".join(filename).endswith((".dcm", ".DCM")):
path = dicomPath + "".join(filename)
dataset = dicom.read_file(path)
for n,val in enumerate(dataset.pixel_array.flat):
dataset.pixel_array.flat[n] = val / 60
if val < 0:
dataset.pixel_array.flat[n] = 0
dataset.PixelData = numpy.uint8(dataset.pixel_array).tostring()
allInOne += dataset.PixelData
print ("slice " + "".join(filename) + " done ",end=" ")
print (i)
newFile = open("./all_in_one.raw", "wb")
newFile.write(allInOne)
newFile.close()
print ("RAW file generated")
There are several things:
PyDicom still doesn't read compressed DICOMs properly (loseless jpeg). You should check Transfer Syntax of the files to check if this is the case. As a workaround you can use GDCM tool dcmdjpeg
you should not convert byte array into string (np.array.tostring returns in fact the array of bytes)
for writing mha files, take a look at MedPy. You can also use ITK directly. There is python wrapper and SimpleITK - some kind lightweight modification of ITK

how to make R datafile to Python type

I want to make R datatype to Python datatype below is the whole code
def convert_datafiles(datasets_folder):
import rpy2.robjects
rpy2.robjects.numpy2ri.activate()
pandas2ri.activate()
for root, dirs, files in os.walk(datasets_folder):
for name in files:
# sort out .RData files
if name.endswith('.RData'):
name_ = os.path.splitext(name)[0]
name_path = os.path.join(datasets_folder, name_)
# creat sub-directory
if not os.path.exists(name_path):
os.makedirs(name_path)
file_path = os.path.join(root, name)
robj = robjects.r.load(file_path)
# check out subfiles in the data frame
for var in robj:
###### error happend right here
myRData = pandas2ri.ri2py_dataframe( var )
####error happend right here
# convert to DataFrame
if not isinstance(myRData, pd.DataFrame):
myRData = pd.DataFrame(myRData)
var_path = os.path.join(datasets_folder,name_,var+'.csv')
myRData.to_csv(var_path)
os.remove(os.path.join(datasets_folder, name)) # clean up
print ("=> Success!")
I want to make R datatype to pythone type, but the error keeps popping up like this : AttributeError: 'str' object has no attribute 'dtype'
How should I do to resolve this error?
The rpy2 documentation is somewhat incomplete when it comes to interaction with pandas, but unit tests will provide examples of conversion. For example:
rdataf = robjects.r('data.frame(a=1:2, '
' b=I(c("a", "b")), '
' c=c("a", "b"))')
with localconverter(default_converter + rpyp.converter) as cv:
pandas_df = robjects.conversion.ri2py(rdataf)