Find and/or Edit file names in a directory hierarchy - python-3.8

I created a python program and I used import.os... so I want to convert it but not with import.os
can someone help me?
So the main goal is the user to be able to search a path and also to change the file name but i only know the import.os method.
tree.py
import os
import sys
import time
#logo
print("_______ _______________ ______ ___")
print("___ |___ __/__ __ \___ |/ /")
print("__ /| |__ / _ / / /__ /|_/ /")
print("_ ___ |_ / / /_/ / _ / / /")
print("/_/ |_|/_/ \____/ /_/ /_/")
#os.getcwd() Cwd is for current working directory in python. This returns the path of the current python directory as a string in Python.
path = os.chdir('/root')
bl = os.getcwd()
#To change our current working directories in python, we use the chdir() method.
print(os.listdir())
os.chdir(bl)
print("your current location is: ", path)
change = input("Do you want to change your directory (yes/no): ")
if change == "yes":
dir = input("Add new Location: ")
if dir == bl and dir == path:
print("cant go to the same directory")
else:
print("please wait")
time.sleep(2)
os.chdir(dir)
print(os.path.exists(dir))
print(os.getcwd())
prDIR = input("Do you want to print the files (yes/no) : ")
if prDIR == "yes":
for f in os.listdir():
file_name, file_ext = os.path.splitext(f)
print(file_name.split('-'))
change2 = input("Do you want to change again your DIR: ")
if change2 == "yes":
dir = input("Add a new location: ")
if dir == bl:
print("cant go to the same place")
else:
print("please wait")
time.sleep(1)
os.chdir(dir)
print(os.path.exists(dir))
print(os.getcwd())
print("your current location is: ", path, dir)
pat = input("Do you want to rename this Dir (yes/no) : ")
if pat == "yes":
ch = input("Enter the path but renamed: ")
time.sleep(2)
os.rename(dir, ch)
print(ch)

Related

! Package pdftex.def Error - when knitting to PDF

I am able to knit to PDF for the example below:
---
title: "R Notebook"
output:
pdf_document: default
html_notebook: default
html_document:
df_print: paged
---
Table 1 example:
```{r, warning=FALSE, message=FALSE, echo=FALSE, include=FALSE, fig.pos="H"}
library(magrittr)
library(tidyverse)
library(kableExtra)
library(readxl)
library(modelsummary)
library(scales)
tmp <- mtcars
# create a list with individual variables
# remove missing and rescale
tmp_list <- lapply(tmp, na.omit)
tmp_list <- lapply(tmp_list, scale)
# create a table with `datasummary`
# add a histogram with column_spec and spec_hist
# add a boxplot with colun_spec and spec_box
emptycol = function(x) " "
final_4_table <- datasummary(mpg + cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb ~ N + Mean + SD + Heading("Boxplot") * emptycol + Heading("Histogram") * emptycol, data = tmp) %>%
column_spec(column = 5, image = spec_boxplot(tmp_list)) %>%
column_spec(column = 6, image = spec_hist(tmp_list))
```
```{r finaltable, echo=FALSE}
final_4_table
```
However, I cannot knit to PDF my own code which involves more variables. My R Markdown starts by reading my excel file and then it is pretty much the same as example above:
---
title: "table1"
output:
pdf_document: default
html_document:
df_print: paged
---
Table 1
```{r prep-tableone, message=FALSE, warning=FALSE, echo=FALSE, include=FALSE, fig.pos="H"}
library(magrittr)
library(tidyverse)
library(kableExtra)
library(readxl)
library(modelsummary)
library(scales)
### set directory
setwd("/etcetc1")
## read dataset
my_dataset <- read_excel("my_dataset.xlsx")
my_dataset <- as.data.frame(my_dataset)
...
I can run this code in R script; it works just fine. I can also knit this code to HTML just fine. When trying to knit to PDF I get the following error:
output file: table1test.knit.md
! Package pdftex.def Error: File `table1test_files/figure-latex//boxplot_65c214bae9bb.pdf' not found: using draft setting.
Error: LaTeX failed to compile table1test.tex. See https://yihui.org/tinytex/r/#debugging for debugging tips. See table1test.log for more info.
In addition: Warning messages:
1: package 'ggplot2' was built under R version 4.1.1
2: package 'tibble' was built under R version 4.1.1
3: package 'tidyr' was built under R version 4.1.1
4: In in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE, :
You changed the working directory to /etcetc1 (probably via setwd()). It will be restored to /etcetc2. See the Note section in ?knitr::knit
Execution halted
Am I missing any packages? Do you know what might be happening?
I use TeXShop for LaTeX.

How to avoid "missing input files" error in Snakemake's "expand" function

I get a MissingInputException when I run the following snakemake code:
import re
import os
glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))
rule end:
input:
expand(os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas"), fileName=glob_vars.fileName)
rule rename:
'''
rename fasta file to avoid problems
'''
input:
expand("inputs/{{fileName}}.{ext}", ext=glob_vars.ext)
output:
os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas")
run:
list_ = []
with open(str(input)) as f2:
line = f2.readline()
while line:
while not line.startswith('>') and line:
line = f2.readline()
fas_name = re.sub(r"\W", "_", line.strip())
list_.append(fas_name)
fas_seq = ""
line = f2.readline()
while not line.startswith('>') and line:
fas_seq += re.sub(r"\s","",line)
line = f2.readline()
list_.append(fas_seq)
with open(str(output), "w") as f:
f.write("\n".join(list_))
My Inputs folder contains these files:
G.bullatarudis.fasta
goldfish_protein.faa
guppy_protein.faa
gyrodactylus_salaris.fasta
protopolystoma_xenopodis.fa
salmon_protein.faa
schistosoma_mansoni.fa
The error message is:
Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NCBI/BLAST/RHB/test.rule:
Missing input files for rule rename:
inputs/guppy_protein.fasta
inputs/guppy_protein.fa
I assumed that the error is caused by expand function, because only guppy_protein.faa file exists, but expand also generate guppy_protein.fasta and guppy_protein.fa files. Are there any solutions?
By default, expand will produce all combinations of the input lists, so this is expected behavior. You need your input to lookup the proper extension given a fileName. I haven't tested this:
glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))
# create a dict to lookup extensions given fileNames
glob_vars_dict = {fname: ex for fname, ex in zip(glob_vars.fileName, glob_vars.ext)}
def rename_input(wildcards):
ext = glob_vars_dict[wildcards.fileName]
return f"inputs/{wildcards.fileName}.{ext}"
rule rename:
input: rename_input
A few unsolicited style comments:
You don't have to prepend your glob_wildcards with the os.getcwd, glob_wildcards("inputs", "{fileName}.{ext}")) should work as snakemake uses paths relative to the working directory by default.
Try to stick with snake_case instead of camalCase for your variable names in python
In this case, fileName isn't a great descriptor of what you are capturing. Maybe species_name or species would be clearer
Thanks to Troy Comi, I modified my code and it worked:
import re
import os
import itertools
speciess,exts = glob_wildcards(os.path.join(os.getcwd(), "inputs_test","{species}.{ext}"))
rule end:
input:
expand("inputs_test/{species}_rename.fas", species=speciess)
def required_files(wildcards):
list_combination = itertools.product([wildcards.species], list(set(exts)))
exist_file = ""
for file in list_combination:
if os.path.exists(f"inputs_test/{'.'.join(file)}"):
exist_file = f"inputs_test/{'.'.join(file)}"
return exist_file
rule rename:
'''
rename fasta file to avoid problems
'''
input:
required_files
output:
"inputs_test/{species}_rename.fas"
run:
list_ = []
with open(str(input)) as f2:
line = f2.readline()
while line:
while not line.startswith('>') and line:
line = f2.readline()
fas_name = ">" + re.sub(r"\W", "_", line.replace(">", "").strip())
list_.append(fas_name)
fas_seq = ""
line = f2.readline()
while not line.startswith('>') and line:
fas_seq += re.sub(r"\s","",line)
line = f2.readline()
list_.append(fas_seq)
with open(str(output), "w") as f:
f.write("\n".join(list_))

only when building a .exe from working code: AttributeError: Can only use .dt accessor with datetimelike values

I have a working python script based on pandas.
Converting a similar script into a .exe worked at my computer at work. Unfortunately this isn't the case for my computer at home. I tried pyinstaller and py2exe and both bring up this error.
It seems to me that the conversion puts up a number of errors(I already fixed some of them), so it's not ultimately about the datetime issue I think.
import pandas as pd
import os
import glob
from datetime import datetime
import shutil
import os.path
try:
parentfolder = os.path.dirname(__file__)
parentfolder = os.path.abspath(os.path.join(parentfolder, '..'))#parentfolder der skriptdatei
except NameError: # We are the main py2exe script, not a module
import sys
parentfolder = os.path.dirname(sys.argv[0])
parentfolder = os.path.abspath(os.path.join(parentfolder, '..'))#parentfolder der skriptdatei
today = datetime.now()
day1 = today.strftime("%d-%m-%Y")
time1= today.strftime("%d-%m-%Y_%H-%M-%S")
day1=day1+'_cleaned'
logname="logfile_" + time1 + ".txt"
resultfolder=os.path.join(parentfolder, day1)
logfile = os.path.join(resultfolder, logname)
if os.path.exists(resultfolder):
shutil.rmtree(resultfolder) #deletes folder and all subfolders
os.makedirs(resultfolder)
pd.set_option('display.max_columns', 5)
pd.set_option('display.max_colwidth', 99)
f = open(logfile, "w")
f.close()
all_files = glob.glob(parentfolder + "/*.xls")
filecounter=0
first_run_counter=0
first_day_counter=0
for filename in all_files:
file_name=(os.path.splitext(os.path.basename(filename))[0])
writepath = os.path.join(resultfolder, '{}.xlsx'.format(str(file_name)+"_py"))
writer = pd.ExcelWriter(writepath, engine = 'xlsxwriter')
with open(logfile, "a") as file:
file.write("{} \n".format(str(file_name)))
filecounter += 1
if filecounter > 1:
print("WARNING, JUST CONVERT 1 FILE")
break
list1 = []
dfs_by_day= []
df = pd.read_excel(filename,header=None,parse_dates=False)#ohne header einlesen ,decimal=","
#df=df.convert_dtypes(convert_string=True)
df_help=df.copy()
df_help[1] = df_help[1].astype(str)
df_help[0] = df_help[0].astype(str)
#####datei ordnen,filtern etc
df.dropna(axis=0,how='any',thresh=None,subset=None,inplace=True)#löscht zeilen mit leeren zellen
df.drop_duplicates(inplace=True) #dropt auch doppelte header?!
df.reset_index(drop=True, inplace=True)
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header#nimmt 2, reihe als header
df = df.sort_values(['Date (MM/DD/YYYY)','Time (HH:mm:ss)'], ascending=[True,True])
df.reset_index(drop=True, inplace=True)
df.rename(columns={'Date (MM/DD/YYYY)':'Date (DD/MM/YYYY)'}, inplace=True)
#df['Date (DD/MM/YYYY)']=df['Date (DD/MM/YYYY)'].astype(str)#WICHTIG! datumsangabe unabhängig von / oder . machen
#df['Date (DD/MM/YYYY)'] = df['Date (DD/MM/YYYY)'].str.replace('/','.')#/ mit . ersetzen
df_help2=df.copy() #deepcopy vom noch nicht datetime, aber getrimmten dataframe
#################################################################### datei in tage aufspalten
##df_help2['Date (DD/MM/YYYY)'] = pd.to_datetime(df_help2['Date (DD/MM/YYYY)'],format='%d.%m.%Y')#EVTL FORMAT EINFÜGEN ,format='%d.%m.%Y'
df_help2['next day'] = (df_help2['Date (DD/MM/YYYY)'].diff()).dt.days > 0 #ob neue zeile=neuer tag
###############datumsangabe unabhängig von / oder . machen
for i in range(df_help2.shape[0]):
if df_help2.at[i,'next day'] == True:
list1.append(i)
#spaltalgorithmus gesamtfile in tage
l_mod = [0] + list1 + [df.shape[0]]
dfs_by_day = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
################################################################# tage in runs aufspalten
for j in dfs_by_day:
memo=0
run_counter=1
df1 = j
df1=df1.reset_index(drop=True)
df_help4 = df1.iloc[0:1,0:2].reset_index(drop=True).copy()
df1['Date (DD/MM/YYYY)'] = df1['Date (DD/MM/YYYY)'].dt.strftime('%d.%m.%Y')
list3=[]
dfdate= str(df1.at[0,'Date (DD/MM/YYYY)'])
print(dfdate)
df_help3=df1.copy() #deepcopy für tageszeitanalyse/runs
df_help3['Time (HH:mm:ss)'] = pd.to_datetime(df_help3['Time (HH:mm:ss)'],format='%H:%M:%S')
df_help3['next run'] = (df_help3['Time (HH:mm:ss)'].diff()).dt.seconds > 2000
df_help3.reset_index(drop=True, inplace=True)
for i in range(df_help3.shape[0]):
if df_help3.at[i,'next run'] == True:
list3.append(i)
###algorithmus spaltet tag in runs auf
l_mod2 = [0] + list3 + [df1.shape[0]]
dfs_by_run = [df1.iloc[l_mod2[n]:l_mod2[n+1]] for n in range(len(l_mod2)-1)]
for k in dfs_by_run:
df_run = k
df_run['Depth m'] = pd.to_numeric(df_run['Depth m'])
df_run['depth rounded'] = df_run['Depth m'].astype(int) #rundet
df_run=df_run.reset_index(drop=True)
df_run = df_run.drop_duplicates(subset=['depth rounded'], keep='last')#letzter wert
del df_run['depth rounded']
df_run=df_run.dropna(axis=0,how='any',thresh=2)
df_run=df_run.reset_index(drop=True)
run_name = str(dfdate) +'_run' + str(run_counter)
#####sensortoresultfile
if first_run_counter==0:
last_df=df_run.copy()
last_df=last_df[0:0]
last_df=last_df.append(df_run)
first_run_counter+=1
with open(logfile, "a") as file:
file.write("{0} has {1} last measurement(s) \n".format(run_name,df_run.shape[0]))
run_counter+=1
#alle daten raw aber mit sensor und header pro tag
df_help4['Time (HH:mm:ss)'] = df_help4['Time (HH:mm:ss)'].astype(str)
df_help4['Date (DD/MM/YYYY)'] = df_help4['Date (DD/MM/YYYY)'].astype(str)
for i in range(df_help.shape[0]):
if df_help4.at[0,'Date (DD/MM/YYYY)'] == df_help.at[i,0]:
if df_help4.at[0,'Time (HH:mm:ss)'] == df_help.at[i,1]:
memo=i
break
for n in reversed(list(range(memo))):
if df_help.at[n,3] == 'SENSOR SERIAL NUMBER:':
sensor_info=df_help.iloc[n:n+1,:]
sensor_info.reset_index(drop=True,inplace=True)
break
sensor_info.at[0,0:2]='-'
df1 = df1.columns.to_frame().T.append(df1, ignore_index=True)#fügt header als zeile ganz oben hinzu
df1.columns = range(len(df1.columns))#header neu 0 bis n
if first_day_counter==0:
raw_df=df1.copy()
raw_df=raw_df[0:0]
sensor_info.columns= range(len(df1.columns))
df1 = pd.concat([df1.iloc[:(0)], sensor_info, df1.iloc[0:]]).reset_index(drop=True)
raw_df=raw_df.append(df1)
first_day_counter += 1
last_df.to_excel(writer, sheet_name='{}'.format("last"),header=False, index = False)
#raw_df['Date (DD/MM/YYYY)'] = raw_df['Date (DD/MM/YYYY)'].dt.strftime('%d.%m.%Y')
raw_df.to_excel(writer, sheet_name='{}'.format("raw"),header=False, index = False)
writer.save()
with open(logfile, "a") as file:
file.write("total number of last measurements: {} \n".format(last_df.shape[0]))
file.write("total number of raw measurements: {} \n".format(raw_df.shape[0]))
f.close()
error:
Traceback (most recent call last):
File "tsk-py-convert.py", line 95, in <module>
File "pandas\core\generic.pyc", line 5458, in __getattr__
File "pandas\core\accessor.pyc", line 180, in __get__
File "pandas\core\indexes\accessors.pyc", line 494, in __new__
AttributeError: Can only use .dt accessor with datetimelike values
Within spyder the code was using an old pandas version (0.23.4). My code doesn't seem to work with a new version. I had the latest pandas version pip installed on windows and now manually installed the version of anaconda (0.23.4).
I can now run the code thorugh cmd, IDLE and the .exe that is created with pyinstaller works!

"Wrong" TF IDF Scores

I have 1000 .txt files and planned searching for various keywords and calculate their TF-IDF Score. But for some reason the results are > 1. I did a test with 2 .txt files then: "I am studying nfc" and "You don't need AI" . For nfc and AI the TF-IDF should be 0.25 but when I open the .csv it says 1.4054651081081644.
I must admit that I did not choose the most efficient way for the code. I think the mistake is with the folders since I originally planned to check the documents by their year (annual reports from 2000-2010). But I canceled those plans and decided to check all annual reports as a whole corpus. I think the folders workaround is the problem still. I placed the 2 txt. files into the folder "-". Is there a way to make it count right?
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from pathlib import Path
# root dir
root = '/Users/Tom/PycharmProjects/TextMining/'
#
words_to_find = ['AI', 'nfc']
# tf_idf file writing
wrote_tf_idf_header = False
tf_idf_file_idx = 0
#
vectorizer_tf_idf = TfidfVectorizer(max_df=.80, min_df=1, stop_words='english', use_idf=True, norm=None, vocabulary=words_to_find, ngram_range=(1, 3))
vectorizer_cnt = CountVectorizer(stop_words='english', vocabulary=words_to_find, ngram_range=(1, 3))
#
years = ['-']
year_folders = [root + folder for folder in years]
# remove previous results file
if os.path.isfile('summary.csv'):
os.remove('summary.csv')
if os.path.isfile('tf_idf.csv'):
os.remove('tf_idf.csv')
#process every folder (for every year)
for year_idx, year_folder in enumerate(year_folders):
# get file paths in folder
file_paths = []
for file in Path(year_folder).rglob("*.txt"):
file_paths.append(file)
# count of files for each year
file_cnt = len(file_paths)
# read every file's text as string
docs_per_year = []
words_in_folder = 0
for txt_file in file_paths:
with open(txt_file, encoding='utf-8', errors="replace") as f:
txt_file_as_string = f.read()
words_in_folder += len(txt_file_as_string.split())
docs_per_year.append(txt_file_as_string)
#
tf_idf_documents_as_array = vectorizer_tf_idf.fit_transform(docs_per_year).toarray()
# tf_idf_documents_as_array = vectorizer_tf_idf.fit_transform([' '.join(docs_per_year)]).toarray()
#
cnt_documents_as_array = vectorizer_cnt.fit_transform(docs_per_year).toarray()
#
with open('summary.csv', 'a') as f:
f.write('Index;Term;Count;Df;Idf;Rel. Frequency\n')
for idx, word in enumerate(words_to_find):
abs_freq = cnt_documents_as_array[:, idx].sum()
f.write('{};{};{};{};{};{}\n'.format(idx + 1,
word,
np.count_nonzero(cnt_documents_as_array[:, idx]),
abs_freq,
vectorizer_tf_idf.idf_[idx],
abs_freq / words_in_folder))
f.write('\n')
with open('tf_idf.csv', 'a') as f:
if not wrote_tf_idf_header:
f.write('{}\n'.format(years[year_idx]))
f.write('Index;Year;File;')
for word in words_to_find:
f.write('{};'.format(word))
f.write('Sum\n')
wrote_tf_idf_header = True
for idx, tf_idfs in enumerate(tf_idf_documents_as_array):
f.write('{};{};{};'.format(tf_idf_file_idx, years[year_idx], file_paths[idx].name))
for word_idx, _ in enumerate(words_to_find):
f.write('{};'.format(tf_idf_documents_as_array[idx][word_idx]))
f.write('{}\n'.format(sum(tf_idf_documents_as_array[idx])))
tf_idf_file_idx += 1
print()
I think the mistake is, that you are defining the norm as norm=None, but the norm should be l1 or l2 as specified in the documentation.

How do you convert all the pdfs in a directory, into txt format, via R?

I'm trying to convert a list of PDF files located in my computer directory, into txt format so that R can read it and begin text mining. Do you know what is wrong with this code?
library(tm) #load text mining library
setwd('D:/Directory') #sets R's working directory to near where my files are
ae.corpus<-Corpus(DirSource("D:/Directory/NewsArticles"),readerControl=list(reader=readPlain))
exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"
system(paste("\"", exe, "\" \"", ae.corpus, "\"", sep = ""), wait = F)
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error..
summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
myStopwords <- c(stopwords('english'), "available", "via")
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords
ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 3))
inspect(ae.tdm)
findFreqTerms(ae.tdm, lowfreq=2)
findAssocs(ae.tdm, "economic",.7)
d<- Dictionary (c("economic", "uncertainty", "policy"))
inspect(DocumentTermMatrix(ae.corpus, list(dictionary = d)))
Try and use this instead:
dest <- "" #same as setwd()
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
lapply(myfiles, function(i) system(paste('""', #the path to Program files where the pdftotext.exe is saved
paste0('"', i, '"')), wait = FALSE) )
and then
#combine files
files <- list.files(pattern = "[.]txt$")
outFile <- file("output.txt", "w")
for (i in files){
x <- readLines(i)
writeLines(x[2:(length(x)-1)], outFile)
}
close(outFile)
#read data
txt<-read.table('output.txt',sep='\t', quote = "")
How that helps!