Python: index out of range but I can't see why - indexing

I'm trying to code a program that opens a file, creates a list that contains each line of this file, and removes some words from this list. I have an Index Out of Range error.
#! /usr/bin/python3
# open the file
f = open("test.txt", "r")
# a list that contains each line of my file (without \n)
lines = f.read().splitlines()
# close the file
f.close()
# some words I want to delete from the file
data = ["fire", "water"]
# for each line of the file...
for i in range(len(lines)):
# if this line is in [data]
if lines[i] in data:
# delete this line from [data]
print(lines[i])
del lines[i]
This is my text file:
sun
moon
*
fire
water
*
metal
This is my output:
fire
Traceback (most recent call last):
File "debug.py", line 16, in <module>
if lines[i] in data:
IndexError: list index out of range

I'll address the out of index error first.
What's happening, in your first post, is that it takes the length of the list
which is 7, so len(lines) = 7 which will result in range(7) = (0,7) which will do this:
lines[0] = sun
lines[1] = moon
lines[2] = *
lines[3] = fire
lines[4] = water
lines[5] = *
lines[6] = metal
=>
lines[0] = sun
lines[1] = moon
lines[2] = *
deleted lines[3] = fire
lines[3] = water
lines[4] = *
lines[5] = metal
If you now delete lines[3] it will still try to iterate over lines[6] although it won't exist anymore.
It will also continue to iterate over the next index lines[4] so water will never be checked, it is absorbed.
I hope this helps and I suggest you do this instead:
# open the file
f = open("test.txt", "r")
# a list that contains each line of my file (without \n)
lines = f.read().splitlines()
# close the file
f.close()
# some words I want to delete from the file
data = ["fire", "water"]
for i in data:
if i in lines:
lines.remove(i)
print(lines)
Output:
['sun', 'moon', '', '', 'metal']

Related

cx_oracle ORA-01036: illegal variable name/number

I have this group of python scripts where I do an API call, merge the resulting df together, then output to a table in an Oracle db. This exact script is working perfectly in three other scripts configured same, except a different API, but in this particular script, this error is getting thrown. I read up on binds, but I can't see how I'm doing it incorrectly for a tuple. Thanks in advance for your time.
sf_joined = pd.merge(sf_opp, sf_account,on=["CustID","CustID"])
# sf_joined.to_csv('sf_joined.csv', index=False)
# sf_types = sf_joined.dtypes
# print(sf_types)
char_columns = sf_joined.select_dtypes(include=['object']).columns
for col in char_columns:
if col not in ['rundate','Amount','Estimated_GC','Probability','MDC','DaysOpen']:
sf_joined[col] = sf_joined[col].fillna('')
# sf_joined[col] = sf_joined[col].map(lambda x: x.encode('utf-8'))
sf_joined[col] = sf_joined[col].map(lambda x: x[:1000])
pw = '****'
db_con = cx_Oracle.connect('mktg', pw, "prd-bia-db-***.o******.com:1521/BIPRD", encoding = "UTF-8", nencoding = "UTF-8")
cur = db_con.cursor()
print(db_con.version)
cur.execute('drop table cs_salesforce')
create_opps = """create table cs_salesforce(
rundate date,
sfoppid varchar(500)
)
"""
cur.execute(create_opps)
all_opps = []
for x in sf_joined.itertuples():
all_opps.append(x[1:])
insert_statement = """insert into cs_salesforce(rundate,sfoppid)values(:1, :2)"""
cur.executemany(insert_statement, all_opps)
db_con.commit()
Error:
runfile('C:/python_scripts_prod/cs_salesforce.py', wdir='C:/python_scripts_prod')
18.3.0.0.0
Traceback (most recent call last):
File "C:\python_scripts_prod\cs_salesforce.py", line 163, in <module>
cur.executemany(insert_statement, all_opps)
DatabaseError: ORA-01036: illegal variable name/number

How to avoid "missing input files" error in Snakemake's "expand" function

I get a MissingInputException when I run the following snakemake code:
import re
import os
glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))
rule end:
input:
expand(os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas"), fileName=glob_vars.fileName)
rule rename:
'''
rename fasta file to avoid problems
'''
input:
expand("inputs/{{fileName}}.{ext}", ext=glob_vars.ext)
output:
os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas")
run:
list_ = []
with open(str(input)) as f2:
line = f2.readline()
while line:
while not line.startswith('>') and line:
line = f2.readline()
fas_name = re.sub(r"\W", "_", line.strip())
list_.append(fas_name)
fas_seq = ""
line = f2.readline()
while not line.startswith('>') and line:
fas_seq += re.sub(r"\s","",line)
line = f2.readline()
list_.append(fas_seq)
with open(str(output), "w") as f:
f.write("\n".join(list_))
My Inputs folder contains these files:
G.bullatarudis.fasta
goldfish_protein.faa
guppy_protein.faa
gyrodactylus_salaris.fasta
protopolystoma_xenopodis.fa
salmon_protein.faa
schistosoma_mansoni.fa
The error message is:
Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NCBI/BLAST/RHB/test.rule:
Missing input files for rule rename:
inputs/guppy_protein.fasta
inputs/guppy_protein.fa
I assumed that the error is caused by expand function, because only guppy_protein.faa file exists, but expand also generate guppy_protein.fasta and guppy_protein.fa files. Are there any solutions?
By default, expand will produce all combinations of the input lists, so this is expected behavior. You need your input to lookup the proper extension given a fileName. I haven't tested this:
glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))
# create a dict to lookup extensions given fileNames
glob_vars_dict = {fname: ex for fname, ex in zip(glob_vars.fileName, glob_vars.ext)}
def rename_input(wildcards):
ext = glob_vars_dict[wildcards.fileName]
return f"inputs/{wildcards.fileName}.{ext}"
rule rename:
input: rename_input
A few unsolicited style comments:
You don't have to prepend your glob_wildcards with the os.getcwd, glob_wildcards("inputs", "{fileName}.{ext}")) should work as snakemake uses paths relative to the working directory by default.
Try to stick with snake_case instead of camalCase for your variable names in python
In this case, fileName isn't a great descriptor of what you are capturing. Maybe species_name or species would be clearer
Thanks to Troy Comi, I modified my code and it worked:
import re
import os
import itertools
speciess,exts = glob_wildcards(os.path.join(os.getcwd(), "inputs_test","{species}.{ext}"))
rule end:
input:
expand("inputs_test/{species}_rename.fas", species=speciess)
def required_files(wildcards):
list_combination = itertools.product([wildcards.species], list(set(exts)))
exist_file = ""
for file in list_combination:
if os.path.exists(f"inputs_test/{'.'.join(file)}"):
exist_file = f"inputs_test/{'.'.join(file)}"
return exist_file
rule rename:
'''
rename fasta file to avoid problems
'''
input:
required_files
output:
"inputs_test/{species}_rename.fas"
run:
list_ = []
with open(str(input)) as f2:
line = f2.readline()
while line:
while not line.startswith('>') and line:
line = f2.readline()
fas_name = ">" + re.sub(r"\W", "_", line.replace(">", "").strip())
list_.append(fas_name)
fas_seq = ""
line = f2.readline()
while not line.startswith('>') and line:
fas_seq += re.sub(r"\s","",line)
line = f2.readline()
list_.append(fas_seq)
with open(str(output), "w") as f:
f.write("\n".join(list_))

How can I use a loop to apply a function to a list of csv files?

I'm trying to loop through all files in a directory and add "indicator" data to them. I had the code working where I could select 1 file and do this, but now am trying to make it work on all files. The problem is when I make the loop it says
ValueError: Invalid file path or buffer object type: <class 'list'>
The goal would be for each loop to read another file from list, make changes, and save file back to folder with changes.
Here is complete code w/o imports. I copied 1 of the "file_path"s from the list and put in comment at bottom.
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file in listdrs_path:
file_path = listdrs_path
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
###################################################
#here is an example of what path from list looks like
#'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/A.csv'
The problem is in line number 13 and 14. Your filename is in variable file but you are using file_path which you've assigned the file list. Because of this you are getting ValueError. Try this:
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
Let me know if it helps.

Converting xls to xlsx using xlrd

I am using the exact script below (except the file path) to try to convert xls to xlsx. The script is successful but not producing any output. Am I missing something basic like inputting some variable values or file names, or not saving the file properly as a xlsx?
import xlrd
import os
from openpyxl.workbook import Workbook
filenames = os.listdir("file_path")
for fname in filenames:
if fname.endswith(".xls"):
def cvt_xls_to_xlsx(fname):
book_xls = xlrd.open_workbook(fname)
book_xlsx = Workbook()
sheet_names = book_xls.sheet_names()
for sheet_index in range(0,len(sheet_names)):
sheet_xls = book_xls.sheet_by_name(sheet_names[sheet_index])
if sheet_index == 0:
sheet_xlsx = book_xlsx.active
sheet_xlsx.title = sheet_names[sheet_index]
else:
sheet_xlsx = book_xlsx.create_sheet(title=sheet_names[sheet_index])
for row in range(0, sheet_xls.nrows):
for col in range(0, sheet_xls.ncols):
sheet_xlsx.cell(row = row+1 , column = col+1).value = sheet_xls.cell_value(row, col)
cvt_xls_to_xlsx(fname)
book_xlsx.save(fname.xlsx)
I have updated the above script with the last two rows of commands, but I am now getting the following error message:
File "C:\Users\local\Documents\Tasks\Python\Excelconvert.py", line 26, in
cvt_xls_to_xlsx(fname)
File "C:\Users\local\Documents\Tasks\Python\Excelconvert.py", line 10, in cvt_xls_to_xlsx
book_xls = xlrd.open_workbook(fname)
File "C:\Users\local\Python34\lib\site-packages\xlrd__init__.py", line 157, in open_workbook
ragged_rows=ragged_rows,
File "C:\Users\local\Python34\lib\site-packages\xlrd\book.py", line 92, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Users\local\Python34\lib\site-packages\xlrd\book.py", line 1278, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Users\local\Python34\lib\site-packages\xlrd\book.py", line 1272, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'# - Copy'
What does the '# - Copy' mean? I have seen others suggest it relates to the file format but usually that is an xml error, and not my above. Any solutions would be appreciated.

How to split a PDF every n page using PyPDF2?

I'm trying to learn how to split a pdf every n page.
In my case I want to split a 64p PDF into several chunks containing four pages each: file 1: p.1-4, file 2: p.5-8 etc.
I'm trying to understand PyPDF2 but my noobness overwhelms me:
from PyPDF2 import PdfFileWriter, PdfFileReader
pdf = PdfFileReader('my_pdf.pdf')
I guess I need to make a loop of sorts using addPage and write files till there's no pages left?
Little late but I ran into your question while looking for help trying to do the same thing.
I ended up doing the following, which does what you're asking. Mind you it's probably more than you're asking for, but the answer is in there. It's a rough first draft, in heavy need of refactoring and some variable renaming.
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def split_pdf(in_pdf, step=1):
"""Splits a given pdf into seperate pdfs and saves
those to a supfolder of the parent pdf's folder, called
splitted_pdf.
Arguments:
in_pdf: [str] Absolute path (and filename) of the
input pdf or just the filename, if the file
is in the current directory.
step: [int] Desired number of pages in each of the
output pdfs.
Returns:
dunno yet
"""
#TODO: Add choice for output dir
#TODO: Add logging instead of prints
#TODO: Refactor
try:
with open(in_pdf, 'rb') as in_file:
input_pdf = PdfFileReader(in_file)
num_pages = input_pdf.numPages
input_dir, filename = os.path.split(in_pdf)
filename = os.path.splitext(filename)[0]
output_dir = input_dir + "/" + filename + "_splitted/"
os.mkdir(output_dir)
intervals = range(0, num_pages, step)
intervals = dict(enumerate(intervals, 1))
naming = f'{filename}_p'
count = 0
for key, val in intervals.items():
output_pdf = PdfFileWriter()
if key == len(intervals):
for i in range(val, num_pages):
output_pdf.addPage(input_pdf.getPage(i))
nums = f'{val + 1}' if step == 1 else f'{val + 1}-{val + step}'
with open(f'{output_dir}{naming}{nums}.pdf', 'wb') as outfile:
output_pdf.write(outfile)
print(f'{naming}{nums}.pdf written to {output_dir}')
count += 1
else:
for i in range(val, intervals[key + 1]):
output_pdf.addPage(input_pdf.getPage(i))
nums = f'{val + 1}' if step == 1 else f'{val + 1}-{val + step}'
with open(f'{output_dir}{naming}{nums}.pdf', 'wb') as outfile:
output_pdf.write(outfile)
print(f'{naming}{nums}.pdf written to {output_dir}')
count += 1
except FileNotFoundError as err:
print('Cannot find the specified file. Check your input:')
print(f'{count} pdf files written to {output_dir}')
Hope it helps you.
from PyPDF2 import PdfFileReader, PdfFileWriter
import os
# Method to split the pdf at every given n pages.
def split_at_every(self,infile , step = 1):
# Copy the input file path to a local variable infile
input_pdf = PdfFileReader(open(infile, "rb"))
pdf_len = input_pdf.number_of_pages
# Get the complete file name along with its path and split the text to take only the first part.
fname = os.path.splitext(os.path.basename(infile))[0]
# Get the list of page numbers in the order of given step
# If there are 10 pages in a pdf, and the step is 2
# page_numbers = [0,2,4,6,8]
page_numbers = list(range(0,pdf_len,step))
# Loop through the pdf pages
for ind,val in enumerate(page_numbers):
# Check if the index is last in the given page numbers
# If the index is not the last one, carry on with the If block.
if(ind+1 != len(page_numbers)):
# Initialize the PDF Writer
output_1 = PdfFileWriter()
# Loop through the pdf pages starting from the value of current index till the value of next index
# Ex : page numbers = [0,2,4,6,8]
# If the current index is 0, loop from 1st page till the 2nd page in the pdf doc.
for page in range(page_numbers[ind], page_numbers[ind+1]):
# Get the data from the given page number
page_data = input_pdf.getPage(page)
# Add the page data to the pdf_writer
output_1.addPage(page_data)
# Frame the output file name
output_1_filename = '{}_page_{}.pdf'.format(fname, page + 1)
# Write the output content to the file and save it.
self.write_to_file(output_1_filename, output_1)
else:
output_final = PdfFileWriter()
output_final_filename = "Last_Pages"
# Loop through the pdf pages starting from the value of current index till the last page of the pdf doc.
# Ex : page numbers = [0,2,4,6,8]
# If the current index is 8, loop from 8th page till the last page in the pdf doc.
for page in range(page_numbers[ind], pdf_len):
# Get the data from the given page number
page_data = input_pdf.getPage(page)
# Add the page data to the pdf_writer
output_final.addPage(page_data)
# Frame the output file name
output_final_filename = '{}_page_{}.pdf'.format(fname, page + 1)
# Write the output content to the file and save it.
self.write_to_file(output_final_filename,output_final)